cs.CV（2024-10-14）

📊 共 36 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (17 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (6) 支柱三：空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱七：动作重定向 (Motion Retargeting) (3 🔗2) 支柱一：机器人控制 (Robot Control) (2 🔗2) 支柱四：生成式动作 (Generative Motion) (2) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (17 篇)

#	题目	一句话要点	标签	🔗	⭐
1	ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization	提出ForgeryGPT，利用多模态大语言模型实现可解释的图像伪造检测与定位。	large language model multimodal instruction following
2	X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing	提出X-Fi：一种模态不变的基础模型，用于多模态人体感知。	foundation model multimodal
3	TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning	提出TWIST & SCOUT框架，通过无遗忘调优提升MLLM的视觉定位能力	large language model multimodal visual grounding
4	EchoApex: A General-Purpose Vision Foundation Model for Echocardiography	EchoApex：用于超声心动图的通用视觉基础模型	foundation model
5	TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models	TemporalBench：用于多模态视频模型细粒度时序理解的基准测试	multimodal
6	Towards Foundation Models for 3D Vision: How Close Are We?	提出UniQA-3D基准测试，评估并提升3D视觉基础模型能力	foundation model	✅
7	CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes	提出CAFuser，一种条件感知多模态融合方法，提升驾驶场景语义感知鲁棒性。	multimodal	✅
8	MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks	MEGA-Bench：构建包含500+真实世界任务的多模态评估基准，覆盖广泛应用场景。	multimodal
9	Class Balancing Diversity Multimodal Ensemble for Alzheimer's Disease Diagnosis and Early Detection	提出IMBALMED，通过类平衡多样性多模态集成方法，用于阿尔茨海默病早期诊断。	multimodal
10	Performance Evaluation of Deep Learning and Transformer Models Using Multimodal Data for Breast Cancer Classification	提出基于多模态数据融合的深度学习模型，用于提升乳腺癌分类性能	multimodal
11	MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models	提出MMIE大规模多模态交错理解基准，用于评估大型视觉语言模型	multimodal	✅
12	LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content	提出LiveXiv：一个基于ArXiv论文内容的多模态实时评测基准，用于评估大型多模态模型。	foundation model
13	Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation	提出SpatialSonic模型，实现语言驱动的沉浸式空间音频生成。	multimodal
14	Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework	提出生成式迁移学习框架GTL，解决跨模态少样本学习问题	multimodal
15	MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer	提出MoTE框架，平衡视频识别中的泛化能力与特定任务性能。	foundation model	✅
16	Hybrid Transformer for Early Alzheimer's Detection: Integration of Handwriting-Based 2D Images and 1D Signal Features	提出一种混合Transformer模型，融合手写体图像与信号特征，用于阿尔茨海默病早期检测。	multimodal
17	Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation	提出空间感知高效投影器SAEP，通过多层特征聚合提升MLLM效率与空间理解能力。	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
18	4DStyleGaussian: Zero-shot 4D Style Transfer with Gaussian Splatting	提出4DStyleGaussian，利用高斯溅射实现零样本4D风格迁移	distillation gaussian splatting splatting
19	V2M: Visual 2-Dimensional Mamba for Image Representation Learning	提出V2M：一种用于图像表示学习的视觉二维Mamba模型	Mamba SSM state space model
20	Hi-Mamba: Hierarchical Mamba for Efficient Image Super-Resolution	Hi-Mamba：用于高效图像超分辨率的分层Mamba网络	Mamba SSM state space model
21	DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model	DrivingDojo：提出交互式和知识增强的驾驶世界模型数据集，促进复杂驾驶场景建模。	world model instruction following
22	GlobalMamba: Global Image Serialization for Vision Mamba	GlobalMamba：通过全局图像序列化增强Vision Mamba的性能	Mamba
23	Depth Any Video with Scalable Synthetic Data	提出Depth Any Video模型，利用可扩展合成数据解决视频深度估计问题	flow matching depth estimation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
24	Few-shot Novel View Synthesis using Depth Aware 3D Gaussian Splatting	提出深度感知3D高斯溅射，解决少样本新视角合成中性能下降问题。	monocular depth 3D gaussian splatting 3DGS	✅
25	4-LEGS: 4D Language Embedded Gaussian Splatting	提出4-LEGS：一种语言嵌入的4D高斯溅射方法，用于时空事件定位。	3D gaussian splatting gaussian splatting splatting
26	Self-Assessed Generation: Trustworthy Label Generation for Optical Flow and Stereo Matching in Real-world	提出自评估生成(SAG)框架，提升光流和立体匹配在真实场景的泛化性	optical flow geometric consistency
27	3DArticCyclists: Generating Synthetic Articulated 8D Pose-Controllable Cyclist Data for Computer Vision Applications	提出3DArticCyclists框架，生成可控3D自行车骑行者合成数据，解决自动驾驶中骑行者数据稀缺问题。	3D gaussian splatting 3DGS gaussian splatting
28	StegaINR4MIH: steganography by implicit neural representation for multi-image hiding	StegaINR4MIH：利用隐式神经表示实现多图像隐藏的隐写术	implicit representation

🔬 支柱七：动作重定向 (Motion Retargeting) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention	Cavia：提出基于视角集成注意力机制的可控相机多视角视频扩散模型	geometric consistency spatiotemporal	✅
30	DragEntity: Trajectory Guided Video Generation using Entity and Positional Relationships	DragEntity：利用实体和位置关系进行轨迹引导的视频生成	spatial relationship
31	FlexGen: Flexible Multi-View Generation from Text and Image Inputs	FlexGen：提出一种灵活的多视角生成框架，支持文本和图像输入。	spatial relationship	✅

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
32	Sitcom-Crafter: A Plot-Driven Human Motion Generation System in 3D Scenes	Sitcom-Crafter：一个情节驱动的3D场景中人物动作生成系统	locomotion motion synthesis motion generation	✅
33	Out-of-Bounding-Box Triggers: A Stealthy Approach to Cheat Object Detectors	提出一种隐蔽的越界触发器攻击，提升目标检测器的对抗鲁棒性	manipulation	✅

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
34	MaskControl: Spatio-Temporal Control for Masked Motion Synthesis	MaskControl：为生成式掩码运动模型引入时空控制，提升控制精度和运动质量。	motion diffusion model motion diffusion text-to-motion
35	Boosting Camera Motion Control for Video Diffusion Transformers	提出相机运动引导（CMG），显著提升视频扩散Transformer的相机运动控制精度	classifier-free guidance

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
36	A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration	提出一致性感知的点引导Transformer，用于通用且分层的点云配准	feature matching geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页