cs.CV（2025-02-27）

📊 共 42 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (20 🔗11) 支柱三：空间感知与语义 (Perception & Semantics) (9 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (9) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱一：机器人控制 (Robot Control) (1) 支柱四：生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (20 篇)

#	题目	一句话要点	标签	🔗
1	Vector-Quantized Vision Foundation Models for Object-Centric Learning	提出VQ-VFM-OCL，通过共享量化视觉基础模型表示，提升面向对象学习的性能。	foundation model	✅
2	Do computer vision foundation models learn the low-level characteristics of the human visual system?	评估计算机视觉基础模型与人类视觉系统在低级特征上的相似性	foundation model
3	Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think	提出Dream Engine，实现文本-图像交错控制的图像生成统一框架	multimodal
4	Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion	提出一种基于Boosting的多模态学习方法，缓解分类能力不均衡问题。	multimodal	✅
5	Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up	提出联合融合编码框架，从底层增强多模态检索性能	multimodal
6	Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios	提出CVQA和CPVQA基准，揭示大语言模型在复杂场景组合推理中的局限性	large language model
7	C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation	提出C-Drag，通过思维链驱动的运动控制器实现更精细的可控视频生成。	chain-of-thought	✅
8	Reliable Multimodal Learning Via Multi-Level Adaptive DeConfusion	提出多层自适应解混淆方法，提升多模态学习在噪声环境下的可靠性。	multimodal
9	Visual Reasoning at Urban Intersections: FineTuning GPT-4o for Traffic Conflict Detection	微调GPT-4o用于城市路口交通冲突检测，提升视觉推理能力	large language model multimodal	✅
10	CoCa-CXR: Contrastive Captioners Learn Strong Temporal Structures for Chest X-Ray Vision-Language Understanding	CoCa-CXR：对比式图像描述模型学习胸部X光片视觉-语言理解中的时间结构	large language model foundation model
11	VideoA11y: Method and Dataset for Accessible Video Description	VideoA11y：提出了一种利用多模态大语言模型生成可访问视频描述的方法与数据集。	large language model multimodal	✅
12	New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration	提出基于专家模型与MLLM协作的细粒度组合指代表达式理解方法与数据集	large language model multimodal	✅
13	AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs	提出AsymLoRA，通过非对称LoRA协调MLLM中数据冲突与共性，提升多模态任务性能。	large language model multimodal	✅
14	Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack	提出动态视觉-语言对齐攻击（DynVLA），提升MLLM对抗攻击的迁移性	large language model multimodal
15	Interpreting CLIP with Hierarchical Sparse Autoencoders	提出Matryoshka SAE，用于CLIP模型的可解释性分析与控制。	multimodal	✅
16	Visual Adaptive Prompting for Compositional Zero-Shot Learning	提出视觉自适应提示系统VAPS，解决组合零样本学习中视觉信息利用不足的问题。	multimodal
17	Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars	Avat3r：基于高斯重建的大型可动画3D头部Avatar模型，仅需少量输入图像。	foundation model	✅
18	ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning	ReCon：通过关系一致性增强真对应判别，实现鲁棒的噪声对应学习	multimodal	✅
19	One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion	提出GIFNet，利用低级视觉任务交互实现任务无关的图像融合	multimodal	✅
20	ProAPO: Progressively Automatic Prompt Optimization for Visual Classification	提出ProAPO以解决视觉分类中的提示优化问题	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (9 篇)

#	题目	一句话要点	标签	🔗
21	3D-AffordanceLLM: Harnessing Large Language Models for Open-Vocabulary Affordance Detection in 3D Worlds	提出3D-AffordanceLLM以解决开放词汇的3D环境中可供性检测问题	open-vocabulary open vocabulary affordance
22	TrackGS: Optimizing COLMAP-Free 3D Gaussian Splatting with Global Track Constraints	TrackGS：利用全局轨迹约束优化无COLMAP的3D高斯溅射	3D gaussian splatting 3DGS gaussian splatting
23	UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler	UniDepthV2：简化通用单目度量深度估计，提升跨域泛化能力	depth estimation metric depth UniDepth	✅
24	Open-Vocabulary Semantic Part Segmentation of 3D Human	提出HumanCLIP模型和MaskFusion模块，实现三维人体开放词汇语义部件分割。	3D gaussian splatting gaussian splatting splatting
25	Efficient Gaussian Splatting for Monocular Dynamic Scene Rendering via Sparse Time-Variant Attribute Modeling	提出EDGS，通过稀疏时变属性建模实现单目动态场景高效高质渲染	gaussian splatting splatting
26	Learning to Generalize without Bias for Open-Vocabulary Action Recognition	提出Open-MeDe框架，解决开放词汇动作识别中CLIP静态偏置导致的泛化性问题	open-vocabulary open vocabulary	✅
27	SegLocNet: Multimodal Localization Network for Autonomous Driving via Bird's-Eye-View Segmentation	SegLocNet：基于鸟瞰图分割的多模态定位网络，用于解决自动驾驶中精确、鲁棒的定位问题。	semantic map multimodal
28	InsTaG: Learning Personalized 3D Talking Head from Few-Second Video	InsTaG：提出一种基于少量视频的个性化3D说话头快速学习框架	3DGS
29	4Deform: Neural Surface Deformation for Robust Shape Interpolation	4Deform：提出基于神经隐式表面的形变方法，用于鲁棒的形状插值，尤其适用于非结构化数据。	implicit representation

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签
30	From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs	LIFT-GS：利用2D视觉语言模型蒸馏实现大规模3D视觉语言理解	distillation open-vocabulary open vocabulary
31	Investigating and Enhancing Vision-Audio Capability in Omnimodal Large Language Models	提出自知识蒸馏方法，提升全模态大语言模型在视觉-音频任务中的性能。	distillation large language model multimodal
32	I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue	提出基于自监督预训练的Co-Speech手势嵌入方法，用于多模态对话中的指代消解。	representation learning multimodal
33	CFTrack: Enhancing Lightweight Visual Tracking through Contrastive Learning and Feature Matching	CFTrack：通过对比学习和特征匹配增强轻量级视觉跟踪	contrastive learning feature matching
34	Identity-preserving Distillation Sampling by Fixed-Point Iterator	提出身份保持蒸馏采样(IDS)，通过不动点迭代正则化解决SDS图像编辑中的身份漂移问题	distillation NeRF neural radiance field
35	Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation	提出MLRG模型，利用多视角纵向数据和对比学习增强胸部X光报告生成。	contrastive learning spatiotemporal
36	Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds	提出多尺度邻域占据掩码自编码器(NOMAE)，用于LiDAR点云自监督学习。	masked autoencoder MAE
37	Learning Mask Invariant Mutual Information for Masked Image Modeling	提出MI-MAE，通过互信息最大化与最小化提升掩码图像建模性能	masked autoencoder MAE contrastive learning
38	SAC-ViT: Semantic-Aware Clustering Vision Transformer with Early Exit	提出SAC-ViT，通过语义感知聚类和早退机制提升Vision Transformer的计算效率。	SAC

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
39	PI-HMR: Towards Robust In-bed Temporal Human Shape Reconstruction with Contact Pressure Sensing	PI-HMR：利用接触压力感知实现鲁棒的卧床人体形状时序重建	HMR
40	EgoNormia: Benchmarking Physical Social Norm Understanding	提出EgoNormia以评估物理社交规范理解能力	egocentric

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
41	InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions	InterMimic：面向物理交互的通用全身控制，从不完美的动作捕捉数据中学习。	whole-body control human-object interaction HOI

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
42	UniTok: A Unified Tokenizer for Visual Generation and Understanding	提出UniTok，通过多码本量化机制，统一视觉生成与理解的Tokenizer。	VQ-VAE	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2025-02-27）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (20 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (9 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理