cs.CV(2025-02-27)

📊 共 42 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (20 🔗11) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (9) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱一:机器人控制 (Robot Control) (1) 支柱四:生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (20 篇)

#题目一句话要点标签🔗
1 Vector-Quantized Vision Foundation Models for Object-Centric Learning 提出VQ-VFM-OCL,通过共享量化视觉基础模型表示,提升面向对象学习的性能。 foundation model
2 Do computer vision foundation models learn the low-level characteristics of the human visual system? 评估计算机视觉基础模型与人类视觉系统在低级特征上的相似性 foundation model
3 Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think 提出Dream Engine,实现文本-图像交错控制的图像生成统一框架 multimodal
4 Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion 提出一种基于Boosting的多模态学习方法,缓解分类能力不均衡问题。 multimodal
5 Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up 提出联合融合编码框架,从底层增强多模态检索性能 multimodal
6 Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios 提出CVQA和CPVQA基准,揭示大语言模型在复杂场景组合推理中的局限性 large language model
7 C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation 提出C-Drag,通过思维链驱动的运动控制器实现更精细的可控视频生成。 chain-of-thought
8 Reliable Multimodal Learning Via Multi-Level Adaptive DeConfusion 提出多层自适应解混淆方法,提升多模态学习在噪声环境下的可靠性。 multimodal
9 Visual Reasoning at Urban Intersections: FineTuning GPT-4o for Traffic Conflict Detection 微调GPT-4o用于城市路口交通冲突检测,提升视觉推理能力 large language model multimodal
10 CoCa-CXR: Contrastive Captioners Learn Strong Temporal Structures for Chest X-Ray Vision-Language Understanding CoCa-CXR:对比式图像描述模型学习胸部X光片视觉-语言理解中的时间结构 large language model foundation model
11 VideoA11y: Method and Dataset for Accessible Video Description VideoA11y:提出了一种利用多模态大语言模型生成可访问视频描述的方法与数据集。 large language model multimodal
12 New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration 提出基于专家模型与MLLM协作的细粒度组合指代表达式理解方法与数据集 large language model multimodal
13 AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs 提出AsymLoRA,通过非对称LoRA协调MLLM中数据冲突与共性,提升多模态任务性能。 large language model multimodal
14 Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack 提出动态视觉-语言对齐攻击(DynVLA),提升MLLM对抗攻击的迁移性 large language model multimodal
15 Interpreting CLIP with Hierarchical Sparse Autoencoders 提出Matryoshka SAE,用于CLIP模型的可解释性分析与控制。 multimodal
16 Visual Adaptive Prompting for Compositional Zero-Shot Learning 提出视觉自适应提示系统VAPS,解决组合零样本学习中视觉信息利用不足的问题。 multimodal
17 Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars Avat3r:基于高斯重建的大型可动画3D头部Avatar模型,仅需少量输入图像。 foundation model
18 ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning ReCon:通过关系一致性增强真对应判别,实现鲁棒的噪声对应学习 multimodal
19 One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion 提出GIFNet,利用低级视觉任务交互实现任务无关的图像融合 multimodal
20 ProAPO: Progressively Automatic Prompt Optimization for Visual Classification 提出ProAPO以解决视觉分类中的提示优化问题 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
21 3D-AffordanceLLM: Harnessing Large Language Models for Open-Vocabulary Affordance Detection in 3D Worlds 提出3D-AffordanceLLM以解决开放词汇的3D环境中可供性检测问题 open-vocabulary open vocabulary affordance
22 TrackGS: Optimizing COLMAP-Free 3D Gaussian Splatting with Global Track Constraints TrackGS:利用全局轨迹约束优化无COLMAP的3D高斯溅射 3D gaussian splatting 3DGS gaussian splatting
23 UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler UniDepthV2:简化通用单目度量深度估计,提升跨域泛化能力 depth estimation metric depth UniDepth
24 Open-Vocabulary Semantic Part Segmentation of 3D Human 提出HumanCLIP模型和MaskFusion模块,实现三维人体开放词汇语义部件分割。 3D gaussian splatting gaussian splatting splatting
25 Efficient Gaussian Splatting for Monocular Dynamic Scene Rendering via Sparse Time-Variant Attribute Modeling 提出EDGS,通过稀疏时变属性建模实现单目动态场景高效高质渲染 gaussian splatting splatting
26 Learning to Generalize without Bias for Open-Vocabulary Action Recognition 提出Open-MeDe框架,解决开放词汇动作识别中CLIP静态偏置导致的泛化性问题 open-vocabulary open vocabulary
27 SegLocNet: Multimodal Localization Network for Autonomous Driving via Bird's-Eye-View Segmentation SegLocNet:基于鸟瞰图分割的多模态定位网络,用于解决自动驾驶中精确、鲁棒的定位问题。 semantic map multimodal
28 InsTaG: Learning Personalized 3D Talking Head from Few-Second Video InsTaG:提出一种基于少量视频的个性化3D说话头快速学习框架 3DGS
29 4Deform: Neural Surface Deformation for Robust Shape Interpolation 4Deform:提出基于神经隐式表面的形变方法,用于鲁棒的形状插值,尤其适用于非结构化数据。 implicit representation

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
30 From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs LIFT-GS:利用2D视觉语言模型蒸馏实现大规模3D视觉语言理解 distillation open-vocabulary open vocabulary
31 Investigating and Enhancing Vision-Audio Capability in Omnimodal Large Language Models 提出自知识蒸馏方法,提升全模态大语言模型在视觉-音频任务中的性能。 distillation large language model multimodal
32 I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue 提出基于自监督预训练的Co-Speech手势嵌入方法,用于多模态对话中的指代消解。 representation learning multimodal
33 CFTrack: Enhancing Lightweight Visual Tracking through Contrastive Learning and Feature Matching CFTrack:通过对比学习和特征匹配增强轻量级视觉跟踪 contrastive learning feature matching
34 Identity-preserving Distillation Sampling by Fixed-Point Iterator 提出身份保持蒸馏采样(IDS),通过不动点迭代正则化解决SDS图像编辑中的身份漂移问题 distillation NeRF neural radiance field
35 Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation 提出MLRG模型,利用多视角纵向数据和对比学习增强胸部X光报告生成。 contrastive learning spatiotemporal
36 Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds 提出多尺度邻域占据掩码自编码器(NOMAE),用于LiDAR点云自监督学习。 masked autoencoder MAE
37 Learning Mask Invariant Mutual Information for Masked Image Modeling 提出MI-MAE,通过互信息最大化与最小化提升掩码图像建模性能 masked autoencoder MAE contrastive learning
38 SAC-ViT: Semantic-Aware Clustering Vision Transformer with Early Exit 提出SAC-ViT,通过语义感知聚类和早退机制提升Vision Transformer的计算效率。 SAC

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
39 PI-HMR: Towards Robust In-bed Temporal Human Shape Reconstruction with Contact Pressure Sensing PI-HMR:利用接触压力感知实现鲁棒的卧床人体形状时序重建 HMR
40 EgoNormia: Benchmarking Physical Social Norm Understanding 提出EgoNormia以评估物理社交规范理解能力 egocentric

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
41 InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions InterMimic:面向物理交互的通用全身控制,从不完美的动作捕捉数据中学习。 whole-body control human-object interaction HOI

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
42 UniTok: A Unified Tokenizer for Visual Generation and Understanding 提出UniTok,通过多码本量化机制,统一视觉生成与理解的Tokenizer。 VQ-VAE

⬅️ 返回 cs.CV 首页 · 🏠 返回主页