cs.CV(2025-03-12)

📊 共 34 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (12 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (6) 支柱一:机器人控制 (Robot Control) (4) 支柱八:物理动画 (Physics-based Animation) (3) 支柱六:视频提取与匹配 (Video Extraction) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
1 CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games CombatVLA:用于3D动作角色扮演游戏中战斗任务的高效视觉-语言-动作模型 vision-language-action VLA
2 Robust Multimodal Survival Prediction with the Latent Differentiation Conditional Variational AutoEncoder 提出LD-CVAE模型,用于解决癌症生存预测中基因组数据缺失情况下的鲁棒多模态分析问题。 multimodal
3 Parameter-Efficient Adaptation of Geospatial Foundation Models through Embedding Deflection 提出DEFLECT,通过嵌入偏转高效适应地理空间基础模型,提升多光谱卫星图像处理性能。 foundation model
4 Post-interactive Multimodal Trajectory Prediction for Autonomous Driving 提出Pioformer,显式建模交互后特征,提升自动驾驶轨迹预测精度 multimodal
5 Multi-Modal Foundation Models for Computational Pathology: A Survey 综述计算病理学中多模态基础模型,涵盖视觉-语言、视觉-知识图谱和视觉-基因表达三大范式。 foundation model
6 Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning 提出IC-ViT,通过单通道预训练和多通道微调,提升ViT在多通道图像处理任务中的性能。 foundation model multimodal
7 MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning? MindGYM:提出一种以思考为中心的微调框架,通过问题合成提升大模型的推理能力。 foundation model
8 Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness 提出Project-Probe-Aggregate以解决图像文本模型的偏差问题 foundation model
9 ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation ForAug:通过重组前景和背景,缓解偏差并提升Vision Transformer训练效果 foundation model
10 Generative Frame Sampler for Long Video Understanding 提出Generative Frame Sampler (GenS)以提升VideoLLM在长视频理解中的效率与性能。 large language model
11 TA-V2A: Textually Assisted Video-to-Audio Generation TA-V2A:提出一种文本辅助的视频到音频生成方法,提升语义理解和生成质量。 large language model
12 Discovering Influential Neuron Path in Vision Transformers 提出Vision Transformer中神经元路径发现方法,提升模型可解释性并应用于模型剪枝。 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
13 Close-up-GS: Enhancing Close-Up View Synthesis in 3D Gaussian Splatting with Progressive Self-Training 提出基于渐进自训练的Close-up-GS,提升3D高斯溅射近距离视角合成质量。 3D gaussian splatting 3DGS gaussian splatting
14 Motion Blender Gaussian Splatting for Dynamic Scene Reconstruction 提出Motion Blender Gaussian Splatting,用于动态场景可控重建与运动编辑。 gaussian splatting splatting scene reconstruction
15 SDD-4DGS: Static-Dynamic Aware Decoupling in Gaussian Splatting for 4D Scene Reconstruction SDD-4DGS:基于高斯溅射的静态-动态解耦4D场景重建 gaussian splatting splatting scene reconstruction
16 GASPACHO: Gaussian Splatting for Controllable Humans and Objects GASPACHO:提出基于高斯溅射的可控人与物体交互渲染方法 gaussian splatting splatting physically plausible
17 OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment 提出OpenVidVRD框架,通过提示驱动的语义空间对齐实现开放词汇视频视觉关系检测。 open-vocabulary open vocabulary spatiotemporal
18 DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection 提出DitHub框架以解决开放词汇物体检测的适应性问题 open-vocabulary open vocabulary
19 Investigation of Frame Differences as Motion Cues for Video Object Segmentation 提出基于帧差的视频对象分割方法,适用于资源受限的边缘设备 optical flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
20 CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation CleverDistiller:一种简单且空间一致的跨模态知识蒸馏方法,提升3D感知性能。 distillation semantic map foundation model
21 LuciBot: Automated Robot Policy Learning from Generated Videos LuciBot:利用生成视频自动学习机器人策略,提升复杂具身任务性能。 policy learning large language model
22 ViM-VQ: Efficient Post-Training Vector Quantization for Visual Mamba ViM-VQ:针对Visual Mamba的高效后训练向量量化方法,提升低比特量化精度。 Mamba state space model
23 Patch-Wise Hypergraph Contrastive Learning with Dual Normal Distribution Weighting for Multi-Domain Stain Transfer 提出STNHCL,通过超图对比学习和双正态分布加权实现多域染色转换 contrastive learning
24 Astrea: A MOE-based Visual Understanding Model with Progressive Alignment Astrea:一种基于MOE和渐进对齐的视觉理解模型,解决异构任务和专家负载不均衡问题。 contrastive learning multimodal
25 Memory-enhanced Retrieval Augmentation for Long Video Understanding 提出MemVid:一种记忆增强的检索增强方法,用于长视频理解 reinforcement learning curriculum learning

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
26 2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos 提出2HandedAfforder,从人类视频中学习精确的可执行双手动作用 manipulation bi-manual affordance
27 Oh-A-DINO: Understanding and Enhancing Attribute-Level Information in Self-Supervised Object-Centric Representations Oh-A-DINO:通过增强属性级别信息提升自监督对象中心表示 manipulation
28 A PyTorch-Enabled Tool for Synthetic Event Camera Data Generation and Algorithm Development SENPI:一个基于PyTorch的合成事件相机数据生成与算法开发工具 manipulation
29 Fully-Synthetic Training for Visual Quality Inspection in Automotive Production 提出基于全合成数据的汽车生产视觉质检训练方法,提升缺陷检测精度。 domain randomization

🔬 支柱八:物理动画 (Physics-based Animation) (3 篇)

#题目一句话要点标签🔗
30 Bidirectional Learned Facial Animation Codec for Low Bitrate Talking Head Videos 提出双向学习面部动画编解码器以解决低比特率视频问题 ASE
31 I2V3D: Controllable image-to-video generation with 3D guidance I2V3D:利用3D引导实现可控的图像到视频生成 character animation
32 Pig behavior dataset and Spatial-temporal perception and enhancement networks based on the attention mechanism for pig behavior recognition 提出基于注意力机制的时空感知增强网络,用于猪行为识别,并构建了相关数据集。 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
33 Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding 提出Exo2Ego,利用外视知识引导MLLM进行第一人称视角视频理解 egocentric large language model multimodal
34 Monte Carlo Diffusion for Generalizable Learning-Based RANSAC 提出基于蒙特卡洛扩散的RANSAC泛化学习方法,提升模型在分布外数据上的鲁棒性 feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页