cs.CV（2025-03-12）

📊 共 34 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (12 🔗4) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (6) 支柱一：机器人控制 (Robot Control) (4) 支柱八：物理动画 (Physics-based Animation) (3) 支柱六：视频提取与匹配 (Video Extraction) (2)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
1	CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games	CombatVLA：用于3D动作角色扮演游戏中战斗任务的高效视觉-语言-动作模型	vision-language-action VLA	✅
2	Robust Multimodal Survival Prediction with the Latent Differentiation Conditional Variational AutoEncoder	提出LD-CVAE模型，用于解决癌症生存预测中基因组数据缺失情况下的鲁棒多模态分析问题。	multimodal
3	Parameter-Efficient Adaptation of Geospatial Foundation Models through Embedding Deflection	提出DEFLECT，通过嵌入偏转高效适应地理空间基础模型，提升多光谱卫星图像处理性能。	foundation model
4	Post-interactive Multimodal Trajectory Prediction for Autonomous Driving	提出Pioformer，显式建模交互后特征，提升自动驾驶轨迹预测精度	multimodal
5	Multi-Modal Foundation Models for Computational Pathology: A Survey	综述计算病理学中多模态基础模型，涵盖视觉-语言、视觉-知识图谱和视觉-基因表达三大范式。	foundation model
6	Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning	提出IC-ViT，通过单通道预训练和多通道微调，提升ViT在多通道图像处理任务中的性能。	foundation model multimodal	✅
7	MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?	MindGYM：提出一种以思考为中心的微调框架，通过问题合成提升大模型的推理能力。	foundation model
8	Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness	提出Project-Probe-Aggregate以解决图像文本模型的偏差问题	foundation model
9	ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation	ForAug：通过重组前景和背景，缓解偏差并提升Vision Transformer训练效果	foundation model	✅
10	Generative Frame Sampler for Long Video Understanding	提出Generative Frame Sampler (GenS)以提升VideoLLM在长视频理解中的效率与性能。	large language model
11	TA-V2A: Textually Assisted Video-to-Audio Generation	TA-V2A：提出一种文本辅助的视频到音频生成方法，提升语义理解和生成质量。	large language model
12	Discovering Influential Neuron Path in Vision Transformers	提出Vision Transformer中神经元路径发现方法，提升模型可解释性并应用于模型剪枝。	foundation model	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
13	Close-up-GS: Enhancing Close-Up View Synthesis in 3D Gaussian Splatting with Progressive Self-Training	提出基于渐进自训练的Close-up-GS，提升3D高斯溅射近距离视角合成质量。	3D gaussian splatting 3DGS gaussian splatting
14	Motion Blender Gaussian Splatting for Dynamic Scene Reconstruction	提出Motion Blender Gaussian Splatting，用于动态场景可控重建与运动编辑。	gaussian splatting splatting scene reconstruction	✅
15	SDD-4DGS: Static-Dynamic Aware Decoupling in Gaussian Splatting for 4D Scene Reconstruction	SDD-4DGS：基于高斯溅射的静态-动态解耦4D场景重建	gaussian splatting splatting scene reconstruction
16	GASPACHO: Gaussian Splatting for Controllable Humans and Objects	GASPACHO：提出基于高斯溅射的可控人与物体交互渲染方法	gaussian splatting splatting physically plausible	✅
17	OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment	提出OpenVidVRD框架，通过提示驱动的语义空间对齐实现开放词汇视频视觉关系检测。	open-vocabulary open vocabulary spatiotemporal
18	DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection	提出DitHub框架以解决开放词汇物体检测的适应性问题	open-vocabulary open vocabulary	✅
19	Investigation of Frame Differences as Motion Cues for Video Object Segmentation	提出基于帧差的视频对象分割方法，适用于资源受限的边缘设备	optical flow

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
20	CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation	CleverDistiller：一种简单且空间一致的跨模态知识蒸馏方法，提升3D感知性能。	distillation semantic map foundation model
21	LuciBot: Automated Robot Policy Learning from Generated Videos	LuciBot：利用生成视频自动学习机器人策略，提升复杂具身任务性能。	policy learning large language model
22	ViM-VQ: Efficient Post-Training Vector Quantization for Visual Mamba	ViM-VQ：针对Visual Mamba的高效后训练向量量化方法，提升低比特量化精度。	Mamba state space model
23	Patch-Wise Hypergraph Contrastive Learning with Dual Normal Distribution Weighting for Multi-Domain Stain Transfer	提出STNHCL，通过超图对比学习和双正态分布加权实现多域染色转换	contrastive learning
24	Astrea: A MOE-based Visual Understanding Model with Progressive Alignment	Astrea：一种基于MOE和渐进对齐的视觉理解模型，解决异构任务和专家负载不均衡问题。	contrastive learning multimodal
25	Memory-enhanced Retrieval Augmentation for Long Video Understanding	提出MemVid：一种记忆增强的检索增强方法，用于长视频理解	reinforcement learning curriculum learning

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
26	2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos	提出2HandedAfforder，从人类视频中学习精确的可执行双手动作用	manipulation bi-manual affordance
27	Oh-A-DINO: Understanding and Enhancing Attribute-Level Information in Self-Supervised Object-Centric Representations	Oh-A-DINO：通过增强属性级别信息提升自监督对象中心表示	manipulation
28	A PyTorch-Enabled Tool for Synthetic Event Camera Data Generation and Algorithm Development	SENPI：一个基于PyTorch的合成事件相机数据生成与算法开发工具	manipulation
29	Fully-Synthetic Training for Visual Quality Inspection in Automotive Production	提出基于全合成数据的汽车生产视觉质检训练方法，提升缺陷检测精度。	domain randomization

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Bidirectional Learned Facial Animation Codec for Low Bitrate Talking Head Videos	提出双向学习面部动画编解码器以解决低比特率视频问题	ASE
31	I2V3D: Controllable image-to-video generation with 3D guidance	I2V3D：利用3D引导实现可控的图像到视频生成	character animation
32	Pig behavior dataset and Spatial-temporal perception and enhancement networks based on the attention mechanism for pig behavior recognition	提出基于注意力机制的时空感知增强网络，用于猪行为识别，并构建了相关数据集。	spatiotemporal

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
33	Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding	提出Exo2Ego，利用外视知识引导MLLM进行第一人称视角视频理解	egocentric large language model multimodal
34	Monte Carlo Diffusion for Generalizable Learning-Based RANSAC	提出基于蒙特卡洛扩散的RANSAC泛化学习方法，提升模型在分布外数据上的鲁棒性	feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页