cs.CV（2025-03-13）

📊 共 60 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (21 🔗4) 支柱三：空间感知与语义 (Perception & Semantics) (16 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (12 🔗3) 支柱一：机器人控制 (Robot Control) (5) 支柱七：动作重定向 (Motion Retargeting) (2 🔗1) 支柱四：生成式动作 (Generative Motion) (2 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (1) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (21 篇)

#	题目	一句话要点	标签	🔗
1	TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models	TokenCarve：一种面向多模态大语言模型的信息保持型视觉Token压缩方法	large language model multimodal	✅
2	EscapeCraft: A 3D Room Escape Environment for Benchmarking Complex Multimodal Reasoning Ability	提出EscapeCraft以解决多模态推理能力评估问题	large language model multimodal visual grounding
3	ChatGPT Encounters Morphing Attack Detection: Zero-Shot MAD with Multi-Modal Large Language Models and General Vision Models	提出基于多模态大语言模型和通用视觉模型的零样本人脸变脸攻击检测方法	large language model multimodal
4	VisualPRM: An Effective Process Reward Model for Multimodal Reasoning	提出VisualPRM，一种有效的多模态过程奖励模型，提升MLLM的推理能力。	large language model multimodal	✅
5	Style Evolving along Chain-of-Thought for Unknown-Domain Object Detection	提出基于思维链的风格演化方法，提升未知领域目标检测在复杂风格下的泛化能力。	multimodal chain-of-thought
6	DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding	提出DriveLMM-o1数据集与多模态模型，用于自动驾驶场景下的逐步推理理解。	multimodal	✅
7	VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search	VisualWebInstruct：通过网络搜索扩展多模态指令数据，提升视觉语言模型推理能力。	multimodal
8	Interactive Multimodal Fusion with Temporal Modeling	提出一种时序建模的交互式多模态融合方法，用于野外环境下的valence-arousal估计。	multimodal
9	A Multimodal Fusion Model Leveraging MLP Mixer and Handcrafted Features-based Deep Learning Networks for Facial Palsy Detection	提出基于MLP Mixer和手工特征融合的多模态面瘫检测模型	multimodal
10	Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation	提出Proxy-Tuning，利用扩散模型提升自回归模型在主体驱动图像生成中的能力	multimodal
11	PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models	提出PiSA-Engine，用于生成高质量3D空间语义的指令数据集，提升3D大模型的理解能力。	large language model multimodal
12	CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance	CINEMA：基于MLLM引导的连贯多主体视频生成框架	large language model multimodal
13	Hybrid Agents for Image Restoration	提出HybridAgent，融合多种图像修复模式，实现智能高效的用户交互。	large language model multimodal
14	UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?	提出UVE-Bench，探索MLLM作为AI生成视频统一评估器的可行性	large language model multimodal
15	UniGoal: Towards Universal Zero-shot Goal-oriented Navigation	UniGoal：提出通用零样本目标导向导航框架，统一处理多种目标类型。	large language model
16	Unifying 2D and 3D Vision-Language Understanding	提出UniVLG，统一2D和3D视觉-语言理解，提升3D场景理解性能。	language conditioned
17	Taxonomic Reasoning for Rare Arthropods: Combining Dense Image Captioning and RAG for Interpretable Classification	结合图像描述与RAG，提升稀有节肢动物分类的准确性和可解释性	large language model
18	TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention	TruthPrInt：通过潜在真值引导的预干预缓解LVLM中的对象幻觉问题	large language model	✅
19	IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification	提出IDEA框架，利用反转文本和协同可变形聚合进行多模态对象重识别	large language model
20	Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA	提出AURA模型，解决复杂遮挡场景下的Amodal推理分割任务	large language model
21	Singular Value Fine-tuning for Few-Shot Class-Incremental Learning	提出SVFCL，通过奇异值微调缓解Few-Shot增量学习中的过拟合问题	foundation model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (16 篇)

#	题目	一句话要点	标签	🔗
22	4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models	提出4D LangSplat，通过多模态大语言模型实现动态场景下的4D语言高斯溅射	gaussian splatting splatting open-vocabulary
23	MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction	MuDG：利用高斯溅射驯服多模态扩散模型，用于城市场景重建	3DGS gaussian splatting splatting
24	VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames	VicaSplat：单次运行即可从无位姿视频帧中进行3D高斯溅射重建和相机估计	3D gaussian splatting gaussian splatting splatting	✅
25	OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer	提出OVTR，首个端到端开放词汇多目标跟踪Transformer模型	open-vocabulary open vocabulary multimodal	✅
26	OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions	OSMa-Bench：提出一个基于LLM/LVLM的自动化流水线，用于评估不同光照条件下的开放语义地图构建算法。	semantic mapping semantic map ConceptGraphs	✅
27	RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors	RI3D：利用修复和补全扩散先验的少样本高斯溅射	3DGS gaussian splatting splatting
28	Flow-NeRF: Joint Learning of Geometry, Poses, and Dense Flow within Unified Neural Representations	提出Flow-NeRF以解决无先验姿态下的场景重建问题	depth estimation NeRF neural radiance field	✅
29	GaussHDR: High Dynamic Range Gaussian Splatting via Learning Unified 3D and 2D Local Tone Mapping	GaussHDR：通过学习统一的3D和2D局部色调映射实现高动态范围高斯溅射	3D gaussian splatting gaussian splatting splatting
30	3D Student Splatting and Scooping	提出Student Splatting and Scooping (SSS)，提升3D高斯溅射的表达能力和参数效率。	3D gaussian splatting 3DGS gaussian splatting
31	LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds	提出LHM：基于单张图像的快速可动画人体重建大模型	3D gaussian splatting gaussian splatting splatting
32	TARS: Traffic-Aware Radar Scene Flow Estimation	TARS：交通感知雷达场景流估计，提升自动驾驶感知能力	scene understanding scene flow
33	The Power of One: A Single Example is All it Takes for Segmentation in VLMs	仅需单样本微调，显著提升视觉语言模型在分割任务中的性能	open-vocabulary open vocabulary multimodal
34	ROODI: Reconstructing Occluded Objects with Denoising Inpainters	ROODI：利用去噪修复器重建3D高斯 Splatting中被遮挡物体	3D gaussian splatting gaussian splatting splatting
35	ST-FlowNet: An Efficient Spiking Neural Network for Event-Based Optical Flow Estimation	提出ST-FlowNet，一种高效的脉冲神经网络，用于事件相机光流估计。	optical flow
36	MouseGPT: A Large-scale Vision-Language Model for Mouse Behavior Analysis	MouseGPT：用于小鼠行为分析的大规模视觉-语言模型	open-vocabulary open vocabulary
37	Speedy MASt3R	Speedy MASt3R：通过后训练优化加速图像匹配，实现实时3D场景理解	scene reconstruction

🔬 支柱二：RL算法与架构 (RL & Architecture) (12 篇)

#	题目	一句话要点	标签	🔗
38	A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection	提出HD-OVD框架，通过分层语义蒸馏提升开放词汇目标检测性能	distillation open-vocabulary open vocabulary
39	R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization	R1-Onevision：通过跨模态形式化提升通用多模态推理能力	reinforcement learning large language model multimodal
40	RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing	RoMA：扩展Mamba基础模型用于遥感，提升高分辨率图像处理能力	Mamba foundation model	✅
41	Multi-Modal Mamba Modeling for Survival Prediction (M4Survive): Adapting Joint Foundation Model Representations	M4Survive：基于多模态Mamba的生存预测模型，融合医学影像与病理信息。	Mamba foundation model
42	Trajectory Mamba: Efficient Attention-Mamba Forecasting Model Based on Selective SSM	提出Trajectory Mamba，基于选择性SSM的高效轨迹预测模型	Mamba SSM
43	HiCMamba: Enhancing Hi-C Resolution and Identifying 3D Genome Structures with State Space Modeling	HiCMamba：利用状态空间模型提升Hi-C分辨率并识别3D基因组结构	Mamba state space model
44	MoFlow: One-Step Flow Matching for Human Trajectory Forecasting via Implicit Maximum Likelihood Estimation based Distillation	MoFlow：基于隐式最大似然蒸馏的单步流匹配人体轨迹预测	flow matching distillation
45	Mamba-VA: A Mamba-based Approach for Continuous Emotion Recognition in Valence-Arousal Space	提出Mamba-VA模型，利用Mamba架构进行连续情感识别，提升Valence-Arousal空间的情感建模能力。	Mamba masked autoencoder MAE	✅
46	Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-based Planner and Graph-based Policy	提出LGC-MARL框架，结合LLM规划器和图策略，提升多智能体系统在复杂任务中的协作能力。	reinforcement learning large language model
47	OuroMamba: A Data-Free Quantization Framework for Vision Mamba	OuroMamba：首个面向Vision Mamba模型的免数据量化框架	Mamba contrastive learning	✅
48	Technical Approach for the EMI Challenge in the 8th Affective Behavior Analysis in-the-Wild Competition	提出双阶段跨模态对齐框架，提升野外环境下情感模仿强度估计精度	contrastive learning multimodal
49	Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective	从分类器视角研究Classifier(-Free) Guidance在扩散模型中的条件生成	flow matching classifier-free guidance

🔬 支柱一：机器人控制 (Robot Control) (5 篇)

#	题目	一句话要点	标签
50	HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model	HybridVLA：融合扩散模型与自回归的统一视觉-语言-动作模型	manipulation vision-language-action VLA
51	NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models	NIL：利用预训练视频扩散模型的无数据模仿学习，提升机器人运动技能	quadruped humanoid humanoid robot
52	6D Object Pose Tracking in Internet Videos for Robotic Manipulation	提出一种无需先验知识的互联网视频6D物体位姿跟踪方法，用于机器人操作。	manipulation trajectory optimization 6D pose estimation
53	AdvPaint: Protecting Images from Inpainting Manipulation via Adversarial Attention Disruption	提出AdvPaint，通过对抗性扰动注意力机制，保护图像免受扩散模型篡改。	manipulation
54	Towards Fast, Memory-based and Data-Efficient Vision-Language Policy	LiteVLP：一种快速、基于记忆且数据高效的视觉-语言策略模型	manipulation

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
55	GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing	提出GoT框架以解决图像生成与编辑中的推理不足问题	spatial relationship large language model multimodal	✅
56	PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation	PanoGen++：用于视觉-语言导航的领域自适应文本引导全景环境生成	spatial relationship VLN

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
57	VMBench: A Benchmark for Perception-Aligned Video Motion Generation	VMBench：一个感知对齐的视频运动生成评估基准	motion generation	✅
58	Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers	提出Cosh-DiT，通过混合扩散Transformer实现逼真的语音驱动手势视频合成	VQ-VAE

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
59	Hoi2Threat: An Interpretable Threat Detection Method for Human Violence Scenarios Guided by Human-Object Interaction	Hoi2Threat：基于人-物交互的、可解释的人类暴力场景威胁检测方法	human-object interaction HOI multimodal

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
60	Lightweight Models for Emotional Analysis in Video	提出基于MobileNetV4和多尺度3D MLP-Mixer的情感分析轻量级模型	spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2025-03-13）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (21 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (16 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (12 篇)

🔬 支柱一：机器人控制 (Robot Control) (5 篇)

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理