cs.CV（2025-12-11）

📊 共 52 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (18 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (13 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (12 🔗2) 支柱七：动作重定向 (Motion Retargeting) (3 🔗1) 支柱四：生成式动作 (Generative Motion) (2 🔗1) 支柱一：机器人控制 (Robot Control) (2) 支柱八：物理动画 (Physics-based Animation) (2)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (18 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models	提出Visual Funnel，解决多模态大语言模型中的上下文盲区问题	large language model multimodal
2	VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction	提出VGent，通过解耦推理和预测的模块化设计实现高效视觉定位。	large language model multimodal visual grounding
3	Efficient-VLN: A Training-Efficient Vision-Language Navigation Model	Efficient-VLN：一种训练高效的视觉-语言导航模型，显著降低训练开销。	VLN large language model multimodal
4	BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models	BabyVLM-V2：面向发育导向的视觉基础模型预训练与评测框架	foundation model multimodal
5	Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding	Blink：面向多模态理解的动态视觉Token分辨率方法	large language model multimodal
6	Information-driven Fusion of Pathology Foundation Models for Enhanced Disease Characterization	提出基于信息驱动的病理学Foundation Model融合方法，提升疾病表征能力	foundation model
7	DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance	DuetSVG：提出一种统一的多模态SVG生成模型，利用内部视觉引导提升生成质量。	multimodal
8	SoccerMaster: A Vision Foundation Model for Soccer Understanding	提出SoccerMaster足球视觉基础模型，统一解决足球理解中的多项任务。	foundation model
9	MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos	提出MultiHateLoc框架，用于在线视频中多模态仇恨内容的弱监督时序定位。	multimodal
10	EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs	提出EchoingPixels，通过跨模态自适应Token缩减，提升音视频LLM效率。	large language model multimodal
11	Vision-Language Models for Infrared Industrial Sensing in Additive Manufacturing Scene Description	VLM-IRIS：面向增材制造红外工业感知的视觉-语言模型零样本框架	foundation model
12	Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context	复现并分析基于图像分块的高分辨率视觉语言模型，探究全局上下文的影响	multimodal
13	AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation	AlcheMinT：用于多参考一致视频生成的细粒度时间控制方法	TAMP	✅
14	FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos	FoundationMotion：提出自动标注与推理框架，提升视频空间运动理解能力	large language model
15	MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence	MMSI-Video-Bench：用于评估视频空间智能的多模态大模型基准	chain-of-thought
16	Leveraging Text Guidance for Enhancing Demographic Fairness in Gender Classification	提出文本引导方法，提升面部性别分类算法的人口公平性	multimodal
17	PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning	PoseGAM：通过几何感知多视角推理实现鲁棒的未见物体姿态估计	foundation model	✅
18	CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates	提出基于场景图增量更新的纠错序列规划方法CoSPlan，提升VLM在复杂任务中的推理能力。	chain-of-thought	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
19	Grounding Everything in Tokens for Multimodal Large Language Models	提出GETok，通过可学习token增强MLLM在2D空间中定位物体的能力	reinforcement learning spatial relationship large language model
20	Latent Chain-of-Thought World Modeling for End-to-End Driving	提出Latent-CoT-Drive，利用隐空间思维链进行端到端自动驾驶决策。	reinforcement learning world model vision-language-action
21	LDP: Parameter-Efficient Fine-Tuning of Multimodal LLM for Medical Report Generation	提出LDP框架，高效微调多模态LLM用于医疗报告生成，显著降低计算成本。	DPO direct preference optimization large language model
22	ConStruct: Structural Distillation of Foundation Models for Prototype-Based Weakly Supervised Histopathology Segmentation	ConStruct：利用结构蒸馏和原型学习，实现基于弱监督的组织病理学分割	distillation foundation model
23	WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World	WorldLens：真实驾驶世界模型全面评估基准，衡量生成世界的真实行为	world model geometric consistency embodied AI
24	StainNet: Scaling Self-Supervised Foundation Models on Immunohistochemistry and Special Stains for Computational Pathology	StainNet：针对免疫组化和特殊染色的病理计算自监督预训练模型	distillation foundation model	✅
25	E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training	E-RayZer：提出自监督3D重建框架，作为空间视觉预训练模型。	visual pre-training VGGT foundation model
26	VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation	VDAWorld：提出基于VLM引导的抽象与仿真世界建模框架	world model latent dynamics
27	Weakly Supervised Tuberculosis Localization in Chest X-rays through Knowledge Distillation	利用知识蒸馏进行胸部X光片中弱监督肺结核定位	teacher-student distillation
28	Hybrid Transformer-Mamba Architecture for Weakly Supervised Volumetric Medical Segmentation	提出TranSamba，一种混合Transformer-Mamba架构，用于弱监督体积医学图像分割。	Mamba state space model	✅
29	Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching	Fast-FoundationStereo：实时零样本立体匹配，兼顾速度与泛化性	distillation foundation model	✅
30	Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR	提出格式解耦强化学习(FD-RL)以提升文档OCR模型在复杂格式文本上的识别能力	reinforcement learning
31	TransLocNet: Cross-Modal Attention for Aerial-Ground Vehicle Localization with Contrastive Learning	TransLocNet：基于跨模态注意力和对比学习的无人机-地面车辆定位	contrastive learning

🔬 支柱三：空间感知与语义 (Perception & Semantics) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
32	Breaking the Vicious Cycle: Coherent 3D Gaussian Splatting from Sparse and Motion-Blurred Views	提出CoherentGS以解决稀疏和运动模糊视图下的3D重建问题	3D gaussian splatting 3DGS gaussian splatting	✅
33	Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization	提出Omni-Attribute，用于视觉概念个性化的开放词汇属性编码器。	open-vocabulary open vocabulary
34	GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting	提出GaussianHeadTalk，利用音频驱动的高斯溅射生成无抖动3D说话头	gaussian splatting splatting
35	Geo6DPose: Fast Zero-Shot 6D Object Pose Estimation via Geometry-Filtered Feature Matching	Geo6DPose：基于几何滤波特征匹配的快速零样本6D物体姿态估计	6D pose estimation feature matching foundation model
36	RaLiFlow: Scene Flow Estimation with 4D Radar and LiDAR Point Clouds	提出RaLiFlow，首个基于4D雷达和激光雷达点云的场景流估计框架	scene flow multimodal
37	Error-Propagation-Free Learned Video Compression With Dual-Domain Progressive Temporal Alignment	提出双域渐进式时序对齐的无误差传播学习视频压缩框架	optical flow motion estimation
38	Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision	StereoWalker：融合双目视觉与中层视觉，提升动态城市导航性能	depth estimation foundation model
39	Mull-Tokens: Modality-Agnostic Latent Thinking	提出Mull-Tokens：一种模态无关的潜在表征，用于提升多模态推理能力。	affordance multimodal
40	Any4D: Unified Feed-Forward Metric 4D Reconstruction	Any4D：统一前馈度量4D重建框架，支持多模态输入	scene flow egocentric
41	VL-JEPA: Joint Embedding Predictive Architecture for Vision-language	VL-JEPA：面向视觉语言的联合嵌入预测架构，参数更少性能更强。	open-vocabulary open vocabulary
42	Video Depth Propagation	VeloDepth：提出一种高效鲁棒的视频深度传播方法，用于实时深度估计。	depth estimation spatiotemporal	✅
43	Robust Shape from Focus via Multiscale Directional Dilated Laplacian and Recurrent Network	提出基于多尺度方向扩张拉普拉斯和循环网络的稳健Shape-from-Focus方法	depth estimation

🔬 支柱七：动作重定向 (Motion Retargeting) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
44	Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task	提出工具增强的时空推理框架，提升MLLM在视频问答任务中的性能	spatial relationship spatiotemporal large language model	✅
45	Point2Pose: A Generative Framework for 3D Human Pose Estimation with Multi-View Point Cloud Dataset	提出Point2Pose生成框架，利用多视角点云数据进行3D人体姿态估计	human motion
46	StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space	StereoSpace：提出一种基于扩散模型的无深度单目图像到立体图像生成方法	geometric consistency

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
47	IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation	提出IRG-MotionLLM，通过交错运动生成、评估和优化，提升文本到动作生成效果。	text-to-motion motion generation large language model	✅
48	Topology-Agnostic Animal Motion Generation from Text Prompt	提出OmniZoo数据集和拓扑无关的动物运动生成框架，解决异构骨骼和文本驱动的动物运动生成问题。	text-driven motion motion generation physically plausible

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
49	TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection	TriDF：一个用于可解释DeepFake检测的综合基准，评估感知、检测和幻觉。	manipulation large language model multimodal
50	XDen-1K: A Density Field Dataset of Real-World Objects	XDen-1K：首个大规模真实物体密度场数据集，助力机器人操作和物理模拟。	manipulation embodied AI

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
51	Data-Efficient American Sign Language Recognition via Few-Shot Prototypical Networks	提出基于Few-Shot原型网络的美国手语识别方法，解决数据稀缺问题	spatiotemporal
52	3D Blood Pulsation Maps	提出Pulse3DFace数据集，用于3D血流脉动图估计，助力远程脉搏估计研究。	PULSE

⬅️ 返回 cs.CV 首页 · 🏠 返回主页