cs.CV（2025-12-19）

📊 共 31 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (12 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (6 🔗3) 支柱六：视频提取与匹配 (Video Extraction) (2 🔗1) 支柱一：机器人控制 (Robot Control) (1) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
1	GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation	GroundingME：多维度评测揭示MLLM在视觉定位能力上的差距	large language model multimodal visual grounding
2	Adversarial Robustness of Vision in Open Foundation Models	研究揭示开放视觉基础模型在对抗攻击下的脆弱性，并发现鲁棒性与基准性能不直接相关。	foundation model
3	PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology	PathFLIP：用于多功能计算病理学的细粒度语言-图像预训练	large language model multimodal instruction following
4	Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection	FALCON-SFOD：利用先验知识增强源域无关目标检测中的目标聚焦	foundation model
5	MULTIAQUA: A multimodal maritime dataset and robust training strategies for multimodal semantic segmentation	提出MULTIAQUA多模态水面数据集，并设计鲁棒训练策略提升水面语义分割性能	multimodal
6	HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection	HeadHunt-VAD：在MLLM中寻找异常敏感头，实现免调优视频异常检测	large language model multimodal
7	Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding	提出Robust-R1框架，通过显式建模视觉退化提升多模态大模型在真实场景下的鲁棒性。	large language model multimodal
8	A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs	提出RSHR-Bench以解决遥感超高分辨率视觉理解评估问题	large language model multimodal	✅
9	Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images	提出DRIM模型，提升视觉语言模型在图像推理中的多轮自反思能力	multimodal chain-of-thought
10	Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training	提出无需训练的Keypoint Counting Classifiers，将ViT转化为自解释模型	foundation model
11	Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model	提出辅助描述知识ADK，提升视觉-语言模型在少样本迁移学习中的性能	large language model
12	ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching	ABE-CLIP：免训练的属性绑定增强方法，提升组合图像-文本匹配性能	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
13	Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs	提出GRASP-HO框架，通过可微分认知引导多模态LLM实现生成式人-物交互检测。	open-vocabulary open vocabulary human-object interaction
14	Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding	Chorus：多教师预训练用于整体3D高斯场景编码	3D gaussian splatting 3DGS gaussian splatting
15	G3Splat: Geometrically Consistent Generalizable Gaussian Splatting	G3Splat：通过几何一致性先验实现可泛化的高斯溅射	gaussian splatting splatting	✅
16	Long-Range depth estimation using learning based Hybrid Distortion Model for CCTV cameras	提出基于学习的混合畸变模型，用于CCTV相机长距离深度估计。	depth estimation
17	3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework	3D-RE-GEN：提出一种生成式框架，用于室内场景的单图三维重建，满足艺术家对可编辑网格的需求。	scene reconstruction spatial relationship
18	SAVeD: A First-Person Social Media Video Dataset for ADAS-equipped vehicle Near-Miss and Crash Event Analyses	SAVeD：用于ADAS车辆近失和碰撞事件分析的第一人称社交媒体视频数据集	depth estimation monocular depth
19	InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion	InsertAnywhere：融合4D场景几何与扩散模型，实现逼真的视频对象插入	scene understanding
20	SynergyWarpNet: Attention-Guided Cooperative Warping for Neural Portrait Animation	SynergyWarpNet：用于神经肖像动画的注意力引导协同扭曲网络	optical flow

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
21	Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning	提出解耦课程学习框架，解决多模态推理中视觉信息遗忘问题。	reinforcement learning large language model multimodal	✅
22	Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting	Re-Depth Anything：利用自监督重照明进行测试时深度优化	distillation depth estimation monocular depth
23	FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views	FLEG：提出一种从任意视角进行前馈语言嵌入高斯溅射的方法	contrastive learning gaussian splatting splatting	✅
24	Xiaomi MiMo-VL-Miloco Technical Report	提出MiMo-VL-Miloco以解决智能家居场景理解问题	reinforcement learning multimodal chain-of-thought	✅
25	PhysFire-WM: A Physics-Informed World Model for Emulating Fire Spread Dynamics	提出PhysFire-WM，利用物理信息世界模型模拟火灾蔓延动态	world model multimodal
26	EMAG: Self-Rectifying Diffusion Sampling with Exponential Moving Average Guidance	提出EMAG：一种基于指数移动平均指导的自校正扩散采样方法，提升生成质量。	flow matching classifier-free guidance

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
27	Mitty: Diffusion-based Human-to-Robot Video Generation	Mitty：提出基于扩散模型的Human2Robot视频生成方法，实现端到端机器人学习。	egocentric human-to-robot
28	ClothHMR: 3D Mesh Recovery of Humans in Diverse Clothing from Single Image	提出ClothHMR以解决多样服装下3D人类网格恢复问题	human mesh recovery	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Dexterous World Models	提出灵巧世界模型DWM，实现基于视频扩散的交互式数字孪生	locomotion manipulation world model

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Any-Optical-Model: A Universal Foundation Model for Optical Remote Sensing	提出Any-Optical-Model，解决遥感领域跨传感器、分辨率和缺失波段的通用性难题。	spatial relationship foundation model

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
31	RadarGen: Automotive Radar Point Cloud Generation from Cameras	RadarGen：提出一种基于图像的汽车雷达点云生成扩散模型	physically plausible foundation model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页