cs.CV(2025-12-19)

📊 共 31 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (12 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗3) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱一:机器人控制 (Robot Control) (1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
1 GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation GroundingME:多维度评测揭示MLLM在视觉定位能力上的差距 large language model multimodal visual grounding
2 Adversarial Robustness of Vision in Open Foundation Models 研究揭示开放视觉基础模型在对抗攻击下的脆弱性,并发现鲁棒性与基准性能不直接相关。 foundation model
3 PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology PathFLIP:用于多功能计算病理学的细粒度语言-图像预训练 large language model multimodal instruction following
4 Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection FALCON-SFOD:利用先验知识增强源域无关目标检测中的目标聚焦 foundation model
5 MULTIAQUA: A multimodal maritime dataset and robust training strategies for multimodal semantic segmentation 提出MULTIAQUA多模态水面数据集,并设计鲁棒训练策略提升水面语义分割性能 multimodal
6 HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection HeadHunt-VAD:在MLLM中寻找异常敏感头,实现免调优视频异常检测 large language model multimodal
7 Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding 提出Robust-R1框架,通过显式建模视觉退化提升多模态大模型在真实场景下的鲁棒性。 large language model multimodal
8 A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs 提出RSHR-Bench以解决遥感超高分辨率视觉理解评估问题 large language model multimodal
9 Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images 提出DRIM模型,提升视觉语言模型在图像推理中的多轮自反思能力 multimodal chain-of-thought
10 Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training 提出无需训练的Keypoint Counting Classifiers,将ViT转化为自解释模型 foundation model
11 Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model 提出辅助描述知识ADK,提升视觉-语言模型在少样本迁移学习中的性能 large language model
12 ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching ABE-CLIP:免训练的属性绑定增强方法,提升组合图像-文本匹配性能 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
13 Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs 提出GRASP-HO框架,通过可微分认知引导多模态LLM实现生成式人-物交互检测。 open-vocabulary open vocabulary human-object interaction
14 Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding Chorus:多教师预训练用于整体3D高斯场景编码 3D gaussian splatting 3DGS gaussian splatting
15 G3Splat: Geometrically Consistent Generalizable Gaussian Splatting G3Splat:通过几何一致性先验实现可泛化的高斯溅射 gaussian splatting splatting
16 Long-Range depth estimation using learning based Hybrid Distortion Model for CCTV cameras 提出基于学习的混合畸变模型,用于CCTV相机长距离深度估计。 depth estimation
17 3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework 3D-RE-GEN:提出一种生成式框架,用于室内场景的单图三维重建,满足艺术家对可编辑网格的需求。 scene reconstruction spatial relationship
18 SAVeD: A First-Person Social Media Video Dataset for ADAS-equipped vehicle Near-Miss and Crash Event Analyses SAVeD:用于ADAS车辆近失和碰撞事件分析的第一人称社交媒体视频数据集 depth estimation monocular depth
19 InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion InsertAnywhere:融合4D场景几何与扩散模型,实现逼真的视频对象插入 scene understanding
20 SynergyWarpNet: Attention-Guided Cooperative Warping for Neural Portrait Animation SynergyWarpNet:用于神经肖像动画的注意力引导协同扭曲网络 optical flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
21 Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning 提出解耦课程学习框架,解决多模态推理中视觉信息遗忘问题。 reinforcement learning large language model multimodal
22 Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting Re-Depth Anything:利用自监督重照明进行测试时深度优化 distillation depth estimation monocular depth
23 FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views FLEG:提出一种从任意视角进行前馈语言嵌入高斯溅射的方法 contrastive learning gaussian splatting splatting
24 Xiaomi MiMo-VL-Miloco Technical Report 提出MiMo-VL-Miloco以解决智能家居场景理解问题 reinforcement learning multimodal chain-of-thought
25 PhysFire-WM: A Physics-Informed World Model for Emulating Fire Spread Dynamics 提出PhysFire-WM,利用物理信息世界模型模拟火灾蔓延动态 world model multimodal
26 EMAG: Self-Rectifying Diffusion Sampling with Exponential Moving Average Guidance 提出EMAG:一种基于指数移动平均指导的自校正扩散采样方法,提升生成质量。 flow matching classifier-free guidance

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
27 Mitty: Diffusion-based Human-to-Robot Video Generation Mitty:提出基于扩散模型的Human2Robot视频生成方法,实现端到端机器人学习。 egocentric human-to-robot
28 ClothHMR: 3D Mesh Recovery of Humans in Diverse Clothing from Single Image 提出ClothHMR以解决多样服装下3D人类网格恢复问题 human mesh recovery

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
29 Dexterous World Models 提出灵巧世界模型DWM,实现基于视频扩散的交互式数字孪生 locomotion manipulation world model

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
30 Any-Optical-Model: A Universal Foundation Model for Optical Remote Sensing 提出Any-Optical-Model,解决遥感领域跨传感器、分辨率和缺失波段的通用性难题。 spatial relationship foundation model

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
31 RadarGen: Automotive Radar Point Cloud Generation from Cameras RadarGen:提出一种基于图像的汽车雷达点云生成扩散模型 physically plausible foundation model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页