cs.CV(2025-12-19)

📊 共 36 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗3) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱一:机器人控制 (Robot Control) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis FPBench:首个用于指纹分析的多模态大语言模型综合基准 large language model foundation model multimodal
2 GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation GroundingME:通过多维度评估揭示多模态大语言模型在视觉定位方面的差距 large language model multimodal visual grounding
3 Adversarial Robustness of Vision in Open Foundation Models 研究表明视觉模态是开放域视觉语言模型(VLM)的有效攻击面,且模型鲁棒性与基准性能不直接相关。 foundation model
4 PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology PathFLIP:用于多功能计算病理学的细粒度语言-图像预训练 large language model multimodal instruction following
5 Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection FALCON-SFOD:利用基础模型先验增强源域无关目标检测中的目标聚焦 foundation model
6 MULTIAQUA: A multimodal maritime dataset and robust training strategies for multimodal semantic segmentation 提出MULTIAQUA多模态水域数据集,并探索稳健的多模态语义分割训练策略 multimodal
7 HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection HeadHunt-VAD:在MLLM中寻找鲁棒的异常敏感头,实现免调优视频异常检测 large language model multimodal
8 Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding 提出Robust-R1框架,通过显式建模视觉退化实现鲁棒视觉理解 large language model multimodal
9 A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs 提出RSHR-Bench:一个面向超高分辨率遥感多模态大语言模型的基准测试 large language model multimodal
10 Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images DRIM:提升视觉语言模型在图像推理中的多轮自反思能力 multimodal chain-of-thought
11 Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training 提出无需训练的Keypoint Counting Classifiers,将ViT转化为自解释模型 foundation model
12 Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model 提出辅助描述知识ADK,提升视觉-语言模型在少样本迁移学习中的性能 large language model
13 ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching 提出ABE-CLIP,无需训练增强CLIP模型在组合图像-文本匹配中的属性绑定能力 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
14 Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs 提出GRASP-HO以解决开放词汇人机交互检测问题 open-vocabulary open vocabulary human-object interaction
15 Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding 提出Chorus,通过多教师预训练实现3D高斯场景的整体编码。 3D gaussian splatting 3DGS gaussian splatting
16 G3Splat: Geometrically Consistent Generalizable Gaussian Splatting 提出G3Splat以解决3D高斯点云几何一致性问题 gaussian splatting splatting
17 Name That Part: 3D Part Segmentation and Naming 提出ALIGN-Parts,通过集合对齐实现3D部件分割与命名 open-vocabulary open vocabulary affordance
18 Long-Range depth estimation using learning based Hybrid Distortion Model for CCTV cameras 提出基于学习的混合畸变模型,用于CCTV相机长距离深度估计。 depth estimation
19 3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework 3D-RE-GEN:提出一种生成式框架,用于室内场景的单图三维重建,满足艺术家对可编辑网格的需求。 scene reconstruction spatial relationship
20 SAVeD: A First-Person Social Media Video Dataset for ADAS-equipped vehicle Near-Miss and Crash Event Analyses SAVeD:首个用于ADAS车辆近失和碰撞事件分析的第一人称社交媒体视频数据集 depth estimation monocular depth
21 InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion InsertAnywhere:融合4D场景几何与扩散模型,实现逼真的视频对象插入 scene understanding
22 SynergyWarpNet: Attention-Guided Cooperative Warping for Neural Portrait Animation SynergyWarpNet:用于神经肖像动画的注意力引导协同扭曲网络 optical flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
23 Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning 提出解耦课程学习框架,解决多模态推理中视觉信息遗忘问题。 reinforcement learning large language model multimodal
24 Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting Re-Depth Anything:利用自监督重照明进行测试时深度优化 distillation depth estimation monocular depth
25 FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views FLEG:提出一种从任意视角进行前馈语言嵌入高斯溅射重建的方法 contrastive learning gaussian splatting splatting
26 SERA-H: Beyond Native Sentinel Spatial Limits for High-Resolution Canopy Height Mapping SERA-H:突破哨兵卫星空间分辨率限制,实现高分辨率冠层高度制图 MAE height map spatiotemporal
27 SAM Audio: Segment Anything in Audio SAM Audio:提出一种通用的音频分割基础模型,支持文本、视觉和时序提示 flow matching foundation model multimodal
28 Xiaomi MiMo-VL-Miloco Technical Report 小米发布MiMo-VL-Miloco-7B,专注智能家居场景的视觉语言模型 reinforcement learning multimodal chain-of-thought
29 PhysFire-WM: A Physics-Informed World Model for Emulating Fire Spread Dynamics 提出PhysFire-WM,利用物理信息世界模型模拟火灾蔓延动态,提升预测精度。 world model multimodal
30 EMAG: Self-Rectifying Diffusion Sampling with Exponential Moving Average Guidance 提出EMAG:一种基于指数移动平均指导的自校正扩散采样方法,提升生成质量。 flow matching classifier-free guidance

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
31 Mitty: Diffusion-based Human-to-Robot Video Generation Mitty:提出基于扩散模型的Human2Robot视频生成方法,实现端到端机器人学习。 egocentric human-to-robot
32 ClothHMR: 3D Mesh Recovery of Humans in Diverse Clothing from Single Image 提出ClothHMR以解决多样服装下3D人类网格恢复问题 human mesh recovery

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
33 Diffusion Forcing for Multi-Agent Interaction Sequence Modeling 提出MAGNet,利用扩散模型和Transformer解决多智能体交互序列建模问题 motion generation multi-person interaction
34 RadarGen: Automotive Radar Point Cloud Generation from Cameras RadarGen:提出一种基于扩散模型的相机图像生成汽车雷达点云方法 physically plausible foundation model multimodal

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
35 Dexterous World Models 提出灵巧世界模型DWM,实现基于视频扩散的交互式数字孪生 locomotion manipulation world model

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
36 Any-Optical-Model: A Universal Foundation Model for Optical Remote Sensing 提出Any-Optical-Model,解决光学遥感中跨传感器、分辨率的通用性难题。 spatial relationship foundation model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页