cs.CV(2025-12-11)

📊 共 52 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (18 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (13 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (12 🔗2) 支柱七:动作重定向 (Motion Retargeting) (3 🔗1) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱一:机器人控制 (Robot Control) (2) 支柱八:物理动画 (Physics-based Animation) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (18 篇)

#题目一句话要点标签🔗
1 Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models 提出Visual Funnel,解决多模态大语言模型中的上下文盲区问题 large language model multimodal
2 VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction 提出VGent,通过解耦推理和预测的模块化设计实现高效视觉定位。 large language model multimodal visual grounding
3 Efficient-VLN: A Training-Efficient Vision-Language Navigation Model Efficient-VLN:一种训练高效的视觉-语言导航模型,显著降低训练开销。 VLN large language model multimodal
4 BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models BabyVLM-V2:面向发育导向的视觉基础模型预训练与评测框架 foundation model multimodal
5 Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding Blink:面向多模态理解的动态视觉Token分辨率方法 large language model multimodal
6 Information-driven Fusion of Pathology Foundation Models for Enhanced Disease Characterization 提出基于信息驱动的病理学Foundation Model融合方法,提升疾病表征能力 foundation model
7 DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance DuetSVG:提出一种统一的多模态SVG生成模型,利用内部视觉引导提升生成质量。 multimodal
8 SoccerMaster: A Vision Foundation Model for Soccer Understanding 提出SoccerMaster足球视觉基础模型,统一解决足球理解中的多项任务。 foundation model
9 MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos 提出MultiHateLoc框架,用于在线视频中多模态仇恨内容的弱监督时序定位。 multimodal
10 EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs 提出EchoingPixels,通过跨模态自适应Token缩减,提升音视频LLM效率。 large language model multimodal
11 Vision-Language Models for Infrared Industrial Sensing in Additive Manufacturing Scene Description VLM-IRIS:面向增材制造红外工业感知的视觉-语言模型零样本框架 foundation model
12 Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context 复现并分析基于图像分块的高分辨率视觉语言模型,探究全局上下文的影响 multimodal
13 AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation AlcheMinT:用于多参考一致视频生成的细粒度时间控制方法 TAMP
14 FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos FoundationMotion:提出自动标注与推理框架,提升视频空间运动理解能力 large language model
15 MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence MMSI-Video-Bench:用于评估视频空间智能的多模态大模型基准 chain-of-thought
16 Leveraging Text Guidance for Enhancing Demographic Fairness in Gender Classification 提出文本引导方法,提升面部性别分类算法的人口公平性 multimodal
17 PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning PoseGAM:通过几何感知多视角推理实现鲁棒的未见物体姿态估计 foundation model
18 CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates 提出基于场景图增量更新的纠错序列规划方法CoSPlan,提升VLM在复杂任务中的推理能力。 chain-of-thought

🔬 支柱二:RL算法与架构 (RL & Architecture) (13 篇)

#题目一句话要点标签🔗
19 Grounding Everything in Tokens for Multimodal Large Language Models 提出GETok,通过可学习token增强MLLM在2D空间中定位物体的能力 reinforcement learning spatial relationship large language model
20 Latent Chain-of-Thought World Modeling for End-to-End Driving 提出Latent-CoT-Drive,利用隐空间思维链进行端到端自动驾驶决策。 reinforcement learning world model vision-language-action
21 LDP: Parameter-Efficient Fine-Tuning of Multimodal LLM for Medical Report Generation 提出LDP框架,高效微调多模态LLM用于医疗报告生成,显著降低计算成本。 DPO direct preference optimization large language model
22 ConStruct: Structural Distillation of Foundation Models for Prototype-Based Weakly Supervised Histopathology Segmentation ConStruct:利用结构蒸馏和原型学习,实现基于弱监督的组织病理学分割 distillation foundation model
23 WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World WorldLens:真实驾驶世界模型全面评估基准,衡量生成世界的真实行为 world model geometric consistency embodied AI
24 StainNet: Scaling Self-Supervised Foundation Models on Immunohistochemistry and Special Stains for Computational Pathology StainNet:针对免疫组化和特殊染色的病理计算自监督预训练模型 distillation foundation model
25 E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training E-RayZer:提出自监督3D重建框架,作为空间视觉预训练模型。 visual pre-training VGGT foundation model
26 VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation VDAWorld:提出基于VLM引导的抽象与仿真世界建模框架 world model latent dynamics
27 Weakly Supervised Tuberculosis Localization in Chest X-rays through Knowledge Distillation 利用知识蒸馏进行胸部X光片中弱监督肺结核定位 teacher-student distillation
28 Hybrid Transformer-Mamba Architecture for Weakly Supervised Volumetric Medical Segmentation 提出TranSamba,一种混合Transformer-Mamba架构,用于弱监督体积医学图像分割。 Mamba state space model
29 Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching Fast-FoundationStereo:实时零样本立体匹配,兼顾速度与泛化性 distillation foundation model
30 Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR 提出格式解耦强化学习(FD-RL)以提升文档OCR模型在复杂格式文本上的识别能力 reinforcement learning
31 TransLocNet: Cross-Modal Attention for Aerial-Ground Vehicle Localization with Contrastive Learning TransLocNet:基于跨模态注意力和对比学习的无人机-地面车辆定位 contrastive learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (12 篇)

#题目一句话要点标签🔗
32 Breaking the Vicious Cycle: Coherent 3D Gaussian Splatting from Sparse and Motion-Blurred Views 提出CoherentGS以解决稀疏和运动模糊视图下的3D重建问题 3D gaussian splatting 3DGS gaussian splatting
33 Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization 提出Omni-Attribute,用于视觉概念个性化的开放词汇属性编码器。 open-vocabulary open vocabulary
34 GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting 提出GaussianHeadTalk,利用音频驱动的高斯溅射生成无抖动3D说话头 gaussian splatting splatting
35 Geo6DPose: Fast Zero-Shot 6D Object Pose Estimation via Geometry-Filtered Feature Matching Geo6DPose:基于几何滤波特征匹配的快速零样本6D物体姿态估计 6D pose estimation feature matching foundation model
36 RaLiFlow: Scene Flow Estimation with 4D Radar and LiDAR Point Clouds 提出RaLiFlow,首个基于4D雷达和激光雷达点云的场景流估计框架 scene flow multimodal
37 Error-Propagation-Free Learned Video Compression With Dual-Domain Progressive Temporal Alignment 提出双域渐进式时序对齐的无误差传播学习视频压缩框架 optical flow motion estimation
38 Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision StereoWalker:融合双目视觉与中层视觉,提升动态城市导航性能 depth estimation foundation model
39 Mull-Tokens: Modality-Agnostic Latent Thinking 提出Mull-Tokens:一种模态无关的潜在表征,用于提升多模态推理能力。 affordance multimodal
40 Any4D: Unified Feed-Forward Metric 4D Reconstruction Any4D:统一前馈度量4D重建框架,支持多模态输入 scene flow egocentric
41 VL-JEPA: Joint Embedding Predictive Architecture for Vision-language VL-JEPA:面向视觉语言的联合嵌入预测架构,参数更少性能更强。 open-vocabulary open vocabulary
42 Video Depth Propagation VeloDepth:提出一种高效鲁棒的视频深度传播方法,用于实时深度估计。 depth estimation spatiotemporal
43 Robust Shape from Focus via Multiscale Directional Dilated Laplacian and Recurrent Network 提出基于多尺度方向扩张拉普拉斯和循环网络的稳健Shape-from-Focus方法 depth estimation

🔬 支柱七:动作重定向 (Motion Retargeting) (3 篇)

#题目一句话要点标签🔗
44 Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task 提出工具增强的时空推理框架,提升MLLM在视频问答任务中的性能 spatial relationship spatiotemporal large language model
45 Point2Pose: A Generative Framework for 3D Human Pose Estimation with Multi-View Point Cloud Dataset 提出Point2Pose生成框架,利用多视角点云数据进行3D人体姿态估计 human motion
46 StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space StereoSpace:提出一种基于扩散模型的无深度单目图像到立体图像生成方法 geometric consistency

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
47 IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation 提出IRG-MotionLLM,通过交错运动生成、评估和优化,提升文本到动作生成效果。 text-to-motion motion generation large language model
48 Topology-Agnostic Animal Motion Generation from Text Prompt 提出OmniZoo数据集和拓扑无关的动物运动生成框架,解决异构骨骼和文本驱动的动物运动生成问题。 text-driven motion motion generation physically plausible

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
49 TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection TriDF:一个用于可解释DeepFake检测的综合基准,评估感知、检测和幻觉。 manipulation large language model multimodal
50 XDen-1K: A Density Field Dataset of Real-World Objects XDen-1K:首个大规模真实物体密度场数据集,助力机器人操作和物理模拟。 manipulation embodied AI

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
51 Data-Efficient American Sign Language Recognition via Few-Shot Prototypical Networks 提出基于Few-Shot原型网络的美国手语识别方法,解决数据稀缺问题 spatiotemporal
52 3D Blood Pulsation Maps 提出Pulse3DFace数据集,用于3D血流脉动图估计,助力远程脉搏估计研究。 PULSE

⬅️ 返回 cs.CV 首页 · 🏠 返回主页