cs.CV(2025-03-13)

📊 共 60 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (21 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (16 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (12 🔗3) 支柱一:机器人控制 (Robot Control) (5) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (21 篇)

#题目一句话要点标签🔗
1 TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models TokenCarve:一种面向多模态大语言模型的信息保持型视觉Token压缩方法 large language model multimodal
2 EscapeCraft: A 3D Room Escape Environment for Benchmarking Complex Multimodal Reasoning Ability 提出EscapeCraft以解决多模态推理能力评估问题 large language model multimodal visual grounding
3 ChatGPT Encounters Morphing Attack Detection: Zero-Shot MAD with Multi-Modal Large Language Models and General Vision Models 提出基于多模态大语言模型和通用视觉模型的零样本人脸变脸攻击检测方法 large language model multimodal
4 VisualPRM: An Effective Process Reward Model for Multimodal Reasoning 提出VisualPRM,一种有效的多模态过程奖励模型,提升MLLM的推理能力。 large language model multimodal
5 Style Evolving along Chain-of-Thought for Unknown-Domain Object Detection 提出基于思维链的风格演化方法,提升未知领域目标检测在复杂风格下的泛化能力。 multimodal chain-of-thought
6 DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding 提出DriveLMM-o1数据集与多模态模型,用于自动驾驶场景下的逐步推理理解。 multimodal
7 VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search VisualWebInstruct:通过网络搜索扩展多模态指令数据,提升视觉语言模型推理能力。 multimodal
8 Interactive Multimodal Fusion with Temporal Modeling 提出一种时序建模的交互式多模态融合方法,用于野外环境下的valence-arousal估计。 multimodal
9 A Multimodal Fusion Model Leveraging MLP Mixer and Handcrafted Features-based Deep Learning Networks for Facial Palsy Detection 提出基于MLP Mixer和手工特征融合的多模态面瘫检测模型 multimodal
10 Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation 提出Proxy-Tuning,利用扩散模型提升自回归模型在主体驱动图像生成中的能力 multimodal
11 PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models 提出PiSA-Engine,用于生成高质量3D空间语义的指令数据集,提升3D大模型的理解能力。 large language model multimodal
12 CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance CINEMA:基于MLLM引导的连贯多主体视频生成框架 large language model multimodal
13 Hybrid Agents for Image Restoration 提出HybridAgent,融合多种图像修复模式,实现智能高效的用户交互。 large language model multimodal
14 UVE: Are MLLMs Unified Evaluators for AI-Generated Videos? 提出UVE-Bench,探索MLLM作为AI生成视频统一评估器的可行性 large language model multimodal
15 UniGoal: Towards Universal Zero-shot Goal-oriented Navigation UniGoal:提出通用零样本目标导向导航框架,统一处理多种目标类型。 large language model
16 Unifying 2D and 3D Vision-Language Understanding 提出UniVLG,统一2D和3D视觉-语言理解,提升3D场景理解性能。 language conditioned
17 Taxonomic Reasoning for Rare Arthropods: Combining Dense Image Captioning and RAG for Interpretable Classification 结合图像描述与RAG,提升稀有节肢动物分类的准确性和可解释性 large language model
18 TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention TruthPrInt:通过潜在真值引导的预干预缓解LVLM中的对象幻觉问题 large language model
19 IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification 提出IDEA框架,利用反转文本和协同可变形聚合进行多模态对象重识别 large language model
20 Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA 提出AURA模型,解决复杂遮挡场景下的Amodal推理分割任务 large language model
21 Singular Value Fine-tuning for Few-Shot Class-Incremental Learning 提出SVFCL,通过奇异值微调缓解Few-Shot增量学习中的过拟合问题 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (16 篇)

#题目一句话要点标签🔗
22 4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models 提出4D LangSplat,通过多模态大语言模型实现动态场景下的4D语言高斯溅射 gaussian splatting splatting open-vocabulary
23 MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction MuDG:利用高斯溅射驯服多模态扩散模型,用于城市场景重建 3DGS gaussian splatting splatting
24 VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames VicaSplat:单次运行即可从无位姿视频帧中进行3D高斯溅射重建和相机估计 3D gaussian splatting gaussian splatting splatting
25 OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer 提出OVTR,首个端到端开放词汇多目标跟踪Transformer模型 open-vocabulary open vocabulary multimodal
26 OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions OSMa-Bench:提出一个基于LLM/LVLM的自动化流水线,用于评估不同光照条件下的开放语义地图构建算法。 semantic mapping semantic map ConceptGraphs
27 RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors RI3D:利用修复和补全扩散先验的少样本高斯溅射 3DGS gaussian splatting splatting
28 Flow-NeRF: Joint Learning of Geometry, Poses, and Dense Flow within Unified Neural Representations 提出Flow-NeRF以解决无先验姿态下的场景重建问题 depth estimation NeRF neural radiance field
29 GaussHDR: High Dynamic Range Gaussian Splatting via Learning Unified 3D and 2D Local Tone Mapping GaussHDR:通过学习统一的3D和2D局部色调映射实现高动态范围高斯溅射 3D gaussian splatting gaussian splatting splatting
30 3D Student Splatting and Scooping 提出Student Splatting and Scooping (SSS),提升3D高斯溅射的表达能力和参数效率。 3D gaussian splatting 3DGS gaussian splatting
31 LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds 提出LHM:基于单张图像的快速可动画人体重建大模型 3D gaussian splatting gaussian splatting splatting
32 TARS: Traffic-Aware Radar Scene Flow Estimation TARS:交通感知雷达场景流估计,提升自动驾驶感知能力 scene understanding scene flow
33 The Power of One: A Single Example is All it Takes for Segmentation in VLMs 仅需单样本微调,显著提升视觉语言模型在分割任务中的性能 open-vocabulary open vocabulary multimodal
34 ROODI: Reconstructing Occluded Objects with Denoising Inpainters ROODI:利用去噪修复器重建3D高斯 Splatting中被遮挡物体 3D gaussian splatting gaussian splatting splatting
35 ST-FlowNet: An Efficient Spiking Neural Network for Event-Based Optical Flow Estimation 提出ST-FlowNet,一种高效的脉冲神经网络,用于事件相机光流估计。 optical flow
36 MouseGPT: A Large-scale Vision-Language Model for Mouse Behavior Analysis MouseGPT:用于小鼠行为分析的大规模视觉-语言模型 open-vocabulary open vocabulary
37 Speedy MASt3R Speedy MASt3R:通过后训练优化加速图像匹配,实现实时3D场景理解 scene reconstruction

🔬 支柱二:RL算法与架构 (RL & Architecture) (12 篇)

#题目一句话要点标签🔗
38 A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection 提出HD-OVD框架,通过分层语义蒸馏提升开放词汇目标检测性能 distillation open-vocabulary open vocabulary
39 R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization R1-Onevision:通过跨模态形式化提升通用多模态推理能力 reinforcement learning large language model multimodal
40 RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing RoMA:扩展Mamba基础模型用于遥感,提升高分辨率图像处理能力 Mamba foundation model
41 Multi-Modal Mamba Modeling for Survival Prediction (M4Survive): Adapting Joint Foundation Model Representations M4Survive:基于多模态Mamba的生存预测模型,融合医学影像与病理信息。 Mamba foundation model
42 Trajectory Mamba: Efficient Attention-Mamba Forecasting Model Based on Selective SSM 提出Trajectory Mamba,基于选择性SSM的高效轨迹预测模型 Mamba SSM
43 HiCMamba: Enhancing Hi-C Resolution and Identifying 3D Genome Structures with State Space Modeling HiCMamba:利用状态空间模型提升Hi-C分辨率并识别3D基因组结构 Mamba state space model
44 MoFlow: One-Step Flow Matching for Human Trajectory Forecasting via Implicit Maximum Likelihood Estimation based Distillation MoFlow:基于隐式最大似然蒸馏的单步流匹配人体轨迹预测 flow matching distillation
45 Mamba-VA: A Mamba-based Approach for Continuous Emotion Recognition in Valence-Arousal Space 提出Mamba-VA模型,利用Mamba架构进行连续情感识别,提升Valence-Arousal空间的情感建模能力。 Mamba masked autoencoder MAE
46 Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-based Planner and Graph-based Policy 提出LGC-MARL框架,结合LLM规划器和图策略,提升多智能体系统在复杂任务中的协作能力。 reinforcement learning large language model
47 OuroMamba: A Data-Free Quantization Framework for Vision Mamba OuroMamba:首个面向Vision Mamba模型的免数据量化框架 Mamba contrastive learning
48 Technical Approach for the EMI Challenge in the 8th Affective Behavior Analysis in-the-Wild Competition 提出双阶段跨模态对齐框架,提升野外环境下情感模仿强度估计精度 contrastive learning multimodal
49 Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective 从分类器视角研究Classifier(-Free) Guidance在扩散模型中的条件生成 flow matching classifier-free guidance

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
50 HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model HybridVLA:融合扩散模型与自回归的统一视觉-语言-动作模型 manipulation vision-language-action VLA
51 NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models NIL:利用预训练视频扩散模型的无数据模仿学习,提升机器人运动技能 quadruped humanoid humanoid robot
52 6D Object Pose Tracking in Internet Videos for Robotic Manipulation 提出一种无需先验知识的互联网视频6D物体位姿跟踪方法,用于机器人操作。 manipulation trajectory optimization 6D pose estimation
53 AdvPaint: Protecting Images from Inpainting Manipulation via Adversarial Attention Disruption 提出AdvPaint,通过对抗性扰动注意力机制,保护图像免受扩散模型篡改。 manipulation
54 Towards Fast, Memory-based and Data-Efficient Vision-Language Policy LiteVLP:一种快速、基于记忆且数据高效的视觉-语言策略模型 manipulation

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
55 GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing 提出GoT框架以解决图像生成与编辑中的推理不足问题 spatial relationship large language model multimodal
56 PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation PanoGen++:用于视觉-语言导航的领域自适应文本引导全景环境生成 spatial relationship VLN

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
57 VMBench: A Benchmark for Perception-Aligned Video Motion Generation VMBench:一个感知对齐的视频运动生成评估基准 motion generation
58 Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers 提出Cosh-DiT,通过混合扩散Transformer实现逼真的语音驱动手势视频合成 VQ-VAE

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
59 Hoi2Threat: An Interpretable Threat Detection Method for Human Violence Scenarios Guided by Human-Object Interaction Hoi2Threat:基于人-物交互的、可解释的人类暴力场景威胁检测方法 human-object interaction HOI multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
60 Lightweight Models for Emotional Analysis in Video 提出基于MobileNetV4和多尺度3D MLP-Mixer的情感分析轻量级模型 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页