cs.CV(2025-05-27)

📊 共 67 篇论文 | 🔗 24 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (24 🔗8) 支柱二:RL算法与架构 (RL & Architecture) (18 🔗6) 支柱三:空间感知与语义 (Perception & Semantics) (10 🔗5) 支柱一:机器人控制 (Robot Control) (5 🔗2) 支柱七:动作重定向 (Motion Retargeting) (3) 支柱六:视频提取与匹配 (Video Extraction) (3 🔗2) 支柱四:生成式动作 (Generative Motion) (2) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱八:物理动画 (Physics-based Animation) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (24 篇)

#题目一句话要点标签🔗
1 GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution GeoLLaVA-8K:提出首个遥感领域8K分辨率多模态大语言模型,解决超高分辨率图像处理难题。 large language model foundation model multimodal
2 DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding 提出DVL-Suite评估多模态大语言模型在动态城市理解中的能力,并构建DVLChat提升性能。 large language model multimodal
3 Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models 提出Fork-Merge解码以解决音视频大语言模型的模态偏差问题 large language model multimodal
4 MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning MMTBENCH:用于复杂多模态表格推理的统一基准测试 large language model multimodal
5 MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios 提出MME-VideoOCR以解决视频场景下OCR能力不足的问题 large language model multimodal
6 Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models FlashVLA:面向VLA模型的Token感知压缩与动作复用高效推理框架 vision-language-action VLA
7 AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding 提出AVCD,通过对比解码缓解音视频大语言模型中的幻觉问题 large language model multimodal
8 EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models 提出EaqVLA框架,解决VLA模型量化中的编码对齐问题,提升端到端控制性能。 vision-language-action VLA
9 Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs 揭示通用多模态LLM在音乐AVQA的局限性,强调领域专用方法的重要性 large language model multimodal
10 Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing 利用大型语言模型提升视觉语音识别性能:模型扩展、上下文感知解码与迭代优化 large language model
11 Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers Paper2Poster提出多模态海报自动生成框架,解决科研论文海报制作难题 multimodal
12 HoliTom: Holistic Token Merging for Fast Video Large Language Models HoliTom:面向快速视频大语言模型的整体Token合并方法 large language model
13 PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding 提出PARTONOMY基准测试和PLUM模型,提升大模型部件级视觉理解能力 multimodal
14 Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models 提出Griffon-R,通过统一的视觉推理机制提升LMMs在复杂视觉推理任务上的性能。 multimodal
15 MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis in Chest X-Ray MedBridge:桥接视觉-语言基础模型至胸部X光医学图像诊断 foundation model multimodal
16 Think Before You Diffuse: Infusing Physical Rules into Video Diffusion DiffPhy:融合物理规则的视频扩散模型,提升生成视频的物理真实性 large language model multimodal
17 Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment 提出FOA-Attack,通过特征优化对齐提升多模态大语言模型对抗攻击的迁移性,尤其针对闭源模型。 large language model multimodal
18 Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration 提出CAAC框架,通过自适应注意力校准缓解大型视觉语言模型中的幻觉问题 multimodal visual grounding
19 OASIS: Online Sample Selection for Continual Visual Instruction Tuning OASIS:面向持续视觉指令调优的自适应在线样本选择方法 foundation model
20 Mentor3AD: Feature Reconstruction-based 3D Anomaly Detection via Multi-modality Mentor Learning Mentor3AD:基于特征重建的多模态导师学习3D异常检测方法 multimodal
21 Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? 提出Video-Holmes基准,评估MLLM在复杂视频推理中如福尔摩斯般思考的能力 multimodal
22 Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals 提出TEMU-VTOFF,从服装图像中生成多品类商品图,提升虚拟试衣逆向任务性能。 multimodal
23 Advancing high-fidelity 3D and Texture Generation with 2.5D latents 提出基于2.5D潜在表示的3D几何与纹理联合生成框架,提升生成质量。 foundation model
24 Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning 提出FlexTI2V,一种无需训练的统一文本-图像到视频生成方法,实现灵活的视觉条件控制。 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (18 篇)

#题目一句话要点标签🔗
25 Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO 提出ACTIVE-O3框架,通过强化学习赋能多模态大语言模型主动感知能力 reinforcement learning large language model multimodal
26 Mamba-Driven Topology Fusion for Monocular 3D Human Pose Estimation 提出Mamba驱动的拓扑融合框架,提升单目3D人体姿态估计精度与效率 Mamba SSM state space model
27 Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation 提出自适应文本梦想者以解决视觉与语言导航问题 dreamer VLN large language model
28 Object Concepts Emerge from Motion 提出一种基于运动信息的无监督物体概念学习框架,提升视觉表征能力。 contrastive learning depth estimation monocular depth
29 TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs 提出TACO算法,通过强化学习优化LVLM中的长链推理与数据学习,解决推理不一致等问题。 reinforcement learning large language model multimodal
30 MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on MagicTryOn:利用扩散Transformer实现服装细节保持的视频虚拟试穿 distillation human motion spatiotemporal
31 ZigzagPointMamba: Spatial-Semantic Mamba for Point Cloud Understanding ZigzagPointMamba:通过空间-语义Mamba网络提升点云理解能力 Mamba SSM state space model
32 MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding MUSEG:通过时间戳感知的多片段定位增强视频时序理解 reinforcement learning large language model multimodal
33 Policy Optimized Text-to-Image Pipeline Design 提出基于强化学习的文本到图像生成流程优化方法,提升图像质量和多样性。 reinforcement learning classifier-free guidance large language model
34 OccLE: Label-Efficient 3D Semantic Occupancy Prediction OccLE:一种标签高效的3D语义占据预测方法 Mamba scene understanding foundation model
35 OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers OmniSync:基于扩散Transformer的通用唇形同步框架,适用于多样化视觉场景 flow matching classifier-free guidance spatiotemporal
36 DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization DreamBoothDPO:利用直接偏好优化提升个性化图像生成效果 DPO direct preference optimization
37 PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter 提出PMA以解决点云理解中的信息利用不足问题 Mamba
38 Rendering-Aware Reinforcement Learning for Vector Graphics Generation 提出RLRF:利用渲染反馈的强化学习方法提升向量图形生成质量 reinforcement learning
39 Hierarchical Instruction-aware Embodied Visual Tracking 提出HIEVT,利用分层指令感知解决具身视觉跟踪中指令理解与动作生成鸿沟 reinforcement learning VLA
40 Temporal Saliency-Guided Distillation: A Scalable Framework for Distilling Video Datasets 提出时序显著性引导的视频数据集蒸馏框架,实现高效视频数据压缩。 distillation
41 Supervised Contrastive Learning for Ordinal Engagement Measurement 提出基于监督对比学习的序数学生参与度测量方法,解决不平衡分类问题。 contrastive learning
42 LPOI: Listwise Preference Optimization for Vision Language Models 提出LPOI,通过列表式偏好优化减少视觉语言模型中的幻觉问题。 RLHF DPO

🔬 支柱三:空间感知与语义 (Perception & Semantics) (10 篇)

#题目一句话要点标签🔗
43 Intern-GS: Vision Model Guided Sparse-View 3D Gaussian Splatting Intern-GS:利用视觉模型引导的稀疏视图3D高斯溅射 3D gaussian splatting gaussian splatting splatting
44 Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts 提出Uni3D-MoE,通过MoE实现可扩展的多模态3D场景理解。 scene understanding large language model multimodal
45 Generalizable and Relightable Gaussian Splatting for Human Novel View Synthesis 提出GRGS,实现通用且可重光照的人体新视角合成 gaussian splatting splatting geometric consistency
46 3D-UIR: 3D Gaussian for Underwater 3D Scene Reconstruction via Physics Based Appearance-Medium Decoupling 提出基于物理的3D高斯水下场景重建方法,解耦外观与介质效应 3D gaussian splatting 3DGS gaussian splatting
47 Empowering Vector Graphics with Consistently Arbitrary Viewing and View-dependent Visibility Dream3DVG:提出一种支持任意视角、渐进细节优化和视角相关可见性的文本到矢量图生成方法 3D gaussian splatting 3DGS gaussian splatting
48 Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning 提出MoDOT框架,通过多任务学习互增强遮挡边界和单目深度估计 depth estimation monocular depth geometric consistency
49 Compositional Scene Understanding through Inverse Generative Modeling 提出基于逆生成建模的组合场景理解方法,实现对复杂场景的鲁棒解析。 scene understanding
50 Plenodium: UnderWater 3D Scene Reconstruction with Plenoptic Medium Representation Plenodium:水下三维场景重建的光场介质表示方法 scene reconstruction
51 Robust Video-Based Pothole Detection and Area Estimation for Intelligent Vehicles with Depth Map and Kalman Smoothing 提出ACSH-YOLOv8与CDKF,用于智能车辆在视频中稳健检测坑洼并估计面积 depth estimation monocular depth Depth Anything
52 OmniIndoor3D: Comprehensive Indoor 3D Reconstruction OmniIndoor3D:基于高斯表示的综合室内三维重建框架 3DGS scene understanding

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
53 HTMNet: A Hybrid Network with Transformer-Mamba Bottleneck Multimodal Fusion for Transparent and Reflective Objects Depth Completion HTMNet:用于透明和反射物体深度补全的Transformer-Mamba混合网络 manipulation Mamba state space model
54 Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks DORI:提出细粒度多轴感知基准,解耦多模态大模型中的方向理解能力 manipulation scene reconstruction scene understanding
55 FastFace: Tuning Identity Preservation in Distilled Diffusion via Guidance and Attention FastFace:通过引导和注意力机制调整蒸馏扩散模型中的身份保持 manipulation distillation classifier-free guidance
56 RefAV: Towards Planning-Centric Scenario Mining RefAV:提出以规划为中心的场景挖掘方法,解决自动驾驶日志分析难题。 motion planning
57 Geometry-Editable and Appearance-Preserving Object Compositon 提出DGAD模型,通过解耦几何编辑和外观保持,实现可控且逼真的物体合成。 manipulation

🔬 支柱七:动作重定向 (Motion Retargeting) (3 篇)

#题目一句话要点标签🔗
58 HuMoCon: Concept Discovery for Human Motion Understanding HuMoCon:提出用于人体运动理解的概念发现框架,提升多模态特征对齐和高频信息表达。 human motion
59 Diffusion Model-based Activity Completion for AI Motion Capture from Videos 提出基于扩散模型的动作补全方法,用于AI视频动作捕捉中生成自然连续的动作 human motion
60 ProBA: Probabilistic Bundle Adjustment with the Bhattacharyya Coefficient 提出ProBA:一种基于Bhattacharyya系数的概率Bundle Adjustment方法,解决相机内参未知和初始估计不准的问题。 geometric consistency

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
61 ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models 提出ViewSpatial-Bench基准,评估视觉语言模型在多视角空间定位中的能力 egocentric spatial relationship embodied AI
62 HCQA-1.5 @ Ego4D EgoSchema Challenge 2025 提出基于多源聚合与置信度过滤的HCQA扩展框架,提升第一人称视角视频问答的准确性。 egocentric Ego4D
63 SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation SANSA:利用SAM2的潜在语义信息进行少样本分割 feature matching

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
64 Exploring Timeline Control for Facial Motion Generation 提出时间线控制的 facial motion 生成方法,实现精细化面部动作控制 motion generation
65 Normalized Attention Guidance: Universal Negative Guidance for Diffusion Models 提出归一化注意力引导(NAG),解决扩散模型中负引导失效问题。 classifier-free guidance

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
66 OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions 提出OmniResponse,解决在线多模态对话中听者反馈生成问题 dyadic interaction large language model multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
67 AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Agriculture Mapping AgriFM:面向农业制图的多源时序遥感基础模型 spatiotemporal foundation model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页