cs.CV(2025-05-27)

📊 共 65 篇论文 | 🔗 24 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (24 🔗8) 支柱二:RL算法与架构 (RL & Architecture) (18 🔗6) 支柱三:空间感知与语义 (Perception & Semantics) (10 🔗5) 支柱一:机器人控制 (Robot Control) (5 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (3 🔗2) 支柱四:生成式动作 (Generative Motion) (2) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱八:物理动画 (Physics-based Animation) (1 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (24 篇)

#题目一句话要点标签🔗
1 GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution 提出GeoLLaVA-8K以解决超高分辨率遥感图像处理问题 large language model foundation model multimodal
2 DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding 提出DVL-Suite以解决多模态大语言模型在城市动态理解中的不足 large language model multimodal
3 Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models 提出Fork-Merge解码以解决音视频大语言模型的模态偏差问题 large language model multimodal
4 MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning 提出MMTBENCH以解决复杂多模态表推理问题 large language model multimodal
5 MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios 提出MME-VideoOCR以解决视频场景下OCR效果不足的问题 large language model multimodal
6 Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models 提出FlashVLA以解决VLA模型推理效率低下问题 vision-language-action VLA
7 AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding 提出AVCD以解决音视频大语言模型中的幻觉问题 large language model multimodal
8 EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models 提出EaqVLA以解决VLA模型量化效率问题 vision-language-action VLA
9 Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs 提出专门化方法以解决音乐音视频问答的复杂性问题 large language model multimodal
10 Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing 提出利用大语言模型提升视觉语音识别性能的方法 large language model
11 Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers 提出Paper2Poster以解决学术海报自动生成问题 multimodal
12 HoliTom: Holistic Token Merging for Fast Video Large Language Models 提出HoliTom以解决视频大语言模型的计算效率问题 large language model
13 PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding 提出PARTONOMY以解决大规模多模态模型的部件识别问题 multimodal
14 Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models 提出统一视觉推理机制以提升多模态模型的复合推理能力 multimodal
15 MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis in Chest X-Ray 提出MedBridge以解决医学影像诊断中的领域适应问题 foundation model multimodal
16 Think Before You Diffuse: Infusing Physical Rules into Video Diffusion 提出DiffPhy框架以解决视频生成中的物理准确性问题 large language model multimodal
17 Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment 提出FOA-Attack以解决闭源MLLMs的对抗攻击问题 large language model multimodal
18 Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration 提出CAAC框架以解决大规模视觉-语言模型中的幻觉问题 multimodal visual grounding
19 OASIS: Online Sample Selection for Continual Visual Instruction Tuning 提出OASIS以解决持续视觉指令调优中的样本选择问题 foundation model
20 Mentor3AD: Feature Reconstruction-based 3D Anomaly Detection via Multi-modality Mentor Learning 提出Mentor3AD以解决3D异常检测中的特征重建问题 multimodal
21 Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? 提出Video-Holmes基准以解决复杂视频推理问题 multimodal
22 Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals 提出TEMU-VTOFF以解决虚拟试穿逆问题 multimodal
23 Advancing high-fidelity 3D and Texture Generation with 2.5D latents 提出一种新框架以解决3D几何与纹理生成不一致问题 foundation model
24 Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning 提出FlexTI2V以解决训练成本高和条件设置有限的问题 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (18 篇)

#题目一句话要点标签🔗
25 Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO 提出ACTIVE-O3以解决多模态大语言模型的主动感知问题 reinforcement learning large language model multimodal
26 Mamba-Driven Topology Fusion for Monocular 3D Human Pose Estimation 提出Mamba驱动的拓扑融合框架以解决单目3D人体姿态估计中的计算挑战 Mamba SSM state space model
27 Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation 提出自适应文本梦境生成器以解决视觉与语言导航问题 dreamer VLN large language model
28 Object Concepts Emerge from Motion 提出一种无监督框架以从运动中学习物体概念 contrastive learning depth estimation monocular depth
29 TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs 提出TACO以解决长链推理中的一致性与学习效率问题 reinforcement learning large language model multimodal
30 ZigzagPointMamba: Spatial-Semantic Mamba for Point Cloud Understanding 提出ZigzagPointMamba以解决点云理解中的空间连续性和语义建模问题 Mamba SSM state space model
31 MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding 提出MUSEG以解决视频时间理解问题 reinforcement learning large language model multimodal
32 Policy Optimized Text-to-Image Pipeline Design 提出基于强化学习的文本到图像生成管道设计以解决效率问题 reinforcement learning classifier-free guidance large language model
33 OccLE: Label-Efficient 3D Semantic Occupancy Prediction 提出OccLE以解决3D语义占用预测中的标注效率问题 Mamba scene understanding foundation model
34 OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers 提出OmniSync以解决多样化场景下的唇动同步问题 flow matching classifier-free guidance spatiotemporal
35 DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization 提出DreamBoothDPO以解决个性化生成中的偏好优化问题 DPO direct preference optimization
36 PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter 提出PMA以解决点云理解中的信息利用不足问题 Mamba
37 Rendering-Aware Reinforcement Learning for Vector Graphics Generation 提出RLRF以解决SVG生成中的渲染反馈问题 reinforcement learning
38 Hierarchical Instruction-aware Embodied Visual Tracking 提出HIEVT以解决用户中心的视觉跟踪挑战 reinforcement learning VLA
39 Temporal Saliency-Guided Distillation: A Scalable Framework for Distilling Video Datasets 提出时间显著性引导蒸馏框架以解决视频数据集压缩问题 distillation
40 Supervised Contrastive Learning for Ordinal Engagement Measurement 提出监督对比学习以解决学生参与度测量中的类别不平衡问题 contrastive learning
41 MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on 提出MagicTryOn以解决视频虚拟试穿中的服装保留问题 distillation spatiotemporal
42 LPOI: Listwise Preference Optimization for Vision Language Models 提出LPOI以解决视觉语言模型中的幻觉问题 RLHF DPO

🔬 支柱三:空间感知与语义 (Perception & Semantics) (10 篇)

#题目一句话要点标签🔗
43 Intern-GS: Vision Model Guided Sparse-View 3D Gaussian Splatting 提出Intern-GS以解决稀疏视图三维重建问题 3D gaussian splatting gaussian splatting splatting
44 Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts 提出Uni3D-MoE以解决多模态3D场景理解不足问题 scene understanding large language model multimodal
45 Generalizable and Relightable Gaussian Splatting for Human Novel View Synthesis 提出GRGS框架以解决高保真人体新视角合成问题 gaussian splatting splatting geometric consistency
46 3D-UIR: 3D Gaussian for Underwater 3D Scene Reconstruction via Physics Based Appearance-Medium Decoupling 提出基于物理的3D高斯模型以解决水下场景重建问题 3D gaussian splatting 3DGS gaussian splatting
47 Empowering Vector Graphics with Consistently Arbitrary Viewing and View-dependent Visibility 提出Dream3DVG以解决文本到矢量图形生成中的视角与遮挡问题 3D gaussian splatting 3DGS gaussian splatting
48 Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning 提出MoDOT框架以解决单目深度估计与遮挡边界估计的互补问题 depth estimation monocular depth geometric consistency
49 Compositional Scene Understanding through Inverse Generative Modeling 通过逆生成建模提出组合场景理解方法 scene understanding
50 Plenodium: UnderWater 3D Scene Reconstruction with Plenoptic Medium Representation 提出Plenodium以解决水下3D场景重建问题 scene reconstruction
51 Robust Video-Based Pothole Detection and Area Estimation for Intelligent Vehicles with Depth Map and Kalman Smoothing 提出基于视频的强健坑洞检测与面积估计方法以提升智能车辆安全性 depth estimation monocular depth Depth Anything
52 OmniIndoor3D: Comprehensive Indoor 3D Reconstruction 提出OmniIndoor3D以解决室内3D重建精度不足问题 3DGS scene understanding

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
53 HTMNet: A Hybrid Network with Transformer-Mamba Bottleneck Multimodal Fusion for Transparent and Reflective Objects Depth Completion 提出HTMNet以解决透明和反射物体深度补全问题 manipulation Mamba state space model
54 Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks 提出DORI基准以解决多模态系统的物体方向理解问题 manipulation scene reconstruction scene understanding
55 FastFace: Tuning Identity Preservation in Distilled Diffusion via Guidance and Attention 提出FastFace框架以解决扩散模型中身份保留适配器的训练效率问题 manipulation distillation classifier-free guidance
56 RefAV: Towards Planning-Centric Scenario Mining 提出RefAV以解决自动驾驶场景挖掘问题 motion planning
57 Geometry-Editable and Appearance-Preserving Object Compositon 提出DGAD模型以解决对象合成中的几何编辑与外观保留问题 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
58 ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models 提出ViewSpatial-Bench以解决多视角空间定位问题 egocentric spatial relationship embodied AI
59 HCQA-1.5 @ Ego4D EgoSchema Challenge 2025 提出HCQA框架扩展以提升自我中心视频问答的可靠性 egocentric Ego4D
60 SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation 提出SANSA以解决少样本分割中的语义理解问题 feature matching

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
61 Exploring Timeline Control for Facial Motion Generation 提出时间线控制信号以提升面部动作生成精度 motion generation
62 Normalized Attention Guidance: Universal Negative Guidance for Diffusion Models 提出归一化注意力引导以解决扩散模型中的负引导问题 classifier-free guidance

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
63 OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions 提出OmniResponse以解决多模态对话响应生成问题 dyadic interaction large language model multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
64 AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Crop Mapping 提出AgriFM以解决农业作物映射中的时空特征提取问题 spatiotemporal foundation model

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
65 ProBA: Probabilistic Bundle Adjustment with the Bhattacharyya Coefficient 提出ProBA以解决传统束调整方法的初始化问题 geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页