cs.CV(2025-08-14)

📊 共 37 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗6) 支柱二:RL算法与架构 (RL & Architecture) (12 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (2) 支柱一:机器人控制 (Robot Control) (2) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 Contrast Sensitivity in Multimodal Large Language Models: A Psychophysics-Inspired Evaluation 提出基于心理物理学的对比敏感度函数评估方法,诊断多模态大语言模型的感知能力 large language model multimodal
2 Towards Agentic AI for Multimodal-Guided Video Object Segmentation 提出多模态Agent,用于解决多模态引导的视频目标分割任务 large language model foundation model multimodal
3 Empowering Multimodal LLMs with External Tools: A Comprehensive Survey 综述:利用外部工具增强多模态大语言模型能力,提升性能、评估和数据质量 large language model multimodal
4 Failures to Surface Harmful Contents in Video Large Language Models 揭示视频大语言模型在识别视频有害内容方面的缺陷,并提出针对性攻击。 large language model
5 A Mutual-Structure Weighted Sub-Pixel Multimodal Optical Remote Sensing Image Matching Method 提出一种互结构加权亚像素多模态遥感图像匹配方法,提升匹配精度。 multimodal
6 Performance of GPT-5 in Brain Tumor MRI Reasoning 评估GPT-5系列模型在脑肿瘤MRI图像问答任务中的性能,结果表明其具备一定潜力但离临床应用尚远。 large language model chain-of-thought
7 UI-Venus Technical Report: Building High-performance UI Agents with RFT UI-Venus:基于RFT构建高性能UI代理,实现UI理解与导航任务的SOTA性能 large language model multimodal
8 MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs 提出MRFD多区域融合解码方法,缓解LVLM中的幻觉问题 multimodal chain-of-thought
9 ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing ToonComposer:通过生成式后关键帧技术简化卡通制作流程 foundation model
10 Insights from the Algonauts 2025 Winners 基于长程多模态电影的脑活动预测:Algonauts 2025挑战赛洞见 multimodal
11 AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences AEGIS:用于评估AI生成视频序列真实性的基准数据集 multimodal
12 Deep Learning for Crack Detection: A Review of Learning Paradigms, Generalizability, and Datasets 综述深度学习裂缝检测:学习范式、泛化性与数据集分析 foundation model
13 A Sub-Pixel Multimodal Optical Remote Sensing Images Matching Method 提出PCWLAD方法以解决多模态光学图像匹配精度问题 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (12 篇)

#题目一句话要点标签🔗
14 EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering 提出EgoCross以解决跨领域自我中心视频问答问题 reinforcement learning egocentric large language model
15 MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data 提出MAESTRO,利用掩码自编码器处理多模态、多时相、多光谱地球观测数据。 masked autoencoder multimodal
16 HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs 提出HumanSense基准,评估多模态LLM在以人为中心的场景中的感知和交互能力。 reinforcement learning large language model multimodal
17 EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba 提出基于Skeleton Mamba的EgoMusic运动网络,用于从第一视角视频和音乐驱动的人体舞蹈动作估计。 Mamba egocentric human motion
18 Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation PhysHPO:用于物理合理视频生成的分层细粒度偏好优化 direct preference optimization physically plausible
19 Trajectory-aware Shifted State Space Models for Online Video Super-Resolution 提出基于轨迹感知的移位状态空间模型的在线视频超分辨率方法,提升时空信息聚合效率。 Mamba SSM state space model
20 BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation 提出BLADE框架,通过块稀疏注意力与步进蒸馏加速高效视频生成。 distillation spatiotemporal
21 From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models 诊断并改进视觉语言模型中的时空物理推理能力 reinforcement learning world model multimodal
22 VIFSS: View-Invariant and Figure Skating-Specific Pose Representation Learning for Temporal Action Segmentation 提出VIFSS框架,解决花样滑冰跳跃动作时序分割中视角不变性和数据稀缺问题 representation learning contrastive learning
23 Beyond conventional vision: RGB-event fusion for robust object detection in dynamic traffic scenarios 提出MCFNet,融合RGB图像与事件相机数据,提升动态交通场景下目标检测的鲁棒性。 Mamba optical flow spatiotemporal
24 Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances 综述:强化学习赋能视觉生成模型,提升可控性与真实感 reinforcement learning
25 Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances 将强化学习与视觉生成模型相结合以优化生成质量 reinforcement learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)

#题目一句话要点标签🔗
26 Multi-Sample Anti-Aliasing and Constrained Optimization for 3D Gaussian Splatting 提出多重采样抗锯齿与约束优化框架,提升3D高斯溅射细节重建质量 3D gaussian splatting gaussian splatting splatting
27 Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset 提出MV-ScanQA和TripAlign,促进多视角3D场景理解和推理 scene understanding multimodal
28 Cooperative Face Liveness Detection from Optical Flow 提出基于光流的协同式人脸活体检测方法,提升安全性。 optical flow

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
29 Human-in-Context: Unified Cross-Domain 3D Human Motion Modeling via In-Context Learning 提出Human-in-Context (HiC),通过上下文学习实现跨领域统一3D人体运动建模。 human motion
30 Novel View Synthesis using DDIM Inversion 提出基于DDIM反演和姿态条件U-Net的新视角合成方法,提升图像质量。 geometric consistency

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
31 InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild InterSyn:通过交错学习实现野外场景下动态运动合成 text-to-motion motion synthesis
32 Increasing the Utility of Synthetic Images through Chamfer Guidance 提出Chamfer Guidance,提升合成图像的质量和多样性,增强下游任务性能。 classifier-free guidance

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
33 Can Multi-modal (reasoning) LLMs detect document manipulation? 评估多模态LLM在文档篡改检测中的有效性,揭示模型能力与检测性能的关联。 manipulation large language model
34 Lameness detection in dairy cows using pose estimation and bidirectional LSTMs 提出基于姿态估计和双向LSTM的奶牛跛足检测方法 locomotion

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
35 STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes 提出STRIDE-QA以解决城市驾驶场景中的时空推理问题 spatiotemporal
36 HyperTea: A Hypergraph-based Temporal Enhancement and Alignment Network for Moving Infrared Small Target Detection HyperTea:一种基于超图的时序增强与对齐网络,用于移动红外小目标检测 spatiotemporal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
37 JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics 提出JRDB-Reasoning以解决视觉推理基准的复杂性问题 human-object interaction embodied AI large language model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页