cs.CV（2025-08-14）

📊 共 37 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗6) 支柱二：RL算法与架构 (RL & Architecture) (12 🔗4) 支柱三：空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱七：动作重定向 (Motion Retargeting) (2 🔗1) 支柱四：生成式动作 (Generative Motion) (2) 支柱一：机器人控制 (Robot Control) (2) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Contrast Sensitivity in Multimodal Large Language Models: A Psychophysics-Inspired Evaluation	提出基于心理物理学的对比敏感度函数评估方法，诊断多模态大语言模型的感知能力	large language model multimodal
2	Towards Agentic AI for Multimodal-Guided Video Object Segmentation	提出多模态Agent，用于解决多模态引导的视频目标分割任务	large language model foundation model multimodal
3	Empowering Multimodal LLMs with External Tools: A Comprehensive Survey	综述：利用外部工具增强多模态大语言模型能力，提升性能、评估和数据质量	large language model multimodal	✅
4	Failures to Surface Harmful Contents in Video Large Language Models	揭示视频大语言模型在识别视频有害内容方面的缺陷，并提出针对性攻击。	large language model
5	A Mutual-Structure Weighted Sub-Pixel Multimodal Optical Remote Sensing Image Matching Method	提出一种互结构加权亚像素多模态遥感图像匹配方法，提升匹配精度。	multimodal	✅
6	Performance of GPT-5 in Brain Tumor MRI Reasoning	评估GPT-5系列模型在脑肿瘤MRI图像问答任务中的性能，结果表明其具备一定潜力但离临床应用尚远。	large language model chain-of-thought
7	UI-Venus Technical Report: Building High-performance UI Agents with RFT	UI-Venus：基于RFT构建高性能UI代理，实现UI理解与导航任务的SOTA性能	large language model multimodal	✅
8	MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs	提出MRFD多区域融合解码方法，缓解LVLM中的幻觉问题	multimodal chain-of-thought
9	ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing	ToonComposer：通过生成式后关键帧技术简化卡通制作流程	foundation model
10	Insights from the Algonauts 2025 Winners	基于长程多模态电影的脑活动预测：Algonauts 2025挑战赛洞见	multimodal
11	AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences	AEGIS：用于评估AI生成视频序列真实性的基准数据集	multimodal	✅
12	Deep Learning for Crack Detection: A Review of Learning Paradigms, Generalizability, and Datasets	综述深度学习裂缝检测：学习范式、泛化性与数据集分析	foundation model	✅
13	A Sub-Pixel Multimodal Optical Remote Sensing Images Matching Method	提出PCWLAD方法以解决多模态光学图像匹配精度问题	multimodal	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
14	EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering	提出EgoCross以解决跨领域自我中心视频问答问题	reinforcement learning egocentric large language model	✅
15	MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data	提出MAESTRO，利用掩码自编码器处理多模态、多时相、多光谱地球观测数据。	masked autoencoder multimodal	✅
16	HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs	提出HumanSense基准，评估多模态LLM在以人为中心的场景中的感知和交互能力。	reinforcement learning large language model multimodal	✅
17	EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba	提出基于Skeleton Mamba的EgoMusic运动网络，用于从第一视角视频和音乐驱动的人体舞蹈动作估计。	Mamba egocentric human motion
18	Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation	PhysHPO：用于物理合理视频生成的分层细粒度偏好优化	direct preference optimization physically plausible
19	Trajectory-aware Shifted State Space Models for Online Video Super-Resolution	提出基于轨迹感知的移位状态空间模型的在线视频超分辨率方法，提升时空信息聚合效率。	Mamba SSM state space model
20	BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation	提出BLADE框架，通过块稀疏注意力与步进蒸馏加速高效视频生成。	distillation spatiotemporal
21	From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models	诊断并改进视觉语言模型中的时空物理推理能力	reinforcement learning world model multimodal
22	VIFSS: View-Invariant and Figure Skating-Specific Pose Representation Learning for Temporal Action Segmentation	提出VIFSS框架，解决花样滑冰跳跃动作时序分割中视角不变性和数据稀缺问题	representation learning contrastive learning
23	Beyond conventional vision: RGB-event fusion for robust object detection in dynamic traffic scenarios	提出MCFNet，融合RGB图像与事件相机数据，提升动态交通场景下目标检测的鲁棒性。	Mamba optical flow spatiotemporal	✅
24	Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances	综述：强化学习赋能视觉生成模型，提升可控性与真实感	reinforcement learning
25	Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances	将强化学习与视觉生成模型相结合以优化生成质量	reinforcement learning

🔬 支柱三：空间感知与语义 (Perception & Semantics) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
26	Multi-Sample Anti-Aliasing and Constrained Optimization for 3D Gaussian Splatting	提出多重采样抗锯齿与约束优化框架，提升3D高斯溅射细节重建质量	3D gaussian splatting gaussian splatting splatting
27	Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset	提出MV-ScanQA和TripAlign，促进多视角3D场景理解和推理	scene understanding multimodal	✅
28	Cooperative Face Liveness Detection from Optical Flow	提出基于光流的协同式人脸活体检测方法，提升安全性。	optical flow

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Human-in-Context: Unified Cross-Domain 3D Human Motion Modeling via In-Context Learning	提出Human-in-Context (HiC)，通过上下文学习实现跨领域统一3D人体运动建模。	human motion	✅
30	Novel View Synthesis using DDIM Inversion	提出基于DDIM反演和姿态条件U-Net的新视角合成方法，提升图像质量。	geometric consistency

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
31	InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild	InterSyn：通过交错学习实现野外场景下动态运动合成	text-to-motion motion synthesis
32	Increasing the Utility of Synthetic Images through Chamfer Guidance	提出Chamfer Guidance，提升合成图像的质量和多样性，增强下游任务性能。	classifier-free guidance

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
33	Can Multi-modal (reasoning) LLMs detect document manipulation?	评估多模态LLM在文档篡改检测中的有效性，揭示模型能力与检测性能的关联。	manipulation large language model
34	Lameness detection in dairy cows using pose estimation and bidirectional LSTMs	提出基于姿态估计和双向LSTM的奶牛跛足检测方法	locomotion

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
35	STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes	提出STRIDE-QA以解决城市驾驶场景中的时空推理问题	spatiotemporal
36	HyperTea: A Hypergraph-based Temporal Enhancement and Alignment Network for Moving Infrared Small Target Detection	HyperTea：一种基于超图的时序增强与对齐网络，用于移动红外小目标检测	spatiotemporal	✅

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
37	JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics	提出JRDB-Reasoning以解决视觉推理基准的复杂性问题	human-object interaction embodied AI large language model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页