cs.CV（2025-06-13）

📊 共 29 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (9 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱七：动作重定向 (Motion Retargeting) (2) 支柱八：物理动画 (Physics-based Animation) (2) 支柱六：视频提取与匹配 (Video Extraction) (1 🔗1) 支柱一：机器人控制 (Robot Control) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Dynamic Mixture of Curriculum LoRA Experts for Continual Multimodal Instruction Tuning	提出动态混合课程LoRA专家以解决持续多模态指令调优问题	large language model multimodal
2	DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs	提出DaMO以解决视频语言模型中的时序推理问题	large language model multimodal
3	TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models	提出TAViS以解决音视频分割中的跨模态对齐问题	foundation model multimodal
4	VGR: Visual Grounded Reasoning	提出VGR以解决多模态推理中的语言偏见问题	large language model multimodal chain-of-thought
5	Exploring the Effectiveness of Deep Features from Domain-Specific Foundation Models in Retinal Image Synthesis	提出基于深度特征的损失函数以改进视网膜图像合成	foundation model
6	VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?	提出VFaith以评估多模态模型的视觉推理能力	multimodal
7	Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation	提出多模态一致性与连贯性增强框架以解决文本-图像计划生成问题	multimodal	✅
8	Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs	提出Manager插件以解决两塔VLMs和MLLMs中的单模态专家聚合问题	large language model multimodal	✅
9	CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection	提出CLIPFUSION以解决异常检测中的多模态融合问题	foundation model
10	Quizzard@INOVA Challenge 2025 -- Track A: Plug-and-Play Technique in Interleaved Multi-Image Model	提出LLaVA-NeXT-Interleave以解决多图像推理问题	foundation model	✅
11	A$^2$LC: Active and Automated Label Correction for Semantic Segmentation	提出A$^2$LC框架以解决语义分割中的标签纠错问题	foundation model

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
12	MambaVSR: Content-Aware Scanning State Space Model for Video Super-Resolution	提出MambaVSR以解决视频超分辨率中的非局部依赖建模问题	Mamba SSM state space model
13	How Visual Representations Map to Language Feature Space in Multimodal LLMs	提出冻结模型与线性适配器以解决视觉与语言对齐问题	representation learning large language model multimodal
14	InceptionMamba: Efficient Multi-Stage Feature Enhancement with Selective State Space Model for Microscopic Medical Image Segmentation	提出InceptionMamba以解决显微医学图像分割效率问题	Mamba state space model
15	AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments	提出AgentSense以解决智能家居中缺乏多样化标注数据的问题	world model embodied AI large language model	✅
16	Stop learning it all to mitigate visual hallucination, Focus on the hallucination target	提出偏好学习方法以缓解多模态大语言模型的视觉幻觉问题	preference learning large language model multimodal
17	Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation	提出DISCOVR以解决心脏超声视频表示学习问题	representation learning distillation	✅
18	DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning	提出DAVID-XR1以解决AI生成视频检测的可解释性问题	distillation chain-of-thought
19	EasyARC: Evaluating Vision Language Models on True Visual Reasoning	提出EasyARC以解决多模态视觉推理评估问题	reinforcement learning multimodal
20	Auto-Connect: Connectivity-Preserving RigFormer with Direct Preference Optimization	提出Auto-Connect以解决自动绑定中骨骼连通性问题	direct preference optimization

🔬 支柱三：空间感知与语义 (Perception & Semantics) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
21	GraphGSOcc: Semantic-Geometric Graph Transformer with Dynamic-Static Decoupling for 3D Gaussian Splatting-based Occupancy Prediction	提出GraphGSOcc以解决3D语义占用预测中的动态静态耦合问题	3D gaussian splatting 3DGS gaussian splatting
22	Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale	提出Affogato以解决开放词汇的可用性定位问题	open-vocabulary open vocabulary affordance	✅
23	OV-MAP : Open-Vocabulary Zero-Shot 3D Instance Segmentation Map for Robots	提出OV-MAP以解决开放世界3D实例分割问题	open-vocabulary open vocabulary

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
24	Dynamic Double Space Tower	提出动态双空间塔以解决视觉问答中的推理不足问题	spatial relationship multimodal
25	SphereDrag: Spherical Geometry-Aware Panoramic Image Editing	提出SphereDrag以解决全景图像编辑中的几何问题	geometric consistency

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
26	SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation	提出SignAligner以解决手语生成中的多模态协调问题	spatiotemporal multimodal
27	EyeSim-VQA: A Free-Energy-Guided Eye Simulation Framework for Video Quality Assessment	提出EyeSim-VQA以解决视频质量评估中的自适应修复问题	spatiotemporal

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
28	EgoPrivacy: What Your First-Person Camera Says About You?	提出EgoPrivacy以评估第一人称视频的隐私风险	egocentric egocentric vision first-person view	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Efficient Multi-Camera Tokenization with Triplanes for End-to-End Driving	提出基于三平面的多摄像头高效标记方法以提升自动驾驶性能	motion planning

⬅️ 返回 cs.CV 首页 · 🏠 返回主页