cs.CV（2025-05-28）

📊 共 63 篇论文 | 🔗 19 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (21 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (21 🔗8) 支柱三：空间感知与语义 (Perception & Semantics) (12 🔗5) 支柱六：视频提取与匹配 (Video Extraction) (3) 支柱一：机器人控制 (Robot Control) (3 🔗2) 支柱五：交互与反应 (Interaction & Reaction) (1) 支柱四：生成式动作 (Generative Motion) (1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (21 篇)

#	题目	一句话要点	标签	🔗
1	Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation	提出Look & Mark策略，利用眼动注视和边界框提升胸部X光报告生成质量	large language model multimodal
2	Farm-LightSeek: An Edge-centric Multimodal Agricultural IoT Data Analytics Framework with Lightweight LLMs	Farm-LightSeek：边缘计算驱动的轻量级LLM农业物联网多模态数据分析框架	large language model multimodal
3	Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs	提出多维基准以评估多模态大语言模型的视觉感知能力	large language model multimodal	✅
4	Zero-Shot 3D Visual Grounding from Vision-Language Models	提出SeeGround，利用2D视觉-语言模型实现零样本3D视觉定位	visual grounding
5	HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer	HiDream-I1：基于稀疏扩散Transformer的高效图像生成基础模型	foundation model	✅
6	3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model	提出3DLLM-Mem，用于具身3D大语言模型中的长期时空记忆建模	large language model
7	YH-MINER: Multimodal Intelligent System for Natural Ecological Reef Metric Extraction	YH-MINER：用于自然生态珊瑚礁指标提取的多模态智能系统	multimodal
8	MObyGaze: a film dataset of multimodal objectification densely annotated by experts	提出MObyGaze电影数据集，用于多模态物体化行为分析与量化	multimodal
9	AquaMonitor: A multimodal multi-view image sequence dataset for real-life aquatic invertebrate biodiversity monitoring	AquaMonitor：用于水生无脊椎动物生物多样性监测的多模态多视角图像序列数据集	multimodal
10	VidText: Towards Comprehensive Evaluation for Video Text Understanding	提出VidText基准，用于全面评估视频文本理解能力，填补现有视频理解benchmark的空白。	multimodal chain-of-thought
11	Thinking with Generated Images	提出基于生成图像的视觉推理方法，提升大模型在复杂视觉任务中的认知能力。	multimodal chain-of-thought	✅
12	CADReview: Automatically Reviewing CAD Programs with Error Detection and Correction	提出ReCAD框架，自动检测并修正CAD程序错误，提升3D对象设计质量。	large language model multimodal
13	Cross-modal RAG: Sub-dimensional Text-to-Image Retrieval-Augmented Generation	提出Cross-modal RAG，解决文本到图像生成中细粒度知识检索增强问题。	large language model multimodal
14	OSPO: Object-centric Self-improving Preference Optimization for Text-to-Image Generation	提出OSPO：面向对象中心自提升偏好优化，解决文本到图像生成中的对象幻觉问题	large language model multimodal
15	Beyond Perception: Evaluating Abstract Visual Reasoning through Multi-Stage Task	提出MultiStAR基准以解决抽象视觉推理评估问题	large language model multimodal
16	EdgeVidSum: Real-Time Personalized Video Summarization at the Edge	EdgeVidSum：提出一种轻量级的边缘设备实时个性化视频摘要方法	TAMP
17	MAC-Gaze: Motion-Aware Continual Calibration for Mobile Gaze Tracking	MAC-Gaze：针对移动端注视追踪的运动感知持续校准方法	multimodal
18	Zero-Shot Vision Encoder Grafting via LLM Surrogates	通过LLM代理实现视觉编码器的零样本嫁接，降低VLM训练成本。	large language model	✅
19	Sherlock: Self-Correcting Reasoning in Vision-Language Models	Sherlock：提出一种基于自校正的视觉-语言模型训练框架，提升复杂推理任务性能。	multimodal
20	Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation	提出MultiTalk框架，解决多人对话场景下的音视频生成问题	instruction following
21	Universal Visuo-Tactile Video Understanding for Embodied Interaction	提出VTV-LLM以解决触觉信息整合不足的问题	large language model

🔬 支柱二：RL算法与架构 (RL & Architecture) (21 篇)

#	题目	一句话要点	标签	🔗
22	SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning	提出SAM-R1，利用强化学习和SAM提升多模态图像分割的推理能力	reinforcement learning multimodal
23	Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization	提出Rhet2Pix，通过双层扩散策略优化解决修辞文本到图像生成难题	diffusion policy large language model multimodal
24	SemIRNet: A Semantic Irony Recognition Network for Multimodal Sarcasm Detection	提出SemIRNet，利用知识融合和跨模态相似度检测提升多模态讽刺识别精度	contrastive learning multimodal
25	OmniAD: Detect and Understand Industrial Anomaly via Multimodal Reasoning	OmniAD：通过多模态推理检测和理解工业异常	reinforcement learning multimodal
26	Research on Driving Scenario Technology Based on Multimodal Large Lauguage Model Optimization	提出一种多模态大模型优化方法，提升自动驾驶场景感知能力。	distillation multimodal
27	IMTS is Worth Time $\times$ Channel Patches: Visual Masked Autoencoders for Irregular Multivariate Time Series Prediction	VIMTS：利用视觉MAE进行不规则多元时间序列预测，提升模型对缺失数据的鲁棒性。	masked autoencoder MAE foundation model	✅
28	RiverMamba: A State Space Model for Global River Discharge and Flood Forecasting	RiverMamba：利用状态空间模型实现全球河流流量和洪水预测	Mamba state space model	✅
29	Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs	提出Context-to-Cue DPO，解决多图MLLM中的幻觉问题，提升多模态理解能力	DPO direct preference optimization large language model
30	GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control	GeoDrive：融合3D几何信息的驾驶世界模型，实现精准动作控制	world model geometric consistency
31	cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning	Cadrille：基于在线强化学习的多模态CAD重建模型，实现更精确的三维模型生成。	reinforcement learning large language model
32	Improving Contrastive Learning for Referring Expression Counting	提出C-REX对比学习框架，提升指代表达式计数任务的判别表示学习能力	representation learning MAE contrastive learning	✅
33	RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction	RICO：通过视觉重建提升图像重述的准确性和完整性	DPO large language model multimodal	✅
34	Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation	提出SRRL：一种自反思强化学习算法，用于扩散模型生成具备推理能力的图像	reinforcement learning chain-of-thought	✅
35	D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples	D-Fusion：通过直接偏好优化和视觉一致样本对齐扩散模型	reinforcement learning DPO direct preference optimization
36	CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation	提出CAST框架，通过对比自适应和蒸馏提升半监督实例分割性能。	distillation foundation model
37	Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers	Q-VDiT：面向视频生成扩散Transformer的精确量化与蒸馏框架	distillation spatiotemporal	✅
38	Learning World Models for Interactive Video Generation	提出VRAG，通过视频检索增强生成实现交互式长视频生成的世界模型	world model spatiotemporal
39	Dynamic-Aware Video Distillation: Optimizing Temporal Resolution Based on Video Semantics	提出DAViD，一种基于强化学习的动态感知视频蒸馏方法，优化视频数据集的时序分辨率。	reinforcement learning distillation
40	StateSpaceDiffuser: Bringing Long Context to Diffusion World Models	提出StateSpaceDiffuser，为扩散世界模型引入长时上下文建模能力	world model	✅
41	Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying	提出显式硬负梯度放大方法，提升多模态嵌入学习性能	contrastive learning large language model	✅
42	InfoSAM: Fine-Tuning the Segment Anything Model from An Information-Theoretic Perspective	InfoSAM：基于信息论微调SAM，提升其在特定领域的分割性能	distillation foundation model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (12 篇)

#	题目	一句话要点	标签	🔗
43	Diffusion-Denoised Hyperspectral Gaussian Splatting	提出基于扩散去噪的高光谱高斯溅射方法，实现高光谱场景的三维重建。	3D gaussian splatting 3DGS gaussian splatting	✅
44	CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting	提出CLIPGaussian，实现基于高斯溅射的通用多模态风格迁移	gaussian splatting splatting multimodal
45	Learning Fine-Grained Geometry for Sparse-View Splatting via Cascade Depth Loss	提出HDGS框架，通过级联深度损失学习细粒度几何信息，提升稀疏视角下的splatting效果。	monocular depth 3D gaussian splatting 3DGS
46	A Survey on Training-free Open-Vocabulary Semantic Segmentation	综述：免训练开放词汇语义分割方法研究进展	open-vocabulary open vocabulary foundation model
47	Learning Hierarchical Sparse Transform Coding of 3DGS	提出SHTC：一种稀疏引导的分层变换编码方法，用于高效压缩3DGS模型。	3D gaussian splatting 3DGS gaussian splatting	✅
48	Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation	提出MLMP方法，用于开放词汇语义分割的视觉-语言模型测试时自适应	open-vocabulary open vocabulary	✅
49	Can NeRFs See without Cameras?	提出基于多径信号的NeRF，实现无需相机即可重建室内环境	NeRF neural radiance field
50	Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs	提出E3VQA基准和M3CoT提示方法，融合第一人称和第三人称视角以提升LVLM的场景理解能力	scene understanding egocentric	✅
51	Task-Driven Implicit Representations for Automated Design of LiDAR Systems	提出任务驱动的隐式表达方法，用于激光雷达系统的自动化设计	implicit representation
52	MR.NAVI: Mixed-Reality Navigation Assistant for the Visually Impaired	MR.NAVI：面向视障人士的混合现实导航助手	depth estimation scene understanding
53	SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation and Understanding	提出SPIRAL：一种语义感知的渐进式LiDAR场景生成与理解框架	semantic map
54	On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation	提出几何增强的参数高效微调方法GEM，用于3D场景分割	scene understanding	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (3 篇)

#	题目	一句话要点	标签
55	A Probabilistic Jump-Diffusion Framework for Open-World Egocentric Activity Recognition	提出基于跳跃扩散的概率残差搜索框架ProbRes，用于开放世界自我中心活动识别。	egocentric
56	Fast Feature Matching of UAV Images via Matrix Band Reduction-based GPU Data Schedule	提出基于矩阵带缩减的GPU数据调度算法，加速无人机图像特征匹配。	feature matching
57	Event-based Egocentric Human Pose Estimation in Dynamic Environment	提出D-EventEgo框架，解决动态环境下基于事件相机的自中心人体姿态估计问题	egocentric

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗
58	Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language	提出基于低维属性对齐的视觉-语言工具选择框架，实现高效灵活的工具选择	manipulation multimodal
59	ATI: Any Trajectory Instruction for Controllable Video Generation	提出统一框架以实现可控视频生成的轨迹指令	manipulation	✅
60	FaceEditTalker: Controllable Talking Head Generation with Facial Attribute Editing	提出FaceEditTalker以解决可控人脸属性编辑问题	manipulation	✅

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
61	Prototype Embedding Optimization for Human-Object Interaction Detection in Livestreaming	提出原型嵌入优化方法PeO-HOI，解决直播场景下HOI检测中的对象偏见问题	human-object interaction HOI

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
62	UniMoGen: Universal Motion Generation	UniMoGen：一种通用的、骨骼无关的运动生成扩散模型	motion generation character animation

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
63	LatentMove: Towards Complex Human Movement Video Generation	LatentMove：面向复杂人体运动视频生成的DiT框架	human motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2025-05-28）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (21 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (21 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (12 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (3 篇)

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理