cs.CV（2025-07-14）

📊 共 32 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (16 🔗8) 支柱三：空间感知与语义 (Perception & Semantics) (6) 支柱二：RL算法与架构 (RL & Architecture) (5 🔗3) 支柱一：机器人控制 (Robot Control) (2 🔗1) 支柱四：生成式动作 (Generative Motion) (1) 支柱八：物理动画 (Physics-based Animation) (1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (16 篇)

#	题目	一句话要点	标签	🔗	⭐
1	ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models	提出ViTCoT：视频-文本交错思维链，提升大语言模型视频理解能力	embodied AI large language model chain-of-thought
2	FaceLLM: A Multimodal Large Language Model for Face Understanding	FaceLLM：面向人脸理解的多模态大语言模型，提升人脸相关任务性能。	large language model multimodal
3	Test-Time Canonicalization by Foundation Models for Robust Perception	提出FOCAL，利用预训练模型在测试时进行规范化，提升感知系统的鲁棒性。	foundation model	✅
4	Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection	SynOOD：利用生成模型合成近边界OOD样本，提升OOD检测性能	large language model foundation model multimodal	✅
5	Boosting Multimodal Learning via Disentangled Gradient Learning	提出解耦梯度学习框架DGL，解决多模态学习中模态编码器与融合模块的优化冲突问题。	multimodal	✅
6	CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books	提出CoSMo多模态Transformer，用于漫画书中页面流分割任务	multimodal
7	(Almost) Free Modality Stitching of Foundation Models	提出Hyma框架，利用超网络实现多模态模型高效拼接与最优单模态模型选择。	foundation model
8	IGD: Instructional Graphic Design with Multimodal Layer Generation	提出IGD：通过多模态层生成实现可编辑的指令式图形设计	multimodal
9	Text-Visual Semantic Constrained AI-Generated Image Quality Assessment	提出SC-AGIQA框架，通过文本-视觉语义约束提升AI生成图像质量评估的准确性。	large language model multimodal	✅
10	DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs	提出DisCo，提升视频MLLM中视觉封装的语义区分性和时间一致性	large language model multimodal	✅
11	A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images	提出ECP框架，无需训练提升MLLM在高分辨率图像上的细粒度定位和推理能力	large language model multimodal	✅
12	Can GPT-4o mini and Gemini 2.0 Flash Predict Fine-Grained Fashion Product Attributes? A Zero-Shot Analysis	零样本分析：GPT-4o mini与Gemini 2.0 Flash在细粒度时尚产品属性预测上的能力评估	large language model multimodal	✅
13	A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends	综述：基于MLLM的富视觉文档理解方法、挑战与新兴趋势	large language model multimodal
14	DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation	DEARLi：解耦识别与定位增强半监督全景分割	foundation model	✅
15	Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect	重新审视Bouba-Kiki效应：评估视觉-语言模型中的跨模态关联能力	multimodal
16	Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction	提出基于连续值Token和掩码预测的生成式音频语言模型，提升音频生成质量。	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
17	OpenHuman4D: Open-Vocabulary 4D Human Parsing	提出OpenHuman4D框架，实现快速、开放词汇的4D人体解析。	open-vocabulary open vocabulary
18	3DGAA: Realistic and Robust 3D Gaussian-based Adversarial Attack for Autonomous Driving	提出基于3D高斯模型的对抗攻击框架3DGAA，提升自动驾驶目标检测系统的安全性。	3D gaussian splatting 3DGS gaussian splatting
19	LLM-Guided Agentic Object Detection for Open-World Understanding	提出LLM引导的Agentic目标检测框架，实现零样本、无标签的开放世界理解	open-vocabulary open vocabulary large language model
20	Cameras as Relative Positional Encoding	提出PRoPE：将相机参数作为相对位置编码，提升多视角Transformer的3D感知能力	depth estimation stereo depth
21	Spatial Lifting for Dense Prediction	提出空间提升(SL)方法，用于高效且参数量小的密集预测任务。	depth estimation
22	MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second	MoVieS：单目视频秒级生成运动感知4D动态新视角	scene flow

🔬 支柱二：RL算法与架构 (RL & Architecture) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
23	Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching	提出IMD框架，通过对齐视觉基础模型解决图像特征匹配中的多实例问题。	contrastive learning feature matching foundation model
24	Reprogramming Vision Foundation Models for Spatio-Temporal Forecasting	提出ST-VFM，通过重编程视觉基础模型解决时空预测问题。	representation learning large language model foundation model
25	Improving Multimodal Learning via Imbalanced Learning	提出非对称表示学习（ARL）策略，通过不平衡学习提升多模态融合性能。	representation learning multimodal	✅
26	Inversion-DPO: Precise and Efficient Post-Training for Diffusion Models	Inversion-DPO：一种精确高效的扩散模型后训练方法，无需奖励模型。	DPO direct preference optimization	✅
27	FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text	提出FIX-CLIP，通过双分支层级对比学习和合成字幕，提升CLIP在长文本理解任务上的性能。	contrastive learning	✅

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
28	EmbRACE-3K: Embodied Reasoning and Action in Complex Environments	EmRACE-3K：用于复杂环境中具身推理与行动的基准数据集	manipulation reinforcement learning scene understanding
29	A New Dataset and Performance Benchmark for Real-time Spacecraft Segmentation in Onboard Flight Computers	提出用于航天器实时分割的新数据集SWiM与YOLO性能基准	manipulation	✅

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Quantize-then-Rectify: Efficient VQ-VAE Training	提出ReVQ框架，通过量化修正加速VQ-VAE训练，降低计算成本。	VQ-VAE multimodal

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
31	Resolution Revolution: A Physics-Guided Deep Learning Framework for Spatiotemporal Temperature Reconstruction	提出物理引导深度学习框架，用于高时空分辨率温度重建	spatiotemporal

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
32	Glance-MCMT: A General MCMT Framework with Glance Initialization and Progressive Association	提出Glance-MCMT框架以解决多摄像头多目标跟踪问题	feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页