cs.CV(2025-07-14)

📊 共 32 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (16 🔗8) 支柱三:空间感知与语义 (Perception & Semantics) (6) 支柱二:RL算法与架构 (RL & Architecture) (5 🔗3) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
1 ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models 提出ViTCoT:视频-文本交错思维链,提升大语言模型视频理解能力 embodied AI large language model chain-of-thought
2 FaceLLM: A Multimodal Large Language Model for Face Understanding FaceLLM:面向人脸理解的多模态大语言模型,提升人脸相关任务性能。 large language model multimodal
3 Test-Time Canonicalization by Foundation Models for Robust Perception 提出FOCAL,利用预训练模型在测试时进行规范化,提升感知系统的鲁棒性。 foundation model
4 Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection SynOOD:利用生成模型合成近边界OOD样本,提升OOD检测性能 large language model foundation model multimodal
5 Boosting Multimodal Learning via Disentangled Gradient Learning 提出解耦梯度学习框架DGL,解决多模态学习中模态编码器与融合模块的优化冲突问题。 multimodal
6 CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books 提出CoSMo多模态Transformer,用于漫画书中页面流分割任务 multimodal
7 (Almost) Free Modality Stitching of Foundation Models 提出Hyma框架,利用超网络实现多模态模型高效拼接与最优单模态模型选择。 foundation model
8 IGD: Instructional Graphic Design with Multimodal Layer Generation 提出IGD:通过多模态层生成实现可编辑的指令式图形设计 multimodal
9 Text-Visual Semantic Constrained AI-Generated Image Quality Assessment 提出SC-AGIQA框架,通过文本-视觉语义约束提升AI生成图像质量评估的准确性。 large language model multimodal
10 DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs 提出DisCo,提升视频MLLM中视觉封装的语义区分性和时间一致性 large language model multimodal
11 A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images 提出ECP框架,无需训练提升MLLM在高分辨率图像上的细粒度定位和推理能力 large language model multimodal
12 Can GPT-4o mini and Gemini 2.0 Flash Predict Fine-Grained Fashion Product Attributes? A Zero-Shot Analysis 零样本分析:GPT-4o mini与Gemini 2.0 Flash在细粒度时尚产品属性预测上的能力评估 large language model multimodal
13 A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends 综述:基于MLLM的富视觉文档理解方法、挑战与新兴趋势 large language model multimodal
14 DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation DEARLi:解耦识别与定位增强半监督全景分割 foundation model
15 Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect 重新审视Bouba-Kiki效应:评估视觉-语言模型中的跨模态关联能力 multimodal
16 Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction 提出基于连续值Token和掩码预测的生成式音频语言模型,提升音频生成质量。 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
17 OpenHuman4D: Open-Vocabulary 4D Human Parsing 提出OpenHuman4D框架,实现快速、开放词汇的4D人体解析。 open-vocabulary open vocabulary
18 3DGAA: Realistic and Robust 3D Gaussian-based Adversarial Attack for Autonomous Driving 提出基于3D高斯模型的对抗攻击框架3DGAA,提升自动驾驶目标检测系统的安全性。 3D gaussian splatting 3DGS gaussian splatting
19 LLM-Guided Agentic Object Detection for Open-World Understanding 提出LLM引导的Agentic目标检测框架,实现零样本、无标签的开放世界理解 open-vocabulary open vocabulary large language model
20 Cameras as Relative Positional Encoding 提出PRoPE:将相机参数作为相对位置编码,提升多视角Transformer的3D感知能力 depth estimation stereo depth
21 Spatial Lifting for Dense Prediction 提出空间提升(SL)方法,用于高效且参数量小的密集预测任务。 depth estimation
22 MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second MoVieS:单目视频秒级生成运动感知4D动态新视角 scene flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
23 Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching 提出IMD框架,通过对齐视觉基础模型解决图像特征匹配中的多实例问题。 contrastive learning feature matching foundation model
24 Reprogramming Vision Foundation Models for Spatio-Temporal Forecasting 提出ST-VFM,通过重编程视觉基础模型解决时空预测问题。 representation learning large language model foundation model
25 Improving Multimodal Learning via Imbalanced Learning 提出非对称表示学习(ARL)策略,通过不平衡学习提升多模态融合性能。 representation learning multimodal
26 Inversion-DPO: Precise and Efficient Post-Training for Diffusion Models Inversion-DPO:一种精确高效的扩散模型后训练方法,无需奖励模型。 DPO direct preference optimization
27 FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text 提出FIX-CLIP,通过双分支层级对比学习和合成字幕,提升CLIP在长文本理解任务上的性能。 contrastive learning

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
28 EmbRACE-3K: Embodied Reasoning and Action in Complex Environments EmRACE-3K:用于复杂环境中具身推理与行动的基准数据集 manipulation reinforcement learning scene understanding
29 A New Dataset and Performance Benchmark for Real-time Spacecraft Segmentation in Onboard Flight Computers 提出用于航天器实时分割的新数据集SWiM与YOLO性能基准 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
30 Quantize-then-Rectify: Efficient VQ-VAE Training 提出ReVQ框架,通过量化修正加速VQ-VAE训练,降低计算成本。 VQ-VAE multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
31 Resolution Revolution: A Physics-Guided Deep Learning Framework for Spatiotemporal Temperature Reconstruction 提出物理引导深度学习框架,用于高时空分辨率温度重建 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
32 Glance-MCMT: A General MCMT Framework with Glance Initialization and Progressive Association 提出Glance-MCMT框架以解决多摄像头多目标跟踪问题 feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页