cs.CV(2025-07-07)

📊 共 35 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (16 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (7 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗3) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1) 支柱四:生成式动作 (Generative Motion) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
1 Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models 提出对齐可回答性框架,提升视频大语言模型拒绝回答不相关问题的能力 large language model multimodal
2 ReLoop: "Seeing Twice and Thinking Backwards" via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding 提出ReLoop闭环训练框架,缓解多模态大语言模型中的幻觉问题 large language model multimodal
3 VectorLLM: Human-like Extraction of Structured Building Contours vis Multimodal LLMs 提出VectorLLM以解决建筑轮廓提取问题 large language model multimodal
4 MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding 提出MODA:通过模块化双工注意力机制增强多模态感知、认知和情感理解能力。 large language model multimodal
5 Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing 提出X-Planner,利用MLLM规划复杂指令图像编辑,提升编辑质量和身份保持。 large language model multimodal chain-of-thought
6 Differential Attention for Multimodal Crisis Event Analysis 提出差分注意力机制,增强多模态危机事件分析中的特征对齐与分类性能 multimodal
7 MurreNet: Modeling Holistic Multimodal Interactions Between Histopathology and Genomic Profiles for Survival Prediction MurreNet:建模组织病理学与基因组图谱间整体多模态交互,用于生存预测 multimodal
8 Geometric-Guided Few-Shot Dental Landmark Detection with Human-Centric Foundation Model GeoSapiens:结合几何约束与人本基础模型的少样本牙科地标检测 foundation model
9 HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding 提出HV-MMBench,用于全面评估MLLM在以人为中心的视频理解能力 large language model multimodal
10 From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection 提出ArtBulb框架,用于AI艺术版权评估,并构建首个AI艺术版权数据集AICD。 large language model multimodal
11 An analysis of vision-language models for fabric retrieval 针对织物检索,提出基于多模态大语言模型自动标注的视觉语言模型零样本检索方案。 large language model multimodal
12 Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts 提出SMoEStereo,利用选择性混合专家模型提升立体匹配在复杂场景下的鲁棒性。 foundation model
13 SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability SPARC:概念对齐的稀疏自编码器,实现跨模型和跨模态的可解释性 multimodal
14 Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model Llama Nemoretriever Colembed:一种高性能的文本-图像跨模态检索模型 multimodal
15 INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling 提出INTER:通过交互引导采样缓解大型视觉语言模型中的幻觉问题 multimodal
16 Transcribing Spanish Texts from the Past: Experiments with Transkribus, Tesseract and Granite GRESEL团队探索多种OCR方法转录西班牙古籍文本,为PastReader任务提供对比。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)

#题目一句话要点标签🔗
17 SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation SegmentDreamer:通过分段一致性轨迹蒸馏实现高保真文本到3D合成 dreamer distillation 3D gaussian splatting
18 Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning 提出Tempo-R0,通过高效时序感知强化学习解决视频时序定位任务。 reinforcement learning large language model multimodal
19 VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents VLM2Vec-V2:提出统一的多模态嵌入框架,支持视频、图像和视觉文档,扩展应用场景。 representation learning multimodal
20 Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations 提出LiMA框架,通过跨视角和长时序蒸馏提升LiDAR表征学习效果 representation learning distillation spatiotemporal
21 Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning 提出Open Vision Reasoner,通过迁移语言认知行为增强多模态视觉推理能力 reinforcement learning large language model multimodal
22 Neural-Driven Image Editing LoongX:提出一种基于多模态神经信号驱动的免手动图像编辑方法 contrastive learning multimodal
23 RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction 提出RIPE:一种基于强化学习的弱监督关键点提取框架 reinforcement learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
24 InterGSEdit: Interactive 3D Gaussian Splatting Editing with 3D Geometry-Consistent Attention Prior InterGSEdit:利用几何一致性注意力先验实现交互式3D高斯溅射编辑 3D gaussian splatting 3DGS gaussian splatting
25 All in One: Visual-Description-Guided Unified Point Cloud Segmentation 提出VDG-Uni3DSeg,利用视觉描述引导的统一框架实现点云分割。 scene understanding large language model multimodal
26 OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts 提出OpenWorldSAM以解决开放词汇图像分割问题 open-vocabulary open vocabulary
27 MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation MoDiT:利用扩散Transformer学习一致性3D运动系数,用于生成逼真说话人头部 optical flow
28 TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation 提出时间感知潜在布朗桥扩散模型TLB-VFI,高效解决视频帧插值问题。 optical flow
29 MCFormer: A Multi-Cost-Volume Network and Comprehensive Benchmark for Particle Image Velocimetry 提出MCFormer,一种多代价体网络,并构建PIV综合基准,解决深度学习在PIV应用中缺乏系统评估的问题。 optical flow

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
30 VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting VOTE:基于轨迹集成投票的视觉-语言-动作优化,提升机器人操作效率 manipulation vision-language-action VLA
31 Mastering Regional 3DGS: Locating, Initializing, and Editing with Diverse 2D Priors 提出基于2D先验的区域3DGS编辑方法,提升编辑效率与质量。 manipulation 3D gaussian splatting 3DGS

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
32 ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing 提出ChangeBridge,用于遥感时空图像生成,解决现有方法无法建模跨时序变化的问题。 spatiotemporal multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
33 Spatio-Temporal LLM: Reasoning about Environments and Actions 提出时空LLM,解决多模态大模型在环境理解和行为推理上的挑战。 egocentric large language model multimodal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
34 Motion Generation: A Survey of Generative Approaches and Benchmarks 运动生成综述:生成方法与基准的全面回顾与分类 motion generation

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
35 A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets 提出基于反事实集生成的视觉语言模型组合推理方法,提升模型性能。 spatial relationship large language model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页