cs.CV(2025-10-23)

📊 共 32 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (10 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗3) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱一:机器人控制 (Robot Control) (1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱四:生成式动作 (Generative Motion) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (10 篇)

#题目一句话要点标签🔗
1 BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models BioCAP:利用合成字幕增强生物学基础模型,超越标签监督 large language model foundation model multimodal
2 EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence EmbodiedBrain:通过Step-GRPO提升具身智能任务规划性能 embodied AI large language model foundation model
3 Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward 提出基于Agent的架构,提升多模态大语言模型在视觉推理任务上的性能 large language model multimodal chain-of-thought
4 Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning 提出Metis-HOME,通过混合专家模型解决多模态推理中的效率与泛化难题 multimodal
5 HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models HyperET:通过双曲空间高效训练多模态大语言模型,提升跨模态对齐。 large language model
6 Calibrating Multimodal Consensus for Emotion Recognition 提出校准多模态共识模型以解决情感识别中的语义不一致问题 multimodal
7 Fake-in-Facext: Towards Fine-Grained Explainable DeepFake Analysis 提出Fake-in-Facext框架,实现细粒度、可解释的DeepFake人脸分析。 large language model multimodal
8 Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation 提出Speculative Verdict框架,解决信息密集型图像的视觉推理难题。 multimodal
9 SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding 提出SeViCES框架,通过语义-视觉共识提升长视频理解能力 large language model
10 Breakdance Video classification in the age of Generative AI 针对霹雳舞视频分类,分析了生成式AI时代下视频基础模型(编码器和解码器)的适用性。 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
11 VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models 提出VESSA:一种基于视频对象中心的自监督视觉基础模型适应方法 distillation foundation model
12 Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence Conan:提出基于多尺度视觉证据的渐进式学习框架,提升多模态大语言模型在视频推理任务上的性能。 reinforcement learning large language model multimodal
13 A Structured Review and Quantitative Profiling of Public Brain MRI Datasets for Foundation Model Development 针对脑MRI基础模型,论文系统评估了公开数据集的多样性与一致性问题。 representation learning foundation model
14 GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs GranViT:面向MLLM的细粒度视觉模型,通过自回归感知提升性能 distillation large language model multimodal
15 Addressing Corner Cases in Autonomous Driving: A World Model-based Approach with Mixture of Experts and LLMs 提出WM-MoE框架,利用世界模型和混合专家模型解决自动驾驶Corner Case问题 world model large language model
16 Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection 提出CURL框架,利用对比学习进行胎儿超声视频中的胎动检测。 representation learning contrastive learning
17 Generative Point Tracking with Flow Matching 提出基于Flow Matching的生成式点跟踪器GenPT,解决视觉遮挡下的多模态轨迹预测问题。 flow matching
18 TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge TernaryCLIP:通过三元权重和知识蒸馏高效压缩视觉-语言模型 distillation multimodal
19 IB-GAN: Disentangled Representation Learning with Information Bottleneck Generative Adversarial Networks 提出IB-GAN,利用信息瓶颈改进GAN的解耦表示学习。 representation learning
20 TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning 提出TOMCAT,通过测试时知识累积解决组合零样本学习中的分布偏移问题。 representation learning multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
21 COS3D: Collaborative Open-Vocabulary 3D Segmentation 提出COS3D,通过协同提示分割框架解决开放词汇3D分割中的语言与分割融合问题。 gaussian splatting splatting open-vocabulary
22 Deep Learning-Powered Visual SLAM Aimed at Assisting Visually Impaired Navigation 提出SELM-SLAM3,利用深度学习增强视觉SLAM,辅助视障人士导航。 visual SLAM
23 RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling RAPO++:通过数据对齐和测试时缩放优化文本到视频生成中的跨阶段Prompt optical flow large language model
24 PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding 提出PartNeXt数据集,用于细粒度分层3D部件理解,提升模型性能。 open-vocabulary open vocabulary
25 From Far and Near: Perceptual Evaluation of Crowd Representations Across Levels of Detail 研究不同细节层次下人群表征的感知质量,优化人群渲染策略。 neural radiance field
26 PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching 提出PPMStereo,通过Pick-and-Play记忆构建实现动态立体匹配中的时序一致性。 depth estimation

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
27 DMC$^3$: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering 提出DMC$^3$框架以解决第一人称视频问答中的挑战 egocentric Ego4D
28 Radar-Camera Fused Multi-Object Tracking: Online Calibration and Common Feature 提出一种雷达-相机融合的多目标跟踪框架,实现在线标定和通用特征利用。 feature matching

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
29 BioDet: Boosting Industrial Object Detection with Image Preprocessing Strategies BioDet:利用图像预处理策略提升工业目标检测性能 manipulation open-vocabulary open vocabulary

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
30 Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers 提出基于像素空间时空Transformer的物理模拟视频预测方法 spatiotemporal large language model

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
31 ARGenSeg: Image Segmentation with Autoregressive Image Generation Model ARGenSeg:提出基于自回归图像生成模型的图像分割方法 VQ-VAE large language model multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
32 AutoScape: Geometry-Consistent Long-Horizon Scene Generation AutoScape:提出几何一致的长时程驾驶场景生成框架 geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页