cs.CV（2025-10-23）

📊 共 32 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (10 🔗5) 支柱二：RL算法与架构 (RL & Architecture) (10 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (6 🔗3) 支柱六：视频提取与匹配 (Video Extraction) (2 🔗1) 支柱一：机器人控制 (Robot Control) (1) 支柱八：物理动画 (Physics-based Animation) (1) 支柱四：生成式动作 (Generative Motion) (1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
1	BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models	BioCAP：利用合成字幕增强生物学基础模型，超越标签监督	large language model foundation model multimodal
2	EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence	EmbodiedBrain：通过Step-GRPO提升具身智能任务规划性能	embodied AI large language model foundation model	✅
3	Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward	提出基于Agent的架构，提升多模态大语言模型在视觉推理任务上的性能	large language model multimodal chain-of-thought
4	Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning	提出Metis-HOME，通过混合专家模型解决多模态推理中的效率与泛化难题	multimodal	✅
5	HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models	HyperET：通过双曲空间高效训练多模态大语言模型，提升跨模态对齐。	large language model
6	Calibrating Multimodal Consensus for Emotion Recognition	提出校准多模态共识模型以解决情感识别中的语义不一致问题	multimodal	✅
7	Fake-in-Facext: Towards Fine-Grained Explainable DeepFake Analysis	提出Fake-in-Facext框架，实现细粒度、可解释的DeepFake人脸分析。	large language model multimodal	✅
8	Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation	提出Speculative Verdict框架，解决信息密集型图像的视觉推理难题。	multimodal	✅
9	SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding	提出SeViCES框架，通过语义-视觉共识提升长视频理解能力	large language model
10	Breakdance Video classification in the age of Generative AI	针对霹雳舞视频分类，分析了生成式AI时代下视频基础模型（编码器和解码器）的适用性。	foundation model

🔬 支柱二：RL算法与架构 (RL & Architecture) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
11	VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models	提出VESSA：一种基于视频对象中心的自监督视觉基础模型适应方法	distillation foundation model	✅
12	Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence	Conan：提出基于多尺度视觉证据的渐进式学习框架，提升多模态大语言模型在视频推理任务上的性能。	reinforcement learning large language model multimodal
13	A Structured Review and Quantitative Profiling of Public Brain MRI Datasets for Foundation Model Development	针对脑MRI基础模型，论文系统评估了公开数据集的多样性与一致性问题。	representation learning foundation model
14	GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs	GranViT：面向MLLM的细粒度视觉模型，通过自回归感知提升性能	distillation large language model multimodal
15	Addressing Corner Cases in Autonomous Driving: A World Model-based Approach with Mixture of Experts and LLMs	提出WM-MoE框架，利用世界模型和混合专家模型解决自动驾驶Corner Case问题	world model large language model
16	Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection	提出CURL框架，利用对比学习进行胎儿超声视频中的胎动检测。	representation learning contrastive learning
17	Generative Point Tracking with Flow Matching	提出基于Flow Matching的生成式点跟踪器GenPT，解决视觉遮挡下的多模态轨迹预测问题。	flow matching
18	TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge	TernaryCLIP：通过三元权重和知识蒸馏高效压缩视觉-语言模型	distillation multimodal
19	IB-GAN: Disentangled Representation Learning with Information Bottleneck Generative Adversarial Networks	提出IB-GAN，利用信息瓶颈改进GAN的解耦表示学习。	representation learning
20	TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning	提出TOMCAT，通过测试时知识累积解决组合零样本学习中的分布偏移问题。	representation learning multimodal	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
21	COS3D: Collaborative Open-Vocabulary 3D Segmentation	提出COS3D，通过协同提示分割框架解决开放词汇3D分割中的语言与分割融合问题。	gaussian splatting splatting open-vocabulary	✅
22	Deep Learning-Powered Visual SLAM Aimed at Assisting Visually Impaired Navigation	提出SELM-SLAM3，利用深度学习增强视觉SLAM，辅助视障人士导航。	visual SLAM
23	RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling	RAPO++：通过数据对齐和测试时缩放优化文本到视频生成中的跨阶段Prompt	optical flow large language model	✅
24	PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding	提出PartNeXt数据集，用于细粒度分层3D部件理解，提升模型性能。	open-vocabulary open vocabulary
25	From Far and Near: Perceptual Evaluation of Crowd Representations Across Levels of Detail	研究不同细节层次下人群表征的感知质量，优化人群渲染策略。	neural radiance field
26	PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching	提出PPMStereo，通过Pick-and-Play记忆构建实现动态立体匹配中的时序一致性。	depth estimation	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
27	DMC$^3$: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering	提出DMC$^3$框架以解决第一人称视频问答中的挑战	egocentric Ego4D
28	Radar-Camera Fused Multi-Object Tracking: Online Calibration and Common Feature	提出一种雷达-相机融合的多目标跟踪框架，实现在线标定和通用特征利用。	feature matching	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	BioDet: Boosting Industrial Object Detection with Image Preprocessing Strategies	BioDet：利用图像预处理策略提升工业目标检测性能	manipulation open-vocabulary open vocabulary

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers	提出基于像素空间时空Transformer的物理模拟视频预测方法	spatiotemporal large language model

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
31	ARGenSeg: Image Segmentation with Autoregressive Image Generation Model	ARGenSeg：提出基于自回归图像生成模型的图像分割方法	VQ-VAE large language model multimodal

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
32	AutoScape: Geometry-Consistent Long-Horizon Scene Generation	AutoScape：提出几何一致的长时程驾驶场景生成框架	geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页