cs.CV（2025-09-25）

📊 共 16 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (7 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (5 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (2 🔗1) 支柱八：物理动画 (Physics-based Animation) (1 🔗1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance	提出推理增强的领域自适应多模态大语言模型预训练方法，用于短视频内容治理	large language model multimodal chain-of-thought
2	Instruction-tuned Self-Questioning Framework for Multimodal Reasoning	提出基于指令调优的自问框架SQ-InstructBLIP，用于增强多模态推理能力	large language model multimodal
3	X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning	提出X-CoT，利用LLM链式思考推理实现可解释的文本到视频检索	chain-of-thought	✅
4	VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding	VideoJudge：通过自举法实现MLLM作为视频理解评判器的可扩展监督	large language model multimodal chain-of-thought
5	A Sentinel-3 foundation model for ocean colour	提出基于Sentinel-3的海洋颜色基础模型，提升海洋观测任务性能	foundation model
6	Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations	Decipher-MR：用于3D MRI表征的视觉-语言基础模型	foundation model
7	CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models	提出CompareBench，用于评估视觉语言模型中的视觉比较推理能力	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
8	MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources	MMR1：通过方差感知采样和开放资源增强多模态推理能力	reinforcement learning multimodal chain-of-thought	✅
9	SlideMamba: Entropy-Based Adaptive Fusion of GNN and Mamba for Enhanced Representation Learning in Digital Pathology	SlideMamba：结合GNN与Mamba的熵自适应融合框架，提升数字病理学表征学习	predictive model Mamba representation learning
10	FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction	FantasyWorld：通过统一视频和3D预测实现几何一致的世界建模	world model foundation model
11	X-Streamer: Unified Human World Modeling with Audiovisual Interaction	X-Streamer：提出基于视听交互的统一人类世界建模框架，实现数字人实时交互。	world model multimodal
12	Enhancing Contrastive Learning for Geolocalization by Discovering Hard Negatives on Semivariograms	提出基于Semivariogram的对比学习方法，提升图像地理定位精度	contrastive learning

🔬 支柱三：空间感知与语义 (Perception & Semantics) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
13	Dense Semantic Matching with VGGT Prior	提出基于VGGT先验的稠密语义匹配方法，提升几何感知和匹配可靠性	VGGT foundation model
14	Quantized Visual Geometry Grounded Transformer	提出QuantVGGT，解决VGGT量化难题，实现资源受限场景下的高效3D重建。	VGGT	✅

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
15	MORPH: PDE Foundation Models with Arbitrary Data Modality	提出MORPH模型以处理多模态偏微分方程数据	spatiotemporal foundation model multimodal	✅

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
16	Does FLUX Already Know How to Perform Physically Plausible Image Composition?	提出SHINE框架，无需训练即可实现物理上合理的图像合成	physically plausible

⬅️ 返回 cs.CV 首页 · 🏠 返回主页