cs.CV（2025-10-20）

📊 共 37 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (9) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱一：机器人控制 (Robot Control) (3 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱八：物理动画 (Physics-based Animation) (2 🔗2) 支柱四：生成式动作 (Generative Motion) (1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
1	MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues	提出MT-Video-Bench，用于评估多模态LLM在多轮对话中的视频理解能力	large language model multimodal
2	$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs	提出VisiPruner以解决多模态大语言模型的计算开销问题	large language model multimodal	✅
3	Towards a Generalizable Fusion Architecture for Multimodal Object Detection	提出FMCAF架构，提升多模态目标检测的泛化能力与鲁棒性	multimodal
4	Glyph: Scaling Context Windows via Visual-Text Compression	Glyph：通过视觉-文本压缩扩展LLM上下文窗口，提升长文本处理效率。	large language model multimodal	✅
5	Xihe: Scalable Zero-Shot Time Series Learner Via Hierarchical Interleaved Block Attention	提出基于分层交错块注意力（HIBA）的Xihe，用于可扩展的零样本时间序列学习。	foundation model zero-shot transfer
6	iDETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA	提出iDETEX，赋能多模态大语言模型进行智能、详细、可解释的图像质量评估	large language model multimodal
7	SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference	SparseVILA：解耦视觉稀疏性，加速高效VLM推理	multimodal
8	Elastic ViTs from Pretrained Models without Retraining	提出SnapViT，无需重训练即可从预训练ViT模型中获得弹性推理能力	foundation model
9	ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input	ImaGGen：提出一种基于语言和图像输入的零样本共语语义手势生成方法	multimodal	✅
10	Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization	提出一种基于上下文感知伪标签评分的零样本视频摘要框架	large language model
11	Monitoring Horses in Stalls: From Object to Event Detection	提出基于YOLOv11和BoT-SORT的马厩马匹行为监测系统，用于早期健康问题检测。	foundation model

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
12	UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action	UltraCUA：融合GUI操作与高级工具的计算机使用Agent基础模型	reinforcement learning foundation model
13	Intelligent Communication Mixture-of-Experts Boosted-Medical Image Segmentation Foundation Model	提出IC-MoE模型，通过智能通信混合专家网络提升医学图像分割性能。	contrastive learning foundation model
14	Closed-Loop Transfer for Weakly-supervised Affordance Grounding	提出LoopTrans闭环框架，解决弱监督可供性定位中的知识迁移问题	distillation affordance egocentric
15	CausalMamba: Scalable Conditional State Space Models for Neural Causal Inference	CausalMamba：用于神经因果推断的可扩展条件状态空间模型	Mamba state space model
16	Token-Level Inference-Time Alignment for Vision-Language Models	提出TITA：一种用于视觉-语言模型token级推理时对齐的轻量级框架	DPO direct preference optimization multimodal
17	World-in-World: World Models in a Closed-Loop World	World-in-World：首个闭环世界模型基准平台，评估具身智能体的预测感知能力。	world model
18	Online In-Context Distillation for Low-Resource Vision Language Models	提出在线上下文蒸馏方法，提升低资源视觉语言模型性能	distillation
19	SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries	SparseWorld：基于稀疏动态查询的灵活高效4D Occupancy世界模型	world model
20	GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image	GACO-CAD：通过几何增强与简洁性优化，从单张图像生成CAD模型	reinforcement learning large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
21	Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language Models	提出Plug-and-Forecast，利用多模态大语言模型增强运动预测模型，提升泛化能力。	scene understanding motion prediction large language model
22	From Volume Rendering to 3D Gaussian Splatting: Theory and Applications	综述3D高斯溅射：从体渲染到应用，解决实时渲染与高质量重建难题	3D gaussian splatting 3DGS gaussian splatting
23	RaindropGS: A Benchmark for 3D Gaussian Splatting under Raindrop Conditions	RaindropGS：雨滴条件下3D高斯溅射重建的综合评测基准	3D gaussian splatting 3DGS gaussian splatting
24	Initialize to Generalize: A Stronger Initialization Pipeline for Sparse-View 3DGS	提出更强的初始化流程ItG-GS，显著提升稀疏视角3DGS的重建质量	3D gaussian splatting 3DGS gaussian splatting	✅
25	PAGE-4D: Disentangled Pose and Geometry Estimation for VGGT-4D Perception	PAGE-4D：解耦姿态与几何信息的动态场景VGGT-4D感知	depth estimation VGGT
26	Towards 3D Objectness Learning in an Open World	提出OP3Det，解决开放世界3D场景中通用物体检测问题。	open-vocabulary open vocabulary foundation model
27	HouseTour: A Virtual Real Estate A(I)gent	HouseTour：提出一种利用扩散模型生成空间感知三维相机轨迹和自然语言摘要的方法，用于房地产场景。	3D gaussian splatting gaussian splatting splatting
28	DeepDetect: Learning All-in-One Dense Keypoints	DeepDetect：提出一种融合经典检测器优势的端到端密集关键点检测方法	visual odometry

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
29	GSPlane: Concise and Accurate Planar Reconstruction via Structured Representation	GSPlane：通过结构化表示实现简洁而精确的平面重建	manipulation gaussian splatting splatting
30	SafeCoop: Unravelling Full Stack Safety in Agentic Collaborative Driving	SafeCoop：针对基于自然语言协同驾驶的全栈安全防御框架	manipulation	✅
31	ConsistEdit: Highly Consistent and Precise Training-free Visual Editing	ConsistEdit：提出一种高一致性和精确度的免训练视觉编辑方法	manipulation

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
32	ManzaiSet: A Multimodal Dataset of Viewer Responses to Japanese Manzai Comedy	提出ManzaiSet：一个用于研究观众对日本漫才反应的大规模多模态数据集	HuMoR multimodal
33	Leveraging AV1 motion vectors for Fast and Dense Feature Matching	利用AV1运动矢量实现快速稠密特征匹配，提升SfM效率	feature matching

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
34	ViBED-Net: Video Based Engagement Detection Network Using Face-Aware and Scene-Aware Spatiotemporal Cues	ViBED-Net：利用面部感知和场景感知的时空线索进行视频参与度检测	spatiotemporal	✅
35	MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models	MUG-V 10B：面向大规模视频生成模型的高效训练框架	spatiotemporal	✅

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
36	Capturing Head Avatar with Hand Contacts from a Monocular Video	提出一种单目视频头部Avatar重建方法，解决手部交互形变建模问题	penetration spatial relationship

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
37	ShapeCraft: LLM Agents for Structured, Textured and Interactive 3D Modeling	ShapeCraft：利用LLM智能体进行结构化、纹理化和交互式3D建模	spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页