cs.CV（2025-12-24）

📊 共 29 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (9 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱一：机器人控制 (Robot Control) (1) 支柱五：交互与反应 (Interaction & Reaction) (1) 支柱四：生成式动作 (Generative Motion) (1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
1	PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding	PanoGrounder：利用全景场景表示桥接2D和3D，实现基于VLM的3D视觉定位	visual grounding
2	Streaming Video Instruction Tuning	提出Streamo，一个用于实时流视频理解的通用交互式助手。	multimodal instruction following
3	TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation	TGC-Net：一种结构感知和语义对齐的文本引导医学图像分割框架	large language model multimodal
4	ALIVE: An Avatar-Lecture Interactive Video Engine with Content-Aware Retrieval for Real-Time Interaction	ALIVE：基于内容感知检索的交互式Avatar讲座视频引擎，实现实时互动	large language model multimodal
5	Understanding Virality: A Rubric based Vision-Language Model Framework for Short-Form Edutainment Evaluation	提出基于规则的视觉-语言模型框架，用于短视频教育娱乐内容评估，提升用户参与度预测。	multimodal
6	Fast SAM2 with Text-Driven Token Pruning	提出基于文本驱动的token剪枝Fast SAM2，加速视频目标分割并降低资源消耗。	foundation model
7	Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval	提出基于事件实体提取的两阶段图像检索方法，提升复杂场景下的检索精度。	multimodal	✅
8	Latent Implicit Visual Reasoning	提出隐式视觉推理方法，无需显式监督即可提升LMMs的视觉推理能力	multimodal
9	T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation	提出T2AV-Compass，用于统一评估文本到音视频生成模型的性能。	instruction following
10	Reasoning-Driven Amodal Completion: Collaborative Agents and Perceptual Evaluation	提出基于协同多智能体推理的非模态补全框架，解决语义一致性和结构完整性问题。	chain-of-thought	✅
11	Beyond Weight Adaptation: Feature-Space Domain Injection for Cross-Modal Ship Re-Identification	提出域表示注入(DRI)方法，解决跨模态船舶重识别中模态差异问题。	foundation model	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
12	SegMo: Segment-aligned Text to 3D Human Motion Generation	提出SegMo框架，通过对齐文本和运动片段实现更精细的文本驱动3D人体动作生成。	contrastive learning motion generation human motion
13	Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition	提出分解与组合的多模态骨骼动作表示学习框架，提升效率与性能。	representation learning multimodal
14	Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential	提出 SpikeSurgSeg，一种用于手术场景分割的脉冲驱动视频Transformer，具有实时潜力。	representation learning scene understanding spatiotemporal
15	TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning	TICON：一种用于组织病理学表征学习的切片级瓦片上下文建模方法	representation learning foundation model
16	Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations	提出NExT-Vid，一种基于下一帧预测的自回归视频建模框架，提升视觉表征学习效果。	flow matching representation learning visual pre-training
17	A Graph-Augmented knowledge Distillation based Dual-Stream Vision Transformer with Region-Aware Attention for Gastrointestinal Disease Classification with Explainable AI	提出基于图增强知识蒸馏的双流Vision Transformer用于可解释的胃肠道疾病分类	teacher-student distillation
18	Self-supervised Multiplex Consensus Mamba for General Image Fusion	提出SMC-Mamba框架，用于通用图像融合，提升多种融合任务性能。	Mamba contrastive learning
19	PUFM++: Point Cloud Upsampling via Enhanced Flow Matching	PUFM++：通过增强的流匹配实现点云上采样，提升几何保真度和鲁棒性	flow matching	✅
20	XGrid-Mapping: Explicit Implicit Hybrid Grid Submaps for Efficient Incremental Neural LiDAR Mapping	提出XGrid-Mapping，利用显隐混合网格子图实现高效增量式神经激光雷达建图	distillation implicit representation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
21	Quantile Rendering: Efficiently Embedding High-dimensional Feature on 3D Gaussian Splatting	提出Quantile Rendering，高效嵌入高维特征于3D高斯溅射，提升开放词汇分割性能。	3D gaussian splatting gaussian splatting splatting
22	Towards Arbitrary Motion Completing via Hierarchical Continuous Representation	提出基于分层连续表示的NAME框架，实现任意帧率的运动补全	implicit representation human motion
23	ORCA: Object Recognition and Comprehension for Archiving Marine Species	ORCA：提出用于海洋物种存档的目标识别与理解多模态基准	open-vocabulary open vocabulary visual grounding
24	Optical Flow-Guided 6DoF Object Pose Tracking with an Event Camera	提出光流引导的事件相机6DoF物体姿态跟踪方法，提升精度和鲁棒性。	optical flow
25	UniPR-3D: Towards Universal Visual Place Recognition with Visual Geometry Grounded Transformer	提出UniPR-3D，利用视觉几何Transformer实现通用视觉定位。	VGGT	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
26	Human Motion Estimation with Everyday Wearables	EveryWear：基于日常可穿戴设备的人体运动估计方法	sim-to-real teacher-student egocentric

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
27	DGSAN: Dual-Graph Spatiotemporal Attention Network for Pulmonary Nodule Malignancy Prediction	提出双图时空注意力网络以解决肺结节恶性预测问题	mutual attention spatiotemporal multimodal

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
28	ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision	ACD：通过注意力监督实现视频扩散模型中的直接条件控制	classifier-free guidance

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Lightweight framework for underground pipeline recognition and spatial localization based on multi-view 2D GPR images	提出DCO-YOLO框架，解决地下管线多视角GPR图像识别与定位难题	feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页