cs.CV(2025-12-24)

📊 共 29 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱一:机器人控制 (Robot Control) (1) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱四:生成式动作 (Generative Motion) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding PanoGrounder:利用全景场景表示桥接2D和3D,实现基于VLM的3D视觉定位 visual grounding
2 Streaming Video Instruction Tuning 提出Streamo,一个用于实时流视频理解的通用交互式助手。 multimodal instruction following
3 TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation TGC-Net:一种结构感知和语义对齐的文本引导医学图像分割框架 large language model multimodal
4 ALIVE: An Avatar-Lecture Interactive Video Engine with Content-Aware Retrieval for Real-Time Interaction ALIVE:基于内容感知检索的交互式Avatar讲座视频引擎,实现实时互动 large language model multimodal
5 Understanding Virality: A Rubric based Vision-Language Model Framework for Short-Form Edutainment Evaluation 提出基于规则的视觉-语言模型框架,用于短视频教育娱乐内容评估,提升用户参与度预测。 multimodal
6 Fast SAM2 with Text-Driven Token Pruning 提出基于文本驱动的token剪枝Fast SAM2,加速视频目标分割并降低资源消耗。 foundation model
7 Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval 提出基于事件实体提取的两阶段图像检索方法,提升复杂场景下的检索精度。 multimodal
8 Latent Implicit Visual Reasoning 提出隐式视觉推理方法,无需显式监督即可提升LMMs的视觉推理能力 multimodal
9 T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation 提出T2AV-Compass,用于统一评估文本到音视频生成模型的性能。 instruction following
10 Reasoning-Driven Amodal Completion: Collaborative Agents and Perceptual Evaluation 提出基于协同多智能体推理的非模态补全框架,解决语义一致性和结构完整性问题。 chain-of-thought
11 Beyond Weight Adaptation: Feature-Space Domain Injection for Cross-Modal Ship Re-Identification 提出域表示注入(DRI)方法,解决跨模态船舶重识别中模态差异问题。 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
12 SegMo: Segment-aligned Text to 3D Human Motion Generation 提出SegMo框架,通过对齐文本和运动片段实现更精细的文本驱动3D人体动作生成。 contrastive learning motion generation human motion
13 Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition 提出分解与组合的多模态骨骼动作表示学习框架,提升效率与性能。 representation learning multimodal
14 Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential 提出 SpikeSurgSeg,一种用于手术场景分割的脉冲驱动视频Transformer,具有实时潜力。 representation learning scene understanding spatiotemporal
15 TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning TICON:一种用于组织病理学表征学习的切片级瓦片上下文建模方法 representation learning foundation model
16 Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations 提出NExT-Vid,一种基于下一帧预测的自回归视频建模框架,提升视觉表征学习效果。 flow matching representation learning visual pre-training
17 A Graph-Augmented knowledge Distillation based Dual-Stream Vision Transformer with Region-Aware Attention for Gastrointestinal Disease Classification with Explainable AI 提出基于图增强知识蒸馏的双流Vision Transformer用于可解释的胃肠道疾病分类 teacher-student distillation
18 Self-supervised Multiplex Consensus Mamba for General Image Fusion 提出SMC-Mamba框架,用于通用图像融合,提升多种融合任务性能。 Mamba contrastive learning
19 PUFM++: Point Cloud Upsampling via Enhanced Flow Matching PUFM++:通过增强的流匹配实现点云上采样,提升几何保真度和鲁棒性 flow matching
20 XGrid-Mapping: Explicit Implicit Hybrid Grid Submaps for Efficient Incremental Neural LiDAR Mapping 提出XGrid-Mapping,利用显隐混合网格子图实现高效增量式神经激光雷达建图 distillation implicit representation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
21 Quantile Rendering: Efficiently Embedding High-dimensional Feature on 3D Gaussian Splatting 提出Quantile Rendering,高效嵌入高维特征于3D高斯溅射,提升开放词汇分割性能。 3D gaussian splatting gaussian splatting splatting
22 Towards Arbitrary Motion Completing via Hierarchical Continuous Representation 提出基于分层连续表示的NAME框架,实现任意帧率的运动补全 implicit representation human motion
23 ORCA: Object Recognition and Comprehension for Archiving Marine Species ORCA:提出用于海洋物种存档的目标识别与理解多模态基准 open-vocabulary open vocabulary visual grounding
24 Optical Flow-Guided 6DoF Object Pose Tracking with an Event Camera 提出光流引导的事件相机6DoF物体姿态跟踪方法,提升精度和鲁棒性。 optical flow
25 UniPR-3D: Towards Universal Visual Place Recognition with Visual Geometry Grounded Transformer 提出UniPR-3D,利用视觉几何Transformer实现通用视觉定位。 VGGT

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
26 Human Motion Estimation with Everyday Wearables EveryWear:基于日常可穿戴设备的人体运动估计方法 sim-to-real teacher-student egocentric

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
27 DGSAN: Dual-Graph Spatiotemporal Attention Network for Pulmonary Nodule Malignancy Prediction 提出双图时空注意力网络以解决肺结节恶性预测问题 mutual attention spatiotemporal multimodal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
28 ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision ACD:通过注意力监督实现视频扩散模型中的直接条件控制 classifier-free guidance

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
29 Lightweight framework for underground pipeline recognition and spatial localization based on multi-view 2D GPR images 提出DCO-YOLO框架,解决地下管线多视角GPR图像识别与定位难题 feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页