cs.CV(2025-03-06)

📊 共 36 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (12 🔗6) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗1) 支柱一:机器人控制 (Robot Control) (4 🔗1) 支柱八:物理动画 (Physics-based Animation) (2) 支柱四:生成式动作 (Generative Motion) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
1 MASTER: Multimodal Segmentation with Text Prompts 提出MASTER:利用文本提示的多模态分割框架,提升复杂场景下的RGB-Thermal融合性能 large language model multimodal
2 PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks 提出PP-DocBee以解决文档图像理解问题 large language model multimodal
3 Leveraging Large Language Models For Scalable Vector Graphics Processing: A Review 综述:利用大型语言模型处理可缩放矢量图形 large language model
4 Adaptive Prototype Learning for Multimodal Cancer Survival Analysis 提出自适应原型学习(APL)方法,用于多模态癌症生存分析,提升预测精度。 multimodal
5 DuCos: Duality Constrained Depth Super-Resolution via Foundation Model DuCos:基于基础模型和拉格朗日对偶的深度超分辨率方法 foundation model
6 The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights 揭示视觉模态在多模态数学推理中的作用,并提出HC-M3D数据集以增强视觉依赖 multimodal
7 FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement FirePlace:结合几何约束与常识推理的3D物体放置框架 large language model multimodal
8 DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation DSV-LFS:融合LLM语义提示与视觉特征,提升小样本分割的鲁棒性 large language model multimodal
9 RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models RetinalGPT:基于大型视觉语言模型的视网膜临床偏好对话助手 large language model multimodal
10 Gate-Shift-Pose: Enhancing Action Recognition in Sports with Skeleton Information Gate-Shift-Pose:融合骨骼信息的运动动作识别方法,提升花样滑冰摔倒检测精度 multimodal
11 TPC: Cross-Temporal Prediction Connection for Vision-Language Model Hallucination Reduction 提出跨时序预测连接(TPC)以降低视觉-语言模型幻觉 large language model
12 ToFu: Visual Tokens Reduction via Fusion for Multi-modal, Multi-patch, Multi-image Task ToFu:一种视觉令牌融合方法,用于提升多模态、多图像任务的效率。 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
13 An Egocentric Vision-Language Model based Portable Real-time Smart Assistant Vinci:基于第一人称视觉-语言模型的便携式实时智能助手 scene understanding egocentric egocentric vision
14 GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding GaussianGraph:基于3D高斯的场景图生成,用于开放世界场景理解 3D gaussian splatting 3DGS gaussian splatting
15 S2Gaussian: Sparse-View Super-Resolution 3D Gaussian Splatting 提出S2Gaussian,解决稀疏低分辨率视图下的高质量3D场景重建问题 3D gaussian splatting gaussian splatting splatting
16 A Novel Solution for Drone Photogrammetry with Low-overlap Aerial Images using Monocular Depth Estimation 提出基于单目深度估计的无人机影像重建方法,解决低重叠度影像重建难题 depth estimation monocular depth metric depth
17 GaussianVideo: Efficient Video Representation and Compression by Gaussian Splatting GaussianVideo:基于高斯溅射的高效视频表示与压缩方法 gaussian splatting splatting spatiotemporal
18 Floxels: Fast Unsupervised Voxel Based Scene Flow Estimation Floxels:一种快速的、基于体素的无监督场景流估计方法 scene flow
19 Robust Computer-Vision based Construction Site Detection for Assistive-Technology Applications 提出基于计算机视觉的稳健施工现场检测系统,辅助视障人士安全导航。 open-vocabulary open vocabulary
20 Extracting Symbolic Sequences from Visual Representations via Self-Supervised Learning 提出一种基于自监督学习的视觉表征符号序列提取方法 scene understanding

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
21 STORM: Token-Efficient Long Video Understanding for Multimodal LLMs 提出STORM以解决长视频理解中的时间建模不足问题 Mamba state space model spatiotemporal
22 Simulating the Real World: A Unified Survey of Multimodal Generative Models 统一多模态生成模型综述,促进真实世界模拟研究 world model multimodal
23 ObjMST: An Object-Focused Multimodal Style Transfer Framework ObjMST:一种面向对象的多模态风格迁移框架 representation learning multimodal
24 CA-W3D: Leveraging Context-Aware Knowledge for Weakly Supervised Monocular 3D Detection 提出CA-W3D,利用上下文感知知识解决弱监督单目3D目标检测问题 distillation open-vocabulary open vocabulary
25 Spectral Informed Mamba for Robust Point Cloud Processing 提出基于谱信息Mamba的鲁棒点云处理方法,提升点云分类、分割和少样本学习性能。 Mamba state space model masked autoencoder
26 Learning 3D Medical Image Models From Brain Functional Connectivity Network Supervision For Mental Disorder Diagnosis 提出CINP框架,利用对比学习融合sMRI与fMRI信息,提升精神疾病诊断准确率。 representation learning contrastive learning multimodal
27 WeakSupCon: Weakly Supervised Contrastive Learning for Encoder Pre-training 提出WeakSupCon,用于弱监督多示例学习中的编码器预训练。 representation learning contrastive learning
28 Teach YOLO to Remember: A Self-Distillation Approach for Continual Object Detection 提出YOLO LwF,一种基于自蒸馏的YOLO持续目标检测方法,显著缓解灾难性遗忘。 distillation

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
29 Instrument-Splatting: Controllable Photorealistic Reconstruction of Surgical Instruments Using Gaussian Splatting 提出Instrument-Splatting,实现手术器械可控逼真的3D高斯重建 real2sim 3D gaussian splatting gaussian splatting
30 High-Precision Transformer-Based Visual Servoing for Humanoid Robots in Aligning Tiny Objects 提出基于Transformer的视觉伺服方法,解决人形机器人高精度微小物体对准问题 humanoid humanoid robot
31 Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning Adapt3R:用于模仿学习中域迁移的自适应3D场景表示 manipulation imitation learning zero-shot transfer
32 Omnidirectional Multi-Object Tracking 提出OmniTrack框架,解决全景图像多目标跟踪中的畸变和运动挑战。 quadruped

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
33 EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models 提出EVE框架,利用视觉-语言模型实现端到端视频字幕提取,并构建大规模数据集ViSa。 spatiotemporal TAMP
34 FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video FluidNexus:单视频流体三维重建与预测框架 differentiable simulation

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
35 How to Move Your Dragon: Text-to-Motion Synthesis for Large-Vocabulary Objects 提出一种基于文本描述的通用骨骼动画生成方法,解决异构骨骼模板的运动合成问题 motion diffusion model motion diffusion text-to-motion

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
36 Spatial-Temporal Perception with Causal Inference for Naturalistic Driving Action Recognition 提出基于因果推理的时空感知网络STP,用于自然驾驶行为识别。 spatial relationship multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页