cs.CV（2025-03-06）

📊 共 36 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (12 🔗6) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (8 🔗1) 支柱一：机器人控制 (Robot Control) (4 🔗1) 支柱八：物理动画 (Physics-based Animation) (2) 支柱四：生成式动作 (Generative Motion) (1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
1	MASTER: Multimodal Segmentation with Text Prompts	提出MASTER：利用文本提示的多模态分割框架，提升复杂场景下的RGB-Thermal融合性能	large language model multimodal
2	PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks	提出PP-DocBee以解决文档图像理解问题	large language model multimodal	✅
3	Leveraging Large Language Models For Scalable Vector Graphics Processing: A Review	综述：利用大型语言模型处理可缩放矢量图形	large language model
4	Adaptive Prototype Learning for Multimodal Cancer Survival Analysis	提出自适应原型学习(APL)方法，用于多模态癌症生存分析，提升预测精度。	multimodal	✅
5	DuCos: Duality Constrained Depth Super-Resolution via Foundation Model	DuCos：基于基础模型和拉格朗日对偶的深度超分辨率方法	foundation model
6	The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights	揭示视觉模态在多模态数学推理中的作用，并提出HC-M3D数据集以增强视觉依赖	multimodal	✅
7	FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement	FirePlace：结合几何约束与常识推理的3D物体放置框架	large language model multimodal
8	DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation	DSV-LFS：融合LLM语义提示与视觉特征，提升小样本分割的鲁棒性	large language model multimodal	✅
9	RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models	RetinalGPT：基于大型视觉语言模型的视网膜临床偏好对话助手	large language model multimodal	✅
10	Gate-Shift-Pose: Enhancing Action Recognition in Sports with Skeleton Information	Gate-Shift-Pose：融合骨骼信息的运动动作识别方法，提升花样滑冰摔倒检测精度	multimodal	✅
11	TPC: Cross-Temporal Prediction Connection for Vision-Language Model Hallucination Reduction	提出跨时序预测连接（TPC）以降低视觉-语言模型幻觉	large language model
12	ToFu: Visual Tokens Reduction via Fusion for Multi-modal, Multi-patch, Multi-image Task	ToFu：一种视觉令牌融合方法，用于提升多模态、多图像任务的效率。	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
13	An Egocentric Vision-Language Model based Portable Real-time Smart Assistant	Vinci：基于第一人称视觉-语言模型的便携式实时智能助手	scene understanding egocentric egocentric vision	✅
14	GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding	GaussianGraph：基于3D高斯的场景图生成，用于开放世界场景理解	3D gaussian splatting 3DGS gaussian splatting
15	S2Gaussian: Sparse-View Super-Resolution 3D Gaussian Splatting	提出S2Gaussian，解决稀疏低分辨率视图下的高质量3D场景重建问题	3D gaussian splatting gaussian splatting splatting
16	A Novel Solution for Drone Photogrammetry with Low-overlap Aerial Images using Monocular Depth Estimation	提出基于单目深度估计的无人机影像重建方法，解决低重叠度影像重建难题	depth estimation monocular depth metric depth
17	GaussianVideo: Efficient Video Representation and Compression by Gaussian Splatting	GaussianVideo：基于高斯溅射的高效视频表示与压缩方法	gaussian splatting splatting spatiotemporal
18	Floxels: Fast Unsupervised Voxel Based Scene Flow Estimation	Floxels：一种快速的、基于体素的无监督场景流估计方法	scene flow
19	Robust Computer-Vision based Construction Site Detection for Assistive-Technology Applications	提出基于计算机视觉的稳健施工现场检测系统，辅助视障人士安全导航。	open-vocabulary open vocabulary
20	Extracting Symbolic Sequences from Visual Representations via Self-Supervised Learning	提出一种基于自监督学习的视觉表征符号序列提取方法	scene understanding

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
21	STORM: Token-Efficient Long Video Understanding for Multimodal LLMs	提出STORM以解决长视频理解中的时间建模不足问题	Mamba state space model spatiotemporal
22	Simulating the Real World: A Unified Survey of Multimodal Generative Models	统一多模态生成模型综述，促进真实世界模拟研究	world model multimodal
23	ObjMST: An Object-Focused Multimodal Style Transfer Framework	ObjMST：一种面向对象的多模态风格迁移框架	representation learning multimodal	✅
24	CA-W3D: Leveraging Context-Aware Knowledge for Weakly Supervised Monocular 3D Detection	提出CA-W3D，利用上下文感知知识解决弱监督单目3D目标检测问题	distillation open-vocabulary open vocabulary
25	Spectral Informed Mamba for Robust Point Cloud Processing	提出基于谱信息Mamba的鲁棒点云处理方法，提升点云分类、分割和少样本学习性能。	Mamba state space model masked autoencoder
26	Learning 3D Medical Image Models From Brain Functional Connectivity Network Supervision For Mental Disorder Diagnosis	提出CINP框架，利用对比学习融合sMRI与fMRI信息，提升精神疾病诊断准确率。	representation learning contrastive learning multimodal
27	WeakSupCon: Weakly Supervised Contrastive Learning for Encoder Pre-training	提出WeakSupCon，用于弱监督多示例学习中的编码器预训练。	representation learning contrastive learning
28	Teach YOLO to Remember: A Self-Distillation Approach for Continual Object Detection	提出YOLO LwF，一种基于自蒸馏的YOLO持续目标检测方法，显著缓解灾难性遗忘。	distillation

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Instrument-Splatting: Controllable Photorealistic Reconstruction of Surgical Instruments Using Gaussian Splatting	提出Instrument-Splatting，实现手术器械可控逼真的3D高斯重建	real2sim 3D gaussian splatting gaussian splatting
30	High-Precision Transformer-Based Visual Servoing for Humanoid Robots in Aligning Tiny Objects	提出基于Transformer的视觉伺服方法，解决人形机器人高精度微小物体对准问题	humanoid humanoid robot
31	Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning	Adapt3R：用于模仿学习中域迁移的自适应3D场景表示	manipulation imitation learning zero-shot transfer
32	Omnidirectional Multi-Object Tracking	提出OmniTrack框架，解决全景图像多目标跟踪中的畸变和运动挑战。	quadruped	✅

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
33	EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models	提出EVE框架，利用视觉-语言模型实现端到端视频字幕提取，并构建大规模数据集ViSa。	spatiotemporal TAMP
34	FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video	FluidNexus：单视频流体三维重建与预测框架	differentiable simulation

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
35	How to Move Your Dragon: Text-to-Motion Synthesis for Large-Vocabulary Objects	提出一种基于文本描述的通用骨骼动画生成方法，解决异构骨骼模板的运动合成问题	motion diffusion model motion diffusion text-to-motion

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
36	Spatial-Temporal Perception with Causal Inference for Naturalistic Driving Action Recognition	提出基于因果推理的时空感知网络STP，用于自然驾驶行为识别。	spatial relationship multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页