cs.CV(2025-02-19)

📊 共 23 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (9 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (5 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (4) 支柱一:机器人控制 (Robot Control) (2) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (9 篇)

#题目一句话要点标签🔗
1 Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data 提出基于自合成数据的视觉拒绝采样框架,提升多模态大模型的认知和可解释性 foundation model multimodal
2 Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging Triad:用于3D磁共振成像的视觉基础模型,提升医学影像分析性能。 foundation model
3 PedDet: Adaptive Spectral Optimization for Multimodal Pedestrian Detection 提出PedDet以解决多模态行人检测中的信息融合不足问题 multimodal
4 A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models 提出链式思考子空间元学习方法,提升少样本图像描述生成效果 chain-of-thought
5 Generative Video Semantic Communication via Multimodal Semantic Fusion with Large Model 提出基于多模态语义融合的生成式视频语义通信框架,提升低带宽下视频重建质量。 multimodal
6 CARE: Confidence-Aware Regression Estimation of building density fine-tuning EO Foundation Models 提出CARE模型,用于遥感影像建筑密度回归估计,并进行置信度量化与自校正。 foundation model
7 CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness 提出CAPability:一个综合视觉描述基准,用于评估正确性和彻底性 large language model multimodal
8 Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework 提出GeoComp数据集与GeoCoT框架,提升地理定位精度与可解释性 chain-of-thought
9 SegSub: Evaluating Robustness to Knowledge Conflicts and Hallucinations in Vision-Language Models 提出SegSub框架以解决视觉语言模型中的知识冲突问题 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
10 CardiacMamba: A Multimodal RGB-RF Fusion Framework with State Space Models for Remote Physiological Measurement CardiacMamba:提出多模态RGB-RF融合框架,用于远程生理信号测量。 Mamba SSM state space model
11 SNN-Driven Multimodal Human Action Recognition via Sparse Spatial-Temporal Data Fusion 提出基于SNN的多模态人体行为识别框架,解决资源受限场景下的高功耗问题。 Mamba multimodal
12 ModSkill: Physical Character Skill Modularization ModSkill:提出物理角色技能模块化框架,提升动作模仿学习的泛化性和可扩展性。 policy learning imitation learning motion generation
13 MambaLiteSR: Image Super-Resolution with Low-Rank Mamba using Knowledge Distillation 提出MambaLiteSR,一种基于低秩Mamba和知识蒸馏的轻量级图像超分辨率模型,适用于边缘设备。 Mamba distillation
14 Pretrained Image-Text Models are Secretly Video Captioners 利用预训练图像-文本模型,仅需少量视频数据即可实现高性能视频字幕生成 reinforcement learning multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
15 3D Gaussian Splatting aided Localization for Large and Complex Indoor-Environments 利用3D高斯溅射辅助定位,提升复杂室内环境视觉定位精度与可靠性 visual SLAM 3D gaussian splatting 3DGS
16 GlossGau: Efficient Inverse Rendering for Glossy Surface with Anisotropic Spherical Gaussian GlossGau:高效的各向异性球形高斯光泽表面逆渲染方法 3D gaussian splatting gaussian splatting splatting
17 Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation 提出互注意力机制,用于上下文感知的2D场景中人体行为预测 affordance
18 Physical Depth-aware Early Accident Anticipation: A Multi-dimensional Visual Feature Fusion Framework 提出物理深度感知的事故早期预测框架,融合多维度视觉特征。 monocular depth Depth Anything

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
19 Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning Sce2DriveX:用于场景到驾驶学习的通用MLLM框架 motion planning scene understanding spatiotemporal
20 EfficientPose 6D: Scalable and Efficient 6D Object Pose Estimation 提出基于GDRNPP的EfficientPose 6D,通过AMIS算法实现精度与效率的自适应平衡。 manipulation

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
21 MagicGeo: Training-Free Text-Guided Geometric Diagram Generation MagicGeo:提出一种免训练的文本引导几何图生成框架 spatial relationship large language model
22 Object-centric Binding in Contrastive Language-Image Pretraining 提出结合场景图与结构化图像表示的绑定模块,提升CLIP模型对复杂场景的理解能力 spatial relationship

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
23 Betsu-Betsu: Multi-View Separable 3D Reconstruction of Two Interacting Objects 提出Betsu-Betsu:一种多视角可分离的交互物体3D重建神经隐式方法 penetration

⬅️ 返回 cs.CV 首页 · 🏠 返回主页