cs.CV(2024-11-05)
📊 共 20 篇论文 | 🔗 5 篇有代码
🎯 兴趣领域导航
支柱二:RL算法与架构 (RL & Architecture) (7 🔗3)
支柱九:具身大模型 (Embodied Foundation Models) (6 🔗2)
支柱三:空间感知与语义 (Perception & Semantics) (5)
支柱六:视频提取与匹配 (Video Extraction) (1)
支柱一:机器人控制 (Robot Control) (1)
🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | Efficient and Effective Adaptation of Multimodal Foundation Models in Sequential Recommendation | 提出IISAN-Versa框架,高效适配多模态基础模型于序列推荐,实现SOTA性能。 | representation learning large language model foundation model | ✅ | |
| 2 | Object and Contact Point Tracking in Demonstrations Using 3D Gaussian Splatting | 利用3D高斯溅射进行交互示教中物体与接触点跟踪 | imitation learning 3D gaussian splatting gaussian splatting | ||
| 3 | V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization | 提出V-DPO,通过视觉引导的直接偏好优化缓解大型视觉语言模型中的幻觉问题 | preference learning DPO direct preference optimization | ✅ | |
| 4 | Estimating Ego-Body Pose from Doubly Sparse Egocentric Video Data | 提出双稀疏自中心视频的自-身体姿态估计方法,提升运动捕捉精度。 | masked autoencoder egocentric spatiotemporal | ||
| 5 | LiVOS: Light Video Object Segmentation with Gated Linear Matching | LiVOS:利用门控线性匹配实现轻量级视频目标分割 | linear attention spatiotemporal foundation model | ||
| 6 | Pre-trained Visual Dynamics Representations for Efficient Policy Learning | 提出PVDR,利用预训练视觉动力学表征提升强化学习策略学习效率 | reinforcement learning policy learning | ||
| 7 | ShadowMamba: State-Space Model with Boundary-Region Selective Scan for Shadow Removal | ShadowMamba:基于边界区域选择性扫描的状态空间模型,用于阴影去除 | Mamba | ✅ |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 8 | Personalized Video Summarization by Multimodal Video Understanding | 提出基于多模态视频理解的个性化视频摘要方法,提升用户体验。 | multimodal | ||
| 9 | MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning | MME-Finance:面向金融领域专家级理解与推理的多模态金融基准 | multimodal | ||
| 10 | Fine-Grained Spatial and Verbal Losses for 3D Visual Grounding | 提出AsphaltNet,通过细粒度空间和语言损失提升3D视觉定位性能 | visual grounding | ||
| 11 | FlexCAD: Unified and Versatile Controllable CAD Generation with Fine-tuned Large Language Models | 提出FlexCAD以解决可控CAD生成效率低下的问题 | large language model | ✅ | |
| 12 | Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters | 视觉语言模型推理优化:减少视觉tokens,增大模型参数更有效 | large language model | ✅ | |
| 13 | CRT-Fusion: Camera, Radar, Temporal Fusion Using Motion Information for 3D Object Detection | CRT-Fusion:融合相机、雷达和时序信息的3D目标检测方法 | TAMP |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | Multi-modal NeRF Self-Supervision for LiDAR Semantic Segmentation | 提出多模态NeRF自监督框架,提升LiDAR语义分割在自动驾驶场景下的性能。 | NeRF foundation model | ||
| 15 | CAD-NeRF: Learning NeRFs from Uncalibrated Few-view Images by CAD Model Retrieval | CAD-NeRF:利用CAD模型检索,从无标定少视图图像中学习NeRF | NeRF neural radiance field | ||
| 16 | Exploring Seasonal Variability in the Context of Neural Radiance Fields for 3D Reconstruction on Satellite Imagery | 提出Planet-NeRF,通过月度嵌入向量增强卫星图像NeRF的季节性预测能力 | NeRF neural radiance field | ||
| 17 | Correlation of Object Detection Performance with Visual Saliency and Depth Estimation | 研究对象检测性能与视觉显著性和深度估计的相关性,为优化模型架构提供指导。 | depth estimation Depth Anything | ||
| 18 | HFGaussian: Learning Generalizable Gaussian Human with Integrated Human Features | HFGaussian:提出融合人体特征的可泛化高斯人体建模方法 | 3D gaussian splatting gaussian splatting splatting |
🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 19 | Self Supervised Networks for Learning Latent Space Representations of Human Body Scans and Motions | 提出自监督网络VariShaPE和MoGeN,用于学习人体扫描和运动的潜在空间表示。 | SMPL |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 20 | Lost in Context: The Influence of Context on Feature Attribution Methods for Object Recognition | 研究上下文对目标识别模型特征归因方法的影响 | manipulation |