cs.CV(2024-06-28)
📊 共 18 篇论文 | 🔗 7 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (6 🔗3)
支柱三:空间感知与语义 (Perception & Semantics) (5 🔗1)
支柱七:动作重定向 (Motion Retargeting) (3 🔗1)
支柱二:RL算法与架构 (RL & Architecture) (3 🔗2)
支柱一:机器人控制 (Robot Control) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment | MM-Instruct:生成视觉指令数据,提升大型多模态模型指令遵循能力 | large language model multimodal instruction following | ✅ | |
| 2 | Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs | 提出Web2Code数据集与评估框架,提升多模态LLM网页理解与代码生成能力 | large language model multimodal | ✅ | |
| 3 | Multimodal Prototyping for cancer survival prediction | 提出基于多模态原型学习的癌症生存预测方法,显著降低计算量并提升可解释性。 | multimodal | ||
| 4 | PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration | PathGen-1.6M:通过多智能体协作生成160万病理图像-文本对,提升病理VLM性能 | large language model multimodal | ||
| 5 | EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model | 提出EVF-SAM,通过早期视觉-语言融合提升文本提示SAM的分割性能 | multimodal | ||
| 6 | InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows | InfiniBench:长视频多模态大模型评测基准,挑战电影和电视剧理解 | multimodal | ✅ |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting | EgoGaussian:利用3D高斯溅射从第一视角视频中理解动态场景 | 3D gaussian splatting gaussian splatting splatting | ||
| 8 | SpotlessSplats: Ignoring Distractors in 3D Gaussian Splatting | SpotlessSplats:利用鲁棒优化和预训练特征,消除3D高斯溅射中的干扰物 | 3D gaussian splatting 3DGS gaussian splatting | ||
| 9 | Deep Learning-based Depth Estimation Methods from Monocular Image and Videos: A Comprehensive Survey | 深度学习单目图像/视频深度估计方法综述:架构、监督与演进 | depth estimation monocular depth | ||
| 10 | ASSR-NeRF: Arbitrary-Scale Super-Resolution on Voxel Grid for High-Quality Radiance Fields Reconstruction | 提出ASSR-NeRF,通过体素网格上的任意尺度超分辨率实现高质量辐射场重建 | NeRF | ||
| 11 | LightStereo: Channel Boost Is All You Need for Efficient 2D Cost Aggregation | LightStereo:通过通道增强实现高效的2D代价聚合立体匹配 | scene flow | ✅ |
🔬 支柱七:动作重定向 (Motion Retargeting) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 12 | FootBots: A Transformer-based Architecture for Motion Prediction in Soccer | FootBots:基于Transformer的足球运动预测架构,利用等变性提升预测精度 | motion prediction | ||
| 13 | MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance | MimicMotion:基于置信度感知姿态引导的高质量人体运动视频生成 | human motion | ✅ | |
| 14 | Optimized 3D Point Labeling with Leaders Using the Beams Displacement Method | 提出基于梁位移法的三维点要素优化标注方法,解决标签重叠和方向偏差问题。 | spatial relationship |
🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | Structure-aware World Model for Probe Guidance via Large-scale Self-supervised Pre-train | 提出结构感知世界模型,通过大规模自监督预训练提升超声探头引导精度 | world model spatial relationship | ||
| 16 | CSAKD: Knowledge Distillation with Cross Self-Attention for Hyperspectral and Multispectral Image Fusion | 提出基于交叉自注意力知识蒸馏的CSAKD模型,用于高光谱和多光谱图像融合。 | distillation HSI | ✅ | |
| 17 | PopAlign: Population-Level Alignment for Fair Text-to-Image Generation | 提出PopAlign,解决文本到图像生成中群体层面偏见问题。 | reinforcement learning RLHF DPO | ✅ |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 18 | SemUV: Deep Learning based semantic manipulation over UV texture map of virtual human heads | SemUV:提出一种基于深度学习的UV纹理空间人脸语义操控方法 | manipulation |