cs.CV(2024-07-23)
📊 共 22 篇论文 | 🔗 7 篇有代码
🎯 兴趣领域导航
支柱三:空间感知与语义 (Perception & Semantics) (7 🔗2)
支柱九:具身大模型 (Embodied Foundation Models) (7 🔗3)
支柱二:RL算法与架构 (RL & Architecture) (4 🔗1)
支柱六:视频提取与匹配 (Video Extraction) (2 🔗1)
支柱一:机器人控制 (Robot Control) (1)
支柱七:动作重定向 (Motion Retargeting) (1)
🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 8 | MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs | MLLM-CompBench:用于评估多模态大语言模型比较推理能力的基准测试。 | large language model multimodal | ||
| 9 | PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects | PartGLEE:用于识别和解析任意对象部件的部件级基础模型 | foundation model | ✅ | |
| 10 | Histopathology image embedding based on foundation models features aggregation for patient treatment response prediction | 提出基于Foundation Model特征聚合的病理图像嵌入方法,用于预测弥漫大B细胞淋巴瘤患者的治疗反应。 | foundation model | ||
| 11 | C3T: Cross-modal Transfer Through Time for Sensor-based Human Activity Recognition | C3T:通过时间跨模态迁移,提升传感器人体活动识别在无监督模态适应下的性能 | multimodal | ||
| 12 | Unveiling and Mitigating Bias in Audio Visual Segmentation | 针对视听分割中音频启动偏差和视觉先验偏差,提出感知模块和对比学习策略。 | visual grounding | ||
| 13 | Category-Extensible Out-of-Distribution Detection via Hierarchical Context Descriptions | 提出CATEX,通过分层上下文描述实现可扩展的OOD检测。 | large language model | ✅ | |
| 14 | Harmonizing Visual Text Comprehension and Generation | TextHarmony:提出Slide-LoRA,统一视觉文本理解与生成任务。 | multimodal | ✅ |
🔬 支柱二:RL算法与架构 (RL & Architecture) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions | 提出基于扩散模型的单目深度估计方法,提升复杂场景下的鲁棒性 | distillation depth estimation monocular depth | ||
| 16 | MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence | MovieDreamer:提出层级生成框架,实现连贯长视觉序列的电影级视频生成 | dreamer multimodal | ✅ | |
| 17 | Accelerating Learned Video Compression via Low-Resolution Representation Learning | 提出基于低分辨率表示学习的加速视频压缩框架,显著提升编解码速度。 | representation learning | ||
| 18 | A Multi-view Mask Contrastive Learning Graph Convolutional Neural Network for Age Estimation | 提出多视角掩码对比学习图卷积网络用于人脸年龄估计 | contrastive learning |
🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 19 | EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval | EgoCVR:一个用于细粒度组合视频检索的自中心视角基准数据集 | egocentric | ✅ | |
| 20 | Motion Capture from Inertial and Vision Sensors | 提出MINIONS数据集和SparseNet框架,实现基于惯性和视觉传感器的低成本人体运动捕捉。 | SMPL |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 21 | Coarse-to-Fine Proposal Refinement Framework for Audio Temporal Forgery Detection and Localization | 提出粗到精的音频时间伪造检测与定位框架,解决现有方法无法定位篡改片段的问题。 | manipulation representation learning TAMP |
🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 22 | VisMin: Visual Minimal-Change Understanding | 提出VisMin基准,用于评估视觉语言模型在细粒度视觉理解上的能力 | spatial relationship large language model |