cs.CV(2025-02-07)

📊 共 18 篇论文 | 🔗 3 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (8 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (4) 支柱二:RL算法与架构 (RL & Architecture) (3 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
1 Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs 揭示多模态大语言模型在时钟和日历理解方面的挑战 large language model multimodal
2 QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation QLIP:文本对齐的视觉Token化统一了自回归多模态理解与生成 multimodal
3 Chest X-ray Foundation Model with Global and Local Representations Integration CheXFound:融合全局与局部表征的胸部X光片基础模型 foundation model
4 Goku: Flow Based Video Generative Foundation Models Goku:基于流的视频生成基础模型,实现业界领先的图像和视频联合生成性能。 foundation model
5 Survey on AI-Generated Media Detection: From Non-MLLM to MLLM 综述AI生成媒体检测技术:从非MLLM到MLLM的演进与挑战 large language model multimodal
6 Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy Long-VITA:一种支持百万token长上下文的多模态模型,兼顾短上下文精度 large language model
7 Multitwine: Multi-Object Compositing with Text and Layout Control Multitwine:首个支持文本和布局控制的多对象组合生成模型 multimodal
8 ELITE: Enhanced Language-Image Toxicity Evaluation for Safety 提出ELITE基准与评估器,提升视觉语言模型安全性评估的质量与多样性 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
9 AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting 提出AutoOcc,利用视觉-语言引导的高斯溅射实现自动开放式语义占据标注。 gaussian splatting splatting
10 PoI: A Filter to Extract Pixel of Interest from Novel View Synthesis for Scene Coordinate Regression 提出PoI滤波器,从新视角合成中提取可靠像素,提升场景坐标回归精度 3D gaussian splatting 3DGS gaussian splatting
11 SC-OmniGS: Self-Calibrating Omnidirectional Gaussian Splatting SC-OmniGS:一种自校准全景高斯溅射方法,用于快速精确的全景辐射场重建。 gaussian splatting splatting
12 High-Speed Dynamic 3D Imaging with Sensor Fusion Splatting 提出基于高斯溅射的传感器融合方法,实现高速动态3D场景重建 gaussian splatting splatting

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
13 CMamba: Learned Image Compression with State Space Models CMamba:一种基于状态空间模型的学习型图像压缩方法,兼顾高性能与低复杂度。 Mamba SSM state space model
14 Learning Street View Representations with Spatiotemporal Contrast 提出时空对比学习框架,用于学习城市街景图像表征以支持城市可持续发展任务。 representation learning contrastive learning spatiotemporal
15 Trust-Aware Diversion for Data-Effective Distillation 提出Trust-Aware Diversion方法,解决带噪声标签的数据集蒸馏问题 distillation

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
16 HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation HumanDiT:姿态引导的扩散Transformer用于生成长时程人体运动视频 human motion
17 LP-DETR: Layer-wise Progressive Relations for Object Detection LP-DETR:通过层间渐进关系建模提升DETR目标检测性能 spatial relationship

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
18 Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment 提出Hummingbird以解决多模态上下文一致性问题 human-object interaction HOI spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页