| 1 |
ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model |
提出ImgCoT以解决长链思维压缩问题 |
large language model chain-of-thought |
|
|
| 2 |
Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage |
提出Head-Aware Visual Cropping,提升细粒度VQA中多模态大模型的视觉定位能力。 |
large language model multimodal visual grounding |
|
|
| 3 |
ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search |
ShotFinder:提出基于网络搜索和想象驱动的开放域视频镜头检索基准与方法 |
large language model multimodal |
|
|
| 4 |
Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval |
提出紧凑超立方体嵌入,加速基于文本的野生动物观测检索 |
foundation model multimodal |
|
|
| 5 |
VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration |
VisionTrim:面向免训练MLLM加速的统一视觉Token压缩框架 |
large language model multimodal |
✅ |
|
| 6 |
PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios |
PhoStream:面向移动场景全模态助手,评估真实世界流式理解能力 |
large language model multimodal |
✅ |
|
| 7 |
ScribbleSense: Generative Scribble-Based Texture Editing with Intent Prediction |
ScribbleSense:基于涂鸦生成纹理编辑,结合意图预测,提升交互式3D资产创建。 |
large language model multimodal |
|
|
| 8 |
Structured Over Scale: Learning Spatial Reasoning from Educational Video |
提出DoraVQA数据集,并利用教育视频中的结构化信息提升视觉语言模型的空间推理能力。 |
multimodal |
|
|
| 9 |
One-shot Optimized Steering Vector for Hallucination Mitigation for VLMs |
提出OSGA,通过单样本优化steering vector有效缓解视觉语言模型中的幻觉问题。 |
multimodal |
|
|
| 10 |
StreamSense: Streaming Social Task Detection with Selective Vision-Language Model Routing |
StreamSense:基于选择性VLM路由的流式社交任务检测 |
TAMP |
|
|