| 1 |
GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models |
提出GeM-VG,一个用于广义多图视觉定位的多模态大语言模型。 |
large language model multimodal visual grounding |
|
|
| 2 |
SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models |
提出SOVABench车辆监控行为检索基准,用于评估多模态大语言模型 |
large language model multimodal instruction following |
|
|
| 3 |
Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models |
提出Forge-and-Quench框架,利用理解增强图像生成保真度 |
multimodal instruction following |
✅ |
|
| 4 |
Cutting AI Research Costs: How Task-Aware Compression Makes Large Language Model Agents Affordable |
AgentCompress:任务感知压缩降低大语言模型Agent的科研成本 |
large language model |
|
|
| 5 |
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice |
VideoAuto-R1:通过“一次思考,两次回答”实现高效视频自动推理 |
large language model multimodal chain-of-thought |
|
|
| 6 |
Atlas 2 -- Foundation models for clinical deployment |
Atlas 2:用于临床部署的病理学视觉基础模型,兼顾性能、鲁棒性和效率。 |
foundation model |
|
|
| 7 |
Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics |
提出ProtoScore以解决多模态评估中的原型偏差问题 |
multimodal |
|
|
| 8 |
Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering |
提出Vision-Language Introspection,通过可解释的双向因果引导缓解多模态大语言模型中的幻觉问题 |
large language model multimodal |
|
|
| 9 |
Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing |
Re-Align:结构化推理引导的上下文图像生成与编辑框架 |
multimodal chain-of-thought |
|
|
| 10 |
AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection |
提出AIVD框架,通过边缘-云协同实现精确高效的工业视觉检测 |
large language model multimodal |
|
|
| 11 |
MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing |
提出MiLDEAgent,解决多层设计文档的细粒度编辑难题。 |
multimodal instruction following |
|
|
| 12 |
All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction |
提出RepMD,通过设计概念重现提升不断演变的有害Meme检测 |
large language model multimodal |
|
|
| 13 |
Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform |
针对工业GenAI平台,扩展视觉语言模型以处理药物长视频推理任务。 |
multimodal |
|
|
| 14 |
Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition |
提出基于骨架化的对抗扰动方法,攻击大视觉语言模型的数学文本识别能力 |
foundation model |
|
|