| 1 |
GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution |
提出GeoLLaVA-8K以解决超高分辨率遥感图像处理问题 |
large language model foundation model multimodal |
|
|
| 2 |
DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding |
提出DVL-Suite以解决多模态大语言模型在城市动态理解中的不足 |
large language model multimodal |
|
|
| 3 |
Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models |
提出Fork-Merge解码以解决音视频大语言模型的模态偏差问题 |
large language model multimodal |
|
|
| 4 |
MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning |
提出MMTBENCH以解决复杂多模态表推理问题 |
large language model multimodal |
|
|
| 5 |
MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios |
提出MME-VideoOCR以解决视频场景下OCR效果不足的问题 |
large language model multimodal |
|
|
| 6 |
Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models |
提出FlashVLA以解决VLA模型推理效率低下问题 |
vision-language-action VLA |
|
|
| 7 |
AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding |
提出AVCD以解决音视频大语言模型中的幻觉问题 |
large language model multimodal |
✅ |
|
| 8 |
EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models |
提出EaqVLA以解决VLA模型量化效率问题 |
vision-language-action VLA |
|
|
| 9 |
Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs |
提出专门化方法以解决音乐音视频问答的复杂性问题 |
large language model multimodal |
✅ |
|
| 10 |
Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing |
提出利用大语言模型提升视觉语音识别性能的方法 |
large language model |
|
|
| 11 |
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers |
提出Paper2Poster以解决学术海报自动生成问题 |
multimodal |
✅ |
|
| 12 |
HoliTom: Holistic Token Merging for Fast Video Large Language Models |
提出HoliTom以解决视频大语言模型的计算效率问题 |
large language model |
|
|
| 13 |
PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding |
提出PARTONOMY以解决大规模多模态模型的部件识别问题 |
multimodal |
|
|
| 14 |
Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models |
提出统一视觉推理机制以提升多模态模型的复合推理能力 |
multimodal |
✅ |
|
| 15 |
MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis in Chest X-Ray |
提出MedBridge以解决医学影像诊断中的领域适应问题 |
foundation model multimodal |
✅ |
|
| 16 |
Think Before You Diffuse: Infusing Physical Rules into Video Diffusion |
提出DiffPhy框架以解决视频生成中的物理准确性问题 |
large language model multimodal |
✅ |
|
| 17 |
Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment |
提出FOA-Attack以解决闭源MLLMs的对抗攻击问题 |
large language model multimodal |
✅ |
|
| 18 |
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration |
提出CAAC框架以解决大规模视觉-语言模型中的幻觉问题 |
multimodal visual grounding |
|
|
| 19 |
OASIS: Online Sample Selection for Continual Visual Instruction Tuning |
提出OASIS以解决持续视觉指令调优中的样本选择问题 |
foundation model |
|
|
| 20 |
Mentor3AD: Feature Reconstruction-based 3D Anomaly Detection via Multi-modality Mentor Learning |
提出Mentor3AD以解决3D异常检测中的特征重建问题 |
multimodal |
|
|
| 21 |
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? |
提出Video-Holmes基准以解决复杂视频推理问题 |
multimodal |
✅ |
|
| 22 |
Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals |
提出TEMU-VTOFF以解决虚拟试穿逆问题 |
multimodal |
|
|
| 23 |
Advancing high-fidelity 3D and Texture Generation with 2.5D latents |
提出一种新框架以解决3D几何与纹理生成不一致问题 |
foundation model |
|
|
| 24 |
Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning |
提出FlexTI2V以解决训练成本高和条件设置有限的问题 |
foundation model |
|
|