| 1 |
MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph |
MicroWorld:通过多模态属性图增强MLLM在微观领域的推理能力 |
large language model multimodal |
✅ |
|
| 2 |
CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models |
提出CapVector方法,通过参数空间解耦实现视觉-语言-动作模型的轻量化能力增强 |
vision-language-action VLA |
|
|
| 3 |
Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning |
提出DRAPE框架:通过动态跨模态提示生成解决多模态持续指令微调中的灾难性遗忘问题 |
large language model multimodal |
|
|
| 4 |
EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving |
提出EnergyLens:一种基于符号回归的闭式能耗模型,实现多模态大模型推理的能效优化 |
large language model multimodal |
|
|
| 5 |
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation |
提出SciVQR多学科多模态基准,旨在全面评估大模型在复杂科学推理中的表现 |
large language model multimodal |
✅ |
|
| 6 |
C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving |
提出C-CoT反事实思维链框架,利用视觉语言模型提升自动驾驶决策安全性 |
chain-of-thought |
|
|
| 7 |
Personal Visual Context Learning in Large Multimodal Models |
提出个人视觉上下文学习(Personal VCL)框架与Agentic Context Bank,提升大模型对用户专属视觉信息的理解能力。 |
multimodal |
|
|
| 8 |
Qwen-Image-2.0 Technical Report |
Qwen-Image-2.0:提出全能型图像生成基础模型,实现高保真生成与精准编辑的统一 |
foundation model multimodal instruction following |
|
|
| 9 |
BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization |
提出BGG框架,通过视觉基础模型适配弥合跨视角图像间的几何差异,提升地理定位性能。 |
foundation model |
|
|
| 10 |
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models |
提出ViSRA:一种无需训练的视频空间推理智能体,旨在提升多模态大模型的3D空间理解能力。 |
large language model |
|
|
| 11 |
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models |
提出TOC-Bench基准以评估视频大模型在物体时序一致性方面的推理能力 |
large language model |
|
|
| 12 |
Count Anything at Any Granularity |
提出多粒度计数框架HieraCount与大规模数据集KubriCount,实现开放世界下的精准目标计数 |
large language model multimodal |
✅ |
|
| 13 |
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning |
提出V-ABS框架:通过动作-观察者驱动的束搜索解决多模态大模型动态视觉推理中的IAO偏差问题 |
large language model multimodal |
|
|
| 14 |
ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning |
提出ERASE框架,通过自适应两阶段视觉Token剪枝技术解决多模态大模型计算冗余问题。 |
large language model multimodal |
✅ |
|
| 15 |
Adversarial Attacks Against MLLMs via Progressive Resolution Processing and Adaptive Feature Alignment |
提出PRAF-Attack框架,通过渐进式分辨率处理与自适应特征对齐提升MLLM黑盒攻击迁移性 |
large language model multimodal |
|
|
| 16 |
The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space |
提出Polaris-Bench基准测试以揭示多模态大模型在视觉推理中的笛卡尔捷径依赖问题 |
large language model multimodal |
|
|
| 17 |
BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation |
提出BabelDOC框架:通过中间表示(IR)实现高保真布局的PDF文档翻译 |
multimodal |
|
|
| 18 |
Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization |
提出基于熵最大化的无目标越狱方法UJEM-KL,显著提升视觉语言模型的攻击迁移性。 |
multimodal |
|
|
| 19 |
AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State |
提出AllocMV框架,通过结构化持久状态与多选背包问题求解实现音乐视频的高效生成。 |
multimodal |
|
|
| 20 |
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination |
提出HAVAE干预策略,通过识别并抑制“词汇劫持”现象以缓解LVLM幻觉问题 |
multimodal |
✅ |
|
| 21 |
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence |
提出TwNV框架,通过生成式新视角合成增强大模型空间推理能力 |
multimodal |
|
|
| 22 |
Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection |
提出Sens-VisualNews基准数据集,以推动新闻图像中煽动性内容检测的研究 |
multimodal |
|
|
| 23 |
SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation |
提出SleepWalk基准测试,旨在压力测试指令引导下的视觉语言导航与具身推理能力 |
multimodal |
|
|