| 1 |
Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning |
提出Attentive-CoT,通过注意力引导微调提升多模态大语言模型的CoT推理能力 |
large language model multimodal chain-of-thought |
|
|
| 2 |
Jailbreaking Multimodal Large Language Models using Multi-Clip Video |
提出Multi-Clip Video SafetyBench,评估视频输入多样性对多模态大语言模型越狱攻击的影响。 |
large language model multimodal |
|
|
| 3 |
Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling |
提出基于感知扰动和奖励建模的多模态LLM评判偏见缓解方法 |
large language model multimodal |
|
|
| 4 |
ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning |
ProtoAda:原型引导的自适应Adapter扩展与几何整合,用于多模态持续指令调优 |
large language model multimodal |
|
|
| 5 |
Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference |
RESTORE:通过校正视觉扭曲提升多模态LLM推理效率 |
large language model multimodal |
|
|
| 6 |
Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains |
研究表明多模态Agent工具使用收益可能被高估,工具调用不代表能力提升 |
multimodal |
|
|
| 7 |
Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis |
对比分析多模态方法在视觉文档类型分类中的应用,揭示不同模态信息的贡献。 |
multimodal |
|
|
| 8 |
InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models |
InfoMerge:面向高效视频大语言模型的信息感知型Token压缩方法 |
large language model |
|
|
| 9 |
Multimodal Action Diffusion for Robust End-to-End Autonomous Driving |
提出Action Diffusion Transformer,用于稳健的端到端自动驾驶多模态动作预测。 |
multimodal |
|
|
| 10 |
The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue |
提出图像重建游戏基准,通过迭代多模态对话提升图像生成质量。 |
multimodal |
|
|
| 11 |
FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature Manifolds |
提出FlatVPR以解决视觉位置识别中的特征重建问题 |
foundation model |
|
|
| 12 |
PathAR: Structure-First Autoregressive Synthesis of Multimodal Pathology Images |
PathAR:一种结构优先的自回归模型,用于合成多模态病理图像 |
multimodal |
|
|
| 13 |
AdaCodec: A Predictive Visual Code for Video MLLMs |
AdaCodec:面向视频MLLM的预测式视觉编码,显著降低计算成本并提升性能。 |
large language model multimodal |
|
|
| 14 |
Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events |
Moment-Video:诊断视频多模态大模型在瞬时视觉事件上的时间保真度 |
large language model multimodal |
|
|
| 15 |
Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models |
提出SEIG框架,利用视觉-语言模型从单张图像重建可编辑Blender场景。 |
foundation model |
|
|
| 16 |
Not All Points Are Equal: Uncertainty-Aware 4D LiDAR Scene Synthesis |
提出U4D框架,利用不确定性指导4D激光雷达场景合成,提升场景保真度和时序一致性。 |
embodied AI |
|
|
| 17 |
A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision |
提出TGAD基准测试,揭示现有文本引导异常检测对语言条件的依赖不足 |
multimodal |
|
|
| 18 |
Density-Aware Translation of Spurious Correlations in Zero-Shot VLMs |
提出密度感知转换(DAT)方法,提升零样本VLM在虚假相关性下的鲁棒性 |
multimodal |
|
|
| 19 |
Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation |
Goal2Pixel:将目标与像素对齐,用于视觉-语言导航 |
VLN |
✅ |
|