| 1 |
NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis |
提出NutriMLLM以解决饮食微量营养素分析问题 |
large language model multimodal |
|
|
| 2 |
Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur? |
提出Ego-MC-Bench与Ego-CoMist以解决视频LLM实时纠错问题 |
large language model multimodal |
|
|
| 3 |
GD-MIL: Grade-Disentangled Multiple Instance Learning for Multimodal Biochemical Recurrence Prediction in Prostate Cancer |
提出GD-MIL以解决前列腺癌生化复发预测问题 |
foundation model multimodal |
|
|
| 4 |
Scaling by Diversified Experience for Vision-Language-Action Models |
提出SyVLA以解决视觉-语言-动作模型的控制与推理问题 |
vision-language-action VLA |
|
|
| 5 |
Securing Self-supervised Data Curation for Foundation Models Robustness |
提出毒性数据检测器以确保自监督数据的完整性 |
foundation model |
|
|
| 6 |
Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning |
提出Rea2Seg框架以解决复杂图像分割问题 |
large language model foundation model multimodal |
|
|
| 7 |
CAMF-Det: Closure-Aware Multimodal Fusion for LiDAR-Camera 3D Object Detection on UAV Platforms |
提出CAMF-Det以解决无人机平台下的遮挡问题 |
multimodal |
|
|
| 8 |
CRANE: Knowledge Editing for Reasoning MLLMs |
提出CRANE框架以解决推理多模态大语言模型的知识编辑问题 |
large language model multimodal chain-of-thought |
|
|
| 9 |
When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models |
提出TransGeoCLIP以解决全球图像地理定位问题 |
multimodal |
|
|
| 10 |
DifferSeg: Towards Diverse Multimodal Binary Segmentation via Differential Perception and Frequency Guidance |
提出DifferSeg以解决多模态二值分割中的适应性与解码效率问题 |
multimodal |
|
|
| 11 |
HDRAgent: An Agentic Framework for Multi-Exposure HDR Imaging |
提出HDRAgent以解决动态场景中的HDR成像伪影问题 |
large language model multimodal |
|
|
| 12 |
A multi-agent system for spine MRI report generation from multi-sequence imaging |
提出SpineAgent以解决脊柱MRI报告生成的复杂性问题 |
foundation model multimodal |
|
|
| 13 |
HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM Agents |
提出HDSL以解决文本驱动室内场景生成与编辑问题 |
multimodal |
|
|
| 14 |
Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA |
提出CREDiT框架以解决视频问答中的因果推理问题 |
multimodal |
|
|
| 15 |
See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding |
提出CoVER框架以解决长视频理解中的证据获取和反馈问题 |
large language model |
|
|
| 16 |
Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions? |
提出Distract-Bench以解决视觉语言模型对语义干扰的鲁棒性问题 |
multimodal |
✅ |
|