| 1 |
Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models |
提出Fuel Gauge,提前预测大模型CoT长度,优化资源分配。 |
multimodal chain-of-thought |
|
|
| 2 |
GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning |
GeoSense:通过几何必要性感知增强多模态推理能力 |
large language model multimodal |
|
|
| 3 |
Med-DualLoRA: Local Adaptation of Foundation Models for 3D Cardiac MRI |
提出Med-DualLoRA以解决3D心脏MRI适应性问题 |
foundation model |
|
|
| 4 |
Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding |
提出跨模态距离不变位置编码(DIPE),缓解MLLM长文本场景中的视觉信息衰减问题。 |
large language model multimodal visual grounding |
✅ |
|
| 5 |
UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations |
UniCom:通过压缩连续语义表示实现统一的多模态建模 |
multimodal |
|
|
| 6 |
RandMark: On Random Watermarking of Visual Foundation Models |
RandMark:提出基于随机水印的视觉基础模型所有权验证方法 |
foundation model |
|
|
| 7 |
Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation |
评估骨骼CT分割中Promptable基础模型对人工提示的敏感性 |
foundation model |
✅ |
|
| 8 |
GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations |
GroundCount:利用目标检测增强视觉语言模型,缓解计数幻觉问题 |
symbolic grounding |
|
|
| 9 |
Taking Shortcuts for Categorical VQA Using Super Neurons |
利用超神经元,加速分类视觉问答任务 |
large language model |
|
|
| 10 |
How To Embed Matters: Evaluation of EO Embedding Design Choices |
系统评估地球观测嵌入设计选择,提升GeoFM在遥感任务中的性能与可扩展性。 |
foundation model |
|
|
| 11 |
Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues |
提出基于视觉-语言模型的红外热成像认知缺陷分析框架,无需训练数据实现零样本缺陷检测。 |
multimodal |
|
|
| 12 |
Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression |
提出CIPHER,通过扩散引导的对抗扰动抑制LVLM的幻觉问题 |
multimodal |
✅ |
|
| 13 |
Learning to Wander: Improving the Global Image Geolocation Ability of LMMs via Actionable Reasoning |
提出GeoAoT框架,通过可执行推理提升LMMs的全局图像地理定位能力 |
multimodal |
|
|