| 1 |
Chain-of-Thought Re-ranking for Image Retrieval Tasks |
提出链式思考重排序方法CoTRR,提升多模态大语言模型在图像检索任务中的性能。 |
large language model multimodal chain-of-thought |
✅ |
|
| 2 |
Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning |
提出导航感知剪枝(NAP),通过无监督多模态token剪枝提升视觉语言导航效率。 |
VLN large language model multimodal |
|
|
| 3 |
How Good are Foundation Models in Step-by-Step Embodied Reasoning? |
提出FoMER基准,评估具身环境中基础模型逐步推理能力 |
foundation model multimodal |
|
|
| 4 |
Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding |
利用多模态LLM进行零样本时空视频定位,提出DSTH和TAS策略。 |
large language model multimodal |
✅ |
|
| 5 |
From Pixels to Urban Policy-Intelligence: Recovering Legacy Effects of Redlining with a Multimodal LLM |
利用多模态LLM从像素到城市政策智能:重现红线政策的历史影响 |
large language model multimodal |
|
|
| 6 |
Two Web Toolkits for Multimodal Piano Performance Dataset Acquisition and Fingering Annotation |
提出用于多模态钢琴演奏数据集采集与指法标注的Web工具包 |
multimodal |
|
|
| 7 |
Trade-offs in Cross-Domain Generalization of Foundation Model Fine-Tuned for Biometric Applications |
研究CLIP微调在生物特征识别任务中泛化能力与过 специализации 的权衡 |
foundation model |
|
|
| 8 |
Attention Lattice Adapter: Visual Explanation Generation for Visual Foundation Model |
提出注意力格适配器(ALA)与交替周期架构(AEA),用于视觉基础模型的视觉解释生成。 |
foundation model |
|
|
| 9 |
V-SenseDrive: A Privacy-Preserving Road Video and In-Vehicle Sensor Fusion Framework for Road Safety & Driver Behaviour Modelling |
V-SenseDrive:面向道路安全与驾驶行为建模的隐私保护型道路视频与车内传感器融合框架 |
multimodal |
|
|
| 10 |
ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models |
提出ORCA框架,通过智能体推理提升视觉-语言模型在幻觉抑制和对抗鲁棒性上的表现。 |
multimodal |
|
|
| 11 |
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data |
ScaleCUA:通过跨平台数据扩展开源计算机使用Agent |
foundation model |
✅ |
|
| 12 |
QuizRank: Picking Images by Quizzing VLMs |
QuizRank:利用视觉语言模型进行问答式图像排序,提升维基百科文章配图质量。 |
large language model |
|
|
| 13 |
Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification |
提出跨模态几何校正(CMGR)框架,解决3D少样本类增量学习中的几何失准和纹理偏差问题。 |
foundation model |
|
|
| 14 |
DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images |
DACoN:利用DINO和任意数量参考图像的动漫线稿自动着色 |
foundation model |
✅ |
|