| 1 |
Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification |
提出Visual Re-Examination (VRE)框架,提升多模态LLM的视觉推理能力并减少幻觉 |
large language model multimodal |
✅ |
|
| 2 |
SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning |
SALMUBench:用于敏感关联级别多模态模型卸载的基准测试 |
multimodal |
|
|
| 3 |
Finding Distributed Object-Centric Properties in Self-Supervised Transformers |
提出Object-DINO,无需训练即可从自监督ViT中提取分布式对象中心属性,提升对象发现和多模态对齐。 |
large language model multimodal visual grounding |
|
|
| 4 |
Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification |
提出MaLSF框架,通过掩码感知的局部语义融合解决多模态媒体验证难题。 |
multimodal |
|
|
| 5 |
FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants |
FairLLaVA:面向视觉-语言大模型的公平性参数高效微调方法 |
large language model multimodal instruction following |
✅ |
|
| 6 |
MA-Bench: Towards Fine-grained Micro-Action Understanding |
提出MA-Bench基准测试,用于评估多模态大语言模型在细粒度微动作理解方面的能力。 |
large language model multimodal |
|
|
| 7 |
Label-Free Cross-Task LoRA Merging with Null-Space Compression |
提出基于零空间压缩的无标签跨任务LoRA融合方法,解决异构任务融合难题。 |
large language model foundation model |
|
|
| 8 |
TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life |
TaxaAdapter:利用视觉分类模型实现生命之树上的细粒度图像生成 |
large language model multimodal |
|
|
| 9 |
SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis |
SkinGPT-X:用于透明可信皮肤病诊断的自进化协同多智能体系统 |
large language model multimodal |
|
|
| 10 |
Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives |
提出有效的Token修剪策略以优化GUI视觉代理的历史截图处理 |
large language model multimodal |
|
|
| 11 |
Make Geometry Matter for Spatial Reasoning |
提出GeoSR框架,增强视觉语言模型在静态和动态场景中的空间推理能力 |
foundation model |
✅ |
|
| 12 |
Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow |
提出生成式视频编解码器GVC,实现零样本视频编码,提升压缩效率。 |
foundation model |
|
|
| 13 |
From Pen to Pixel: Translating Hand-Drawn Plots into Graphical APIs via a Novel Benchmark and Efficient Adapter |
提出HDpy-13数据集和Plot-Adapter,提升手绘图到图形API的推荐效果。 |
large language model |
|
|
| 14 |
HINT: Composed Image Retrieval with Dual-path Compositional Contextualized Network |
提出双路组合上下文网络HINT,提升组合图像检索的匹配判别能力 |
multimodal |
✅ |
|
| 15 |
Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding |
提出基于扩散模型的GUI Agent,用于提升GUI环境下的目标定位与交互能力 |
multimodal |
|
|
| 16 |
ComVi: Context-Aware Optimized Comment Display in Video Playback |
ComVi:上下文感知的视频评论优化显示系统,提升用户沉浸感 |
TAMP |
|
|