| 1 |
Grounding-Aware Token Pruning: Recovering from Drastic Performance Drops in Visual Grounding Caused by Pruning |
提出基于定位感知的标记剪枝以解决视觉定位性能下降问题 |
large language model multimodal visual grounding |
|
|
| 2 |
Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment |
提出可否定视频蕴含任务以提升视频多模态模型的推理能力 |
large language model multimodal |
|
|
| 3 |
TaleForge: Interactive Multimodal System for Personalized Story Creation |
提出TaleForge以解决个性化故事创作的参与度不足问题 |
large language model multimodal |
|
|
| 4 |
COOCO -- Common Objects Out-of-Context -- Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication |
提出COOCO数据集以研究多模态上下文在指称交流中的作用 |
multimodal |
✅ |
|
| 5 |
RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models |
提出RetFiner以解决视网膜基础模型的语义理解不足问题 |
foundation model |
✅ |
|
| 6 |
Towards Scalable and Robust White Matter Lesion Localization via Multimodal Deep Learning |
提出多模态深度学习框架以解决白质病灶定位问题 |
multimodal |
|
|
| 7 |
TASeg: Text-aware RGB-T Semantic Segmentation based on Fine-tuning Vision Foundation Models |
提出TASeg框架以解决RGB-T语义分割中的文本信息缺失问题 |
foundation model |
|
|
| 8 |
SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding |
提出SPAZER以解决零-shot 3D视觉定位问题 |
visual grounding |
|
|
| 9 |
Few-Shot Segmentation of Historical Maps via Linear Probing of Vision Foundation Models |
提出基于线性探测的少样本历史地图分割方法 |
foundation model |
✅ |
|
| 10 |
CAL-RAG: Retrieval-Augmented Multi-Agent Generation for Content-Aware Layout Design |
提出CAL-RAG以解决内容感知布局生成问题 |
large language model multimodal |
|
|
| 11 |
Exploring Task-Solving Paradigm for Generalized Cross-Domain Face Anti-Spoofing via Reinforcement Fine-Tuning |
提出基于强化微调的跨域人脸反欺诈方法以解决泛化问题 |
large language model multimodal |
|
|
| 12 |
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs |
提出LLaVA-Scissor以解决视频多模态大语言模型的token压缩问题 |
large language model multimodal |
✅ |
|
| 13 |
GameTileNet: A Semantic Dataset for Low-Resolution Game Art in Procedural Content Generation |
提出GameTileNet以解决低分辨率游戏艺术生成问题 |
large language model |
|
|
| 14 |
Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset |
提出无缝交互模型以解决人机交互中的非语言信号理解问题 |
multimodal |
|
|
| 15 |
Test-Time Consistency in Vision Language Models |
提出测试时一致性框架以解决视觉语言模型的不一致性问题 |
multimodal |
|
|
| 16 |
Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment |
提出VisionDrop以解决LVLM中视觉标记冗余问题 |
large language model |
|
|
| 17 |
ProSAM: Enhancing the Robustness of SAM-based Visual Reference Segmentation with Probabilistic Prompts |
提出ProSAM以解决SAM视觉参考分割的稳定性问题 |
foundation model |
|
|