| 1 |
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs |
提出IV-Bench基准,评估多模态LLM在图像引导下的视频感知与推理能力 |
large language model multimodal |
✅ |
|
| 2 |
Event2Vec: Processing Neuromorphic Events directly by Representations in Vector Space |
提出Event2Vec,通过向量空间表征直接处理神经形态事件数据 |
large language model multimodal |
✅ |
|
| 3 |
LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception |
提出LongPerceptualThoughts数据集,提升视觉感知任务中类系统2推理能力。 |
chain-of-thought |
|
|
| 4 |
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models |
Eagle 2.5:通过长上下文后训练提升前沿视觉-语言模型性能 |
multimodal |
|
|
| 5 |
Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning |
揭示MILS图像描述框架的隐藏代价:高计算开销下的零样本性能 |
multimodal |
|
|
| 6 |
Cognitive-Inspired Hierarchical Attention Fusion With Visual and Textual for Cross-Domain Sequential Recommendation |
提出HAF-VT模型,融合视觉和文本信息,解决跨域序列推荐中用户兴趣建模问题。 |
multimodal |
|
|
| 7 |
ScanEdit: Hierarchically-Guided Functional 3D Scan Editing |
ScanEdit:提出层级引导的功能性3D扫描编辑方法,实现指令驱动的场景编辑。 |
large language model |
|
|
| 8 |
Insert Anything: Image Insertion via In-Context Editing in DiT |
提出Insert Anything框架,通过DiT上下文编辑实现参考图像的无缝插入。 |
multimodal |
|
|
| 9 |
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation |
提出FG-BMK基准,全面评估大型视觉语言模型在细粒度图像任务上的性能 |
multimodal |
✅ |
|
| 10 |
DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding |
DyFo:免训练动态聚焦视觉搜索,提升LMMs的细粒度视觉理解能力 |
multimodal |
✅ |
|
| 11 |
Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation |
提出基于语义扰动的置信度校准框架,提升视觉-语言模型在对象级别上的置信度可靠性。 |
multimodal |
|
|