| 1 |
MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning |
提出MINT-CoT以解决多模态数学推理中的视觉信号整合问题 |
large language model multimodal chain-of-thought |
✅ |
|
| 2 |
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes |
提出Anywhere3D-Bench以解决3D场景中的多层次视觉定位问题 |
large language model multimodal visual grounding |
|
|
| 3 |
Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics |
提出多模态街道评估框架解决城市设计主观感知不足问题 |
large language model multimodal |
|
|
| 4 |
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations |
提出STARE基准以评估多模态模型在视觉模拟中的空间认知能力 |
large language model multimodal |
|
|
| 5 |
When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding |
提出ZoomText与Grounded Layer Correction以缓解场景文本理解中的语义幻觉问题 |
multimodal |
|
|
| 6 |
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning |
提出MORSE-500以解决多模态推理基准不足问题 |
multimodal |
|
|
| 7 |
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos |
提出VideoMathQA以解决视频中的数学推理问题 |
multimodal |
✅ |
|
| 8 |
Can Foundation Models Generalise the Presentation Attack Detection Capabilities on ID Cards? |
利用基础模型提升身份证件的呈现攻击检测能力 |
foundation model |
|
|
| 9 |
MokA: Multimodal Low-Rank Adaptation for MLLMs |
提出MokA以解决多模态大语言模型的适应性问题 |
multimodal |
✅ |
|
| 10 |
Single GPU Task Adaptation of Pathology Foundation Models for Whole Slide Image Analysis |
提出TAPFM以解决病理基础模型在全切片图像分析中的适应性问题 |
foundation model |
|
|
| 11 |
PixCell: A generative foundation model for digital histopathology images |
提出PixCell以解决数字病理图像生成问题 |
foundation model |
|
|
| 12 |
Deep histological synthesis from mass spectrometry imaging for multimodal registration |
提出基于pix2pix模型的组织学图像合成以解决多模态配准问题 |
multimodal |
✅ |
|
| 13 |
BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models |
提出BYO-Eval以解决多模态语言模型评估问题 |
multimodal |
✅ |
|
| 14 |
OpenMaskDINO3D : Reasoning 3D Segmentation via Large Language Model |
提出OpenMaskDINO3D以解决3D分割推理问题 |
large language model |
|
|
| 15 |
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs |
提出SparseMM以优化多模态大语言模型的视觉理解效率 |
large language model multimodal |
✅ |
|
| 16 |
Towards Vision-Language-Garment Models for Web Knowledge Garment Understanding and Generation |
提出VLG模型以解决服装生成领域的知识转移问题 |
foundation model multimodal |
|
|
| 17 |
Quantifying Cross-Modality Memorization in Vision-Language Models |
量化视觉语言模型中的跨模态记忆以提升知识迁移能力 |
large language model multimodal |
|
|
| 18 |
A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions |
综述越南文档分析与识别技术以应对独特挑战 |
large language model multimodal |
|
|
| 19 |
TextVidBench: A Benchmark for Long Video Scene Text Understanding |
提出TextVidBench以解决长视频场景文本理解问题 |
large language model multimodal |
|
|
| 20 |
APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval |
提出APVR以解决长视频理解中的信息检索问题 |
large language model multimodal |
|
|
| 21 |
Refer to Any Segmentation Mask Group With Vision-Language Prompts |
提出全模态参考表达分割以解决视觉语言交互不足问题 |
multimodal |
|
|
| 22 |
Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos |
提出Perceive Anything模型以解决图像和视频的区域理解问题 |
large language model |
|
|
| 23 |
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm |
提出MonkeyOCR以解决文档解析效率与准确性问题 |
multimodal |
✅ |
|
| 24 |
SeedEdit 3.0: Fast and High-Quality Generative Image Editing |
提出SeedEdit 3.0以解决高质量图像编辑问题 |
instruction following |
|
|
| 25 |
FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing |
提出FlowDirector以解决视频编辑中的逆向过程问题 |
instruction following |
|
|
| 26 |
LLMs Can Compensate for Deficiencies in Visual Representations |
提出视觉语言模型以弥补视觉表示的不足 |
multimodal |
|
|
| 27 |
Towards Holistic Visual Quality Assessment of AI-Generated Videos: A LLM-Based Multi-Dimensional Evaluation Model |
提出多维度评估模型以解决AI生成视频的视觉质量问题 |
large language model |
✅ |
|
| 28 |
Line of Sight: On Linear Representations in VLLMs |
提出多模态稀疏自编码器以增强VLLM的图像表示能力 |
multimodal |
|
|
| 29 |
HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model |
提出HoliSafe以解决视觉语言模型安全性不足问题 |
multimodal |
|
|