| 1 |
ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding |
提出ENC-Bench,用于评估多模态大语言模型在电子海图理解中的能力。 |
large language model multimodal symbolic grounding |
|
|
| 2 |
YOLOv10 with Kolmogorov-Arnold networks and vision-language foundation models for interpretable object detection and trustworthy multimodal AI in computer vision perception |
提出基于Kolmogorov-Arnold网络和视觉-语言模型的YOLOv10,用于可解释的目标检测和可信赖的多模态AI |
foundation model multimodal |
|
|
| 3 |
MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding |
MLLM-HWSI:用于分层全切片图像理解的多模态大语言模型 |
large language model multimodal |
✅ |
|
| 4 |
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling |
ForestPrune:通过时空森林建模实现视频多模态大语言模型的高比例视觉Token压缩 |
large language model multimodal |
|
|
| 5 |
SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning |
SpecEyes:通过推测性感知与规划加速Agentic多模态LLM |
large language model multimodal |
|
|
| 6 |
GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning |
GeoTikzBridge:通过Tikz代码生成增强多模态大模型几何感知与推理能力 |
large language model multimodal |
✅ |
|
| 7 |
3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding |
提出3DCity-LLM,赋能多模态大语言模型进行3D城市级感知与理解 |
large language model |
✅ |
|
| 8 |
Multimodal Industrial Anomaly Detection via Geometric Prior |
提出基于几何先验的多模态工业异常检测网络,提升复杂几何缺陷检测精度。 |
multimodal |
|
|
| 9 |
UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation |
UniFunc3D:统一的主动时空定位框架,用于3D功能分割 |
large language model multimodal |
|
|
| 10 |
ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting |
ViKey:通过视觉提示增强视频大语言模型的时间理解能力 |
large language model multimodal |
|
|
| 11 |
SMSP: A Plug-and-Play Strategy of Multi-Scale Perception for MLLMs to Perceive Visual Illusions |
提出SMSP多尺度感知策略,提升MLLM对视觉错觉的识别能力。 |
large language model multimodal |
✅ |
|
| 12 |
Cog3DMap: Multi-View Vision-Language Reasoning with 3D Cognitive Maps |
Cog3DMap:利用3D认知地图实现多视角视觉-语言推理 |
large language model multimodal |
|
|
| 13 |
ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance |
ForeSea:面向视频监控的多模态查询AI取证搜索系统 |
large language model multimodal |
|
|
| 14 |
Focus, Don't Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding |
PinPoint:聚焦而非剪枝,识别信息密集图像中指令相关区域,提升视觉语言模型效率。 |
large language model multimodal |
|
|
| 15 |
Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models |
Know3D:利用视觉-语言模型知识提示3D生成,实现可控后视图生成。 |
large language model multimodal |
|
|
| 16 |
OccAny: Generalized Unconstrained Urban 3D Occupancy |
OccAny:首个广义无约束城市3D Occupancy预测模型,提升泛化性和几何补全能力。 |
foundation model |
✅ |
|
| 17 |
DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object Detection |
DetPO:利用多模态LLM的上下文学习进行少样本目标检测,提升泛化能力。 |
visual grounding |
✅ |
|
| 18 |
AgentFoX: LLM Agent-Guided Fusion with eXplainability for AI-Generated Image Detection |
AgentFoX:基于LLM Agent引导的AI生成图像检测与可解释性融合框架 |
large language model |
|
|
| 19 |
SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts |
SOUPLE:利用可学习提示上下文增强音视频定位与分割 |
multimodal |
|
|
| 20 |
Think 360°: Evaluating the Width-centric Reasoning Capability of MLLMs Beyond Depth |
提出Think 360°基准,评估多模态大模型在推理宽度上的能力。 |
multimodal |
|
|