| 1 |
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models |
提出Visual Funnel,解决多模态大语言模型中的上下文盲区问题 |
large language model multimodal |
|
|
| 2 |
VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction |
提出VGent,通过解耦推理和预测的模块化设计实现高效视觉定位。 |
large language model multimodal visual grounding |
|
|
| 3 |
Efficient-VLN: A Training-Efficient Vision-Language Navigation Model |
Efficient-VLN:一种训练高效的视觉-语言导航模型,显著降低训练开销。 |
VLN large language model multimodal |
|
|
| 4 |
BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models |
BabyVLM-V2:面向发育导向的视觉基础模型预训练与评测框架 |
foundation model multimodal |
|
|
| 5 |
Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding |
Blink:面向多模态理解的动态视觉Token分辨率方法 |
large language model multimodal |
|
|
| 6 |
Information-driven Fusion of Pathology Foundation Models for Enhanced Disease Characterization |
提出基于信息驱动的病理学Foundation Model融合方法,提升疾病表征能力 |
foundation model |
|
|
| 7 |
DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance |
DuetSVG:提出一种统一的多模态SVG生成模型,利用内部视觉引导提升生成质量。 |
multimodal |
|
|
| 8 |
SoccerMaster: A Vision Foundation Model for Soccer Understanding |
提出SoccerMaster足球视觉基础模型,统一解决足球理解中的多项任务。 |
foundation model |
|
|
| 9 |
MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos |
提出MultiHateLoc框架,用于在线视频中多模态仇恨内容的弱监督时序定位。 |
multimodal |
|
|
| 10 |
EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs |
提出EchoingPixels,通过跨模态自适应Token缩减,提升音视频LLM效率。 |
large language model multimodal |
|
|
| 11 |
Vision-Language Models for Infrared Industrial Sensing in Additive Manufacturing Scene Description |
VLM-IRIS:面向增材制造红外工业感知的视觉-语言模型零样本框架 |
foundation model |
|
|
| 12 |
Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context |
复现并分析基于图像分块的高分辨率视觉语言模型,探究全局上下文的影响 |
multimodal |
|
|
| 13 |
AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation |
AlcheMinT:用于多参考一致视频生成的细粒度时间控制方法 |
TAMP |
✅ |
|
| 14 |
FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos |
FoundationMotion:提出自动标注与推理框架,提升视频空间运动理解能力 |
large language model |
|
|
| 15 |
MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence |
MMSI-Video-Bench:用于评估视频空间智能的多模态大模型基准 |
chain-of-thought |
|
|
| 16 |
Leveraging Text Guidance for Enhancing Demographic Fairness in Gender Classification |
提出文本引导方法,提升面部性别分类算法的人口公平性 |
multimodal |
|
|
| 17 |
PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning |
PoseGAM:通过几何感知多视角推理实现鲁棒的未见物体姿态估计 |
foundation model |
✅ |
|
| 18 |
CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates |
提出基于场景图增量更新的纠错序列规划方法CoSPlan,提升VLM在复杂任务中的推理能力。 |
chain-of-thought |
✅ |
|