| 1 |
Can Large Multimodal Models Inspect Buildings? A Hierarchical Benchmark for Structural Pathology Reasoning |
提出DefectBench,用于评估大模型在建筑结构病理推理中的能力 |
foundation model multimodal |
|
|
| 2 |
Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision |
提出Semantically-Grounded Supervision (SeGroS)框架,提升统一多模态模型的对齐效果。 |
multimodal visual grounding |
|
|
| 3 |
FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow |
FlowScene:提出多模态图整流流模型,实现风格一致的室内场景生成。 |
multimodal language conditioned |
|
|
| 4 |
MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI |
MedSPOT:面向临床GUI工作流的序列化视觉定位基准测试 |
large language model multimodal visual grounding |
✅ |
|
| 5 |
Evaluating Vision Foundation Models for Pixel and Object Classification in Microscopy |
评估视觉基础模型在显微镜像素和对象分类中的应用潜力 |
foundation model |
|
|
| 6 |
Template-based Object Detection Using a Foundation Model |
提出基于分割Foundation Model的模板匹配目标检测方法,无需训练即可应用于GUI自动化测试。 |
foundation model |
|
|
| 7 |
FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLMs |
FREAK:针对高级多模态大语言模型细粒度幻觉评估基准 |
large language model multimodal chain-of-thought |
|
|
| 8 |
Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images |
HINGE:通过组织学图像生成空间基因表达,有效利用预训练单细胞模型。 |
foundation model |
|
|
| 9 |
Unbiased Dynamic Multimodal Fusion |
提出无偏动态多模态学习框架,解决动态场景下模态质量评估和依赖偏差问题。 |
multimodal |
✅ |
|
| 10 |
Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement |
提出HRNet,通过解耦和对齐实现非迭代混合多模态图像配准 |
multimodal |
|
|
| 11 |
LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation |
LumosX:通过关联身份及其属性实现个性化视频生成 |
large language model multimodal |
✅ |
|
| 12 |
Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR |
提出解耦跳跃连接和R-Probe,提升MLLM在OCR任务中的细粒度识别能力 |
large language model multimodal |
|
|
| 13 |
MedQ-Engine: A Closed-Loop Data Engine for Evolving MLLMs in Medical Image Quality Assessment |
MedQ-Engine:用于医学图像质量评估中演进MLLM的闭环数据引擎 |
large language model multimodal |
|
|
| 14 |
One Model, Two Minds: Task-Conditioned Reasoning for Unified Image Quality and Aesthetic Assessment |
提出TATAR框架,通过任务条件推理统一图像质量与美学评估 |
large language model multimodal |
✅ |
|
| 15 |
Evaluating Image Editing with LLMs: A Comprehensive Benchmark and Intermediate-Layer Probing Approach |
提出TIEdit基准和EditProbe评估器,提升文本引导图像编辑的评测可靠性 |
large language model multimodal |
|
|
| 16 |
CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management |
CurveStream:提出曲率感知的分层视觉记忆管理,提升MLLM在流视频理解中的性能。 |
large language model multimodal |
✅ |
|
| 17 |
CoVR-R:Reason-Aware Composed Video Retrieval |
提出CoVR-R:一种基于推理的组合视频检索方法,解决现有方法忽略编辑后效应的问题。 |
multimodal |
✅ |
|
| 18 |
TSegAgent: Zero-Shot Tooth Segmentation via Geometry-Aware Vision-Language Agents |
TSegAgent:基于几何感知视觉-语言Agent的零样本牙齿分割 |
foundation model |
|
|