| 1 |
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing |
提出ThinkSound框架以解决视频音频生成中的高保真挑战 |
large language model foundation model multimodal |
|
|
| 2 |
DRISHTIKON: Visual Grounding at Multiple Granularities in Documents |
提出DRISHTIKON以解决文档图像中的视觉定位问题 |
large language model visual grounding |
|
|
| 3 |
Multimodal Prompt Alignment for Facial Expression Recognition |
提出多模态提示对齐框架以提升面部表情识别精度 |
large language model multimodal |
|
|
| 4 |
Bridging Video Quality Scoring and Justification via Large Multimodal Models |
提出基于SIG的多模态模型以提升视频质量评分与解释能力 |
multimodal chain-of-thought |
|
|
| 5 |
SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark |
提出SiM3D以解决单实例多视角多模态3D异常检测问题 |
multimodal |
|
|
| 6 |
Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation |
基于深度学习的非典型有丝分裂分类基准研究 |
foundation model |
✅ |
|
| 7 |
SimVecVis: A Dataset for Enhancing MLLMs in Visualization Understanding |
提出SimVec以解决多模态大语言模型在可视化理解中的挑战 |
large language model multimodal chain-of-thought |
✅ |
|
| 8 |
SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification |
提出SAMURAI以解决复杂室内环境中的3D物体检索问题 |
multimodal |
|
|
| 9 |
LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection |
提出LASFNet以简化多模态目标检测中的特征融合问题 |
multimodal |
✅ |
|
| 10 |
FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering |
提出FOCUS以解决细粒度视觉问答中的视觉裁剪问题 |
large language model multimodal |
|
|
| 11 |
Exploring the Design Space of 3D MLLMs for CT Report Generation |
提出3D多模态大语言模型以提升CT报告生成效果 |
large language model multimodal |
✅ |
|
| 12 |
LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning |
提出LLaVA-Pose以解决人类姿态与动作理解问题 |
multimodal instruction following |
✅ |
|
| 13 |
GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding |
提出GroundFlow模块以解决3D点云序列定位中的时间推理问题 |
large language model visual grounding |
|
|
| 14 |
Task-Aware KV Compression For Cost-Effective Long Video Understanding |
提出Video-X^2L以解决长视频理解中的KV压缩问题 |
large language model multimodal |
|
|
| 15 |
OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography |
提出OracleFusion以解决甲骨文字符解读难题 |
large language model multimodal |
|
|
| 16 |
Evidence-based diagnostic reasoning with multi-agent copilot for human pathology |
提出PathChat+以解决病理学诊断推理不足问题 |
large language model multimodal |
|
|
| 17 |
Global and Local Entailment Learning for Natural World Imagery |
提出Radial Cross-Modal Embeddings以解决视觉语言模型中的推理问题 |
foundation model |
✅ |
|
| 18 |
ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models |
提出ShotBench以解决电影语言理解不足的问题 |
multimodal |
|
|
| 19 |
FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing |
提出FaSTA$^*$以解决高效的多轮图像编辑问题 |
large language model |
|
|