| 1 |
QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization |
QVLA:针对具身控制,提出动作敏感的VLA模型通道量化框架 |
vision-language-action VLA large language model |
|
|
| 2 |
MM-SCALE: Grounded Multimodal Moral Reasoning via Scalar Judgment and Listwise Alignment |
提出MM-SCALE数据集,通过标量判断和列表对齐提升多模态道德推理能力 |
multimodal |
|
|
| 3 |
Quasi-multimodal-based pathophysiological feature learning for retinal disease diagnosis |
提出基于准多模态的病理生理特征学习框架,用于视网膜疾病诊断。 |
multimodal |
|
|
| 4 |
Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization |
提出C3PO框架,通过CoT压缩和对比偏好优化缓解多模态推理模型中的幻觉问题。 |
multimodal |
|
|
| 5 |
Z3D: Zero-Shot 3D Visual Grounding from Images |
提出Z3D,解决仅使用多视角图像的零样本3D视觉定位问题 |
visual grounding |
✅ |
|
| 6 |
Full end-to-end diagnostic workflow automation of 3D OCT via foundation model-driven AI for retinal diseases |
提出基于Vision Foundation Model的FOCUS框架,实现3D OCT视网膜疾病诊断全流程自动化 |
foundation model |
|
|
| 7 |
FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion |
FSOD-VFM:利用视觉基础模型和图扩散进行少样本目标检测 |
foundation model |
✅ |
|
| 8 |
FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation |
提出FinMTM:一个用于金融推理和Agent评估的多轮多模态基准 |
multimodal |
|
|
| 9 |
A generalizable large-scale foundation model for musculoskeletal radiographs |
SKELEX:用于肌肉骨骼X光片的通用大规模基础模型 |
foundation model |
|
|
| 10 |
VOILA: Value-of-Information Guided Fidelity Selection for Cost-Aware Multimodal Question Answering |
提出VOILA框架,通过信息价值指导的多模态问答保真度选择,优化资源受限场景。 |
multimodal |
|
|
| 11 |
PnP-U3D: Plug-and-Play 3D Framework Bridging Autoregression and Diffusion for Unified Understanding and Generation |
提出PnP-U3D框架,结合自回归与扩散模型,统一3D理解与生成任务。 |
large language model multimodal |
|
|
| 12 |
Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation |
提出Refer-Agent以解决视频对象分割中的推理与反思问题 |
large language model |
|
|
| 13 |
SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM |
提出SlowFocus机制,增强视频LLM对细粒度时序信息的理解能力 |
large language model |
|
|
| 14 |
Interpretable Logical Anomaly Classification via Constraint Decomposition and Instruction Fine-Tuning |
提出LogiCls框架,通过约束分解和指令微调实现可解释的工业图像逻辑异常分类。 |
chain-of-thought |
|
|
| 15 |
MUSE: A Multi-agent Framework for Unconstrained Story Envisioning via Closed-Loop Cognitive Orchestration |
MUSE:通过闭环认知编排的多智能体框架,用于无约束的故事构想 |
multimodal |
|
|