| 1 |
Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search |
提出AutoCaption框架以解决视频字幕生成评估问题 |
large language model multimodal |
✅ |
|
| 2 |
EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models |
提出EfficientVLA以解决VLA模型的加速与压缩问题 |
vision-language-action VLA |
|
|
| 3 |
Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy |
提出Kvasir-VQA-x1以解决医疗视觉问答数据集不足问题 |
large language model multimodal |
✅ |
|
| 4 |
OctoNav: Towards Generalist Embodied Navigation |
提出OctoNav以解决多模态导航任务的统一性问题 |
embodied AI VLA VLN |
|
|
| 5 |
AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation |
提出AnimateAnyMesh以解决高质量3D模型动画生成问题 |
foundation model |
|
|
| 6 |
Class Similarity-Based Multimodal Classification under Heterogeneous Category Sets |
提出基于类别相似性的多模态分类方法以解决异构类别集问题 |
multimodal |
|
|
| 7 |
Prompt-Guided Latent Diffusion with Predictive Class Conditioning for 3D Prostate MRI Generation |
提出CCELLA以解决医学影像数据稀缺问题 |
large language model foundation model |
✅ |
|
| 8 |
HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding |
提出HSENet以解决3D医学图像理解中的语言-视觉融合问题 |
large language model multimodal |
✅ |
|
| 9 |
Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding |
提出ReVisiT以解决视觉信息在LVLM解码中的不足 |
multimodal visual grounding |
|
|
| 10 |
Digitization of Document and Information Extraction using OCR |
提出结合OCR与大语言模型的框架以提升文档信息提取准确性 |
large language model |
|
|
| 11 |
DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision |
提出DreamCS以解决文本到3D生成中的几何偏差问题 |
large language model |
|
|
| 12 |
Q-SAM2: Accurate Quantization for Segment Anything Model 2 |
提出Q-SAM2以解决SAM2模型在资源受限设备上的量化问题 |
foundation model |
|
|
| 13 |
LLM-to-Phy3D: Physically Conform Online 3D Object Generation with LLMs |
提出LLM-to-Phy3D以解决物理约束下的3D对象生成问题 |
large language model |
|
|