| 1 |
EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model |
提出EE-MLLM,通过复合注意力机制实现数据和计算高效的多模态大语言模型 |
large language model multimodal |
|
|
| 2 |
CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion |
提出CaRDiff框架,利用视频显著性物体排序链式推理和扩散模型提升视频显著性预测。 |
large language model multimodal chain-of-thought |
|
|
| 3 |
UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation |
UniFashion:用于多模态时尚检索与生成的一体化视觉-语言模型 |
large language model multimodal |
✅ |
|
| 4 |
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models |
GRAB:一个用于评估大型多模态模型图分析能力的高难度基准 |
multimodal |
|
|
| 5 |
Video-to-Text Pedestrian Monitoring (VTPM): Leveraging Computer Vision and Large Language Models for Privacy-Preserve Pedestrian Activity Monitoring at Intersections |
提出VTPM,利用计算机视觉和LLM实现保护隐私的交叉路口行人活动监测。 |
large language model |
|
|
| 6 |
MCDubber: Multimodal Context-Aware Expressive Video Dubbing |
MCDubber:提出多模态上下文感知的视频配音模型,提升配音表现力 |
multimodal |
✅ |
|
| 7 |
Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance |
提出一种半监督3D语义场景补全框架,利用2D视觉基础模型指导。 |
foundation model |
|
|
| 8 |
TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models |
TWLV-I:通过全面评估视频基础模型,提升外观和运动理解能力 |
foundation model |
✅ |
|
| 9 |
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs |
提出SEA:用于MLLM中Token级视觉-文本对齐的监督嵌入对齐方法 |
large language model multimodal |
|
|
| 10 |
OE3DIS: Open-Ended 3D Point Cloud Instance Segmentation |
提出OE3DIS,解决开放场景下无需预定义类名的3D点云实例分割问题 |
large language model multimodal |
|
|
| 11 |
EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning |
EMO-LLaMA:通过指令微调增强多模态大语言模型在面部表情理解上的能力 |
large language model multimodal |
✅ |
|
| 12 |
Image Score: Learning and Evaluating Human Preferences for Mercari Search |
利用LLM和链式思考(CoT)为Mercari电商平台学习和评估图像质量偏好 |
large language model chain-of-thought |
|
|
| 13 |
MSCPT: Few-shot Whole Slide Image Classification with Multi-scale and Context-focused Prompt Tuning |
提出MSCPT,利用多尺度上下文提示调整解决病理全切片图像的少样本分类问题 |
large language model |
✅ |
|
| 14 |
T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval |
提出T2VIndexer,一种生成式视频索引器,用于高效文本-视频检索。 |
multimodal |
✅ |
|
| 15 |
EAGLE: Elevating Geometric Reasoning through LLM-empowered Visual Instruction Tuning |
EAGLE:通过LLM驱动的视觉指令调优提升几何推理能力 |
large language model |
|
|