| 1 |
Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine |
MedPLIB:面向生物医学,具备像素级理解的多模态大语言模型 |
large language model multimodal |
✅ |
|
| 2 |
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM |
EasyRef:利用多模态LLM实现扩散模型的多参考图像泛化生成 |
large language model multimodal instruction following |
|
|
| 3 |
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions |
提出InternLM-XComposer2.5-OmniLive,用于长期流式视频和音频交互的多模态系统 |
large language model foundation model multimodal |
|
|
| 4 |
Exemplar Masking for Multimodal Incremental Learning |
提出Exemplar Masking框架,解决多模态增量学习中的存储和计算瓶颈。 |
large language model multimodal |
✅ |
|
| 5 |
Agtech Framework for Cranberry-Ripening Analysis Using Vision Foundation Models |
提出基于视觉基础模型的蔓越莓成熟度分析框架,用于精准农业 |
foundation model |
|
|
| 6 |
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding |
提出V2PE:通过可变视觉位置编码提升视觉-语言模型的多模态长上下文能力 |
multimodal |
|
|
| 7 |
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation |
提出VMB框架,利用显式桥接和检索增强实现高质量多模态音乐生成 |
multimodal |
✅ |
|
| 8 |
MaskTerial: A Foundation Model for Automated 2D Material Flake Detection |
MaskTerial:用于自动二维材料薄片检测的基础模型 |
foundation model |
|
|
| 9 |
Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering |
提出LGQAVE模型,通过自适应特征选择和基础模型增强视频问答性能 |
foundation model |
|
|
| 10 |
Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content |
提出DEVA框架,通过文本情感描述增强视听内容的多模态情感分析。 |
multimodal |
|
|
| 11 |
Pinpoint Counterfactuals: Reducing social bias in foundation models via localized counterfactual generation |
提出局部化对抗样本生成方法,降低基础模型中的社会偏见 |
foundation model |
|
|
| 12 |
ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation |
提出ViCaS数据集以解决视频理解中的高层次与像素级分割问题 |
large language model multimodal |
✅ |
|
| 13 |
Olympus: A Universal Task Router for Computer Vision Tasks |
Olympus:一种用于计算机视觉任务的通用任务路由框架 |
large language model multimodal |
|
|
| 14 |
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding |
SynerGen-VL:利用视觉专家和Token Folding实现协同图像理解与生成 |
large language model multimodal |
|
|
| 15 |
Do MLLMs Exhibit Human-like Perceptual Behaviors? HVSBench: A Benchmark for MLLM Alignment with Human Perceptual Behavior |
提出HVSBench基准测试MLLM是否具备类人感知行为,揭示显著差距。 |
large language model multimodal |
|
|
| 16 |
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition |
Lyra:一种高效且以语音为中心的全知认知框架 |
large language model multimodal |
|
|
| 17 |
GenEx: Generating an Explorable World |
GenEx:通过生成式想象构建可探索的3D世界,提升具身智能体能力 |
embodied AI |
|
|
| 18 |
TimeRefine: Temporal Grounding with Time Refining Video LLM |
TimeRefine:利用时间细化的视频LLM进行时序定位 |
TAMP |
|
|
| 19 |
Vision-Language Models Generate More Homogeneous Stories for Phenotypically Black Individuals |
视觉-语言模型对表型黑人生成更趋同的故事,揭示群体内部的同质性偏见 |
large language model |
|
|
| 20 |
FD2-Net: Frequency-Driven Feature Decomposition Network for Infrared-Visible Object Detection |
提出FD2-Net,通过频率驱动的特征分解实现红外-可见光图像目标检测性能提升。 |
multimodal |
|
|
| 21 |
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method |
提出长程视觉-语言导航任务与基准,并设计多粒度动态记忆模型以提升导航性能。 |
VLN |
|
|
| 22 |
Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning |
提出Geo-LLaVA,结合元上下文学习解决几何数学难题 |
large language model |
|
|