| 1 |
Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis |
Visatronic:一种用于语音合成的多模态解码器模型,实现视频-文本到语音的生成。 |
large language model foundation model multimodal |
✅ |
|
| 2 |
InsightEdit: Towards Better Instruction Following for Image Editing |
InsightEdit:利用多模态大语言模型提升指令驱动的图像编辑效果 |
large language model multimodal instruction following |
|
|
| 3 |
Multimodal Alignment and Fusion: A Survey |
综述多模态对齐与融合技术,涵盖结构视角与方法范式,旨在提升多模态学习系统的泛化性。 |
embodied AI large language model multimodal |
|
|
| 4 |
NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects? |
NEMO:评估多模态大语言模型识别属性修改对象的能力 |
large language model multimodal |
|
|
| 5 |
ShowUI: One Vision-Language-Action Model for GUI Visual Agent |
提出ShowUI,一个用于GUI视觉代理的视觉-语言-动作模型 |
vision-language-action instruction following |
✅ |
|
| 6 |
Grounding-IQA: Grounding Multimodal Language Model for Image Quality Assessment |
提出Grounding-IQA,通过多模态 grounding 提升图像质量评估的细粒度。 |
large language model multimodal |
✅ |
|
| 7 |
Real-Time Multimodal Signal Processing for HRI in RoboCup: Understanding a Human Referee |
针对RoboCup人机交互,提出实时多模态信号处理方法以理解人类裁判 |
multimodal |
|
|
| 8 |
Video-Guided Foley Sound Generation with Multimodal Controls |
MultiFoley:多模态控制的视频引导Foley音效生成模型 |
multimodal |
✅ |
|
| 9 |
HyperSeg: Towards Universal Visual Segmentation with Large Language Model |
HyperSeg:基于大语言模型的通用视觉分割模型,实现图像和视频的像素级理解 |
large language model |
|
|
| 10 |
Multimodal Outer Arithmetic Block Dual Fusion of Whole Slide Images and Omics Data for Precision Oncology |
提出基于双重融合的多模态外积算术块方法,提升WSI与基因组学数据融合的肿瘤亚型诊断精度。 |
multimodal |
|
|
| 11 |
Efficient Multi-modal Large Language Models via Visual Token Grouping |
提出VisToG,通过视觉Token分组提升多模态大语言模型效率 |
large language model |
|
|
| 12 |
Exploring Aleatoric Uncertainty in Object Detection via Vision Foundation Models |
利用视觉基础模型探索目标检测中的偶然不确定性,提升模型鲁棒性 |
foundation model |
|
|
| 13 |
Advancing Content Moderation: Evaluating Large Language Models for Detecting Sensitive Content Across Text, Images, and Videos |
评估大型语言模型在文本、图像和视频中检测敏感内容的能力,提升内容审核效果。 |
large language model |
|
|
| 14 |
SatVision-TOA: A Geospatial Foundation Model for Coarse-Resolution All-Sky Remote Sensing Imagery |
SatVision-TOA:用于粗分辨率全天候遥感影像的地理空间基础模型 |
foundation model |
|
|
| 15 |
Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration |
提出FiCoCo框架,通过无训练Token缩减加速多模态大语言模型 |
large language model multimodal |
✅ |
|
| 16 |
SketchAgent: Language-Driven Sequential Sketch Generation |
SketchAgent:提出一种基于语言驱动的序列化草图生成方法 |
large language model multimodal |
|
|
| 17 |
HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator |
提出HEIE:基于MLLM的分层可解释AIGC图像不合理性评估器 |
large language model multimodal |
✅ |
|
| 18 |
DOGR: Towards Versatile Visual Document Grounding and Referring |
DOGR:面向通用视觉文档定位与指代的模型、数据引擎与评测基准 |
large language model multimodal |
|
|
| 19 |
OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection |
OpenAD:用于3D目标检测的开放世界自动驾驶基准测试 |
large language model multimodal |
✅ |
|
| 20 |
The Context of Crash Occurrence: A Complexity-Infused Approach Integrating Semantic, Contextual, and Kinematic Features |
提出融合语义、上下文和运动学特征的道路复杂性分析框架,用于提升交通事故预测精度。 |
large language model |
|
|
| 21 |
Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings |
提出Bi-ICE,通过概念与输入嵌入的双向交互,提升图像分类的内部可解释性。 |
large language model |
|
|
| 22 |
Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop |
提出Scene Co-pilot框架,结合LLM与程序化3D场景生成,实现可控的文本到视频生成。 |
large language model |
|
|
| 23 |
FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval |
提出FLEX-CLIP,通过特征生成网络增强CLIP,解决X-shot跨模态检索中的特征退化和数据不平衡问题。 |
multimodal |
|
|
| 24 |
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models |
提出VL-RewardBench,用于评估和提升视觉-语言生成奖励模型 |
multimodal |
|
|
| 25 |
in-Car Biometrics (iCarB) Datasets for Driver Recognition: Face, Fingerprint, and Voice |
发布iCarB车载生物识别数据集,用于驾驶员身份识别,包含人脸、指纹和语音三种模态。 |
multimodal |
|
|
| 26 |
Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation |
Reflect3D:利用单图像对称性检测实现高质量3D生成 |
foundation model |
|
|
| 27 |
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding |
MUSE-VL:通过语义离散编码建模统一的视觉-语言模型 |
multimodal |
|
|