| 1 |
RzenEmbed: Towards Comprehensive Multimodal Retrieval |
RzenEmbed:提出统一多模态嵌入框架,显著提升视频和文档检索性能 |
large language model multimodal instruction following |
✅ |
|
| 2 |
Image Hashing via Cross-View Code Alignment in the Age of Foundation Models |
提出CroVCA,通过跨视图编码对齐实现高效图像哈希检索 |
foundation model multimodal |
|
|
| 3 |
Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing |
提出CIELR,通过LLM推理将复杂图像编辑指令分解为简单动作,无需联合微调。 |
large language model foundation model |
✅ |
|
| 4 |
Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions |
提出MIVA基准,评估多模态大语言模型在多人社交互动中识别谎言的能力 |
large language model multimodal |
|
|
| 5 |
Sketch-to-Layout: Sketch-Guided Multimodal Layout Generation |
提出Sketch-to-Layout框架,利用草图引导多模态布局生成,提升设计体验。 |
multimodal |
✅ |
|
| 6 |
Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum |
提出通用视频检索框架,通过合成多模态金字塔课程泛化视频嵌入 |
multimodal |
|
|
| 7 |
E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources |
提出E-MMDiT,一种轻量级多模态扩散Transformer,用于资源受限下的快速图像合成。 |
multimodal |
✅ |
|
| 8 |
CompAgent: An Agentic Framework for Visual Compliance Verification |
提出CompAgent,用于视觉合规性验证的Agent框架,提升细粒度推理能力。 |
large language model multimodal |
|
|
| 9 |
FOCUS: Efficient Keyframe Selection for Long Video Understanding |
提出FOCUS,一种高效的关键帧选择方法,用于提升长视频理解中多模态大语言模型的性能。 |
large language model multimodal |
✅ |
|
| 10 |
Generating Accurate and Detailed Captions for High-Resolution Images |
提出一种多阶段流程,融合视觉-语言模型、大型语言模型和目标检测,为高分辨率图像生成更准确、详细的描述。 |
large language model multimodal |
|
|
| 11 |
FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding |
FLoC:基于设施选址的长视频高效视觉Token压缩方法 |
multimodal |
|
|
| 12 |
NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception |
NegoCollab:一种面向异构协作感知的协商式通用表征方法 |
multimodal |
|
|
| 13 |
MapSAM2: Adapting SAM2 for Automatic Segmentation of Historical Map Images and Time Series |
MapSAM2:通过自适应SAM2实现历史地图图像和时间序列的自动分割 |
foundation model |
|
|
| 14 |
Mitigating Semantic Collapse in Partially Relevant Video Retrieval |
提出文本相关性保持学习与跨分支视频对齐,缓解部分相关视频检索中的语义坍塌问题。 |
foundation model |
|
|
| 15 |
MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts |
提出MoRE:基于混合专家模型的3D视觉几何重建框架,提升可扩展性和适应性。 |
foundation model |
|
|
| 16 |
Multi-Modal Feature Fusion for Spatial Morphology Analysis of Traditional Villages via Hierarchical Graph Neural Networks |
提出基于分层图神经网络的多模态特征融合方法,用于传统村落空间形态分析。 |
multimodal |
|
|