| 1 |
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation |
提出Harmon,一个统一的自回归框架,用于多模态理解和生成任务。 |
multimodal |
✅ |
|
| 2 |
On Large Multimodal Models as Open-World Image Classifiers |
评估大型多模态模型在开放世界图像分类中的性能与挑战 |
multimodal |
|
|
| 3 |
PS-ReID: Advancing Person Re-Identification and Precise Segmentation with Multimodal Retrieval |
PS-ReID:结合图像文本多模态检索,实现更精准的行人重识别与分割 |
multimodal |
|
|
| 4 |
Multimodal surface defect detection from wooden logs for sawing optimization |
提出一种基于多模态融合的木材表面节疤检测方法,用于优化木材锯切。 |
multimodal |
|
|
| 5 |
HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery |
提出HyperFree:一种通道自适应、免调参的高光谱遥感图像基础模型 |
foundation model |
|
|
| 6 |
AdaMHF: Adaptive Multimodal Hierarchical Fusion for Survival Prediction |
提出AdaMHF,自适应多模态分层融合用于提升生存预测精度,尤其在数据缺失场景下。 |
multimodal |
|
|
| 7 |
iMedImage Technical Report |
iMedImage:用于通用医学图像识别的端到端多模态基础模型,提升染色体异常检测精度。 |
foundation model multimodal chain-of-thought |
|
|
| 8 |
Online Reasoning Video Segmentation with Just-in-Time Digital Twins |
提出基于即时数字孪生的在线推理视频分割框架,解决现有方法推理能力不足、依赖微调等问题。 |
embodied AI large language model multimodal |
|
|
| 9 |
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs |
FaceBench:用于评估人脸感知多模态大语言模型的多视角多层次人脸属性VQA数据集 |
large language model multimodal |
✅ |
|
| 10 |
VALLR: Visual ASR Language Model for Lip Reading |
VALLR:提出视觉ASR语言模型,用于唇语识别,显著降低词错误率。 |
large language model multimodal |
|
|
| 11 |
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression |
InternVL-X:通过高效视觉Token压缩提升InternVL系列模型的性能与效率 |
large language model multimodal |
|
|
| 12 |
Differential Evolution for Grassmann Manifold Optimization: A Projection Approach |
提出一种基于投影的差分进化算法,用于格拉斯曼流形上的优化问题。 |
multimodal |
|
|
| 13 |
StarFlow: Generating Structured Workflow Outputs From Sketch Images |
StarFlow:利用视觉-语言模型从草图生成结构化工作流 |
foundation model |
|
|
| 14 |
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model |
提出Mobile-VideoGPT,一种参数小于10亿的高效视频理解语言模型,实现实时吞吐。 |
multimodal |
✅ |
|
| 15 |
Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence |
提出Stable-SCore框架,通过稳定配准实现更鲁棒的3D形状对应 |
foundation model |
|
|
| 16 |
Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation |
提出基于深度学习的图像、视频和音频分类器,用于自动化新闻视频分割。 |
multimodal |
|
|
| 17 |
FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval |
提出FineCIR框架,通过显式解析细粒度语义提升组合图像检索精度。 |
multimodal |
✅ |
|
| 18 |
Vision-to-Music Generation: A Survey |
综述视觉到音乐生成:系统回顾视频、图像到音乐生成的技术进展与未来方向。 |
multimodal |
✅ |
|
| 19 |
M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? |
提出M-DocSum-Bench,评估LVLM在多模态文档摘要中的理解能力 |
multimodal |
✅ |
|
| 20 |
Towards Generalizable Forgery Detection and Reasoning |
提出FakeReasoning框架,利用多模态大语言模型实现AI生成图像的通用伪造检测与推理。 |
large language model |
|
|
| 21 |
DSU-Net:An Improved U-Net Model Based on DINOv2 and SAM2 with Multi-scale Cross-model Feature Enhancement |
DSU-Net:融合DINOv2和SAM2的多尺度跨模型特征增强U-Net,提升图像分割性能 |
foundation model |
✅ |
|
| 22 |
A Multi-Modal Knowledge-Enhanced Framework for Vessel Trajectory Prediction |
提出多模态知识增强框架MAKER,提升船舶轨迹预测精度。 |
large language model |
|
|