| 1 |
UniVBench: Towards Unified Evaluation for Video Foundation Models |
提出UniVBench以解决视频基础模型评估碎片化问题 |
foundation model multimodal instruction following |
|
|
| 2 |
MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving |
MindDriver:面向自动驾驶的渐进式多模态推理框架 |
multimodal chain-of-thought |
✅ |
|
| 3 |
SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model |
SkyReels-V4:统一多模态视频-音频生成、修复与编辑的基石模型 |
large language model foundation model multimodal |
|
|
| 4 |
RGB-Event HyperGraph Prompt for Kilometer Marker Recognition based on Pre-trained Foundation Models |
提出基于RGB-Event超图提示的预训练模型,用于解决GNSS拒止环境下的地铁里程标志识别问题 |
foundation model |
✅ |
|
| 5 |
Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models |
提出动态多模态激活引导方法,缓解大型视觉语言模型中的幻觉问题 |
multimodal |
|
|
| 6 |
E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought |
提出E-comIQ-ZH框架,用于细粒度评估中文电商海报质量,解决现有方法忽略文本伪影问题。 |
chain-of-thought |
✅ |
|
| 7 |
CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis |
提出CARE:一种分子引导的自适应区域建模病理切片图像分析基础模型 |
foundation model |
|
|
| 8 |
SEF-MAP: Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction |
SEF-MAP:用于稳健多模态高清地图预测的子空间分解专家融合方法 |
multimodal |
|
|
| 9 |
WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs |
WeaveTime:通过将先前帧的信息融入涌现记忆,提升视频LLM在流式场景下的时序理解能力 |
large language model multimodal |
✅ |
|
| 10 |
Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation |
GLoTran:面向高分辨率富文本图像翻译,提出全局-局部双重感知MLLM框架 |
large language model multimodal |
|
|
| 11 |
StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles |
StoryMovie数据集通过电影剧本和字幕对齐,提升视觉故事中语义关系的准确性。 |
visual grounding TAMP |
|
|
| 12 |
TranX-Adapter: Bridging Artifacts and Semantics within MLLMs for Robust AI-generated Image Detection |
提出TranX-Adapter,增强MLLM在AI生成图像检测中的鲁棒性 |
large language model multimodal |
|
|
| 13 |
NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors |
NoLan:通过动态抑制语言先验缓解大型视觉语言模型中的对象幻觉 |
multimodal |
✅ |
|
| 14 |
RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations |
RobustVisRAG:提出因果感知的视觉退化鲁棒检索增强生成框架 |
multimodal |
|
|