| 1 |
Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models |
Impromptu VLA:开放数据与权重,赋能自动驾驶视觉-语言-动作模型 |
vision-language-action VLA |
✅ |
|
| 2 |
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought |
Argus:提出基于视觉注意 grounding 的链式思考方法,提升多模态推理能力 |
large language model multimodal chain-of-thought |
✅ |
|
| 3 |
Preemptive Hallucination Reduction: An Input-Level Approach for Multimodal Language Model |
提出一种基于输入预处理的多模态语言模型幻觉抑制方法 |
large language model multimodal |
|
|
| 4 |
OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation |
OpenUni:一个用于统一多模态理解与生成任务的简单基线模型 |
large language model multimodal |
✅ |
|
| 5 |
MaskAdapt: Unsupervised Geometry-Aware Domain Adaptation Using Multimodal Contextual Learning and RGB-Depth Masking |
MaskAdapt:利用多模态上下文学习和RGB-D掩码实现无监督几何感知领域自适应 |
multimodal |
|
|
| 6 |
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence |
Spatial-MLLM:通过视觉几何先验增强MLLM的视觉空间智能 |
large language model foundation model multimodal |
✅ |
|
| 7 |
FMG-Det: Foundation Model Guided Robust Object Detection |
FMG-Det:基于Foundation Model引导的鲁棒目标检测方法,解决噪声标注下的模型训练问题。 |
foundation model |
|
|
| 8 |
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos |
提出VF-Eval以评估多模态LLM在AIGC视频反馈生成中的表现 |
multimodal |
|
|
| 9 |
EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis |
EndoBench:构建内窥镜分析多模态大语言模型综合评估基准 |
large language model |
|
|
| 10 |
OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data |
提出OmniEarth-Bench,用于全面评估地球六大圈层及跨圈层交互的多模态观测数据学习。 |
multimodal |
|
|
| 11 |
VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning |
VAU-R1:通过强化微调提升视频异常理解能力 |
large language model multimodal chain-of-thought |
✅ |
|
| 12 |
MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification |
提出MCFNet,用于解决细粒度语义分类中跨模态信息融合难题。 |
multimodal |
|
|
| 13 |
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? |
VideoReasonBench:提出面向视觉复杂推理的多模态大模型评测基准 |
large language model multimodal chain-of-thought |
|
|
| 14 |
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks |
ThinkGeo:评估工具增强型Agent在遥感任务中的性能 |
large language model multimodal |
|
|
| 15 |
Position Paper: Metadata Enrichment Model: Integrating Neural Networks and Semantic Knowledge Graphs for Cultural Heritage Applications |
提出Metadata Enrichment Model,融合神经网络与知识图谱,提升文化遗产数字化元数据质量。 |
large language model TAMP |
|
|
| 16 |
CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection |
提出CMIE框架,结合MLLM洞察与外部证据,解决语境外信息检测难题。 |
large language model multimodal |
|
|
| 17 |
Vid-SME: Membership Inference Attacks against Large Video Understanding Models |
提出Vid-SME,针对视频理解大模型进行高效的成员推理攻击。 |
large language model multimodal |
|
|
| 18 |
DGIQA: Depth-guided Feature Attention and Refinement for Generalizable Image Quality Assessment |
DGIQA:提出深度引导的特征注意力和精炼机制,提升图像质量评估的泛化性 |
multimodal |
|
|
| 19 |
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL |
VisualSphinx:用于强化学习的大规模合成视觉逻辑谜题数据集 |
multimodal |
|
|
| 20 |
ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding |
提出ScaleLong:一个用于长视频理解的多时间尺度基准测试,促进模型在不同时间尺度上性能的直接比较。 |
multimodal |
✅ |
|
| 21 |
D-AR: Diffusion via Autoregressive Models |
D-AR:将图像扩散过程重构为自回归模型,实现图像生成。 |
large language model |
✅ |
|
| 22 |
ZeroSep: Separate Anything in Audio with Zero Training |
ZeroSep:无需训练,利用预训练文本引导音频扩散模型实现音频分离 |
foundation model |
|
|
| 23 |
Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition |
Uni-MuMER:通过统一多任务微调视觉-语言模型,实现手写数学表达式识别 |
chain-of-thought |
✅ |
|
| 24 |
TerraIncognita: A Dynamic Benchmark for Species Discovery Using Frontier Models |
TerraIncognita:一个用于物种发现的动态基准,利用前沿模型识别未知昆虫物种。 |
multimodal |
✅ |
|
| 25 |
VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation |
提出VCapsBench,一个大规模细粒度视频描述质量评估基准,提升文本生成视频的质量。 |
large language model |
✅ |
|