| 1 |
Risk-adaptive Activation Steering for Safe Multimodal Large Language Models |
提出风险自适应激活引导(RAS)方法,提升多模态大语言模型安全性并加速推理。 |
large language model multimodal |
|
|
| 2 |
Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models |
提出针对视觉-语言-动作模型的模型无关对抗攻击与防御方法 |
vision-language-action VLA |
✅ |
|
| 3 |
Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs |
提出Honey-Data-15M数据集和Bee-8B模型,提升全开源多模态大语言模型性能。 |
large language model multimodal chain-of-thought |
|
|
| 4 |
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark |
提出Uni-MMMU:一个大规模多学科多模态统一基准,用于评估视觉理解与生成模型的双向协同能力。 |
multimodal |
|
|
| 5 |
Fusion Meets Diverse Conditions: A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues |
提出一种条件感知的动态融合方法,用于解决无人机多模态目标检测在复杂场景下的鲁棒性问题。 |
multimodal |
|
|
| 6 |
Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity |
利用语言标签进行零样本多模态分类,解决数据稀缺下的日常姿态识别问题 |
multimodal |
|
|
| 7 |
OS-HGAdapter: Open Semantic Hypergraph Adapter for Large Language Models Assisted Entropy-Enhanced Image-Text Alignment |
提出OS-HGAdapter,利用大语言模型增强图像-文本对齐,显著提升跨模态检索性能。 |
large language model |
|
|
| 8 |
Reasoning in Space via Grounding in the World |
提出基于世界感知的Grounded-Spatial Reasoner,用于提升3D空间推理能力。 |
visual grounding chain-of-thought |
|
|
| 9 |
RECODE: Reasoning Through Code Generation for Visual Question Answering |
提出RECODE框架,通过代码生成实现视觉问答中更精确的可验证推理。 |
large language model multimodal |
|
|
| 10 |
OmniGaze: Reward-inspired Generalizable Gaze Estimation In The Wild |
OmniGaze:提出奖励驱动的通用凝视估计框架,解决野外场景泛化性问题 |
large language model multimodal |
|
|
| 11 |
Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding |
提出Vgent,通过图结构检索-推理增强生成,提升长视频理解能力。 |
large language model |
✅ |
|
| 12 |
Adaptive Visual Conditioning for Semantic Consistency in Diffusion-Based Story Continuation |
提出AVC框架,自适应视觉条件控制扩散模型,提升故事延续生成语义一致性。 |
large language model |
|
|
| 13 |
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue |
提出InteractiveOmni,一个用于音视频多轮交互的统一全模态大语言模型。 |
large language model |
|
|
| 14 |
Towards Adversarial Robustness and Uncertainty Quantification in DINOv2-based Few-Shot Anomaly Detection |
研究DINOv2在少样本异常检测中的对抗鲁棒性和不确定性量化问题 |
foundation model |
|
|
| 15 |
Visual Interestingness Decoded: How GPT-4o Mirrors Human Interests |
探索GPT-4o对视觉趣味性的理解,并用于提升学习排序模型 |
multimodal |
|
|
| 16 |
Self-Augmented Visual Contrastive Decoding |
提出自增强视觉对比解码,提升大型视觉语言模型的事实一致性 |
multimodal |
|
|
| 17 |
MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models |
提出MMLongCite基准,评估长上下文视觉语言模型的信息保真度 |
multimodal |
|
|
| 18 |
What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging |
提出NegToMe模块和CoVAND数据集,提升VLM在否定描述对象检测中的性能 |
chain-of-thought |
|
|