| 1 |
A Generative Foundation Model for Multimodal Histopathology |
MuPD:用于多模态组织病理学的生成式基础模型,实现跨模态合成与虚拟染色。 |
foundation model multimodal |
|
|
| 2 |
A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models |
提出基于Patch增强和跨视角正则化的框架,防御多模态大语言模型中的后门攻击。 |
large language model multimodal |
|
|
| 3 |
CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks |
CoLA:用于多模态下游任务的跨模态低秩适配 |
foundation model multimodal visual grounding |
|
|
| 4 |
When Does Multimodal AI Help? Diagnostic Complementarity of Vision-Language Models and CNNs for Spectrum Management in Satellite-Terrestrial Networks |
提出SpectrumQA基准,诊断VLM与CNN在卫星-地面网络频谱管理中的互补性 |
foundation model multimodal chain-of-thought |
|
|
| 5 |
Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models |
提出场景动态场SDF,提升多模态大语言模型对连续物体动态物理的理解 |
large language model multimodal |
|
|
| 6 |
The Indra Representation Hypothesis for Multimodal Alignment |
提出基于Indra表征假设的多模态对齐方法,实现免训练的跨模态鲁棒对齐。 |
foundation model multimodal |
|
|
| 7 |
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation |
Firebolt-VL:通过跨模态调制实现高效的视觉-语言理解 |
large language model foundation model multimodal |
|
|
| 8 |
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning |
提出Chain-of-Frames框架,提升多模态LLM在视频理解中的帧感知推理能力 |
large language model multimodal |
|
|
| 9 |
ZINA: Multimodal Fine-grained Hallucination Detection and Editing |
ZINA:提出多模态细粒度幻觉检测与编辑方法,解决MLLM输出与视觉内容不符问题。 |
large language model multimodal |
|
|
| 10 |
Image Hashing via Cross-View Code Alignment in the Age of Foundation Models |
提出CroVCA,通过跨视图编码对齐实现高效图像哈希,适用于大规模检索。 |
foundation model multimodal |
|
|
| 11 |
EI: Early Intervention for Multimodal Imaging based Disease Recognition |
提出EI框架,通过模态早期干预和MoR自适应,提升多模态医学影像疾病识别精度。 |
foundation model multimodal |
|
|
| 12 |
LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset |
KITScenes LongTail数据集:提供推理轨迹的长尾驾驶场景数据集,用于端到端驾驶。 |
VLA multimodal instruction following |
|
|
| 13 |
Automated Segmentation and Tracking of Group Housed Pigs Using Foundation Models |
利用Foundation Model实现猪群的自动分割与跟踪,提升畜牧业智能化水平 |
foundation model |
|
|
| 14 |
Multimodal Urban Tree Detection from Satellite and Street-Level Imagery via Annotation-Efficient Deep Learning Strategies |
提出一种基于多模态图像和高效标注策略的城市树木检测方法 |
multimodal |
|
|
| 15 |
A Physics-Informed, Behavior-Aware Digital Twin for Robust Multimodal Forecasting of Core Body Temperature in Precision Livestock Farming |
提出物理信息化的数字双胞胎以提高奶牛核心体温预测精度 |
multimodal |
|
|
| 16 |
Multimodal Backdoor Attack on VLMs for Autonomous Driving via Graffiti and Cross-Lingual Triggers |
提出基于涂鸦和跨语言触发器的多模态后门攻击,威胁自动驾驶视觉语言模型 |
multimodal |
|
|
| 17 |
ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality |
ClickAIXR:一种在扩展现实中与真实世界对象进行设备端多模态视觉-语言交互的框架 |
multimodal |
|
|
| 18 |
Robust Adaptation of Foundation Models with Black-Box Visual Prompting |
提出BlackVIP,通过黑盒视觉提示实现大模型在有限资源下的鲁棒自适应。 |
foundation model |
|
|
| 19 |
Automated Wildfire Damage Assessment from Multi view Ground level Imagery Via Vision Language Models |
提出基于多视角地面图像和视觉语言模型的自动化野火损失评估方法 |
large language model multimodal chain-of-thought |
|
|
| 20 |
Revisiting Multimodal Positional Encoding in Vision-Language Models |
提出多头旋转位置编码MHRoPE及其变体MRoPE-I,提升视觉-语言模型的多模态位置编码能力 |
multimodal |
|
|
| 21 |
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning |
提出VIGA:通过交错多模态推理实现视觉逆向图形Agent |
multimodal |
|
|
| 22 |
Improving Multimodal Learning with Dispersive and Anchoring Regularization |
提出Dispersive and Anchoring Regularization,提升多模态学习表征质量与融合效果 |
multimodal |
|
|
| 23 |
Inference-Path Optimization via Circuit Duplication in Frozen Visual Transformers for Marine Species Classification |
针对海洋物种分类,提出基于冻结视觉Transformer电路复制的推理路径优化方法 |
large language model foundation model |
|
|
| 24 |
Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling |
提出CSRS方法,稳定多模态大语言模型在几何任务上的无监督自进化学习。 |
large language model multimodal |
|
|
| 25 |
ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs |
提出ITIScore:一个用于评估多模态大语言模型图像描述能力的图像-文本-图像评分框架 |
large language model multimodal |
|
|
| 26 |
BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing |
BoxComm:提出拳击赛事解说生成数据集与评测体系,填补格斗运动解说AI研究空白 |
large language model multimodal |
|
|
| 27 |
Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks |
系统性评估视觉-语言模型在自然语义变异下的鲁棒性,揭示其在多样任务中的脆弱性 |
multimodal zero-shot transfer |
|
|
| 28 |
Synthesis4AD: Synthetic Anomalies are All You Need for 3D Anomaly Detection |
Synthesis4AD:利用合成异常数据提升3D异常检测性能 |
large language model multimodal |
|
|
| 29 |
Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval |
提出紧凑超立方体嵌入,加速基于文本的野生动物观测检索 |
foundation model multimodal |
|
|
| 30 |
SafeScreen: A Safety-First Screening Framework for Personalized Video Retrieval for Vulnerable Users |
SafeScreen:面向弱势用户的安全优先个性化视频检索框架 |
multimodal |
|
|
| 31 |
When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models |
提出层级Sink门控(LSG)模块,提升大型视觉语言模型中全局推理和局部感知的平衡。 |
multimodal |
|
|
| 32 |
Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning |
提出一种可扩展的AI方法,无需人工标注即可检测面对面协作学习中的注视行为。 |
foundation model |
|
|
| 33 |
Bridging the Dimensionality Gap: A Taxonomy and Survey of 2D Vision Model Adaptation for 3D Analysis |
综述2D视觉模型在3D分析中的适配方法,弥合维度差异性鸿沟 |
foundation model |
|
|
| 34 |
Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models |
提出基于相位感知的抑制方法,解决视觉-语言模型中的幻觉问题 |
multimodal |
|
|
| 35 |
Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning |
LaPR:面向视觉上下文学习,提出标签感知的提示检索框架,提升任务性能。 |
foundation model |
|
|
| 36 |
DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR |
DSERT-RoLL:用于多样驾驶条件下的稳健多模态感知数据集与融合框架 |
multimodal |
|
|
| 37 |
SciLT: Long-Tailed Classification in Scientific Image Domains |
SciLT:针对科学图像领域长尾分类问题,提出自适应特征融合和双重监督学习框架。 |
foundation model |
|
|
| 38 |
SGTA: Scene-Graph Based Multi-Modal Traffic Agent for Video Understanding |
提出基于场景图的多模态交通Agent(SGTA)用于交通视频理解 |
large language model |
|
|
| 39 |
A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning |
提出多模态排版攻击,揭示视听推理大模型在跨模态对抗中的脆弱性 |
large language model |
|
|
| 40 |
ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity |
提出ATSS方法,通过异常时序自相似性检测AI生成视频 |
multimodal |
|
|
| 41 |
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning |
提出Graph-to-Frame RAG以解决视频推理中的知识融合问题 |
multimodal |
|
|
| 42 |
Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs |
提出自适应KV-Cache量化方法,优化轻量级On-Device LLM的内存和延迟。 |
large language model |
|
|
| 43 |
DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing |
提出DIRECT框架,通过分层多智能体规划和意图引导编辑实现高质量视频混剪 |
multimodal |
|
|
| 44 |
FileGram: Grounding Agent Personalization in File-System Behavioral Traces |
FileGram:提出基于文件系统行为轨迹的Agent个性化框架,解决数据约束下的Agent定制难题。 |
multimodal |
|
|
| 45 |
Rethinking Model Efficiency: Multi-Agent Inference with Large Models |
提出多智能体推理框架,利用大模型和小模型优势提升视觉语言模型效率。 |
large language model |
|
|
| 46 |
VideoCoF: Unified Video Editing with Temporal Reasoner |
VideoCoF:提出基于时序推理的统一视频编辑框架,无需掩码实现精准编辑。 |
chain-of-thought |
|
|
| 47 |
SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling |
SubspaceAD:基于子空间建模的免训练少样本异常检测方法 |
foundation model |
|
|
| 48 |
Event6D: Event-based Novel Object 6D Pose Tracking |
EventTrack6D:提出一种基于事件相机的通用物体6D位姿跟踪框架 |
TAMP |
|
|
| 49 |
JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation |
提出JAMMEval,用于可靠评估日语VLM的精细化基准集合 |
visual grounding |
|
|
| 50 |
VLA-InfoEntropy: A Training-Free Vision-Attention Information Entropy Approach for Vision-Language-Action Models Inference Acceleration and Success |
VLA-InfoEntropy:一种免训练的视觉-注意力信息熵方法,加速并提升VLA模型推理 |
vision-language-action VLA |
|
|
| 51 |
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs |
提出GUIDE框架以解决多模态大语言模型的空间感知问题 |
large language model foundation model multimodal |
|
|
| 52 |
Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis |
RATNet:基于类比推理的胃肠内窥镜诊断基础模型,提升泛化性和鲁棒性 |
foundation model zero-shot transfer |
|
|
| 53 |
A Unified Foundation Model for All-in-One Multi-Modal Remote Sensing Image Restoration and Fusion with Language Prompting |
提出LLaRS:用于多模态遥感图像修复与融合的统一基础模型 |
foundation model language conditioned |
✅ |
|
| 54 |
Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery |
提出轻量级多模态适配框架,用于无人机热成像物种识别与栖息地环境解读。 |
multimodal |
|
|
| 55 |
Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition |
提出MoME框架与HTL策略,提升驾驶员行为识别中细粒度多模态视觉分析能力 |
multimodal |
|
|
| 56 |
Leveraging Image Editing Foundation Models for Data-Efficient CT Metal Artifact Reduction |
利用图像编辑基础模型,以数据高效的方式减少CT金属伪影 |
foundation model |
✅ |
|
| 57 |
PDMP: Rethinking Balanced Multimodal Learning via Performance-Dominant Modality Prioritization |
提出性能主导模态优先(PDMP)策略,解决多模态学习中的欠优化问题。 |
multimodal |
|
|
| 58 |
SGANet: Semantic and Geometric Alignment for Multimodal Multi-view Anomaly Detection |
SGANet:用于多模态多视角异常检测的语义与几何对齐网络 |
multimodal |
|
|
| 59 |
Evaluation Before Generation: A Paradigm for Robust Multimodal Sentiment Analysis with Missing Modalities |
提出基于评估的缺失模态适应框架以解决多模态情感分析问题 |
multimodal |
✅ |
|
| 60 |
Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images |
提出STSF-Net,利用先验引导的多模态特征融合进行光学-SAR图像变化检测。 |
multimodal |
✅ |
|
| 61 |
UAVReason: A Unified, Large-Scale Benchmark for Multimodal Aerial Scene Reasoning and Generation |
提出UAVReason:一个用于多模态航拍场景理解与生成的大规模统一基准。 |
multimodal |
|
|
| 62 |
FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips |
FoleyDesigner:提出一种时空精确对齐的沉浸式立体声拟音生成框架,用于电影片段。 |
large language model TAMP |
✅ |
|
| 63 |
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions |
提出DetailVerifyBench,用于长图像描述中细粒度幻觉定位的基准测试 |
large language model multimodal |
✅ |
|
| 64 |
EchoAgent: Towards Reliable Echocardiography Interpretation with "Eyes","Hands" and "Minds" |
提出EchoAgent,实现可靠的心脏超声影像端到端判读,模拟医生“眼、手、脑”协同工作。 |
large language model multimodal |
|
|
| 65 |
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG |
VideoStir:提出时空结构化和意图感知的RAG框架,用于理解长视频 |
large language model multimodal |
|
|
| 66 |
CoStream: Codec-Guided Resource-Efficient System for Video Streaming Analytics |
CoStream:一种编解码器引导的资源高效视频流分析系统 |
multimodal |
|
|
| 67 |
AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis |
提出AICA-Bench基准测试,用于全面评估VLMs在情感图像内容分析中的能力。 |
multimodal |
|
|
| 68 |
Physics-Aware Video Instance Removal Benchmark |
提出物理感知视频实例移除基准PVIR,评估算法在保持物理一致性下的移除效果。 |
instruction following |
|
|
| 69 |
Few-Shot Semantic Segmentation Meets SAM3 |
提出基于SAM3的无监督少样本语义分割方法 |
foundation model |
✅ |
|
| 70 |
Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval |
提出对象锚定组合图像检索任务与AdaFocal框架,解决实例级一致性问题。 |
multimodal |
|
|