| 1 |
Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis |
提出融合多模态信息的MLLM增强方案,用于安全驾驶视频分析 |
large language model multimodal |
|
|
| 2 |
PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought |
PointLLM-R:通过思维链增强3D点云推理能力 |
multimodal instruction following chain-of-thought |
|
|
| 3 |
AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture |
AgroTools:农业领域工具增强型多模态Agent基准测试 |
large language model multimodal |
✅ |
|
| 4 |
Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding |
提出Seizure-Semiology-Suite数据集与基准,用于评估和提升多模态大模型对癫痫发作症状学的理解能力。 |
large language model multimodal |
|
|
| 5 |
Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure |
针对冻结视觉基础模型的噪声鲁棒训练:跨数据集基准测试与小损失失效案例研究 |
foundation model |
|
|
| 6 |
MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding |
提出MOTOR数据集以解决两轮车骑行行为理解问题 |
multimodal |
|
|
| 7 |
Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement |
提出基于多模态知识图谱和可靠性引导的病例感知医学图像分类框架 |
multimodal |
|
|
| 8 |
Bernini: Latent Semantic Planning for Video Diffusion |
Bernini:提出基于潜在语义规划的视频扩散模型,用于高质量视频生成与编辑。 |
large language model multimodal chain-of-thought |
|
|
| 9 |
Accelerating Vision Foundation Models with Drop-in Depthwise Convolution |
提出基于深度卷积的替代方案以加速视觉基础模型 |
foundation model |
|
|
| 10 |
VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results |
VISTA:融合时空基础模型与解剖学解码,用于罕见病理VCE事件检测 |
foundation model |
|
|
| 11 |
AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding |
AgroVG:用于农业视觉定位的大规模多源基准数据集 |
visual grounding |
|
|
| 12 |
Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction |
提出用于情感模仿强度预测的两阶段多模态融合框架,在Hume-ABAW10挑战赛中获得第三名。 |
multimodal |
|
|
| 13 |
Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models |
提出Director-Experts (DEX)模型,解决多模态医学影像中非独立同分布特征导致的表示坍塌问题。 |
foundation model |
✅ |
|
| 14 |
GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT |
提出GLeVE框架,通过图引导和提案验证实现3D CT图像中病灶的精准定位。 |
foundation model multimodal |
|
|
| 15 |
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis |
提出VGenST-Bench,通过主动视频合成评估多模态大语言模型中的时空推理能力。 |
large language model multimodal |
|
|
| 16 |
GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning |
GeoWeaver:提出一种预推理几何 grounding 框架,提升视觉语言模型中的时空推理能力。 |
large language model multimodal |
✅ |
|
| 17 |
FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning |
提出FashionLens以解决多样化时尚图像检索问题 |
large language model multimodal |
✅ |
|
| 18 |
EvoIR-Agent: Self-Evolving Image Restoration Agentic System via Experience-Driven Learning |
提出EvoIR-Agent,通过经验驱动学习实现自进化图像修复智能体系统 |
large language model multimodal |
|
|
| 19 |
Zero-Shot Temporal Action Localization Through Textual Guidance |
提出TEGU,利用文本引导实现零样本时序动作定位,无需训练数据。 |
large language model zero-shot transfer |
|
|
| 20 |
MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues |
通过注意力线索揭示和恢复时间定位,提升MLLM在视频时序定位任务上的性能。 |
large language model multimodal |
✅ |
|
| 21 |
Cambrian-P: Pose-Grounded Video Understanding |
Cambrian-P:提出一种基于相机位姿的多模态视频理解模型,提升空间推理能力。 |
multimodal |
|
|
| 22 |
DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders |
DecQ:通过细节浓缩查询增强表征自编码器的重建与生成能力 |
foundation model |
|
|
| 23 |
Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models |
提出CEDAR,通过稀疏解耦变换提升视觉-语言模型嵌入的可解释性,无需增加维度。 |
multimodal |
|
|
| 24 |
SceneAligner: 3D-Grounded Floorplan Localization in the Wild |
SceneAligner:基于3D场景重建的室外环境平面图定位方法 |
foundation model |
|
|
| 25 |
Translating Signals to Languages for sEMG-Based Activity Recognition |
提出LLM-sEMG框架,利用大语言模型实现高精度sEMG信号活动识别 |
large language model |
|
|
| 26 |
Direct content-based retrieval from music scores images |
提出音乐乐谱图像直接内容检索方法,提升音乐信息检索效率 |
large language model |
|
|
| 27 |
EventGait: Towards Robust Gait Recognition with Event Streams |
EventGait:利用事件流实现稳健的步态识别,尤其在低光照环境下表现出色。 |
foundation model |
✅ |
|
| 28 |
GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery |
GenHAR:面向末端配送的跨域人体活动识别泛化框架 |
foundation model |
✅ |
|
| 29 |
Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception |
Thermo-VL:扩展视觉-语言模型至热红外感知,提升低照度场景理解能力 |
visual grounding |
✅ |
|