| 1 |
All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles |
面向自动驾驶,综述基于多模态LLM/VLM的下一代融合目标检测技术 |
large language model multimodal |
|
|
| 2 |
OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research |
OracleAgent:用于甲骨文研究的多模态推理Agent系统 |
large language model multimodal |
|
|
| 3 |
AD-SAM: Fine-Tuning the Segment Anything Vision Foundation Model for Autonomous Driving Perception |
AD-SAM:微调SAM视觉基础模型,用于自动驾驶感知 |
foundation model |
|
|
| 4 |
ProstNFound+: A Prospective Study using Medical Foundation Models for Prostate Cancer Detection |
ProstNFound+:利用医学基础模型进行前列腺癌检测的前瞻性研究 |
foundation model |
|
|
| 5 |
FARM: Fine-Tuning Geospatial Foundation Models for Intra-Field Crop Yield Regression |
FARM:微调地理空间基础模型,用于田间作物产量回归 |
foundation model |
|
|
| 6 |
SpinalSAM-R1: A Vision-Language Multimodal Interactive System for Spine CT Segmentation |
SpinalSAM-R1:用于脊柱CT分割的视觉-语言多模态交互系统 |
multimodal |
✅ |
|
| 7 |
MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation |
提出MoME:一种用于医学影像分割的视觉语言混合专家模型 |
large language model foundation model |
|
|
| 8 |
WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios |
WOD-E2E:针对端到端驾驶中长尾场景的Waymo开放数据集 |
large language model multimodal |
|
|
| 9 |
Semantic Frame Aggregation-based Transformer for Live Video Comment Generation |
提出基于语义帧聚合Transformer的直播视频评论生成模型,提升评论相关性。 |
multimodal |
|
|
| 10 |
OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes |
OmniX:利用全景生成与感知,生成可用于图形渲染的3D场景 |
multimodal |
|
|
| 11 |
SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models |
SteerVLM:通过轻量级激活调控实现视觉语言模型鲁棒控制 |
multimodal |
|
|
| 12 |
Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition |
提出表征级反事实校准方法,解决零样本识别中的上下文偏差问题。 |
multimodal |
|
|
| 13 |
Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models |
提出AoT-PsyPhyBENCH基准,评估视觉-语言模型对视频时间流逝方向的理解能力 |
multimodal |
|
|
| 14 |
ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts |
ConceptScope:通过解耦视觉概念表征来量化和识别数据集偏差。 |
foundation model |
|
|