| 1 |
Vision Foundation Models for Domain Generalisable Cross-View Localisation in Planetary Ground-Aerial Robotic Teams |
提出基于视觉基础模型的跨视角定位方法,用于行星地空机器人协同。 |
foundation model |
|
|
| 2 |
LP-LLM: End-to-End Real-World Degraded License Plate Text Recognition via Large Multimodal Models |
提出LP-LLM,通过大模型端到端解决真实场景下退化车牌识别问题 |
multimodal |
|
|
| 3 |
CogRail: Benchmarking VLMs in Cognitive Intrusion Perception for Intelligent Railway Transportation Systems |
CogRail:构建铁路入侵认知感知基准,并提出联合微调框架提升VLM性能 |
foundation model multimodal |
✅ |
|
| 4 |
Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs |
Video-MSR:首个动态视频多步空间推理能力评测基准 |
large language model multimodal |
✅ |
|
| 5 |
See More, Store Less: Memory-Efficient Resolution for Video Moment Retrieval |
提出SMORE框架以解决视频时刻检索中的内存效率问题 |
large language model multimodal |
|
|
| 6 |
Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain |
提出多阶段渐进式训练的ICT领域图像描述模型,提升领域知识理解能力 |
large language model multimodal |
|
|
| 7 |
Hot-Start from Pixels: Low-Resolution Visual Tokens for Chinese Language Modeling |
提出基于低分辨率像素的中文语言建模方法,有效利用汉字视觉信息。 |
large language model |
|
|
| 8 |
Beyond the final layer: Attentive multilayer fusion for vision transformers |
提出基于注意力机制的多层融合方法,提升Vision Transformer线性探测性能 |
foundation model |
|
|
| 9 |
PhyRPR: Training-Free Physics-Constrained Video Generation |
提出PhyRPR以解决物理约束下视频生成问题 |
multimodal |
|
|
| 10 |
Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning |
提出Slow4fast-VLN,通过快速-慢速交互推理实现通用视觉语言导航 |
VLN |
|
|