| 1 |
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation |
提出TimeSoccer,一个端到端多模态大语言模型,用于足球赛事解说生成。 |
large language model multimodal TAMP |
|
|
| 2 |
TRACE: Textual Relevance Augmentation and Contextual Encoding for Multimodal Hate Detection |
提出TRACE框架以解决社交媒体恶意内容检测问题 |
multimodal visual grounding |
|
|
| 3 |
Token Sequence Compression for Efficient Multimodal Computing |
提出基于聚类级别token聚合的视觉token压缩方法,提升多模态计算效率。 |
multimodal |
|
|
| 4 |
Plasma State Monitoring and Disruption Characterization using Multimodal VAEs |
提出基于多模态VAE的等离子体状态监测与破裂特征分析方法。 |
multimodal |
|
|
| 5 |
Hierarchical and Multimodal Data for Daily Activity Understanding |
DARai:用于日常活动理解的分层多模态数据集,支持反事实活动分析。 |
multimodal |
|
|
| 6 |
Fine-tune Smarter, Not Harder: Parameter-Efficient Fine-Tuning for Geospatial Foundation Models |
针对地理空间基础模型,提出更智能而非更费力的参数高效微调方法。 |
foundation model |
|
|
| 7 |
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency |
提出VCBENCH基准,评估LVLM在显式视觉依赖的多模态数学推理能力 |
multimodal |
✅ |
|
| 8 |
MASR: Self-Reflective Reasoning through Multimodal Hierarchical Attention Focusing for Agent-based Video Understanding |
提出MASR框架,通过多模态分层注意力自反思推理提升Agent视频理解能力 |
multimodal |
|
|
| 9 |
FashionM3: Multimodal, Multitask, and Multiround Fashion Assistant based on Unified Vision-Language Model |
FashionM3:基于统一视觉-语言模型的时尚多轮多任务助手 |
multimodal |
|
|
| 10 |
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models |
提出Token-Shuffle,提升自回归模型在高分辨率图像生成中的效率与质量。 |
large language model multimodal |
|
|
| 11 |
DiMeR: Disentangled Mesh Reconstruction Model |
DiMeR:提出解耦的网格重建模型,用于稀疏视角下的三维重建。 |
foundation model |
|
|
| 12 |
FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding |
提出FRAG:一种帧选择增强生成框架,用于长视频和长文档理解。 |
multimodal |
✅ |
|
| 13 |
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos |
提出TimeChat-Online,通过差分Token丢弃解决在线视频流冗余问题。 |
large language model |
|
|
| 14 |
VEU-Bench: Towards Comprehensive Understanding of Video Editing |
提出VEU-Bench,用于评估和提升视频大语言模型在视频编辑理解方面的能力。 |
large language model |
|
|