| 1 |
Empowering Video Translation using Multimodal Large Language Models |
利用多模态大语言模型赋能视频翻译,克服传统流水线的局限性。 |
large language model multimodal |
|
|
| 2 |
BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning |
BoxTuning:通过直接注入目标框信息微调多模态模型,提升视频问答性能 |
large language model multimodal |
|
|
| 3 |
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models |
提出基于熵探测的伪统一性诊断框架,揭示统一多模态模型的信息流不一致问题 |
large language model multimodal |
|
|
| 4 |
LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment |
LARY:用于可泛化视觉-动作对齐的潜在动作表征基准 |
vision-language-action VLA foundation model |
|
|
| 5 |
HuiYanEarth-SAR: A Foundation Model for High-Fidelity and Low-Cost Global Remote Sensing Imagery Generation |
HuiYanEarth-SAR:首个基于地理坐标生成高保真全球SAR影像的基础模型 |
foundation model |
|
|
| 6 |
MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration |
MedP-CLIP:融合区域感知Prompt的医学CLIP模型,提升医学图像细粒度理解 |
large language model multimodal zero-shot transfer |
|
|
| 7 |
Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions |
探索深度学习在视频中识别矛盾/犹豫情绪,用于个性化数字健康干预 |
large language model multimodal |
|
|
| 8 |
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs |
POINTS-Long:提出双模态视觉推理MLLM,解决长视频和流媒体场景下的视觉token扩展性问题。 |
large language model multimodal |
|
|
| 9 |
MLLM-as-a-Judge Exhibits Model Preference Bias |
提出Philautia-Eval评估MLLM偏好偏差,并用Pomms集成模型缓解该偏差。 |
large language model multimodal |
|
|
| 10 |
Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging |
提出MERIT,通过层选择模型融合恢复视频语言模型中的时间推理能力。 |
large language model multimodal |
|
|
| 11 |
Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding |
提出多流场景脚本MTSS,解耦视频信息以提升多模态大语言模型在视频理解和生成任务上的性能。 |
large language model multimodal |
|
|
| 12 |
rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training |
提出rPPG-VQA框架,用于评估视频质量并提升无监督rPPG训练效果 |
large language model multimodal |
✅ |
|
| 13 |
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding |
提出DualComp,针对超高分辨率遥感影像,实现任务自适应的视觉令牌高效压缩。 |
large language model multimodal |
|
|
| 14 |
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images |
提出TTSP框架,通过测试时感知缩放解决多模态大模型中的Grounding Paradox问题 |
large language model multimodal |
|
|
| 15 |
Panoptic Pairwise Distortion Graph |
提出基于区域结构化表示的Distortion Graph,用于图像对的细粒度质量评估。 |
large language model multimodal |
|
|
| 16 |
TraversalBench: Challenging Paths to Follow for Vision Language Models |
TraversalBench:用于评估视觉语言模型在复杂视觉路径上推理能力的新基准 |
multimodal visual grounding |
|
|
| 17 |
SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models |
提出SVD-Prune,一种免训练的视觉-语言模型token剪枝方法,提升效率。 |
multimodal |
|
|
| 18 |
Sign Language Recognition in the Age of LLMs |
探索LLM在零样本手语识别中的能力,揭示模型规模与数据多样性的重要性 |
multimodal |
|
|
| 19 |
Hierarchical Textual Knowledge for Enhanced Image Clustering |
提出KEC方法,利用层级文本知识增强图像聚类效果 |
large language model |
|
|
| 20 |
ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation |
ReSpinQuant:通过子空间残差旋转逼近实现高效的逐层LLM量化 |
large language model |
|
|
| 21 |
ArtiCAD: Articulated CAD Assembly Design via Multi-Agent Code Generation |
ArtiCAD:基于多智能体代码生成的装配式CAD设计 |
embodied AI |
|
|