| 1 |
EasyTune: Efficient Step-Aware Fine-Tuning for Diffusion-Based Motion Generation |
EasyTune:一种高效的步进式微调方法,用于扩散模型驱动的运动生成。 |
preference learning motion generation |
✅ |
|
| 2 |
MambaFusion: Adaptive State-Space Fusion for Multimodal 3D Object Detection |
MambaFusion:面向多模态3D目标检测的自适应状态空间融合 |
Mamba SSM multimodal |
|
|
| 3 |
Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement |
综述MLLM在图表理解中的应用:演进、局限与认知增强 |
reinforcement learning large language model multimodal |
|
|
| 4 |
ViT-5: Vision Transformers for The Mid-2020s |
ViT-5:通过架构改进,为2020年代中期视觉任务提供更优的Vision Transformer骨干网络。 |
representation learning foundation model |
|
|
| 5 |
MIND: Benchmarking Memory Consistency and Action Control in World Models |
MIND:用于评估世界模型记忆一致性和动作控制的综合性基准测试 |
world model |
✅ |
|
| 6 |
Geometry-Aware Rotary Position Embedding for Consistent Video World Model |
提出ViewRope,通过几何感知旋转位置编码提升视频世界模型长期一致性 |
world model |
|
|
| 7 |
PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification |
提出PAND:提示感知邻域蒸馏,用于轻量级细粒度图像分类 |
distillation |
✅ |
|
| 8 |
Robustness of Vision Language Models Against Split-Image Harmful Input Attacks |
提出SIVA攻击,揭示视觉语言模型在分割图像恶意输入下的脆弱性 |
RLHF distillation |
|
|