| 1 |
SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving |
提出SAMoE-VLA,通过场景自适应MoE提升自动驾驶VLA模型的性能与安全性。 |
world model vision-language-action VLA |
|
|
| 2 |
Toward Unified Multimodal Representation Learning for Autonomous Driving |
提出对比张量预训练框架,用于自动驾驶多模态统一表征学习 |
representation learning contrastive learning scene understanding |
|
|
| 3 |
SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation |
提出SGG-R$^{
m 3}$以解决场景图生成中的偏见与稀疏问题 |
reinforcement learning large language model multimodal |
|
|
| 4 |
MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models |
MINT:利用空间转录组监督的病理学Foundation模型分子信息训练 |
distillation foundation model |
|
|
| 5 |
Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model |
提出基于Mamba和扩散模型的MambaDance,解决舞蹈生成中时序建模和节拍同步问题 |
Mamba human motion |
✅ |
|
| 6 |
Geometric Transformation-Embedded Mamba for Learned Video Compression |
提出几何变换嵌入的Mamba模型,用于提升学习型视频压缩的性能。 |
Mamba motion estimation |
✅ |
|
| 7 |
BuildMamba: A Visual State-Space Based Model for Multi-Task Building Segmentation and Height Estimation from Satellite Images |
BuildMamba:用于卫星图像多任务建筑物分割与高度估计的视觉状态空间模型 |
Mamba monocular depth |
|
|
| 8 |
It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models |
提出TickTockVQA以解决视觉语言模型在模拟时钟阅读中的挑战 |
DPO direct preference optimization multimodal |
|
|
| 9 |
ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation |
ER-Pose:重新思考关键点驱动的单阶段人体姿态估计,提升精度与效率 |
representation learning |
|
|
| 10 |
SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents |
SPIRAL:通过自反规划智能体实现自改进动作世界模型的闭环框架 |
world model |
|
|
| 11 |
SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval |
提出SAVE模型,通过语音感知视频表征学习提升视频-文本检索性能 |
representation learning |
|
|
| 12 |
MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data |
MM-TS:多模态对比学习中基于长尾数据的温度和Margin动态调整方法 |
contrastive learning |
|
|
| 13 |
ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning |
ImageEdit-R1:强化学习驱动的多智能体图像编辑框架 |
reinforcement learning |
|
|
| 14 |
Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared |
提出一种字典引导的跨模态图像融合框架,解决缺失红外图像融合问题。 |
representation learning large language model |
✅ |
|