| 1 |
SpatialBench: Is Your Spatial Foundation Model an All-Round Player? |
SpatialBench:用于评估空间基础模型泛化能力的跨领域、多任务基准测试。 |
representation learning egocentric foundation model |
|
|
| 2 |
FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation |
FoundObj:利用自监督基础模型奖励进行无标签3D物体分割 |
reinforcement learning foundation model |
|
|
| 3 |
Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini |
Gemini Embedding 2:原生多模态嵌入模型,统一表示视频、音频、图像和文本 |
contrastive learning multimodal |
|
|
| 4 |
O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding |
提出O-MARC框架,通过压缩蒸馏提升多模态大模型在视频理解中的效率与性能。 |
distillation large language model multimodal |
|
|
| 5 |
Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression |
提出Re-M3Dr,解决多模态医学影像融合中视场缺损评估的性能退化问题 |
contrastive learning multimodal |
|
|
| 6 |
DinoComplete: 3D Shape Completion with Distilled Semantic Priors and State Space Models |
DinoComplete:利用蒸馏语义先验和状态空间模型实现三维形状补全 |
Mamba state space model foundation model |
|
|
| 7 |
OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation |
提出基于融合即教师蒸馏的OmniRetriever,实现任意模态音视频文本检索 |
distillation multimodal |
|
|
| 8 |
JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search |
JetViT:通过后训练注意力搜索实现高效高分辨率视觉Transformer |
linear attention Depth Anything foundation model |
|
|
| 9 |
LongCat-Video-Avatar 1.5 Technical Report |
LongCat-Video-Avatar 1.5:面向商业级应用的开源音频驱动视频生成框架 |
RLHF distillation multi-person interaction |
|
|
| 10 |
PlayClass: Automated Play Behaviour Classification in Poultry |
PlayClass:一种用于家禽玩耍行为自动分类的流水线方法 |
JEPA foundation model |
|
|
| 11 |
Touch-R1: Reinforcing Touch Reasoning in MLLMs |
Touch-R1:通过触觉强化学习提升多模态大模型中的触觉推理能力 |
reinforcement learning multimodal |
|
|
| 12 |
Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning |
提出解耦子空间对比学习的半监督眼球注视估计方法,提升领域泛化性。 |
contrastive learning |
✅ |
|
| 13 |
REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization |
提出REVERSE框架,通过强化证据验证与搜索实现Agentic图像地理定位 |
reinforcement learning visual grounding |
✅ |
|
| 14 |
InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward |
InterSketch:提出一种基于自校正视觉草图和逐步奖励的交错推理模型,提升视觉语言模型在复杂视觉推理任务上的性能。 |
reinforcement learning chain-of-thought |
|
|
| 15 |
JLT: Clean-Latent Prediction in Latent Diffusion Transformers |
JLT:在潜在扩散Transformer中通过Clean-Latent预测提升图像生成质量 |
flow matching classifier-free guidance |
|
|
| 16 |
Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules |
提出TriPS,通过优化扩散后验采样的引导与随机性调度,显著提升逆问题成像效果。 |
reinforcement learning classifier-free guidance |
|
|