| 1 |
VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents |
VideoWeaver:面向具身智能体的多模态多视角视频到视频转换框架 |
policy learning egocentric embodied AI |
|
|
| 2 |
Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference |
提出层次引导的多模态表示学习以解决生物多样性识别问题 |
representation learning foundation model multimodal |
|
|
| 3 |
Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs |
提出Token-Reweighting策略,提升多模态LLM在RLVR任务中的感知与推理能力 |
reinforcement learning large language model multimodal |
|
|
| 4 |
MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning |
提出多阶段强化学习MSRL,用于扩展生成式多模态奖励模型的训练。 |
reinforcement learning distillation multimodal |
✅ |
|
| 5 |
GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization |
GDPO-Listener:通过自回归流匹配和分组解耦策略优化实现富有表现力的交互式头部生成 |
flow matching motion generation dyadic interaction |
|
|
| 6 |
Multimodal Dataset Distillation via Phased Teacher Models |
提出PTM-ST框架,解决多模态数据集蒸馏中教师模型知识动态演化捕捉不足的问题。 |
distillation multimodal |
✅ |
|
| 7 |
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models |
提出混合记忆机制,解决动态视频世界模型中主体消失重现问题 |
world model world models spatiotemporal |
|
|
| 8 |
Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning |
提出RL-MBA框架,解决多模态主动学习中模态平衡与样本难度动态变化问题。 |
reinforcement learning multimodal |
|
|
| 9 |
Vega: Learning to Drive with Natural Language Instructions |
提出Vega模型,通过自然语言指令实现个性化自动驾驶。 |
world model world models vision-language-action |
|
|
| 10 |
LanteRn: Latent Visual Structured Reasoning |
LanteRn:提出基于隐空间视觉结构化推理框架,提升多模态模型视觉理解能力 |
reinforcement learning multimodal visual grounding |
|
|
| 11 |
CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation |
提出CLIP-RD,通过关系蒸馏提升CLIP模型知识蒸馏效率。 |
contrastive learning teacher-student distillation |
|
|
| 12 |
VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning |
VideoTIR:利用强化学习和工具集成推理提升长视频理解的准确性和效率 |
reinforcement learning large language model multimodal |
|
|
| 13 |
TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Driven Optimization |
提出TIGFlow-GRPO框架,通过交互感知流匹配和奖励驱动优化实现更符合社会规范和物理约束的轨迹预测。 |
flow matching multimodal |
|
|
| 14 |
Towards Controllable Low-Light Image Enhancement: A Continuous Multi-illumination Dataset and Efficient State Space Framework |
提出可控低光图像增强方法以解决现有方法的不足 |
SSM state space model multimodal |
|
|
| 15 |
FD$^2$: A Dedicated Framework for Fine-Grained Dataset Distillation |
提出FD$^2$框架,用于细粒度数据集蒸馏,提升小样本学习性能。 |
distillation |
|
|
| 16 |
AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization |
AnyDoc:通过大规模HTML/CSS数据合成与高度感知强化优化增强文档生成 |
reinforcement learning large language model |
|
|
| 17 |
MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models |
提出MoE-GRPO,通过强化学习优化MoE-VLMs中的专家路由,提升多模态理解能力。 |
reinforcement learning |
|
|
| 18 |
Towards Video Anomaly Detection from Event Streams: A Baseline and Benchmark Datasets |
提出EWAD框架,解决事件流视频异常检测中数据稀疏和模型训练难题。 |
distillation spatiotemporal |
|
|
| 19 |
Image Rotation Angle Estimation: Comparing Circular-Aware Methods |
针对图像旋转角度估计,对比研究了五种循环感知方法,并验证了概率方法的有效性。 |
Mamba MAE |
|
|
| 20 |
Learning to Rank Caption Chains for Video-Text Alignment |
提出基于排序优化的视频-文本对齐方法,提升长文本生成质量。 |
DPO direct preference optimization |
|
|
| 21 |
Reinforcing Structured Chain-of-Thought for Video Understanding |
提出Summary-Driven RL框架,增强MLLM在视频理解中的推理能力和泛化性 |
reinforcement learning large language model chain-of-thought |
|
|
| 22 |
Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets |
提出VLAAD和CARLA-Collide数据集,提升端到端自动驾驶的防碰撞能力。 |
representation learning multimodal |
|
|
| 23 |
LEMON: a foundation model for nuclear morphology in Computational Pathology |
LEMON:用于计算病理学中细胞核形态的基础模型 |
representation learning foundation model |
✅ |
|
| 24 |
GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding |
GazeQwen:基于注视感知的轻量级LLM调制方法,用于流视频理解 |
JEPA large language model multimodal |
✅ |
|
| 25 |
CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation |
提出CLIP-RD,通过关系蒸馏提升CLIP模型知识蒸馏效率。 |
contrastive learning teacher-student distillation |
|
|
| 26 |
Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis |
Geo$^2$: 提出几何引导的跨视角地理定位与图像合成统一框架,实现SOTA性能。 |
flow matching VGGT foundation model |
|
|
| 27 |
World Reasoning Arena |
提出WR-Arena,用于评估世界模型在动作模拟、长时预测和推理规划方面的能力。 |
world model world models physically plausible |
✅ |
|
| 28 |
DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation |
DiReCT:解耦对比轨迹正则化,提升物理约束的视频生成质量 |
flow matching contrastive learning |
|
|