| 22 |
SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning |
提出SAM-R1,利用强化学习和SAM提升多模态图像分割的推理能力 |
reinforcement learning multimodal |
|
|
| 23 |
Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization |
提出Rhet2Pix,通过双层扩散策略优化解决修辞文本到图像生成难题 |
diffusion policy large language model multimodal |
|
|
| 24 |
SemIRNet: A Semantic Irony Recognition Network for Multimodal Sarcasm Detection |
提出SemIRNet,利用知识融合和跨模态相似度检测提升多模态讽刺识别精度 |
contrastive learning multimodal |
|
|
| 25 |
OmniAD: Detect and Understand Industrial Anomaly via Multimodal Reasoning |
OmniAD:通过多模态推理检测和理解工业异常 |
reinforcement learning multimodal |
|
|
| 26 |
Research on Driving Scenario Technology Based on Multimodal Large Lauguage Model Optimization |
提出一种多模态大模型优化方法,提升自动驾驶场景感知能力。 |
distillation multimodal |
|
|
| 27 |
IMTS is Worth Time $\times$ Channel Patches: Visual Masked Autoencoders for Irregular Multivariate Time Series Prediction |
VIMTS:利用视觉MAE进行不规则多元时间序列预测,提升模型对缺失数据的鲁棒性。 |
masked autoencoder MAE foundation model |
✅ |
|
| 28 |
RiverMamba: A State Space Model for Global River Discharge and Flood Forecasting |
RiverMamba:利用状态空间模型实现全球河流流量和洪水预测 |
Mamba state space model |
✅ |
|
| 29 |
Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs |
提出Context-to-Cue DPO,解决多图MLLM中的幻觉问题,提升多模态理解能力 |
DPO direct preference optimization large language model |
|
|
| 30 |
GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control |
GeoDrive:融合3D几何信息的驾驶世界模型,实现精准动作控制 |
world model geometric consistency |
|
|
| 31 |
cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning |
Cadrille:基于在线强化学习的多模态CAD重建模型,实现更精确的三维模型生成。 |
reinforcement learning large language model |
|
|
| 32 |
Improving Contrastive Learning for Referring Expression Counting |
提出C-REX对比学习框架,提升指代表达式计数任务的判别表示学习能力 |
representation learning MAE contrastive learning |
✅ |
|
| 33 |
RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction |
RICO:通过视觉重建提升图像重述的准确性和完整性 |
DPO large language model multimodal |
✅ |
|
| 34 |
Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation |
提出SRRL:一种自反思强化学习算法,用于扩散模型生成具备推理能力的图像 |
reinforcement learning chain-of-thought |
✅ |
|
| 35 |
D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples |
D-Fusion:通过直接偏好优化和视觉一致样本对齐扩散模型 |
reinforcement learning DPO direct preference optimization |
|
|
| 36 |
CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation |
提出CAST框架,通过对比自适应和蒸馏提升半监督实例分割性能。 |
distillation foundation model |
|
|
| 37 |
Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers |
Q-VDiT:面向视频生成扩散Transformer的精确量化与蒸馏框架 |
distillation spatiotemporal |
✅ |
|
| 38 |
Learning World Models for Interactive Video Generation |
提出VRAG,通过视频检索增强生成实现交互式长视频生成的世界模型 |
world model spatiotemporal |
|
|
| 39 |
Dynamic-Aware Video Distillation: Optimizing Temporal Resolution Based on Video Semantics |
提出DAViD,一种基于强化学习的动态感知视频蒸馏方法,优化视频数据集的时序分辨率。 |
reinforcement learning distillation |
|
|
| 40 |
StateSpaceDiffuser: Bringing Long Context to Diffusion World Models |
提出StateSpaceDiffuser,为扩散世界模型引入长时上下文建模能力 |
world model |
✅ |
|
| 41 |
Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying |
提出显式硬负梯度放大方法,提升多模态嵌入学习性能 |
contrastive learning large language model |
✅ |
|
| 42 |
InfoSAM: Fine-Tuning the Segment Anything Model from An Information-Theoretic Perspective |
InfoSAM:基于信息论微调SAM,提升其在特定领域的分割性能 |
distillation foundation model |
|
|