| 1 |
MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning |
提出MMAT-1M:一个大规模多模态Agent Tuning推理数据集,用于提升多模态大模型的推理和工具使用能力。 |
large language model multimodal chain-of-thought |
✅ |
|
| 2 |
Automated Label Placement on Maps via Large Language Models |
提出基于大语言模型的地图自动标注方法,解决人工标注效率低下的问题。 |
large language model foundation model |
✅ |
|
| 3 |
ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval |
ArtSeek:通过多模态上下文推理和延迟交互检索实现深度艺术品理解 |
large language model multimodal |
✅ |
|
| 4 |
Aether Weaver: Multimodal Affective Narrative Co-Generation with Dynamic Scene Graphs |
Aether Weaver:提出一种动态场景图驱动的多模态情感叙事协同生成框架。 |
large language model multimodal |
|
|
| 5 |
MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces |
MAGE:通过桥接视觉和语义空间,增强多模态对齐和生成能力 |
large language model multimodal |
✅ |
|
| 6 |
Meta CLIP 2: A Worldwide Scaling Recipe |
Meta CLIP 2:提出一种全球范围扩展CLIP训练的有效方法 |
large language model foundation model multimodal |
|
|
| 7 |
Attention-Driven Multimodal Alignment for Long-term Action Quality Assessment |
提出基于注意力机制的多模态对齐网络,用于长期动作质量评估。 |
multimodal |
|
|
| 8 |
Chain-of-Cooking:Cooking Process Visualization via Bidirectional Chain-of-Thought Guidance |
提出Chain-of-Cooking模型,通过双向CoT指导实现烹饪过程可视化 |
chain-of-thought |
|
|
| 9 |
From Waveforms to Pixels: A Survey on Audio-Visual Segmentation |
音频-视觉分割综述:全面回顾问题、方法与未来趋势 |
foundation model multimodal |
|
|
| 10 |
AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock |
综述性论文:深度学习在农业领域作物、渔业和畜牧业中的应用 |
foundation model multimodal |
✅ |
|
| 11 |
EMIT: Enhancing MLLMs for Industrial Anomaly Detection via Difficulty-Aware GRPO |
EMIT:通过难度感知GRPO增强MLLM在工业异常检测中的性能 |
large language model multimodal |
|
|
| 12 |
Temporally Consistent Unsupervised Segmentation for Mobile Robot Perception |
提出Frontier-Seg,用于移动机器人视频流中时序一致的无监督地形分割 |
foundation model |
|
|
| 13 |
CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding |
CAPE:结合CLIP感知的互补热图线索集成,用于具身引用理解 |
multimodal |
|
|
| 14 |
MSGCoOp: Multiple Semantic-Guided Context Optimization for Few-Shot Learning |
提出MSGCoOp框架,通过多语义引导上下文优化提升小样本学习泛化能力。 |
large language model |
✅ |
|
| 15 |
AU-LLM: Micro-Expression Action Unit Detection via Enhanced LLM-Based Feature Fusion |
提出AU-LLM,首次利用LLM进行微表情动作单元检测,显著提升性能。 |
large language model |
✅ |
|
| 16 |
Decoupled Spatio-Temporal Consistency Learning for Self-Supervised Tracking |
提出SSTrack自监督跟踪框架,通过解耦时空一致性学习提升跟踪性能。 |
TAMP |
✅ |
|
| 17 |
Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval |
提出DAC框架,利用CLIP和MLLM增强开放集3D物体检索能力 |
large language model |
✅ |
|