| 1 |
VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding |
VideoLoom:用于联合时空理解的视频大语言模型 |
large language model multimodal |
|
|
| 2 |
Robust Multicentre Detection and Classification of Colorectal Liver Metastases on CT: Application of Foundation Models |
利用Foundation Model实现结直肠癌肝转移病灶在多中心CT图像上的稳健检测与分类 |
foundation model |
|
|
| 3 |
A Multimodal Dataset of Student Oral Presentations with Sensors and Evaluation Data |
SOPHIAS:一个用于口头报告评估的多模态数据集 |
multimodal |
|
|
| 4 |
SIRR-LMM: Single-image Reflection Removal via Large Multimodal Model |
提出SIRR-LMM,利用大模型解决单图像反射去除问题,并构建高质量合成数据集。 |
multimodal |
|
|
| 5 |
ShowUI-Aloha: Human-Taught GUI Agent |
ShowUI-Aloha:一种基于人类示教的GUI智能体框架 |
Aloha |
|
|
| 6 |
DIVER: Dynamic Iterative Visual Evidence Reasoning for Multimodal Fake News Detection |
提出DIVER:动态迭代视觉证据推理框架,用于多模态虚假新闻检测 |
multimodal |
|
|
| 7 |
HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression |
HiVid-Narrator:提出基于场景的ASR锚定压缩的分层视频叙事生成框架,用于电商视频。 |
multimodal chain-of-thought |
|
|
| 8 |
Seeing Right but Saying Wrong: Inter- and Intra-Layer Refinement in MLLMs without Training |
提出DualPD,无需训练即可提升MLLM层间一致性,解决“知行不一”问题 |
large language model multimodal |
|
|
| 9 |
A Visual Semantic Adaptive Watermark grounded by Prefix-Tuning for Large Vision-Language Model |
提出VISA-Mark:一种基于前缀调优的视觉语义自适应水印方法,用于保护大视觉语言模型的内容版权。 |
multimodal visual grounding |
|
|
| 10 |
VENUS: Visual Editing with Noise Inversion Using Scene Graphs |
VENUS:基于场景图和噪声反演的免训练图像视觉编辑框架 |
large language model multimodal |
|
|
| 11 |
PARL: Position-Aware Relation Learning Network for Document Layout Analysis |
提出PARL:一种位置感知关系学习网络,用于提升文档布局分析性能。 |
multimodal |
|
|
| 12 |
BenchSeg: A Large-Scale Dataset and Benchmark for Multi-View Food Video Segmentation |
BenchSeg:一个大规模多视角食物视频分割数据集与基准 |
multimodal |
✅ |
|
| 13 |
PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion |
PanoSAMic:基于SAM特征编码和双视角融合的全景图像分割 |
foundation model |
✅ |
|
| 14 |
Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models |
提出Focal Guidance以解决视频扩散模型中的语义弱层控制问题 |
instruction following |
|
|