| 1 |
HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes |
提出统一框架以生成可控的全屋室内场景 |
embodied AI large language model |
|
|
| 2 |
Towards One-to-Many Temporal Grounding |
提出一种方法以解决多段视频定位问题 |
chain-of-thought |
|
|
| 3 |
Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models |
提出GeoVR框架以解决多模态大语言模型的3D感知问题 |
large language model foundation model multimodal |
|
|
| 4 |
Multimodal Sexism Identification and Characterization using Large Language Models and Gradient Boosting |
提出多模态性别歧视识别与表征方法以解决社交媒体内容分析问题 |
large language model multimodal |
|
|
| 5 |
LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing |
提出LoomVideo以解决多模态视频生成与编辑的计算复杂性问题 |
large language model foundation model multimodal |
|
|
| 6 |
VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning |
提出VTI-CoT以解决视频推理中的视觉信息缺失问题 |
multimodal chain-of-thought |
|
|
| 7 |
WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark |
提出WorldBench以解决多模态模型在视觉理解中的不足问题 |
large language model multimodal |
|
|
| 8 |
GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-Attention |
提出GRAMformer以解决多模态交互建模复杂性问题 |
multimodal |
|
|
| 9 |
ExpSpeech-Net: Multimodal Fusion of Expression and Speech for Deepfake Detection |
提出ExpSpeech-Net以解决深伪视频检测的效率问题 |
multimodal |
|
|
| 10 |
Multi-Task Crack Foundation Model for Engineering-Reliable Crack Representation and Topology Preservation in Civil Infrastructure |
提出CrackGeoFM以解决土木基础设施裂缝评估问题 |
foundation model |
|
|
| 11 |
Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models |
提出BloomBench以解决多模态模型评估的认知能力不足问题 |
multimodal |
✅ |
|
| 12 |
Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models |
提出FEPBench以解决科学插图生成的评估不足问题 |
large language model multimodal |
|
|
| 13 |
Resonant Minds: Closed-Loop Social Avatars with Theory of Mind |
提出闭环双代理框架以解决数字人类社交智能不足问题 |
large language model multimodal |
|
|
| 14 |
LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video |
提出LongSpace框架以解决长视频空间记忆问题 |
large language model multimodal |
|
|
| 15 |
UltraVR: A Diagnostic Ultra-Resolution Image-VQA Benchmark for Evidence-Grounded Reasoning |
提出UltraVR以解决超分辨率图像推理问题 |
multimodal chain-of-thought |
|
|
| 16 |
Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation |
提出全球-局部蒙特卡洛树搜索以解决文本到3D室内场景生成问题 |
chain-of-thought |
|
|
| 17 |
Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment |
提出RED-Aes框架以解决传统图像美学评估的局限性 |
chain-of-thought |
|
|
| 18 |
ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions |
提出Triple-Shot组合以解决单一裁剪的叙事不足问题 |
chain-of-thought |
|
|