cs.CV(2026-06-04)

📊 共 39 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (18 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗1) 支柱一:机器人控制 (Robot Control) (3) 支柱四:生成式动作 (Generative Motion) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (18 篇)

#题目一句话要点标签🔗
1 HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes 提出统一框架以生成可控的全屋室内场景 embodied AI large language model
2 Towards One-to-Many Temporal Grounding 提出一种方法以解决多段视频定位问题 chain-of-thought
3 Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models 提出GeoVR框架以解决多模态大语言模型的3D感知问题 large language model foundation model multimodal
4 Multimodal Sexism Identification and Characterization using Large Language Models and Gradient Boosting 提出多模态性别歧视识别与表征方法以解决社交媒体内容分析问题 large language model multimodal
5 LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing 提出LoomVideo以解决多模态视频生成与编辑的计算复杂性问题 large language model foundation model multimodal
6 VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning 提出VTI-CoT以解决视频推理中的视觉信息缺失问题 multimodal chain-of-thought
7 WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark 提出WorldBench以解决多模态模型在视觉理解中的不足问题 large language model multimodal
8 GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-Attention 提出GRAMformer以解决多模态交互建模复杂性问题 multimodal
9 ExpSpeech-Net: Multimodal Fusion of Expression and Speech for Deepfake Detection 提出ExpSpeech-Net以解决深伪视频检测的效率问题 multimodal
10 Multi-Task Crack Foundation Model for Engineering-Reliable Crack Representation and Topology Preservation in Civil Infrastructure 提出CrackGeoFM以解决土木基础设施裂缝评估问题 foundation model
11 Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models 提出BloomBench以解决多模态模型评估的认知能力不足问题 multimodal
12 Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models 提出FEPBench以解决科学插图生成的评估不足问题 large language model multimodal
13 Resonant Minds: Closed-Loop Social Avatars with Theory of Mind 提出闭环双代理框架以解决数字人类社交智能不足问题 large language model multimodal
14 LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video 提出LongSpace框架以解决长视频空间记忆问题 large language model multimodal
15 UltraVR: A Diagnostic Ultra-Resolution Image-VQA Benchmark for Evidence-Grounded Reasoning 提出UltraVR以解决超分辨率图像推理问题 multimodal chain-of-thought
16 Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation 提出全球-局部蒙特卡洛树搜索以解决文本到3D室内场景生成问题 chain-of-thought
17 Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment 提出RED-Aes框架以解决传统图像美学评估的局限性 chain-of-thought
18 ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions 提出Triple-Shot组合以解决单一裁剪的叙事不足问题 chain-of-thought

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
19 DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments 提出DisasterBench以解决复杂环境下无人机灾害响应的多模态推理问题 reinforcement learning multimodal chain-of-thought
20 Unveiling the Unknown: Open Vocabulary Object Detection with Scene Graphs 提出场景引导关系建模框架以解决开放词汇目标检测问题 distillation open-vocabulary open vocabulary
21 PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding 提出PAR3D以解决3D场景理解中的部件建模问题 representation learning scene understanding large language model
22 ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation 提出ViCuR以解决多模态蒸馏中的教师特权问题 distillation multimodal
23 Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators 提出Astra框架以解决视觉语言模型的空间推理问题 world model world models egocentric
24 What's Under the Skin? Estimating Swine Body Condition 提出PigFormer以解决猪体况监测的自动化问题 MAE distillation height map
25 Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder 提出视频速率流式风格化方法以解决实时文本到图像生成瓶颈 distillation large language model multimodal
26 DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models 提出DRIFT以解决视觉语言模型连续输出解码问题 flow matching VLA visual grounding
27 Noise-Aware Visual Representation Learning for Medical Visual Question Answering 提出噪声感知的视觉表征学习以提升医学视觉问答性能 representation learning large language model
28 T-SAR-JEPA: Self-Supervised Temporal Anomaly Detection in SAR Amplitude Stacks via Latent Prediction 提出T-SAR-JEPA以解决SAR幅度堆栈中的时间异常检测问题 JEPA

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
29 Deep Learning-based 3D Oral Cavity Reconstruction Using 2D Intraoral Images 提出基于深度学习的3D口腔重建方法以解决传统扫描局限性 3D reconstruction
30 CamFlow+: Hybrid Motion Bases for 2D Camera Motion Estimation with Stabilization Applications 提出CamFlow+以解决2D相机运动估计中的平面假设问题 optical flow motion estimation
31 T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation 提出T-FunS3D以解决开放词汇3D功能分割问题 open-vocabulary open vocabulary
32 RPC-GS: Gaussian Splatting with native RPC Rendering for Satellite Imagery 提出RPC-GS框架以解决卫星影像高精度重建问题 gaussian splatting splatting
33 GS-NFS: Bandwidth-adaptive Streaming of Dynamic Gaussian Splats and Point Clouds 提出GS-NFS以加速动态高斯点云的带宽自适应流媒体传输 3D gaussian splatting 3DGS gaussian splatting
34 RigPAPR: Rig-Based Animation of Static Neural Point Clouds from a Fixed-Viewpoint Video 提出RigPAPR以解决静态神经点云动画生成问题 gaussian splatting splatting

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
35 Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models 提出一种简单的一步动作生成方法以优化视觉-语言-动作模型 bi-manual distillation vision-language-action
36 Synthetic Data Generation and Vision-based Wrinkle and Keypoint Detection for Bimanual Cloth Manipulation 提出合成数据生成与基于视觉的皱纹和关键点检测以解决双手布料操作问题 manipulation bi-manual
37 Inverse Design of Realizable Metasurface based Absorbers using Improved Conditioning and Diversity Enhanced Progressively Growing GANs 提出生成对抗网络以解决超材料吸收器逆向设计问题 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
38 KV-Control: Parameter-Efficient K/V Injection for Trajectory-Controlled Text-to-Motion 提出KV-Control以解决文本驱动运动生成中的控制精度问题 text-to-motion motion generation human motion

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
39 HDST-GNN: Heterogeneous Dynamic Spatiotemporal Graph Neural Networks for Multi-Object Tracking in UAV Aerial Imagery 提出HDST-GNN以解决无人机图像中的多目标跟踪问题 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页