cs.CV(2026-03-20)

📊 共 50 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (18 🔗6) 支柱二:RL算法与架构 (RL & Architecture) (15 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (8) 支柱六:视频提取与匹配 (Video Extraction) (3) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (1 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (18 篇)

#题目一句话要点标签🔗
1 Can Large Multimodal Models Inspect Buildings? A Hierarchical Benchmark for Structural Pathology Reasoning 提出DefectBench,用于评估大模型在建筑结构病理推理中的能力 foundation model multimodal
2 Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision 提出Semantically-Grounded Supervision (SeGroS)框架,提升统一多模态模型的对齐效果。 multimodal visual grounding
3 FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow FlowScene:提出多模态图整流流模型,实现风格一致的室内场景生成。 multimodal language conditioned
4 MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI MedSPOT:面向临床GUI工作流的序列化视觉定位基准测试 large language model multimodal visual grounding
5 Evaluating Vision Foundation Models for Pixel and Object Classification in Microscopy 评估视觉基础模型在显微镜像素和对象分类中的应用潜力 foundation model
6 Template-based Object Detection Using a Foundation Model 提出基于分割Foundation Model的模板匹配目标检测方法,无需训练即可应用于GUI自动化测试。 foundation model
7 FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLMs FREAK:针对高级多模态大语言模型细粒度幻觉评估基准 large language model multimodal chain-of-thought
8 Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images HINGE:通过组织学图像生成空间基因表达,有效利用预训练单细胞模型。 foundation model
9 Unbiased Dynamic Multimodal Fusion 提出无偏动态多模态学习框架,解决动态场景下模态质量评估和依赖偏差问题。 multimodal
10 Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement 提出HRNet,通过解耦和对齐实现非迭代混合多模态图像配准 multimodal
11 LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation LumosX:通过关联身份及其属性实现个性化视频生成 large language model multimodal
12 Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR 提出解耦跳跃连接和R-Probe,提升MLLM在OCR任务中的细粒度识别能力 large language model multimodal
13 MedQ-Engine: A Closed-Loop Data Engine for Evolving MLLMs in Medical Image Quality Assessment MedQ-Engine:用于医学图像质量评估中演进MLLM的闭环数据引擎 large language model multimodal
14 One Model, Two Minds: Task-Conditioned Reasoning for Unified Image Quality and Aesthetic Assessment 提出TATAR框架,通过任务条件推理统一图像质量与美学评估 large language model multimodal
15 Evaluating Image Editing with LLMs: A Comprehensive Benchmark and Intermediate-Layer Probing Approach 提出TIEdit基准和EditProbe评估器,提升文本引导图像编辑的评测可靠性 large language model multimodal
16 CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management CurveStream:提出曲率感知的分层视觉记忆管理,提升MLLM在流视频理解中的性能。 large language model multimodal
17 CoVR-R:Reason-Aware Composed Video Retrieval 提出CoVR-R:一种基于推理的组合视频检索方法,解决现有方法忽略编辑后效应的问题。 multimodal
18 TSegAgent: Zero-Shot Tooth Segmentation via Geometry-Aware Vision-Language Agents TSegAgent:基于几何感知视觉-语言Agent的零样本牙齿分割 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (15 篇)

#题目一句话要点标签🔗
19 MFil-Mamba: Multi-Filter Scanning for Spatial Redundancy-Aware Visual State Space Models MFil-Mamba:面向空间冗余感知的视觉状态空间模型,采用多滤波器扫描 Mamba SSM state space model
20 SIMPLER: Efficient Foundation Model Adaptation via Similarity-Guided Layer Pruning for Earth Observation SIMPLER:面向地球观测,通过相似性引导剪枝实现高效基础模型适配 MAE foundation model multimodal
21 PFM-VEPAR: Prompting Foundation Models for RGB-Event Camera based Pedestrian Attribute Recognition 提出PFM-VEPAR框架,利用事件相机信息提升低光照和运动模糊场景下的行人属性识别 representation learning foundation model multimodal
22 CFCML: A Coarse-to-Fine Crossmodal Learning Framework For Disease Diagnosis Using Multimodal Images and Tabular Data 提出CFCML框架,通过粗细粒度跨模态学习提升多模态医学图像和表格数据的疾病诊断准确率。 contrastive learning multimodal
23 X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving 提出X-World,一种可控的自车视角多相机世界模型,用于可扩展的端到端自动驾驶。 world model geometric consistency VLA
24 BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates BALM:一种模型无关的平衡多模态学习框架,解决不平衡缺失率问题 representation learning multimodal
25 EgoForge: Goal-Directed Egocentric World Simulator EgoForge:基于单张图像和指令生成目标导向的自中心世界模拟视频 world model egocentric
26 Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning 提出Chain-of-Adaptation框架,通过强化学习实现手术视觉-语言模型的领域自适应 reinforcement learning multimodal
27 ReconMIL: Synergizing Latent Space Reconstruction with Bi-Stream Mamba for Whole Slide Image Analysis ReconMIL:结合潜在空间重构与双流Mamba用于病理切片图像分析 Mamba foundation model
28 WorldAgents: Can Foundation Image Models be Agents for 3D World Models? WorldAgents:利用2D基础图像模型构建3D世界模型 world model foundation model
29 CS-MUNet: A Channel-Spatial Dual-Stream Mamba Network for Multi-Organ Segmentation CS-MUNet:用于多器官分割的通道-空间双流Mamba网络 Mamba SSM
30 Improving Image-to-Image Translation via a Rectified Flow Reformulation 提出I2I-RFR,通过修正流重构改进图像到图像的转换任务。 distillation multimodal
31 Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features 提出面部区域感知妆容特征的扩散模型,实现更精细可控的妆容迁移 contrastive learning foundation model
32 Accelerating Diffusion Decoders via Multi-Scale Sampling and One-Step Distillation 提出多尺度采样与一步蒸馏加速扩散解码器以解决延迟问题 distillation
33 Dual-Domain Representation Alignment: Bridging 2D and 3D Vision via Geometry-Aware Architecture Search 提出EvoNAS以解决大规模视觉模型在边缘设备上的部署问题 Mamba distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
34 HUGE-Bench: A Benchmark for High-Level UAV Vision-Language-Action Tasks 提出HUGE-Bench,用于评估无人机高层视觉-语言-动作任务的基准。 3D gaussian splatting 3DGS gaussian splatting
35 3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction 提出自约束先验的3D高斯溅射,用于高保真表面重建 3D gaussian splatting 3DGS gaussian splatting
36 Fourier Splatting: Generalized Fourier encoded primitives for scalable radiance fields 提出Fourier Splatting,通过傅里叶编码基元实现可扩展辐射场渲染 3D gaussian splatting 3DGS gaussian splatting
37 StreetForward: Perceiving Dynamic Street with Feedforward Causal Attention StreetForward:提出基于前馈因果注意力机制的动态街景快速重建方法 depth estimation 3D gaussian splatting gaussian splatting
38 PCSTracker: Long-Term Scene Flow Estimation for Point Cloud Sequences PCSTracker:提出用于点云序列长期场景流估计的端到端框架。 scene flow motion estimation
39 Generalizable NGP-SR: Generalizable Neural Radiance Fields Super-Resolution via Neural Graph Primitives 提出通用NGP-SR,通过神经图元实现可泛化的神经辐射场超分辨率重建。 NeRF neural radiance field
40 SeeClear: Reliable Transparent Object Depth Estimation via Generative Opacification SeeClear:通过生成式不透明化实现可靠的透明物体深度估计 depth estimation monocular depth
41 GravCal: Single-Image Calibration of IMU Gravity Priors with Per-Sample Confidence GravCal:提出单图像重力先验校准模型,提升视觉惯性系统鲁棒性 VIO

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
42 RAM: Recover Any 3D Human Motion in-the-Wild RAM:提出一种在复杂场景下恢复任意3D人体运动的通用框架。 HMR human motion human motion reconstruction
43 Investigating a Policy-Based Formulation for Endoscopic Camera Pose Recovery 提出基于策略的内窥镜相机位姿恢复方法,解决弱纹理和光照变化下的定位难题。 feature matching motion prediction
44 IUP-Pose: Decoupled Iterative Uncertainty Propagation for Real-time Relative Pose Regression via Implicit Dense Alignment v1 IUP-Pose:基于隐式密集对齐的解耦迭代不确定性传播相对位姿回归,实现实时性。 feature matching

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
45 FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision FlashCap:基于闪烁LED和事件相机的毫秒级精确人体运动捕捉系统 human motion
46 OrbitNVS: Harnessing Video Diffusion Priors for Novel View Synthesis OrbitNVS:利用视频扩散先验实现高质量新视角合成 geometric consistency

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
47 Semantic Audio-Visual Navigation in Continuous Environments 提出MAGNet,解决连续环境中语义音频-视觉导航中目标信息丢失问题。 PULSE multimodal
48 PhysNeXt: Next-Generation Dual-Branch Structured Attention Fusion Network for Remote Photoplethysmography Measurement PhysNeXt:用于远程光电容积脉搏波测量的双分支结构化注意力融合网络 spatiotemporal PULSE

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
49 Controllable Text-to-Motion Generation via Modular Body-Part Phase Control 提出模块化身体部位相位控制,实现可控的文本到动作生成。 text-to-motion motion generation

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
50 MuSteerNet: Human Reaction Generation from Videos via Observation-Reaction Mutual Steering MuSteerNet:通过观察-反应互导,从视频生成逼真的人类反应动作 reaction synthesis human motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页