cs.CV(2026-02-28)

📊 共 91 篇论文

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (27) 支柱三:空间感知与语义 (Perception & Semantics) (26) 支柱二:RL算法与架构 (RL & Architecture) (24) 支柱一:机器人控制 (Robot Control) (5) 支柱四:生成式动作 (Generative Motion) (3) 支柱八:物理动画 (Physics-based Animation) (3) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (27 篇)

#题目一句话要点标签🔗
1 Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection 提出基于梯度的自反思方法GACD,缓解多模态大语言模型中的幻觉问题。 large language model multimodal visual grounding
2 Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios 提出图像驱动的自适应数据集构建方法,用于解决真实世界多模态安全场景问题 large language model multimodal
3 SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model SkyReels-V4:统一多模态视频-音频生成、修复和编辑的基模型 large language model foundation model multimodal
4 Enabling clinical use of foundation models in histopathology 提出基于鲁棒性损失的下游训练方法,提升病理学Foundation Model在临床应用中的泛化性。 foundation model
5 SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read SimpleOCR:通过渲染可视化问题训练MLLM以提升其阅读能力 large language model multimodal visual grounding
6 Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models 提出多模态权重分配模块,增强多模态图像理解模型在模态缺失下的鲁棒性 multimodal
7 Asymmetric Idiosyncrasies in Multimodal Models 研究多模态模型中的不对称特性,揭示文本到图像生成中风格信息损失问题 multimodal
8 Towards Multimodal Domain Generalization with Few Labels 提出一种半监督多模态域泛化框架,解决标注数据稀缺下的跨域泛化问题 multimodal
9 MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis 提出MM-NeuroOnco多模态脑肿瘤MRI诊断基准与指令数据集,提升诊断推理能力。 multimodal
10 Efficient Encoder-Free Fourier-based 3D Large Multimodal Model 提出Fase3D:一种高效无编码器的傅里叶3D大模型,用于处理大规模点云场景。 multimodal
11 AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios AgentVista:提出一个超高难度真实视觉场景下的多模态Agent评估基准。 multimodal
12 Large Multimodal Models as General In-Context Classifiers 提出CIRCLE方法,提升大模型在开放世界分类中的上下文学习能力 multimodal
13 Visual Instruction Pretraining for Domain-Specific Foundation Models 提出视觉指令预训练方法以提升领域特定基础模型的性能 foundation model
14 Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering 提出RETINA基准和MIMIR模型,解决多模态知识图谱VQA中的视觉捷径问题 multimodal
15 OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence OneVision-Encoder:编解码器对齐的稀疏性作为多模态智能的基础原则 multimodal
16 StruXLIP: Enhancing Vision-language Models with Multimodal Structural Cues StruXLIP:利用多模态结构线索增强视觉-语言模型,提升跨模态检索性能。 multimodal
17 MammoWise: Multi-Model Local RAG Pipeline for Mammography Report Generation MammoWise:用于乳腺钼靶报告生成的本地多模型RAG流水线 multimodal chain-of-thought
18 SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs 提出基于球坐标的位置编码SoPE,增强3D LVLMs的空间感知能力 large language model multimodal
19 Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study 对比裁决多智能体框架,零样本区分视觉混淆疾病 large language model multimodal
20 Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy 提出一种弱监督视觉-语言建模方法,用于人脑显微图像的细胞结构分析。 large language model foundation model
21 Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing 提出基于众包的音视频质量评估数据集构建方法,并发布大规模数据集YT-NTU-AVQ。 multimodal
22 HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models HulluEdit:单次证据一致子空间编辑,缓解大型视觉语言模型中的幻觉问题 visual grounding
23 SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling SubspaceAD:基于子空间建模的免训练少样本异常检测方法 foundation model
24 ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding ThinkOmni:通过引导解码将文本推理能力提升到全模态场景 large language model
25 Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation 提出Dual-IPO双迭代优化框架,提升文本到视频生成质量并对齐用户偏好。 foundation model
26 PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions 提出PoSh:利用场景图引导LLM评估图像描述,提升评估的细粒度和准确性。 foundation model
27 Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics 提出TIMAR,用于建模交互式3D对话头部的因果turn级动态,提升头像和机器人的表现力。 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (26 篇)

#题目一句话要点标签🔗
28 Joint Optimization for 4D Human-Scene Reconstruction in the Wild 提出JOSH,用于野外单目视频中4D人体-场景联合重建 scene reconstruction human-scene interaction human mesh recovery
29 Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation BriGeS:融合几何与语义基础模型,提升单目深度估计性能 depth estimation monocular depth foundation model
30 Distractor-free Generalizable 3D Gaussian Splatting 提出DGGS,解决通用3D高斯溅射中无干扰物体的场景重建问题 3D gaussian splatting 3DGS gaussian splatting
31 Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting Proxy-GS:利用统一遮挡先验加速结构化3D高斯溅射训练与推理 3D gaussian splatting 3DGS gaussian splatting
32 AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction AeroDGS:面向单目航拍的物理一致动态高斯溅射4D重建 gaussian splatting splatting scene reconstruction
33 GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views GIFSplat:基于生成先验的迭代式前馈3D高斯溅射,从稀疏视角重建 3D gaussian splatting gaussian splatting splatting
34 Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking 提出潜在高斯喷涂方法以解决4D全景占用跟踪问题 gaussian splatting splatting scene understanding
35 G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior G4Splat:利用生成先验和几何引导的高斯溅射,提升三维场景重建质量。 gaussian splatting splatting NeRF
36 ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting 提出ST-GS框架,利用时空高斯溅射提升视觉中心自动驾驶中的3D语义占据预测 gaussian splatting splatting scene understanding
37 Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes 提出基于单目视觉的室内场景开放词汇占据预测方法,提升复杂环境理解能力。 splatting open-vocabulary open vocabulary
38 GSTurb: Gaussian Splatting for Atmospheric Turbulence Mitigation GSTurb:利用高斯溅射进行大气湍流缓解,提升长距离成像质量。 gaussian splatting splatting optical flow
39 Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning Pix2Key提出基于语义分解和自监督视觉字典学习的可控开放词汇图像检索方法 open-vocabulary open vocabulary
40 Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation? 提出检索增强的测试时适配器,以少量样本弥合开放词汇分割的监督差距。 open-vocabulary open vocabulary
41 From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects 提出OWEL和MSCAL,使开放词汇目标检测模型具备开放世界新物体检测能力 open-vocabulary open vocabulary
42 BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model 提出BetterScene以解决稀疏照片下的新视角合成问题 3D gaussian splatting 3DGS gaussian splatting
43 SplatSDF: Boosting SDF-NeRF via Architecture-Level Fusion with Gaussian Splats SplatSDF:通过与高斯溅射架构级融合加速SDF-NeRF训练与收敛 3DGS NeRF
44 SwiftNDC: Fast Neural Depth Correction for High-Fidelity 3D Reconstruction SwiftNDC:快速神经深度校正,实现高保真3D重建 3D gaussian splatting 3DGS gaussian splatting
45 Instruction-based Image Editing with Planning, Reasoning, and Generation 提出基于规划、推理和生成的指令图像编辑框架,提升复杂场景下的编辑质量。 scene understanding large language model chain-of-thought
46 Unveiling Deep Shadows: A Survey and Benchmark on Image and Video Shadow Detection, Removal, and Generation in the Deep Learning Era 深度学习时代阴影检测、去除与生成:统一综述、基准测试与未来方向 scene understanding foundation model multimodal
47 Loc$^2$: Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching 提出Loc$^2$,通过深度提升的局部特征匹配实现可解释的跨视角定位 monocular depth feature matching
48 DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces DICArt:提出基于离散扩散的铰接物体类别级姿态估计方法 6D pose estimation embodied AI
49 Motion-aware Event Suppression for Event Cameras 提出运动感知事件抑制框架,实时过滤事件相机中由独立运动物体和自运动引起的事件。 visual odometry IMoS
50 PackUV: Packed Gaussian UV Maps for 4D Volumetric Video PackUV:提出基于UV图的紧凑型高斯表示,用于高效4D体积视频的存储与传输。 gaussian splatting splatting
51 FLIGHT: Fibonacci Lattice-based Inference for Geometric Heading in real-Time FLIGHT:基于斐波那契格点推理的实时几何航向估计 visual odometry
52 Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings 提出基于平移缩放视频的皮划艇速度和划桨率重建框架,无需船载传感器。 optical flow
53 SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation 提出SuperQuadricOcc,利用超二次曲面实现实时自监督占据估计,显著降低内存占用。 scene understanding

🔬 支柱二:RL算法与架构 (RL & Architecture) (24 篇)

#题目一句话要点标签🔗
54 CrossLLM-Mamba: Multimodal State Space Fusion of LLMs for RNA Interaction Prediction 提出CrossLLM-Mamba,利用状态空间融合LLM进行RNA相互作用预测 Mamba large language model multimodal
55 Unified Multimodal Models as Auto-Encoders 提出Unified-GRPO,通过自编码器视角和强化学习统一优化图像到文本和文本到图像任务。 reinforcement learning multimodal instruction following
56 SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation SPATIALALIGN:通过自提升框架增强文本到视频生成模型对动态空间关系的建模能力 DPO direct preference optimization spatial relationship
57 A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling 提出CheXficient,通过主动数据选择,高效构建胸部X光影像基础模型。 representation learning foundation model
58 From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models 提出诊断驱动的迭代训练DPE,提升大规模多模态模型在开放任务上的持续学习能力 reinforcement learning multimodal
59 MediX-R1: Open Ended Medical Reinforcement Learning MediX-R1:用于医学多模态大语言模型的开放式强化学习框架 reinforcement learning large language model multimodal
60 ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing 提出ThinkRL-Edit,通过强化学习提升推理驱动的图像编辑质量 reinforcement learning multimodal chain-of-thought
61 Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening 利用视频基础模型进行远程帕金森病筛查的基准测试研究 representation learning foundation model
62 ProjFlow: Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control ProjFlow:基于Flow Matching的投影采样,实现零样本精确空间运动控制 flow matching human motion
63 SpectralMamba-UNet: Frequency-Disentangled State Space Modeling for Texture-Structure Consistent Medical Image Segmentation 提出SpectralMamba-UNet,通过频域解耦建模实现纹理结构一致的医学图像分割。 Mamba state space model
64 VQ-Style: Disentangling Style and Content in Motion with Residual Quantized Representations 提出基于残差量化表示的VQ-Style框架,用于解耦人体运动中的风格与内容。 contrastive learning VQ-VAE human motion
65 Object-Centric Representation Learning for Enhanced 3D Semantic Scene Graph Prediction 提出基于对比预训练的目标中心表示学习方法,提升3D语义场景图预测精度。 representation learning open-vocabulary open vocabulary
66 MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding 提出MSJoE,联合优化MLLM和采样器,高效理解长视频 reinforcement learning large language model multimodal
67 USF-Net: A Unified Spatiotemporal Fusion Network for Ground-Based Remote Sensing Cloud Image Sequence Extrapolation 提出USF-Net,用于地基遥感云图序列外推,提升预测精度与效率。 SSM spatiotemporal
68 Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization 提出难度感知分组归一化(Durian),提升多模态LLM推理能力 reinforcement learning large language model multimodal
69 GeoWorld: Geometric World Models GeoWorld:提出基于双曲几何的World Model,提升多步视觉规划性能。 reinforcement learning world model
70 SPMamba-YOLO: An Underwater Object Detection Network Based on Multi-Scale Feature Enhancement and Global Context Modeling SPMamba-YOLO:融合多尺度特征增强与全局上下文建模的水下目标检测网络 Mamba state space model
71 UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models UCM:基于时间感知位置编码扭曲的相机控制与记忆统一世界模型 world model
72 WARM-CAT: : Warm-Started Test-Time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning 提出WARM-CAT,解决组合零样本学习中测试时知识积累的分布偏移问题 representation learning multimodal
73 ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation ManifoldGD:一种无训练的分层流形引导扩散数据集蒸馏方法 distillation
74 PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data PartSAM:基于原生3D数据训练的可扩展、可Prompt的三维部件分割模型 representation learning foundation model
75 Deforming Videos to Masks: Flow Matching for Referring Video Segmentation 提出FlowRVS框架以解决视频对象分割中的语言引导问题 flow matching
76 Solaris: Building a Multiplayer Video World Model in Minecraft Solaris:构建Minecraft多人视频世界模型,实现一致的多视角模拟。 world model
77 ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models ViT-Linearizer:通过知识蒸馏将二次复杂度ViT模型转化为线性复杂度模型,提升高分辨率图像处理效率。 Mamba distillation

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
78 EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents EmbodMocap:提出一种基于双iPhone的便携式4D人-场景重建方法,用于具身智能体。 humanoid humanoid robot sim-to-real
79 GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning GigaBrain-0.5M*:一种基于世界模型的强化学习VLA模型,提升机器人操作性能。 manipulation reinforcement learning world model
80 Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving 提出风险感知世界模型预测控制(RaWMPC),解决端到端自动驾驶泛化性问题。 model predictive control imitation learning world model
81 ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals ArtPro:基于自监督和自适应运动提议融合的铰接物体重建 manipulation 3D gaussian splatting gaussian splatting
82 OSDaR-AR: Enhancing Railway Perception Datasets via Multi-modal Augmented Reality OSDaR-AR:通过多模态增强现实技术提升铁路感知数据集质量 sim-to-real

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
83 Causal Motion Diffusion Models for Autoregressive Motion Generation 提出因果运动扩散模型(CMDM),用于高质量、低延迟的自回归运动生成。 motion diffusion model motion diffusion text-to-motion
84 DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation DyaDiT:用于生成符合社会规范的双人对话手势的多模态扩散Transformer motion generation human motion
85 DisQ-HNet: A Disentangled Quantized Half-UNet for Interpretable Multimodal Image Synthesis Applications to Tau-PET Synthesis from T1 and FLAIR MRI 提出DisQ-HNet,用于从MRI合成Tau-PET并解析模态贡献 VQ-VAE multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (3 篇)

#题目一句话要点标签🔗
86 FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery FUSAR-GPT:面向SAR影像,时空特征嵌入与解耦的两阶段视觉语言模型 spatiotemporal
87 Multidimensional Task Learning: A Unified Tensor Framework for Computer Vision Tasks 提出基于张量的多维任务学习框架,统一解决计算机视觉中的分类、分割和检测任务。 spatiotemporal
88 Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents GUIPruner:针对高分辨率GUI代理的时空Token剪枝,提升效率并保持性能。 spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
89 Motion-Aware Animatable Gaussian Avatars Deblurring 提出运动感知的可动画高斯头像去模糊方法,解决模糊视频重建问题 human motion
90 VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm VLM-Pruner:面向高效VLM的离心式Token剪枝与空间稀疏缓冲 spatial relationship

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
91 Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge 探索边缘设备上多模态LLM用于在线情景记忆问答 Ego4D large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页