cs.CV(2026-03-30)

📊 共 45 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (12 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗3) 支柱一:机器人控制 (Robot Control) (8) 支柱八:物理动画 (Physics-based Animation) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 Integrating Multimodal Large Language Model Knowledge into Amodal Completion 提出AmodalCG,利用多模态大语言模型知识指导非完整性补全 large language model multimodal
2 AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation AutoCut:提出基于多模态离散化和可控生成的端到端广告视频编辑框架 large language model foundation model multimodal
3 ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning ResAdapt:自适应分辨率提升多模态推理效率,解决视觉token增长瓶颈 large language model multimodal
4 MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures 提出MarkushGrapher-2,用于端到端多模态化学结构识别,显著提升识别精度。 multimodal
5 GEMS: Agent-Native Multimodal Generation with Memory and Skills GEMS:利用记忆和技能的Agent原生多模态生成框架,提升复杂指令和下游任务性能。 multimodal
6 Unsafe2Safe: Controllable Image Anonymization for Downstream Utility Unsafe2Safe:提出可控图像匿名化方法,保障隐私同时维持下游任务性能。 large language model multimodal
7 Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation 提出PPCR框架,通过渐进式提示引导跨模态推理,提升指代表达图像分割性能 large language model multimodal
8 AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding AdaptToken:基于熵自适应Token选择的长视频理解方法 large language model
9 Domain-Invariant Prompt Learning for Vision-Language Models 提出DiCoOp,通过对抗训练提升视觉-语言模型在领域泛化任务中的性能 zero-shot transfer
10 INSID3: Training-Free In-Context Segmentation with DINOv3 INSID3:利用DINOv3实现免训练的上下文分割,无需任何监督。 foundation model
11 RecycleLoRA: Rank-Revealing QR-Based Dual-LoRA Subspace Adaptation for Domain Generalized Semantic Segmentation RecycleLoRA:基于RRQR分解的双LoRA子空间自适应,用于领域泛化语义分割 foundation model
12 MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios MDPBench:首个多语言文档解析真实场景基准评测,揭示开源模型性能瓶颈。 multimodal
13 Event6D: Event-based Novel Object 6D Pose Tracking EventTrack6D:基于事件相机的新物体6D位姿跟踪框架 TAMP

🔬 支柱三:空间感知与语义 (Perception & Semantics) (12 篇)

#题目一句话要点标签🔗
14 GeoHCC: Local Geometry-Aware Hierarchical Context Compression for 3D Gaussian Splatting GeoHCC:提出局部几何感知的分层上下文压缩方法,用于高效3D高斯溅射。 3D gaussian splatting 3DGS gaussian splatting
15 SVGS: Single-View to 3D Object Editing via Gaussian Splatting 提出SVGS,利用高斯溅射实现单视角文本驱动的3D物体编辑。 3D gaussian splatting 3DGS gaussian splatting
16 Physically Inspired Gaussian Splatting for HDR Novel View Synthesis 提出PhysHDR-GS,通过物理启发的高斯溅射实现HDR新视角合成,显著提升细节重建效果。 gaussian splatting splatting
17 RehearsalNeRF: Decoupling Intrinsic Neural Fields of Dynamic Illuminations for Scene Editing RehearsalNeRF:解耦动态光照下的本征神经场以实现场景编辑 neural radiance field optical flow geometric consistency
18 FlowIt: Global Matching for Optical Flow with Confidence-Guided Refinement FlowIt:一种置信度引导的全局匹配光流估计方法,提升大位移场景鲁棒性。 optical flow
19 AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers AffordMatcher:利用视觉线索在3D场景中进行可供性学习 affordance
20 Industrial3D: A Terrestrial LiDAR Point Cloud Dataset and CrossParadigm Benchmark for Industrial Infrastructure 提出Industrial3D数据集,用于工业基础设施点云语义理解与跨范式基准测试。 scene understanding foundation model
21 DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning DiffAttn:基于扩散模型和LLM增强语义推理的驾驶员视觉注意力预测 scene understanding large language model
22 Explaining CLIP Zero-shot Predictions Through Concepts EZPC:通过概念解释CLIP的零样本预测,提升模型可解释性 open-vocabulary open vocabulary
23 \textit{4DSurf}: High-Fidelity Dynamic Scene Surface Reconstruction 提出4DSurf,通过高斯变形诱导的SDF流正则化实现高保真动态场景表面重建。 gaussian splatting splatting
24 SegRGB-X: General RGB-X Semantic Segmentation Model 提出SegRGB-X通用语义分割框架,统一多模态数据分割并达到SOTA scene understanding
25 ForestSim: A Synthetic Benchmark for Intelligent Vehicle Perception in Unstructured Forest Environments ForestSim:为智能车辆在非结构化森林环境中感知提供合成基准数据集 scene understanding

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
26 MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding MedLoc-R1:基于GRPO的医学视觉定位性能感知课程奖励调度 reinforcement learning multimodal visual grounding
27 To View Transform or Not to View Transform: NeRF-based Pre-training Perspective 提出NeRP3D,解决NeRF预训练中视角变换引入的先验冲突,提升3D目标检测性能。 representation learning NeRF neural radiance field
28 CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains CiQi-Agent:面向中国瓷器文化推理的多模态智能体,对齐视觉、工具与美学 reinforcement learning multimodal
29 Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal 提出Ghost-FWL数据集和FWL-MAE模型,用于解决移动LiDAR中的鬼点检测与移除问题 representation learning masked autoencoder MAE
30 PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models PoseDreamer:利用扩散模型生成可扩展且逼真的人体数据,用于3D人体网格估计。 direct preference optimization dreamer
31 $R_{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation 提出Rdm框架,将分布匹配重构为扩散蒸馏的奖励,提升生成质量与效率。 reinforcement learning distillation
32 ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization ColorFLUX:基于结构-颜色解耦的老照片着色框架 DPO direct preference optimization structure preservation
33 ToLL: Topological Layout Learning with Structural Multi-view Augmentation for 3D Scene Graph Pretraining 提出ToLL框架,通过拓扑布局学习和结构多视角增强进行3D场景图预训练。 representation learning distillation affordance
34 Bridging the Geometry Mismatch: Frequency-Aware Anisotropic Serialization for Thin-Structure SSMs 提出FGOS-Net以解决薄结构SSM的几何不匹配问题 SSM
35 Beyond Dataset Distillation: Lossless Dataset Concentration via Diffusion-Assisted Distribution Alignment 提出DsCo框架,通过扩散模型对数据集进行无损压缩,提升训练效率。 distillation

🔬 支柱一:机器人控制 (Robot Control) (8 篇)

#题目一句话要点标签🔗
36 HandX: Scaling Bimanual Motion and Interaction Generation HandX:提出一个用于扩展双手动捕和交互生成的基础框架。 bi-manual motion generation human motion
37 ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS Models 提出ObjectMorpher以解决2D图像编辑缺乏3D感知的问题 manipulation 3D gaussian splatting 3DGS
38 Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models 提出一种新框架以解决文本引导图像编辑中的结构保持问题 manipulation reinforcement learning structure preservation
39 Learning Multi-View Spatial Reasoning from Cross-View Relations 提出XVR数据集,提升视觉语言模型在多视角空间推理和机器人操作中的能力 manipulation spatial relationship embodied AI
40 Generalizable Detection of AI Generated Images with Large Models and Fuzzy Decision Tree 提出融合模糊决策树的AI生成图像检测框架以解决泛化不足问题 manipulation large language model multimodal
41 Sim-to-Real Fruit Detection Using Synthetic Data: Quantitative Evaluation and Embedded Deployment with Isaac Sim 利用Isaac Sim合成数据,实现Sim-to-Real水果检测,并在嵌入式设备上部署。 sim-to-real
42 SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild 提出SHOW3D数据集,用于在真实场景中捕捉3D手部与物体交互 manipulation egocentric
43 ConceptWeaver: Weaving Disentangled Concepts with Flow ConceptWeaver:利用Flow模型解耦概念,实现单样本概念定制化合成与编辑。 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
44 VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning VistaGEN:利用多视角视觉-语言推理实现精细控制的一致性驾驶视频生成 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
45 Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes 提出基于图的动态场景注视模拟方法,超越传统注视路径。 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页