cs.CV（2026-03-30）

📊 共 45 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗5) 支柱三：空间感知与语义 (Perception & Semantics) (12 🔗5) 支柱二：RL算法与架构 (RL & Architecture) (10 🔗3) 支柱一：机器人控制 (Robot Control) (8) 支柱八：物理动画 (Physics-based Animation) (1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Integrating Multimodal Large Language Model Knowledge into Amodal Completion	提出AmodalCG，利用多模态大语言模型知识指导非完整性补全	large language model multimodal
2	AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation	AutoCut：提出基于多模态离散化和可控生成的端到端广告视频编辑框架	large language model foundation model multimodal
3	ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning	ResAdapt：自适应分辨率提升多模态推理效率，解决视觉token增长瓶颈	large language model multimodal	✅
4	MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures	提出MarkushGrapher-2，用于端到端多模态化学结构识别，显著提升识别精度。	multimodal
5	GEMS: Agent-Native Multimodal Generation with Memory and Skills	GEMS：利用记忆和技能的Agent原生多模态生成框架，提升复杂指令和下游任务性能。	multimodal
6	Unsafe2Safe: Controllable Image Anonymization for Downstream Utility	Unsafe2Safe：提出可控图像匿名化方法，保障隐私同时维持下游任务性能。	large language model multimodal
7	Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation	提出PPCR框架，通过渐进式提示引导跨模态推理，提升指代表达图像分割性能	large language model multimodal
8	AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding	AdaptToken：基于熵自适应Token选择的长视频理解方法	large language model	✅
9	Domain-Invariant Prompt Learning for Vision-Language Models	提出DiCoOp，通过对抗训练提升视觉-语言模型在领域泛化任务中的性能	zero-shot transfer
10	INSID3: Training-Free In-Context Segmentation with DINOv3	INSID3：利用DINOv3实现免训练的上下文分割，无需任何监督。	foundation model	✅
11	RecycleLoRA: Rank-Revealing QR-Based Dual-LoRA Subspace Adaptation for Domain Generalized Semantic Segmentation	RecycleLoRA：基于RRQR分解的双LoRA子空间自适应，用于领域泛化语义分割	foundation model
12	MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios	MDPBench：首个多语言文档解析真实场景基准评测，揭示开源模型性能瓶颈。	multimodal	✅
13	Event6D: Event-based Novel Object 6D Pose Tracking	EventTrack6D：基于事件相机的新物体6D位姿跟踪框架	TAMP	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
14	GeoHCC: Local Geometry-Aware Hierarchical Context Compression for 3D Gaussian Splatting	GeoHCC：提出局部几何感知的分层上下文压缩方法，用于高效3D高斯溅射。	3D gaussian splatting 3DGS gaussian splatting
15	SVGS: Single-View to 3D Object Editing via Gaussian Splatting	提出SVGS，利用高斯溅射实现单视角文本驱动的3D物体编辑。	3D gaussian splatting 3DGS gaussian splatting	✅
16	Physically Inspired Gaussian Splatting for HDR Novel View Synthesis	提出PhysHDR-GS，通过物理启发的高斯溅射实现HDR新视角合成，显著提升细节重建效果。	gaussian splatting splatting	✅
17	RehearsalNeRF: Decoupling Intrinsic Neural Fields of Dynamic Illuminations for Scene Editing	RehearsalNeRF：解耦动态光照下的本征神经场以实现场景编辑	neural radiance field optical flow geometric consistency
18	FlowIt: Global Matching for Optical Flow with Confidence-Guided Refinement	FlowIt：一种置信度引导的全局匹配光流估计方法，提升大位移场景鲁棒性。	optical flow
19	AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers	AffordMatcher：利用视觉线索在3D场景中进行可供性学习	affordance
20	Industrial3D: A Terrestrial LiDAR Point Cloud Dataset and CrossParadigm Benchmark for Industrial Infrastructure	提出Industrial3D数据集，用于工业基础设施点云语义理解与跨范式基准测试。	scene understanding foundation model	✅
21	DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning	DiffAttn：基于扩散模型和LLM增强语义推理的驾驶员视觉注意力预测	scene understanding large language model
22	Explaining CLIP Zero-shot Predictions Through Concepts	EZPC：通过概念解释CLIP的零样本预测，提升模型可解释性	open-vocabulary open vocabulary	✅
23	\textit{4DSurf}: High-Fidelity Dynamic Scene Surface Reconstruction	提出4DSurf，通过高斯变形诱导的SDF流正则化实现高保真动态场景表面重建。	gaussian splatting splatting
24	SegRGB-X: General RGB-X Semantic Segmentation Model	提出SegRGB-X通用语义分割框架，统一多模态数据分割并达到SOTA	scene understanding
25	ForestSim: A Synthetic Benchmark for Intelligent Vehicle Perception in Unstructured Forest Environments	ForestSim：为智能车辆在非结构化森林环境中感知提供合成基准数据集	scene understanding	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
26	MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding	MedLoc-R1：基于GRPO的医学视觉定位性能感知课程奖励调度	reinforcement learning multimodal visual grounding	✅
27	To View Transform or Not to View Transform: NeRF-based Pre-training Perspective	提出NeRP3D，解决NeRF预训练中视角变换引入的先验冲突，提升3D目标检测性能。	representation learning NeRF neural radiance field
28	CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains	CiQi-Agent：面向中国瓷器文化推理的多模态智能体，对齐视觉、工具与美学	reinforcement learning multimodal	✅
29	Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal	提出Ghost-FWL数据集和FWL-MAE模型，用于解决移动LiDAR中的鬼点检测与移除问题	representation learning masked autoencoder MAE	✅
30	PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models	PoseDreamer：利用扩散模型生成可扩展且逼真的人体数据，用于3D人体网格估计。	direct preference optimization dreamer
31	$R_{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation	提出Rdm框架，将分布匹配重构为扩散蒸馏的奖励，提升生成质量与效率。	reinforcement learning distillation
32	ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization	ColorFLUX：基于结构-颜色解耦的老照片着色框架	DPO direct preference optimization structure preservation
33	ToLL: Topological Layout Learning with Structural Multi-view Augmentation for 3D Scene Graph Pretraining	提出ToLL框架，通过拓扑布局学习和结构多视角增强进行3D场景图预训练。	representation learning distillation affordance
34	Bridging the Geometry Mismatch: Frequency-Aware Anisotropic Serialization for Thin-Structure SSMs	提出FGOS-Net以解决薄结构SSM的几何不匹配问题	SSM
35	Beyond Dataset Distillation: Lossless Dataset Concentration via Diffusion-Assisted Distribution Alignment	提出DsCo框架，通过扩散模型对数据集进行无损压缩，提升训练效率。	distillation

🔬 支柱一：机器人控制 (Robot Control) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
36	HandX: Scaling Bimanual Motion and Interaction Generation	HandX：提出一个用于扩展双手动捕和交互生成的基础框架。	bi-manual motion generation human motion
37	ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS Models	提出ObjectMorpher以解决2D图像编辑缺乏3D感知的问题	manipulation 3D gaussian splatting 3DGS
38	Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models	提出一种新框架以解决文本引导图像编辑中的结构保持问题	manipulation reinforcement learning structure preservation
39	Learning Multi-View Spatial Reasoning from Cross-View Relations	提出XVR数据集，提升视觉语言模型在多视角空间推理和机器人操作中的能力	manipulation spatial relationship embodied AI
40	Generalizable Detection of AI Generated Images with Large Models and Fuzzy Decision Tree	提出融合模糊决策树的AI生成图像检测框架以解决泛化不足问题	manipulation large language model multimodal
41	Sim-to-Real Fruit Detection Using Synthetic Data: Quantitative Evaluation and Embedded Deployment with Isaac Sim	利用Isaac Sim合成数据，实现Sim-to-Real水果检测，并在嵌入式设备上部署。	sim-to-real
42	SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild	提出SHOW3D数据集，用于在真实场景中捕捉3D手部与物体交互	manipulation egocentric
43	ConceptWeaver: Weaving Disentangled Concepts with Flow	ConceptWeaver：利用Flow模型解耦概念，实现单样本概念定制化合成与编辑。	manipulation

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
44	VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning	VistaGEN：利用多视角视觉-语言推理实现精细控制的一致性驾驶视频生成	spatiotemporal

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
45	Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes	提出基于图的动态场景注视模拟方法，超越传统注视路径。	egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页