cs.CV（2026-03-26）

📊 共 79 篇论文 | 🔗 19 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (28 🔗5) 支柱九：具身大模型 (Embodied Foundation Models) (19 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (17 🔗5) 支柱一：机器人控制 (Robot Control) (5 🔗2) 支柱八：物理动画 (Physics-based Animation) (4 🔗1) 支柱四：生成式动作 (Generative Motion) (2 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (2) 支柱六：视频提取与匹配 (Video Extraction) (1 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1 🔗1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (28 篇)

#	题目	一句话要点	标签	🔗
1	VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents	VideoWeaver：面向具身智能体的多模态多视角视频到视频转换框架	policy learning egocentric embodied AI
2	Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference	提出层次引导的多模态表示学习以解决生物多样性识别问题	representation learning foundation model multimodal
3	Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs	提出Token-Reweighting策略，提升多模态LLM在RLVR任务中的感知与推理能力	reinforcement learning large language model multimodal
4	MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning	提出多阶段强化学习MSRL，用于扩展生成式多模态奖励模型的训练。	reinforcement learning distillation multimodal	✅
5	GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization	GDPO-Listener：通过自回归流匹配和分组解耦策略优化实现富有表现力的交互式头部生成	flow matching motion generation dyadic interaction
6	Multimodal Dataset Distillation via Phased Teacher Models	提出PTM-ST框架，解决多模态数据集蒸馏中教师模型知识动态演化捕捉不足的问题。	distillation multimodal	✅
7	Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models	提出混合记忆机制，解决动态视频世界模型中主体消失重现问题	world model world models spatiotemporal
8	Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning	提出RL-MBA框架，解决多模态主动学习中模态平衡与样本难度动态变化问题。	reinforcement learning multimodal
9	Vega: Learning to Drive with Natural Language Instructions	提出Vega模型，通过自然语言指令实现个性化自动驾驶。	world model world models vision-language-action
10	LanteRn: Latent Visual Structured Reasoning	LanteRn：提出基于隐空间视觉结构化推理框架，提升多模态模型视觉理解能力	reinforcement learning multimodal visual grounding
11	CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation	提出CLIP-RD，通过关系蒸馏提升CLIP模型知识蒸馏效率。	contrastive learning teacher-student distillation
12	VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning	VideoTIR：利用强化学习和工具集成推理提升长视频理解的准确性和效率	reinforcement learning large language model multimodal
13	TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Driven Optimization	提出TIGFlow-GRPO框架，通过交互感知流匹配和奖励驱动优化实现更符合社会规范和物理约束的轨迹预测。	flow matching multimodal
14	Towards Controllable Low-Light Image Enhancement: A Continuous Multi-illumination Dataset and Efficient State Space Framework	提出可控低光图像增强方法以解决现有方法的不足	SSM state space model multimodal
15	FD$^2$: A Dedicated Framework for Fine-Grained Dataset Distillation	提出FD$^2$框架，用于细粒度数据集蒸馏，提升小样本学习性能。	distillation
16	AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization	AnyDoc：通过大规模HTML/CSS数据合成与高度感知强化优化增强文档生成	reinforcement learning large language model
17	MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models	提出MoE-GRPO，通过强化学习优化MoE-VLMs中的专家路由，提升多模态理解能力。	reinforcement learning
18	Towards Video Anomaly Detection from Event Streams: A Baseline and Benchmark Datasets	提出EWAD框架，解决事件流视频异常检测中数据稀疏和模型训练难题。	distillation spatiotemporal
19	Image Rotation Angle Estimation: Comparing Circular-Aware Methods	针对图像旋转角度估计，对比研究了五种循环感知方法，并验证了概率方法的有效性。	Mamba MAE
20	Learning to Rank Caption Chains for Video-Text Alignment	提出基于排序优化的视频-文本对齐方法，提升长文本生成质量。	DPO direct preference optimization
21	Reinforcing Structured Chain-of-Thought for Video Understanding	提出Summary-Driven RL框架，增强MLLM在视频理解中的推理能力和泛化性	reinforcement learning large language model chain-of-thought
22	Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets	提出VLAAD和CARLA-Collide数据集，提升端到端自动驾驶的防碰撞能力。	representation learning multimodal
23	LEMON: a foundation model for nuclear morphology in Computational Pathology	LEMON：用于计算病理学中细胞核形态的基础模型	representation learning foundation model	✅
24	GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding	GazeQwen：基于注视感知的轻量级LLM调制方法，用于流视频理解	JEPA large language model multimodal	✅
25	CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation	提出CLIP-RD，通过关系蒸馏提升CLIP模型知识蒸馏效率。	contrastive learning teacher-student distillation
26	Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis	Geo$^2$: 提出几何引导的跨视角地理定位与图像合成统一框架，实现SOTA性能。	flow matching VGGT foundation model
27	World Reasoning Arena	提出WR-Arena，用于评估世界模型在动作模拟、长时预测和推理规划方面的能力。	world model world models physically plausible	✅
28	DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation	DiReCT：解耦对比轨迹正则化，提升物理约束的视频生成质量	flow matching contrastive learning

🔬 支柱九：具身大模型 (Embodied Foundation Models) (19 篇)

#	题目	一句话要点	标签	🔗
29	Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models	Photon：利用高效多模态大语言模型加速三维医学影像理解	large language model multimodal
30	Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors	提出CLVA，通过跨层视觉锚点缓解多模态大语言模型中的幻觉问题	large language model multimodal
31	Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification	评估多模态大语言模型在人脸验证中的性别和种族偏见	large language model multimodal
32	MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models	MuRF：释放视觉基础模型的多尺度潜力，提升推理性能	foundation model
33	Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs	提出VISAGE框架，通过视觉注意力校准，提升MDLLM的多模态抗幻觉能力。	large language model multimodal visual grounding
34	GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing	提出GeoHeight-Bench，解决遥感领域大模型缺乏高度感知能力的问题	multimodal
35	SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding	提出SlotVTG以解决视频时间定位中的对象中心学习问题	large language model multimodal
36	Pixelis: Reasoning in Pixels, from Seeing to Acting	Pixelis：提出像素级推理Agent，通过执行操作和学习结果，提升视觉语言系统的泛化性和物理基础。	multimodal chain-of-thought
37	Self-Corrected Image Generation with Explainable Latent Rewards	提出xLARD框架，利用可解释的隐空间奖励实现自校正图像生成。	large language model multimodal	✅
38	BFMD: A Full-Match Badminton Dense Dataset for Dense Shot Captioning	提出BFMD羽毛球全场密集数据集，用于羽毛球击球事件的密集描述	multimodal
39	Knowledge-Guided Failure Prediction: Detecting When Object Detectors Miss Safety-Critical Objects	提出知识引导的失效预测方法，用于检测目标检测器在安全关键场景下的漏检。	foundation model
40	PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders	提出PMT：一种基于冻结视觉编码器的图像和视频分割Plain Mask Transformer	foundation model	✅
41	GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding	GIFT：面向高效视频理解的全局不可替代帧选择方法	large language model
42	Synergistic Event-SVE Imaging for Quantitative Propellant Combustion Diagnostics	提出协同Event-SVE成像系统，用于定量推进剂燃烧诊断，解决高动态范围和烟雾遮蔽问题。	multimodal
43	BEVMAPMATCH: Multimodal BEV Neural Map Matching for Robust Re-Localization of Autonomous Vehicles	BEVMapMatch：用于自动驾驶车辆在恶劣环境下鲁棒重定位的多模态BEV神经地图匹配方法	multimodal	✅
44	Good Scores, Bad Data: A Metric for Multimodal Coherence	提出多模态一致性评分以解决数据不一致问题	multimodal
45	THFM: A Unified Video Foundation Model for 4D Human Perception and Beyond	提出THFM：一个统一的视频基础模型，用于4D人体感知及其他任务	foundation model
46	Speech-Synchronized Whiteboard Generation via VLM-Driven Structured Drawing Representations	提出基于VLM的语音同步白板生成方法，解决教育视频内容自动生成问题	multimodal TAMP
47	GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks	提出GUIDE基准，用于理解和辅助用户完成开放式GUI任务	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (17 篇)

#	题目	一句话要点	标签	🔗
48	Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos	提出基于SE(3) B样条运动基的动态高斯溅射方法，用于单目视频高质量动态场景重建。	gaussian splatting splatting motion representation	✅
49	AirSplat: Alignment and Rating for Robust Feed-Forward 3D Gaussian Splatting	AirSplat：对齐与评分，实现稳健的前馈3D高斯溅射	3D gaussian splatting gaussian splatting splatting
50	Relaxed Rigidity with Ray-based Grouping for Dynamic Gaussian Splatting	提出基于射线分组的松弛刚性方法，用于动态高斯溅射，提升单目视频重建质量。	3D gaussian splatting gaussian splatting splatting
51	ViewSplat: View-Adaptive Dynamic Gaussian Splatting for Feed-Forward Synthesis	ViewSplat：提出视角自适应动态高斯溅射，实现快速高保真新视角合成	3D gaussian splatting gaussian splatting splatting
52	Towards Comprehensive Real-Time Scene Understanding in Ophthalmic Surgery through Multimodal Image Fusion	提出一种多模态图像融合网络，用于眼科手术中实时场景理解和器械精准追踪。	scene understanding multimodal
53	Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds	PointINS：面向点云实例感知的自监督学习，提升3D场景理解能力	scene understanding foundation model
54	MegaFlow: Zero-Shot Large Displacement Optical Flow	MegaFlow：提出一种零样本大位移光流估计方法，无需特定领域微调。	optical flow motion estimation	✅
55	Training-free Detection and 6D Pose Estimation of Unseen Surgical Instruments	提出一种无需训练的 surgical instrument 6D位姿估计方法，适用于未知器械。	scene understanding 6D pose estimation geometric consistency
56	Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting	LGTM：通过纹理化高斯点实现4K分辨率前馈 novel view synthesis	3D gaussian splatting gaussian splatting splatting	✅
57	Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos	提出Colon-Bench，用于结肠镜视频中可扩展的密集病灶标注，以促进AI在结肠癌早期筛查的应用。	open-vocabulary open vocabulary large language model
58	EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions	EgoXtreme：用于极端条件下以自我为中心的视角进行鲁棒物体姿态估计的数据集	6D pose estimation egocentric egocentric vision	✅
59	GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator	GaussFusion：利用几何信息视频生成器提升野外场景3D重建质量	3D gaussian splatting 3DGS gaussian splatting
60	MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes	MoRGS：高效的Per-Gaussian运动推理，用于可流式传输的动态3D场景重建	3D gaussian splatting gaussian splatting splatting
61	Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs	提出SGREC以解决零-shot指代表达理解问题	scene understanding spatial relationship large language model
62	Infinite Gaze Generation for Videos with Autoregressive Diffusion	提出基于自回归扩散模型的无限注视生成框架，用于预测任意长度视频中的人类注视轨迹。	scene understanding multimodal TAMP
63	HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT	提出HeSS，通过头部敏感度指导VGGT稀疏化，提升高稀疏度下的精度。	VGGT	✅
64	Few TensoRF: Enhance the Few-shot on Tensorial Radiance Fields	Few TensoRF：结合张量分解与频率正则化，提升少样本3D重建效果	NeRF

🔬 支柱一：机器人控制 (Robot Control) (5 篇)

#	题目	一句话要点	标签	🔗
65	LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior	LaMP：利用3D场景流作为潜在运动先验，学习视觉-语言-动作策略	manipulation flow matching scene flow	✅
66	Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection	提出概率概念图推理框架PCGR，用于可解释的多模态虚假信息检测。	manipulation large language model multimodal
67	Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models	提出TIES框架，利用层间排序一致性提升VLA模型效率并超越注意力幅度选择。	manipulation vision-language-action VLA
68	PAWS: Perception of Articulation in the Wild at Scale from Egocentric Videos	PAWS：从第一视角视频大规模感知自然场景中的物体铰接	manipulation scene understanding egocentric	✅
69	THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics	提出THEMIS基准，用于多模态大语言模型在科学论文欺诈取证中的整体评估	manipulation large language model multimodal

🔬 支柱八：物理动画 (Physics-based Animation) (4 篇)

#	题目	一句话要点	标签	🔗
70	UNIC: Neural Garment Deformation Field for Real-time Clothed Character Animation	提出UNIC：一种基于神经形变场的服装动画实时生成方法	character animation
71	PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference	PackForcing：利用短视频训练实现长视频采样和长上下文推理	spatiotemporal	✅
72	GeoNDC: A Queryable Neural Data Cube for Planetary-Scale Earth Observation	GeoNDC：一种可查询的行星尺度地球观测神经数据立方体	spatiotemporal
73	Dynamic LIBRAS Gesture Recognition via CNN over Spatiotemporal Matrix Representation	提出基于时空矩阵和CNN的动态LIBRAS手势识别方法，用于家庭自动化设备控制	spatiotemporal

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
74	Bilingual Text-to-Motion Generation: A New Benchmark and Baselines	提出BiHumanML3D基准以解决双语文本到动作生成问题	motion diffusion text-to-motion motion synthesis	✅
75	Unleashing Guidance Without Classifiers for Human-Object Interaction Animation	提出LIGHT以解决人机交互动画生成中的接触质量问题	classifier-free guidance contact-aware human-object interaction

🔬 支柱五：交互与反应 (Interaction & Reaction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
76	Challenges in Hyperspectral Imaging for Autonomous Driving: The HSI-Drive Case	针对自动驾驶高光谱成像挑战，分析HSI-Drive数据集上的视觉技术	HSI
77	ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions	ArtHOI：利用基础模型进行单目4D手部-可动物体交互重建	HOI large language model foundation model

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
78	AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation	AG-EgoPose：利用动作引导的运动和关节编码进行第一人称3D姿态估计	egocentric first-person view	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
79	ICTPolarReal: A Polarized Reflection and Material Dataset of Real World Objects	提出ICTPolarReal数据集，用于提升真实世界物体反射和材质建模的性能。	geometric consistency	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2026-03-26）

🎯 兴趣领域导航

🔬 支柱二：RL算法与架构 (RL & Architecture) (28 篇)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (19 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (17 篇)

🔬 支柱一：机器人控制 (Robot Control) (5 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (4 篇)

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

🔬 支柱五：交互与反应 (Interaction & Reaction) (2 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理