cs.CV（2026-02-24）

📊 共 39 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (11 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (9 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (8 🔗2) 支柱一：机器人控制 (Robot Control) (4 🔗1) 支柱七：动作重定向 (Motion Retargeting) (3 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (2 🔗1) 支柱八：物理动画 (Physics-based Animation) (2 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Are Multimodal Large Language Models Good Annotators for Image Tagging?	提出TagLLM框架，提升多模态大语言模型在图像标签任务中的标注质量。	large language model multimodal
2	Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion	提出MVLAD-AD，通过掩码扩散模型实现高效、可解释的端到端自动驾驶。	vision-language-action large language model
3	CrystaL: Spontaneous Emergence of Visual Latents in MLLMs	CrystaL：MLLM中视觉隐变量的自发涌现，提升细粒度视觉理解	large language model multimodal chain-of-thought
4	OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation	OrthoDiffusion：用于肌肉骨骼MRI解释的通用多任务扩散模型	foundation model
5	An interactive enhanced driving dataset for autonomous driving	提出交互增强驾驶数据集IEDD，解决自动驾驶VLA模型数据稀疏和多模态对齐不足问题。	vision-language-action VLA multimodal
6	UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics	提出UDVideoQA数据集，用于城市交通视频中多目标时空推理的视频问答任务。	multimodal visual grounding	✅
7	OmniOCR: Generalist OCR for Ethnic Minority Languages	OmniOCR：面向少数民族语言的通用OCR框架，提升低资源场景识别精度。	foundation model multimodal	✅
8	Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction	Skullptor：基于多视角法线预测的快速高保真3D头部重建	foundation model
9	VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models	提出VII框架，通过视觉指令注入破解图生视频模型的安全限制。	instruction following
10	Cycle-Consistent Tuning for Layered Image Decomposition	提出循环一致性微调方法，用于基于扩散模型的图像分层解耦	foundation model
11	On the Explainability of Vision-Language Models in Art History	研究CLIP在艺术史领域的视觉推理可解释性，评估XAI方法有效性。	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
12	SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens	SceMoS：利用几何约束Token规划的场景感知3D人体运动合成	height map occupancy grid affordance
13	RU4D-SLAM: Reweighting Uncertainty in Gaussian Splatting SLAM for 4D Scene Reconstruction	RU4D-SLAM：通过重加权不确定性实现动态场景的4D高斯溅射SLAM重建	3D gaussian splatting gaussian splatting splatting
14	Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting	提出DropAnSH-GS，通过锚点Dropout和球谐函数稀疏化提升稀疏视角下的高斯溅射性能。	3D gaussian splatting 3DGS gaussian splatting
15	VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos	VAGNet：通过视频中的人-物交互进行3D可供性区域定位	affordance human-object interaction HOI
16	Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models	融合几何与交互线索，零样本提升视觉基础模型的可供性推理能力	affordance foundation model
17	BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting	提出BrepGaussian，利用高斯溅射从多视角图像重建CAD模型	gaussian splatting splatting
18	Monocular Endoscopic Tissue 3D Reconstruction with Multi-Level Geometry Regularization	提出多层几何约束的单目内窥镜组织3D重建方法，实现实时渲染和光滑表面	3D gaussian splatting gaussian splatting splatting
19	WildGHand: Learning Anti-Perturbation Gaussian Hand Avatars from Monocular In-the-Wild Videos	WildGHand：学习抗扰动高斯手部Avatar，从单目野外视频中重建	3D gaussian splatting gaussian splatting splatting	✅
20	Real-time Motion Segmentation with Event-based Normal Flow	提出基于事件Normal Flow的实时运动分割框架，显著提升动态场景理解效率。	scene understanding

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
21	LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding	提出LongVideo-R1，通过主动推理导航实现低成本长视频理解。	reinforcement learning large language model multimodal	✅
22	GatedCLIP: Gated Multimodal Fusion for Hateful Memes Detection	提出GatedCLIP，通过门控多模态融合提升Hateful Memes检测性能。	contrastive learning multimodal
23	RAYNOVA: 3D-Geometry-Free Auto-Regressive Driving World Modeling with Unified Spatio-Temporal Representation	RAYNOVA：提出无3D几何先验的自回归驾驶世界建模方法，实现统一时空表示。	world model physically plausible foundation model	✅
24	Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models	提出MMHNet，解决视频到音频生成模型在长时序上的泛化难题	Mamba multimodal
25	Communication-Inspired Tokenization for Structured Image Representations	提出COMiT，通过模仿人类交流方式学习结构化图像表示，提升组合泛化和关系推理能力。	flow matching multimodal
26	A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata	提出轻量级视觉-语言融合框架，利用UI和元数据预测App评分。	MAE multimodal
27	Path-Decoupled Hyperbolic Flow Matching for Few-Shot Adaptation	提出路径解耦的双曲流匹配（HFM），用于解决小样本跨模态迁移中的路径纠缠问题。	flow matching
28	PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models	PropFly：利用预训练视频扩散模型的即时监督学习视频编辑传播	flow matching classifier-free guidance

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Object-Scene-Camera Decomposition and Recomposition for Data-Efficient Monocular 3D Object Detection	提出对象-场景-相机解耦重组方法，提升单目3D目标检测的数据效率。	manipulation
30	From Perception to Action: An Interactive Benchmark for Vision Reasoning	提出CHAIN基准测试，用于评估视觉推理模型在交互式物理环境中的行动能力。	manipulation	✅
31	See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis	ArtiAgent：通过智能体数据合成，使VLMs和扩散模型理解视觉伪影	manipulation
32	RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces	RecoverMark：用于人脸篡改定位与恢复的鲁棒水印方法	manipulation

🔬 支柱七：动作重定向 (Motion Retargeting) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
33	VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving	VGGDrive：通过跨视角几何 grounding 增强视觉-语言模型在自动驾驶中的应用	motion prediction foundation model
34	SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models	提出SpatiaLQA基准，评估视觉语言模型在复杂空间逻辑推理中的能力	spatial relationship foundation model	✅
35	SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking	SIMSPINE：一个用于3D脊柱运动标注和基准测试的生物力学感知模拟框架	motion estimation

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
36	Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing	提出InterFormer，通过交互感知建模和共现一致性提升自中心视角下手-物解析性能	egocentric	✅
37	Long-Term Multi-Session 3D Reconstruction Under Substantial Appearance Change	提出联合SfM重建方法，解决长期外观变化下的三维重建问题	feature matching

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
38	PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning	PFGNet：一种全卷积频率引导外围门控网络，用于高效时空预测学习	spatiotemporal	✅
39	Human Video Generation from a Single Image with 3D Pose and View Control	提出HVG模型，通过单张图像生成具有3D姿态和视角控制的高质量人体视频。	spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页