cs.CV（2026-04-29）

📊 共 35 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (11 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (7 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (2) 支柱一：机器人控制 (Robot Control) (2) 支柱六：视频提取与匹配 (Video Extraction) (2 🔗1) 支柱七：动作重定向 (Motion Retargeting) (2 🔗1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
1	CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation	提出CheXthought以提升胸部X光解读的多模态推理能力	multimodal chain-of-thought
2	Three-Step Nav: A Hierarchical Global-Local Planner for Zero-Shot Vision-and-Language Navigation	提出Three-Step Nav，解决零样本视觉语言导航中的漂移和早停问题	VLN large language model multimodal	✅
3	AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation	AnimateAnyMesh++：用于高保真文本驱动网格动画的灵活4D基础模型	foundation model
4	TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection	利用视觉基础模型特征，提出TAP以提升AI生成图像检测性能	foundation model
5	Decoupled Prototype Matching with Vision Foundation Models for Few-Shot Industrial Object Detection	利用视觉基础模型和解耦原型匹配，解决小样本工业物体检测问题。	foundation model
6	FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing	FASH-iCNN：通过多模态CNN探究可解释的时尚编辑风格	multimodal
7	State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading	提出TriSCA框架，提升MLLM在表盘读数任务中的状态一致性，解决视角和光照变化下的性能下降问题。	large language model multimodal
8	Adaptive Transform Coding for Semantic Compression	提出自适应变换编码方法，用于语义压缩，提升机器视觉任务性能。	foundation model
9	Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners	提出LILA，利用线性上下文学习从动态3D场景中学习像素级特征	foundation model
10	Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection	提出基于稀疏自编码器的ViT异常检测方法，提升模型安全性。	large language model
11	Beyond Shortcuts: Mitigating Visual Illusions in Frozen VLMs via Qualitative Reasoning	提出SQI框架，通过定性推理增强冻结VLM在视觉错觉场景下的感知鲁棒性	visual grounding

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
12	MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching	MesonGS++：通过超参数搜索实现3D高斯溅射的后训练压缩，显著降低存储成本。	3D gaussian splatting 3DGS gaussian splatting	✅
13	EnerGS: Energy-Based Gaussian Splatting with Partial Geometric Priors	EnerGS：基于能量的3D高斯溅射，利用部分几何先验提升重建质量	3D gaussian splatting 3DGS gaussian splatting
14	MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification	提出MemOVCD，通过跨时序记忆推理和自适应校正实现免训练开放词汇变化检测	open-vocabulary open vocabulary foundation model
15	Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation	提出Last-Layer-Centric Feature Recombination模块，提升DINOv3在单目深度估计中的几何信息利用率。	depth estimation monocular depth foundation model
16	Seeking Consensus: Geometric-Semantic On-the-Fly Recalibration for Open-Vocabulary Remote Sensing Semantic Segmentation	提出SeeCo框架，通过几何-语义共识校准提升遥感开放词汇语义分割性能	open-vocabulary open vocabulary
17	Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction	提出基于颜色编码照明的高速体三维重建方法，无需改造相机硬件。	gaussian splatting splatting scene reconstruction
18	AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision	AirZoo：用于空中几何3D视觉的大规模统一数据集与基准	metric depth Depth Anything 3D reconstruction
19	Semantic Foam: Unifying Spatial and Semantic Scene Decomposition	Semantic Foam：统一空间和语义场景分解，提升交互式图形应用能力	3D gaussian splatting gaussian splatting splatting

🔬 支柱二：RL算法与架构 (RL & Architecture) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
20	GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents	GLM-5V-Turbo：面向多模态Agent的原生基础模型	reinforcement learning foundation model multimodal
21	Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding	MCM-VG：通过多重一致性2D-3D映射实现鲁棒的Zero-Shot 3D视觉定位	distillation open-vocabulary open vocabulary
22	World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning	World2VLM：将世界模型的想象能力蒸馏到VLM中，用于动态空间推理	world model world models egocentric
23	A Multimodal Pre-trained Network for Integrated EEG-Video Seizure Detection	提出EEGVFusion，用于整合脑电和视频信息以提升小鼠癫痫检测的可靠性。	representation learning multimodal
24	$\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding	提出PKS$^4$，通过并行运动学选择性状态空间扫描器实现高效视频理解	SSM state space model spatial relationship
25	GaitKD: A Universal Decoupled Distillation Framework for Efficient Gait Recognition	GaitKD：一种通用的解耦蒸馏框架，用于高效步态识别	teacher-student distillation	✅
26	Edge AI for Automotive Vulnerable Road User Safety: Deployable Detection via Knowledge Distillation	提出基于知识蒸馏的边缘AI方案，提升自动驾驶弱势道路使用者检测的INT8量化精度。	distillation

🔬 支柱五：交互与反应 (Interaction & Reaction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
27	Cross-Domain Transfer of Hyperspectral Foundation Models	提出跨域迁移高光谱基础模型，提升近端遥感语义分割性能	HSI foundation model
28	HOI-aware Adaptive Network for Weakly-supervised Action Segmentation	提出HOI感知的自适应网络AdaAct，用于弱监督动作分割	human-object interaction HOI

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Attribution-Guided Multimodal Deepfake Detection via Cross-Modal Forensic Fingerprints	提出基于归因引导的多模态Deepfake检测框架，通过跨模态指纹提升检测精度。	manipulation multimodal
30	GIFGuard: Proactive Forensics against Deepfakes in Facial GIFs via Spatiotemporal Watermarking	提出GIFGuard，通过时空水印技术实现对GIF图像中深度伪造的主动取证。	manipulation spatiotemporal

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
31	DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation	提出DenseStep2M：一个可扩展、免训练的密集教学视频标注流程。	egocentric large language model multimodal	✅
32	ViBE: Visual-to-M/EEG Brain Encoding via Spatio-Temporal VAE and Distribution-Aligned Projection	ViBE：通过时空VAE和分布对齐投影实现视觉到M/EEG脑编码	feature matching

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
33	GateMOT: Q-Gated Attention for Dense Object Tracking	提出Q-Gated Attention的GateMOT，解决密集物体跟踪中高分辨率特征的计算瓶颈。	motion estimation
34	Motion-Driven Multi-Object Tracking of Model Organisms in Space Science Experiments	ART-Track：针对空间科学实验中模型生物的运动驱动多目标跟踪	motion estimation	✅

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
35	Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models	提出空间自适应多重引导（SAMG），解决扩散模型中细节缺失与伪影问题。	classifier-free guidance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页