cs.CV（2025-12-30）

📊 共 25 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱九：具身大模型 (Embodied Foundation Models) (7 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (6 🔗1) 支柱一：机器人控制 (Robot Control) (3 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset	提出IMDD-1M大规模工业多模态缺陷数据集，用于开放词汇工业缺陷理解。	open-vocabulary open vocabulary foundation model	✅
2	Improved 3D Gaussian Splatting of Unknown Spacecraft Structure Using Space Environment Illumination Knowledge	提出基于太阳位置知识的3D高斯点云重建方法以应对动态光照问题	3D gaussian splatting 3DGS gaussian splatting
3	ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation	提出ARM模块以解决CLIP基础的开放词汇语义分割问题	open-vocabulary open vocabulary foundation model
4	Guided Diffusion-based Generation of Adversarial Objects for Real-World Monocular Depth Estimation Attacks	提出基于扩散模型的对抗目标生成方法，提升单目深度估计攻击的真实性和有效性	depth estimation monocular depth physically plausible
5	Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention	提出CERES框架，通过双模态因果干预解决Ego-RVOS中的偏差和混淆问题	metric depth egocentric
6	Structure-Guided Allocation of 2D Gaussians for Image Representation and Compression	提出结构引导的2D高斯分配方法，提升图像表示和压缩的率失真性能	gaussian splatting splatting
7	PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing	PipeFlow：面向长视频编辑的流水线处理和运动感知帧选择方法	optical flow

🔬 支柱九：具身大模型 (Embodied Foundation Models) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
8	DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models	提出DiffThinker，利用扩散模型实现生成式多模态推理，提升视觉中心任务性能。	large language model multimodal
9	Using Large Language Models To Translate Machine Results To Human Results	利用大型语言模型将机器结果转化为人类可读的放射报告	large language model
10	F2IDiff: Real-world Image Super-resolution using Feature to Image Diffusion Foundation Model	提出F2IDiff，利用特征到图像扩散模型提升真实场景图像超分辨率效果，减少伪影。	foundation model
11	Virtual-Eyes: Quantitative Validation of a Lung CT Quality-Control Pipeline for Foundation-Model Cancer Risk Prediction	Virtual-Eyes：用于肺癌风险预测的CT质量控制流程，提升通用基础模型性能	foundation model
12	MGML: A Plug-and-Play Meta-Guided Multi-Modal Learning Framework for Incomplete Multimodal Brain Tumor Segmentation	提出MGML框架，解决脑肿瘤分割中多模态MRI数据不完整问题。	multimodal	✅
13	Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation	提出DualityForge框架以解决多模态大语言模型视频理解中的幻觉问题	large language model multimodal
14	Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design	提出FSAP与Whitespace-Normalized Hash Validation，提升LLM在计算机视觉架构自动设计中的效率。	large language model

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
15	MotivNet: Evolving Meta-Sapiens into an Emotionally Intelligent Foundation Model	提出MotivNet以解决面部情感识别的泛化问题	masked autoencoder foundation model	✅
16	Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis	提出Hilbert-VLM，利用Hilbert-Mamba增强SAM2，提升VLM在医学诊断中的鲁棒性	Mamba SSM state space model
17	MambaSeg: Harnessing Mamba for Accurate and Efficient Image-Event Semantic Segmentation	提出MambaSeg，利用Mamba架构实现高效准确的图像-事件语义分割	Mamba multimodal
18	RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations	提出RSAgent，通过多轮工具调用实现文本引导的图像分割，显著提升分割精度。	reinforcement learning large language model multimodal
19	DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model	DyStream：基于流匹配自回归模型的流式双人对话头像生成	flow matching distillation
20	Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images	提出平衡分层对比学习与解耦查询，提升遥感图像细粒度目标检测性能	representation learning contrastive learning

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
21	UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots	UniAct：用于人形机器人的统一运动生成与动作流式传输	humanoid humanoid robot humanoid control
22	Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation	提出Latent Motion Reasoning (LMR)框架，解决文本到动作生成中的语义-运动阻抗失配问题。	motion planning text-to-motion motion generation	✅
23	SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning	提出SenseNova-MARS，通过强化学习增强多模态Agent的推理和搜索能力	manipulation reinforcement learning multimodal

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
24	GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation	GeoBench：通过分层评估重新思考多模态几何问题求解	spatial relationship multimodal chain-of-thought

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
25	Guiding a Diffusion Transformer with the Internal Dynamics of Itself	提出内部引导（IG）策略，提升扩散Transformer的图像生成质量与训练效率。	classifier-free guidance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页