cs.CV(2025-12-30)

📊 共 25 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱三:空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱九:具身大模型 (Embodied Foundation Models) (7 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗1) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
1 Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset 提出IMDD-1M大规模工业多模态缺陷数据集,用于开放词汇工业缺陷理解。 open-vocabulary open vocabulary foundation model
2 Improved 3D Gaussian Splatting of Unknown Spacecraft Structure Using Space Environment Illumination Knowledge 提出基于太阳位置知识的3D高斯点云重建方法以应对动态光照问题 3D gaussian splatting 3DGS gaussian splatting
3 ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation 提出ARM模块以解决CLIP基础的开放词汇语义分割问题 open-vocabulary open vocabulary foundation model
4 Guided Diffusion-based Generation of Adversarial Objects for Real-World Monocular Depth Estimation Attacks 提出基于扩散模型的对抗目标生成方法,提升单目深度估计攻击的真实性和有效性 depth estimation monocular depth physically plausible
5 Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention 提出CERES框架,通过双模态因果干预解决Ego-RVOS中的偏差和混淆问题 metric depth egocentric
6 Structure-Guided Allocation of 2D Gaussians for Image Representation and Compression 提出结构引导的2D高斯分配方法,提升图像表示和压缩的率失真性能 gaussian splatting splatting
7 PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing PipeFlow:面向长视频编辑的流水线处理和运动感知帧选择方法 optical flow

🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)

#题目一句话要点标签🔗
8 DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models 提出DiffThinker,利用扩散模型实现生成式多模态推理,提升视觉中心任务性能。 large language model multimodal
9 Using Large Language Models To Translate Machine Results To Human Results 利用大型语言模型将机器结果转化为人类可读的放射报告 large language model
10 F2IDiff: Real-world Image Super-resolution using Feature to Image Diffusion Foundation Model 提出F2IDiff,利用特征到图像扩散模型提升真实场景图像超分辨率效果,减少伪影。 foundation model
11 Virtual-Eyes: Quantitative Validation of a Lung CT Quality-Control Pipeline for Foundation-Model Cancer Risk Prediction Virtual-Eyes:用于肺癌风险预测的CT质量控制流程,提升通用基础模型性能 foundation model
12 MGML: A Plug-and-Play Meta-Guided Multi-Modal Learning Framework for Incomplete Multimodal Brain Tumor Segmentation 提出MGML框架,解决脑肿瘤分割中多模态MRI数据不完整问题。 multimodal
13 Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation 提出DualityForge框架以解决多模态大语言模型视频理解中的幻觉问题 large language model multimodal
14 Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design 提出FSAP与Whitespace-Normalized Hash Validation,提升LLM在计算机视觉架构自动设计中的效率。 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
15 MotivNet: Evolving Meta-Sapiens into an Emotionally Intelligent Foundation Model 提出MotivNet以解决面部情感识别的泛化问题 masked autoencoder foundation model
16 Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis 提出Hilbert-VLM,利用Hilbert-Mamba增强SAM2,提升VLM在医学诊断中的鲁棒性 Mamba SSM state space model
17 MambaSeg: Harnessing Mamba for Accurate and Efficient Image-Event Semantic Segmentation 提出MambaSeg,利用Mamba架构实现高效准确的图像-事件语义分割 Mamba multimodal
18 RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations 提出RSAgent,通过多轮工具调用实现文本引导的图像分割,显著提升分割精度。 reinforcement learning large language model multimodal
19 DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model DyStream:基于流匹配自回归模型的流式双人对话头像生成 flow matching distillation
20 Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images 提出平衡分层对比学习与解耦查询,提升遥感图像细粒度目标检测性能 representation learning contrastive learning

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
21 UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots UniAct:用于人形机器人的统一运动生成与动作流式传输 humanoid humanoid robot humanoid control
22 Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation 提出Latent Motion Reasoning (LMR)框架,解决文本到动作生成中的语义-运动阻抗失配问题。 motion planning text-to-motion motion generation
23 SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning 提出SenseNova-MARS,通过强化学习增强多模态Agent的推理和搜索能力 manipulation reinforcement learning multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
24 GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation GeoBench:通过分层评估重新思考多模态几何问题求解 spatial relationship multimodal chain-of-thought

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
25 Guiding a Diffusion Transformer with the Internal Dynamics of Itself 提出内部引导(IG)策略,提升扩散Transformer的图像生成质量与训练效率。 classifier-free guidance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页