cs.CV(2025-08-28)

📊 共 34 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗6) 支柱二:RL算法与架构 (RL & Architecture) (6) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱七:动作重定向 (Motion Retargeting) (4 🔗4) 支柱一:机器人控制 (Robot Control) (4 🔗2) 支柱四:生成式动作 (Generative Motion) (2 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification CogVLA:通过指令驱动的路由和稀疏化实现认知对齐的视觉-语言-动作模型 vision-language-action VLA multimodal
2 Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation Dino U-Net:利用DINOv3高保真密集特征提升医学图像分割精度 foundation model
3 PathMR: Multimodal Visual Reasoning for Interpretable Pathology Diagnosis 提出PathMR:用于可解释病理诊断的多模态视觉推理框架 multimodal
4 Adapting Foundation Model for Dental Caries Detection with Dual-View Co-Training 提出DVCTNet,利用双视角协同训练提升牙齿龋齿检测精度 foundation model
5 Graph-Based Uncertainty Modeling and Multimodal Fusion for Salient Object Detection 提出基于图的不确定性建模与多模态融合的显著性目标检测网络,提升复杂场景下的检测精度。 multimodal
6 MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models MedFoundationHub:轻量安全医学视觉语言模型部署工具包,解决PHI暴露风险。 foundation model
7 Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning 提出Veritas,通过模式感知推理实现深度伪造检测的泛化性,并使用HydraFake数据集进行评估。 large language model chain-of-thought
8 R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning R-4B:通过双模退火和强化学习,激励MLLM的通用自动思考能力 large language model multimodal
9 GENNAV: Polygon Mask Generation for Generalized Referring Navigable Regions GENNAV:用于广义指代可导航区域的多边形掩码生成 zero-shot transfer
10 Generalizable Object Re-Identification via Visual In-Context Prompting 提出基于视觉上下文提示的通用物体重识别方法,无需特定类别训练。 foundation model
11 MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs MMG-Vid:通过分段和Token级最大化边际收益,提升视频LLM效率 large language model
12 Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding 综述:视频抽象概念识别,利用基础模型促进视频理解 foundation model
13 Improving Alignment in LVLMs with Debiased Self-Judgment 提出基于去偏自我判断的LVLM对齐方法,提升视觉语言模型的安全性和准确性。 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
14 "Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection 提出S-HArM数据集,用于意图感知的合成图像检测,解决现有方法忽略图像生成意图的问题。 contrastive learning HuMoR multimodal
15 HiddenObject: Modality-Agnostic Fusion for Multimodal Hidden Object Detection 提出HiddenObject,利用Mamba融合RGB、深度和热成像数据,提升隐藏物体检测性能。 Mamba multimodal
16 OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning OneReward:通过多任务人类偏好学习统一的掩码引导图像生成 reinforcement learning preference learning
17 PHD: Personalized 3D Human Body Fitting with Point Diffusion PHD:利用点扩散的个性化3D人体姿态拟合,提升视频姿态估计精度。 distillation human mesh recovery HMR
18 Contrastive Learning through Auxiliary Branch for Video Object Detection 提出CLAB方法,通过对比学习辅助分支提升视频目标检测的鲁棒性。 contrastive learning
19 Enhancing Mamba Decoder with Bidirectional Interaction in Multi-Task Dense Prediction 提出双向交互Mamba(BIM)模型,提升多任务密集预测的性能。 Mamba

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
20 ${C}^{3}$-GS: Learning Context-aware, Cross-dimension, Cross-scale Feature for Generalizable Gaussian Splatting 提出C³-GS以解决高质量视图合成中的特征编码不足问题 gaussian splatting splatting
21 Adam SLAM - the last mile of camera calibration with 3DGS Adam SLAM:利用3DGS微调相机标定,提升新视角合成质量 3DGS NeRF
22 Realistic and Controllable 3D Gaussian-Guided Object Editing for Driving Video Generation 提出G^2Editor,用于自动驾驶视频中逼真且可控的3D高斯引导物体编辑 3D gaussian splatting gaussian splatting splatting
23 Enhancing Pseudo-Boxes via Data-Level LiDAR-Camera Fusion for Unsupervised 3D Object Detection 提出数据级LiDAR-Camera融合方法,用于无监督3D目标检测,显著提升伪标签质量。 depth estimation foundation model
24 AvatarBack: Back-Head Generation for Complete 3D Avatars from Front-View Images AvatarBack:提出一种从正面图像生成完整3D头像背面头部的新框架 gaussian splatting splatting

🔬 支柱七:动作重定向 (Motion Retargeting) (4 篇)

#题目一句话要点标签🔗
25 SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding SeqVLM通过多视角序列推理和VLM,实现零样本3D视觉定位 spatial relationship visual grounding
26 COMETH: Convex Optimization for Multiview Estimation and Tracking of Humans COMETH:基于凸优化的多人多视角人体姿态估计与跟踪 human motion motion tracking
27 SYNBUILD-3D: A large, multi-modal, and semantically rich synthetic dataset of 3D building models at Level of Detail 4 SYNBUILD-3D:一个大规模、多模态、语义丰富的LoD4级别3D建筑模型合成数据集。 geometric consistency
28 FW-GAN: Frequency-Driven Handwriting Synthesis with Wave-Modulated MLP Generator FW-GAN:基于频率驱动和波调制MLP生成器的手写体合成 spatial relationship

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
29 DrivingGaussian++: Towards Realistic Reconstruction and Editable Simulation for Surrounding Dynamic Driving Scenes DrivingGaussian++:面向自动驾驶场景的真实重建与可编辑模拟 manipulation scene reconstruction large language model
30 Webly-Supervised Image Manipulation Localization via Category-Aware Auto-Annotation 提出基于类别感知的自动标注Webly监督图像篡改定位方法,缓解数据稀缺问题。 manipulation
31 Enhancing Corpus Callosum Segmentation in Fetal MRI via Pathology-Informed Domain Randomization 提出病理信息驱动的域随机化方法,提升胎儿MRI中胼胝体分割精度,尤其针对胼胝体发育不全。 domain randomization
32 Towards Mechanistic Defenses Against Typographic Attacks in CLIP 针对CLIP中印刷攻击,提出一种基于选择性消融的防御机制。 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
33 Embracing Aleatoric Uncertainty: Generating Diverse 3D Human Motion Diverse-T2M:通过引入不确定性生成多样化3D人体运动 text-to-motion motion generation human motion
34 EmoCAST: Emotional Talking Portrait via Emotive Text Description EmoCAST:提出一种基于扩散模型的文本驱动情感化说话人像生成框架 motion synthesis

⬅️ 返回 cs.CV 首页 · 🏠 返回主页