cs.CV(2025-12-18)

📊 共 39 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (13 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗2) 支柱九:具身大模型 (Embodied Foundation Models) (8 🔗4) 支柱一:机器人控制 (Robot Control) (5) 支柱八:物理动画 (Physics-based Animation) (3 🔗1) 支柱四:生成式动作 (Generative Motion) (2)

🔬 支柱二:RL算法与架构 (RL & Architecture) (13 篇)

#题目一句话要点标签🔗
1 KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals KineST:一种基于运动学引导的时空状态空间模型,用于从稀疏信号中进行人体运动跟踪 state space model representation learning human motion
2 BrepLLM: Native Boundary Representation Understanding with Large Language Models BrepLLM:首个原生边界表示理解的大语言模型框架 contrastive learning semantic mapping semantic map
3 SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning SNOW:融合世界知识的时空场景理解框架,用于开放世界具身推理 world model scene understanding multimodal
4 AdaTooler-V: Adaptive Tool-Use for Images and Videos 提出AdaTooler-V,通过自适应工具使用提升多模态大语言模型在图像和视频任务中的推理效率和性能。 reinforcement learning large language model multimodal
5 Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation 提出基于3D感知表达蒸馏的即时高表现力高斯头部头像方法 distillation gaussian splatting splatting
6 SARMAE: Masked Autoencoder for SAR Representation Learning 提出SARMAE:一种用于SAR图像表征学习的噪声感知掩码自编码器 representation learning masked autoencoder
7 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation 提出4D-RGPT,通过感知蒸馏增强MLLM在4D场景理解中的区域级推理能力。 distillation multimodal
8 The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text WorldCanvas:结合文本、轨迹和参考图像,实现可控的世界事件模拟。 world model multimodal visual grounding
9 Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation 提出TODSynth框架,用于遥感语义分割任务的数据合成与控制优化。 flow matching foundation model multimodal
10 Predictive Modeling of Maritime Radar Data Using Transformer Architecture 探索Transformer在海事雷达数据预测建模中的应用,填补现有研究空白 predictive model spatiotemporal
11 MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval 提出MACL:一种多标签自适应对比学习损失,用于遥感图像检索 representation learning contrastive learning
12 Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization 提出基于骨骼片段对比学习和多尺度特征融合的动作定位方法 contrastive learning
13 MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning 提出MomaGraph,利用视觉-语言模型为具身任务规划构建状态感知的统一场景图。 reinforcement learning scene understanding

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
14 Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation 提出全景深度估计基础模型DAP,提升跨场景距离的泛化能力与几何一致性。 depth estimation metric depth geometric consistency
15 SDFoam: Signed-Distance Foam for explicit surface reconstruction SDFoam:结合显式Voronoi图和隐式SDF,实现精确表面重建 3D gaussian splatting 3DGS gaussian splatting
16 N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models N3D-VLM:原生3D感知赋能视觉语言模型精确空间推理 depth estimation spatial relationship multimodal
17 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction 提出4D Primitive-Mâché,用于单目视频的持久化4D场景重建 scene reconstruction
18 Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture 利用高斯溅射重建高保真面部几何与纹理,实现可控人脸生成 gaussian splatting splatting NeRF
19 CountZES: Counting via Zero-Shot Exemplar Selection 提出CountZES以解决零样本场景中的物体计数问题 open-vocabulary open vocabulary
20 Privacy-Aware Sharing of Raw Spatial Sensor Data for Cooperative Perception 提出SHARP框架,解决车辆协同感知中原始空间传感器数据共享的隐私泄露问题。 scene understanding
21 Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation InfCam:利用无限单应性实现相机控制的鲁棒视频生成 depth estimation

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
22 Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation Causal-Tune:挖掘视觉基础模型中的因果因子,用于领域泛化语义分割 foundation model
23 Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors 提出EMERT模型和EMER数据集,通过眼部行为弥合面部表情识别和情感识别之间的差距 multimodal
24 Kling-Omni Technical Report Kling-Omni:通用生成框架,实现多模态输入到高质量视频的端到端合成 multimodal instruction following
25 Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs 提出Sketch-in-Latents (SkiLa),实现MLLM中统一的多模态推理与视觉想象。 large language model multimodal
26 A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos 提出LongShOTBench长视频多模态推理与工具使用基准及LongShOTAgent智能体框架 multimodal
27 VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization 提出VIVA框架以解决视频编辑中的指令泛化问题 instruction following
28 REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion REGLUE:融合全局与局部语义的解耦扩散模型,提升图像合成质量与收敛速度 foundation model
29 VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks VenusBench-GD:一个用于多样化Grounding任务的综合性多平台GUI基准 multimodal

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
30 GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation GeoPredict:利用预测运动学和3D高斯几何实现精确的VLA操作 manipulation vision-language-action VLA
31 Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation 提出Make-It-Poseable,通过潜在空间变换实现3D人形角色动画 humanoid character animation
32 OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction OpenTouch:构建真实场景下完整手部触觉交互数据集与基准 manipulation egocentric multimodal
33 Animate Any Character in Any World AniX:提出一种通用角色动画框架,实现在任意3D场景中控制角色行为。 locomotion world model 3DGS
34 TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering 提出TextEditBench,用于评估图像文本编辑中蕴含推理能力的模型。 manipulation multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (3 篇)

#题目一句话要点标签🔗
35 EverybodyDance: Bipartite Graph-Based Identity Correspondence for Multi-Character Animation EverybodyDance:基于二分图的角色匹配方法,用于多角色动画中的身份一致性保持。 character animation
36 Characterizing Motion Encoding in Video Diffusion Timesteps 通过量化时序步中的运动-外观权衡,揭示视频扩散模型中的运动编码特性 spatiotemporal
37 EasyV2V: A High-quality Instruction-based Video Editing Framework EasyV2V:高质量的指令驱动视频编辑框架,实现灵活可控的视频编辑 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
38 Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach 提出基于注视启动的人体运动合成方法,用于物体抓取或放置任务。 motion generation human motion human motion generation
39 Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos EgoMAN:基于自中心交互视频学习3D手部轨迹预测,实现推理到运动的衔接 motion generation egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页