cs.CV（2025-12-18）

📊 共 39 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (13 🔗4) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗2) 支柱九：具身大模型 (Embodied Foundation Models) (8 🔗4) 支柱一：机器人控制 (Robot Control) (5) 支柱八：物理动画 (Physics-based Animation) (3 🔗1) 支柱四：生成式动作 (Generative Motion) (2)

🔬 支柱二：RL算法与架构 (RL & Architecture) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals	KineST：一种基于运动学引导的时空状态空间模型，用于从稀疏信号中进行人体运动跟踪	state space model representation learning human motion	✅
2	BrepLLM: Native Boundary Representation Understanding with Large Language Models	BrepLLM：首个原生边界表示理解的大语言模型框架	contrastive learning semantic mapping semantic map
3	SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning	SNOW：融合世界知识的时空场景理解框架，用于开放世界具身推理	world model scene understanding multimodal
4	AdaTooler-V: Adaptive Tool-Use for Images and Videos	提出AdaTooler-V，通过自适应工具使用提升多模态大语言模型在图像和视频任务中的推理效率和性能。	reinforcement learning large language model multimodal
5	Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation	提出基于3D感知表达蒸馏的即时高表现力高斯头部头像方法	distillation gaussian splatting splatting
6	SARMAE: Masked Autoencoder for SAR Representation Learning	提出SARMAE：一种用于SAR图像表征学习的噪声感知掩码自编码器	representation learning masked autoencoder	✅
7	4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation	提出4D-RGPT，通过感知蒸馏增强MLLM在4D场景理解中的区域级推理能力。	distillation multimodal
8	The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text	WorldCanvas：结合文本、轨迹和参考图像，实现可控的世界事件模拟。	world model multimodal visual grounding	✅
9	Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation	提出TODSynth框架，用于遥感语义分割任务的数据合成与控制优化。	flow matching foundation model multimodal
10	Predictive Modeling of Maritime Radar Data Using Transformer Architecture	探索Transformer在海事雷达数据预测建模中的应用，填补现有研究空白	predictive model spatiotemporal
11	MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval	提出MACL：一种多标签自适应对比学习损失，用于遥感图像检索	representation learning contrastive learning	✅
12	Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization	提出基于骨骼片段对比学习和多尺度特征融合的动作定位方法	contrastive learning
13	MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning	提出MomaGraph，利用视觉-语言模型为具身任务规划构建状态感知的统一场景图。	reinforcement learning scene understanding

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
14	Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation	提出全景深度估计基础模型DAP，提升跨场景距离的泛化能力与几何一致性。	depth estimation metric depth geometric consistency	✅
15	SDFoam: Signed-Distance Foam for explicit surface reconstruction	SDFoam：结合显式Voronoi图和隐式SDF，实现精确表面重建	3D gaussian splatting 3DGS gaussian splatting
16	N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models	N3D-VLM：原生3D感知赋能视觉语言模型精确空间推理	depth estimation spatial relationship multimodal
17	4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction	提出4D Primitive-Mâché，用于单目视频的持久化4D场景重建	scene reconstruction
18	Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture	利用高斯溅射重建高保真面部几何与纹理，实现可控人脸生成	gaussian splatting splatting NeRF
19	CountZES: Counting via Zero-Shot Exemplar Selection	提出CountZES以解决零样本场景中的物体计数问题	open-vocabulary open vocabulary
20	Privacy-Aware Sharing of Raw Spatial Sensor Data for Cooperative Perception	提出SHARP框架，解决车辆协同感知中原始空间传感器数据共享的隐私泄露问题。	scene understanding
21	Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation	InfCam：利用无限单应性实现相机控制的鲁棒视频生成	depth estimation	✅

🔬 支柱九：具身大模型 (Embodied Foundation Models) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
22	Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation	Causal-Tune：挖掘视觉基础模型中的因果因子，用于领域泛化语义分割	foundation model
23	Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors	提出EMERT模型和EMER数据集，通过眼部行为弥合面部表情识别和情感识别之间的差距	multimodal	✅
24	Kling-Omni Technical Report	Kling-Omni：通用生成框架，实现多模态输入到高质量视频的端到端合成	multimodal instruction following
25	Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs	提出Sketch-in-Latents (SkiLa)，实现MLLM中统一的多模态推理与视觉想象。	large language model multimodal	✅
26	A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos	提出LongShOTBench长视频多模态推理与工具使用基准及LongShOTAgent智能体框架	multimodal	✅
27	VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization	提出VIVA框架以解决视频编辑中的指令泛化问题	instruction following
28	REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion	REGLUE：融合全局与局部语义的解耦扩散模型，提升图像合成质量与收敛速度	foundation model	✅
29	VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks	VenusBench-GD：一个用于多样化Grounding任务的综合性多平台GUI基准	multimodal

🔬 支柱一：机器人控制 (Robot Control) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
30	GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation	GeoPredict：利用预测运动学和3D高斯几何实现精确的VLA操作	manipulation vision-language-action VLA
31	Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation	提出Make-It-Poseable，通过潜在空间变换实现3D人形角色动画	humanoid character animation
32	OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction	OpenTouch：构建真实场景下完整手部触觉交互数据集与基准	manipulation egocentric multimodal
33	Animate Any Character in Any World	AniX：提出一种通用角色动画框架，实现在任意3D场景中控制角色行为。	locomotion world model 3DGS
34	TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering	提出TextEditBench，用于评估图像文本编辑中蕴含推理能力的模型。	manipulation multimodal

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
35	EverybodyDance: Bipartite Graph-Based Identity Correspondence for Multi-Character Animation	EverybodyDance：基于二分图的角色匹配方法，用于多角色动画中的身份一致性保持。	character animation
36	Characterizing Motion Encoding in Video Diffusion Timesteps	通过量化时序步中的运动-外观权衡，揭示视频扩散模型中的运动编码特性	spatiotemporal
37	EasyV2V: A High-quality Instruction-based Video Editing Framework	EasyV2V：高质量的指令驱动视频编辑框架，实现灵活可控的视频编辑	spatiotemporal	✅

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
38	Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach	提出基于注视启动的人体运动合成方法，用于物体抓取或放置任务。	motion generation human motion human motion generation
39	Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos	EgoMAN：基于自中心交互视频学习3D手部轨迹预测，实现推理到运动的衔接	motion generation egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页