cs.CV（2026-01-21）

📊 共 25 篇论文 | 🔗 3 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (10 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (5) 支柱二：RL算法与架构 (RL & Architecture) (5) 支柱一：机器人控制 (Robot Control) (3) 支柱四：生成式动作 (Generative Motion) (1) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Multimodal system for skin cancer detection	提出一种基于普通照片和元数据的多模态皮肤癌检测系统，提升诊断可及性。	multimodal
2	Iterative Refinement Improves Compositional Image Generation	提出迭代优化框架，利用视觉-语言模型反馈提升组合图像生成质量	large language model chain-of-thought
3	Towards Understanding Best Practices for Quantization of Vision-Language Models	研究视觉-语言模型量化的最佳实践，提升多模态任务效率。	large language model multimodal	✅
4	LiViBench: An Omnimodal Benchmark for Interactive Livestream Video Understanding	提出LiViBench，一个面向交互式直播视频理解的全模态基准测试。	large language model multimodal
5	FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes	FunCineForge：面向多样化电影场景的零样本电影配音统一工具包与模型	multimodal instruction following
6	HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding	提出HERMES，利用分层KV缓存实现高效流式视频理解	large language model multimodal
7	3D Space as a Scratchpad for Editable Text-to-Image Generation	提出基于3D空间草稿板的可编辑文本到图像生成框架，提升空间推理能力。	large language model chain-of-thought	✅
8	Rethinking Video Generation Model for the Embodied World	针对具身智能，提出RBench机器人视频生成评估基准和RoVid-X大规模数据集。	embodied AI
9	Training-Free and Interpretable Hateful Video Detection via Multi-stage Adversarial Reasoning	提出MARS：一种免训练且可解释的多阶段对抗推理框架，用于检测仇恨视频。	multimodal	✅
10	Symmetry Informative and Agnostic Feature Disentanglement for 3D Shapes	提出对称感知和不可知特征解耦方法，提升3D形状分析性能	foundation model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
11	POTR: Post-Training 3DGS Compression	POTR：一种用于3D高斯溅射的后训练压缩方法，显著提升推理速度并降低存储需求。	3D gaussian splatting 3DGS gaussian splatting
12	GAT-NeRF: Geometry-Aware-Transformer Enhanced Neural Radiance Fields for High-Fidelity 4D Facial Avatars	提出GAT-NeRF，通过几何感知Transformer增强NeRF，实现高保真4D面部Avatar重建。	NeRF neural radiance field
13	SpatialMem: Unified 3D Memory with Metric Anchoring and Fast Retrieval	SpatialMem：提出一种统一的3D记忆系统，用于度量锚定和快速检索。	open-vocabulary open vocabulary egocentric
14	RayRoPE: Projective Ray Positional Encoding for Multi-view Attention	RayRoPE：用于多视角注意力机制的射影光线位置编码，提升新视角合成与深度估计。	depth estimation stereo depth
15	ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation	ScenDi：结合3D和2D扩散模型的城市场景生成方法	3DGS

🔬 支柱二：RL算法与架构 (RL & Architecture) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
16	ReinPath: A Multimodal Reinforcement Learning Approach for Pathology	提出ReinPath：一种用于病理学分析的多模态强化学习方法	reinforcement learning large language model multimodal
17	Deep Leakage with Generative Flow Matching Denoiser	提出基于生成流匹配去噪器的深度泄露攻击，提升联邦学习隐私破解效果	flow matching foundation model
18	UBATrack: Spatio-Temporal State Space Model for General Multi-Modal Tracking	UBATrack：基于时空状态空间模型的通用多模态目标跟踪框架	Mamba state space model
19	FlowSSC: Universal Generative Monocular Semantic Scene Completion via One-Step Latent Diffusion	FlowSSC：基于单步潜在扩散的通用生成式单目语义场景补全	flow matching spatial relationship
20	M2I2HA: A Multi-modal Object Detection Method Based on Intra- and Inter-Modal Hypergraph Attention	提出M2I2HA，利用超图注意力进行多模态目标检测，提升复杂环境下的检测精度。	Mamba SSM state space model

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
21	Walk through Paintings: Egocentric World Models from Internet Priors	提出EgoWM，利用互联网视频先验知识构建可控的自中心世界模型	humanoid manipulation world model
22	DrivIng: A Large-Scale Multimodal Driving Dataset with Full Digital Twin Integration	DrivIng：一个集成完整数字孪生的大规模多模态自动驾驶数据集	sim-to-real multimodal
23	LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes	LuxRemix：提出一种室内场景光照分解与重混合的交互式光照编辑方法	manipulation 3D gaussian splatting gaussian splatting

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
24	Reconstruction-Anchored Diffusion Model for Text-to-Motion Generation	提出重建锚定扩散模型以解决文本到动作生成中的信息缺失问题	motion diffusion model motion diffusion text-to-motion

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
25	Breaking the accuracy-resource dilemma: a lightweight adaptive video inference enhancement	提出基于模糊控制的自适应视频推理增强框架，解决精度-资源困境	spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页