cs.CV（2025-01-23）

📊 共 31 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (10 🔗2) 支柱九：具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (5) 支柱一：机器人控制 (Robot Control) (3 🔗2) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1 🔗1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Multi-aspect Knowledge Distillation with Large Language Model	提出基于多模态大语言模型的多方面知识蒸馏方法，提升图像分类性能。	distillation large language model multimodal
2	QMamba: Post-Training Quantization for Vision State Space Models	QMamba：面向视觉状态空间模型的后训练量化框架	Mamba SSM state space model
3	MultiDreamer3D: Multi-concept 3D Customization with Concept-Aware Diffusion Guidance	MultiDreamer3D：提出概念感知扩散引导的多概念3D定制方法。	dreamer 3D gaussian splatting gaussian splatting
4	Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step	提出基于CoT的图像生成方法，通过验证和强化步骤显著提升自回归图像生成质量。	DPO direct preference optimization chain-of-thought	✅
5	MV-GMN: State Space Model for Multi-View Action Recognition	提出MV-GMN模型，高效处理多视角动作识别中的多模态、多视角和多时序数据。	Mamba state space model
6	Contrast: A Hybrid Architecture of Transformers and State Space Models for Low-Level Vision	提出Contrast混合架构，融合Transformer与状态空间模型，提升图像超分辨率性能。	Mamba state space model
7	Temporal Preference Optimization for Long-Form Video Understanding	提出时间偏好优化（TPO）框架，提升视频大模型在长视频中的时间定位能力	preference learning multimodal	✅
8	Improving Video Generation with Human Feedback	提出基于人类反馈的视频生成优化流程，解决运动不平滑和对齐问题。	reinforcement learning DPO direct preference optimization
9	Retrievals Can Be Detrimental: A Contrastive Backdoor Attack Paradigm on Retrieval-Augmented Diffusion Models	提出BadRDM，一种针对检索增强扩散模型的对比后门攻击方法，揭示RAG引入的安全隐患。	contrastive learning multimodal
10	A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs	提出认知范式评估框架，解剖视觉语言模型中感知-推理的接口	DRL HOI

🔬 支柱九：具身大模型 (Embodied Foundation Models) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
11	Revisiting CLIP: Efficient Alignment of 3D MRI and Tabular Data using Domain-Specific Foundation Models	提出一种基于领域特定3D基础模型的MRI与表格数据高效对齐方法	foundation model
12	GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing	GeoPixel：首个遥感像素级Grounding的大型多模态模型，支持交互式掩码生成。	multimodal
13	EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion	EchoVideo：通过多模态特征融合实现身份保持的人类视频生成	multimodal
14	MetaWild: A Multimodal Dataset for Animal Re-Identification with Environmental Metadata	MetaWild：提出包含环境元数据的多模态动物重识别数据集与元特征适配器。	multimodal
15	ReasVQA: Advancing VideoQA with Imperfect Reasoning Process	ReasVQA：利用不完善推理过程提升视频问答性能	large language model multimodal
16	Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge	提出StreamChat框架，通过增强记忆的知识实现流视频理解和多轮交互。	large language model multimodal	✅
17	Eye Gaze as a Signal for Conveying User Attention in Contextual AI Systems	利用眼动追踪作为上下文AI系统中用户注意力的信号	multimodal
18	Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos	提出Video-MMMU以评估多模态模型从专业视频中获取知识的能力	multimodal
19	MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation	MPG-SAM 2：利用掩码先验和全局上下文改进SAM 2，用于指代视频对象分割	multimodal	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
20	PromptMono: Cross Prompting Attention for Self-Supervised Monocular Depth Estimation in Challenging Environments	PromptMono：利用跨Prompting注意力提升复杂环境下单目深度估计	depth estimation monocular depth
21	GoDe: Gaussians on Demand for Progressive Level of Detail and Scalable Compression	提出GoDe：基于按需高斯的渐进式细节层次和可扩展压缩方法	3D gaussian splatting 3DGS gaussian splatting
22	GC-ConsFlow: Leveraging Optical Flow Residuals and Global Context for Robust Deepfake Detection	GC-ConsFlow：利用光流残差和全局上下文增强Deepfake检测鲁棒性	optical flow spatiotemporal
23	Deblur-Avatar: Animatable Avatars from Motion-Blurred Monocular Videos	Deblur-Avatar：从运动模糊单目视频重建可动画高保真3D人像	3D gaussian splatting 3DGS gaussian splatting
24	Symmetrization Weighted Binary Cross-Entropy: Modeling Perceptual Asymmetry for Human-Consistent Neural Edge Detection	提出SWBCE损失函数以解决边缘检测中的感知不对称问题	scene understanding

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
25	LLM-guided Instance-level Image Manipulation with Diffusion U-Net Cross-Attention Maps	提出LLM引导的实例级图像操控方法，利用扩散U-Net交叉注意力图实现精准编辑。	manipulation open-vocabulary open vocabulary	✅
26	Integrating Persian Lip Reading in Surena-V Humanoid Robot for Human-Robot Interaction	将波斯语唇语识别集成到Surena-V机器人，提升人机交互能力	humanoid humanoid robot
27	mmEgoHand: Egocentric Hand Pose Estimation and Gesture Recognition with Head-mounted Millimeter-wave Radar and IMU	提出mmEgoHand，利用头戴毫米波雷达和IMU进行手部姿态估计和手势识别。	teleoperation egocentric	✅

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
28	EventVL: Understand Event Streams via Multimodal Large Language Model	提出EventVL，首个生成式事件相机多模态大语言模型，用于显式语义理解。	spatiotemporal large language model multimodal
29	Towards Robust Multimodal Open-set Test-time Adaptation via Adaptive Entropy-aware Optimization	提出AEO框架，解决多模态开放集测试时自适应问题，提升未知类别样本区分能力。	AMP multimodal	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
30	ME-CPT: Multi-Task Enhanced Cross-Temporal Point Transformer for Urban 3D Change Detection	提出ME-CPT，用于城市三维变化检测，提升多时相点云语义变化特征提取能力。	spatial relationship spatiotemporal	✅

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
31	Implicit Neural Surface Deformation with Explicit Velocity Fields	提出一种基于显式速度场的无监督神经隐式表面形变方法	physically plausible

⬅️ 返回 cs.CV 首页 · 🏠 返回主页