cs.CV（2025-02-18）

📊 共 29 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (9 🔗5) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗2) 支柱九：具身大模型 (Embodied Foundation Models) (6) 支柱一：机器人控制 (Robot Control) (2 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (2 🔗1) 支柱八：物理动画 (Physics-based Animation) (2 🔗1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation	提出mmMamba，通过蒸馏将多模态大语言模型转化为线性复杂度的状态空间模型。	Mamba state space model distillation	✅
2	Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization	提出Re-Align框架，通过检索增强的直接偏好优化对齐视觉语言模型，有效缓解跨模态幻觉问题。	reinforcement learning RLHF DPO	✅
3	S2C: Learning Noise-Resistant Differences for Unsupervised Change Detection in Multimodal Remote Sensing Images	提出S2C框架，利用视觉基础模型和对比学习进行多模态遥感图像的无监督变化检测。	contrastive learning foundation model multimodal
4	RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm	RealSyn：一种有效且可扩展的多模态交错文档转换范式，提升对比视觉-语言表征学习。	representation learning multimodal zero-shot transfer	✅
5	CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image	CAST：提出组件对齐的单RGB图像三维场景重建方法	MAE scene reconstruction penetration
6	RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning	RAD：基于大规模3DGS强化学习的端到端自动驾驶策略训练	reinforcement learning imitation learning 3DGS	✅
7	DAMamba: Vision State Space Model with Dynamic Adaptive Scan	提出动态自适应扫描以解决视觉状态空间模型的局限性	Mamba SSM state space model	✅
8	RecDreamer: Consistent Text-to-3D Generation via Uniform Score Distillation	RecDreamer通过均匀分数蒸馏解决文本到3D生成中的多面Janus问题	dreamer distillation
9	Contrast-Unity for Partially-Supervised Temporal Sentence Grounding	提出Contrast-Unity框架，解决部分监督时序语句定位问题，降低标注成本。	contrastive learning TAMP

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
10	GS-QA: Comprehensive Quality Assessment Benchmark for Gaussian Splatting View Synthesis	GS-QA：高斯溅射视角合成的综合质量评估基准	gaussian splatting splatting NeRF
11	SHADeS: Self-supervised Monocular Depth Estimation Through Non-Lambertian Image Decomposition	提出SHADeS模型，通过非朗伯图像分解实现结肠镜视频中的自监督单目深度估计。	depth estimation monocular depth scene reconstruction	✅
12	ROI-NeRFs: Hi-Fi Visualization of Objects of Interest within a Scene by NeRFs Composition	提出ROI-NeRFs，通过NeRFs组合实现场景内感兴趣对象的高保真可视化	NeRF neural radiance field
13	High-Fidelity Novel View Synthesis via Splatting-Guided Diffusion	SplatDiff：提出一种基于Splatting引导的扩散模型，用于高保真度新视角合成	splatting
14	L4P: Towards Unified Low-Level 4D Vision Perception	提出L4P以统一解决低级4D视觉感知问题	optical flow
15	PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization	提出PartSDF以解决复合3D形状表示与优化问题	implicit representation	✅
16	Spiking Vision Transformer with Saccadic Attention	提出基于生物性注视机制的脉冲视觉变换器以解决性能不足问题	scene understanding
17	Understanding and Evaluating Hallucinations in 3D Visual Language Models	系统性研究3D视觉语言模型幻觉问题，并提出评估指标	scene understanding

🔬 支柱九：具身大模型 (Embodied Foundation Models) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
18	SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning	提出SAFEERASER基准和Prompt Decouple Loss，提升多模态大语言模型安全性	large language model multimodal
19	CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base	提出CutPaste&Find，利用视觉辅助知识库高效检测多模态幻觉	multimodal
20	Zero-shot Emotion Annotation in Facial Images Using Large Multimodal Models: Benchmarking and Prospects for Multi-Class, Multi-Frame Approaches	利用大型多模态模型实现面部图像零样本情感标注，探索多分类和多帧方法	multimodal
21	Corrupted but Not Broken: Understanding and Mitigating the Negative Impacts of Corrupted Data in Visual Instruction Tuning	提出一种针对视觉指令微调中数据损坏的鲁棒训练方法，提升多模态大语言模型性能。	large language model multimodal
22	Understanding and Rectifying Safety Perception Distortion in VLMs	提出ShiftDC，用于校正视觉语言模型中的安全性感知失真问题	multimodal
23	Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning	提出RDCL方法，用于解决物理视听常识推理中模态缺失和因果推理不足的问题	multimodal

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
24	Magma: A Foundation Model for Multimodal AI Agents	Magma：用于多模态AI代理的基座模型，提升具身智能	manipulation foundation model multimodal	✅
25	Predicate Hierarchies Improve Few-Shot State Classification	提出PHIER，利用谓词层级结构提升机器人少样本状态分类性能	manipulation

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
26	MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching	MotionMatcher：通过运动特征匹配实现文本到视频扩散模型的运动定制	feature matching
27	MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval	提出MomentSeeker，一个面向长视频片段检索的任务型基准，涵盖多种真实场景。	egocentric	✅

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
28	Spatiotemporal Multi-Camera Calibration using Freely Moving People	提出一种基于自由移动行人的时空多相机标定方法	spatiotemporal
29	Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning	提出S-CMRL框架，增强视听觉脉冲神经网络的语义对齐和跨模态残差学习能力	spatiotemporal multimodal	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页