cs.CV(2025-02-18)

📊 共 29 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (9 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗2) 支柱九:具身大模型 (Embodied Foundation Models) (6) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2 🔗1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
1 Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation 提出mmMamba,通过蒸馏将多模态大语言模型转化为线性复杂度的状态空间模型。 Mamba state space model distillation
2 Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization 提出Re-Align框架,通过检索增强的直接偏好优化对齐视觉语言模型,有效缓解跨模态幻觉问题。 reinforcement learning RLHF DPO
3 S2C: Learning Noise-Resistant Differences for Unsupervised Change Detection in Multimodal Remote Sensing Images 提出S2C框架,利用视觉基础模型和对比学习进行多模态遥感图像的无监督变化检测。 contrastive learning foundation model multimodal
4 RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm RealSyn:一种有效且可扩展的多模态交错文档转换范式,提升对比视觉-语言表征学习。 representation learning multimodal zero-shot transfer
5 CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image CAST:提出组件对齐的单RGB图像三维场景重建方法 MAE scene reconstruction penetration
6 RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning RAD:基于大规模3DGS强化学习的端到端自动驾驶策略训练 reinforcement learning imitation learning 3DGS
7 DAMamba: Vision State Space Model with Dynamic Adaptive Scan 提出动态自适应扫描以解决视觉状态空间模型的局限性 Mamba SSM state space model
8 RecDreamer: Consistent Text-to-3D Generation via Uniform Score Distillation RecDreamer通过均匀分数蒸馏解决文本到3D生成中的多面Janus问题 dreamer distillation
9 Contrast-Unity for Partially-Supervised Temporal Sentence Grounding 提出Contrast-Unity框架,解决部分监督时序语句定位问题,降低标注成本。 contrastive learning TAMP

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
10 GS-QA: Comprehensive Quality Assessment Benchmark for Gaussian Splatting View Synthesis GS-QA:高斯溅射视角合成的综合质量评估基准 gaussian splatting splatting NeRF
11 SHADeS: Self-supervised Monocular Depth Estimation Through Non-Lambertian Image Decomposition 提出SHADeS模型,通过非朗伯图像分解实现结肠镜视频中的自监督单目深度估计。 depth estimation monocular depth scene reconstruction
12 ROI-NeRFs: Hi-Fi Visualization of Objects of Interest within a Scene by NeRFs Composition 提出ROI-NeRFs,通过NeRFs组合实现场景内感兴趣对象的高保真可视化 NeRF neural radiance field
13 High-Fidelity Novel View Synthesis via Splatting-Guided Diffusion SplatDiff:提出一种基于Splatting引导的扩散模型,用于高保真度新视角合成 splatting
14 L4P: Towards Unified Low-Level 4D Vision Perception 提出L4P以统一解决低级4D视觉感知问题 optical flow
15 PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization 提出PartSDF以解决复合3D形状表示与优化问题 implicit representation
16 Spiking Vision Transformer with Saccadic Attention 提出基于生物性注视机制的脉冲视觉变换器以解决性能不足问题 scene understanding
17 Understanding and Evaluating Hallucinations in 3D Visual Language Models 系统性研究3D视觉语言模型幻觉问题,并提出评估指标 scene understanding

🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)

#题目一句话要点标签🔗
18 SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning 提出SAFEERASER基准和Prompt Decouple Loss,提升多模态大语言模型安全性 large language model multimodal
19 CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base 提出CutPaste&Find,利用视觉辅助知识库高效检测多模态幻觉 multimodal
20 Zero-shot Emotion Annotation in Facial Images Using Large Multimodal Models: Benchmarking and Prospects for Multi-Class, Multi-Frame Approaches 利用大型多模态模型实现面部图像零样本情感标注,探索多分类和多帧方法 multimodal
21 Corrupted but Not Broken: Understanding and Mitigating the Negative Impacts of Corrupted Data in Visual Instruction Tuning 提出一种针对视觉指令微调中数据损坏的鲁棒训练方法,提升多模态大语言模型性能。 large language model multimodal
22 Understanding and Rectifying Safety Perception Distortion in VLMs 提出ShiftDC,用于校正视觉语言模型中的安全性感知失真问题 multimodal
23 Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning 提出RDCL方法,用于解决物理视听常识推理中模态缺失和因果推理不足的问题 multimodal

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
24 Magma: A Foundation Model for Multimodal AI Agents Magma:用于多模态AI代理的基座模型,提升具身智能 manipulation foundation model multimodal
25 Predicate Hierarchies Improve Few-Shot State Classification 提出PHIER,利用谓词层级结构提升机器人少样本状态分类性能 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
26 MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching MotionMatcher:通过运动特征匹配实现文本到视频扩散模型的运动定制 feature matching
27 MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval 提出MomentSeeker,一个面向长视频片段检索的任务型基准,涵盖多种真实场景。 egocentric

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
28 Spatiotemporal Multi-Camera Calibration using Freely Moving People 提出一种基于自由移动行人的时空多相机标定方法 spatiotemporal
29 Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning 提出S-CMRL框架,增强视听觉脉冲神经网络的语义对齐和跨模态残差学习能力 spatiotemporal multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页