cs.CV(2025-03-20)

📊 共 49 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱三:空间感知与语义 (Perception & Semantics) (17 🔗4) 支柱九:具身大模型 (Embodied Foundation Models) (15 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (13 🔗4) 支柱一:机器人控制 (Robot Control) (3) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱三:空间感知与语义 (Perception & Semantics) (17 篇)

#题目一句话要点标签🔗
1 Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions 提出Open3DHOI数据集与Gaussian-HOI优化器,用于野外场景开放词汇3D人-物交互重建。 open-vocabulary open vocabulary human-object interaction
2 Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding 提出CUA-O3D,融合多模态知识与不确定性感知,提升开放词汇3D场景理解能力。 scene understanding open-vocabulary open vocabulary
3 IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes IRef-VLA:用于三维场景中交互式指代定位的基准数据集,关注不完美语言 scene understanding VLA large language model
4 BARD-GS: Blur-Aware Reconstruction of Dynamic Scenes via Gaussian Splatting BARD-GS:提出一种基于高斯溅射的动态场景模糊感知重建方法 3D gaussian splatting 3DGS gaussian splatting
5 Learning to Efficiently Adapt Foundation Models for Self-Supervised Endoscopic 3D Scene Reconstruction from Any Cameras Endo3DAC:高效自监督内窥镜3D重建,自适应预训练模型并联合优化深度、姿态与相机内参。 depth estimation scene reconstruction foundation model
6 4D Gaussian Splatting SLAM 提出4D高斯溅射SLAM,用于动态场景下的相机定位与辐射场重建。 gaussian splatting splatting optical flow
7 1000+ FPS 4D Gaussian Splatting for Dynamic Scene Rendering 提出4DGS-1K,显著提升动态场景高斯溅射渲染速度至1000+ FPS。 gaussian splatting splatting
8 Enhancing Close-up Novel View Synthesis via Pseudo-labeling 提出基于伪标签的策略,提升近距离视角下的新视角合成质量 3D gaussian splatting 3DGS gaussian splatting
9 QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge 提出QuartDepth以解决边缘设备上深度估计模型部署问题 depth estimation monocular depth
10 Gaussian Graph Network: Learning Efficient and Generalizable Gaussian Representations from Multi-view Images 提出高斯图网络,从多视角图像中学习高效且泛化的高斯表示,提升新视角合成效果。 3D gaussian splatting 3DGS gaussian splatting
11 Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation Jasmine:利用扩散先验的自监督深度估计框架,提升单目深度估计的清晰度和泛化性 depth estimation monocular depth
12 Automating 3D Dataset Generation with Neural Radiance Fields 提出基于神经辐射场的3D数据集自动生成流程,解决3D检测模型训练数据匮乏问题。 neural radiance field
13 Digitally Prototype Your Eye Tracker: Simulating Hardware Performance using 3D Synthetic Data 提出基于3D合成数据的眼动追踪硬件性能评估方法,加速硬件原型设计。 NeRF neural radiance field
14 DreamTexture: Shape from Virtual Texture with Analysis by Augmentation DreamTexture:利用虚拟纹理和增广分析实现单目图像三维重建 monocular depth
15 Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction 提出动态点图(DPM),用于动态3D重建中的运动分割、场景流估计和物体跟踪。 scene flow
16 EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation EDEN:增强扩散模型,解决大运动视频插帧中生成质量和时序一致性问题 optical flow
17 OffsetOPT: Explicit Surface Reconstruction without Normals OffsetOPT:无需法线的显式表面重建方法,提升尖锐特征保持能力 implicit representation

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
18 Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models VidKV:针对视频大语言模型的即插即用1.x比特KV缓存量化方法 large language model
19 MapGlue: Multimodal Remote Sensing Image Matching 提出MapGlue框架与MapData数据集,解决多模态遥感图像匹配难题。 multimodal
20 Disentangled and Interpretable Multimodal Attention Fusion for Cancer Survival Prediction 提出DIMAF模型,解耦多模态注意力融合,提升癌症生存预测性能与可解释性。 multimodal
21 Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models 提出混合层级指令注入的视频token压缩方法HICom,提升多模态大语言模型视频理解能力。 large language model
22 A Survey on fMRI-based Brain Decoding for Reconstructing Multimodal Stimuli 综述:基于fMRI脑解码的多模态刺激重建技术 multimodal
23 UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation 提出UniCrossAdapter,用于CLIP在放射影像报告生成任务上的多模态迁移学习。 multimodal
24 Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data 提出Chain of Functions (CoF)框架,用于生成高质量、多样化的图表推理数据。 large language model multimodal
25 When Less is Enough: Adaptive Token Reduction for Efficient Image Representation 提出基于自编码器和Gumbel-Softmax的自适应Token缩减方法,提升图像表征效率。 multimodal
26 GAEA: A Geolocation Aware Conversational Assistant 提出GAEA:一个地理位置感知的对话式助手,用于图像地理定位。 multimodal
27 OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP OSLoPrompt:桥接CLIP低监督挑战与开放集域泛化 foundation model
28 UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis UniHDSA:一种统一的关系预测方法,用于分层文档结构分析 multimodal
29 Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance 提出概念引导的人类贝叶斯推理框架,提升视觉-语言模型零样本图像识别能力 large language model
30 MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations MASH-VLM通过解耦时空表示缓解视频LLM中的动作-场景幻觉问题 large language model
31 What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation? 利用现成的大型多模态模型解决动态场景图生成问题,实现性能显著提升。 multimodal
32 GraPLUS: Graph-based Placement Using Semantics for Image Composition GraPLUS:利用语义信息的图神经网络图像合成对象放置方法 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (13 篇)

#题目一句话要点标签🔗
33 GAIR: Improving Multimodal Geo-Foundation Model with Geo-Aligned Implicit Representations GAIR:利用地理对齐隐式表征改进多模态地理基础模型 contrastive learning implicit representation spatial relationship
34 VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling VideoRFSplat:利用视频生成模型直接生成具有灵活姿态和多视角联合建模的场景级文本到3D高斯溅射 distillation 3D gaussian splatting 3DGS
35 RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models 提出RL4Med-DDPO,利用强化学习引导VLFM生成高质量、可控的医学图像,提升诊断性能。 reinforcement learning foundation model
36 DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding DynamicVis:面向遥感图像理解的高效通用视觉基础模型 state space model foundation model
37 JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse 提出JARVIS-VLA,通过视觉语言后训练提升VLA模型在Minecraft中的决策能力。 imitation learning VLA
38 A Vision Centric Remote Sensing Benchmark 提出遥感多模态视觉模式基准(RSMMVP),评估并提升MLLM在遥感领域的性能。 representation learning large language model multimodal
39 SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer 提出SaMam:一种风格感知的状态空间模型,用于任意图像风格迁移 Mamba SSM state space model
40 EDiT: Efficient Diffusion Transformers with Linear Compressed Attention 提出EDiT:一种线性压缩注意力的高效扩散Transformer,加速高分辨率图像生成。 linear attention distillation multimodal
41 iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation iFlame:交错式全注意力和线性注意力的高效网格生成方法 linear attention
42 Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation 提出高效模型缩放与Cloze自蒸馏,提升场景文本识别精度 distillation
43 Binarized Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing 提出BMTNet,一种轻量级二值化Mamba-Transformer网络,用于Quad Bayer混合事件视觉传感器图像去马赛克。 Mamba
44 Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation Acc3D:通过边缘一致性引导的Score蒸馏加速单图到3D扩散模型 distillation
45 Learning 3D Scene Analogies with Neural Contextual Scene Maps 提出神经上下文场景图,用于学习3D场景类比并实现场景间知识迁移 imitation learning spatial relationship

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
46 M3: 3D-Spatial MultiModal Memory M3:提出3D空间多模态记忆系统,用于视觉感知中静态场景的信息保留。 quadruped distillation 3D gaussian splatting
47 TruthLens: Visual Grounding for Universal DeepFake Reasoning TruthLens:面向通用DeepFake推理的可视化定位框架 manipulation scene understanding large language model
48 Physically Grounded Monocular Depth via Nanophotonic Wavefront Prompting 利用纳米光子波前调控,实现物理可信的单目深度估计 sim-to-real monocular depth metric depth

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
49 SceneMI: Motion In-betweening for Modeling Human-Scene Interactions SceneMI:提出场景感知的运动插值框架,用于建模人-场景交互 human-scene interaction HSI scene-aware motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页