cs.CV（2025-03-20）

📊 共 49 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (17 🔗4) 支柱九：具身大模型 (Embodied Foundation Models) (15 🔗5) 支柱二：RL算法与架构 (RL & Architecture) (13 🔗4) 支柱一：机器人控制 (Robot Control) (3) 支柱五：交互与反应 (Interaction & Reaction) (1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (17 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions	提出Open3DHOI数据集与Gaussian-HOI优化器，用于野外场景开放词汇3D人-物交互重建。	open-vocabulary open vocabulary human-object interaction	✅
2	Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding	提出CUA-O3D，融合多模态知识与不确定性感知，提升开放词汇3D场景理解能力。	scene understanding open-vocabulary open vocabulary	✅
3	IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes	IRef-VLA：用于三维场景中交互式指代定位的基准数据集，关注不完美语言	scene understanding VLA large language model	✅
4	BARD-GS: Blur-Aware Reconstruction of Dynamic Scenes via Gaussian Splatting	BARD-GS：提出一种基于高斯溅射的动态场景模糊感知重建方法	3D gaussian splatting 3DGS gaussian splatting
5	Learning to Efficiently Adapt Foundation Models for Self-Supervised Endoscopic 3D Scene Reconstruction from Any Cameras	Endo3DAC：高效自监督内窥镜3D重建，自适应预训练模型并联合优化深度、姿态与相机内参。	depth estimation scene reconstruction foundation model
6	4D Gaussian Splatting SLAM	提出4D高斯溅射SLAM，用于动态场景下的相机定位与辐射场重建。	gaussian splatting splatting optical flow
7	1000+ FPS 4D Gaussian Splatting for Dynamic Scene Rendering	提出4DGS-1K，显著提升动态场景高斯溅射渲染速度至1000+ FPS。	gaussian splatting splatting
8	Enhancing Close-up Novel View Synthesis via Pseudo-labeling	提出基于伪标签的策略，提升近距离视角下的新视角合成质量	3D gaussian splatting 3DGS gaussian splatting
9	QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge	提出QuartDepth以解决边缘设备上深度估计模型部署问题	depth estimation monocular depth	✅
10	Gaussian Graph Network: Learning Efficient and Generalizable Gaussian Representations from Multi-view Images	提出高斯图网络，从多视角图像中学习高效且泛化的高斯表示，提升新视角合成效果。	3D gaussian splatting 3DGS gaussian splatting
11	Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation	Jasmine：利用扩散先验的自监督深度估计框架，提升单目深度估计的清晰度和泛化性	depth estimation monocular depth
12	Automating 3D Dataset Generation with Neural Radiance Fields	提出基于神经辐射场的3D数据集自动生成流程，解决3D检测模型训练数据匮乏问题。	neural radiance field
13	Digitally Prototype Your Eye Tracker: Simulating Hardware Performance using 3D Synthetic Data	提出基于3D合成数据的眼动追踪硬件性能评估方法，加速硬件原型设计。	NeRF neural radiance field
14	DreamTexture: Shape from Virtual Texture with Analysis by Augmentation	DreamTexture：利用虚拟纹理和增广分析实现单目图像三维重建	monocular depth
15	Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction	提出动态点图（DPM），用于动态3D重建中的运动分割、场景流估计和物体跟踪。	scene flow
16	EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation	EDEN：增强扩散模型，解决大运动视频插帧中生成质量和时序一致性问题	optical flow
17	OffsetOPT: Explicit Surface Reconstruction without Normals	OffsetOPT：无需法线的显式表面重建方法，提升尖锐特征保持能力	implicit representation

🔬 支柱九：具身大模型 (Embodied Foundation Models) (15 篇)

#	题目	一句话要点	标签	🔗	⭐
18	Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models	VidKV：针对视频大语言模型的即插即用1.x比特KV缓存量化方法	large language model
19	MapGlue: Multimodal Remote Sensing Image Matching	提出MapGlue框架与MapData数据集，解决多模态遥感图像匹配难题。	multimodal	✅
20	Disentangled and Interpretable Multimodal Attention Fusion for Cancer Survival Prediction	提出DIMAF模型，解耦多模态注意力融合，提升癌症生存预测性能与可解释性。	multimodal
21	Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models	提出混合层级指令注入的视频token压缩方法HICom，提升多模态大语言模型视频理解能力。	large language model	✅
22	A Survey on fMRI-based Brain Decoding for Reconstructing Multimodal Stimuli	综述：基于fMRI脑解码的多模态刺激重建技术	multimodal	✅
23	UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation	提出UniCrossAdapter，用于CLIP在放射影像报告生成任务上的多模态迁移学习。	multimodal	✅
24	Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data	提出Chain of Functions (CoF)框架，用于生成高质量、多样化的图表推理数据。	large language model multimodal
25	When Less is Enough: Adaptive Token Reduction for Efficient Image Representation	提出基于自编码器和Gumbel-Softmax的自适应Token缩减方法，提升图像表征效率。	multimodal
26	GAEA: A Geolocation Aware Conversational Assistant	提出GAEA：一个地理位置感知的对话式助手，用于图像地理定位。	multimodal
27	OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP	OSLoPrompt：桥接CLIP低监督挑战与开放集域泛化	foundation model
28	UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis	UniHDSA：一种统一的关系预测方法，用于分层文档结构分析	multimodal	✅
29	Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance	提出概念引导的人类贝叶斯推理框架，提升视觉-语言模型零样本图像识别能力	large language model
30	MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations	MASH-VLM通过解耦时空表示缓解视频LLM中的动作-场景幻觉问题	large language model
31	What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?	利用现成的大型多模态模型解决动态场景图生成问题，实现性能显著提升。	multimodal
32	GraPLUS: Graph-based Placement Using Semantics for Image Composition	GraPLUS：利用语义信息的图神经网络图像合成对象放置方法	large language model

🔬 支柱二：RL算法与架构 (RL & Architecture) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
33	GAIR: Improving Multimodal Geo-Foundation Model with Geo-Aligned Implicit Representations	GAIR：利用地理对齐隐式表征改进多模态地理基础模型	contrastive learning implicit representation spatial relationship
34	VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling	VideoRFSplat：利用视频生成模型直接生成具有灵活姿态和多视角联合建模的场景级文本到3D高斯溅射	distillation 3D gaussian splatting 3DGS
35	RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models	提出RL4Med-DDPO，利用强化学习引导VLFM生成高质量、可控的医学图像，提升诊断性能。	reinforcement learning foundation model	✅
36	DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding	DynamicVis：面向遥感图像理解的高效通用视觉基础模型	state space model foundation model
37	JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse	提出JARVIS-VLA，通过视觉语言后训练提升VLA模型在Minecraft中的决策能力。	imitation learning VLA	✅
38	A Vision Centric Remote Sensing Benchmark	提出遥感多模态视觉模式基准(RSMMVP)，评估并提升MLLM在遥感领域的性能。	representation learning large language model multimodal
39	SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer	提出SaMam：一种风格感知的状态空间模型，用于任意图像风格迁移	Mamba SSM state space model
40	EDiT: Efficient Diffusion Transformers with Linear Compressed Attention	提出EDiT：一种线性压缩注意力的高效扩散Transformer，加速高分辨率图像生成。	linear attention distillation multimodal
41	iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation	iFlame：交错式全注意力和线性注意力的高效网格生成方法	linear attention
42	Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation	提出高效模型缩放与Cloze自蒸馏，提升场景文本识别精度	distillation
43	Binarized Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing	提出BMTNet，一种轻量级二值化Mamba-Transformer网络，用于Quad Bayer混合事件视觉传感器图像去马赛克。	Mamba	✅
44	Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation	Acc3D：通过边缘一致性引导的Score蒸馏加速单图到3D扩散模型	distillation
45	Learning 3D Scene Analogies with Neural Contextual Scene Maps	提出神经上下文场景图，用于学习3D场景类比并实现场景间知识迁移	imitation learning spatial relationship	✅

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
46	M3: 3D-Spatial MultiModal Memory	M3：提出3D空间多模态记忆系统，用于视觉感知中静态场景的信息保留。	quadruped distillation 3D gaussian splatting
47	TruthLens: Visual Grounding for Universal DeepFake Reasoning	TruthLens：面向通用DeepFake推理的可视化定位框架	manipulation scene understanding large language model
48	Physically Grounded Monocular Depth via Nanophotonic Wavefront Prompting	利用纳米光子波前调控，实现物理可信的单目深度估计	sim-to-real monocular depth metric depth

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
49	SceneMI: Motion In-betweening for Modeling Human-Scene Interactions	SceneMI：提出场景感知的运动插值框架，用于建模人-场景交互	human-scene interaction HSI scene-aware motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页