cs.CV（2025-03-19）

📊 共 58 篇论文 | 🔗 23 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (20 🔗8) 支柱二：RL算法与架构 (RL & Architecture) (15 🔗9) 支柱三：空间感知与语义 (Perception & Semantics) (14 🔗3) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱四：生成式动作 (Generative Motion) (2 🔗1) 支柱一：机器人控制 (Robot Control) (2 🔗1) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (20 篇)

#	题目	一句话要点	标签	🔗
1	VisNumBench: Evaluating Number Sense of Multimodal Large Language Models	提出VisNumBench，用于评估多模态大语言模型（MLLMs）的数字感知能力。	large language model multimodal chain-of-thought	✅
2	UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation	提出UPME：一种无监督多模态大语言模型评估框架，缓解人工标注依赖。	large language model multimodal
3	Visual Position Prompt for MLLM based Visual Grounding	VPP-LLaVA：通过视觉位置提示增强MLLM的视觉定位能力	large language model multimodal visual grounding	✅
4	Benchmarking Large Language Models for Handwritten Text Recognition	评估大型语言模型在手写文本识别中的性能，探索零样本迁移能力	large language model multimodal
5	EarthScape: A Multimodal Dataset for Surficial Geologic Mapping and Earth Surface Analysis	EarthScape：用于地表地质填图和地球表面分析的多模态数据集	multimodal	✅
6	LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning	LLaVA-MORE：多模态大语言模型中LLM与视觉骨干网络对比研究，提升视觉指令调优效果	large language model multimodal instruction following	✅
7	Visual Persona: Foundation Model for Full-Body Human Customization	Visual Persona：用于全身人体定制的基座模型	foundation model
8	EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds	EdgeRegNet：一种基于边缘特征的图像与LiDAR点云多模态配准网络	multimodal
9	Generating Multimodal Driving Scenes via Next-Scene Prediction	提出UMGen，通过预测下一场景生成多模态自动驾驶场景，支持地图模态。	multimodal	✅
10	Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation	提出FakeVLM：基于大模型的多模态合成图像检测与伪造解释	multimodal	✅
11	Cube: A Roblox View of 3D Intelligence	提出Cube：Roblox视角下的3D智能基础模型，实现3D内容生成与理解	large language model foundation model
12	EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models	EfficientLLaVA：面向大规模视觉语言模型的可泛化自动剪枝方法	large language model multimodal
13	TruthLens:A Training-Free Paradigm for DeepFake Detection	提出TruthLens，一种免训练的深度伪造检测框架，提升可解释性。	large language model multimodal
14	MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems	MathFlow：提升MLLM在视觉数学问题中的感知能力	large language model multimodal	✅
15	FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding	FAVOR-Bench：用于细粒度视频运动理解的综合基准测试	large language model multimodal
16	Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations	提出多频扰动（MFP）方法，缓解多模态大语言模型中的物体幻觉问题	large language model multimodal
17	Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection	提出知识引导的伪造检测框架，提升大视觉语言模型在深度伪造检测中的泛化性和可解释性	large language model multimodal
18	Vision-Speech Models: Teaching Speech Models to Converse about Images	提出MoshiVis，赋予语音模型视觉理解能力，实现图像相关的语音对话	multimodal
19	Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models	提出Forensics-Bench，用于全面评估大型视觉语言模型在伪造检测中的能力。	multimodal	✅
20	Universal Scene Graph Generation	提出通用场景图（USG）表示及解析器，实现多模态场景语义的全面理解。	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (15 篇)

#	题目	一句话要点	标签	🔗
21	EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining	EgoDTM：通过3D感知自中心视频-语言预训练提升视频表征学习	representation learning contrastive learning depth estimation	✅
22	Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU	提出基于双曲空间的MERU模型解耦方法，实现多模态对比学习中的概念遗忘	contrastive learning multimodal	✅
23	Toward Scalable, Flexible Scene Flow for Point Clouds	构建可扩展、灵活的点云场景流估计器，提升泛化性和性能。	distillation scene flow
24	Distilling 3D distinctive local descriptors for 6D pose estimation	提出基于知识蒸馏的3D局部描述子，加速6D位姿估计。	distillation 6D pose estimation	✅
25	Decompositional Neural Scene Reconstruction with Generative Diffusion Prior	DP-Recon：利用生成扩散先验实现可分解的神经场景重建，解决稀疏视图下的遮挡问题。	distillation scene reconstruction	✅
26	Object-Centric Pretraining via Target Encoder Bootstrapping	提出OCEBO，通过目标编码器自举实现面向对象表征的预训练，无需依赖非对象中心预训练模型。	representation learning distillation foundation model	✅
27	When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning	提出T-CoRe，利用时序对应关系进行自监督视频表征学习	representation learning distillation	✅
28	xMOD: Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D motion	提出xMOD，利用2D运动信息蒸馏实现2D/3D多目标无监督发现	teacher-student distillation	✅
29	Tables Guide Vision: Learning to See the Heart through Tabular Data	提出表格引导的对比学习框架，提升心血管影像表征学习效果	representation learning contrastive learning multimodal
30	Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening	提出基于非平衡最优传输的流匹配框架，实现快速高质量遥感影像融合	flow matching distillation	✅
31	Text-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation	提出TRG-Net，利用文本派生关系图增强骨骼动作分割，实现更精准的动作理解。	contrastive learning large language model
32	Single-Step Bidirectional Unpaired Image Translation Using Implicit Bridge Consistency Distillation	提出隐式桥一致性蒸馏(IBCD)，实现单步双向非配对图像转换。	distillation	✅
33	Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching	提出D2S-VSE模型，通过稠密到稀疏特征蒸馏对齐图像-文本匹配的信息容量。	distillation
34	When Domain Generalization meets Generalized Category Discovery: An Adaptive Task-Arithmetic Driven Approach	提出DG2CD-Net，通过自适应任务算术驱动的领域泛化方法解决广义类别发现问题。	representation learning foundation model
35	TULIP: Towards Unified Language-Image Pretraining	TULIP：面向统一语言-图像预训练，提升视觉理解能力和跨模态性能	contrastive learning depth estimation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (14 篇)

#	题目	一句话要点	标签	🔗
36	SPNeRF: Open Vocabulary 3D Neural Scene Segmentation with Superpoints	SPNeRF：利用超点实现开放词汇3D神经场景分割	NeRF open-vocabulary open vocabulary
37	Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport	提出RAM框架，通过知识约束最优传输实现开放词汇多标签识别	open-vocabulary open vocabulary	✅
38	DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation	提出DiST-4D，用于生成具有度量深度信息的解耦时空扩散4D驾驶场景	metric depth spatiotemporal
39	Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task, Dataset and Benchmark	提出3F-OVD任务以解决开放词汇物体检测中的评估不公问题	open-vocabulary open vocabulary
40	SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments	SemanticFlow：动态场景下联合预测场景流和实例分割的自监督框架	scene understanding scene flow
41	GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector	GO-N3RDet：几何优化NeRF增强的多视角3D目标检测器	NeRF neural radiance field	✅
42	MultiBARF: Integrating Imagery of Different Wavelength Regions by Using Neural Radiance Fields	MultiBARF：利用神经辐射场集成不同波长区域的图像，简化多传感器融合。	NeRF neural radiance field
43	USAM-Net: A U-Net-based Network for Improved Stereo Correspondence and Scene Depth Estimation using Features from a Pre-trained Image Segmentation network	USAM-Net：融合预训练分割特征的U-Net立体匹配与深度估计网络	depth estimation stereo depth
44	Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene	提出基于2D视觉场景知识迁移的4D全景场景图生成框架，解决数据稀缺问题。	open-vocabulary open vocabulary large language model
45	DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework	提出DPFlow双金字塔自适应光流估计框架，解决高分辨率视频光流估计难题。	optical flow
46	DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis	DiffPortrait360：提出一致性人像扩散模型，用于360度视角合成	NeRF neural radiance field
47	3D Engine-ready Photorealistic Avatars via Dynamic Textures	提出基于动态纹理的3D引擎即用型逼真化身生成方法	NeRF implicit representation
48	The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generation	提出HySCDG混合数据生成流程，用于提升遥感图像语义变化检测性能	semantic map	✅
49	Temporal-Consistent Video Restoration with Pre-trained Diffusion Models	提出基于预训练扩散模型的时序一致性视频修复框架，提升视觉质量和时序稳定性。	optical flow

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
50	Challenges and Trends in Egocentric Vision: A Survey	综述性分析第一人称视觉理解的挑战与趋势，为AR/VR等领域提供参考。	egocentric egocentric vision multimodal
51	CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image	CHROME：单图遮挡下多视角一致的服装人体重建	SMPL

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
52	GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation	GenM³：用于文本条件人体动作生成的生成式预训练多路径运动模型	motion generation VQ-VAE large language model
53	MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space	MotionStreamer：提出基于扩散的自回归模型，在因果隐空间中实现流式运动生成。	motion generation motion latent	✅

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
54	DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning	DeepMesh：提出基于强化学习的自回归艺术家风格网格生成方法	manipulation reinforcement learning DPO	✅
55	LEGION: Learning to Ground and Explain for Synthetic Image Detection	提出LEGION框架，用于合成图像检测，并具备伪造区域定位与解释能力。	manipulation large language model multimodal

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
56	GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving	GASP：面向自动驾驶的几何与语义自监督预训练统一框架	spatiotemporal large language model foundation model
57	Reducing Annotation Burden: Exploiting Image Knowledge for Few-Shot Medical Video Object Segmentation via Spatiotemporal Consistency Relearning	提出时空一致性重学习方法，利用图像知识解决医学视频少样本分割问题	spatiotemporal	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
58	Deep Polycuboid Fitting for Compact 3D Representation of Indoor Scenes	提出基于深度学习的多面体拟合框架，用于紧凑表示室内场景三维结构	spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2025-03-19）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (20 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (15 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (14 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理