cs.CV(2025-03-19)

📊 共 58 篇论文 | 🔗 23 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (20 🔗8) 支柱二:RL算法与架构 (RL & Architecture) (15 🔗9) 支柱三:空间感知与语义 (Perception & Semantics) (14 🔗3) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (20 篇)

#题目一句话要点标签🔗
1 VisNumBench: Evaluating Number Sense of Multimodal Large Language Models 提出VisNumBench,用于评估多模态大语言模型(MLLMs)的数字感知能力。 large language model multimodal chain-of-thought
2 UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation 提出UPME:一种无监督多模态大语言模型评估框架,缓解人工标注依赖。 large language model multimodal
3 Visual Position Prompt for MLLM based Visual Grounding VPP-LLaVA:通过视觉位置提示增强MLLM的视觉定位能力 large language model multimodal visual grounding
4 Benchmarking Large Language Models for Handwritten Text Recognition 评估大型语言模型在手写文本识别中的性能,探索零样本迁移能力 large language model multimodal
5 EarthScape: A Multimodal Dataset for Surficial Geologic Mapping and Earth Surface Analysis EarthScape:用于地表地质填图和地球表面分析的多模态数据集 multimodal
6 LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning LLaVA-MORE:多模态大语言模型中LLM与视觉骨干网络对比研究,提升视觉指令调优效果 large language model multimodal instruction following
7 Visual Persona: Foundation Model for Full-Body Human Customization Visual Persona:用于全身人体定制的基座模型 foundation model
8 EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds EdgeRegNet:一种基于边缘特征的图像与LiDAR点云多模态配准网络 multimodal
9 Generating Multimodal Driving Scenes via Next-Scene Prediction 提出UMGen,通过预测下一场景生成多模态自动驾驶场景,支持地图模态。 multimodal
10 Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation 提出FakeVLM:基于大模型的多模态合成图像检测与伪造解释 multimodal
11 Cube: A Roblox View of 3D Intelligence 提出Cube:Roblox视角下的3D智能基础模型,实现3D内容生成与理解 large language model foundation model
12 EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models EfficientLLaVA:面向大规模视觉语言模型的可泛化自动剪枝方法 large language model multimodal
13 TruthLens:A Training-Free Paradigm for DeepFake Detection 提出TruthLens,一种免训练的深度伪造检测框架,提升可解释性。 large language model multimodal
14 MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems MathFlow:提升MLLM在视觉数学问题中的感知能力 large language model multimodal
15 FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding FAVOR-Bench:用于细粒度视频运动理解的综合基准测试 large language model multimodal
16 Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations 提出多频扰动(MFP)方法,缓解多模态大语言模型中的物体幻觉问题 large language model multimodal
17 Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection 提出知识引导的伪造检测框架,提升大视觉语言模型在深度伪造检测中的泛化性和可解释性 large language model multimodal
18 Vision-Speech Models: Teaching Speech Models to Converse about Images 提出MoshiVis,赋予语音模型视觉理解能力,实现图像相关的语音对话 multimodal
19 Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models 提出Forensics-Bench,用于全面评估大型视觉语言模型在伪造检测中的能力。 multimodal
20 Universal Scene Graph Generation 提出通用场景图(USG)表示及解析器,实现多模态场景语义的全面理解。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (15 篇)

#题目一句话要点标签🔗
21 EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining EgoDTM:通过3D感知自中心视频-语言预训练提升视频表征学习 representation learning contrastive learning depth estimation
22 Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU 提出基于双曲空间的MERU模型解耦方法,实现多模态对比学习中的概念遗忘 contrastive learning multimodal
23 Toward Scalable, Flexible Scene Flow for Point Clouds 构建可扩展、灵活的点云场景流估计器,提升泛化性和性能。 distillation scene flow
24 Distilling 3D distinctive local descriptors for 6D pose estimation 提出基于知识蒸馏的3D局部描述子,加速6D位姿估计。 distillation 6D pose estimation
25 Decompositional Neural Scene Reconstruction with Generative Diffusion Prior DP-Recon:利用生成扩散先验实现可分解的神经场景重建,解决稀疏视图下的遮挡问题。 distillation scene reconstruction
26 Object-Centric Pretraining via Target Encoder Bootstrapping 提出OCEBO,通过目标编码器自举实现面向对象表征的预训练,无需依赖非对象中心预训练模型。 representation learning distillation foundation model
27 When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning 提出T-CoRe,利用时序对应关系进行自监督视频表征学习 representation learning distillation
28 xMOD: Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D motion 提出xMOD,利用2D运动信息蒸馏实现2D/3D多目标无监督发现 teacher-student distillation
29 Tables Guide Vision: Learning to See the Heart through Tabular Data 提出表格引导的对比学习框架,提升心血管影像表征学习效果 representation learning contrastive learning multimodal
30 Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening 提出基于非平衡最优传输的流匹配框架,实现快速高质量遥感影像融合 flow matching distillation
31 Text-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation 提出TRG-Net,利用文本派生关系图增强骨骼动作分割,实现更精准的动作理解。 contrastive learning large language model
32 Single-Step Bidirectional Unpaired Image Translation Using Implicit Bridge Consistency Distillation 提出隐式桥一致性蒸馏(IBCD),实现单步双向非配对图像转换。 distillation
33 Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching 提出D2S-VSE模型,通过稠密到稀疏特征蒸馏对齐图像-文本匹配的信息容量。 distillation
34 When Domain Generalization meets Generalized Category Discovery: An Adaptive Task-Arithmetic Driven Approach 提出DG2CD-Net,通过自适应任务算术驱动的领域泛化方法解决广义类别发现问题。 representation learning foundation model
35 TULIP: Towards Unified Language-Image Pretraining TULIP:面向统一语言-图像预训练,提升视觉理解能力和跨模态性能 contrastive learning depth estimation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (14 篇)

#题目一句话要点标签🔗
36 SPNeRF: Open Vocabulary 3D Neural Scene Segmentation with Superpoints SPNeRF:利用超点实现开放词汇3D神经场景分割 NeRF open-vocabulary open vocabulary
37 Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport 提出RAM框架,通过知识约束最优传输实现开放词汇多标签识别 open-vocabulary open vocabulary
38 DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation 提出DiST-4D,用于生成具有度量深度信息的解耦时空扩散4D驾驶场景 metric depth spatiotemporal
39 Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task, Dataset and Benchmark 提出3F-OVD任务以解决开放词汇物体检测中的评估不公问题 open-vocabulary open vocabulary
40 SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments SemanticFlow:动态场景下联合预测场景流和实例分割的自监督框架 scene understanding scene flow
41 GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector GO-N3RDet:几何优化NeRF增强的多视角3D目标检测器 NeRF neural radiance field
42 MultiBARF: Integrating Imagery of Different Wavelength Regions by Using Neural Radiance Fields MultiBARF:利用神经辐射场集成不同波长区域的图像,简化多传感器融合。 NeRF neural radiance field
43 USAM-Net: A U-Net-based Network for Improved Stereo Correspondence and Scene Depth Estimation using Features from a Pre-trained Image Segmentation network USAM-Net:融合预训练分割特征的U-Net立体匹配与深度估计网络 depth estimation stereo depth
44 Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene 提出基于2D视觉场景知识迁移的4D全景场景图生成框架,解决数据稀缺问题。 open-vocabulary open vocabulary large language model
45 DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework 提出DPFlow双金字塔自适应光流估计框架,解决高分辨率视频光流估计难题。 optical flow
46 DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis DiffPortrait360:提出一致性人像扩散模型,用于360度视角合成 NeRF neural radiance field
47 3D Engine-ready Photorealistic Avatars via Dynamic Textures 提出基于动态纹理的3D引擎即用型逼真化身生成方法 NeRF implicit representation
48 The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generation 提出HySCDG混合数据生成流程,用于提升遥感图像语义变化检测性能 semantic map
49 Temporal-Consistent Video Restoration with Pre-trained Diffusion Models 提出基于预训练扩散模型的时序一致性视频修复框架,提升视觉质量和时序稳定性。 optical flow

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
50 Challenges and Trends in Egocentric Vision: A Survey 综述性分析第一人称视觉理解的挑战与趋势,为AR/VR等领域提供参考。 egocentric egocentric vision multimodal
51 CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image CHROME:单图遮挡下多视角一致的服装人体重建 SMPL

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
52 GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation GenM³:用于文本条件人体动作生成的生成式预训练多路径运动模型 motion generation VQ-VAE large language model
53 MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space MotionStreamer:提出基于扩散的自回归模型,在因果隐空间中实现流式运动生成。 motion generation motion latent

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
54 DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning DeepMesh:提出基于强化学习的自回归艺术家风格网格生成方法 manipulation reinforcement learning DPO
55 LEGION: Learning to Ground and Explain for Synthetic Image Detection 提出LEGION框架,用于合成图像检测,并具备伪造区域定位与解释能力。 manipulation large language model multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
56 GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving GASP:面向自动驾驶的几何与语义自监督预训练统一框架 spatiotemporal large language model foundation model
57 Reducing Annotation Burden: Exploiting Image Knowledge for Few-Shot Medical Video Object Segmentation via Spatiotemporal Consistency Relearning 提出时空一致性重学习方法,利用图像知识解决医学视频少样本分割问题 spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
58 Deep Polycuboid Fitting for Compact 3D Representation of Indoor Scenes 提出基于深度学习的多面体拟合框架,用于紧凑表示室内场景三维结构 spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页