cs.CV(2025-04-07)

📊 共 38 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (15 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (13 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱一:机器人控制 (Robot Control) (3 🔗3) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
1 Towards Visual Text Grounding of Multimodal Large Language Model 提出TRIG基准,解决多模态大语言模型在文本丰富图像上的视觉文本定位难题。 large language model multimodal visual grounding
2 SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models SCAM:一个用于评估多模态基础模型在印刷攻击下鲁棒性的真实世界数据集 large language model foundation model multimodal
3 OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance 提出OCC-MLLM-CoT-Alpha,通过3D感知和CoT指导提升MLLM在遮挡识别中的性能 large language model chain-of-thought
4 LEO-MINI: An Efficient Multimodal Large Language Model using Conditional Token Reduction and Mixture of Multi-Modal Experts LEO-MINI:利用条件Token缩减和多模态专家混合,提升多模态大语言模型的效率和视觉推理能力 large language model multimodal
5 Training state-of-the-art pathology foundation models with orders of magnitude less data 利用远少于SOTA模型的数据,训练出具有竞争力的病理学基础模型 foundation model
6 The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation 利用大型多模态模型,解决运动表达视频分割难题,荣获PVUW MeViS挑战赛冠军。 multimodal
7 SSLFusion: Scale & Space Aligned Latent Fusion Model for Multimodal 3D Object Detection SSLFusion:提出尺度与空间对齐的潜在融合模型,用于多模态3D目标检测。 multimodal
8 AsyReC: A Multimodal Graph-based Framework for Spatio-Temporal Asymmetric Dyadic Relationship Classification AsyReC:提出基于多模态图神经网络的非对称时空二元关系分类框架 multimodal
9 Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision Lumina-OmniLV:用于通用底层视觉的统一多模态框架 multimodal
10 REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding 提出REEF:一种相关性感知的高效LLM适配器,用于视频理解 large language model foundation model
11 URECA: Unique Region Caption Anything 提出URECA数据集和模型,解决多粒度区域描述的唯一性和一致性问题。 large language model multimodal
12 Seeking and Updating with Live Visual Knowledge 提出LiveVQA数据集,用于评估和更新多模态大语言模型对实时视觉知识的理解能力。 large language model multimodal
13 Explaining Low Perception Model Competency with High-Competency Counterfactuals 提出五种生成高置信度反事实图像的方法,解释低感知模型能力 large language model multimodal
14 InstructionBench: An Instructional Video Understanding Benchmark 提出InstructionBench,用于评估视频大语言模型在教学视频理解中的时序推理能力。 large language model
15 Video-Bench: Human-Aligned Video Generation Benchmark 提出Video-Bench:一个更符合人类感知的视频生成评估基准 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (13 篇)

#题目一句话要点标签🔗
16 PanoDreamer: Consistent Text to 360-Degree Scene Generation PanoDreamer:提出一致性文本驱动的360度全景场景生成方法 dreamer 3D gaussian splatting gaussian splatting
17 SCRAMBLe : Enhancing Multimodal LLM Compositionality with Synthetic Preference Data SCRAMBLe:利用合成偏好数据提升多模态LLM的组合性推理能力 preference learning large language model multimodal
18 OrderChain: Towards General Instruct-Tuning for Stimulating the Ordinal Understanding Ability of MLLM OrderChain:通过指令调优提升多模态大语言模型对序数理解能力 MAE large language model multimodal
19 REVEAL: Relation-based Video Representation Learning for Video-Question-Answering 提出REVEAL框架,通过关系建模提升视频问答中视频表征的质量和效率。 representation learning spatiotemporal
20 Learning Activity View-invariance Under Extreme Viewpoint Changes via Curriculum Knowledge Distillation 提出基于课程知识蒸馏的视角不变性学习方法,解决极端视角变化下的行为识别问题。 curriculum learning distillation
21 Leveraging State Space Models in Long Range Genomics 利用状态空间模型解决长程基因组学中的依赖关系建模问题 SSM state space model
22 Dynamic Vision Mamba Dynamic Vision Mamba (DyVM):通过动态token剪枝和块选择提升Mamba视觉模型的效率。 Mamba SSM
23 Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos Uni4D:面向点云视频的统一自监督学习框架,解耦几何与语义信息 representation learning masked autoencoder MAE
24 Towards Efficient Real-Time Video Motion Transfer via Generative Time Series Modeling 提出基于生成时间序列模型的实时视频动作迁移框架,提升带宽效率。 MAE optical flow
25 S^4M: Boosting Semi-Supervised Instance Segmentation with SAM S^4M:利用SAM提升半监督实例分割性能 teacher-student distillation
26 DebGCD: Debiased Learning with Distribution Guidance for Generalized Category Discovery DebGCD:面向广义类别发现,提出基于分布引导的解偏学习框架。 curriculum learning distillation
27 CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images CADCrafter:提出一种从无约束图像生成参数化CAD模型的新框架 DPO direct preference optimization
28 Dual Consistent Constraint via Disentangled Consistency and Complementarity for Multi-view Clustering 提出基于解耦一致性与互补性的双重一致性约束多视图聚类框架 representation learning contrastive learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
29 DeclutterNeRF: Generative-Free 3D Scene Recovery for Occlusion Removal DeclutterNeRF:一种无生成先验的3D场景重建方法,用于遮挡移除 3D gaussian splatting 3DGS gaussian splatting
30 Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions 提出LMAffordance3D,通过语言指令、视觉观察和交互实现3D物体可操作性的定位。 affordance
31 Stereo-LiDAR Fusion by Semi-Global Matching With Discrete Disparity-Matching Cost and Semidensification 提出基于半全局匹配和离散视差匹配代价的立体视觉-激光雷达融合方法 depth estimation
32 DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation DFormerv2:用于RGBD语义分割的几何自注意力机制 scene understanding

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
33 MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond MotionPRO:探索压力在人体动作捕捉中的作用,提升物理合理性 humanoid humanoid robot penetration
34 FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis 提出FantasyTalking以解决静态肖像动画生成问题 manipulation motion synthesis
35 Continuous Locomotive Crowd Behavior Generation 提出基于扩散模型的连续人群行为生成框架,解决现有方法难以模拟真实人群动态的问题。 locomotion

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
36 Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting 提出CAT-V:一个免训练的视频细粒度、以对象为中心的描述框架。 spatiotemporal multimodal chain-of-thought
37 Inter-event Interval Microscopy for Event Cameras 提出基于事件间隔显微镜的IEIM方法,用于静态事件相机下的荧光显微成像 PULSE

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
38 From Sparse Signal to Smooth Motion: Real-Time Motion Generation with Rolling Prediction Models 提出滚动预测模型RPM,解决XR中稀疏、不稳定的手部追踪信号生成流畅全身动作的问题。 motion generation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页