cs.CV(2025-05-19)

📊 共 48 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (16 🔗5) 支柱九:具身大模型 (Embodied Foundation Models) (13 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (11 🔗2) 支柱七:动作重定向 (Motion Retargeting) (3) 支柱八:物理动画 (Physics-based Animation) (2) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (16 篇)

#题目一句话要点标签🔗
1 KinTwin: Imitation Learning with Torque and Muscle Driven Biomechanical Models Enables Precise Replication of Able-Bodied and Impaired Movement from Markerless Motion Capture KinTwin:利用力矩和肌肉驱动的生物力学模型,通过模仿学习精确复制无标记运动捕捉中的正常和受损运动 imitation learning markerless motion capture
2 Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning 提出基于难度先验的强化学习方法,提升多模态推理能力 reinforcement learning multimodal
3 Mamba-Adaptor: State Space Model Adaptor for Visual Recognition 提出Mamba-Adaptor,解决Mamba在视觉识别中全局上下文建模、长程依赖和空间结构建模的不足。 Mamba SSM state space model
4 G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning G1:通过强化学习引导视觉-语言模型感知与推理能力,提升交互式游戏环境决策能力。 reinforcement learning multimodal
5 SPKLIP: Aligning Spike Video Streams with Natural Language SPKLIP:提出用于Spike视频-语言对齐的新架构,解决模态差异导致的性能瓶颈。 contrastive learning VLA multimodal
6 AutoMat: Enabling Automated Crystal Structure Reconstruction from Microscopy via Agentic Tool Use AutoMat:通过智能体工具调用实现显微图像自动晶体结构重建 MAE large language model multimodal
7 BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation BusterX:提出基于MLLM的AI生成视频伪造检测与解释框架,并构建大规模数据集GenBuster-200K。 reinforcement learning large language model multimodal
8 Few-Step Diffusion via Score identity Distillation 提出基于Score identity Distillation的SiD框架,加速Stable Diffusion XL等文图生成模型。 distillation classifier-free guidance
9 Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping Sat2Sound:用于零样本声景地图构建的统一多模态框架 representation learning contrastive learning multimodal
10 Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking Safe-Sora:通过图形式水印实现安全的文本到视频生成 Mamba state space model spatiotemporal
11 DD-Ranking: Rethinking the Evaluation of Dataset Distillation DD-Ranking:重新思考数据集蒸馏的评估方法,提出更公平的评估框架。 distillation
12 RMMSS: Towards Advanced Robust Multi-Modal Semantic Segmentation with Hybrid Prototype Distillation and Feature Selection RMMSS:面向鲁棒多模态语义分割,提出混合原型蒸馏与特征选择框架 distillation
13 Coarse Attribute Prediction with Task Agnostic Distillation for Real World Clothes Changing ReID 提出RLQ框架,通过粗粒度属性预测和任务无关蒸馏提升真实场景下服饰变换ReID的鲁棒性。 distillation
14 RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers RoPECraft:基于轨迹引导RoPE优化的无训练扩散Transformer视频动作迁移 flow matching optical flow
15 Touch2Shape: Touch-Conditioned 3D Diffusion for Shape Exploration and Reconstruction Touch2Shape:提出触觉条件下的3D扩散模型,用于形状探索与重建 reinforcement learning reward design
16 Towards Low-Latency Event Stream-based Visual Object Tracking: A Slow-Fast Approach 提出SFTrack:一种低延迟事件流视觉目标跟踪的慢-快方法 representation learning distillation

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
17 FEALLM: Advancing Facial Emotion Analysis in Multimodal Large Language Models with Emotional Synergy and Reasoning FEALLM:利用情感协同与推理,提升多模态大语言模型在面部情感分析中的性能 large language model multimodal
18 Specialized Foundation Models for Intelligent Operating Rooms 提出ORQA:专为智能手术室设计的、融合多模态数据的专用基础模型 foundation model multimodal
19 Semantic Change Detection of Roads and Bridges: A Fine-grained Dataset and Multimodal Frequency-driven Detector 提出多模态频率驱动变化检测器,解决道路桥梁语义变化检测难题。 multimodal
20 Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues? 提出Reasoning-OCR基准,评估大型多模态模型在OCR线索上的复杂逻辑推理能力 multimodal
21 FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks 提出FLASH:一种面向多模态任务的潜在感知半自回归推测解码框架,加速LMM推理。 multimodal
22 VLC Fusion: Vision-Language Conditioned Sensor Fusion for Robust Object Detection 提出VLC Fusion,利用视觉-语言模型进行条件传感器融合,提升目标检测鲁棒性。 language conditioned
23 Any-to-Any Learning in Computational Pathology via Triplet Multimodal Pretraining 提出ALTER框架,通过三元组多模态预训练实现计算病理学中的任意模态学习。 multimodal
24 Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering 提出时序感知激活工程框架,有效缓解视频大语言模型中的幻觉问题 large language model multimodal
25 Computer Vision Models Show Human-Like Sensitivity to Geometric and Topological Concepts 研究发现Transformer模型在几何拓扑概念理解上表现出类人敏感性,但多模态模型性能下降 multimodal
26 From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection 提出基于注意力选择的ABS方法,提升视觉-语言模型在零样本任务上的泛化能力。 large language model
27 Industrial Synthetic Segment Pre-training 提出InsCore合成数据集,用于工业场景实例分割预训练,无需真实图像和人工标注。 foundation model
28 Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents MONDAY:用于跨平台移动代理的可扩展视频到数据集生成 large language model
29 Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding 提出面向时序的训练方案,提升大型视觉语言模型在视频理解任务上的性能 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (11 篇)

#题目一句话要点标签🔗
30 Hybrid 3D-4D Gaussian Splatting for Fast Dynamic Scene Representation 提出混合3D-4D高斯溅射,加速动态场景表示并提升渲染质量 gaussian splatting splatting scene reconstruction
31 3D Visual Illusion Depth Estimation 提出基于视觉语言常识融合的3D视觉错觉深度估计框架,提升深度估计精度。 depth estimation monocular depth spatial relationship
32 eStonefish-scenes: A synthetically generated dataset for underwater event-based optical flow prediction tasks 提出eStonefish-scenes水下事件相机光流预测合成数据集,助力水下机器人研究。 visual odometry optical flow
33 IPENS:Interactive Unsupervised Framework for Rapid Plant Phenotyping Extraction via NeRF-SAM2 Fusion IPENS:基于NeRF-SAM2融合的交互式无监督植物表型快速提取框架 NeRF
34 TACOcc:Target-Adaptive Cross-Modal Fusion with Volume Rendering for 3D Semantic Occupancy 提出TACOcc,通过目标自适应跨模态融合与体渲染实现3D语义占据预测。 3D gaussian splatting gaussian splatting splatting
35 Predicting Reaction Time to Comprehend Scenes with Foveated Scene Understanding Maps 提出基于注视场景理解图(F-SUM)的反应时间预测模型,用于预测场景理解时间。 scene understanding
36 Recollection from Pensieve: Novel View Synthesis via Learning from Uncalibrated Videos 提出Pensieve,通过无标定视频学习实现高质量新视角合成。 gaussian splatting splatting
37 Event-Driven Dynamic Scene Depth Completion 提出EventDC框架,利用事件相机数据完成动态场景下的深度补全任务。 depth estimation
38 FlowCut: Unsupervised Video Instance Segmentation via Temporal Mask Matching FlowCut:提出一种基于时序掩码匹配的无监督视频实例分割方法 optical flow
39 Just Dance with $π$! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection 提出PI-VAD:一种多模态诱导框架,用于提升弱监督视频异常检测的鲁棒性。 optical flow
40 IA-MVS: Instance-Focused Adaptive Depth Sampling for Multi-View Stereo IA-MVS:面向实例的自适应深度采样多视角立体匹配 depth estimation

🔬 支柱七:动作重定向 (Motion Retargeting) (3 篇)

#题目一句话要点标签🔗
41 CacheFlow: Fast Human Motion Prediction by Cached Normalizing Flow CacheFlow:通过缓存归一化流加速人体运动预测 human motion
42 Multi-Resolution Haar Network: Enhancing human motion prediction via Haar transform 提出基于Haar变换的多分辨率网络HaarMoDic,提升人体运动预测精度。 human motion
43 GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization 提出GeoRanker,利用距离感知排序解决全球图像地理定位问题 spatial relationship multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
44 Joint Depth and Reflectivity Estimation using Single-Photon LiDAR 提出SPLiDER,用于快速移动场景下单光子激光雷达深度与反射率联合估计。 PULSE TAMP
45 Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation 提出Long-RVOS长视频基准,并设计ReferMo模型解决长时Referring视频分割问题 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
46 HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos HiERO:利用人类行为层级结构增强第一视角视频推理能力 egocentric egocentric vision Ego4D

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
47 Single Image Reflection Separation via Dual Prior Interaction Transformer 提出双重先验交互Transformer,有效分离单幅图像中的反射和透射层 interaction transformer

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
48 FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance FinePhys:通过显式结合物理定律进行有效骨骼引导的细粒度人体动作生成 physically plausible

⬅️ 返回 cs.CV 首页 · 🏠 返回主页