cs.CV（2025-05-19）

📊 共 48 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (16 🔗5) 支柱九：具身大模型 (Embodied Foundation Models) (13 🔗5) 支柱三：空间感知与语义 (Perception & Semantics) (11 🔗2) 支柱七：动作重定向 (Motion Retargeting) (3) 支柱八：物理动画 (Physics-based Animation) (2) 支柱六：视频提取与匹配 (Video Extraction) (1) 支柱五：交互与反应 (Interaction & Reaction) (1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (16 篇)

#	题目	一句话要点	标签	🔗	⭐
1	KinTwin: Imitation Learning with Torque and Muscle Driven Biomechanical Models Enables Precise Replication of Able-Bodied and Impaired Movement from Markerless Motion Capture	KinTwin：利用力矩和肌肉驱动的生物力学模型，通过模仿学习精确复制无标记运动捕捉中的正常和受损运动	imitation learning markerless motion capture
2	Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning	提出基于难度先验的强化学习方法，提升多模态推理能力	reinforcement learning multimodal
3	Mamba-Adaptor: State Space Model Adaptor for Visual Recognition	提出Mamba-Adaptor，解决Mamba在视觉识别中全局上下文建模、长程依赖和空间结构建模的不足。	Mamba SSM state space model
4	G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning	G1：通过强化学习引导视觉-语言模型感知与推理能力，提升交互式游戏环境决策能力。	reinforcement learning multimodal	✅
5	SPKLIP: Aligning Spike Video Streams with Natural Language	SPKLIP：提出用于Spike视频-语言对齐的新架构，解决模态差异导致的性能瓶颈。	contrastive learning VLA multimodal
6	AutoMat: Enabling Automated Crystal Structure Reconstruction from Microscopy via Agentic Tool Use	AutoMat：通过智能体工具调用实现显微图像自动晶体结构重建	MAE large language model multimodal	✅
7	BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation	BusterX：提出基于MLLM的AI生成视频伪造检测与解释框架，并构建大规模数据集GenBuster-200K。	reinforcement learning large language model multimodal
8	Few-Step Diffusion via Score identity Distillation	提出基于Score identity Distillation的SiD框架，加速Stable Diffusion XL等文图生成模型。	distillation classifier-free guidance	✅
9	Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping	Sat2Sound：用于零样本声景地图构建的统一多模态框架	representation learning contrastive learning multimodal
10	Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking	Safe-Sora：通过图形式水印实现安全的文本到视频生成	Mamba state space model spatiotemporal	✅
11	DD-Ranking: Rethinking the Evaluation of Dataset Distillation	DD-Ranking：重新思考数据集蒸馏的评估方法，提出更公平的评估框架。	distillation
12	RMMSS: Towards Advanced Robust Multi-Modal Semantic Segmentation with Hybrid Prototype Distillation and Feature Selection	RMMSS：面向鲁棒多模态语义分割，提出混合原型蒸馏与特征选择框架	distillation
13	Coarse Attribute Prediction with Task Agnostic Distillation for Real World Clothes Changing ReID	提出RLQ框架，通过粗粒度属性预测和任务无关蒸馏提升真实场景下服饰变换ReID的鲁棒性。	distillation
14	RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers	RoPECraft：基于轨迹引导RoPE优化的无训练扩散Transformer视频动作迁移	flow matching optical flow
15	Touch2Shape: Touch-Conditioned 3D Diffusion for Shape Exploration and Reconstruction	Touch2Shape：提出触觉条件下的3D扩散模型，用于形状探索与重建	reinforcement learning reward design
16	Towards Low-Latency Event Stream-based Visual Object Tracking: A Slow-Fast Approach	提出SFTrack：一种低延迟事件流视觉目标跟踪的慢-快方法	representation learning distillation	✅

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
17	FEALLM: Advancing Facial Emotion Analysis in Multimodal Large Language Models with Emotional Synergy and Reasoning	FEALLM：利用情感协同与推理，提升多模态大语言模型在面部情感分析中的性能	large language model multimodal	✅
18	Specialized Foundation Models for Intelligent Operating Rooms	提出ORQA：专为智能手术室设计的、融合多模态数据的专用基础模型	foundation model multimodal
19	Semantic Change Detection of Roads and Bridges: A Fine-grained Dataset and Multimodal Frequency-driven Detector	提出多模态频率驱动变化检测器，解决道路桥梁语义变化检测难题。	multimodal	✅
20	Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues?	提出Reasoning-OCR基准，评估大型多模态模型在OCR线索上的复杂逻辑推理能力	multimodal	✅
21	FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks	提出FLASH：一种面向多模态任务的潜在感知半自回归推测解码框架，加速LMM推理。	multimodal	✅
22	VLC Fusion: Vision-Language Conditioned Sensor Fusion for Robust Object Detection	提出VLC Fusion，利用视觉-语言模型进行条件传感器融合，提升目标检测鲁棒性。	language conditioned
23	Any-to-Any Learning in Computational Pathology via Triplet Multimodal Pretraining	提出ALTER框架，通过三元组多模态预训练实现计算病理学中的任意模态学习。	multimodal
24	Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering	提出时序感知激活工程框架，有效缓解视频大语言模型中的幻觉问题	large language model multimodal
25	Computer Vision Models Show Human-Like Sensitivity to Geometric and Topological Concepts	研究发现Transformer模型在几何拓扑概念理解上表现出类人敏感性，但多模态模型性能下降	multimodal
26	From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection	提出基于注意力选择的ABS方法，提升视觉-语言模型在零样本任务上的泛化能力。	large language model	✅
27	Industrial Synthetic Segment Pre-training	提出InsCore合成数据集，用于工业场景实例分割预训练，无需真实图像和人工标注。	foundation model
28	Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents	MONDAY：用于跨平台移动代理的可扩展视频到数据集生成	large language model
29	Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding	提出面向时序的训练方案，提升大型视觉语言模型在视频理解任务上的性能	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Hybrid 3D-4D Gaussian Splatting for Fast Dynamic Scene Representation	提出混合3D-4D高斯溅射，加速动态场景表示并提升渲染质量	gaussian splatting splatting scene reconstruction
31	3D Visual Illusion Depth Estimation	提出基于视觉语言常识融合的3D视觉错觉深度估计框架，提升深度估计精度。	depth estimation monocular depth spatial relationship
32	eStonefish-scenes: A synthetically generated dataset for underwater event-based optical flow prediction tasks	提出eStonefish-scenes水下事件相机光流预测合成数据集，助力水下机器人研究。	visual odometry optical flow
33	IPENS:Interactive Unsupervised Framework for Rapid Plant Phenotyping Extraction via NeRF-SAM2 Fusion	IPENS：基于NeRF-SAM2融合的交互式无监督植物表型快速提取框架	NeRF
34	TACOcc:Target-Adaptive Cross-Modal Fusion with Volume Rendering for 3D Semantic Occupancy	提出TACOcc，通过目标自适应跨模态融合与体渲染实现3D语义占据预测。	3D gaussian splatting gaussian splatting splatting
35	Predicting Reaction Time to Comprehend Scenes with Foveated Scene Understanding Maps	提出基于注视场景理解图（F-SUM）的反应时间预测模型，用于预测场景理解时间。	scene understanding
36	Recollection from Pensieve: Novel View Synthesis via Learning from Uncalibrated Videos	提出Pensieve，通过无标定视频学习实现高质量新视角合成。	gaussian splatting splatting	✅
37	Event-Driven Dynamic Scene Depth Completion	提出EventDC框架，利用事件相机数据完成动态场景下的深度补全任务。	depth estimation
38	FlowCut: Unsupervised Video Instance Segmentation via Temporal Mask Matching	FlowCut：提出一种基于时序掩码匹配的无监督视频实例分割方法	optical flow
39	Just Dance with $π$! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection	提出PI-VAD：一种多模态诱导框架，用于提升弱监督视频异常检测的鲁棒性。	optical flow
40	IA-MVS: Instance-Focused Adaptive Depth Sampling for Multi-View Stereo	IA-MVS：面向实例的自适应深度采样多视角立体匹配	depth estimation	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
41	CacheFlow: Fast Human Motion Prediction by Cached Normalizing Flow	CacheFlow：通过缓存归一化流加速人体运动预测	human motion
42	Multi-Resolution Haar Network: Enhancing human motion prediction via Haar transform	提出基于Haar变换的多分辨率网络HaarMoDic，提升人体运动预测精度。	human motion
43	GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization	提出GeoRanker，利用距离感知排序解决全球图像地理定位问题	spatial relationship multimodal

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
44	Joint Depth and Reflectivity Estimation using Single-Photon LiDAR	提出SPLiDER，用于快速移动场景下单光子激光雷达深度与反射率联合估计。	PULSE TAMP
45	Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation	提出Long-RVOS长视频基准，并设计ReferMo模型解决长时Referring视频分割问题	spatiotemporal

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
46	HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos	HiERO：利用人类行为层级结构增强第一视角视频推理能力	egocentric egocentric vision Ego4D

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
47	Single Image Reflection Separation via Dual Prior Interaction Transformer	提出双重先验交互Transformer，有效分离单幅图像中的反射和透射层	interaction transformer

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
48	FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance	FinePhys：通过显式结合物理定律进行有效骨骼引导的细粒度人体动作生成	physically plausible

⬅️ 返回 cs.CV 首页 · 🏠 返回主页