cs.CV（2025-03-27）

📊 共 51 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (22 🔗7) 支柱三：空间感知与语义 (Perception & Semantics) (11 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (7 🔗2) 支柱六：视频提取与匹配 (Video Extraction) (4 🔗1) 支柱八：物理动画 (Physics-based Animation) (3 🔗1) 支柱一：机器人控制 (Robot Control) (2) 支柱四：生成式动作 (Generative Motion) (2)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (22 篇)

#	题目	一句话要点	标签	🔗
1	Harmonizing Visual Representations for Unified Multimodal Understanding and Generation	提出Harmon，一个统一的自回归框架，用于多模态理解和生成任务。	multimodal	✅
2	On Large Multimodal Models as Open-World Image Classifiers	评估大型多模态模型在开放世界图像分类中的性能与挑战	multimodal
3	PS-ReID: Advancing Person Re-Identification and Precise Segmentation with Multimodal Retrieval	PS-ReID：结合图像文本多模态检索，实现更精准的行人重识别与分割	multimodal
4	Multimodal surface defect detection from wooden logs for sawing optimization	提出一种基于多模态融合的木材表面节疤检测方法，用于优化木材锯切。	multimodal
5	HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery	提出HyperFree：一种通道自适应、免调参的高光谱遥感图像基础模型	foundation model
6	AdaMHF: Adaptive Multimodal Hierarchical Fusion for Survival Prediction	提出AdaMHF，自适应多模态分层融合用于提升生存预测精度，尤其在数据缺失场景下。	multimodal
7	iMedImage Technical Report	iMedImage：用于通用医学图像识别的端到端多模态基础模型，提升染色体异常检测精度。	foundation model multimodal chain-of-thought
8	Online Reasoning Video Segmentation with Just-in-Time Digital Twins	提出基于即时数字孪生的在线推理视频分割框架，解决现有方法推理能力不足、依赖微调等问题。	embodied AI large language model multimodal
9	FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs	FaceBench：用于评估人脸感知多模态大语言模型的多视角多层次人脸属性VQA数据集	large language model multimodal	✅
10	VALLR: Visual ASR Language Model for Lip Reading	VALLR：提出视觉ASR语言模型，用于唇语识别，显著降低词错误率。	large language model multimodal
11	InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression	InternVL-X：通过高效视觉Token压缩提升InternVL系列模型的性能与效率	large language model multimodal
12	Differential Evolution for Grassmann Manifold Optimization: A Projection Approach	提出一种基于投影的差分进化算法，用于格拉斯曼流形上的优化问题。	multimodal
13	StarFlow: Generating Structured Workflow Outputs From Sketch Images	StarFlow：利用视觉-语言模型从草图生成结构化工作流	foundation model
14	Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model	提出Mobile-VideoGPT，一种参数小于10亿的高效视频理解语言模型，实现实时吞吐。	multimodal	✅
15	Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence	提出Stable-SCore框架，通过稳定配准实现更鲁棒的3D形状对应	foundation model
16	Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation	提出基于深度学习的图像、视频和音频分类器，用于自动化新闻视频分割。	multimodal
17	FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval	提出FineCIR框架，通过显式解析细粒度语义提升组合图像检索精度。	multimodal	✅
18	Vision-to-Music Generation: A Survey	综述视觉到音乐生成：系统回顾视频、图像到音乐生成的技术进展与未来方向。	multimodal	✅
19	M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?	提出M-DocSum-Bench，评估LVLM在多模态文档摘要中的理解能力	multimodal	✅
20	Towards Generalizable Forgery Detection and Reasoning	提出FakeReasoning框架，利用多模态大语言模型实现AI生成图像的通用伪造检测与推理。	large language model
21	DSU-Net:An Improved U-Net Model Based on DINOv2 and SAM2 with Multi-scale Cross-model Feature Enhancement	DSU-Net：融合DINOv2和SAM2的多尺度跨模型特征增强U-Net，提升图像分割性能	foundation model	✅
22	A Multi-Modal Knowledge-Enhanced Framework for Vessel Trajectory Prediction	提出多模态知识增强框架MAKER，提升船舶轨迹预测精度。	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (11 篇)

#	题目	一句话要点	标签	🔗
23	X$^{2}$-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction	提出X²-Gaussian，通过动态辐射高斯溅射实现连续时间断层扫描重建	gaussian splatting splatting spatiotemporal	✅
24	LandMarkSystem Technical Report	LandMarkSystem：用于大规模高质量3D重建与渲染的计算框架	3D gaussian splatting 3DGS gaussian splatting	✅
25	Frequency-Aware Gaussian Splatting Decomposition	提出频率感知高斯溅射分解，实现高效可控的新视角合成	3D gaussian splatting gaussian splatting splatting
26	Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation	提出语义库适应框架解决开放词汇语义分割问题	open-vocabulary open vocabulary
27	UGNA-VPR: A Novel Training Paradigm for Visual Place Recognition Based on Uncertainty-Guided NeRF Augmentation	UGNA-VPR：基于不确定性引导NeRF增强的视觉定位新训练范式	NeRF	✅
28	SC-NeRF: NeRF-based Point Cloud Reconstruction using a Stationary Camera for Agricultural Applications	提出基于静止相机的SC-NeRF，用于农业高通量植物表型分析的点云重建	NeRF
29	StyledStreets: Multi-style Street Simulator with Spatial and Temporal Consistency	StyledStreets：提出时空一致的多风格街景模拟器，用于城市环境重建。	gaussian splatting splatting scene reconstruction
30	Can Video Diffusion Model Reconstruct 4D Geometry?	Sora3R：利用视频扩散模型从单目视频重建动态4D几何	optical flow spatiotemporal
31	HS-SLAM: Hybrid Representation with Structural Supervision for Improved Dense SLAM	HS-SLAM：结合结构化监督的混合表示，提升稠密SLAM性能	NeRF
32	ICG-MVSNet: Learning Intra-view and Cross-view Relationships for Guidance in Multi-View Stereo	ICG-MVSNet：学习视图内和跨视图关系以指导多视图立体匹配	depth estimation
33	GenFusion: Closing the Loop between Reconstruction and Generation via Videos	GenFusion：通过视频闭环重建与生成，弥合3D重建与生成之间的差距	scene reconstruction

🔬 支柱二：RL算法与架构 (RL & Architecture) (7 篇)

#	题目	一句话要点	标签	🔗
34	Multimodal Data Integration for Sustainable Indoor Gardening: Tracking Anyplant with Time Series Foundation Model	利用多模态数据融合与时间序列模型Anyplant，实现可持续室内园艺的植物健康监测。	MAE foundation model multimodal
35	VADMamba: Exploring State Space Models for Fast Video Anomaly Detection	提出VADMamba，利用状态空间模型加速视频异常检测，提升推理速度。	Mamba state space model optical flow	✅
36	Video-R1: Reinforcing Video Reasoning in MLLMs	Video-R1：通过规则强化学习提升多模态大语言模型中的视频推理能力	reinforcement learning large language model multimodal	✅
37	What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning	利用状态变化描述与反事实推理，提升程序性视频表征学习	representation learning large language model
38	AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis	提出AssistPDA以解决实时视频异常检测问题	distillation spatiotemporal large language model
39	Q-MambaIR: Accurate Quantized Mamba for Efficient Image Restoration	提出Q-MambaIR，用于高效图像恢复的精确量化Mamba模型	Mamba SSM
40	Delving Deep into Semantic Relation Distillation	提出基于语义关系知识蒸馏(SeRKD)方法，提升模型压缩和泛化能力。	distillation

🔬 支柱六：视频提取与匹配 (Video Extraction) (4 篇)

#	题目	一句话要点	标签	🔗
41	Gaze-Guided 3D Hand Motion Prediction for Detecting Intent in Egocentric Grasping Tasks	提出一种基于注视引导的3D手部动作预测方法，用于辅助抓取任务中的意图检测。	egocentric
42	AgRowStitch: A High-fidelity Image Stitching Pipeline for Ground-based Agricultural Images	AgRowStitch：针对地面农业图像的高保真图像拼接流程，无需额外数据。	feature matching
43	Reconstructing Humans with a Biomechanically Accurate Skeleton	提出基于生物力学骨骼模型的单图人体三维重建方法，提升极端姿态下的重建效果。	human mesh recovery	✅
44	ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate	提出ClimbingCap以解决攀岩动作捕捉的挑战	HMR

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

#	题目	一句话要点	标签	🔗
45	Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video	Uni4D：统一视觉基础模型，从单视频实现4D建模	motion tracking foundation model
46	CMD-HAR: Cross-Modal Disentanglement for Wearable Human Activity Recognition	提出CMD-HAR模型，通过跨模态解耦解决可穿戴设备人体活动识别中的数据混合与异构问题	spatiotemporal multimodal
47	DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation	提出DynamiCtrl框架，提升扩散Transformer在高质量人体图像动画中的控制性和语义一致性。	character control	✅

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
48	Semantic Consistent Language Gaussian Splatting for Point-Level Open-vocabulary Querying	提出语义一致语言高斯溅射，实现点级开放词汇查询	manipulation 3D gaussian splatting gaussian splatting
49	CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models	提出CoT-VLA，通过视觉思维链推理提升视觉-语言-动作模型的操作能力	manipulation vision-language-action VLA

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
50	ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model	ChatAnyone：基于分层运动扩散模型的风格化实时人像视频生成	motion diffusion model motion diffusion
51	StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion	StyleMotif：提出一种多模态风格化运动潜在扩散模型，用于生成具有风格的运动。	motion synthesis motion generation motion latent

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2025-03-27）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (22 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (11 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (7 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (4 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理