cs.CV（2025-04-07）

📊 共 38 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (15 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (13 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱一：机器人控制 (Robot Control) (3 🔗3) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (15 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Towards Visual Text Grounding of Multimodal Large Language Model	提出TRIG基准，解决多模态大语言模型在文本丰富图像上的视觉文本定位难题。	large language model multimodal visual grounding
2	SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models	SCAM：一个用于评估多模态基础模型在印刷攻击下鲁棒性的真实世界数据集	large language model foundation model multimodal
3	OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance	提出OCC-MLLM-CoT-Alpha，通过3D感知和CoT指导提升MLLM在遮挡识别中的性能	large language model chain-of-thought
4	LEO-MINI: An Efficient Multimodal Large Language Model using Conditional Token Reduction and Mixture of Multi-Modal Experts	LEO-MINI：利用条件Token缩减和多模态专家混合，提升多模态大语言模型的效率和视觉推理能力	large language model multimodal
5	Training state-of-the-art pathology foundation models with orders of magnitude less data	利用远少于SOTA模型的数据，训练出具有竞争力的病理学基础模型	foundation model
6	The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation	利用大型多模态模型，解决运动表达视频分割难题，荣获PVUW MeViS挑战赛冠军。	multimodal
7	SSLFusion: Scale & Space Aligned Latent Fusion Model for Multimodal 3D Object Detection	SSLFusion：提出尺度与空间对齐的潜在融合模型，用于多模态3D目标检测。	multimodal
8	AsyReC: A Multimodal Graph-based Framework for Spatio-Temporal Asymmetric Dyadic Relationship Classification	AsyReC：提出基于多模态图神经网络的非对称时空二元关系分类框架	multimodal	✅
9	Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision	Lumina-OmniLV：用于通用底层视觉的统一多模态框架	multimodal
10	REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding	提出REEF：一种相关性感知的高效LLM适配器，用于视频理解	large language model foundation model
11	URECA: Unique Region Caption Anything	提出URECA数据集和模型，解决多粒度区域描述的唯一性和一致性问题。	large language model multimodal
12	Seeking and Updating with Live Visual Knowledge	提出LiveVQA数据集，用于评估和更新多模态大语言模型对实时视觉知识的理解能力。	large language model multimodal
13	Explaining Low Perception Model Competency with High-Competency Counterfactuals	提出五种生成高置信度反事实图像的方法，解释低感知模型能力	large language model multimodal
14	InstructionBench: An Instructional Video Understanding Benchmark	提出InstructionBench，用于评估视频大语言模型在教学视频理解中的时序推理能力。	large language model	✅
15	Video-Bench: Human-Aligned Video Generation Benchmark	提出Video-Bench：一个更符合人类感知的视频生成评估基准	large language model

🔬 支柱二：RL算法与架构 (RL & Architecture) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
16	PanoDreamer: Consistent Text to 360-Degree Scene Generation	PanoDreamer：提出一致性文本驱动的360度全景场景生成方法	dreamer 3D gaussian splatting gaussian splatting
17	SCRAMBLe : Enhancing Multimodal LLM Compositionality with Synthetic Preference Data	SCRAMBLe：利用合成偏好数据提升多模态LLM的组合性推理能力	preference learning large language model multimodal	✅
18	OrderChain: Towards General Instruct-Tuning for Stimulating the Ordinal Understanding Ability of MLLM	OrderChain：通过指令调优提升多模态大语言模型对序数理解能力	MAE large language model multimodal	✅
19	REVEAL: Relation-based Video Representation Learning for Video-Question-Answering	提出REVEAL框架，通过关系建模提升视频问答中视频表征的质量和效率。	representation learning spatiotemporal
20	Learning Activity View-invariance Under Extreme Viewpoint Changes via Curriculum Knowledge Distillation	提出基于课程知识蒸馏的视角不变性学习方法，解决极端视角变化下的行为识别问题。	curriculum learning distillation
21	Leveraging State Space Models in Long Range Genomics	利用状态空间模型解决长程基因组学中的依赖关系建模问题	SSM state space model
22	Dynamic Vision Mamba	Dynamic Vision Mamba (DyVM)：通过动态token剪枝和块选择提升Mamba视觉模型的效率。	Mamba SSM
23	Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos	Uni4D：面向点云视频的统一自监督学习框架，解耦几何与语义信息	representation learning masked autoencoder MAE
24	Towards Efficient Real-Time Video Motion Transfer via Generative Time Series Modeling	提出基于生成时间序列模型的实时视频动作迁移框架，提升带宽效率。	MAE optical flow
25	S^4M: Boosting Semi-Supervised Instance Segmentation with SAM	S^4M：利用SAM提升半监督实例分割性能	teacher-student distillation
26	DebGCD: Debiased Learning with Distribution Guidance for Generalized Category Discovery	DebGCD：面向广义类别发现，提出基于分布引导的解偏学习框架。	curriculum learning distillation	✅
27	CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images	CADCrafter：提出一种从无约束图像生成参数化CAD模型的新框架	DPO direct preference optimization
28	Dual Consistent Constraint via Disentangled Consistency and Complementarity for Multi-view Clustering	提出基于解耦一致性与互补性的双重一致性约束多视图聚类框架	representation learning contrastive learning

🔬 支柱三：空间感知与语义 (Perception & Semantics) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
29	DeclutterNeRF: Generative-Free 3D Scene Recovery for Occlusion Removal	DeclutterNeRF：一种无生成先验的3D场景重建方法，用于遮挡移除	3D gaussian splatting 3DGS gaussian splatting
30	Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions	提出LMAffordance3D，通过语言指令、视觉观察和交互实现3D物体可操作性的定位。	affordance
31	Stereo-LiDAR Fusion by Semi-Global Matching With Discrete Disparity-Matching Cost and Semidensification	提出基于半全局匹配和离散视差匹配代价的立体视觉-激光雷达融合方法	depth estimation
32	DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation	DFormerv2：用于RGBD语义分割的几何自注意力机制	scene understanding	✅

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
33	MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond	MotionPRO：探索压力在人体动作捕捉中的作用，提升物理合理性	humanoid humanoid robot penetration	✅
34	FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis	提出FantasyTalking以解决静态肖像动画生成问题	manipulation motion synthesis	✅
35	Continuous Locomotive Crowd Behavior Generation	提出基于扩散模型的连续人群行为生成框架，解决现有方法难以模拟真实人群动态的问题。	locomotion	✅

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
36	Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting	提出CAT-V：一个免训练的视频细粒度、以对象为中心的描述框架。	spatiotemporal multimodal chain-of-thought	✅
37	Inter-event Interval Microscopy for Event Cameras	提出基于事件间隔显微镜的IEIM方法，用于静态事件相机下的荧光显微成像	PULSE

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
38	From Sparse Signal to Smooth Motion: Real-Time Motion Generation with Rolling Prediction Models	提出滚动预测模型RPM，解决XR中稀疏、不稳定的手部追踪信号生成流畅全身动作的问题。	motion generation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页