cs.CV（2025-03-04）

📊 共 22 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (6 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (4 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱四：生成式动作 (Generative Motion) (3) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (1 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1 🔗1) 支柱一：机器人控制 (Robot Control) (1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
1	A Token-level Text Image Foundation Model for Document Understanding	提出TokenOCR：面向文档理解的Token级文本图像基础模型	large language model foundation model	✅
2	Multimodal Deep Learning for Subtype Classification in Breast Cancer Using Histopathological Images and Gene Expression Data	提出多模态深度学习框架以解决乳腺癌亚型分类问题	multimodal
3	SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models	SPIDER：构建多器官病理图像数据集并提出基线模型，促进AI病理学研究	foundation model multimodal	✅
4	BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA	BioD2C：双层语义一致性约束框架，提升生物医学VQA性能	large language model multimodal
5	CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors	CADDI：提出一个基于低成本IMU的课堂活动检测数据集，促进教育场景下的活动识别。	multimodal
6	StageDesigner: Artistic Stage Generation for Scenography via Theater Scripts	StageDesigner：利用剧本生成艺术化舞台场景的综合框架	large language model	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
7	Developing a PET/CT Foundation Model for Cross-Modal Anatomical and Functional Imaging	提出Cross-Fraternal Twin Masked Autoencoder，用于PET/CT跨模态解剖和功能成像	representation learning masked autoencoder foundation model
8	LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning	LLaVE：基于难度加权对比学习的大型语言-视觉嵌入模型，实现SOTA性能。	representation learning contrastive learning multimodal
9	SSNet: Saliency Prior and State Space Model-based Network for Salient Object Detection in RGB-D Images	提出基于显著性先验和状态空间模型的SSNet，用于RGB-D图像的显著性目标检测。	SSM state space model scene understanding
10	WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation	WMNav：融合视觉-语言模型与世界模型的物体目标导航框架	world model embodied AI	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
11	2DGS-Avatar: Animatable High-fidelity Clothed Avatar via 2D Gaussian Splatting	提出2DGS-Avatar，通过2D高斯溅射实现高保真可动画的服装人像实时渲染。	3D gaussian splatting 3DGS gaussian splatting
12	Resource-Efficient Affordance Grounding with Complementary Depth and Semantic Prompts	提出BiT-Align框架，利用互补深度和语义提示提升资源受限下的可供性推理性能。	affordance multimodal	✅
13	Label-Efficient LiDAR Panoptic Segmentation	提出L3PS，利用少量标注数据实现高效LiDAR全景分割	scene understanding

🔬 支柱四：生成式动作 (Generative Motion) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
14	SPG: Improving Motion Diffusion by Smooth Perturbation Guidance	SPG：通过平滑扰动引导提升运动扩散模型的生成质量	motion diffusion model motion diffusion
15	ARC-Flow : Articulated, Resolution-Agnostic, Correspondence-Free Matching and Interpolation of 3D Shapes Under Flow Fields	提出ARC-Flow，通过流场实现铰接3D形状的无对应关系匹配与插值。	physically plausible
16	Efficient Training-Free High-Resolution Synthesis with Energy Rectification in Diffusion Models	提出RectifiedHR，一种高效无训练的扩散模型高分辨率图像合成方法	classifier-free guidance

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
17	MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments	提出MM-OR手术室多模态数据集与MM2SG模型，用于提升高强度手术环境的语义理解。	spatiotemporal multimodal	✅
18	TReND: Transformer derived features and Regularized NMF for neonatal functional network Delineation	提出TReND框架，利用Transformer和正则化NMF进行新生儿功能网络划分	spatiotemporal

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
19	Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs	提出MapleLeaf AKI，通过解耦因果注意力实现多模态LLM的模态互注意力。	mutual attention large language model foundation model	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
20	CMMLoc: Advancing Text-to-PointCloud Localization with Cauchy-Mixture-Model Based Framework	CMMLoc：基于柯西混合模型的文本到点云定位框架	spatial relationship	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
21	Monocular Person Localization under Camera Ego-motion	提出基于单目相机运动的四点人体模型定位方法，提升人机交互中定位精度	quadruped

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
22	mmDEAR: mmWave Point Cloud Density Enhancement for Accurate Human Body Reconstruction	提出mmDEAR框架，增强毫米波点云密度，提升人体重建精度	SMPL

⬅️ 返回 cs.CV 首页 · 🏠 返回主页