cs.CV（2025-01-03）

📊 共 24 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (4 🔗3) 支柱一：机器人控制 (Robot Control) (2) 支柱四：生成式动作 (Generative Motion) (1) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models	提出I-FAS：利用多模态大语言模型提升人脸反欺骗的泛化能力与可解释性	large language model multimodal
2	Multimodal classification of forest biodiversity potential from 2D orthophotos and 3D airborne laser scanning point clouds	提出基于深度学习的多模态融合方法，利用正射影像和激光雷达数据评估森林生物多样性潜力。	multimodal
3	Google is all you need: Semi-Supervised Transfer Learning Strategy For Light Multimodal Multi-Task Classification Model	提出一种半监督迁移学习策略，用于轻量级多模态多任务分类模型，提升图像标签精度。	multimodal
4	VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction	VITA-1.5：面向GPT-4o水平的实时视觉与语音交互多模态大模型	large language model multimodal	✅
5	Virgo: A Preliminary Exploration on Reproducing o1-like MLLM	Virgo：通过文本长程思维数据微调MLLM，探索多模态慢思考推理能力	large language model multimodal	✅
6	HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding	构建大规模小时级视频基准HLV-1K，促进时间感知长视频理解研究。	large language model multimodal
7	AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs	AVTrustBench：评估并提升音视频大语言模型的可靠性和鲁棒性	large language model
8	MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation	提出MoEE模型和DH-FaceEmoVid-150数据集，用于生成具有复杂情感的音频驱动人像动画。	multimodal
9	LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction	LogicAD：基于VLM文本特征提取的可解释异常检测	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
10	CrossView-GS: Cross-view Gaussian Splatting For Large-scale Scene Reconstruction	提出CrossView-GS，解决大规模场景跨视角重建中3DGS优化难题。	3D gaussian splatting 3DGS gaussian splatting
11	PG-SAG: Parallel Gaussian Splatting for Fine-Grained Large-Scale Urban Buildings Reconstruction via Semantic-Aware Grouping	提出PG-SAG，通过语义感知分组并行高斯溅射重建大规模城市建筑	3D gaussian splatting 3DGS gaussian splatting	✅
12	DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data	DreamMask：利用合成数据提升开放词汇全景分割性能	open-vocabulary open vocabulary
13	Cloth-Splatting: 3D Cloth State Estimation from RGB Supervision	Cloth-Splatting：利用RGB监督进行3D布料状态估计	3D gaussian splatting gaussian splatting splatting
14	SafeAug: Safety-Critical Driving Data Augmentation from Naturalistic Datasets	SafeAug：从自然数据集增强安全关键的自动驾驶数据	depth estimation
15	VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment	VideoLifter：利用快速分层立体对齐将视频提升为3D模型	scene understanding
16	D$^3$-Human: Dynamic Disentangled Digital Human from Monocular Video	D$^3$-Human：提出解耦的动态数字人重建方法，解决单目视频中服装遮挡问题	implicit representation

🔬 支柱二：RL算法与架构 (RL & Architecture) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
17	A Separable Self-attention Inspired by the State Space Model for Computer Vision	受状态空间模型启发，提出可分离自注意力机制，用于计算机视觉任务。	Mamba SSM state space model	✅
18	MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders	提出MoVE-KD，通过知识蒸馏将多个视觉编码器的能力迁移到单个高效VLM中。	distillation foundation model	✅
19	3D Cloud reconstruction through geospatially-aware Masked Autoencoders	提出基于地理空间感知的掩码自编码器，用于三维云重构	masked autoencoder MAE
20	Merging Context Clustering with Visual State Space Models for Medical Image Segmentation	提出CCViM，融合上下文聚类与视觉状态空间模型，提升医学图像分割性能。	Mamba state space model	✅

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
21	Aesthetic Matters in Music Perception for Image Stylization: A Emotion-driven Music-to-Visual Manipulation	EmoMV：提出情感驱动的音乐到视觉图像风格化方法	manipulation multimodal
22	IAM: Enhancing RGB-D Instance Segmentation with New Benchmarks	提出RGB-D实例分割新基准IAM，并提出有效的数据融合方法，提升场景理解能力	manipulation scene understanding

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
23	JoyGen: Audio-Driven 3D Depth-Aware Talking-Face Video Editing	JoyGen：提出深度感知的音频驱动3D说话人脸视频编辑框架	motion generation

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
24	VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer for Video-based Remote Physiological Measurement	提出VidFormer框架以解决视频基础远程生理信号测量问题	spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页