cs.CV(2025-01-02)

📊 共 22 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱三:空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱九:具身大模型 (Embodied Foundation Models) (7 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (3) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱四:生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
1 EasySplat: View-Adaptive Learning makes 3D Gaussian Splatting Easy EasySplat:通过视角自适应学习简化3D高斯溅射建模 3D gaussian splatting 3DGS gaussian splatting
2 ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding ViGiL3D:一个语言多样化的3D视觉定位数据集,用于提升模型泛化性。 open-vocabulary open vocabulary embodied AI
3 Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction PanopticRecon++:利用交叉注意力实现端到端开放词汇全景重建 scene understanding open-vocabulary open vocabulary
4 Deformable Gaussian Splatting for Efficient and High-Fidelity Reconstruction of Surgical Scenes 提出EH-SurGS,高效高保真重建可变形手术场景 3D gaussian splatting gaussian splatting splatting
5 PatchRefiner V2: Fast and Lightweight Real-Domain High-Resolution Metric Depth Estimation PatchRefiner V2:提出快速轻量级真实域高分辨率度量深度估计方法 depth estimation metric depth
6 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer 3D-LLaVA:利用全能超点Transformer构建通用3D多模态大模型 scene understanding multimodal
7 GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models GPT4Scene:利用视觉语言模型理解视频中的3D场景,提升具身智能 scene understanding
8 Sparis: Neural Implicit Surface Reconstruction of Indoor Scenes from Sparse Views Sparis:基于稀疏视图的室内场景神经隐式表面重建 scene reconstruction

🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)

#题目一句话要点标签🔗
9 Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning 提出RAGPT框架,通过检索增强动态Prompt调整解决不完全多模态学习问题。 multimodal
10 Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants 提出Face-Human-Bench,用于全面评估多模态助手在人脸和人体理解方面的能力。 large language model chain-of-thought
11 Towards Interactive Deepfake Analysis 提出DFA-GPT交互式深度伪造分析系统,提升深度伪造检测与分析能力 large language model instruction following
12 SAFER: Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers SAFER:针对视觉Transformer的层选择性精调,提升鲁棒性 foundation model
13 Unifying Specialized Visual Encoders for Video Language Models MERV:统一多个视觉编码器,提升视频语言模型的理解能力 large language model
14 Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models 提出VA-VAE对齐预训练视觉模型,加速潜空间扩散模型训练并提升生成质量。 foundation model
15 Asymmetric Reinforcing against Multi-modal Representation Bias 提出非对称增强方法ARM,解决多模态表征偏差问题。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
16 SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization SeFAR:结合时序扰动与学习稳定的半监督细粒度动作识别框架 teacher-student large language model foundation model
17 Detail Matters: Mamba-Inspired Joint Unfolding Network for Snapshot Spectral Compressive Imaging 提出Mamba启发的联合展开网络MiJUN,用于快照光谱压缩成像,提升细节重建能力。 Mamba state space model HSI
18 Event Masked Autoencoder: Point-wise Action Recognition with Event-Based Cameras 提出事件掩码自编码器,用于基于事件相机的点云动作识别。 masked autoencoder MAE spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
19 R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization R-SCoRe:通过改进场景坐标回归,实现鲁棒的大规模视觉定位 feature matching
20 Source-free Semantic Regularization Learning for Semi-supervised Domain Adaptation 提出SERL框架,通过语义正则化学习提升半监督领域自适应性能 HMR

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
21 Transferability of Adversarial Attacks in Video-based MLLMs: A Cross-modal Image-to-Video Approach 提出I2V-MLLM攻击,提升视频多模态大模型对抗样本的黑盒迁移性 spatiotemporal large language model multimodal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
22 Learning 3D Garment Animation from Trajectories of A Piece of Cloth 提出EUNet,从单块布料轨迹学习3D服装动画,提升泛化性和物理真实性。 physically plausible differentiable simulation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页