cs.CV(2025-01-02)
📊 共 22 篇论文 | 🔗 6 篇有代码
🎯 兴趣领域导航
支柱三:空间感知与语义 (Perception & Semantics) (8 🔗1)
支柱九:具身大模型 (Embodied Foundation Models) (7 🔗3)
支柱二:RL算法与架构 (RL & Architecture) (3)
支柱六:视频提取与匹配 (Video Extraction) (2 🔗1)
支柱八:物理动画 (Physics-based Animation) (1)
支柱四:生成式动作 (Generative Motion) (1 🔗1)
🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 9 | Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning | 提出RAGPT框架,通过检索增强动态Prompt调整解决不完全多模态学习问题。 | multimodal | ✅ | |
| 10 | Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants | 提出Face-Human-Bench,用于全面评估多模态助手在人脸和人体理解方面的能力。 | large language model chain-of-thought | ||
| 11 | Towards Interactive Deepfake Analysis | 提出DFA-GPT交互式深度伪造分析系统,提升深度伪造检测与分析能力 | large language model instruction following | ✅ | |
| 12 | SAFER: Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers | SAFER:针对视觉Transformer的层选择性精调,提升鲁棒性 | foundation model | ||
| 13 | Unifying Specialized Visual Encoders for Video Language Models | MERV:统一多个视觉编码器,提升视频语言模型的理解能力 | large language model | ||
| 14 | Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models | 提出VA-VAE对齐预训练视觉模型,加速潜空间扩散模型训练并提升生成质量。 | foundation model | ✅ | |
| 15 | Asymmetric Reinforcing against Multi-modal Representation Bias | 提出非对称增强方法ARM,解决多模态表征偏差问题。 | multimodal |
🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 16 | SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization | SeFAR:结合时序扰动与学习稳定的半监督细粒度动作识别框架 | teacher-student large language model foundation model | ||
| 17 | Detail Matters: Mamba-Inspired Joint Unfolding Network for Snapshot Spectral Compressive Imaging | 提出Mamba启发的联合展开网络MiJUN,用于快照光谱压缩成像,提升细节重建能力。 | Mamba state space model HSI | ||
| 18 | Event Masked Autoencoder: Point-wise Action Recognition with Event-Based Cameras | 提出事件掩码自编码器,用于基于事件相机的点云动作识别。 | masked autoencoder MAE spatiotemporal |
🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 19 | R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization | R-SCoRe:通过改进场景坐标回归,实现鲁棒的大规模视觉定位 | feature matching | ✅ | |
| 20 | Source-free Semantic Regularization Learning for Semi-supervised Domain Adaptation | 提出SERL框架,通过语义正则化学习提升半监督领域自适应性能 | HMR |
🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 21 | Transferability of Adversarial Attacks in Video-based MLLMs: A Cross-modal Image-to-Video Approach | 提出I2V-MLLM攻击,提升视频多模态大模型对抗样本的黑盒迁移性 | spatiotemporal large language model multimodal |
🔬 支柱四:生成式动作 (Generative Motion) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 22 | Learning 3D Garment Animation from Trajectories of A Piece of Cloth | 提出EUNet,从单块布料轨迹学习3D服装动画,提升泛化性和物理真实性。 | physically plausible differentiable simulation | ✅ |