cs.CV(2024-11-20)
📊 共 30 篇论文 | 🔗 13 篇有代码
🎯 兴趣领域导航
支柱二:RL算法与架构 (RL & Architecture) (13 🔗5)
支柱三:空间感知与语义 (Perception & Semantics) (7 🔗2)
支柱九:具身大模型 (Embodied Foundation Models) (7 🔗4)
支柱六:视频提取与匹配 (Video Extraction) (2 🔗1)
支柱四:生成式动作 (Generative Motion) (1 🔗1)
🔬 支柱二:RL算法与架构 (RL & Architecture) (13 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | Unsupervised Foundation Model-Agnostic Slide-Level Representation Learning | 提出COBRA方法以解决病理全切片图像表示学习问题 | Mamba representation learning foundation model | ✅ | |
| 2 | FAST-Splat: Fast, Ambiguity-Free Semantics Transfer in Gaussian Splatting | FAST-Splat:快速无歧义的高斯溅射语义迁移方法 | distillation gaussian splatting splatting | ||
| 3 | XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation | 提出XMask3D,通过跨模态掩码推理实现开放词汇3D语义分割。 | distillation open-vocabulary open vocabulary | ✅ | |
| 4 | Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale Benchmark | 提出密度嵌入高效掩码自编码计数框架(E-MAC),解决视频对象计数中前景-背景动态不平衡问题。 | representation learning masked autoencoder optical flow | ||
| 5 | Extending Video Masked Autoencoders to 128 frames | 提出长视频掩码自编码器(LVMAE),有效处理128帧视频,提升视频理解性能。 | masked autoencoder MAE foundation model | ||
| 6 | MambaDETR: Query-based Temporal Modeling using State Space Model for Multi-View 3D Object Detection | MambaDETR:利用状态空间模型进行多视角3D目标检测的查询式时序建模 | Mamba state space model | ||
| 7 | Decompose and Leverage Preferences from Expert Models for Improving Trustworthiness of MLLMs | 提出DecompGen框架,利用专家模型分解评估MLLM响应,提升其可信度。 | preference learning large language model multimodal | ||
| 8 | Intensity-Spatial Dual Masked Autoencoder for Multi-Scale Feature Learning in Chest CT Segmentation | 提出强度-空间双掩码自编码器(ISD-MAE)用于胸部CT多尺度特征学习与分割 | masked autoencoder MAE contrastive learning | ✅ | |
| 9 | Find Any Part in 3D | 利用2D基础模型驱动的数据引擎,实现任意3D物体部件的开放世界分割 | world model foundation model | ✅ | |
| 10 | Identity Preserving 3D Head Stylization with Multiview Score Distillation | 提出基于多视角Score Distillation的3D头部风格化方法,提升身份保持能力 | distillation | ✅ | |
| 11 | Cross-Camera Distracted Driver Classification through Feature Disentanglement and Contrastive Learning | 提出DBMNet,通过特征解耦和对比学习实现跨摄像头分心驾驶员分类。 | contrastive learning | ||
| 12 | Collaborative Feature-Logits Contrastive Learning for Open-Set Semi-Supervised Object Detection | 提出CFL-Detector,解决开放集半监督目标检测中的OOD误分类问题 | contrastive learning | ||
| 13 | RobustFormer: Noise-Robust Pre-training for images and videos | RobustFormer:一种噪声鲁棒的图像和视频预训练方法,利用DWT提升Transformer在噪声环境下的性能。 | masked autoencoder MAE |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | GazeGaussian: High-Fidelity Gaze Redirection with 3D Gaussian Splatting | GazeGaussian:基于3D高斯溅射的高保真视线重定向 | 3D gaussian splatting 3DGS gaussian splatting | ✅ | |
| 15 | Generating 3D-Consistent Videos from Unposed Internet Photos | 提出一种自监督方法,从无位姿互联网照片生成3D一致性视频 | 3D gaussian splatting gaussian splatting splatting | ||
| 16 | Sparse Input View Synthesis: 3D Representations and Reliable Priors | 针对稀疏视角的新视角合成,提出基于3D表示和可靠先验的解决方案 | NeRF neural radiance field optical flow | ||
| 17 | Robust SG-NeRF: Robust Scene Graph Aided Neural Surface Reconstruction | 提出鲁棒的SG-NeRF,利用场景图辅助神经表面重建,解决相机姿态噪声问题。 | NeRF | ||
| 18 | DATAP-SfM: Dynamic-Aware Tracking Any Point for Robust Structure from Motion in the Wild | DATAP-SfM:动态感知追踪任意点,实现野外场景鲁棒的运动结构重建 | depth estimation optical flow | ||
| 19 | Geometric Algebra Planes: Convex Implicit Neural Volumes | 提出GA-Planes:一种可凸优化训练的隐式神经场表示方法,用于体积建模。 | implicit representation | ||
| 20 | Practical Compact Deep Compressed Sensing | 提出PCNet,一种实用紧凑的深度压缩感知网络,提升图像重建质量。 | implicit representation | ✅ |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 21 | VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation | VideoAutoArena:通过用户模拟自动评估视频分析大模型的竞技场基准 | multimodal | ||
| 22 | Adapting Vision Foundation Models for Robust Cloud Segmentation in Remote Sensing Images | 提出Cloud-Adapter,利用视觉基础模型实现鲁棒的遥感图像云分割 | foundation model | ✅ | |
| 23 | Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving | 提出Hints of Prompt (HoP)框架,增强多模态LLM在自动驾驶场景中的视觉表征能力 | multimodal | ||
| 24 | MEGL: Multimodal Explanation-Guided Learning | 提出MEGL:一种多模态解释引导学习框架,提升图像分类模型的可解释性和性能。 | multimodal | ||
| 25 | Unsupervised Homography Estimation on Multimodal Image Pair via Alternating Optimization | 提出AltO,通过交替优化解决多模态图像对的无监督单应性估计问题 | multimodal | ✅ | |
| 26 | On the Consistency of Video Large Language Models in Temporal Comprehension | 针对视频大语言模型时间理解一致性问题,提出事件时序验证调优方法 | large language model | ✅ | |
| 27 | FabuLight-ASD: Unveiling Speech Activity via Body Language | FabuLight-ASD:利用身体语言增强多模态环境下的语音活动检测 | multimodal | ✅ |
🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 28 | DIS-Mine: Instance Segmentation for Disaster-Awareness in Poor-Light Condition in Underground Mines | DIS-Mine:针对地下矿井弱光环境的灾害感知实例分割方法 | feature matching | ||
| 29 | X as Supervision: Contending with Depth Ambiguity in Unsupervised Monocular 3D Pose Estimation | 提出基于多假设检测与3D先验的无监督单目3D姿态估计方法 | SMPL | ✅ |
🔬 支柱四:生成式动作 (Generative Motion) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 30 | REDUCIO! Generating 1K Video within 16 Seconds using Extremely Compressed Motion Latents | REDUCIO:利用极度压缩的运动潜在空间,在16秒内生成1K视频 | motion latent | ✅ |