cs.CV(2025-02-11)
📊 共 22 篇论文 | 🔗 7 篇有代码
🎯 兴趣领域导航
支柱二:RL算法与架构 (RL & Architecture) (6 🔗2)
支柱六:视频提取与匹配 (Video Extraction) (5 🔗3)
支柱一:机器人控制 (Robot Control) (4)
支柱九:具身大模型 (Embodied Foundation Models) (4 🔗1)
支柱三:空间感知与语义 (Perception & Semantics) (2 🔗1)
支柱八:物理动画 (Physics-based Animation) (1)
🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | Flow Distillation Sampling: Regularizing 3D Gaussians with Pre-trained Matching Priors | 提出Flow Distillation Sampling,利用预训练匹配先验正则化3D高斯模型,提升几何重建质量。 | distillation 3D gaussian splatting 3DGS | ✅ | |
| 2 | A Survey on Mamba Architecture for Vision Applications | 综述Mamba架构在视觉任务中的应用,探索其在图像和视频理解中的潜力。 | Mamba SSM spatiotemporal | ||
| 3 | HOMIE: Histopathology Omni-modal Embedding for Pathology Composed Retrieval | HOMIE:用于病理组合检索的组织病理学全模态嵌入方法 | predictive model large language model multimodal | ||
| 4 | A Survey of Representation Learning, Optimization Strategies, and Applications for Omnidirectional Vision | 全景视觉深度学习综述:聚焦表征学习、优化策略与应用 | representation learning optical flow | ||
| 5 | PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning | PlaySlot:学习逆向潜在动态,实现可控的、以对象为中心的视频预测与规划 | world model latent dynamics | ✅ | |
| 6 | Articulate That Object Part (ATOP): 3D Part Articulation via Text and Motion Personalization | ATOP:提出一种基于文本和运动个性化的3D部件可动性建模方法 | distillation motion generation |
🔬 支柱六:视频提取与匹配 (Video Extraction) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models | 提出Anomaly-OV,用于零样本异常检测与推理,显著提升细粒度异常识别能力。 | feature matching large language model multimodal | ✅ | |
| 8 | EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering | 提出EgoTextVQA基准,用于评测以自我为中心的场景文本感知视频问答能力。 | egocentric large language model multimodal | ✅ | |
| 9 | PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization | 提出PRVQL,通过渐进式知识引导优化第一人称视频中的视觉查询定位。 | egocentric Ego4D | ✅ | |
| 10 | EventEgo3D++: 3D Human Motion Capture from a Head-Mounted Event Camera | EventEgo3D++:利用头戴式事件相机进行3D人体运动捕捉 | SMPL egocentric | ||
| 11 | Few-Shot Multi-Human Neural Rendering Using Geometry Constraints | 提出基于几何约束的少样本多人神经渲染方法,解决遮挡和杂乱问题。 | SMPL |
🔬 支柱一:机器人控制 (Robot Control) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 12 | TranSplat: Surface Embedding-guided 3D Gaussian Splatting for Transparent Object Manipulation | TranSplat:表面嵌入引导的3D高斯溅射用于透明物体操作 | manipulation 3D gaussian splatting gaussian splatting | ||
| 13 | DeepSeek on a Trip: Inducing Targeted Visual Hallucinations via Representation Vulnerabilities | 通过表征脆弱性诱导DeepSeek模型产生目标视觉幻觉 | manipulation large language model multimodal | ||
| 14 | Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving | 提出PreWorld:一种半监督的、以视觉为中心的3D Occupancy世界模型,用于自动驾驶。 | motion planning world model | ||
| 15 | Diffusion Suction Grasping with Large-Scale Parcel Dataset | 提出Diffusion-Suction,解决复杂包裹抓取的吸盘抓取规划问题 | manipulation affordance grasp prediction |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 16 | Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content | 针对多模态仇恨检测,提出一种鲁棒框架,着重研究视频与图像内容差异性。 | multimodal | ✅ | |
| 17 | NanoVLMs: How small can we go and still make coherent Vision Language Models? | 提出NanoVLMs,探索保持视觉语言模型连贯性的最小模型尺寸。 | large language model multimodal | ||
| 18 | Scaling Pre-training to One Hundred Billion Data for Vision Language Models | 大规模视觉语言预训练:探索千亿级数据对模型性能与文化多样性的影响 | multimodal | ||
| 19 | Confidence-calibrated covariate shift correction for few-shot classification in Vision-Language Models | 提出CalShift方法,校准置信度并修正协变量偏移,提升视觉-语言模型在少样本分类中的泛化性。 | foundation model |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 20 | TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation | 提出TRAVEL,一种免训练的视觉语言导航检索与对齐方法 | semantic map VLMAP VLN | ||
| 21 | VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation | VidCRAFT3:通过相机、物体和光照控制实现图像到视频的生成 | optical flow | ✅ |
🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 22 | Enhancing Video Understanding: Deep Neural Networks for Spatiotemporal Analysis | 探索时空特征与深度网络,综述视频理解算法与数据集 | spatiotemporal |