cs.CV(2024-05-28)

📊 共 38 篇论文 | 🔗 17 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (14 🔗8) 支柱二:RL算法与架构 (RL & Architecture) (11 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗2) 支柱一:机器人控制 (Robot Control) (1 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1) 支柱四:生成式动作 (Generative Motion) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
1 SkinCaRe: A Multimodal Dermatology Dataset Annotated with Medical Caption and Chain-of-Thought Reasoning SkinCaRe:一个包含医学描述和思维链推理的多模态皮肤病学数据集 large language model multimodal chain-of-thought
2 Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model 提出AcFormer:一种基于视觉锚点的低成本高效多模态大语言模型连接器 large language model multimodal
3 EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition EffoVPR:利用有效的基础模型进行视觉定位识别,实现零样本和单阶段SOTA性能。 foundation model
4 Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion 提出基于多模态混合特征提取和Transformer融合的MMHT模型,提升复杂场景下的目标跟踪可靠性。 multimodal
5 White-box Multimodal Jailbreaks Against Large Vision-Language Models 提出白盒多模态越狱攻击方法,提升视觉-语言模型对抗鲁棒性评估 multimodal
6 XTrack: Multimodal Training Boosts RGB-X Video Object Trackers XTrack:多模态训练提升RGB-X视频目标跟踪器性能 multimodal
7 MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance MMPareto:通过无害的单模态辅助提升多模态学习性能 multimodal
8 Mitigating Object Hallucination in MLLMs via Data-augmented Phrase-level Alignment 提出数据增强的短语级对齐方法DPA,缓解多模态大语言模型中的对象幻觉问题。 large language model multimodal
9 Multi-modal Generation via Cross-Modal In-Context Learning 提出MGCC,利用跨模态上下文学习生成多模态提示序列的新图像。 large language model multimodal
10 Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention 提出Intent3D数据集和IntentNet模型,实现基于人类意图的RGB-D场景3D目标检测。 visual grounding
11 Text-only Synthesis for Image Captioning 提出ToCa,利用纯文本合成方法进行图像描述生成,显著提升零样本泛化能力。 large language model
12 VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections VeLoRA:利用Rank-1子Token投影实现内存高效的LLM训练 large language model
13 Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment 提出对比对齐(CAL)方法,通过视觉相关性区分文本token重要性,优化视觉语言模型。 multimodal
14 MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation MMDisCo:利用多模态判别器引导协同扩散,实现联合音视频生成 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (11 篇)

#题目一句话要点标签🔗
15 OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision 提出OV-DQUO,通过去噪文本查询训练和开放世界未知对象监督,提升开放词汇目标检测性能。 contrastive learning open-vocabulary open vocabulary
16 SCE-MAE: Selective Correspondence Enhancement with Masked Autoencoder for Self-Supervised Landmark Estimation SCE-MAE:基于掩码自编码器的选择性对应增强,用于自监督地标估计 masked autoencoder MAE
17 EG4D: Explicit Generation of 4D Object without Score Distillation 提出EG4D框架,无需Score Distillation即可显式生成高质量4D动态物体。 distillation gaussian splatting splatting
18 Aligning in a Compact Space: Contrastive Knowledge Distillation between Heterogeneous Architectures 提出LFCC框架,通过对比学习实现异构网络间的知识蒸馏。 contrastive learning teacher-student distillation
19 Benchmarking Skeleton-based Motion Encoder Models for Clinical Applications: Estimating Parkinson's Disease Severity in Walking Sequences 评估基于骨骼运动编码器模型在临床应用中的性能:帕金森病步态严重程度估计 predictive model human motion
20 Cardiovascular Disease Detection from Multi-View Chest X-rays with BI-Mamba 提出BI-Mamba,利用多视角胸部X光片进行心血管疾病风险预测,降低辐射暴露。 Mamba SSM
21 DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 提出DiG以解决扩散模型的效率问题 Mamba linear attention
22 ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention ViG:基于门控线性注意力的高效视觉序列学习网络,实现线性复杂度。 linear attention representation learning
23 Relational Self-supervised Distillation with Compact Descriptors for Image Copy Detection 提出关系自监督蒸馏方法,利用紧凑描述子实现高效图像复制检测 contrastive learning distillation
24 Visualizing the loss landscape of Self-supervised Vision Transformer 通过可视化损失 landscape 分析自监督 ViT 泛化能力 masked autoencoder MAE distillation
25 SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching SelMatch:通过选择初始化和轨迹匹配的部分更新,有效扩展数据集蒸馏规模 distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
26 HFGS: 4D Gaussian Splatting with Emphasis on Spatial and Temporal High-Frequency Components for Endoscopic Scene Reconstruction 提出HFGS以解决内窥镜场景重建中的动态场景不足问题 3D gaussian splatting gaussian splatting splatting
27 FreeSplat: Generalizable 3D Gaussian Splatting Towards Free-View Synthesis of Indoor Scenes FreeSplat:面向室内场景自由视角合成的通用3D高斯溅射 3D gaussian splatting gaussian splatting splatting
28 Deform3DGS: Flexible Deformation for Fast Surgical Scene Reconstruction with Gaussian Splatting Deform3DGS:基于高斯溅射的快速柔性形变手术场景重建 3D gaussian splatting gaussian splatting splatting
29 A Refined 3D Gaussian Representation for High-Quality Dynamic Scene Reconstruction 提出一种精细化的3D高斯表示方法,用于高质量动态场景重建。 3D gaussian splatting gaussian splatting splatting
30 SafeguardGS: 3D Gaussian Primitive Pruning While Avoiding Catastrophic Scene Destruction SafeguardGS:通过高斯基元剪枝避免灾难性的场景破坏,优化3DGS。 3D gaussian splatting 3DGS gaussian splatting
31 NeRAF: 3D Scene Infused Neural Radiance and Acoustic Fields NeRAF:融合神经辐射场与声场的3D场景重建与声学渲染 NeRF implicit representation PULSE
32 RT-GS2: Real-Time Generalizable Semantic Segmentation for 3D Gaussian Representations of Radiance Fields RT-GS2:首个用于3D高斯辐射场实时通用语义分割方法 gaussian splatting splatting
33 GFlow: Recovering 4D World from Monocular Video GFlow:从单目视频中恢复动态4D世界,无需相机参数 optical flow
34 Flow-Assisted Motion Learning Network for Weakly-Supervised Group Activity Recognition 提出Flaming-Net,利用光流辅助的运动学习网络解决弱监督群体活动识别问题。 optical flow

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
35 3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting 3DitScene:提出语言引导的解耦高斯溅射,实现任意场景的编辑。 manipulation gaussian splatting splatting

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
36 Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions? 提出EgoNCE++,提升Egocentric视频语言模型对手部-物体交互的理解能力 HOI egocentric large language model

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
37 Towards Open Domain Text-Driven Synthesis of Multi-Person Motions 提出一种基于Transformer扩散模型的开放域文本驱动多人运动合成方法 text-to-motion motion generation

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
38 VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation VividPose:提出基于SVD的端到端框架,实现逼真且时序稳定的视频人物图像动画。 SMPL SMPL-X

⬅️ 返回 cs.CV 首页 · 🏠 返回主页