cs.CV(2024-11-14)

📊 共 23 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (7 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (5) 支柱一:机器人控制 (Robot Control) (4) 支柱四:生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)

#题目一句话要点标签🔗
1 LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation 提出LHRS-Bot-Nova,提升多模态大语言模型在遥感图像理解中的性能 large language model multimodal instruction following
2 Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models 提出基于多模态大语言模型的图像再生评估框架,用于评估文本到图像生成模型的性能。 large language model multimodal
3 Spider: Any-to-Many Multimodal LLM 提出Spider框架,实现任意到多模态生成,突破多模态大语言模型的模态组合限制。 large language model multimodal
4 Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey 综述多模态生成模型的越狱攻击与防御,旨在保障其安全可靠的应用。 foundation model multimodal
5 Detecting Children with Autism Spectrum Disorder based on Script-Centric Behavior Understanding with Emotional Enhancement 提出基于脚本行为理解与情感增强的自闭症谱系障碍零样本检测框架 large language model multimodal
6 Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models 提出多尺度对齐方法,提升多模态模型在细粒度视觉理解任务上的性能 large language model
7 LES-Talker: Fine-Grained Emotion Editing for Talking Head Generation in Linear Emotion Space 提出LES-Talker,实现基于线性情感空间的高精度可控说话人头部生成。 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
8 Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation 提出Trident框架,无需训练即可实现高性能开放词汇分割 open-vocabulary open vocabulary foundation model
9 Self-Supervised Monocular 4D Scene Reconstruction for Egocentric Videos 提出EgoMono4D,用于自监督单目4D利己视频场景重建。 scene reconstruction egocentric
10 DyGASR: Dynamic Generalized Exponential Splatting with Surface Alignment for Accelerated 3D Mesh Reconstruction DyGASR:基于动态广义指数Splatting与表面对齐的加速3D网格重建 3D gaussian splatting 3DGS gaussian splatting
11 Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting Architect:利用分层2D图像修复生成生动交互式3D场景 depth estimation embodied AI large language model
12 CropCraft: Complete Structural Characterization of Crop Plants From Images CropCraft:提出基于逆向程序建模的农作物完整三维结构重建方法 neural radiance field
13 MFTIQ: Multi-Flow Tracker with Independent Matching Quality Estimation MFTIQ:一种具有独立匹配质量估计的多流跟踪器,提升长时跟踪性能。 optical flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
14 Advancing Stroke Risk Prediction Using a Multi-modal Foundation Model 提出多模态自监督框架,融合脑部影像与临床数据,提升卒中风险预测精度。 predictive model contrastive learning foundation model
15 Towards Neural Foundation Models for Vision: Aligning EEG, MEG, and fMRI Representations for Decoding, Encoding, and Modality Conversion 提出一种神经基础模型,通过对齐脑电、脑磁和功能磁共振表征实现视觉信息的多模态转换。 contrastive learning foundation model multimodal
16 VPBSD:Vessel-Pattern-Based Semi-Supervised Distillation for Efficient 3D Microscopic Cerebrovascular Segmentation 提出基于血管模式的半监督蒸馏方法(VpbSD),用于高效的3D显微脑血管分割。 distillation
17 Long-Tailed Object Detection Pre-training: Dynamic Rebalancing Contrastive Learning with Dual Reconstruction 提出2DRCL,用于长尾目标检测预训练,提升尾部类别性能。 contrastive learning
18 BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation BEARD:数据集蒸馏对抗鲁棒性评测基准,解决现有方法安全性评估缺失问题。 distillation

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
19 How Good is ChatGPT at Audiovisual Deepfake Detection: A Comparative Study of ChatGPT, AI Models and Human Perception 评估ChatGPT在音视频深度伪造检测中的能力,并与AI模型和人类感知进行对比 manipulation spatiotemporal large language model
20 MagicQuill: An Intelligent Interactive Image Editing System MagicQuill:一个智能交互式图像编辑系统,通过多模态LLM实时预测编辑意图。 manipulation large language model multimodal
21 VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation VidMan:利用视频扩散模型中的隐式动力学,提升机器人操作性能 manipulation world model
22 Computational metaoptics for imaging 计算超构光学:结合超构表面与计算成像,突破传统成像限制 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
23 JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation JoyVASA:提出基于解耦表示和扩散模型的音视频驱动人像及动物图像动画生成方法 motion generation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页