cs.CV(2024-11-21)

📊 共 28 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (5 🔗2) 支柱八:物理动画 (Physics-based Animation) (1) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems 提出基于约束满足问题的零样本3D视觉定位方法,提升复杂场景理解能力。 large language model visual grounding
2 Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts Panther:利用指令引导的视觉提示增强多模态LLM的视觉感知能力 large language model multimodal
3 A Multimodal Approach to The Detection and Classification of Skin Diseases 提出多模态皮肤病检测与分类方法,结合图像与文本信息提升诊断准确率。 large language model multimodal
4 GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI 提出GMAI-VL,一个基于大规模多模态医学数据集的通用医学视觉-语言模型 multimodal
5 Multimodal 3D Brain Tumor Segmentation with Adversarial Training and Conditional Random Field 提出基于对抗训练和条件随机场的3D多模态脑肿瘤分割方法 multimodal
6 Multimodal Autoregressive Pre-training of Large Vision Encoders 提出AIMV2:一种基于多模态自回归预训练的大规模视觉编码器,显著提升下游任务性能。 multimodal
7 Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance 提出LACING框架,通过多模态双重注意力与软图像引导减少大型视觉语言模型中的语言偏见。 multimodal
8 SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning SMoLoRA:探索并解决持续视觉指令微调中的双重灾难性遗忘问题 large language model multimodal instruction following
9 Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding 提出DYTO:一种动态Token融合框架,用于零样本视频理解。 large language model multimodal
10 FoPru: Focal Pruning for Efficient Large Vision-Language Models 提出FoPru:基于注意力机制的焦点剪枝,提升大规模视觉语言模型效率 large language model multimodal
11 LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval 提出LLaVA-MR,利用多模态大语言模型解决视频片段检索难题。 large language model multimodal
12 FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression FocusLLaVA:一种粗到细的视觉Token压缩方法,提升多模态大模型的效率和性能 large language model
13 Quantization without Tears 提出QwT,通过轻量级线性层结构实现高效、通用且高精度的网络量化。 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
14 NexusSplats: Efficient 3D Gaussian Splatting in the Wild NexusSplats:针对复杂光照和遮挡场景的高效3D高斯溅射 3D gaussian splatting 3DGS gaussian splatting
15 CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation CLIPer:通过分层改进CLIP空间表示,实现开放词汇语义分割 open-vocabulary open vocabulary foundation model
16 Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction 提出DiffusionGS,通过扩散模型直接生成3D高斯点云,实现快速单阶段图像到3D生成与重建。 gaussian splatting splatting scene reconstruction
17 Multimodal 3D Reasoning Segmentation with Complex Scenes 提出MORE3D网络和ReasonSeg3D数据集,用于复杂场景下的多模态3D推理分割。 scene understanding embodied AI multimodal
18 DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding DINO-X:用于开放世界目标检测与理解的统一视觉模型 open-vocabulary open vocabulary
19 StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart StereoCrafter-Zero:基于噪声重启的零样本立体视频生成框架 depth estimation
20 Transforming Static Images Using Generative Models for Video Salient Object Detection 利用生成模型转换静态图像,提升视频显著性目标检测性能 optical flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
21 Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models Insight-V:探索基于多模态大语言模型的长链视觉推理,提升复杂任务性能。 DPO large language model multimodal
22 PhysFlow: Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation PhysFlow:利用多模态大模型和视频扩散进行4D动态物理场景仿真 distillation optical flow foundation model
23 Segment Any Class (SAC): Multi-Class Few-Shot Semantic Segmentation via Class Region Proposals 提出SAC:一种基于类别区域提议的多类别少样本语义分割方法,无需训练。 SAC foundation model
24 BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models 提出BiomedCoOp,通过提示学习提升BiomedCLIP在生物医学图像分类中的准确性和泛化性。 representation learning distillation large language model
25 WARLearn: Weather-Adaptive Representation Learning WARLearn:提出一种天气自适应的表征学习框架,提升恶劣天气下的模型性能。 representation learning

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
26 Spatiotemporal Decoupling for Efficient Vision-Based Occupancy Forecasting 提出时空解耦的EfficientOCF,高效预测自动驾驶环境中的未来占用状态。 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
27 VAGUE: Visual Contexts Clarify Ambiguous Expressions VAGUE:利用视觉上下文消除歧义性表达,提升多模态推理能力 Ego4D multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
28 Enhancing GeoAI and location encoding with spatial point pattern statistics: A Case Study of Terrain Feature Classification 融合空间点模式统计的GeoAI模型,提升地形特征分类精度 spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页