cs.CV(2024-11-26)

📊 共 55 篇论文 | 🔗 17 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (27 🔗7) 支柱三:空间感知与语义 (Perception & Semantics) (12 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (11 🔗4) 支柱一:机器人控制 (Robot Control) (2) 支柱八:物理动画 (Physics-based Animation) (2) 支柱四:生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (27 篇)

#题目一句话要点标签🔗
1 Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis Visatronic:一种用于语音合成的多模态解码器模型,实现视频-文本到语音的生成。 large language model foundation model multimodal
2 InsightEdit: Towards Better Instruction Following for Image Editing InsightEdit:利用多模态大语言模型提升指令驱动的图像编辑效果 large language model multimodal instruction following
3 Multimodal Alignment and Fusion: A Survey 综述多模态对齐与融合技术,涵盖结构视角与方法范式,旨在提升多模态学习系统的泛化性。 embodied AI large language model multimodal
4 NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects? NEMO:评估多模态大语言模型识别属性修改对象的能力 large language model multimodal
5 ShowUI: One Vision-Language-Action Model for GUI Visual Agent 提出ShowUI,一个用于GUI视觉代理的视觉-语言-动作模型 vision-language-action instruction following
6 Grounding-IQA: Grounding Multimodal Language Model for Image Quality Assessment 提出Grounding-IQA,通过多模态 grounding 提升图像质量评估的细粒度。 large language model multimodal
7 Real-Time Multimodal Signal Processing for HRI in RoboCup: Understanding a Human Referee 针对RoboCup人机交互,提出实时多模态信号处理方法以理解人类裁判 multimodal
8 Video-Guided Foley Sound Generation with Multimodal Controls MultiFoley:多模态控制的视频引导Foley音效生成模型 multimodal
9 HyperSeg: Towards Universal Visual Segmentation with Large Language Model HyperSeg:基于大语言模型的通用视觉分割模型,实现图像和视频的像素级理解 large language model
10 Multimodal Outer Arithmetic Block Dual Fusion of Whole Slide Images and Omics Data for Precision Oncology 提出基于双重融合的多模态外积算术块方法,提升WSI与基因组学数据融合的肿瘤亚型诊断精度。 multimodal
11 Efficient Multi-modal Large Language Models via Visual Token Grouping 提出VisToG,通过视觉Token分组提升多模态大语言模型效率 large language model
12 Exploring Aleatoric Uncertainty in Object Detection via Vision Foundation Models 利用视觉基础模型探索目标检测中的偶然不确定性,提升模型鲁棒性 foundation model
13 Advancing Content Moderation: Evaluating Large Language Models for Detecting Sensitive Content Across Text, Images, and Videos 评估大型语言模型在文本、图像和视频中检测敏感内容的能力,提升内容审核效果。 large language model
14 SatVision-TOA: A Geospatial Foundation Model for Coarse-Resolution All-Sky Remote Sensing Imagery SatVision-TOA:用于粗分辨率全天候遥感影像的地理空间基础模型 foundation model
15 Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration 提出FiCoCo框架,通过无训练Token缩减加速多模态大语言模型 large language model multimodal
16 SketchAgent: Language-Driven Sequential Sketch Generation SketchAgent:提出一种基于语言驱动的序列化草图生成方法 large language model multimodal
17 HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator 提出HEIE:基于MLLM的分层可解释AIGC图像不合理性评估器 large language model multimodal
18 DOGR: Towards Versatile Visual Document Grounding and Referring DOGR:面向通用视觉文档定位与指代的模型、数据引擎与评测基准 large language model multimodal
19 OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection OpenAD:用于3D目标检测的开放世界自动驾驶基准测试 large language model multimodal
20 The Context of Crash Occurrence: A Complexity-Infused Approach Integrating Semantic, Contextual, and Kinematic Features 提出融合语义、上下文和运动学特征的道路复杂性分析框架,用于提升交通事故预测精度。 large language model
21 Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings 提出Bi-ICE,通过概念与输入嵌入的双向交互,提升图像分类的内部可解释性。 large language model
22 Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop 提出Scene Co-pilot框架,结合LLM与程序化3D场景生成,实现可控的文本到视频生成。 large language model
23 FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval 提出FLEX-CLIP,通过特征生成网络增强CLIP,解决X-shot跨模态检索中的特征退化和数据不平衡问题。 multimodal
24 VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models 提出VL-RewardBench,用于评估和提升视觉-语言生成奖励模型 multimodal
25 in-Car Biometrics (iCarB) Datasets for Driver Recognition: Face, Fingerprint, and Voice 发布iCarB车载生物识别数据集,用于驾驶员身份识别,包含人脸、指纹和语音三种模态。 multimodal
26 Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation Reflect3D:利用单图像对称性检测实现高质量3D生成 foundation model
27 MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding MUSE-VL:通过语义离散编码建模统一的视觉-语言模型 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (12 篇)

#题目一句话要点标签🔗
28 DROID-Splat: Combining end-to-end SLAM with 3D Gaussian Splatting DroidSplat:结合端到端SLAM与3D高斯溅射,实现SOTA级跟踪与渲染。 monocular depth 3D gaussian splatting gaussian splatting
29 Distractor-free Generalizable 3D Gaussian Splatting 提出DGGS,解决跨场景泛化3D高斯溅射中无干扰物体的重建问题 3D gaussian splatting 3DGS gaussian splatting
30 4D Scaffold Gaussian Splatting with Dynamic-Aware Anchor Growing for Efficient and High-Fidelity Dynamic Scene Reconstruction 提出基于动态感知Anchor生长的4D骨架高斯溅射,用于高效高保真动态场景重建 gaussian splatting splatting scene reconstruction
31 SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting SelfSplat:提出一种无需位姿和3D先验的可泛化3D高斯溅射方法 3D gaussian splatting gaussian splatting splatting
32 Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation 提出基于谱图蒸馏的对象上下文感知开放词汇语义分割方法 open-vocabulary open vocabulary foundation model
33 HSI-Drive v2.0: More Data for New Challenges in Scene Understanding for Autonomous Driving HSI-Drive v2.0:扩展高光谱图像数据集,提升自动驾驶场景理解能力 scene understanding HSI
34 MLI-NeRF: Multi-Light Intrinsic-Aware Neural Radiance Fields 提出MLI-NeRF,利用多光源信息解决NeRF中固有图像分解难题。 NeRF neural radiance field
35 DepthCues: Evaluating Monocular Depth Perception in Large Vision Models DepthCues:评估大型视觉模型中的单目深度感知能力 depth estimation monocular depth
36 Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors 提出基于生成隐变量先验的内窥镜自监督单目深度与姿态估计方法 monocular depth
37 Puzzle Similarity: A Perceptually-guided Cross-Reference Metric for Artifact Detection in 3D Scene Reconstructions 提出Puzzle Similarity,用于3D重建中无参考伪影检测,提升重建质量。 scene reconstruction
38 Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors Buffer Anytime:利用图像先验实现零样本视频深度和法线估计 Depth Anything optical flow
39 Box for Mask and Mask for Box: weak losses for multi-task partially supervised learning 提出BoMBo策略,利用弱监督损失进行多任务部分监督学习,提升目标检测与语义分割性能。 scene understanding

🔬 支柱二:RL算法与架构 (RL & Architecture) (11 篇)

#题目一句话要点标签🔗
40 VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension VersatileMotion:统一的多模态运动LLM框架,实现运动合成与理解 flow matching motion synthesis motion tokenizer
41 FTMoMamba: Motion Generation with Frequency and Text State Space Models FTMoMamba:利用频率和文本状态空间模型进行运动生成 Mamba state space model text-to-motion
42 BadScan: An Architectural Backdoor Attack on Visual State Space Models BadScan:针对视觉状态空间模型的架构后门攻击 Mamba SSM state space model
43 Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation 提出一种基于模态蒸馏的鲁棒Anymodal分割器,解决多模态分割中的单模态偏见问题。 distillation multimodal
44 D$^2$-World: An Efficient World Model through Decoupled Dynamic Flow D$^2$-World:通过解耦动态流高效预测未来点云 world model foundation model
45 SVGDreamer++: Advancing Editability and Diversity in Text-Guided SVG Generation SVGDreamer++:提出HIVE和VPSD,提升文本引导SVG生成的可编辑性和多样性 dreamer distillation
46 Spatially Visual Perception for End-to-End Robotic Learning 提出基于空间感知的端到端机器人学习框架,提升光照变化下的泛化能力 imitation learning depth estimation monocular depth
47 TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba TinyViM:通过频率解耦实现Tiny混合视觉Mamba模型,提升性能并加速推理。 Mamba
48 DWCL: Dual-Weighted Contrastive Learning for Multi-View Clustering 提出双重加权对比学习(DWCL)用于解决多视图聚类中的表示退化和不可靠视图问题。 contrastive learning
49 Large-Scale Data-Free Knowledge Distillation for ImageNet via Multi-Resolution Data Generation 提出MUSE:通过多分辨率数据生成实现ImageNet大规模无数据知识蒸馏 distillation
50 Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models 提出语义锚点迁移(SAT)方法,解决视觉-语言模型在测试时自适应的鲁棒性问题。 representation learning contrastive learning distillation

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
51 vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation 提出vesselFM,用于通用三维血管分割的医学图像基础模型 domain randomization flow matching foundation model
52 GMFlow: Global Motion-Guided Recurrent Flow for 6D Object Pose Estimation 提出GMFlow:全局运动引导的循环光流用于6D物体姿态估计 manipulation linear attention

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
53 AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM 提出AIGV-Assessor,利用LMM评估文本生成视频的感知质量,并构建大规模AIGVQA-DB数据集。 spatiotemporal multimodal
54 Selfish Evolution: Making Discoveries in Extreme Label Noise with the Help of Overfitting Dynamics 提出Selfish Evolution,利用过拟合动态在极端标签噪声下进行发现与纠正。 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
55 I2VControl: Disentangled and Unified Video Motion Synthesis Control I2VControl:解耦统一的视频运动合成控制框架,实现多类型控制无冲突融合 motion synthesis

⬅️ 返回 cs.CV 首页 · 🏠 返回主页