cs.CV(2024-08-21)

📊 共 37 篇论文 | 🔗 15 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (15 🔗6) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗6) 支柱四:生成式动作 (Generative Motion) (2) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱一:机器人控制 (Robot Control) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
1 EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model 提出EE-MLLM,通过复合注意力机制实现数据和计算高效的多模态大语言模型 large language model multimodal
2 CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion 提出CaRDiff框架,利用视频显著性物体排序链式推理和扩散模型提升视频显著性预测。 large language model multimodal chain-of-thought
3 UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation UniFashion:用于多模态时尚检索与生成的一体化视觉-语言模型 large language model multimodal
4 GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models GRAB:一个用于评估大型多模态模型图分析能力的高难度基准 multimodal
5 Video-to-Text Pedestrian Monitoring (VTPM): Leveraging Computer Vision and Large Language Models for Privacy-Preserve Pedestrian Activity Monitoring at Intersections 提出VTPM,利用计算机视觉和LLM实现保护隐私的交叉路口行人活动监测。 large language model
6 MCDubber: Multimodal Context-Aware Expressive Video Dubbing MCDubber:提出多模态上下文感知的视频配音模型,提升配音表现力 multimodal
7 Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance 提出一种半监督3D语义场景补全框架,利用2D视觉基础模型指导。 foundation model
8 TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models TWLV-I:通过全面评估视频基础模型,提升外观和运动理解能力 foundation model
9 SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs 提出SEA:用于MLLM中Token级视觉-文本对齐的监督嵌入对齐方法 large language model multimodal
10 OE3DIS: Open-Ended 3D Point Cloud Instance Segmentation 提出OE3DIS,解决开放场景下无需预定义类名的3D点云实例分割问题 large language model multimodal
11 EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning EMO-LLaMA:通过指令微调增强多模态大语言模型在面部表情理解上的能力 large language model multimodal
12 Image Score: Learning and Evaluating Human Preferences for Mercari Search 利用LLM和链式思考(CoT)为Mercari电商平台学习和评估图像质量偏好 large language model chain-of-thought
13 MSCPT: Few-shot Whole Slide Image Classification with Multi-scale and Context-focused Prompt Tuning 提出MSCPT,利用多尺度上下文提示调整解决病理全切片图像的少样本分类问题 large language model
14 T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval 提出T2VIndexer,一种生成式视频索引器,用于高效文本-视频检索。 multimodal
15 EAGLE: Elevating Geometric Reasoning through LLM-empowered Visual Instruction Tuning EAGLE:通过LLM驱动的视觉指令调优提升几何推理能力 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
16 Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model 提出基于多模态大语言模型的视频开放词汇情感识别方法 open-vocabulary open vocabulary large language model
17 DeRainGS: Gaussian Splatting for Enhanced Scene Reconstruction in Rainy Environments 提出DeRainGS,用于雨天环境下增强场景重建的高斯溅射方法 3DGS gaussian splatting splatting
18 Robust 3D Gaussian Splatting for Novel View Synthesis in Presence of Distractors 提出鲁棒的3D高斯溅射方法,解决存在干扰物的新视角合成问题 3D gaussian splatting gaussian splatting splatting
19 Irregularity Inspection using Neural Radiance Field 提出基于NeRF的3D孪生模型,用于大型机械设备的不规则性检测。 NeRF neural radiance field
20 EmbodiedSAM: Online Segment Any 3D Thing in Real Time EmbodiedSAM:实时在线分割任意3D物体,赋能具身智能 open-vocabulary open vocabulary foundation model
21 Visual Localization in 3D Maps: Comparing Point Cloud, Mesh, and NeRF Representations 提出一种通用的视觉定位系统,可在点云、网格和NeRF等多种3D地图表示中实现单目图像定位。 NeRF
22 Pano2Room: Novel View Synthesis from a Single Indoor Panorama Pano2Room:从单张全景图合成高质量室内场景新视角 3D gaussian splatting gaussian splatting splatting
23 LiFCal: Online Light Field Camera Calibration via Bundle Adjustment 提出LiFCal,通过光场相机在线标定实现无目标场景下的精确参数估计 depth estimation
24 SelfDRSC++: Self-Supervised Learning for Dual Reversed Rolling Shutter Correction 提出SelfDRSC++以解决动态场景下的滚动快门畸变问题 optical flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
25 GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting GaussianOcc:利用高斯溅射实现全自监督高效的3D Occupancy估计 representation learning gaussian splatting splatting
26 MambaCSR: Dual-Interleaved Scanning for Compressed Image Super-Resolution With SSMs MambaCSR:提出双重交错扫描的Mamba框架,用于压缩图像超分辨率重建。 Mamba SSM state space model
27 MambaOcc: Visual State Space Model for BEV-based Occupancy Prediction with Local Adaptive Reordering MambaOcc:基于视觉状态空间模型的BEV占用预测,采用局部自适应重排序 Mamba state space model
28 HMT-UNet: A hybird Mamba-Transformer Vision UNet for Medical Image Segmentation 提出HMT-UNet,一种混合Mamba-Transformer的UNet,用于提升医学图像分割性能。 Mamba SSM state space model
29 Supervised Representation Learning towards Generalizable Assembly State Recognition 提出基于表示学习的ISIL方法以解决装配状态识别问题 representation learning
30 BadVim: Unveiling Backdoor Threats in Visual State Space Model BadVim:揭示视觉状态空间模型中的后门威胁 state space model
31 Positional Prompt Tuning for Efficient 3D Representation Learning 提出PPT:一种高效的3D表示学习位置提示微调方法 representation learning
32 LAKD-Activation Mapping Distillation Based on Local Learning 提出基于局部学习的激活映射蒸馏(LAKD),提升知识蒸馏的效率与可解释性。 distillation

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
33 Story3D-Agent: Exploring 3D Storytelling Visualization with Large Language Models Story3D-Agent:利用大语言模型探索3D故事可视化 motion synthesis large language model
34 SynPlay: Large-Scale Synthetic Human Data with Real-World Diversity for Aerial-View Perception SynPlay:用于空中视角感知、具备真实世界多样性的大规模合成人体数据集 motion generation human motion

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
35 Pixel Is Not a Barrier: An Effective Evasion Attack for Pixel-Domain Diffusion Models 提出AtkPDM:一种针对像素域扩散模型的高效逃逸攻击方法 latent optimization

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
36 AIM 2024 Challenge on Compressed Video Quality Assessment: Methods and Results AIM 2024压缩视频质量评估挑战赛:方法与结果分析 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
37 HumanCoser: Layered 3D Human Generation via Semantic-Aware Diffusion Model HumanCoser:提出语义感知扩散模型,实现可重用的分层3D人体生成 SMPL

⬅️ 返回 cs.CV 首页 · 🏠 返回主页