cs.CV(2024-12-30)

📊 共 23 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (8 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (3 🔗1) 支柱四:生成式动作 (Generative Motion) (2) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
1 M$^3$oralBench: A MultiModal Moral Benchmark for LVLMs 提出M$^3$oralBench,用于评估LVLM在多模态道德理解和推理方面的能力。 large language model foundation model multimodal
2 Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering 提出基于场景图增强的多模态RAG-LLM,提升视觉问答精度 large language model multimodal
3 Dialogue Director: Bridging the Gap in Dialogue Visualization for Multimodal Storytelling Dialogue Director:提出一种多模态框架,用于将对话脚本转化为动态多视角故事板。 multimodal chain-of-thought
4 Enhancing Table Recognition with Vision LLMs: A Benchmark and Neighbor-Guided Toolchain Reasoner 提出基于视觉大语言模型的表格识别基准和邻域引导工具链推理器NGTR large language model foundation model
5 Social-LLaVA: Enhancing Robot Navigation through Human-Language Reasoning in Social Spaces Social-LLaVA:通过人类语言推理增强社交空间中机器人导航能力 chain-of-thought
6 Towards Compatible Fine-tuning for Vision-Language Model Updates 提出ContCoOp,解决视觉-语言模型更新后微调模块的兼容性问题 foundation model
7 Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks 提出VEGA:一种无监督视觉-语言模型排序方法,用于下游任务选择。 large language model
8 Enhancing Visual Representation for Text-based Person Searching 提出VFE-TPS模型,增强视觉表征以提升文本行人检索精度 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
9 KeyGS: A Keyframe-Centric Gaussian Splatting Method for Monocular Image Sequences 提出KeyGS以解决单目图像序列中的3D重建效率问题 3D gaussian splatting 3DGS gaussian splatting
10 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives 提出基于原生4D高斯的动态场景建模方法,实现高分辨率动态场景的实时渲染。 gaussian splatting splatting scene understanding
11 YOLO-UniOW: Efficient Universal Open-World Object Detection YOLO-UniOW:高效通用开放世界目标检测模型,解决传统目标检测的局限性。 open-vocabulary open vocabulary multimodal
12 FPGA-based Acceleration of Neural Network for Image Classification using Vitis AI 利用Vitis AI在FPGA上加速图像分类神经网络,提升吞吐量和能效。 depth estimation

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
13 ReFlow6D: Refraction-Guided Transparent Object 6D Pose Estimation via Intermediate Representation Learning ReFlow6D:利用折射引导的中间表示学习实现透明物体6D位姿估计 manipulation representation learning 6D pose estimation
14 PERSE: Personalized 3D Generative Avatars from A Single Portrait PERSE:基于单张人像生成个性化3D可控头像,实现面部属性解耦编辑 manipulation 3D gaussian splatting gaussian splatting
15 Edicho: Consistent Image Editing in the Wild Edicho:基于显式图像对应关系的diffusion模型,实现野外图像一致性编辑 manipulation classifier-free guidance

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
16 Hierarchical Banzhaf Interaction for General Video-Language Representation Learning 提出层级Banzhaf交互模型,用于增强通用视频-语言表征学习中的细粒度语义交互。 representation learning contrastive learning multimodal
17 VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation 提出VisionReward框架以解决视觉生成中的人类偏好对齐问题 reinforcement learning preference learning
18 ILDiff: Generate Transparent Animated Stickers by Implicit Layout Distillation 提出ILDiff,通过隐式布局蒸馏生成高质量透明动画贴纸 distillation

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
19 LS-GAN: Human Motion Synthesis with Latent-space GANs LS-GAN:利用潜在空间GAN进行高效的人体动作合成 motion synthesis
20 Diffgrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model Diffgrasp:利用扩散模型和物体运动引导的全身抓取合成 contact-aware human-object interaction

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
21 Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model Vinci:基于第一视角视觉-语言模型的实时具身智能助手 egocentric egocentric vision

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
22 Slow Perception: Let's Perceive Geometric Figures Step-by-step 提出“慢感知”策略,提升LVLM在几何图形理解和复制上的能力 spatial relationship

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
23 LTX-Video: Realtime Video Latent Diffusion LTX-Video:一种用于实时视频生成的基于Transformer的潜在扩散模型 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页