cs.CV（2024-12-30）

📊 共 23 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (8 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱一：机器人控制 (Robot Control) (3 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (3 🔗1) 支柱四：生成式动作 (Generative Motion) (2) 支柱六：视频提取与匹配 (Video Extraction) (1 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
1	M$^3$oralBench: A MultiModal Moral Benchmark for LVLMs	提出M$^3$oralBench，用于评估LVLM在多模态道德理解和推理方面的能力。	large language model foundation model multimodal
2	Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering	提出基于场景图增强的多模态RAG-LLM，提升视觉问答精度	large language model multimodal
3	Dialogue Director: Bridging the Gap in Dialogue Visualization for Multimodal Storytelling	Dialogue Director：提出一种多模态框架，用于将对话脚本转化为动态多视角故事板。	multimodal chain-of-thought
4	Enhancing Table Recognition with Vision LLMs: A Benchmark and Neighbor-Guided Toolchain Reasoner	提出基于视觉大语言模型的表格识别基准和邻域引导工具链推理器NGTR	large language model foundation model
5	Social-LLaVA: Enhancing Robot Navigation through Human-Language Reasoning in Social Spaces	Social-LLaVA：通过人类语言推理增强社交空间中机器人导航能力	chain-of-thought
6	Towards Compatible Fine-tuning for Vision-Language Model Updates	提出ContCoOp，解决视觉-语言模型更新后微调模块的兼容性问题	foundation model
7	Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks	提出VEGA：一种无监督视觉-语言模型排序方法，用于下游任务选择。	large language model
8	Enhancing Visual Representation for Text-based Person Searching	提出VFE-TPS模型，增强视觉表征以提升文本行人检索精度	multimodal	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
9	KeyGS: A Keyframe-Centric Gaussian Splatting Method for Monocular Image Sequences	提出KeyGS以解决单目图像序列中的3D重建效率问题	3D gaussian splatting 3DGS gaussian splatting
10	4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives	提出基于原生4D高斯的动态场景建模方法，实现高分辨率动态场景的实时渲染。	gaussian splatting splatting scene understanding
11	YOLO-UniOW: Efficient Universal Open-World Object Detection	YOLO-UniOW：高效通用开放世界目标检测模型，解决传统目标检测的局限性。	open-vocabulary open vocabulary multimodal	✅
12	FPGA-based Acceleration of Neural Network for Image Classification using Vitis AI	利用Vitis AI在FPGA上加速图像分类神经网络，提升吞吐量和能效。	depth estimation

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
13	ReFlow6D: Refraction-Guided Transparent Object 6D Pose Estimation via Intermediate Representation Learning	ReFlow6D：利用折射引导的中间表示学习实现透明物体6D位姿估计	manipulation representation learning 6D pose estimation	✅
14	PERSE: Personalized 3D Generative Avatars from A Single Portrait	PERSE：基于单张人像生成个性化3D可控头像，实现面部属性解耦编辑	manipulation 3D gaussian splatting gaussian splatting
15	Edicho: Consistent Image Editing in the Wild	Edicho：基于显式图像对应关系的diffusion模型，实现野外图像一致性编辑	manipulation classifier-free guidance

🔬 支柱二：RL算法与架构 (RL & Architecture) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
16	Hierarchical Banzhaf Interaction for General Video-Language Representation Learning	提出层级Banzhaf交互模型，用于增强通用视频-语言表征学习中的细粒度语义交互。	representation learning contrastive learning multimodal
17	VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation	提出VisionReward框架以解决视觉生成中的人类偏好对齐问题	reinforcement learning preference learning	✅
18	ILDiff: Generate Transparent Animated Stickers by Implicit Layout Distillation	提出ILDiff，通过隐式布局蒸馏生成高质量透明动画贴纸	distillation

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
19	LS-GAN: Human Motion Synthesis with Latent-space GANs	LS-GAN：利用潜在空间GAN进行高效的人体动作合成	motion synthesis
20	Diffgrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model	Diffgrasp：利用扩散模型和物体运动引导的全身抓取合成	contact-aware human-object interaction

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
21	Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model	Vinci：基于第一视角视觉-语言模型的实时具身智能助手	egocentric egocentric vision	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
22	Slow Perception: Let's Perceive Geometric Figures Step-by-step	提出“慢感知”策略，提升LVLM在几何图形理解和复制上的能力	spatial relationship

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
23	LTX-Video: Realtime Video Latent Diffusion	LTX-Video：一种用于实时视频生成的基于Transformer的潜在扩散模型	spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页