cs.CV（2024-06-21）

📊 共 15 篇论文 | 🔗 3 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (5 🔗2) 支柱九：具身大模型 (Embodied Foundation Models) (4) 支柱二：RL算法与架构 (RL & Architecture) (3 🔗1) 支柱一：机器人控制 (Robot Control) (2) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Open-Vocabulary Temporal Action Localization using Multimodal Guidance	提出OVFormer，利用多模态指导实现开放词汇时序动作定位	open-vocabulary open vocabulary large language model
2	E2GS: Event Enhanced Gaussian Splatting	提出E2GS，利用事件相机数据增强高斯溅射，实现快速高质量的新视角合成。	gaussian splatting splatting NeRF	✅
3	Taming 3DGS: High-Quality Radiance Fields with Limited Resources	提出预算约束下的3DGS优化方法，实现高质量、低资源占用率的新视角合成。	3D gaussian splatting 3DGS gaussian splatting
4	Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning	提出多模态任务向量，解决多模态大模型长程上下文学习问题	implicit representation multimodal	✅
5	Relighting Scenes with Object Insertions in Neural Radiance Fields	提出基于NeRF的物体插入与光照重定向方法，实现逼真的AR场景合成	NeRF neural radiance field

🔬 支柱九：具身大模型 (Embodied Foundation Models) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
6	Multimodal Deformable Image Registration for Long-COVID Analysis Based on Progressive Alignment and Multi-perspective Loss	提出基于渐进对齐和多视角损失的多模态可变形图像配准方法，用于长新冠分析。	multimodal
7	Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models	提出SpatialEval基准，揭示VLM在空间推理能力上的不足与反直觉现象。	large language model multimodal
8	TraceNet: Segment one thing efficiently	TraceNet：高效单实例分割，通过用户点击驱动，专为移动端成像应用设计	multimodal
9	Accessible, At-Home Detection of Parkinson's Disease via Multi-task Video Analysis	提出不确定性校准融合网络UFNet，用于家庭场景下帕金森病的辅助检测	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
10	CLIP-Decoder : ZeroShot Multilabel Classification using Multimodal CLIP Aligned Representation	提出CLIP-Decoder，利用多模态对齐表征实现零样本多标签分类	representation learning multimodal
11	VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation	VividDreamer：提出姿态依赖一致性蒸馏采样，实现高质量高效的文本到3D生成	dreamer distillation	✅
12	VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation	VideoScore：构建自动视频评估指标，模拟人类反馈以提升视频生成质量	reinforcement learning RLHF

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
13	An End-to-End, Segmentation-Free, Arabic Handwritten Recognition Model on KHATT	提出一种端到端、无分割的阿拉伯语手写识别模型，并在KHATT数据集上验证。	manipulation
14	Landscape More Secure Than Portrait? Zooming Into the Directionality of Digital Images With Security Implications	揭示图像方向性对媒体安全的影响，并提出改进方法	manipulation

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
15	Real-Time Hand Gesture Recognition: Integrating Skeleton-Based Data Fusion and Multi-Stream CNN	提出基于骨骼数据融合和多流CNN的实时手势识别框架	spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页