cs.CV(2025-02-06)

📊 共 23 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (4 🔗1) 支柱八:物理动画 (Physics-based Animation) (2) 支柱四:生成式动作 (Generative Motion) (1 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models? PixFoundation:揭示像素级视觉基础模型在视觉问答和定位能力上的局限性,并探索无像素级监督的MLLM的潜力。 large language model foundation model
2 LeAP: Consistent multi-domain 3D labeling using Foundation Models LeAP:利用Foundation Model实现多领域一致性3D点云自动标注 foundation model
3 WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs 提出WorldSense基准,用于评估多模态LLM在真实世界场景下的全模态理解能力。 multimodal
4 A Self-supervised Multimodal Deep Learning Approach to Differentiate Post-radiotherapy Progression from Pseudoprogression in Glioblastoma 提出一种自监督多模态深度学习方法,用于区分胶质母细胞瘤放疗后的真性进展与假性进展。 multimodal
5 LR0.FM: Low-Res Benchmark and Improving Robustness for Zero-Shot Classification in Foundation Models LR0.FM:低分辨率图像下提升视觉语言基础模型零样本分类鲁棒性 foundation model
6 No Free Lunch in Annotation either: An objective evaluation of foundation models for streamlining annotation in animal tracking 针对动物追踪,论文客观评估了基础模型在简化标注任务中的有效性。 foundation model
7 FairT2I: Mitigating Social Bias in Text-to-Image Generation via Large Language Model-Assisted Detection and Attribute Rebalancing FairT2I通过大语言模型辅助检测和属性重平衡缓解文本到图像生成中的社会偏见。 large language model
8 Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting 提出Time-VLM,利用多模态视觉-语言模型增强时间序列预测。 multimodal
9 Color in Visual-Language Models: CLIP deficiencies 揭示CLIP在颜色理解上的缺陷:对非彩色刺激的偏见与文本优先倾向 multimodal
10 Ola: Pushing the Frontiers of Omni-Modal Language Model Ola:一种全模态语言模型,在图像、视频和音频理解方面达到与专用模型相媲美的性能。 large language model
11 Keep It Light! Simplifying Image Clustering Via Text-Free Adapters SCP:通过无文本适配器简化图像聚类,实现媲美SOTA的性能 large language model
12 CAD-Editor: A Locate-then-Infill Framework with Automated Training Data Synthesis for Text-Based CAD Editing 提出CAD-Editor框架,通过自动数据合成和locate-then-infill策略实现文本驱动的CAD模型编辑。 large language model
13 RWKV-UI: UI Understanding with Enhanced Perception and Reasoning 提出RWKV-UI,增强视觉语言模型在UI理解和交互推理中的性能 chain-of-thought

🔬 支柱二:RL算法与架构 (RL & Architecture) (4 篇)

#题目一句话要点标签🔗
14 Seeing in the Dark: A Teacher-Student Framework for Dark Video Action Recognition via Knowledge Distillation and Contrastive Learning ActLumos:面向暗光视频行为识别的知识蒸馏与对比学习框架 contrastive learning teacher-student distillation
15 Taking A Closer Look at Interacting Objects: Interaction-Aware Open Vocabulary Scene Graph Generation 提出INOVA框架,通过交互感知机制提升开放词汇场景图生成的性能。 distillation open-vocabulary open vocabulary
16 Adapting Human Mesh Recovery with Vision-Language Feedback 提出基于视觉-语言反馈的人体网格恢复方法以解决模型对齐问题 contrastive learning VQ-VAE human mesh recovery
17 Adaptive Margin Contrastive Learning for Ambiguity-aware 3D Semantic Segmentation 提出AMContrast3D,通过自适应Margin对比学习解决3D语义分割中歧义点标注不可靠问题。 contrastive learning

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
18 TerraQ: Spatiotemporal Question-Answering on Satellite Image Archives TerraQ:用于卫星图像档案的时空问答引擎 spatiotemporal
19 MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation MotionCanvas:通过可控图像到视频生成实现电影级镜头设计 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
20 DICE: Distilling Classifier-Free Guidance into Text Embeddings 提出DICE以降低文本图像生成中的计算复杂度 classifier-free guidance

🔬 支柱三:空间感知与语义 (Perception & Semantics) (1 篇)

#题目一句话要点标签🔗
21 sshELF: Single-Shot Hierarchical Extrapolation of Latent Features for 3D Reconstruction from Sparse-Views 提出sshELF,通过单次分层外推潜在特征,实现稀疏视角下的3D重建。 scene reconstruction scene understanding foundation model

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
22 Vision-Integrated LLMs for Autonomous Driving Assistance : Human Performance Comparison and Trust Evaluation 提出融合视觉信息的LLM辅助驾驶系统,提升复杂场景理解与决策能力 spatial relationship large language model

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
23 HD-EPIC: A Highly-Detailed Egocentric Video Dataset HD-EPIC:一个高细节厨房场景第一人称视频数据集,用于评估和提升视觉语言模型。 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页