cs.CV（2025-02-06）

📊 共 23 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (4 🔗1) 支柱八：物理动画 (Physics-based Animation) (2) 支柱四：生成式动作 (Generative Motion) (1 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (1) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?	PixFoundation：揭示像素级视觉基础模型在视觉问答和定位能力上的局限性，并探索无像素级监督的MLLM的潜力。	large language model foundation model
2	LeAP: Consistent multi-domain 3D labeling using Foundation Models	LeAP：利用Foundation Model实现多领域一致性3D点云自动标注	foundation model
3	WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs	提出WorldSense基准，用于评估多模态LLM在真实世界场景下的全模态理解能力。	multimodal
4	A Self-supervised Multimodal Deep Learning Approach to Differentiate Post-radiotherapy Progression from Pseudoprogression in Glioblastoma	提出一种自监督多模态深度学习方法，用于区分胶质母细胞瘤放疗后的真性进展与假性进展。	multimodal
5	LR0.FM: Low-Res Benchmark and Improving Robustness for Zero-Shot Classification in Foundation Models	LR0.FM：低分辨率图像下提升视觉语言基础模型零样本分类鲁棒性	foundation model	✅
6	No Free Lunch in Annotation either: An objective evaluation of foundation models for streamlining annotation in animal tracking	针对动物追踪，论文客观评估了基础模型在简化标注任务中的有效性。	foundation model
7	FairT2I: Mitigating Social Bias in Text-to-Image Generation via Large Language Model-Assisted Detection and Attribute Rebalancing	FairT2I通过大语言模型辅助检测和属性重平衡缓解文本到图像生成中的社会偏见。	large language model
8	Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting	提出Time-VLM，利用多模态视觉-语言模型增强时间序列预测。	multimodal	✅
9	Color in Visual-Language Models: CLIP deficiencies	揭示CLIP在颜色理解上的缺陷：对非彩色刺激的偏见与文本优先倾向	multimodal
10	Ola: Pushing the Frontiers of Omni-Modal Language Model	Ola：一种全模态语言模型，在图像、视频和音频理解方面达到与专用模型相媲美的性能。	large language model	✅
11	Keep It Light! Simplifying Image Clustering Via Text-Free Adapters	SCP：通过无文本适配器简化图像聚类，实现媲美SOTA的性能	large language model
12	CAD-Editor: A Locate-then-Infill Framework with Automated Training Data Synthesis for Text-Based CAD Editing	提出CAD-Editor框架，通过自动数据合成和locate-then-infill策略实现文本驱动的CAD模型编辑。	large language model	✅
13	RWKV-UI: UI Understanding with Enhanced Perception and Reasoning	提出RWKV-UI，增强视觉语言模型在UI理解和交互推理中的性能	chain-of-thought

🔬 支柱二：RL算法与架构 (RL & Architecture) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
14	Seeing in the Dark: A Teacher-Student Framework for Dark Video Action Recognition via Knowledge Distillation and Contrastive Learning	ActLumos：面向暗光视频行为识别的知识蒸馏与对比学习框架	contrastive learning teacher-student distillation	✅
15	Taking A Closer Look at Interacting Objects: Interaction-Aware Open Vocabulary Scene Graph Generation	提出INOVA框架，通过交互感知机制提升开放词汇场景图生成的性能。	distillation open-vocabulary open vocabulary
16	Adapting Human Mesh Recovery with Vision-Language Feedback	提出基于视觉-语言反馈的人体网格恢复方法以解决模型对齐问题	contrastive learning VQ-VAE human mesh recovery
17	Adaptive Margin Contrastive Learning for Ambiguity-aware 3D Semantic Segmentation	提出AMContrast3D，通过自适应Margin对比学习解决3D语义分割中歧义点标注不可靠问题。	contrastive learning

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
18	TerraQ: Spatiotemporal Question-Answering on Satellite Image Archives	TerraQ：用于卫星图像档案的时空问答引擎	spatiotemporal
19	MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation	MotionCanvas：通过可控图像到视频生成实现电影级镜头设计	spatiotemporal

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
20	DICE: Distilling Classifier-Free Guidance into Text Embeddings	提出DICE以降低文本图像生成中的计算复杂度	classifier-free guidance	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
21	sshELF: Single-Shot Hierarchical Extrapolation of Latent Features for 3D Reconstruction from Sparse-Views	提出sshELF，通过单次分层外推潜在特征，实现稀疏视角下的3D重建。	scene reconstruction scene understanding foundation model

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
22	Vision-Integrated LLMs for Autonomous Driving Assistance : Human Performance Comparison and Trust Evaluation	提出融合视觉信息的LLM辅助驾驶系统，提升复杂场景理解与决策能力	spatial relationship large language model

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
23	HD-EPIC: A Highly-Detailed Egocentric Video Dataset	HD-EPIC：一个高细节厨房场景第一人称视频数据集，用于评估和提升视觉语言模型。	egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页