cs.CV（2025-01-31）

📊 共 27 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (10 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (6 🔗3) 支柱四：生成式动作 (Generative Motion) (3) 支柱一：机器人控制 (Robot Control) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
1	AIN: The Arabic INclusive Large Multimodal Model	提出AIN：一个阿拉伯语包容性大型多模态模型，在多领域超越GPT-4o。	large language model multimodal
2	Text-to-CAD Generation Through Infusing Visual Feedback in Large Language Models	CADFusion：通过视觉反馈增强大语言模型，实现文本到CAD模型的生成。	large language model multimodal
3	CerraData-4MM: A multimodal benchmark dataset on Cerrado for land use and land cover classification	提出CerraData-4MM多模态数据集，用于塞拉多土地利用和土地覆盖分类	multimodal	✅
4	Transformation trees -- documentation of multimodal image registration	提出基于变换树的多模态图像配准方法，提升医学图像处理的可追溯性和可重复性。	multimodal
5	Deep Ensembling with Multimodal Image Fusion for Efficient Classification of Lung Cancer	提出基于深度集成和多模态图像融合的DEMF网络，用于高效肺癌分类	multimodal
6	Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification	针对X射线图像分类，分析CLIP类模型在不同人口统计学属性上的公平性问题	foundation model
7	PixelWorld: How Far Are We from Perceiving Everything as Pixels?	提出PixelWorld基准，探索“万物皆像素”的统一感知范式，用于评估视觉-语言模型。	multimodal chain-of-thought
8	RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs	RedundancyLens揭示并利用视觉token处理冗余，提升Decoder-Only MLLM效率	large language model multimodal	✅
9	Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs	提出视觉对抗扰动(VAP)方法，缓解大型视觉语言模型中的对象幻觉问题	large language model
10	TV-Dialogue: Crafting Theme-Aware Video Dialogues with Immersive Interaction	提出TV-Dialogue框架，用于生成主题感知的沉浸式视频对话	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
11	LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models	LLMDet：利用大语言模型监督学习的强开放词汇目标检测器	open-vocabulary open vocabulary large language model	✅
12	JGHand: Joint-Driven Animatable Hand Avater via 3D Gaussian Splatting	提出JGHand，一种基于3D高斯溅射的关节驱动可动画手部Avatar，实现高质量实时渲染。	3D gaussian splatting 3DGS gaussian splatting
13	Advancing Dense Endoscopic Reconstruction with Gaussian Splatting-driven Surface Normal-aware Tracking and Mapping	提出Endo-2DTAM，利用高斯溅射驱动的表面法线感知跟踪与建图，提升内窥镜稠密重建精度。	3D gaussian splatting 3DGS gaussian splatting
14	RaySplats: Ray Tracing based Gaussian Splatting	RaySplats：提出基于光线追踪的高斯溅射方法，解决光照和阴影反射问题。	3D gaussian splatting 3DGS gaussian splatting
15	Contrast-Aware Calibration for Fine-Tuned CLIP: Leveraging Image-Text Alignment	提出对比感知校准(CAC)，提升微调CLIP在开放词汇分类中的置信度校准。	open-vocabulary open vocabulary multimodal
16	A Survey on Class-Agnostic Counting: Advancements from Reference-Based to Open-World Text-Guided Approaches	对类别无关计数方法进行综述，涵盖参考式、无参考式和开放世界文本引导方法。	open-vocabulary open vocabulary
17	Lifting by Gaussians: A Simple, Fast and Flexible Method for 3D Instance Segmentation	提出Lifting By Gaussians方法以解决3D实例分割问题	3DGS

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
18	Laser: Efficient Language-Guided Segmentation in Neural Radiance Fields	提出Laser，通过CLIP特征蒸馏实现神经辐射场中高效的语言引导分割。	distillation neural radiance field	✅
19	EgoMe: A New Dataset and Challenge for Following Me via Egocentric View in Real World	EgoMe：提出一个用于真实世界中以自我为中心视角进行模仿学习的新数据集与挑战。	imitation learning egocentric	✅
20	XRF V2: A Dataset for Action Summarization with Wi-Fi Signals, and IMUs in Phones, Watches, Earbuds, and Glasses	提出XRF V2数据集和XRFMamba网络，用于Wi-Fi和IMU信号驱动的动作总结	Mamba foundation model multimodal	✅
21	Imagine with the Teacher: Complete Shape in a Multi-View Distillation Way	提出基于多视角蒸馏的点云补全网络，提升三维形状补全效果	teacher-student distillation
22	RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception	提出基于强化学习的合成样本选择方法RLS3，增强视觉-语言模型在室内自主感知中的空间推理能力。	reinforcement learning visual grounding
23	Improving vision-language alignment with graph spiking hybrid Networks	提出图脉冲混合网络GSHN，提升视觉-语言对齐效果，增强语义表征能力。	contrastive learning spatiotemporal

🔬 支柱四：生成式动作 (Generative Motion) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
24	MotionPCM: Real-Time Motion Synthesis with Phased Consistency Model	MotionPCM：基于相位一致性模型的实时人体运动合成	motion synthesis
25	GDO:Gradual Domain Osmosis	提出渐进域渗透(GDO)方法，解决渐进领域自适应中的知识迁移问题。	penetration
26	OmniPhysGS: 3D Constitutive Gaussians for General Physics-Based Dynamics Generation	提出OmniPhysGS以解决复杂物体物理动态生成问题	physically plausible

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
27	UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent	提出UP-VLA模型，通过统一理解与预测目标提升具身智能体的性能。	manipulation vision-language-action VLA

⬅️ 返回 cs.CV 首页 · 🏠 返回主页