cs.CV（2025-02-11）

📊 共 22 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (6 🔗2) 支柱六：视频提取与匹配 (Video Extraction) (5 🔗3) 支柱一：机器人控制 (Robot Control) (4) 支柱九：具身大模型 (Embodied Foundation Models) (4 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (2 🔗1) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Flow Distillation Sampling: Regularizing 3D Gaussians with Pre-trained Matching Priors	提出Flow Distillation Sampling，利用预训练匹配先验正则化3D高斯模型，提升几何重建质量。	distillation 3D gaussian splatting 3DGS	✅
2	A Survey on Mamba Architecture for Vision Applications	综述Mamba架构在视觉任务中的应用，探索其在图像和视频理解中的潜力。	Mamba SSM spatiotemporal
3	HOMIE: Histopathology Omni-modal Embedding for Pathology Composed Retrieval	HOMIE：用于病理组合检索的组织病理学全模态嵌入方法	predictive model large language model multimodal
4	A Survey of Representation Learning, Optimization Strategies, and Applications for Omnidirectional Vision	全景视觉深度学习综述：聚焦表征学习、优化策略与应用	representation learning optical flow
5	PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning	PlaySlot：学习逆向潜在动态，实现可控的、以对象为中心的视频预测与规划	world model latent dynamics	✅
6	Articulate That Object Part (ATOP): 3D Part Articulation via Text and Motion Personalization	ATOP：提出一种基于文本和运动个性化的3D部件可动性建模方法	distillation motion generation

🔬 支柱六：视频提取与匹配 (Video Extraction) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
7	Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models	提出Anomaly-OV，用于零样本异常检测与推理，显著提升细粒度异常识别能力。	feature matching large language model multimodal	✅
8	EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering	提出EgoTextVQA基准，用于评测以自我为中心的场景文本感知视频问答能力。	egocentric large language model multimodal	✅
9	PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization	提出PRVQL，通过渐进式知识引导优化第一人称视频中的视觉查询定位。	egocentric Ego4D	✅
10	EventEgo3D++: 3D Human Motion Capture from a Head-Mounted Event Camera	EventEgo3D++：利用头戴式事件相机进行3D人体运动捕捉	SMPL egocentric
11	Few-Shot Multi-Human Neural Rendering Using Geometry Constraints	提出基于几何约束的少样本多人神经渲染方法，解决遮挡和杂乱问题。	SMPL

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
12	TranSplat: Surface Embedding-guided 3D Gaussian Splatting for Transparent Object Manipulation	TranSplat：表面嵌入引导的3D高斯溅射用于透明物体操作	manipulation 3D gaussian splatting gaussian splatting
13	DeepSeek on a Trip: Inducing Targeted Visual Hallucinations via Representation Vulnerabilities	通过表征脆弱性诱导DeepSeek模型产生目标视觉幻觉	manipulation large language model multimodal
14	Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving	提出PreWorld：一种半监督的、以视觉为中心的3D Occupancy世界模型，用于自动驾驶。	motion planning world model
15	Diffusion Suction Grasping with Large-Scale Parcel Dataset	提出Diffusion-Suction，解决复杂包裹抓取的吸盘抓取规划问题	manipulation affordance grasp prediction

🔬 支柱九：具身大模型 (Embodied Foundation Models) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
16	Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content	针对多模态仇恨检测，提出一种鲁棒框架，着重研究视频与图像内容差异性。	multimodal	✅
17	NanoVLMs: How small can we go and still make coherent Vision Language Models?	提出NanoVLMs，探索保持视觉语言模型连贯性的最小模型尺寸。	large language model multimodal
18	Scaling Pre-training to One Hundred Billion Data for Vision Language Models	大规模视觉语言预训练：探索千亿级数据对模型性能与文化多样性的影响	multimodal
19	Confidence-calibrated covariate shift correction for few-shot classification in Vision-Language Models	提出CalShift方法，校准置信度并修正协变量偏移，提升视觉-语言模型在少样本分类中的泛化性。	foundation model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
20	TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation	提出TRAVEL，一种免训练的视觉语言导航检索与对齐方法	semantic map VLMAP VLN
21	VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation	VidCRAFT3：通过相机、物体和光照控制实现图像到视频的生成	optical flow	✅

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
22	Enhancing Video Understanding: Deep Neural Networks for Spatiotemporal Analysis	探索时空特征与深度网络，综述视频理解算法与数据集	spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页