cs.CV(2026-01-30)

📊 共 30 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (10 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (8) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱四:生成式动作 (Generative Motion) (2) 支柱八:物理动画 (Physics-based Animation) (1) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (10 篇)

#题目一句话要点标签🔗
1 ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model 提出ImgCoT以解决长链思维压缩问题 large language model chain-of-thought
2 Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage 提出Head-Aware Visual Cropping,提升细粒度VQA中多模态大模型的视觉定位能力。 large language model multimodal visual grounding
3 ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search ShotFinder:提出基于网络搜索和想象驱动的开放域视频镜头检索基准与方法 large language model multimodal
4 Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval 提出紧凑超立方体嵌入,加速基于文本的野生动物观测检索 foundation model multimodal
5 VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration VisionTrim:面向免训练MLLM加速的统一视觉Token压缩框架 large language model multimodal
6 PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios PhoStream:面向移动场景全模态助手,评估真实世界流式理解能力 large language model multimodal
7 ScribbleSense: Generative Scribble-Based Texture Editing with Intent Prediction ScribbleSense:基于涂鸦生成纹理编辑,结合意图预测,提升交互式3D资产创建。 large language model multimodal
8 Structured Over Scale: Learning Spatial Reasoning from Educational Video 提出DoraVQA数据集,并利用教育视频中的结构化信息提升视觉语言模型的空间推理能力。 multimodal
9 One-shot Optimized Steering Vector for Hallucination Mitigation for VLMs 提出OSGA,通过单样本优化steering vector有效缓解视觉语言模型中的幻觉问题。 multimodal
10 StreamSense: Streaming Social Task Detection with Selective Vision-Language Model Routing StreamSense:基于选择性VLM路由的流式社交任务检测 TAMP

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
11 How Much of a Model Do We Need? Redundancy and Slimmability in Remote Sensing Foundation Models 遥感基础模型冗余度分析与精简:揭示参数缩放的有效性边界 masked autoencoder MAE foundation model
12 Inference-Time Dynamic Modality Selection for Incomplete Multimodal Classification DyMo:针对不完整多模态分类的推理时动态模态选择框架 representation learning multimodal
13 Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning Video-o3:面向长视频多跳推理的原生交错线索搜索框架 reinforcement learning large language model multimodal
14 Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training Med-Scout:通过几何感知强化学习后训练,解决MLLM在医学感知中的几何盲区问题 reinforcement learning large language model multimodal
15 Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs GRACE:面向高效视觉语言模型的置信度蒸馏门控关系对齐量化训练框架 distillation multimodal
16 VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation VideoGPA:通过几何先验知识蒸馏实现3D一致性视频生成 DPO direct preference optimization foundation model
17 Region-Normalized DPO for Medical Image Segmentation under Noisy Judges 提出区域归一化DPO,解决医学图像分割中噪声判别器下的微调问题 DPO direct preference optimization
18 DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation DINO-SAE:用于高保真图像重建与生成的DINO球面自编码器 flow matching foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
19 ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding 提出ExpAlign以解决开放词汇基础上的视觉-语言对齐问题 open-vocabulary open vocabulary
20 Diachronic Stereo Matching for Multi-Date Satellite Imagery 提出历时立体匹配方法,解决多时相卫星影像三维重建难题 monocular depth gaussian splatting splatting
21 Segment Any Events with Language SEAL:首个基于语言提示的事件数据实例分割框架,支持开放词汇。 scene understanding open-vocabulary open vocabulary
22 FlowCalib: LiDAR-to-Vehicle Miscalibration Detection using Scene Flows FlowCalib:利用场景流检测LiDAR与车辆的外部参数误差 scene flow
23 Deep in the Jungle: Towards Automating Chimpanzee Population Estimation 提出基于单目深度估计的黑猩猩种群数量自动化评估方法 depth estimation monocular depth Depth Anything
24 Under-Canopy Terrain Reconstruction in Dense Forests Using RGB Imaging and Neural 3D Reconstruction 提出基于RGB图像和神经辐射场的森林冠层下地形重建方法 NeRF neural radiance field
25 Hi-Light: A Path to high-fidelity, high-resolution video relighting with a Novel Evaluation Paradigm 提出Hi-Light以解决视频重光照中的稳定性与细节保留问题 optical flow

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
26 Cross-Domain Few-Shot Learning for Hyperspectral Image Classification Based on Mixup Foundation Model 提出基于Mixup基础模型的MIFOMO,用于高光谱图像跨域少样本分类。 MDM HSI foundation model
27 Training-Free Representation Guidance for Diffusion Models with a Representation Alignment Projector 提出基于表征对齐投影的无训练扩散模型引导方法,提升图像生成质量。 classifier-free guidance

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
28 Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding 提出时空语义对比解码,缓解视频大语言模型中的幻觉问题 spatiotemporal large language model

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
29 PEAR: Pixel-aligned Expressive humAn mesh Recovery PEAR:提出像素对齐的快速人体网格重建框架,解决细节缺失和速度慢的问题。 human mesh recovery SMPL-X

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
30 UniGeo: A Unified 3D Indoor Object Detection Framework Integrating Geometry-Aware Learning and Dynamic Channel Gating UniGeo:融合几何感知学习和动态通道门控的统一3D室内目标检测框架 spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页