cs.CV(2025-10-21)

📊 共 38 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱三:空间感知与语义 (Perception & Semantics) (11 🔗3) 支柱九:具身大模型 (Embodied Foundation Models) (11 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗5) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱八:物理动画 (Physics-based Animation) (1 🔗1) 支柱一:机器人控制 (Robot Control) (1 🔗1)

🔬 支柱三:空间感知与语义 (Perception & Semantics) (11 篇)

#题目一句话要点标签🔗
1 Moving Light Adaptive Colonoscopy Reconstruction via Illumination-Attenuation-Aware 3D Gaussian Splatting 提出ColIAGS,通过光照衰减感知的3D高斯溅射实现移动光源自适应的结肠镜重建 3D gaussian splatting 3DGS gaussian splatting
2 Re-Activating Frozen Primitives for 3D Gaussian Splatting ReAct-GS:通过重激活冻结图元解决3D高斯溅射中的过重建伪影问题 3D gaussian splatting gaussian splatting splatting
3 OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion 提出OpenInsGaussian,通过上下文感知跨视角融合实现开放词汇实例高斯分割 gaussian splatting splatting scene understanding
4 GeoDiff: Geometry-Guided Diffusion for Metric Depth Estimation GeoDiff:提出几何引导的扩散模型用于度量深度估计,无需重新训练。 depth estimation monocular depth metric depth
5 BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining BlendCLIP:通过多模态预训练桥接合成与真实域,实现零样本3D物体分类 open-vocabulary open vocabulary multimodal
6 Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos Mono4DGS-HDR:提出基于高斯溅射的单目交替曝光视频HDR 4D重建方法 gaussian splatting splatting
7 Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views 提出3DThinker,从有限视角实现基于几何想象的空间推理 VGGT spatial relationship foundation model
8 PLANA3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting PLANA3R:基于前馈平面splatting的零样本度量平面3D重建 depth estimation splatting
9 UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding 提出UWBench水下视觉-语言基准,促进水下环境理解研究。 scene understanding multimodal visual grounding
10 MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models MoAlign:面向视频扩散模型,提出运动中心表征对齐方法,提升时序一致性和物理合理性。 optical flow physically plausible
11 VelocityNet: Real-Time Crowd Anomaly Detection via Person-Specific Velocity Analysis VelocityNet:通过人员特定速度分析实现实时人群异常检测 optical flow

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
12 The Impact of Image Resolution on Biomedical Multimodal Large Language Models 研究图像分辨率对生物医学多模态大语言模型性能的影响,提出混合分辨率训练策略。 large language model multimodal
13 Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs 提出GAR以解决多模态大语言模型的区域理解问题 large language model multimodal
14 VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety VLSU:构建多模态AI安全评估框架,揭示视觉-语言联合理解的局限性 foundation model multimodal
15 Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model for Microclimate Impact Prediction 微调地理空间基础模型,用于城市热岛检测与模拟,预测微气候影响。 foundation model
16 IF-VidCap: Can Video Caption Models Follow Instructions? 提出IF-VidCap基准,评估视频字幕模型在指令遵循方面的能力。 large language model multimodal instruction following
17 Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding 提出自适应Token集成解码(ATED),无需训练即可有效缓解多模态大模型中的幻觉问题 multimodal
18 Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts 提出基于元数据和任务特定提示的驾驶场景问答系统,提升鲁棒性。 multimodal chain-of-thought
19 See the Text: From Tokenization to Visual Reading 提出SeeTok,将文本视为图像,利用多模态LLM实现高效视觉阅读理解。 large language model multimodal
20 SITS-DECO: A Generative Decoder Is All You Need For Multitask Satellite Image Time Series Modelling SITS-DECO:仅用生成式解码器进行多任务卫星图像时间序列建模 large language model foundation model
21 PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions 提出PoSh,利用场景图引导LLM评估图像描述,并发布DOCENT数据集。 foundation model
22 Gestura: A LVLM-Powered System Bridging Motion and Semantics for Real-Time Free-Form Gesture Understanding Gestura:一种基于LVLM的实时自由手势理解系统,弥合动作与语义鸿沟 chain-of-thought

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
23 Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models 提出Med-RwR框架,通过主动检索增强医学多模态大语言模型的推理能力 reinforcement learning large language model multimodal
24 CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder CovMatch:通过跨协方差引导和可训练文本编码器实现多模态数据集蒸馏 contrastive learning distillation multimodal
25 Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models 提出VFM-VAE,直接利用视觉基础模型作为潜在扩散模型的tokenizer,显著提升生成质量与效率。 distillation foundation model
26 Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs 提出基于掩码预测的上下文常识激活方法,提升视觉语言模型在多模态场景下的推理能力 reinforcement learning large language model multimodal
27 Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback 提出Diffusion-DRO,通过排序优化和在线负样本提升扩散模型的用户偏好对齐。 reinforcement learning inverse reinforcement learning preference learning
28 OmniNWM: Omniscient Driving Navigation World Models OmniNWM:全知全景导航世界模型,赋能自动驾驶 world model metric depth
29 Embodied Navigation with Auxiliary Task of Action Description Prediction 提出基于动作描述预测辅助任务的具身导航方法,提升导航性能和可解释性。 reinforcement learning distillation multimodal
30 UniHPR: Unified Human Pose Representation via Singular Value Contrastive Learning 提出UniHPR,通过奇异值对比学习统一多模态人体姿态表征 representation learning contrastive learning
31 Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents 提出视觉中心对比学习VC2L,统一处理多模态网页文档的表示学习。 representation learning contrastive learning multimodal
32 ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder ProCLIP:提出基于LLM嵌入的渐进式视觉-语言对齐框架,提升CLIP处理长文本能力。 contrastive learning curriculum learning distillation

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
33 Latent-Info and Low-Dimensional Learning for Human Mesh Recovery and Parallel Optimization 提出基于潜在信息和低维学习的人体网格恢复与并行优化方法 human mesh recovery human motion
34 Hyperbolic Space Learning Method Leveraging Temporal Motion Priors for Human Mesh Recovery 提出一种利用时序运动先验的 hyperbolic 空间学习方法,用于人体网格重建。 human mesh recovery

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
35 Learning Human-Object Interaction as Groups 提出GroupHOI框架,从群体交互视角提升人-物交互检测性能 human-object interaction HOI

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
36 DSI-Bench: A Benchmark for Dynamic Spatial Intelligence 提出DSI-Bench基准测试,用于评估动态空间智能,揭示现有VLM在3D动态场景理解上的局限性。 spatial relationship

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
37 A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition 提出显式运动信息挖掘模块EMIM,增强Transformer在动作识别中对运动信息的建模能力。 spatiotemporal

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
38 Efficient Few-shot Identity Preserving Attribute Editing for 3D-aware Deep Generative Models 提出一种高效的少样本3D人脸属性编辑方法,保持身份一致性。 manipulation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页