cs.CV(2026-04-07)

📊 共 187 篇论文 | 🔗 15 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (70 🔗7) 支柱二:RL算法与架构 (RL & Architecture) (45) 支柱三:空间感知与语义 (Perception & Semantics) (38 🔗5) 支柱一:机器人控制 (Robot Control) (10) 支柱八:物理动画 (Physics-based Animation) (7) 支柱四:生成式动作 (Generative Motion) (6 🔗1) 支柱七:动作重定向 (Motion Retargeting) (5 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (4) 支柱五:交互与反应 (Interaction & Reaction) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (70 篇)

#题目一句话要点标签🔗
1 A Generative Foundation Model for Multimodal Histopathology MuPD:用于多模态组织病理学的生成式基础模型,实现跨模态合成与虚拟染色。 foundation model multimodal
2 A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models 提出基于Patch增强和跨视角正则化的框架,防御多模态大语言模型中的后门攻击。 large language model multimodal
3 CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks CoLA:用于多模态下游任务的跨模态低秩适配 foundation model multimodal visual grounding
4 When Does Multimodal AI Help? Diagnostic Complementarity of Vision-Language Models and CNNs for Spectrum Management in Satellite-Terrestrial Networks 提出SpectrumQA基准,诊断VLM与CNN在卫星-地面网络频谱管理中的互补性 foundation model multimodal chain-of-thought
5 Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models 提出场景动态场SDF,提升多模态大语言模型对连续物体动态物理的理解 large language model multimodal
6 The Indra Representation Hypothesis for Multimodal Alignment 提出基于Indra表征假设的多模态对齐方法,实现免训练的跨模态鲁棒对齐。 foundation model multimodal
7 Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation Firebolt-VL:通过跨模态调制实现高效的视觉-语言理解 large language model foundation model multimodal
8 Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning 提出Chain-of-Frames框架,提升多模态LLM在视频理解中的帧感知推理能力 large language model multimodal
9 ZINA: Multimodal Fine-grained Hallucination Detection and Editing ZINA:提出多模态细粒度幻觉检测与编辑方法,解决MLLM输出与视觉内容不符问题。 large language model multimodal
10 Image Hashing via Cross-View Code Alignment in the Age of Foundation Models 提出CroVCA,通过跨视图编码对齐实现高效图像哈希,适用于大规模检索。 foundation model multimodal
11 EI: Early Intervention for Multimodal Imaging based Disease Recognition 提出EI框架,通过模态早期干预和MoR自适应,提升多模态医学影像疾病识别精度。 foundation model multimodal
12 LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset KITScenes LongTail数据集:提供推理轨迹的长尾驾驶场景数据集,用于端到端驾驶。 VLA multimodal instruction following
13 Automated Segmentation and Tracking of Group Housed Pigs Using Foundation Models 利用Foundation Model实现猪群的自动分割与跟踪,提升畜牧业智能化水平 foundation model
14 Multimodal Urban Tree Detection from Satellite and Street-Level Imagery via Annotation-Efficient Deep Learning Strategies 提出一种基于多模态图像和高效标注策略的城市树木检测方法 multimodal
15 A Physics-Informed, Behavior-Aware Digital Twin for Robust Multimodal Forecasting of Core Body Temperature in Precision Livestock Farming 提出物理信息化的数字双胞胎以提高奶牛核心体温预测精度 multimodal
16 Multimodal Backdoor Attack on VLMs for Autonomous Driving via Graffiti and Cross-Lingual Triggers 提出基于涂鸦和跨语言触发器的多模态后门攻击,威胁自动驾驶视觉语言模型 multimodal
17 ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality ClickAIXR:一种在扩展现实中与真实世界对象进行设备端多模态视觉-语言交互的框架 multimodal
18 Robust Adaptation of Foundation Models with Black-Box Visual Prompting 提出BlackVIP,通过黑盒视觉提示实现大模型在有限资源下的鲁棒自适应。 foundation model
19 Automated Wildfire Damage Assessment from Multi view Ground level Imagery Via Vision Language Models 提出基于多视角地面图像和视觉语言模型的自动化野火损失评估方法 large language model multimodal chain-of-thought
20 Revisiting Multimodal Positional Encoding in Vision-Language Models 提出多头旋转位置编码MHRoPE及其变体MRoPE-I,提升视觉-语言模型的多模态位置编码能力 multimodal
21 Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning 提出VIGA:通过交错多模态推理实现视觉逆向图形Agent multimodal
22 Improving Multimodal Learning with Dispersive and Anchoring Regularization 提出Dispersive and Anchoring Regularization,提升多模态学习表征质量与融合效果 multimodal
23 Inference-Path Optimization via Circuit Duplication in Frozen Visual Transformers for Marine Species Classification 针对海洋物种分类,提出基于冻结视觉Transformer电路复制的推理路径优化方法 large language model foundation model
24 Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling 提出CSRS方法,稳定多模态大语言模型在几何任务上的无监督自进化学习。 large language model multimodal
25 ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs 提出ITIScore:一个用于评估多模态大语言模型图像描述能力的图像-文本-图像评分框架 large language model multimodal
26 BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing BoxComm:提出拳击赛事解说生成数据集与评测体系,填补格斗运动解说AI研究空白 large language model multimodal
27 Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks 系统性评估视觉-语言模型在自然语义变异下的鲁棒性,揭示其在多样任务中的脆弱性 multimodal zero-shot transfer
28 Synthesis4AD: Synthetic Anomalies are All You Need for 3D Anomaly Detection Synthesis4AD:利用合成异常数据提升3D异常检测性能 large language model multimodal
29 Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval 提出紧凑超立方体嵌入,加速基于文本的野生动物观测检索 foundation model multimodal
30 SafeScreen: A Safety-First Screening Framework for Personalized Video Retrieval for Vulnerable Users SafeScreen:面向弱势用户的安全优先个性化视频检索框架 multimodal
31 When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models 提出层级Sink门控(LSG)模块,提升大型视觉语言模型中全局推理和局部感知的平衡。 multimodal
32 Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning 提出一种可扩展的AI方法,无需人工标注即可检测面对面协作学习中的注视行为。 foundation model
33 Bridging the Dimensionality Gap: A Taxonomy and Survey of 2D Vision Model Adaptation for 3D Analysis 综述2D视觉模型在3D分析中的适配方法,弥合维度差异性鸿沟 foundation model
34 Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models 提出基于相位感知的抑制方法,解决视觉-语言模型中的幻觉问题 multimodal
35 Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning LaPR:面向视觉上下文学习,提出标签感知的提示检索框架,提升任务性能。 foundation model
36 DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR DSERT-RoLL:用于多样驾驶条件下的稳健多模态感知数据集与融合框架 multimodal
37 SciLT: Long-Tailed Classification in Scientific Image Domains SciLT:针对科学图像领域长尾分类问题,提出自适应特征融合和双重监督学习框架。 foundation model
38 SGTA: Scene-Graph Based Multi-Modal Traffic Agent for Video Understanding 提出基于场景图的多模态交通Agent(SGTA)用于交通视频理解 large language model
39 A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning 提出多模态排版攻击,揭示视听推理大模型在跨模态对抗中的脆弱性 large language model
40 ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity 提出ATSS方法,通过异常时序自相似性检测AI生成视频 multimodal
41 Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning 提出Graph-to-Frame RAG以解决视频推理中的知识融合问题 multimodal
42 Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs 提出自适应KV-Cache量化方法,优化轻量级On-Device LLM的内存和延迟。 large language model
43 DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing 提出DIRECT框架,通过分层多智能体规划和意图引导编辑实现高质量视频混剪 multimodal
44 FileGram: Grounding Agent Personalization in File-System Behavioral Traces FileGram:提出基于文件系统行为轨迹的Agent个性化框架,解决数据约束下的Agent定制难题。 multimodal
45 Rethinking Model Efficiency: Multi-Agent Inference with Large Models 提出多智能体推理框架,利用大模型和小模型优势提升视觉语言模型效率。 large language model
46 VideoCoF: Unified Video Editing with Temporal Reasoner VideoCoF:提出基于时序推理的统一视频编辑框架,无需掩码实现精准编辑。 chain-of-thought
47 SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling SubspaceAD:基于子空间建模的免训练少样本异常检测方法 foundation model
48 Event6D: Event-based Novel Object 6D Pose Tracking EventTrack6D:提出一种基于事件相机的通用物体6D位姿跟踪框架 TAMP
49 JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation 提出JAMMEval,用于可靠评估日语VLM的精细化基准集合 visual grounding
50 VLA-InfoEntropy: A Training-Free Vision-Attention Information Entropy Approach for Vision-Language-Action Models Inference Acceleration and Success VLA-InfoEntropy:一种免训练的视觉-注意力信息熵方法,加速并提升VLA模型推理 vision-language-action VLA
51 Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs 提出GUIDE框架以解决多模态大语言模型的空间感知问题 large language model foundation model multimodal
52 Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis RATNet:基于类比推理的胃肠内窥镜诊断基础模型,提升泛化性和鲁棒性 foundation model zero-shot transfer
53 A Unified Foundation Model for All-in-One Multi-Modal Remote Sensing Image Restoration and Fusion with Language Prompting 提出LLaRS:用于多模态遥感图像修复与融合的统一基础模型 foundation model language conditioned
54 Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery 提出轻量级多模态适配框架,用于无人机热成像物种识别与栖息地环境解读。 multimodal
55 Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition 提出MoME框架与HTL策略,提升驾驶员行为识别中细粒度多模态视觉分析能力 multimodal
56 Leveraging Image Editing Foundation Models for Data-Efficient CT Metal Artifact Reduction 利用图像编辑基础模型,以数据高效的方式减少CT金属伪影 foundation model
57 PDMP: Rethinking Balanced Multimodal Learning via Performance-Dominant Modality Prioritization 提出性能主导模态优先(PDMP)策略,解决多模态学习中的欠优化问题。 multimodal
58 SGANet: Semantic and Geometric Alignment for Multimodal Multi-view Anomaly Detection SGANet:用于多模态多视角异常检测的语义与几何对齐网络 multimodal
59 Evaluation Before Generation: A Paradigm for Robust Multimodal Sentiment Analysis with Missing Modalities 提出基于评估的缺失模态适应框架以解决多模态情感分析问题 multimodal
60 Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images 提出STSF-Net,利用先验引导的多模态特征融合进行光学-SAR图像变化检测。 multimodal
61 UAVReason: A Unified, Large-Scale Benchmark for Multimodal Aerial Scene Reasoning and Generation 提出UAVReason:一个用于多模态航拍场景理解与生成的大规模统一基准。 multimodal
62 FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips FoleyDesigner:提出一种时空精确对齐的沉浸式立体声拟音生成框架,用于电影片段。 large language model TAMP
63 DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions 提出DetailVerifyBench,用于长图像描述中细粒度幻觉定位的基准测试 large language model multimodal
64 EchoAgent: Towards Reliable Echocardiography Interpretation with "Eyes","Hands" and "Minds" 提出EchoAgent,实现可靠的心脏超声影像端到端判读,模拟医生“眼、手、脑”协同工作。 large language model multimodal
65 VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG VideoStir:提出时空结构化和意图感知的RAG框架,用于理解长视频 large language model multimodal
66 CoStream: Codec-Guided Resource-Efficient System for Video Streaming Analytics CoStream:一种编解码器引导的资源高效视频流分析系统 multimodal
67 AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis 提出AICA-Bench基准测试,用于全面评估VLMs在情感图像内容分析中的能力。 multimodal
68 Physics-Aware Video Instance Removal Benchmark 提出物理感知视频实例移除基准PVIR,评估算法在保持物理一致性下的移除效果。 instruction following
69 Few-Shot Semantic Segmentation Meets SAM3 提出基于SAM3的无监督少样本语义分割方法 foundation model
70 Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval 提出对象锚定组合图像检索任务与AdaFocal框架,解决实例级一致性问题。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (45 篇)

#题目一句话要点标签🔗
71 BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion BiTDiff:通过BiMamba-Transformer扩散模型实现精细的3D指挥动作生成 Mamba motion synthesis motion generation
72 TreeGaussian: Tree-Guided Cascaded Contrastive Learning for Hierarchical Consistent 3D Gaussian Scene Segmentation and Understanding TreeGaussian:树引导的级联对比学习用于分层一致的3D高斯场景分割与理解 contrastive learning 3D gaussian splatting 3DGS
73 Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning 提出DisWM,通过离线知识蒸馏和解耦约束,提升视觉强化学习在复杂环境中的样本效率。 reinforcement learning world model world models
74 Restore-R1: Efficient Image Restoration Agents via Reinforcement Learning with Multimodal LLM Perceptual Feedback 提出Restore-R1,利用强化学习和多模态LLM反馈高效解决复杂图像修复问题 reinforcement learning large language model multimodal
75 MuDD: A Multimodal Deception Detection Dataset and GSR-Guided Progressive Distillation for Non-Contact Deception Detection 提出MuDD数据集和GPD框架,用于非接触式多模态欺骗检测 representation learning distillation multimodal
76 CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection 提出CoLoRSMamba,利用条件LoRA引导的Mamba模型进行多模态暴力检测。 Mamba multimodal
77 Training a Student Expert via Semi-Supervised Foundation Model Distillation 提出半监督知识蒸馏框架,用于将视觉基础模型压缩为轻量级专家模型,提升实例分割性能。 distillation foundation model
78 FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views via Compact Semantic Representation FLEG:基于紧凑语义表示的任意视角前馈语言嵌入高斯溅射 distillation gaussian splatting splatting
79 A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens 提出DeltaWorld,通过Delta Tokens高效生成多样化视频未来帧,显著降低计算成本。 world model world models foundation model
80 Universal Skeleton Understanding via Differentiable Rendering and MLLMs SkeletonLLM:通过可微渲染和MLLM实现通用骨骼理解 distillation open-vocabulary open vocabulary
81 Learning Additively Compositional Latent Actions for Embodied AI 提出AC-LAM,利用可加组合的潜在动作学习提升具身智能在桌面任务中的表现。 policy learning embodied AI
82 Multimodal Structure Learning: Disentangling Shared and Specific Topology via Cross-Modal Graphical Lasso 提出CM-GLasso,通过跨模态图 Lasso 解耦共享和特定拓扑结构,提升多模态表征学习。 distillation multimodal
83 TAPE: A two-stage parameter-efficient adaptation framework for foundation models in OCT-OCTA analysis TAPE:用于OCT-OCTA分析中高效微调医学Foundation模型的两阶段自适应框架 MAE foundation model
84 CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models CLEAR框架通过生成式能力提升统一多模态模型在退化图像理解上的鲁棒性 reinforcement learning multimodal
85 LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation LinguDistill:通过选择性跨模态蒸馏恢复视觉语言模型中的语言能力 distillation multimodal visual grounding
86 OpenWorldLib: A Unified Codebase and Definition of Advanced World Models OpenWorldLib:统一高级世界模型的代码库与定义,促进高效复用与协同推理。 world model world models
87 RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes 提出RL-AWB,利用深度强化学习解决低光夜景场景的自动白平衡问题 reinforcement learning deep reinforcement learning
88 HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance HandDreamer:利用矫正手部形状引导的零样本文本到3D手部模型生成 dreamer distillation MANO
89 VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing 提出VitaTouch,用于智能制造中融合视觉、触觉和语言的机器人质量检测。 contrastive learning large language model multimodal
90 V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators V-Reflection:通过主动视觉查询,提升多模态大语言模型在细粒度感知任务中的性能 distillation large language model multimodal
91 Incomplete Multi-View Multi-Label Classification via Shared Codebook and Fused-Teacher Self-Distillation 提出基于共享码本和融合教师自蒸馏的不完全多视角多标签分类方法 representation learning contrastive learning distillation
92 TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Guided Optimization 提出TIGFlow-GRPO,通过交互感知流匹配和奖励引导优化实现更符合社会规范的轨迹预测。 flow matching multimodal
93 DriveVA: Video Action Models are Zero-Shot Drivers DriveVA:利用视频动作模型实现自动驾驶零样本泛化 world model world models scene understanding
94 Spatially-Weighted CLIP for Street-View Geo-localization 提出空间加权CLIP以解决街景地理定位问题 representation learning contrastive learning multimodal
95 Scalable and Generalizable Correspondence Pruning via Geometry-Consistent Pre-training 提出基于几何一致性预训练的可扩展通用对应关系剪枝方法 representation learning masked autoencoder geometric consistency
96 FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution 提出FinPercep-RM和CCL,提升RL在真实超分辨率重建中的感知质量。 reinforcement learning policy learning RLHF
97 HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation 提出HandMCM,利用多模态点云和Correspondence Mamba解决3D手部姿态估计中的遮挡问题 Mamba state space model
98 Deep Image Clustering Based on Curriculum Learning and Density Information 提出基于课程学习和密度信息的深度图像聚类方法,提升复杂图像聚类性能。 curriculum learning
99 TORA: Topological Representation Alignment for 3D Shape Assembly TORA:通过拓扑表示对齐实现更高效、准确的3D形状组装 flow matching zero-shot transfer
100 OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models 提出OP-GRPO,通过离线策略优化提升Flow-Matching模型生成质量和训练效率。 flow matching
101 Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning 提出RLER双重范式,通过强化学习生成证据并进行选举推理,提升视频推理的可靠性与可解释性。 reinforcement learning multimodal
102 Discovering Failure Modes in Vision-Language Models using RL 提出基于强化学习的框架,自动发现视觉-语言模型中的失效模式。 reinforcement learning multimodal
103 Patch-Wise Hypergraph Contrastive Learning with Dual Normal Distribution Weighting for Multi-Domain Stain Transfer 提出STNHCL,通过超图对比学习和双重正态分布加权实现多域染色转移 contrastive learning
104 MT-PCR: Hybrid Mamba-Transformer Network with Spatial Serialization for Point Cloud Registration 提出MT-PCR:混合Mamba-Transformer网络,通过空间序列化实现点云配准 Mamba
105 SPHINX: A Synthetic Environment for Visual Perception and Reasoning SPHINX:用于视觉感知与推理的合成环境,解决认知基元任务。 reinforcement learning multimodal
106 MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model 提出MPDiT:一种多尺度Transformer架构,用于高效Flow Matching和扩散模型。 flow matching
107 Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction 提出Free-Range Gaussians,解决少视图下非网格对齐的3D高斯重建问题。 flow matching classifier-free guidance
108 VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning 提出VidNum-1.4K,用于评估视频数值推理能力的综合基准测试集。 world model world models
109 Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning 提出基于显著性引导和一致性策略学习的SRCP框架,提升视觉无监督强化学习的零样本泛化能力。 reinforcement learning policy learning consistency policy
110 MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control 提出MMEmb-R1,通过自适应推理增强多模态嵌入,显著提升MMEB-V2性能。 reinforcement learning multimodal chain-of-thought
111 Purify-then-Align: Towards Robust Human Sensing under Modality Missing with Knowledge Distillation from Noisy Multimodal Teacher 提出PTA框架,通过提纯-对齐策略提升模态缺失下的人体感知鲁棒性 distillation multimodal
112 Action Images: End-to-End Policy Learning via Multiview Video Generation Action Images:通过多视角视频生成实现端到端机器人策略学习 policy learning world model world models
113 Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning 提出基于双重自洽强化学习的科学图形程序合成方法,提升TikZ代码生成质量。 reinforcement learning large language model multimodal
114 SVC 2026: the Second Multimodal Deception Detection Challenge and the First Domain Generalized Remote Physiological Measurement Challenge SVC 2026:多模态欺骗检测与领域泛化远程生理测量挑战赛 representation learning multimodal
115 Semantic-Topological Graph Reasoning for Language-Guided Pulmonary Screening 提出语义-拓扑图推理框架,用于语言引导的肺部筛查,显著提升分割精度。 distillation large language model foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (38 篇)

#题目一句话要点标签🔗
116 ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding 提出ShelfGaussian,利用自监督VFM实现开放词汇高斯3D场景理解 scene understanding open-vocabulary open vocabulary
117 3D Gaussian Splatting for Annular Dark Field Scanning Transmission Electron Microscopy Tomography Reconstruction 提出DenZa-Gaussian方法,用于解决稀疏视角下ADF-STEM层析重建伪影问题。 3D gaussian splatting 3DGS gaussian splatting
118 Can Protective Watermarking Safeguard the Copyright of 3D Gaussian Splatting? 提出GSPure框架,有效去除3D高斯溅射水印并保持场景完整性 3D gaussian splatting 3DGS gaussian splatting
119 GA-GS: Generation-Assisted Gaussian Splatting for Static Scene Reconstruction 提出GA-GS,利用生成模型辅助高斯溅射重建动态场景中的静态背景。 gaussian splatting splatting scene reconstruction
120 M2StyleGS: Multi-Modality 3D Style Transfer with Gaussian Splatting M2StyleGS:利用高斯溅射和多模态信息进行3D风格迁移,实现实时风格化渲染。 3D gaussian splatting 3DGS gaussian splatting
121 SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving Scenes SpectralSplat:解耦外观的驾驶场景前馈高斯溅射 3D gaussian splatting gaussian splatting splatting
122 MedGS: Gaussian Splatting for Multi-Modal 3D Medical Imaging MedGS:用于多模态3D医学影像的高斯溅射重建 3D gaussian splatting gaussian splatting splatting
123 4C4D: 4 Camera 4D Gaussian Splatting 提出4C4D框架,仅用四个相机实现高质量的动态场景4D高斯溅射重建。 gaussian splatting splatting
124 Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection 提出HSA-DINO,通过参数高效的语义增强提升开放词汇目标检测在领域迁移中的性能。 open-vocabulary open vocabulary
125 SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams SpikeStereoNet:一种脑启发的脉冲立体视觉深度估计框架 depth estimation stereo depth
126 A Step to Decouple Optimization in 3DGS 解耦3DGS优化:提出Sparse Adam、重置正则化和解耦属性正则化,提升优化效率和表达能力。 3D gaussian splatting 3DGS gaussian splatting
127 RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes RAD:检索增强的单目深度估计,提升欠表示类别的深度预测精度 depth estimation metric depth
128 BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting 提出BrepGaussian,利用高斯溅射从多视角图像重建CAD模型。 gaussian splatting splatting
129 3D-IDE: 3D Implicit Depth Emergent 提出3D-IDE,通过几何自监督使多模态LLM具备高效3D场景理解能力 scene understanding large language model foundation model
130 FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning FunFact:构建基于因子图推理的概率功能性3D场景图 scene understanding open-vocabulary open vocabulary
131 CGHair: Compact Gaussian Hair Reconstruction with Card Clustering 提出基于卡片聚类的紧凑高斯毛发重建方法,显著降低存储和渲染成本。 3D gaussian splatting 3DGS gaussian splatting
132 PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis 提出PR-IQA,用于评估扩散模型生成的新视角合成图像质量,提升3D重建效果。 3D gaussian splatting 3DGS gaussian splatting
133 2D Triangle Splatting for Direct Differentiable Mesh Training 提出2D三角形溅射,用于直接可微网格训练,实现高效高保真3D重建 splatting NeRF
134 Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction 提出基于Swin Transformer的层级感知单目深度估计模型,显著提升精度与效率。 depth estimation monocular depth spatial relationship
135 AvatarPointillist: AutoRegressive 4D Gaussian Avatarization AvatarPointillist:提出一种自回归4D高斯头像生成框架,从单张人像生成动态头像。 3D gaussian splatting gaussian splatting splatting
136 PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding 提出PointTPA,通过动态网络参数自适应提升3D场景理解能力 scene understanding
137 ViBA: Implicit Bundle Adjustment with Geometric and Temporal Consistency for Robust Visual Matching ViBA:结合几何与时序一致性的隐式Bundle Adjustment,提升视觉匹配鲁棒性 visual odometry geometric consistency
138 GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction GaMO:基于几何感知的多视角扩散外绘,用于稀疏视角三维重建 NeRF geometric consistency
139 SBF: An Effective Representation to Augment Skeleton for Video-based Human Action Recognition 提出SBF表示增强骨骼信息,提升视频人体行为识别精度 optical flow human-object interaction
140 DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation 提出LGAA框架,利用多视角扩散模型高效生成具备PBR材质的3D资产 gaussian splatting splatting
141 Towards Context-Aware Image Anonymization with Multi-Agent Reasoning 提出CAIAMAR框架,利用多智能体推理实现上下文感知的图像匿名化,保护个人身份信息。 open-vocabulary open vocabulary
142 DINO-VO: Learning Where to Focus for Enhanced State Estimation DINO-VO:学习关注区域以增强状态估计的单目视觉里程计 visual odometry
143 More than the Sum: Panorama-Language Models for Adverse Omni-Scenes 提出全景语言模型以解决传统视觉语言模型的局限性 scene understanding
144 3D Smoke Scene Reconstruction Guided by Vision Priors from Multimodal Large Language Models 提出Smoke-GS,利用视觉先验重建烟雾降质的多视角3D场景 3D gaussian splatting gaussian splatting splatting
145 In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting 提出基于单目深度监督的Gaussian Splatting方法,提升几何精度和渲染质量 depth estimation monocular depth 3D gaussian splatting
146 Indoor Asset Detection in Large Scale 360° Drone-Captured Imagery via 3D Gaussian Splatting 提出基于3D高斯溅射的室内资产检测方法,用于大规模360°无人机图像。 3D gaussian splatting 3DGS gaussian splatting
147 Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction 提出ADM-GS,通过显式外观分解解决多视角重建中光照不一致问题。 gaussian splatting splatting scene reconstruction
148 LSGS-Loc: Towards Robust 3DGS-Based Visual Localization for Large-Scale UAV Scenarios LSGS-Loc:面向大规模无人机场景的鲁棒3DGS视觉定位 3D gaussian splatting 3DGS gaussian splatting
149 PanopticQuery: Unified Query-Time Reasoning for 4D Scenes PanopticQuery:用于4D场景的统一查询时推理框架 gaussian splatting splatting scene reconstruction
150 SmokeGS-R: Physics-Guided Pseudo-Clean 3DGS for Real-World Multi-View Smoke Restoration SmokeGS-R:基于物理先验的伪干净3D高斯模型用于真实场景多视角烟雾去除 3D gaussian splatting 3DGS gaussian splatting
151 3DTurboQuant: Training-Free Near-Optimal Quantization for 3D Reconstruction Models 3DTurboQuant:免训练的3D重建模型近优量化方案 3D gaussian splatting 3DGS gaussian splatting
152 FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos FunRec:从第一视角交互视频重建功能性3D场景 affordance egocentric
153 GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance GaussianGrow:提出几何感知的高斯增长方法,从点云生成高质量3D高斯模型 3D gaussian splatting gaussian splatting splatting

🔬 支柱一:机器人控制 (Robot Control) (10 篇)

#题目一句话要点标签🔗
154 VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models VLA-Forget:用于具身基础模型的视觉-语言-动作协同式可控遗忘 manipulation vision-language-action VLA
155 HOIGS: Human-Object Interaction Gaussian Splatting HOIGS:提出基于高斯溅射的人-物交互动态场景重建方法 manipulation gaussian splatting splatting
156 E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes E-VLA:事件相机增强的视觉-语言-动作模型,提升黑暗和模糊场景下的操作鲁棒性 manipulation teleoperation vision-language-action
157 ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models ActDistill:面向高效视觉-语言-动作模型的通用动作引导自蒸馏框架 manipulation distillation vision-language-action
158 ResGuard: Enhancing Robustness Against Known Original Attacks in Deep Watermarking ResGuard:增强深度水印技术抵抗已知原始攻击的鲁棒性 manipulation
159 SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation SymphoMotion:联合控制相机运动和物体动态,实现连贯视频生成 manipulation
160 UENR-600K: A Large-Scale Physically Grounded Dataset for Nighttime Video Deraining 提出UENR-600K:大规模物理真实夜间视频去雨数据集,提升模型泛化性 sim-to-real
161 SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing 提出SpatialEdit-Bench,用于评估图像空间编辑的几何保真度和感知合理性。 manipulation
162 SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation SnapFlow:基于渐进式自蒸馏的Flow-Matching VLA单步动作生成 manipulation flow matching distillation
163 Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality ContrAR:增强现实中矛盾虚拟内容攻击下的视觉-语言模型基准测试 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (7 篇)

#题目一句话要点标签🔗
164 KiToke: Kernel-based Interval-aware Token Compression for Video Large Language Models KiToke:面向视频大语言模型的核函数区间感知型Token压缩 spatiotemporal large language model
165 HighFM: Towards a Foundation Model for Learning Representations from High-Frequency Earth Observation Data HighFM:面向高频地球观测数据的遥感表征学习基础模型 spatiotemporal foundation model
166 Markovian Reeb Graphs for Simulating Spatiotemporal Patterns of Life 提出Markovian Reeb Graphs,用于模拟时空生命模式轨迹生成。 spatiotemporal
167 Low-Bitrate Video Compression through Semantic-Conditioned Diffusion 提出DiSCo:一种基于语义条件扩散的低码率视频压缩框架,显著提升感知质量。 spatiotemporal multimodal
168 PollutionNet: A Vision Transformer Framework for Climatological Assessment of NO$_2$ and SO$_2$ Using Satellite-Ground Data Fusion PollutionNet:融合卫星与地面数据的Vision Transformer大气污染评估框架 spatiotemporal
169 Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling? TABLeT:利用自然图像自编码器紧凑地 Token 化 fMRI 数据,用于长程动态建模。 spatiotemporal
170 Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale 提出原生尺度AI生成视频检测框架,有效提升伪造视频的识别精度。 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (6 篇)

#题目一句话要点标签🔗
171 Diffusion Path Alignment for Long-Range Motion Generation and Domain Transitions 提出基于扩散模型的路径对齐框架,用于长程运动生成和领域迁移 motion generation human motion human motion generation
172 Next-Scale Autoregressive Models for Text-to-Motion Generation 提出MoScale:一种用于文本驱动人体动作生成的下一尺度自回归模型 text-to-motion motion generation
173 InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement 提出InfBaGel以解决人-物-场景交互生成问题 penetration human-object interaction HOI
174 THOM: Generating Physically Plausible Hand-Object Meshes From Text 提出THOM框架,从文本生成具有物理合理性的手-物体交互3D网格模型 physically plausible contact-aware HOI
175 Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing Vid-Freeze:通过时序冻结防御恶意图像到视频生成 motion synthesis
176 Human Interaction-Aware 3D Reconstruction from a Single Image 提出HUG3D框架,从单张图像重建交互人群的物理合理3D模型 physically plausible

🔬 支柱七:动作重定向 (Motion Retargeting) (5 篇)

#题目一句话要点标签🔗
177 EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs EgoMind:通过多模态大语言模型的语言推理激活空间认知 spatial relationship large language model multimodal
178 3AM: 3egment Anything with Geometric Consistency in Videos 提出3AM以解决视频对象分割中的几何一致性问题 geometric consistency
179 MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label 提出MonoSAOD,解决单目3D目标检测在稀疏标注下的性能瓶颈。 geometric consistency
180 HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation HumANDiff:通过关节噪声扩散实现运动一致的人体视频生成 human motion spatiotemporal
181 GESS: Multi-cue Guided Local Feature Learning via Geometric and Semantic Synergy GESS:通过几何和语义协同的多线索引导局部特征学习 geometric consistency

🔬 支柱六:视频提取与匹配 (Video Extraction) (4 篇)

#题目一句话要点标签🔗
182 ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos 提出ToG-Bench:一个面向任务的自中心视频时空定位基准 egocentric foundation model
183 LoMa: Local Feature Matching Revisited 提出LoMa以提升局部特征匹配性能 feature matching
184 SkillSight: Efficient First-Person Skill Assessment with Gaze 提出SkillSight以解决高效的第一人称技能评估问题 egocentric
185 Sub-metre Lunar DEM Generation and Validation from Chandrayaan-2 OHRC Multi-View Imagery Using an Open-Source Pipeline 利用Chandrayaan-2 OHRC多视影像,开源生成亚米级月球DEM feature matching

🔬 支柱五:交互与反应 (Interaction & Reaction) (2 篇)

#题目一句话要点标签🔗
186 Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos 提出基于VLLMs的视线与标记集结合方法以解决人-物交互预测问题 human-object interaction egocentric egocentric vision
187 HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis HVG-3D:提出基于3D条件的手-物交互视频合成框架,弥合真实与仿真域差距 HOI

⬅️ 返回 cs.CV 首页 · 🏠 返回主页