cs.CV（2026-04-07）

📊 共 187 篇论文 | 🔗 15 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (70 🔗7) 支柱二：RL算法与架构 (RL & Architecture) (45) 支柱三：空间感知与语义 (Perception & Semantics) (38 🔗5) 支柱一：机器人控制 (Robot Control) (10) 支柱八：物理动画 (Physics-based Animation) (7) 支柱四：生成式动作 (Generative Motion) (6 🔗1) 支柱七：动作重定向 (Motion Retargeting) (5 🔗2) 支柱六：视频提取与匹配 (Video Extraction) (4) 支柱五：交互与反应 (Interaction & Reaction) (2)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (70 篇)

#	题目	一句话要点	标签	🔗
1	A Generative Foundation Model for Multimodal Histopathology	MuPD：用于多模态组织病理学的生成式基础模型，实现跨模态合成与虚拟染色。	foundation model multimodal
2	A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models	提出基于Patch增强和跨视角正则化的框架，防御多模态大语言模型中的后门攻击。	large language model multimodal
3	CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks	CoLA：用于多模态下游任务的跨模态低秩适配	foundation model multimodal visual grounding
4	When Does Multimodal AI Help? Diagnostic Complementarity of Vision-Language Models and CNNs for Spectrum Management in Satellite-Terrestrial Networks	提出SpectrumQA基准，诊断VLM与CNN在卫星-地面网络频谱管理中的互补性	foundation model multimodal chain-of-thought
5	Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models	提出场景动态场SDF，提升多模态大语言模型对连续物体动态物理的理解	large language model multimodal
6	The Indra Representation Hypothesis for Multimodal Alignment	提出基于Indra表征假设的多模态对齐方法，实现免训练的跨模态鲁棒对齐。	foundation model multimodal
7	Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation	Firebolt-VL：通过跨模态调制实现高效的视觉-语言理解	large language model foundation model multimodal
8	Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning	提出Chain-of-Frames框架，提升多模态LLM在视频理解中的帧感知推理能力	large language model multimodal
9	ZINA: Multimodal Fine-grained Hallucination Detection and Editing	ZINA：提出多模态细粒度幻觉检测与编辑方法，解决MLLM输出与视觉内容不符问题。	large language model multimodal
10	Image Hashing via Cross-View Code Alignment in the Age of Foundation Models	提出CroVCA，通过跨视图编码对齐实现高效图像哈希，适用于大规模检索。	foundation model multimodal
11	EI: Early Intervention for Multimodal Imaging based Disease Recognition	提出EI框架，通过模态早期干预和MoR自适应，提升多模态医学影像疾病识别精度。	foundation model multimodal
12	LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset	KITScenes LongTail数据集：提供推理轨迹的长尾驾驶场景数据集，用于端到端驾驶。	VLA multimodal instruction following
13	Automated Segmentation and Tracking of Group Housed Pigs Using Foundation Models	利用Foundation Model实现猪群的自动分割与跟踪，提升畜牧业智能化水平	foundation model
14	Multimodal Urban Tree Detection from Satellite and Street-Level Imagery via Annotation-Efficient Deep Learning Strategies	提出一种基于多模态图像和高效标注策略的城市树木检测方法	multimodal
15	A Physics-Informed, Behavior-Aware Digital Twin for Robust Multimodal Forecasting of Core Body Temperature in Precision Livestock Farming	提出物理信息化的数字双胞胎以提高奶牛核心体温预测精度	multimodal
16	Multimodal Backdoor Attack on VLMs for Autonomous Driving via Graffiti and Cross-Lingual Triggers	提出基于涂鸦和跨语言触发器的多模态后门攻击，威胁自动驾驶视觉语言模型	multimodal
17	ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality	ClickAIXR：一种在扩展现实中与真实世界对象进行设备端多模态视觉-语言交互的框架	multimodal
18	Robust Adaptation of Foundation Models with Black-Box Visual Prompting	提出BlackVIP，通过黑盒视觉提示实现大模型在有限资源下的鲁棒自适应。	foundation model
19	Automated Wildfire Damage Assessment from Multi view Ground level Imagery Via Vision Language Models	提出基于多视角地面图像和视觉语言模型的自动化野火损失评估方法	large language model multimodal chain-of-thought
20	Revisiting Multimodal Positional Encoding in Vision-Language Models	提出多头旋转位置编码MHRoPE及其变体MRoPE-I，提升视觉-语言模型的多模态位置编码能力	multimodal
21	Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning	提出VIGA：通过交错多模态推理实现视觉逆向图形Agent	multimodal
22	Improving Multimodal Learning with Dispersive and Anchoring Regularization	提出Dispersive and Anchoring Regularization，提升多模态学习表征质量与融合效果	multimodal
23	Inference-Path Optimization via Circuit Duplication in Frozen Visual Transformers for Marine Species Classification	针对海洋物种分类，提出基于冻结视觉Transformer电路复制的推理路径优化方法	large language model foundation model
24	Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling	提出CSRS方法，稳定多模态大语言模型在几何任务上的无监督自进化学习。	large language model multimodal
25	ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs	提出ITIScore：一个用于评估多模态大语言模型图像描述能力的图像-文本-图像评分框架	large language model multimodal
26	BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing	BoxComm：提出拳击赛事解说生成数据集与评测体系，填补格斗运动解说AI研究空白	large language model multimodal
27	Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks	系统性评估视觉-语言模型在自然语义变异下的鲁棒性，揭示其在多样任务中的脆弱性	multimodal zero-shot transfer
28	Synthesis4AD: Synthetic Anomalies are All You Need for 3D Anomaly Detection	Synthesis4AD：利用合成异常数据提升3D异常检测性能	large language model multimodal
29	Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval	提出紧凑超立方体嵌入，加速基于文本的野生动物观测检索	foundation model multimodal
30	SafeScreen: A Safety-First Screening Framework for Personalized Video Retrieval for Vulnerable Users	SafeScreen：面向弱势用户的安全优先个性化视频检索框架	multimodal
31	When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models	提出层级Sink门控（LSG）模块，提升大型视觉语言模型中全局推理和局部感知的平衡。	multimodal
32	Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning	提出一种可扩展的AI方法，无需人工标注即可检测面对面协作学习中的注视行为。	foundation model
33	Bridging the Dimensionality Gap: A Taxonomy and Survey of 2D Vision Model Adaptation for 3D Analysis	综述2D视觉模型在3D分析中的适配方法，弥合维度差异性鸿沟	foundation model
34	Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models	提出基于相位感知的抑制方法，解决视觉-语言模型中的幻觉问题	multimodal
35	Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning	LaPR：面向视觉上下文学习，提出标签感知的提示检索框架，提升任务性能。	foundation model
36	DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR	DSERT-RoLL：用于多样驾驶条件下的稳健多模态感知数据集与融合框架	multimodal
37	SciLT: Long-Tailed Classification in Scientific Image Domains	SciLT：针对科学图像领域长尾分类问题，提出自适应特征融合和双重监督学习框架。	foundation model
38	SGTA: Scene-Graph Based Multi-Modal Traffic Agent for Video Understanding	提出基于场景图的多模态交通Agent（SGTA）用于交通视频理解	large language model
39	A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning	提出多模态排版攻击，揭示视听推理大模型在跨模态对抗中的脆弱性	large language model
40	ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity	提出ATSS方法，通过异常时序自相似性检测AI生成视频	multimodal
41	Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning	提出Graph-to-Frame RAG以解决视频推理中的知识融合问题	multimodal
42	Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs	提出自适应KV-Cache量化方法，优化轻量级On-Device LLM的内存和延迟。	large language model
43	DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing	提出DIRECT框架，通过分层多智能体规划和意图引导编辑实现高质量视频混剪	multimodal
44	FileGram: Grounding Agent Personalization in File-System Behavioral Traces	FileGram：提出基于文件系统行为轨迹的Agent个性化框架，解决数据约束下的Agent定制难题。	multimodal
45	Rethinking Model Efficiency: Multi-Agent Inference with Large Models	提出多智能体推理框架，利用大模型和小模型优势提升视觉语言模型效率。	large language model
46	VideoCoF: Unified Video Editing with Temporal Reasoner	VideoCoF：提出基于时序推理的统一视频编辑框架，无需掩码实现精准编辑。	chain-of-thought
47	SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling	SubspaceAD：基于子空间建模的免训练少样本异常检测方法	foundation model
48	Event6D: Event-based Novel Object 6D Pose Tracking	EventTrack6D：提出一种基于事件相机的通用物体6D位姿跟踪框架	TAMP
49	JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation	提出JAMMEval，用于可靠评估日语VLM的精细化基准集合	visual grounding
50	VLA-InfoEntropy: A Training-Free Vision-Attention Information Entropy Approach for Vision-Language-Action Models Inference Acceleration and Success	VLA-InfoEntropy：一种免训练的视觉-注意力信息熵方法，加速并提升VLA模型推理	vision-language-action VLA
51	Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs	提出GUIDE框架以解决多模态大语言模型的空间感知问题	large language model foundation model multimodal
52	Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis	RATNet：基于类比推理的胃肠内窥镜诊断基础模型，提升泛化性和鲁棒性	foundation model zero-shot transfer
53	A Unified Foundation Model for All-in-One Multi-Modal Remote Sensing Image Restoration and Fusion with Language Prompting	提出LLaRS：用于多模态遥感图像修复与融合的统一基础模型	foundation model language conditioned	✅
54	Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery	提出轻量级多模态适配框架，用于无人机热成像物种识别与栖息地环境解读。	multimodal
55	Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition	提出MoME框架与HTL策略，提升驾驶员行为识别中细粒度多模态视觉分析能力	multimodal
56	Leveraging Image Editing Foundation Models for Data-Efficient CT Metal Artifact Reduction	利用图像编辑基础模型，以数据高效的方式减少CT金属伪影	foundation model	✅
57	PDMP: Rethinking Balanced Multimodal Learning via Performance-Dominant Modality Prioritization	提出性能主导模态优先(PDMP)策略，解决多模态学习中的欠优化问题。	multimodal
58	SGANet: Semantic and Geometric Alignment for Multimodal Multi-view Anomaly Detection	SGANet：用于多模态多视角异常检测的语义与几何对齐网络	multimodal
59	Evaluation Before Generation: A Paradigm for Robust Multimodal Sentiment Analysis with Missing Modalities	提出基于评估的缺失模态适应框架以解决多模态情感分析问题	multimodal	✅
60	Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images	提出STSF-Net，利用先验引导的多模态特征融合进行光学-SAR图像变化检测。	multimodal	✅
61	UAVReason: A Unified, Large-Scale Benchmark for Multimodal Aerial Scene Reasoning and Generation	提出UAVReason：一个用于多模态航拍场景理解与生成的大规模统一基准。	multimodal
62	FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips	FoleyDesigner：提出一种时空精确对齐的沉浸式立体声拟音生成框架，用于电影片段。	large language model TAMP	✅
63	DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions	提出DetailVerifyBench，用于长图像描述中细粒度幻觉定位的基准测试	large language model multimodal	✅
64	EchoAgent: Towards Reliable Echocardiography Interpretation with "Eyes","Hands" and "Minds"	提出EchoAgent，实现可靠的心脏超声影像端到端判读，模拟医生“眼、手、脑”协同工作。	large language model multimodal
65	VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG	VideoStir：提出时空结构化和意图感知的RAG框架，用于理解长视频	large language model multimodal
66	CoStream: Codec-Guided Resource-Efficient System for Video Streaming Analytics	CoStream：一种编解码器引导的资源高效视频流分析系统	multimodal
67	AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis	提出AICA-Bench基准测试，用于全面评估VLMs在情感图像内容分析中的能力。	multimodal
68	Physics-Aware Video Instance Removal Benchmark	提出物理感知视频实例移除基准PVIR，评估算法在保持物理一致性下的移除效果。	instruction following
69	Few-Shot Semantic Segmentation Meets SAM3	提出基于SAM3的无监督少样本语义分割方法	foundation model	✅
70	Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval	提出对象锚定组合图像检索任务与AdaFocal框架，解决实例级一致性问题。	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (45 篇)

#	题目	一句话要点	标签
71	BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion	BiTDiff：通过BiMamba-Transformer扩散模型实现精细的3D指挥动作生成	Mamba motion synthesis motion generation
72	TreeGaussian: Tree-Guided Cascaded Contrastive Learning for Hierarchical Consistent 3D Gaussian Scene Segmentation and Understanding	TreeGaussian：树引导的级联对比学习用于分层一致的3D高斯场景分割与理解	contrastive learning 3D gaussian splatting 3DGS
73	Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning	提出DisWM，通过离线知识蒸馏和解耦约束，提升视觉强化学习在复杂环境中的样本效率。	reinforcement learning world model world models
74	Restore-R1: Efficient Image Restoration Agents via Reinforcement Learning with Multimodal LLM Perceptual Feedback	提出Restore-R1，利用强化学习和多模态LLM反馈高效解决复杂图像修复问题	reinforcement learning large language model multimodal
75	MuDD: A Multimodal Deception Detection Dataset and GSR-Guided Progressive Distillation for Non-Contact Deception Detection	提出MuDD数据集和GPD框架，用于非接触式多模态欺骗检测	representation learning distillation multimodal
76	CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection	提出CoLoRSMamba，利用条件LoRA引导的Mamba模型进行多模态暴力检测。	Mamba multimodal
77	Training a Student Expert via Semi-Supervised Foundation Model Distillation	提出半监督知识蒸馏框架，用于将视觉基础模型压缩为轻量级专家模型，提升实例分割性能。	distillation foundation model
78	FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views via Compact Semantic Representation	FLEG：基于紧凑语义表示的任意视角前馈语言嵌入高斯溅射	distillation gaussian splatting splatting
79	A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens	提出DeltaWorld，通过Delta Tokens高效生成多样化视频未来帧，显著降低计算成本。	world model world models foundation model
80	Universal Skeleton Understanding via Differentiable Rendering and MLLMs	SkeletonLLM：通过可微渲染和MLLM实现通用骨骼理解	distillation open-vocabulary open vocabulary
81	Learning Additively Compositional Latent Actions for Embodied AI	提出AC-LAM，利用可加组合的潜在动作学习提升具身智能在桌面任务中的表现。	policy learning embodied AI
82	Multimodal Structure Learning: Disentangling Shared and Specific Topology via Cross-Modal Graphical Lasso	提出CM-GLasso，通过跨模态图 Lasso 解耦共享和特定拓扑结构，提升多模态表征学习。	distillation multimodal
83	TAPE: A two-stage parameter-efficient adaptation framework for foundation models in OCT-OCTA analysis	TAPE：用于OCT-OCTA分析中高效微调医学Foundation模型的两阶段自适应框架	MAE foundation model
84	CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models	CLEAR框架通过生成式能力提升统一多模态模型在退化图像理解上的鲁棒性	reinforcement learning multimodal
85	LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation	LinguDistill：通过选择性跨模态蒸馏恢复视觉语言模型中的语言能力	distillation multimodal visual grounding
86	OpenWorldLib: A Unified Codebase and Definition of Advanced World Models	OpenWorldLib：统一高级世界模型的代码库与定义，促进高效复用与协同推理。	world model world models
87	RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes	提出RL-AWB，利用深度强化学习解决低光夜景场景的自动白平衡问题	reinforcement learning deep reinforcement learning
88	HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance	HandDreamer：利用矫正手部形状引导的零样本文本到3D手部模型生成	dreamer distillation MANO
89	VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing	提出VitaTouch，用于智能制造中融合视觉、触觉和语言的机器人质量检测。	contrastive learning large language model multimodal
90	V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators	V-Reflection：通过主动视觉查询，提升多模态大语言模型在细粒度感知任务中的性能	distillation large language model multimodal
91	Incomplete Multi-View Multi-Label Classification via Shared Codebook and Fused-Teacher Self-Distillation	提出基于共享码本和融合教师自蒸馏的不完全多视角多标签分类方法	representation learning contrastive learning distillation
92	TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Guided Optimization	提出TIGFlow-GRPO，通过交互感知流匹配和奖励引导优化实现更符合社会规范的轨迹预测。	flow matching multimodal
93	DriveVA: Video Action Models are Zero-Shot Drivers	DriveVA：利用视频动作模型实现自动驾驶零样本泛化	world model world models scene understanding
94	Spatially-Weighted CLIP for Street-View Geo-localization	提出空间加权CLIP以解决街景地理定位问题	representation learning contrastive learning multimodal
95	Scalable and Generalizable Correspondence Pruning via Geometry-Consistent Pre-training	提出基于几何一致性预训练的可扩展通用对应关系剪枝方法	representation learning masked autoencoder geometric consistency
96	FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution	提出FinPercep-RM和CCL，提升RL在真实超分辨率重建中的感知质量。	reinforcement learning policy learning RLHF
97	HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation	提出HandMCM，利用多模态点云和Correspondence Mamba解决3D手部姿态估计中的遮挡问题	Mamba state space model
98	Deep Image Clustering Based on Curriculum Learning and Density Information	提出基于课程学习和密度信息的深度图像聚类方法，提升复杂图像聚类性能。	curriculum learning
99	TORA: Topological Representation Alignment for 3D Shape Assembly	TORA：通过拓扑表示对齐实现更高效、准确的3D形状组装	flow matching zero-shot transfer
100	OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models	提出OP-GRPO，通过离线策略优化提升Flow-Matching模型生成质量和训练效率。	flow matching
101	Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning	提出RLER双重范式，通过强化学习生成证据并进行选举推理，提升视频推理的可靠性与可解释性。	reinforcement learning multimodal
102	Discovering Failure Modes in Vision-Language Models using RL	提出基于强化学习的框架，自动发现视觉-语言模型中的失效模式。	reinforcement learning multimodal
103	Patch-Wise Hypergraph Contrastive Learning with Dual Normal Distribution Weighting for Multi-Domain Stain Transfer	提出STNHCL，通过超图对比学习和双重正态分布加权实现多域染色转移	contrastive learning
104	MT-PCR: Hybrid Mamba-Transformer Network with Spatial Serialization for Point Cloud Registration	提出MT-PCR：混合Mamba-Transformer网络，通过空间序列化实现点云配准	Mamba
105	SPHINX: A Synthetic Environment for Visual Perception and Reasoning	SPHINX：用于视觉感知与推理的合成环境，解决认知基元任务。	reinforcement learning multimodal
106	MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model	提出MPDiT：一种多尺度Transformer架构，用于高效Flow Matching和扩散模型。	flow matching
107	Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction	提出Free-Range Gaussians，解决少视图下非网格对齐的3D高斯重建问题。	flow matching classifier-free guidance
108	VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning	提出VidNum-1.4K，用于评估视频数值推理能力的综合基准测试集。	world model world models
109	Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning	提出基于显著性引导和一致性策略学习的SRCP框架，提升视觉无监督强化学习的零样本泛化能力。	reinforcement learning policy learning consistency policy
110	MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control	提出MMEmb-R1，通过自适应推理增强多模态嵌入，显著提升MMEB-V2性能。	reinforcement learning multimodal chain-of-thought
111	Purify-then-Align: Towards Robust Human Sensing under Modality Missing with Knowledge Distillation from Noisy Multimodal Teacher	提出PTA框架，通过提纯-对齐策略提升模态缺失下的人体感知鲁棒性	distillation multimodal
112	Action Images: End-to-End Policy Learning via Multiview Video Generation	Action Images：通过多视角视频生成实现端到端机器人策略学习	policy learning world model world models
113	Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning	提出基于双重自洽强化学习的科学图形程序合成方法，提升TikZ代码生成质量。	reinforcement learning large language model multimodal
114	SVC 2026: the Second Multimodal Deception Detection Challenge and the First Domain Generalized Remote Physiological Measurement Challenge	SVC 2026：多模态欺骗检测与领域泛化远程生理测量挑战赛	representation learning multimodal
115	Semantic-Topological Graph Reasoning for Language-Guided Pulmonary Screening	提出语义-拓扑图推理框架，用于语言引导的肺部筛查，显著提升分割精度。	distillation large language model foundation model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (38 篇)

#	题目	一句话要点	标签	🔗
116	ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding	提出ShelfGaussian，利用自监督VFM实现开放词汇高斯3D场景理解	scene understanding open-vocabulary open vocabulary
117	3D Gaussian Splatting for Annular Dark Field Scanning Transmission Electron Microscopy Tomography Reconstruction	提出DenZa-Gaussian方法，用于解决稀疏视角下ADF-STEM层析重建伪影问题。	3D gaussian splatting 3DGS gaussian splatting
118	Can Protective Watermarking Safeguard the Copyright of 3D Gaussian Splatting?	提出GSPure框架，有效去除3D高斯溅射水印并保持场景完整性	3D gaussian splatting 3DGS gaussian splatting
119	GA-GS: Generation-Assisted Gaussian Splatting for Static Scene Reconstruction	提出GA-GS，利用生成模型辅助高斯溅射重建动态场景中的静态背景。	gaussian splatting splatting scene reconstruction
120	M2StyleGS: Multi-Modality 3D Style Transfer with Gaussian Splatting	M2StyleGS：利用高斯溅射和多模态信息进行3D风格迁移，实现实时风格化渲染。	3D gaussian splatting 3DGS gaussian splatting
121	SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving Scenes	SpectralSplat：解耦外观的驾驶场景前馈高斯溅射	3D gaussian splatting gaussian splatting splatting
122	MedGS: Gaussian Splatting for Multi-Modal 3D Medical Imaging	MedGS：用于多模态3D医学影像的高斯溅射重建	3D gaussian splatting gaussian splatting splatting
123	4C4D: 4 Camera 4D Gaussian Splatting	提出4C4D框架，仅用四个相机实现高质量的动态场景4D高斯溅射重建。	gaussian splatting splatting
124	Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection	提出HSA-DINO，通过参数高效的语义增强提升开放词汇目标检测在领域迁移中的性能。	open-vocabulary open vocabulary
125	SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams	SpikeStereoNet：一种脑启发的脉冲立体视觉深度估计框架	depth estimation stereo depth
126	A Step to Decouple Optimization in 3DGS	解耦3DGS优化：提出Sparse Adam、重置正则化和解耦属性正则化，提升优化效率和表达能力。	3D gaussian splatting 3DGS gaussian splatting
127	RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes	RAD：检索增强的单目深度估计，提升欠表示类别的深度预测精度	depth estimation metric depth
128	BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting	提出BrepGaussian，利用高斯溅射从多视角图像重建CAD模型。	gaussian splatting splatting
129	3D-IDE: 3D Implicit Depth Emergent	提出3D-IDE，通过几何自监督使多模态LLM具备高效3D场景理解能力	scene understanding large language model foundation model
130	FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning	FunFact：构建基于因子图推理的概率功能性3D场景图	scene understanding open-vocabulary open vocabulary
131	CGHair: Compact Gaussian Hair Reconstruction with Card Clustering	提出基于卡片聚类的紧凑高斯毛发重建方法，显著降低存储和渲染成本。	3D gaussian splatting 3DGS gaussian splatting
132	PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis	提出PR-IQA，用于评估扩散模型生成的新视角合成图像质量，提升3D重建效果。	3D gaussian splatting 3DGS gaussian splatting
133	2D Triangle Splatting for Direct Differentiable Mesh Training	提出2D三角形溅射，用于直接可微网格训练，实现高效高保真3D重建	splatting NeRF
134	Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction	提出基于Swin Transformer的层级感知单目深度估计模型，显著提升精度与效率。	depth estimation monocular depth spatial relationship
135	AvatarPointillist: AutoRegressive 4D Gaussian Avatarization	AvatarPointillist：提出一种自回归4D高斯头像生成框架，从单张人像生成动态头像。	3D gaussian splatting gaussian splatting splatting
136	PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding	提出PointTPA，通过动态网络参数自适应提升3D场景理解能力	scene understanding
137	ViBA: Implicit Bundle Adjustment with Geometric and Temporal Consistency for Robust Visual Matching	ViBA：结合几何与时序一致性的隐式Bundle Adjustment，提升视觉匹配鲁棒性	visual odometry geometric consistency
138	GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction	GaMO：基于几何感知的多视角扩散外绘，用于稀疏视角三维重建	NeRF geometric consistency
139	SBF: An Effective Representation to Augment Skeleton for Video-based Human Action Recognition	提出SBF表示增强骨骼信息，提升视频人体行为识别精度	optical flow human-object interaction
140	DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation	提出LGAA框架，利用多视角扩散模型高效生成具备PBR材质的3D资产	gaussian splatting splatting
141	Towards Context-Aware Image Anonymization with Multi-Agent Reasoning	提出CAIAMAR框架，利用多智能体推理实现上下文感知的图像匿名化，保护个人身份信息。	open-vocabulary open vocabulary
142	DINO-VO: Learning Where to Focus for Enhanced State Estimation	DINO-VO：学习关注区域以增强状态估计的单目视觉里程计	visual odometry
143	More than the Sum: Panorama-Language Models for Adverse Omni-Scenes	提出全景语言模型以解决传统视觉语言模型的局限性	scene understanding
144	3D Smoke Scene Reconstruction Guided by Vision Priors from Multimodal Large Language Models	提出Smoke-GS，利用视觉先验重建烟雾降质的多视角3D场景	3D gaussian splatting gaussian splatting splatting
145	In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting	提出基于单目深度监督的Gaussian Splatting方法，提升几何精度和渲染质量	depth estimation monocular depth 3D gaussian splatting
146	Indoor Asset Detection in Large Scale 360° Drone-Captured Imagery via 3D Gaussian Splatting	提出基于3D高斯溅射的室内资产检测方法，用于大规模360°无人机图像。	3D gaussian splatting 3DGS gaussian splatting
147	Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction	提出ADM-GS，通过显式外观分解解决多视角重建中光照不一致问题。	gaussian splatting splatting scene reconstruction	✅
148	LSGS-Loc: Towards Robust 3DGS-Based Visual Localization for Large-Scale UAV Scenarios	LSGS-Loc：面向大规模无人机场景的鲁棒3DGS视觉定位	3D gaussian splatting 3DGS gaussian splatting	✅
149	PanopticQuery: Unified Query-Time Reasoning for 4D Scenes	PanopticQuery：用于4D场景的统一查询时推理框架	gaussian splatting splatting scene reconstruction
150	SmokeGS-R: Physics-Guided Pseudo-Clean 3DGS for Real-World Multi-View Smoke Restoration	SmokeGS-R：基于物理先验的伪干净3D高斯模型用于真实场景多视角烟雾去除	3D gaussian splatting 3DGS gaussian splatting	✅
151	3DTurboQuant: Training-Free Near-Optimal Quantization for 3D Reconstruction Models	3DTurboQuant：免训练的3D重建模型近优量化方案	3D gaussian splatting 3DGS gaussian splatting	✅
152	FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos	FunRec：从第一视角交互视频重建功能性3D场景	affordance egocentric
153	GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance	GaussianGrow：提出几何感知的高斯增长方法，从点云生成高质量3D高斯模型	3D gaussian splatting gaussian splatting splatting	✅

🔬 支柱一：机器人控制 (Robot Control) (10 篇)

#	题目	一句话要点	标签
154	VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models	VLA-Forget：用于具身基础模型的视觉-语言-动作协同式可控遗忘	manipulation vision-language-action VLA
155	HOIGS: Human-Object Interaction Gaussian Splatting	HOIGS：提出基于高斯溅射的人-物交互动态场景重建方法	manipulation gaussian splatting splatting
156	E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes	E-VLA：事件相机增强的视觉-语言-动作模型，提升黑暗和模糊场景下的操作鲁棒性	manipulation teleoperation vision-language-action
157	ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models	ActDistill：面向高效视觉-语言-动作模型的通用动作引导自蒸馏框架	manipulation distillation vision-language-action
158	ResGuard: Enhancing Robustness Against Known Original Attacks in Deep Watermarking	ResGuard：增强深度水印技术抵抗已知原始攻击的鲁棒性	manipulation
159	SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation	SymphoMotion：联合控制相机运动和物体动态，实现连贯视频生成	manipulation
160	UENR-600K: A Large-Scale Physically Grounded Dataset for Nighttime Video Deraining	提出UENR-600K：大规模物理真实夜间视频去雨数据集，提升模型泛化性	sim-to-real
161	SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing	提出SpatialEdit-Bench，用于评估图像空间编辑的几何保真度和感知合理性。	manipulation
162	SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation	SnapFlow：基于渐进式自蒸馏的Flow-Matching VLA单步动作生成	manipulation flow matching distillation
163	Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality	ContrAR：增强现实中矛盾虚拟内容攻击下的视觉-语言模型基准测试	manipulation

🔬 支柱八：物理动画 (Physics-based Animation) (7 篇)

#	题目	一句话要点	标签
164	KiToke: Kernel-based Interval-aware Token Compression for Video Large Language Models	KiToke：面向视频大语言模型的核函数区间感知型Token压缩	spatiotemporal large language model
165	HighFM: Towards a Foundation Model for Learning Representations from High-Frequency Earth Observation Data	HighFM：面向高频地球观测数据的遥感表征学习基础模型	spatiotemporal foundation model
166	Markovian Reeb Graphs for Simulating Spatiotemporal Patterns of Life	提出Markovian Reeb Graphs，用于模拟时空生命模式轨迹生成。	spatiotemporal
167	Low-Bitrate Video Compression through Semantic-Conditioned Diffusion	提出DiSCo：一种基于语义条件扩散的低码率视频压缩框架，显著提升感知质量。	spatiotemporal multimodal
168	PollutionNet: A Vision Transformer Framework for Climatological Assessment of NO$_2$ and SO$_2$ Using Satellite-Ground Data Fusion	PollutionNet：融合卫星与地面数据的Vision Transformer大气污染评估框架	spatiotemporal
169	Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?	TABLeT：利用自然图像自编码器紧凑地 Token 化 fMRI 数据，用于长程动态建模。	spatiotemporal
170	Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale	提出原生尺度AI生成视频检测框架，有效提升伪造视频的识别精度。	spatiotemporal

🔬 支柱四：生成式动作 (Generative Motion) (6 篇)

#	题目	一句话要点	标签	🔗
171	Diffusion Path Alignment for Long-Range Motion Generation and Domain Transitions	提出基于扩散模型的路径对齐框架，用于长程运动生成和领域迁移	motion generation human motion human motion generation
172	Next-Scale Autoregressive Models for Text-to-Motion Generation	提出MoScale：一种用于文本驱动人体动作生成的下一尺度自回归模型	text-to-motion motion generation
173	InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement	提出InfBaGel以解决人-物-场景交互生成问题	penetration human-object interaction HOI
174	THOM: Generating Physically Plausible Hand-Object Meshes From Text	提出THOM框架，从文本生成具有物理合理性的手-物体交互3D网格模型	physically plausible contact-aware HOI
175	Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing	Vid-Freeze：通过时序冻结防御恶意图像到视频生成	motion synthesis
176	Human Interaction-Aware 3D Reconstruction from a Single Image	提出HUG3D框架，从单张图像重建交互人群的物理合理3D模型	physically plausible	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (5 篇)

#	题目	一句话要点	标签	🔗
177	EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs	EgoMind：通过多模态大语言模型的语言推理激活空间认知	spatial relationship large language model multimodal
178	3AM: 3egment Anything with Geometric Consistency in Videos	提出3AM以解决视频对象分割中的几何一致性问题	geometric consistency
179	MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label	提出MonoSAOD，解决单目3D目标检测在稀疏标注下的性能瓶颈。	geometric consistency
180	HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation	HumANDiff：通过关节噪声扩散实现运动一致的人体视频生成	human motion spatiotemporal	✅
181	GESS: Multi-cue Guided Local Feature Learning via Geometric and Semantic Synergy	GESS：通过几何和语义协同的多线索引导局部特征学习	geometric consistency	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (4 篇)

#	题目	一句话要点	标签
182	ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos	提出ToG-Bench：一个面向任务的自中心视频时空定位基准	egocentric foundation model
183	LoMa: Local Feature Matching Revisited	提出LoMa以提升局部特征匹配性能	feature matching
184	SkillSight: Efficient First-Person Skill Assessment with Gaze	提出SkillSight以解决高效的第一人称技能评估问题	egocentric
185	Sub-metre Lunar DEM Generation and Validation from Chandrayaan-2 OHRC Multi-View Imagery Using an Open-Source Pipeline	利用Chandrayaan-2 OHRC多视影像，开源生成亚米级月球DEM	feature matching

🔬 支柱五：交互与反应 (Interaction & Reaction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
186	Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos	提出基于VLLMs的视线与标记集结合方法以解决人-物交互预测问题	human-object interaction egocentric egocentric vision
187	HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis	HVG-3D：提出基于3D条件的手-物交互视频合成框架，弥合真实与仿真域差距	HOI

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2026-04-07）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (70 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (45 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (38 篇)

🔬 支柱一：机器人控制 (Robot Control) (10 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (7 篇)

🔬 支柱四：生成式动作 (Generative Motion) (6 篇)

🔬 支柱七：动作重定向 (Motion Retargeting) (5 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (4 篇)

🔬 支柱五：交互与反应 (Interaction & Reaction) (2 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理