cs.CV(2026-04-30)

📊 共 43 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (12 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (11 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗3) 支柱六:视频提取与匹配 (Video Extraction) (3 🔗1) 支柱一:机器人控制 (Robot Control) (2) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
1 Judge, Then Drive: A Critic-Centric Vision Language Action Framework for Autonomous Driving 提出CriticVLA框架,利用视觉语言行为模型提升自动驾驶决策质量 vision-language-action VLA multimodal
2 COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts 提出COHERENCE基准,用于评估MLLM在交错多模态上下文中细粒度图文对齐能力。 large language model multimodal
3 Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention 提出HSTFG和PhonoSTFG以解决越南场景文本图像描述问题 multimodal
4 Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning 提出SPUR基准,用于评估多模态大语言模型在科学实验图像理解和推理方面的能力。 large language model multimodal chain-of-thought
5 AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images AEGIS:用于评估AI生成学术图像取证分析的综合基准 large language model multimodal
6 World2Minecraft: Occupancy-Driven Simulated Scenes Construction World2Minecraft:提出一种基于Occupancy预测的Minecraft场景自动构建方法 embodied AI VLN
7 FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting FineState-Bench:用于细粒度GUI状态设置的状态条件 grounding 基准测试 visual grounding
8 Frequency-Aware Semantic Fusion with Gated Injection for AI-generated Image Detection 提出频率感知门控注入网络FGINet,提升AI生成图像检测的泛化性 foundation model
9 ClipTBP: Clip-Pair based Temporal Boundary Prediction with Boundary-Aware Learning for Moment Retrieval ClipTBP:基于Clip对和边界感知学习的时刻检索方法 multimodal
10 EdgeFM: Efficient Edge Inference for Vision-Language Models EdgeFM:面向跨平台工业边缘场景的高效视觉-语言模型推理框架 VLA
11 Context as Prior: Bayesian-Inspired Intent Inference for Non-Speaking Agents with a Household Cat Testbed CatSignal:提出基于贝叶斯的意图推断框架,用于理解非语言智能体在家庭环境中的行为 multimodal
12 Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization 提出基于LLM语义原型优化的迭代定义精炼方法,提升零样本分类性能 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (11 篇)

#题目一句话要点标签🔗
13 Faster 3D Gaussian Splatting Convergence via Structure-Aware Densification 提出结构感知密度控制,加速3D高斯溅射收敛并提升重建质量 3D gaussian splatting gaussian splatting splatting
14 Sparse-View 3D Gaussian Splatting in the Wild 提出一种稀疏视角下的3D高斯溅射方法,用于解决真实场景中的新视角合成问题。 3D gaussian splatting gaussian splatting splatting
15 Residual Gaussian Splatting for Ultra Sparse-View CBCT Reconstruction 提出残差高斯溅射(RGS)用于超稀疏视角CBCT重建,提升细节保真度。 3D gaussian splatting 3DGS gaussian splatting
16 Stop Holding Your Breath: CT-Informed Gaussian Splatting for Dynamic Bronchoscopy 提出CT引导的高斯溅射方法,用于动态支气管镜检查,无需屏气。 gaussian splatting splatting
17 TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions 提出TransVLM,用于检测视频中任意类型的镜头过渡,解决传统方法对复杂过渡处理不足的问题。 optical flow motion representation spatiotemporal
18 Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction 提出TunnelMIND,通过视觉重校准和实体重建实现免训练的隧道缺陷检测与工程解读。 open-vocabulary open vocabulary foundation model
19 3D Reconstruction Techniques in the Manufacturing Domain: Applications, Research Opportunities and Use Cases 综述制造领域3D重建技术,揭示应用、研究机遇与用例,填补统一框架的空白。 3D reconstruction
20 RayFormer: Modeling Inter- and Intra-Ray Similarity for NeRF-Based Video Snapshot Compressive Imaging RayFormer:通过建模光线间和光线内相似性,提升NeRF视频快照压缩成像质量 NeRF
21 Softmax-GS: Generalized Gaussians Learning When to Blend or Bound 提出Softmax-GS以解决3D高斯重叠问题 3D gaussian splatting gaussian splatting splatting
22 VkSplat: High-Performance 3DGS Training in Vulkan Compute VkSplat:基于Vulkan Compute的高性能3DGS训练框架 3D gaussian splatting 3DGS gaussian splatting
23 REALM: An RGB and Event Aligned Latent Manifold for Cross-Modal Perception REALM:提出RGB和事件对齐的潜在流形,实现跨模态感知 depth estimation feature matching foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
24 PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning PRISM:通过黑盒策略蒸馏预对齐提升多模态强化学习性能 reinforcement learning distillation multimodal
25 HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation 提出HERMES++,统一3D场景理解与未来几何预测的自动驾驶世界模型 world model world models scene understanding
26 Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling 提出智能视觉生成五级分类法,推动视觉生成从原子映射向Agentic世界建模演进 flow matching world model world models
27 Echo-α: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation 提出Echo-α,用于超声图像解读的Agentic多模态推理模型 reinforcement learning large language model multimodal
28 JI-ADF: Joint-Individual Learning with Adaptive Decision Fusion for Multimodal Skin Lesion Classification 提出JI-ADF框架,融合多模态信息,提升皮肤病灶分类的准确性和临床实用性。 representation learning multimodal
29 Leveraging Verifier-Based Reinforcement Learning in Image Editing 提出Edit-R1框架,利用基于验证器的强化学习提升图像编辑效果 reinforcement learning RLHF chain-of-thought
30 Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements 提出A4Mer自监督学习框架,用于人体动作分层表示,提升行为建模性能。 representation learning SMPL motion prediction
31 Generalizable Sparse-View 3D Reconstruction from Unconstrained Images GenWildSplat:提出一种可泛化的稀疏视角三维重建框架,适用于无约束图像 curriculum learning 3D reconstruction
32 Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces 提出S²VAE,通过拓扑对齐编码Vision Transformer特征空间,提升三维重建效果。 world model world models depth estimation
33 LA-Pose: Latent Action Pretraining Meets Pose Estimation LA-Pose:利用潜在动作预训练提升相机位姿估计精度 world model world models

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
34 ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss ResiHMR:针对肢体缺失个体的残肢感知单图3D人体网格重建 human mesh recovery HMR
35 MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons MoCapAnything V2:提出端到端运动捕捉框架,适用于任意骨骼动画生成。 video-to-pose
36 Adaptive Geodesic Conformal Prediction for Egocentric Camera Pose Estimation 提出DINOv2-Bridge自适应共形预测,提升以自我为中心的相机姿态估计不确定性覆盖率。 egocentric

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
37 Fake3DGS: A Benchmark for 3D Manipulation Detection in Neural Rendering 提出Fake3DGS基准,用于评估神经渲染中3D篡改检测算法的性能。 manipulation 3D gaussian splatting 3D reconstruction
38 SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation SpaAct:通过空间激活的迁移学习和课程自适应提升视觉-语言导航性能 locomotion curriculum learning VLN

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
39 CasLayout: Cascaded 3D Layout Diffusion for Indoor Scene Synthesis with Implicit Relation Modeling CasLayout:级联扩散模型,通过隐式关系建模实现室内场景合成 spatial relationship large language model
40 3D-ReGen: A Unified 3D Geometry Regeneration Framework 提出3D-ReGen,通过可控的3D几何体再生框架实现3D对象增强、重建和编辑。 geometric consistency

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
41 YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal YOSE:提出一种高效的DiT视频对象移除框架,通过选择必要tokens显著加速推理。 spatiotemporal diff-sim
42 MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video MAEPose:基于毫米波视频的自监督时空人体姿态估计 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
43 Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction Uni-HOI:提出统一框架,学习文本与人-物交互的联合分布,实现多任务HOI生成与预测。 motion generation VQ-VAE human-object interaction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页