cs.CV(2025-11-28)

📊 共 46 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (16 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (12 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗1) 支柱一:机器人控制 (Robot Control) (5 🔗2) 支柱八:物理动画 (Physics-based Animation) (1 🔗1) 支柱四:生成式动作 (Generative Motion) (1) 支柱三:空间感知 (Perception & SLAM) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
1 Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs 提出模态对齐调优策略,提升MLLM在矛盾模态下的多模态推理能力 large language model multimodal
2 Instruction Tuning of Large Language Models for Tabular Data Generation-in One Day 针对表格数据生成,提出高效指令调优方法,一天内可媲美GPT-4o large language model
3 Buffer replay enhances the robustness of multimodal learning under missing-modality 提出REplay Prompting (REP)方法,增强多模态学习在模态缺失下的鲁棒性。 multimodal
4 JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization 提出JarvisEvo,通过协同编辑器-评估器优化实现自进化图像编辑Agent multimodal instruction following chain-of-thought
5 Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering 提出RETINA基准和MIMIR模型,解决多模态知识图谱VQA中的视觉捷径问题 multimodal
6 RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video RobotSeg:用于图像和视频中机器人分割的模型与数据集 VLA foundation model
7 AutocleanEEG ICVision: Automated ICA Artifact Classification Using Vision-Language AI ICVision:利用视觉-语言AI自动分类脑电ICA伪迹,模拟专家级脑电分析。 large language model multimodal
8 Visual Generation Tuning 提出视觉生成调优VGT,激发预训练VLM的视觉生成能力,加速自回归建模。 foundation model multimodal
9 Unlocking Multilingual Reasoning Capability of LLMs and LVLMs through Representation Engineering 提出MRRE,通过表征工程提升LLM和LVLM的跨语言推理能力 large language model
10 Zero-Shot Multi-Criteria Visual Quality Inspection for Semi-Controlled Industrial Environments via Real-Time 3D Digital Twin Simulation 提出基于实时3D数字孪生仿真的零样本多标准视觉质量检测方法,用于半控制工业环境。 multimodal
11 Analyzing Image Beyond Visual Aspect: Image Emotion Classification via Multiple-Affective Captioning 提出ACIEC框架,通过多重情感描述进行图像情感分类,有效弥补情感鸿沟。 chain-of-thought
12 MathSight: A Benchmark Exploring Have Vision-Language Models Really Seen in University-Level Mathematical Reasoning? MathSight:评估视觉语言模型在大学数学推理中视觉信息利用程度的基准 multimodal
13 Contrastive Heliophysical Image Pretraining for Solar Dynamics Observatory Records 提出SolarCHIP,用于SDO太阳图像的对比预训练,提升跨模态转换和耀斑分类性能。 multimodal
14 From Points to Clouds: Learning Robust Semantic Distributions for Multi-modal Prompts 提出Points-to-Clouds (P2C)框架,学习鲁棒语义分布以提升多模态Prompt Learning的泛化性。 multimodal
15 CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation CoordSpeaker:利用手势描述生成,实现协同的、文本驱动的口语手势生成。 multimodal
16 Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding 提出SLEUTH多智能体框架,解决长文档理解中证据稀疏和冗余问题。 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (12 篇)

#题目一句话要点标签🔗
17 MrGS: Multi-modal Radiance Fields with 3D Gaussian Splatting for RGB-Thermal Novel View Synthesis MrGS:基于3D高斯溅射的多模态辐射场,用于RGB-热红外新视角合成 3D gaussian splatting 3DGS gaussian splatting
18 Geometry-Consistent 4D Gaussian Splatting for Sparse-Input Dynamic View Synthesis 提出GC-4DGS,通过几何一致性提升稀疏输入下动态场景的4D高斯溅射渲染质量。 monocular depth gaussian splatting splatting
19 HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model HMR3D:利用大型视觉语言模型进行3D场景理解的分层多模态表示 scene understanding spatial relationship multimodal
20 DenseScan: Advancing 3D Scene Understanding with 2D Dense Annotation DenseScan:利用2D密集标注提升3D场景理解能力 scene understanding spatial relationship large language model
21 FACT-GS: Frequency-Aligned Complexity-Aware Texture Reparameterization for 2D Gaussian Splatting FACT-GS:频率对齐的复杂度感知纹理重参数化高斯溅射,提升渲染质量。 gaussian splatting splatting
22 See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection 提出基于重要词感知的视频片段过滤方法,用于视频时刻检索和高光检测。 scene understanding large language model multimodal
23 Image Valuation in NeRF-based 3D reconstruction 提出一种图像价值评估方法,用于优化NeRF三维重建的图像选择。 NeRF neural radiance field scene reconstruction
24 Robust 3DGS-based SLAM via Adaptive Kernel Smoothing 提出基于自适应核平滑的鲁棒3DGS-SLAM,提升相机位姿跟踪精度 3DGS scene reconstruction
25 SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models 提出SpaceMind,通过相机引导的多模态融合增强视觉-语言模型中的空间推理能力 VGGT large language model multimodal
26 DenoiseGS: Gaussian Reconstruction Model for Burst Denoising DenoiseGS:利用高斯重建模型实现高效的Burst图像去噪 3D gaussian splatting gaussian splatting splatting
27 Taming the Light: Illumination-Invariant Semantic 3DGS-SLAM 提出光照不变语义3DGS-SLAM,解决极端光照下SLAM系统性能退化问题 3DGS
28 DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation DualCamCtrl:用于几何感知相机控制视频生成的双分支扩散模型 scene understanding

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
29 Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model Hunyuan-GameCraft-2:提出指令驱动的交互式游戏世界建模方法 world model foundation model instruction following
30 Pathryoshka: Compressing Pathology Foundation Models via Multi-Teacher Knowledge Distillation with Nested Embeddings Pathryoshka:通过嵌套嵌入的多教师知识蒸馏压缩病理学基础模型 representation learning distillation foundation model
31 Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models Video-R2通过强化时序对齐和推理一致性,提升多模态语言模型在视频理解中的性能。 reinforcement learning large language model multimodal
32 VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction 提出VQRAE,统一多模态理解、生成和重建的表示量化自编码器。 distillation foundation model multimodal
33 ReactionMamba: Generating Short &Long Human Reaction Sequences ReactionMamba:提出基于Mamba的框架,用于生成长短时程的人体反应动作序列 Mamba ReMoS
34 Breaking Scale Anchoring: Frequency Representation Learning for Accurate High-Resolution Inference from Low-Resolution Training 提出频率表示学习,解决低分辨率训练高分辨率推理中的尺度锚定问题 representation learning spatiotemporal
35 MANTA: Physics-Informed Generalized Underwater Object Tracking MANTA:提出物理信息引导的水下通用目标跟踪框架,提升水下环境适应性。 representation learning contrastive learning geometric consistency
36 REVEAL: Reasoning-enhanced Forensic Evidence Analysis for Explainable AI-generated Image Detection 提出REVEAL:一种推理增强的AI生成图像检测框架,提升可解释性和泛化性。 reinforcement learning multimodal
37 From Illusion to Intention: Visual Rationale Learning for Vision-Language Reasoning 提出Visual Rationale Learning (ViRL),通过视觉推理链提升视觉-语言模型的透明性和可信度。 reward shaping chain-of-thought
38 McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning 提出McSc框架,通过自批判分层推理实现运动校正的视频生成偏好对齐。 reinforcement learning direct preference optimization

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
39 Video-CoM: Interactive Video Reasoning via Chain of Manipulations 提出Video-CoM,通过链式操作实现交互式视频推理,提升时空理解能力。 manipulation reinforcement learning large language model
40 DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline 提出DEAL-300K数据集与频率提示基线,用于扩散模型图像编辑区域定位 manipulation large language model foundation model
41 Hybrid Synthetic Data Generation with Domain Randomization Enables Zero-Shot Vision-Based Part Inspection Under Extreme Class Imbalance 提出基于混合合成数据和领域随机化的零样本零件质检方法,解决极端类别不平衡问题 domain randomization
42 Synthetic Industrial Object Detection: GenAI vs. Feature-Based Methods 对比GenAI与传统方法,高效合成工业目标检测数据,提升Sim-to-Real性能。 sim-to-real domain randomization
43 NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing NumeriKontrol:为扩散Transformer添加数值控制,实现指令驱动的图像编辑 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
44 One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer 提出One-to-All Animation以解决姿态不对齐问题 character animation

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
45 Guiding Visual Autoregressive Models through Spectrum Weakening 提出基于频谱弱化的视觉自回归模型引导方法,无需重训练即可提升生成质量。 classifier-free guidance

🔬 支柱三:空间感知 (Perception & SLAM) (1 篇)

#题目一句话要点标签🔗
46 GLOW: Global Illumination-Aware Inverse Rendering of Indoor Scenes Captured with Dynamic Co-Located Light & Camera GLOW:全局光照感知的动态共位光相机室内场景逆渲染 neural radiance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页