cs.CV(2025-10-15)

📊 共 46 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (18 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱八:物理动画 (Physics-based Animation) (3) 支柱四:生成式动作 (Generative Motion) (3) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (18 篇)

#题目一句话要点标签🔗
1 Risk-adaptive Activation Steering for Safe Multimodal Large Language Models 提出风险自适应激活引导(RAS)方法,提升多模态大语言模型安全性并加速推理。 large language model multimodal
2 Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models 提出针对视觉-语言-动作模型的模型无关对抗攻击与防御方法 vision-language-action VLA
3 Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs 提出Honey-Data-15M数据集和Bee-8B模型,提升全开源多模态大语言模型性能。 large language model multimodal chain-of-thought
4 Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark 提出Uni-MMMU:一个大规模多学科多模态统一基准,用于评估视觉理解与生成模型的双向协同能力。 multimodal
5 Fusion Meets Diverse Conditions: A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues 提出一种条件感知的动态融合方法,用于解决无人机多模态目标检测在复杂场景下的鲁棒性问题。 multimodal
6 Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity 利用语言标签进行零样本多模态分类,解决数据稀缺下的日常姿态识别问题 multimodal
7 OS-HGAdapter: Open Semantic Hypergraph Adapter for Large Language Models Assisted Entropy-Enhanced Image-Text Alignment 提出OS-HGAdapter,利用大语言模型增强图像-文本对齐,显著提升跨模态检索性能。 large language model
8 Reasoning in Space via Grounding in the World 提出基于世界感知的Grounded-Spatial Reasoner,用于提升3D空间推理能力。 visual grounding chain-of-thought
9 RECODE: Reasoning Through Code Generation for Visual Question Answering 提出RECODE框架,通过代码生成实现视觉问答中更精确的可验证推理。 large language model multimodal
10 OmniGaze: Reward-inspired Generalizable Gaze Estimation In The Wild OmniGaze:提出奖励驱动的通用凝视估计框架,解决野外场景泛化性问题 large language model multimodal
11 Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding 提出Vgent,通过图结构检索-推理增强生成,提升长视频理解能力。 large language model
12 Adaptive Visual Conditioning for Semantic Consistency in Diffusion-Based Story Continuation 提出AVC框架,自适应视觉条件控制扩散模型,提升故事延续生成语义一致性。 large language model
13 InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue 提出InteractiveOmni,一个用于音视频多轮交互的统一全模态大语言模型。 large language model
14 Towards Adversarial Robustness and Uncertainty Quantification in DINOv2-based Few-Shot Anomaly Detection 研究DINOv2在少样本异常检测中的对抗鲁棒性和不确定性量化问题 foundation model
15 Visual Interestingness Decoded: How GPT-4o Mirrors Human Interests 探索GPT-4o对视觉趣味性的理解,并用于提升学习排序模型 multimodal
16 Self-Augmented Visual Contrastive Decoding 提出自增强视觉对比解码,提升大型视觉语言模型的事实一致性 multimodal
17 MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models 提出MMLongCite基准,评估长上下文视觉语言模型的信息保真度 multimodal
18 What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging 提出NegToMe模块和CoVAND数据集,提升VLM在否定描述对象检测中的性能 chain-of-thought

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
19 PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning 提出PhysMaster,通过强化学习物理表征,提升视频生成模型的物理合理性。 reinforcement learning DPO direct preference optimization
20 XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation XD-RCDepth:面向自动驾驶,提出轻量级雷达-相机深度估计与可解释性对齐的知识蒸馏方法 MAE distillation depth estimation
21 Generative Universal Verifier as Multimodal Meta-Reasoner 提出生成式通用验证器,赋能多模态模型进行视觉结果反思与优化。 world model multimodal
22 UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning UniME-V2:利用MLLM作为判别器进行通用多模态嵌入学习 representation learning multimodal
23 Generalizing WiFi Gesture Recognition via Large-Model-Aware Semantic Distillation and Alignment 提出GLSDA框架,利用大模型语义知识提升WiFi手势识别泛化能力 representation learning distillation foundation model
24 End-to-End Multi-Modal Diffusion Mamba 提出多模态扩散Mamba(MDM),用于统一多模态处理并提升生成性能。 Mamba representation learning MDM
25 Scaling Vision Transformers for Functional MRI with Flat Maps 利用平面图和视觉Transformer扩展功能磁共振成像研究 masked autoencoder MAE spatiotemporal
26 Shortcutting Pre-trained Flow Matching Diffusion Models is Almost Free Lunch 提出基于速度场自蒸馏的Flow Matching模型加速方法,实现高效少步采样 flow matching distillation
27 Seeing and Knowing in the Wild: Open-domain Visual Entity Recognition with Large-scale Knowledge Graphs via Contrastive Learning 提出知识引导对比学习框架以解决开放域视觉实体识别问题 contrastive learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
28 FlyAwareV2: A Multimodal Cross-Domain UAV Dataset for Urban Scene Understanding FlyAwareV2:用于城市场景理解的多模态跨域无人机数据集 depth estimation monocular depth scene understanding
29 InsideOut: Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation InsideOut:集成RGB与辐射高斯溅射的综合3D物体表示 3D gaussian splatting 3DGS gaussian splatting
30 STT-GS: Sample-Then-Transmit Edge Gaussian Splatting with Joint Client Selection and Power Control 提出STT-GS边缘高斯溅射方法,联合优化客户端选择和功率控制,提升低空场景重建质量。 gaussian splatting splatting scene reconstruction
31 Leveraging 2D Priors and SDF Guidance for Dynamic Urban Scene Rendering 提出结合2D先验与SDF引导的动态城市场景渲染方法 3D gaussian splatting 3DGS gaussian splatting
32 Accelerated Feature Detectors for Visual SLAM: A Comparative Study of FPGA vs GPU 对比FPGA与GPU加速的特征检测器在视觉SLAM中的性能与能效 visual SLAM
33 Removing Cost Volumes from Optical Flow Estimators 提出一种训练策略,可在光流估计中移除代价体,显著提升推理速度并降低内存占用。 optical flow
34 Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images 提出Capture, Canonicalize, Splat零样本3D高斯头像生成方法 gaussian splatting splatting
35 InstantSfM: Fully Sparse and Parallel Structure-from-Motion InstantSfM:全稀疏并行Structure-from-Motion,加速大规模场景重建。 VGGT

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
36 DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning DepthVLA:通过深度感知的空间推理增强视觉-语言-动作模型 manipulation vision-language-action VLA
37 Trace Anything: Representing Any Video in 4D via Trajectory Fields Trace Anything:提出基于轨迹场的视频4D表示方法,实现高效时空建模。 manipulation
38 NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models 提出NoisePrints,一种用于私有扩散模型中无失真水印的作者身份验证方法 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (3 篇)

#题目一句话要点标签🔗
39 EPIPTrack: Rethinking Prompt Modeling with Explicit and Implicit Prompts for Multi-Object Tracking EPIPTrack:利用显式和隐式提示进行多目标跟踪的提示建模新方法 spatiotemporal large language model multimodal
40 Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs 揭示VideoLLM信息流动路径:通过机制可解释性分析时序推理过程 spatiotemporal large language model
41 Real-Time Sign Language to text Translation using Deep Learning: A Comparative study of LSTM and 3D CNN 对比LSTM与3D CNN,实现实时手语到文本的深度学习翻译 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
42 MimicParts: Part-aware Style Injection for Speech-Driven 3D Motion Generation MimicParts:用于语音驱动3D人体动作生成的部件感知风格注入方法 motion generation
43 CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas CanvasMAR:通过画布机制改进掩码自回归视频生成,解决慢启动和误差累积问题。 classifier-free guidance
44 Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models 提出Group-VQ,通过分组优化自扩展码书解决VQ-VAE中的码书坍塌问题 VQ-VAE

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
45 MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion MVCustom:通过几何潜在渲染和补全实现多视角定制化扩散模型 geometric consistency

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
46 VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models VisCoP:通过视觉探针实现视觉语言模型在视频领域的域自适应 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页