cs.CV(2025-07-10)

📊 共 38 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (11 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗2) 支柱四:生成式动作 (Generative Motion) (3 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1) 支柱一:机器人控制 (Robot Control) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning Corvid:通过思维链推理增强多模态大语言模型 large language model multimodal instruction following
2 Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought 提出Rationale-Enhanced Decoding (RED),提升多模态Chain-of-Thought推理中理性内容利用率。 large language model chain-of-thought
3 Impact of Pretraining Word Co-occurrence on Compositional Generalization in Multimodal Models 研究揭示预训练数据中词共现对多模态模型组合泛化能力的影响 multimodal
4 Input Conditioned Layer Dropping in Speech Foundation Models 提出输入驱动的层丢弃方法,用于边缘设备上语音基础模型的动态推理加速。 foundation model
5 MeD-3D: A Multimodal Deep Learning Framework for Precise Recurrence Prediction in Clear Cell Renal Cell Carcinoma (ccRCC) 提出MeD-3D多模态深度学习框架,用于精确预测透明细胞肾细胞癌的复发风险。 multimodal
6 NexViTAD: Few-shot Unsupervised Cross-Domain Defect Detection via Vision Foundation Models and Multi-Task Learning NexViTAD:基于视觉基础模型和多任务学习的小样本无监督跨域缺陷检测 foundation model
7 MAPEX: Modality-Aware Pruning of Experts for Remote Sensing Foundation Models MAPEX:遥感基础模型中基于模态感知的专家剪枝方法 foundation model
8 Multigranular Evaluation for Brain Visual Decoding 提出BASIC框架,用于脑视觉解码的多粒度、神经科学驱动的综合评估 large language model multimodal
9 MIRA: A Novel Framework for Fusing Modalities in Medical RAG MIRA:一种用于医学RAG中融合多模态信息的新框架,显著提升事实准确性。 large language model multimodal
10 SpatialViz-Bench: An MLLM Benchmark for Spatial Visualization 提出SpatialViz-Bench,用于评估多模态大语言模型在空间可视化方面的能力。 large language model chain-of-thought
11 EPIC: Efficient Prompt Interaction for Text-Image Classification EPIC:一种高效的文本-图像分类提示交互方法,显著降低计算成本。 foundation model multimodal
12 Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs 提出STTM:一种免训练的时空Token融合方法,加速视频LLM推理。 large language model
13 Where are we with calibration under dataset shift in image classification? 针对图像分类中数据集偏移下的校准问题,进行了全面的方法对比与分析,并提出了实用建议。 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (11 篇)

#题目一句话要点标签🔗
14 RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration RegGS:基于3DGS配准的无位姿稀疏视图高斯溅射重建 3D gaussian splatting 3DGS gaussian splatting
15 Seg-Wild: Interactive Segmentation based on 3D Gaussian Splatting for Unconstrained Image Collections Seg-Wild:基于3D高斯溅射的交互式分割方法,适用于无约束图像集 3D gaussian splatting gaussian splatting splatting
16 Occlusion-Aware Temporally Consistent Amodal Completion for 3D Human-Object Interaction Reconstruction 提出一种遮挡感知的时序一致性非模态补全方法,用于3D人-物交互重建 3D gaussian splatting gaussian splatting splatting
17 An Embedded Real-time Object Alert System for Visually Impaired: A Monocular Depth Estimation based Approach through Computer Vision 提出一种基于单目深度估计的嵌入式实时盲人辅助系统,解决城市复杂环境下视障人士的出行安全问题。 depth estimation monocular depth
18 OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding OST-Bench:用于评估MLLM在线时空场景理解能力的基准测试 scene understanding large language model multimodal
19 MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation 提出MUVOD多视角视频对象分割数据集,并为3D分割提供基准。 3D gaussian splatting gaussian splatting splatting
20 Hardware-Aware Feature Extraction Quantisation for Real-Time Visual Odometry on FPGA Platforms 提出硬件感知量化的SuperPoint,加速FPGA平台实时视觉里程计 visual odometry
21 Spline Deformation Field 提出基于样条变形场的轨迹建模方法,提升空间一致性和时间插值性能。 implicit representation scene reconstruction spatiotemporal
22 PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency PacGDC:利用投影模糊性和一致性,实现标签高效且泛化性强的深度补全 metric depth foundation model
23 LOSC: LiDAR Open-voc Segmentation Consolidator LOSC:利用图像视觉-语言模型进行LiDAR开放词汇分割,显著提升性能。 open-vocabulary open vocabulary
24 HOTA: Hierarchical Overlap-Tiling Aggregation for Large-Area 3D Flood Mapping 提出HOTA:一种用于大面积3D洪水制图的分层重叠平铺聚合方法 depth estimation

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
25 Tree-Mamba: A Tree-Aware Mamba for Underwater Monocular Depth Estimation 提出Tree-Mamba水下单目深度估计方法,解决现有方法结构特征建模不足问题。 Mamba depth estimation monocular depth
26 PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning 提出PUMA:一种层剪枝语言模型,用于高效统一多模态检索和模态自适应学习。 contrastive learning distillation large language model
27 Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling 提出Geometry Forcing方法,融合视频扩散模型与3D表示,提升视频生成3D一致性。 world model foundation model
28 Martian World Model: Controllable Video Synthesis with Physically Accurate 3D Reconstructions 提出M3arsSynth和MarsGen,用于生成逼真且可控的火星景观视频,服务于任务预演和机器人仿真。 world model multimodal
29 Scaling RL to Long Videos 提出LongVILA框架,通过强化学习提升视觉语言模型在长视频推理上的能力。 reinforcement learning chain-of-thought
30 Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-Light Semantic Segmentation 提出DGKD-WLSS框架,解决弱监督低光照语义分割难题 distillation
31 Robust and Generalizable Heart Rate Estimation via Deep Learning for Remote Photoplethysmography in Complex Scenarios 提出一种深度学习方法以解决复杂场景下心率估计问题 MAE PULSE
32 Single-pass Adaptive Image Tokenization for Minimum Program Search 提出KARL:单次自适应图像Token化方法,用于最小程序搜索。 reinforcement learning representation learning

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
33 MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization MGVQ:一种基于多组量化的通用Tokenizer,显著提升VQ-VAE的图像重建质量。 VQ-VAE
34 D-CNN and VQ-VAE Autoencoders for Compression and Denoising of Industrial X-ray Computed Tomography Images 利用D-CNN和VQ-VAE自编码器压缩和降噪工业X射线CT图像 VQ-VAE
35 T-GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates T-GVC:轨迹引导的生成式视频编码,解决超低码率下视频质量问题 physically plausible motion tracking

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
36 SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes SURPRISE3D:用于复杂3D场景中空间理解和推理的数据集 spatial relationship embodied AI visual grounding

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
37 Behave Your Motion: Habit-preserved Cross-category Animal Motion Transfer 提出一种保持动物行为习惯的跨类别动物运动迁移框架 quadruped motion retargeting large language model

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
38 A novel attention mechanism for noise-adaptive and robust segmentation of microtubules in microscopy images 提出噪声自适应注意力机制,用于显微图像中微管的稳健分割 ASE

⬅️ 返回 cs.CV 首页 · 🏠 返回主页