cs.CV(2025-06-27)

📊 共 41 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (17 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (13 🔗4) 支柱七:动作重定向 (Motion Retargeting) (3) 支柱八:物理动画 (Physics-based Animation) (2) 支柱三:空间感知与语义 (Perception & Semantics) (2 🔗1) 支柱一:机器人控制 (Robot Control) (2) 支柱四:生成式动作 (Generative Motion) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (17 篇)

#题目一句话要点标签🔗
1 Grounding-Aware Token Pruning: Recovering from Drastic Performance Drops in Visual Grounding Caused by Pruning 提出基于定位感知的标记剪枝以解决视觉定位性能下降问题 large language model multimodal visual grounding
2 Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment 提出可否定视频蕴含任务以提升视频多模态模型的推理能力 large language model multimodal
3 TaleForge: Interactive Multimodal System for Personalized Story Creation 提出TaleForge以解决个性化故事创作的参与度不足问题 large language model multimodal
4 COOCO -- Common Objects Out-of-Context -- Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication 提出COOCO数据集以研究多模态上下文在指称交流中的作用 multimodal
5 RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models 提出RetFiner以解决视网膜基础模型的语义理解不足问题 foundation model
6 Towards Scalable and Robust White Matter Lesion Localization via Multimodal Deep Learning 提出多模态深度学习框架以解决白质病灶定位问题 multimodal
7 TASeg: Text-aware RGB-T Semantic Segmentation based on Fine-tuning Vision Foundation Models 提出TASeg框架以解决RGB-T语义分割中的文本信息缺失问题 foundation model
8 SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding 提出SPAZER以解决零-shot 3D视觉定位问题 visual grounding
9 Few-Shot Segmentation of Historical Maps via Linear Probing of Vision Foundation Models 提出基于线性探测的少样本历史地图分割方法 foundation model
10 CAL-RAG: Retrieval-Augmented Multi-Agent Generation for Content-Aware Layout Design 提出CAL-RAG以解决内容感知布局生成问题 large language model multimodal
11 Exploring Task-Solving Paradigm for Generalized Cross-Domain Face Anti-Spoofing via Reinforcement Fine-Tuning 提出基于强化微调的跨域人脸反欺诈方法以解决泛化问题 large language model multimodal
12 LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs 提出LLaVA-Scissor以解决视频多模态大语言模型的token压缩问题 large language model multimodal
13 GameTileNet: A Semantic Dataset for Low-Resolution Game Art in Procedural Content Generation 提出GameTileNet以解决低分辨率游戏艺术生成问题 large language model
14 Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset 提出无缝交互模型以解决人机交互中的非语言信号理解问题 multimodal
15 Test-Time Consistency in Vision Language Models 提出测试时一致性框架以解决视觉语言模型的不一致性问题 multimodal
16 Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment 提出VisionDrop以解决LVLM中视觉标记冗余问题 large language model
17 ProSAM: Enhancing the Robustness of SAM-based Visual Reference Segmentation with Probabilistic Prompts 提出ProSAM以解决SAM视觉参考分割的稳定性问题 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (13 篇)

#题目一句话要点标签🔗
18 Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation 提出MMKD-CLIP以解决生物医学领域模型泛化问题 distillation foundation model
19 Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation 提出周期性视频掩码自编码器以解决rPPG估计问题 masked autoencoder MAE PULSE
20 BrainMT: A Hybrid Mamba-Transformer Architecture for Modeling Long-Range Dependencies in Functional MRI Data 提出BrainMT以解决fMRI数据长程依赖建模问题 Mamba spatial relationship spatiotemporal
21 EAMamba: Efficient All-Around Vision State Space Model for Image Restoration 提出EAMamba以解决低级视觉任务中的计算复杂性问题 Mamba state space model
22 ReF-LLE: Personalized Low-Light Enhancement via Reference-Guided Deep Reinforcement Learning 提出ReF-LLE以解决低光图像增强的个性化问题 reinforcement learning deep reinforcement learning
23 SPADE: Spatial Transcriptomics and Pathology Alignment Using a Mixture of Data Experts for an Expressive Latent Space 提出SPADE以解决病理图像与空间转录组数据整合问题 representation learning contrastive learning foundation model
24 Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning 提出Seg-R1以提升多模态模型的像素级理解能力 reinforcement learning multimodal
25 R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning 提出R1-Track以解决视觉目标跟踪中的模板匹配问题 reinforcement learning large language model
26 MiCo: Multi-image Contrast for Reinforcement Visual Reasoning 提出MiCo以解决多图像推理中的逻辑关联问题 reinforcement learning representation learning chain-of-thought
27 CaO$_2$: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation 提出CaO$_2$以解决扩散模型数据蒸馏中的不一致性问题 distillation
28 OutDreamer: Video Outpainting with a Diffusion Transformer 提出OutDreamer以解决视频外延生成中的一致性问题 dreamer
29 Exploring Semantic Masked Autoencoder for Self-supervised Point Cloud Understanding 提出语义掩码自编码器以解决点云理解中的语义关系捕捉问题 masked autoencoder
30 RAUM-Net: Regional Attention and Uncertainty-aware Mamba Network 提出RAUM-Net以解决细粒度视觉分类中的不确定性问题 Mamba

🔬 支柱七:动作重定向 (Motion Retargeting) (3 篇)

#题目一句话要点标签🔗
31 RoomCraft: Controllable and Complete 3D Indoor Scene Generation 提出RoomCraft以解决3D室内场景生成中的多约束问题 spatial relationship geometric consistency
32 Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs 提出VISER以解决视觉语言模型中的绑定问题 spatial relationship chain-of-thought
33 WarpRF: Multi-View Consistency for Training-Free Uncertainty Quantification and Applications in Radiance Fields 提出WarpRF框架以解决辐射场不确定性量化问题 geometric consistency

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
34 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration 提出4D-VLA以解决机器人预训练中的混乱问题 spatiotemporal vision-language-action VLA
35 Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs 提出Q-Frame以解决视频理解中的帧选择与多分辨率适应问题 spatiotemporal large language model multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (2 篇)

#题目一句话要点标签🔗
36 BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting 提出BézierGS以解决动态城市场景重建问题 gaussian splatting splatting scene reconstruction
37 DIGS: Dynamic CBCT Reconstruction using Deformation-Informed 4D Gaussian Splatting and a Low-Rank Free-Form Deformation Model 提出基于变形信息的4D高斯点云重建方法以解决动态CBCT重建问题 gaussian splatting splatting

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
38 RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation 提出RoboEnvision以解决长时间视频生成的机器人操作问题 manipulation motion generation
39 Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy 提出Shape-for-Motion以解决视频编辑精确控制问题 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
40 Generating Attribute-Aware Human Motions from Textual Prompt 提出一种新框架以解决文本驱动的人类动作生成中的属性影响问题 motion generation

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
41 MatChA: Cross-Algorithm Matching with Feature Augmentation 提出MatChA以解决跨算法特征匹配问题 feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页