cs.CV(2025-09-28)

📊 共 46 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (15 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (11 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗2) 支柱一:机器人控制 (Robot Control) (4) 支柱四:生成式动作 (Generative Motion) (3) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱八:物理动画 (Physics-based Animation) (2) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
1 LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models 提出LUQ:多模态大语言模型的分层超低比特量化方法,降低内存占用。 large language model multimodal
2 PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications 提出PCRI指标,评估多模态模型在企业应用中对视觉上下文的鲁棒性。 large language model multimodal
3 Assessing Visual Privacy Risks in Multimodal AI: A Novel Taxonomy-Grounded Evaluation of Vision-Language Models 提出视觉隐私分类法,评估视觉-语言模型在隐私理解上的局限性 large language model multimodal
4 RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks 提出RCI指标,评估多模态基准测试中全局和局部推理的依赖程度 large language model multimodal
5 LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training LLaVA-OneVision-1.5:全开放多模态训练框架,降低训练成本并提升性能 multimodal chain-of-thought
6 Uncovering Grounding IDs: How External Cues Shape Multimodal Binding 提出Grounding IDs概念,揭示外部线索如何塑造多模态绑定 multimodal
7 HunyuanImage 3.0 Technical Report 腾讯混元发布HunyuanImage 3.0,开源最大规模的图像生成MoE模型 foundation model multimodal chain-of-thought
8 Adapting Large Language Models to Mitigate Skin Tone Biases in Clinical Dermatology Tasks: A Mixed-Methods Study 通过适配大型语言模型缓解临床皮肤病学任务中的肤色偏差 large language model
9 ColLab: A Collaborative Spatial Progressive Data Engine for Referring Expression Comprehension and Generation ColLab:一种用于指代表达式理解与生成的协同空间渐进式数据引擎 large language model multimodal
10 HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling HiDe:通过分层解耦重新思考高分辨率MLLM中的Zoom-IN方法 large language model multimodal
11 SVAC: Scaling Is All You Need For Referring Video Object Segmentation SVAC:通过放大输入和分割token,提升指称视频对象分割性能。 large language model
12 Revisit the Imbalance Optimization in Multi-task Learning: An Experimental Analysis 通过梯度范数调整损失权重,解决多任务学习中的优化不平衡问题 foundation model
13 HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score 提出HIVTP,一种免训练的分层视觉Token剪枝方法,提升VLM推理效率。 multimodal
14 RIV: Recursive Introspection Mask Diffusion Vision Language Model 提出递归自省掩码扩散视觉语言模型(RIV),赋予模型自纠错能力。 multimodal
15 StolenLoRA: Exploring LoRA Extraction Attacks via Synthetic Data StolenLoRA:提出基于合成数据的LoRA提取攻击方法 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (11 篇)

#题目一句话要点标签🔗
16 Preserving Cross-Modal Stability for Visual Unlearning in Multimodal Scenarios 提出跨模态对比解学习框架CCU,解决多模态场景下视觉解学习的知识保留问题 contrastive learning multimodal
17 Reinforcement Learning with Inverse Rewards for World Model Post-training 提出RLIR框架,通过逆向奖励学习提升视频世界模型的动作跟随能力 reinforcement learning world model
18 FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning 提出FrameMind,通过强化学习实现视频推理过程中动态帧采样,提升视频理解性能。 reinforcement learning chain-of-thought
19 A Modality-Tailored Graph Modeling Framework for Urban Region Representation via Contrastive Learning 提出MTGRR框架,通过对比学习进行城市区域表征,解决多模态数据融合中的异构性问题。 contrastive learning multimodal
20 Joint Superpixel and Self-Representation Learning for Scalable Hyperspectral Image Clustering 提出联合超像素和自表达学习框架,用于可扩展的高光谱图像聚类 representation learning HSI
21 Hazy Pedestrian Trajectory Prediction via Physical Priors and Graph-Mamba 提出基于物理先验和Graph-Mamba的行人轨迹预测模型,解决雾天环境下的预测难题。 Mamba state space model
22 GenView++: Unifying Adaptive Generative Augmentation and Quality-Driven Supervision for Contrastive Representation Learning GenView++:融合自适应生成增强与质量驱动监督的对比表示学习框架 representation learning contrastive learning
23 MSD-KMamba: Bidirectional Spatial-Aware Multi-Modal 3D Brain Segmentation via Multi-scale Self-Distilled Fusion Strategy 提出MSD-KMamba,通过双向空间感知和多尺度自蒸馏融合实现高效精准的多模态3D脑分割。 Mamba distillation
24 Poivre: Self-Refining Visual Pointing with Reinforcement Learning 提出Poivre:基于强化学习的自精炼视觉指向方法 reinforcement learning
25 ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis 提出ReWatch-R1以解决复杂视频推理数据瓶颈问题 reinforcement learning chain-of-thought
26 FlowLUT: Efficient Image Enhancement via Differentiable LUTs and Iterative Flow Matching 提出FlowLUT以解决图像增强中的效率与表现力权衡问题 flow matching

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
27 RPG360: Robust 360 Depth Estimation with Perspective Foundation Models and Graph Optimization RPG360:利用透视基础模型和图优化的鲁棒360度深度估计 depth estimation monocular depth feature matching
28 CrashSplat: 2D to 3D Vehicle Damage Segmentation in Gaussian Splatting CrashSplat:基于高斯溅射的车辆损伤2D到3D分割方法 3D gaussian splatting gaussian splatting splatting
29 OVSeg3R: Learn Open-vocabulary Instance Segmentation from 2D via 3D Reconstruction OVSeg3R:通过3D重建从2D学习开放词汇实例分割 open-vocabulary open vocabulary
30 Efficient Domain-Adaptive Multi-Task Dense Prediction with Vision Foundation Models 提出FAMDA框架,利用视觉基础模型高效解决多任务密集预测的领域自适应问题 depth estimation foundation model
31 Uni4D-LLM: A Unified SpatioTemporal-Aware VLM for 4D Understanding and Generation 提出Uni4D-LLM,用于统一4D场景理解与生成的时空感知VLM框架 scene understanding spatiotemporal
32 FastViDAR: Real-Time Omnidirectional Depth Estimation via Alternative Hierarchical Attention FastViDAR:提出基于交替分层注意力机制的实时全向深度估计框架 depth estimation
33 Color-Pair Guided Robust Zero-Shot 6D Pose Estimation and Tracking of Cluttered Objects on Edge Devices 提出一种颜色对引导的鲁棒零样本6D位姿估计与跟踪方法,适用于边缘设备上杂乱物体的场景。 6D pose estimation
34 GRS-SLAM3R: Real-Time Dense SLAM with Gated Recurrent State GRS-SLAM3R:基于门控循环状态的实时稠密SLAM,提升重建精度和全局一致性。 visual SLAM scene reconstruction

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
35 InteractMove: Text-Controlled Human-Object Interaction Generation in 3D Scenes with Movable Objects InteractMove:提出一种文本控制的3D场景中可移动物体人机交互生成方法 manipulation affordance physically plausible
36 From Fields to Splats: A Cross-Domain Survey of Real-Time Neural Scene Representations 综述:从NeRF到3DGS,实时神经场景表示的跨领域研究 manipulation teleoperation 3D gaussian splatting
37 UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception 提出UniAlignment以解决多模态生成中的语义一致性问题 manipulation multimodal instruction following
38 AssemblyHands-X: Modeling 3D Hand-Body Coordination for Understanding Bimanual Human Activities AssemblyHands-X:提出首个无标记3D手-身协同动作识别基准数据集 bi-manual SMPL SMPL-X

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
39 Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow DualFlow:基于修正流的统一多模态交互式3D人体运动生成框架 text-to-motion motion synthesis motion generation
40 MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing MotionVerse:用于运动理解、生成和编辑的统一多模态框架 motion tokenizer human motion large language model
41 CrimEdit: Controllable Editing for Counterfactual Object Removal, Insertion, and Movement CrimEdit:提出可控编辑框架,实现反事实对象移除、插入和移动 classifier-free guidance

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
42 FUSAR-KLIP: Towards Multimodal Foundation Models for Remote Sensing 提出FUSAR-KLIP以解决遥感图像理解中的认知不一致问题 spatial relationship foundation model multimodal
43 Sparse-Up: Learnable Sparse Upsampling for 3D Generation with High-Fidelity Textures Sparse-Up:用于高保真纹理3D生成的可学习稀疏上采样 geometric consistency

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
44 Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding 提出STIM-TM,通过时空信息挖掘进行手术视频Token融合,提升效率。 spatiotemporal
45 Autoregressive Video Generation beyond Next Frames Prediction VideoAR:提出基于时空立方体的自回归视频生成框架,突破逐帧预测限制。 spatiotemporal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
46 MoReact: Generating Reactive Motion from Textual Descriptions MoReact:提出一种基于文本描述生成反应性动作的扩散模型。 reactive motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页