cs.CV（2025-10-11）

📊 共 26 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (5 🔗1) 支柱一：机器人控制 (Robot Control) (3) 支柱八：物理动画 (Physics-based Animation) (2)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
1	MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output	MIMO：一种具有视觉指代多模态输入和像素级定位多模态输出的医学视觉语言模型	multimodal instruction following
2	Vision4PPG: Emergent PPG Analysis Capability of Vision Foundation Models for Vital Signs like Blood Pressure	Vision4PPG：利用视觉基础模型进行PPG分析，实现血压等生命体征的预测	foundation model
3	ESCA: Contextualizing Embodied Agents via Scene-Graph Generation	提出ESCA框架，通过场景图生成增强具身智能体的上下文感知能力	large language model foundation model	✅
4	From Generic to Specialized: A Subspecialty Diagnostic System Powered by Self-Supervised Learning for Cervical Histopathology	CerS-Path：基于自监督学习的宫颈组织病理亚专科诊断系统	foundation model multimodal
5	CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization	CoIDO：通过耦合重要性-多样性优化实现视觉指令调优的高效数据选择	large language model multimodal
6	Q-Adapter: Visual Query Adapter for Extracting Textually-related Features in Video Captioning	提出Q-Adapter，通过可学习查询token高效提取视频字幕相关视觉特征，实现参数高效的视频字幕生成。	large language model multimodal
7	Scaling Traffic Insights with AI and Language Model-Powered Camera Systems for Data-Driven Transportation Decision Making	提出基于AI和语言模型的交通摄像头系统，用于大规模交通洞察和数据驱动的决策	large language model multimodal
8	EditCast3D: Single-Frame-Guided 3D Editing with Video Propagation and View Selection	EditCast3D：利用视频传播和视图选择实现单帧引导的3D编辑	foundation model	✅
9	From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries	FactoredScenes：通过学习程序库生成可分解的真实世界场景，解决数据稀缺问题。	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
10	Gesplat: Robust Pose-Free 3D Reconstruction via Geometry-Guided Gaussian Splatting	Gesplat：基于几何引导高斯溅射的鲁棒无姿态3D重建	depth estimation 3D gaussian splatting 3DGS
11	Opacity-Gradient Driven Density Control for Compact and Efficient Few-Shot 3D Gaussian Splatting	提出基于不透明度梯度的密度控制方法，提升少样本3D高斯溅射的效率和紧凑性。	3D gaussian splatting 3DGS gaussian splatting
12	P-4DGS: Predictive 4D Gaussian Splatting with 90$\times$ Compression	提出P-4DGS以解决动态场景建模中的高内存消耗问题	3D gaussian splatting 3DGS gaussian splatting
13	Ordinal Scale Traffic Congestion Classification with Multi-Modal Vision-Language and Motion Analysis	提出多模态融合框架，用于序数尺度下的交通拥堵等级分类	open-vocabulary open vocabulary multimodal
14	Ortho-Fuse: Orthomosaic Generation for Sparse High-Resolution Crop Health Maps Through Intermediate Optical Flow Estimation	Ortho-Fuse：通过光流估计为稀疏高分辨率作物健康地图生成正射影像	optical flow
15	B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding	提出B2N3D框架，通过二元到N元关系渐进学习实现更精确的3D物体定位	scene understanding spatial relationship
16	Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer	Color3D：基于个性化着色器的可控一致3D着色框架	gaussian splatting splatting	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
17	Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models	提出ACRE，通过一致性强化学习提升多模态大模型在视觉问答任务中的推理一致性。	reinforcement learning large language model multimodal
18	Bridging Perspectives: Foundation Model Guided BEV Maps for 3D Object Detection and Tracking	提出DualViewDistill，利用基础模型引导的BEV地图提升3D目标检测与跟踪性能。	distillation foundation model
19	Probabilistic Hyper-Graphs using Multiple Randomly Masked Autoencoders for Semi-supervised Multi-modal Multi-task Learning	提出PHG-MAE模型，结合神经图和掩码自编码器，用于半监督多模态多任务学习。	masked autoencoder MAE distillation
20	Complementary and Contrastive Learning for Audio-Visual Segmentation	提出CCFormer，通过互补对比学习实现更精准的音视频分割	contrastive learning multimodal	✅
21	SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation	提出SaFiRe框架，利用Mamba解决指代图像分割中复杂表达式的难题。	Mamba

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
22	Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization	提出免训练的上下文取证链ICFC，用于图像篡改检测与定位	manipulation large language model
23	SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents	SecureWebArena：LVLM Web Agent安全评估的综合基准	manipulation
24	BurstDeflicker: A Benchmark Dataset for Flicker Removal in Dynamic Scenes	提出BurstDeflicker数据集，用于动态场景下图像闪烁消除研究。	manipulation

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
25	Tracking the Spatiotemporal Evolution of Landslide Scars Using a Vision Foundation Model: A Novel and Universal Framework	提出基于视觉基础模型的滑坡疤痕时空演化追踪框架，实现连续监测与预警。	spatiotemporal foundation model
26	Semi-disentangled spatiotemporal implicit neural representations of longitudinal neuroimaging data for trajectory classification	提出一种半解耦时空隐式神经表示方法，用于纵向神经影像数据的轨迹分类。	spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页