cs.CV（2025-10-21）

📊 共 38 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (11 🔗3) 支柱九：具身大模型 (Embodied Foundation Models) (11 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (10 🔗5) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱五：交互与反应 (Interaction & Reaction) (1) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱八：物理动画 (Physics-based Animation) (1 🔗1) 支柱一：机器人控制 (Robot Control) (1 🔗1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Moving Light Adaptive Colonoscopy Reconstruction via Illumination-Attenuation-Aware 3D Gaussian Splatting	提出ColIAGS，通过光照衰减感知的3D高斯溅射实现移动光源自适应的结肠镜重建	3D gaussian splatting 3DGS gaussian splatting
2	Re-Activating Frozen Primitives for 3D Gaussian Splatting	ReAct-GS：通过重激活冻结图元解决3D高斯溅射中的过重建伪影问题	3D gaussian splatting gaussian splatting splatting
3	OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion	提出OpenInsGaussian，通过上下文感知跨视角融合实现开放词汇实例高斯分割	gaussian splatting splatting scene understanding
4	GeoDiff: Geometry-Guided Diffusion for Metric Depth Estimation	GeoDiff：提出几何引导的扩散模型用于度量深度估计，无需重新训练。	depth estimation monocular depth metric depth
5	BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining	BlendCLIP：通过多模态预训练桥接合成与真实域，实现零样本3D物体分类	open-vocabulary open vocabulary multimodal	✅
6	Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos	Mono4DGS-HDR：提出基于高斯溅射的单目交替曝光视频HDR 4D重建方法	gaussian splatting splatting
7	Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views	提出3DThinker，从有限视角实现基于几何想象的空间推理	VGGT spatial relationship foundation model	✅
8	PLANA3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting	PLANA3R：基于前馈平面splatting的零样本度量平面3D重建	depth estimation splatting	✅
9	UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding	提出UWBench水下视觉-语言基准，促进水下环境理解研究。	scene understanding multimodal visual grounding
10	MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models	MoAlign：面向视频扩散模型，提出运动中心表征对齐方法，提升时序一致性和物理合理性。	optical flow physically plausible
11	VelocityNet: Real-Time Crowd Anomaly Detection via Person-Specific Velocity Analysis	VelocityNet：通过人员特定速度分析实现实时人群异常检测	optical flow

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
12	The Impact of Image Resolution on Biomedical Multimodal Large Language Models	研究图像分辨率对生物医学多模态大语言模型性能的影响，提出混合分辨率训练策略。	large language model multimodal
13	Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs	提出GAR以解决多模态大语言模型的区域理解问题	large language model multimodal
14	VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety	VLSU：构建多模态AI安全评估框架，揭示视觉-语言联合理解的局限性	foundation model multimodal
15	Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model for Microclimate Impact Prediction	微调地理空间基础模型，用于城市热岛检测与模拟，预测微气候影响。	foundation model
16	IF-VidCap: Can Video Caption Models Follow Instructions?	提出IF-VidCap基准，评估视频字幕模型在指令遵循方面的能力。	large language model multimodal instruction following
17	Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding	提出自适应Token集成解码(ATED)，无需训练即可有效缓解多模态大模型中的幻觉问题	multimodal	✅
18	Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts	提出基于元数据和任务特定提示的驾驶场景问答系统，提升鲁棒性。	multimodal chain-of-thought
19	See the Text: From Tokenization to Visual Reading	提出SeeTok，将文本视为图像，利用多模态LLM实现高效视觉阅读理解。	large language model multimodal
20	SITS-DECO: A Generative Decoder Is All You Need For Multitask Satellite Image Time Series Modelling	SITS-DECO：仅用生成式解码器进行多任务卫星图像时间序列建模	large language model foundation model
21	PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions	提出PoSh，利用场景图引导LLM评估图像描述，并发布DOCENT数据集。	foundation model
22	Gestura: A LVLM-Powered System Bridging Motion and Semantics for Real-Time Free-Form Gesture Understanding	Gestura：一种基于LVLM的实时自由手势理解系统，弥合动作与语义鸿沟	chain-of-thought

🔬 支柱二：RL算法与架构 (RL & Architecture) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
23	Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models	提出Med-RwR框架，通过主动检索增强医学多模态大语言模型的推理能力	reinforcement learning large language model multimodal	✅
24	CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder	CovMatch：通过跨协方差引导和可训练文本编码器实现多模态数据集蒸馏	contrastive learning distillation multimodal
25	Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models	提出VFM-VAE，直接利用视觉基础模型作为潜在扩散模型的tokenizer，显著提升生成质量与效率。	distillation foundation model
26	Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs	提出基于掩码预测的上下文常识激活方法，提升视觉语言模型在多模态场景下的推理能力	reinforcement learning large language model multimodal
27	Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback	提出Diffusion-DRO，通过排序优化和在线负样本提升扩散模型的用户偏好对齐。	reinforcement learning inverse reinforcement learning preference learning	✅
28	OmniNWM: Omniscient Driving Navigation World Models	OmniNWM：全知全景导航世界模型，赋能自动驾驶	world model metric depth	✅
29	Embodied Navigation with Auxiliary Task of Action Description Prediction	提出基于动作描述预测辅助任务的具身导航方法，提升导航性能和可解释性。	reinforcement learning distillation multimodal
30	UniHPR: Unified Human Pose Representation via Singular Value Contrastive Learning	提出UniHPR，通过奇异值对比学习统一多模态人体姿态表征	representation learning contrastive learning
31	Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents	提出视觉中心对比学习VC2L，统一处理多模态网页文档的表示学习。	representation learning contrastive learning multimodal	✅
32	ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder	ProCLIP：提出基于LLM嵌入的渐进式视觉-语言对齐框架，提升CLIP处理长文本能力。	contrastive learning curriculum learning distillation	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
33	Latent-Info and Low-Dimensional Learning for Human Mesh Recovery and Parallel Optimization	提出基于潜在信息和低维学习的人体网格恢复与并行优化方法	human mesh recovery human motion
34	Hyperbolic Space Learning Method Leveraging Temporal Motion Priors for Human Mesh Recovery	提出一种利用时序运动先验的 hyperbolic 空间学习方法，用于人体网格重建。	human mesh recovery

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
35	Learning Human-Object Interaction as Groups	提出GroupHOI框架，从群体交互视角提升人-物交互检测性能	human-object interaction HOI

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
36	DSI-Bench: A Benchmark for Dynamic Spatial Intelligence	提出DSI-Bench基准测试，用于评估动态空间智能，揭示现有VLM在3D动态场景理解上的局限性。	spatial relationship

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
37	A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition	提出显式运动信息挖掘模块EMIM，增强Transformer在动作识别中对运动信息的建模能力。	spatiotemporal	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
38	Efficient Few-shot Identity Preserving Attribute Editing for 3D-aware Deep Generative Models	提出一种高效的少样本3D人脸属性编辑方法，保持身份一致性。	manipulation	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页