cs.CV（2025-09-26）

📊 共 30 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (10 🔗5) 支柱三：空间感知与语义 (Perception & Semantics) (5 🔗2) 支柱八：物理动画 (Physics-based Animation) (1) 支柱一：机器人控制 (Robot Control) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Explaining multimodal LLMs via intra-modal token interactions	通过模态内token交互增强多模态LLM的可解释性	large language model multimodal
2	JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation	JanusVLN：利用双重隐式记忆解耦语义与空间信息，提升视觉语言导航性能	VLN large language model multimodal	✅
3	Introducing Multimodal Paradigm for Learning Sleep Staging PSG via General-Purpose Model	提出基于多模态通用模型的睡眠分期新范式，提升PSG分析的准确性和鲁棒性	multimodal
4	Effectiveness of Large Multimodal Models in Detecting Disinformation: Experimental Results	利用GPT-4o模型，结合优化Prompt工程，解决多模态信息伪造检测难题	multimodal
5	MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning	提出MILR，一种测试时潜在推理方法，提升多模态图像生成质量。	multimodal
6	DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images	提出基于Stable Diffusion的多模态方法DeHate，以减轻图像中的仇恨言论。	multimodal
7	DynaNav: Dynamic Feature and Layer Selection for Efficient Visual Navigation	DynaNav：针对高效视觉导航的动态特征与层选择框架	embodied AI foundation model
8	FishAI 2.0: Marine Fish Image Classification with Multi-modal Few-shot Learning	FishAI 2.0：结合多模态少样本学习进行海洋鱼类图像分类	large language model multimodal
9	LABELING COPILOT: A Deep Research Agent for Automated Data Curation in Computer Vision	提出Labeling Copilot，用于计算机视觉中自动化数据标注的深度研究Agent。	foundation model multimodal
10	UML-CoT: Structured Reasoning and Planning with Unified Modeling Language for Robotic Room Cleaning	提出UML-CoT框架，利用UML进行机器人房间清洁任务的结构化推理与规划	large language model chain-of-thought
11	Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation	EAGLE：轻量级黑盒框架，解释多模态大语言模型自回归token生成过程。	large language model multimodal
12	Beyond the Vision Encoder: Identifying and Mitigating Spatial Bias in Large Vision-Language Models	提出自适应全局上下文注入（AGCI）以解决大视觉语言模型中的空间偏见问题	large language model multimodal
13	CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process	CircuitSense：提出电路系统基准，桥接工程设计中的视觉理解与符号推理。	large language model

🔬 支柱二：RL算法与架构 (RL & Architecture) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
14	On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations	提出RobustVLA，增强视觉-语言-动作模型在多模态扰动下的鲁棒性	flow matching vision-language-action VLA
15	Multimodal Slice Interaction Network Enhanced by Transfer Learning for Precise Segmentation of Internal Gross Tumor Volume in Lung Cancer PET/CT Imaging	提出基于迁移学习的多模态切片交互网络，用于肺癌PET/CT图像IGTV精确分割	Mamba multimodal
16	TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses	提出TRUST，利用不确定性引导的SSM遍历进行测试时优化，提升模型在分布偏移下的鲁棒性。	Mamba SSM state space model
17	SPARK: Synergistic Policy And Reward Co-Evolving Framework	SPARK：协同策略与奖励共同进化的LLM/LVLM强化学习框架	reinforcement learning RLHF large language model
18	PSTTS: A Plug-and-Play Token Selector for Efficient Event-based Spatio-temporal Representation Learning	提出PSTTS即插即用模块，有效降低事件数据时空表示学习的计算冗余。	Mamba representation learning
19	VideoScore2: Think before You Score in Generative Video Evaluation	VideoScore2：提出多维度、可解释的视频生成评估框架，提升评估准确性和可控性。	reinforcement learning chain-of-thought	✅
20	CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning	提出CapRL，利用强化学习提升图像描述的稠密性和实用性	reinforcement learning	✅
21	NIFTY: a Non-Local Image Flow Matching for Texture Synthesis	NIFTY：一种用于纹理合成的非局部图像流匹配方法	flow matching	✅
22	Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models	提出基于规则的强化学习方法，提升文档图像分类的泛化能力。	reinforcement learning	✅
23	ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models	提出ERGO，通过高效高分辨率视觉理解提升视觉-语言模型性能	reinforcement learning multimodal	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
24	Learning Unified Representation of 3D Gaussian Splatting	提出基于连续子流形场的3D高斯点云统一表示方法	3D gaussian splatting 3DGS gaussian splatting	✅
25	Lightweight Structured Multimodal Reasoning for Clinical Scene Understanding in Robotics	提出轻量级结构化多模态推理框架，用于机器人临床场景理解	scene understanding multimodal chain-of-thought
26	GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting	GaussianVision：利用2D高斯溅射从压缩图像表示中实现视觉-语言对齐	gaussian splatting splatting multimodal
27	EfficientDepth: A Fast and Detail-Preserving Monocular Depth Estimation Model	EfficientDepth：一种快速且保留细节的单目深度估计模型	depth estimation monocular depth geometric consistency
28	CCNeXt: An Effective Self-Supervised Stereo Depth Estimation Approach	提出CCNeXt，一种高效的自监督立体深度估计方法，适用于计算资源受限的场景。	depth estimation stereo depth	✅

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs	提出DeeptraceReward基准，利用多模态LLM学习AI生成视频中人类感知的伪造痕迹。	spatiotemporal multimodal TAMP

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
30	MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning	MesaTask：提出基于3D空间推理的任务驱动型桌面场景生成框架	manipulation DPO physically plausible

⬅️ 返回 cs.CV 首页 · 🏠 返回主页