cs.CV（2025-07-21）

📊 共 30 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (6 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (3) 支柱一：机器人控制 (Robot Control) (2 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (1 🔗1) 支柱八：物理动画 (Physics-based Animation) (1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Extracting Visual Facts from Intermediate Layers for Mitigating Hallucinations in Multimodal Large Language Models	提出EVA方法，通过提取中间层视觉信息缓解多模态大语言模型中的幻觉问题	large language model multimodal
2	True Multimodal In-Context Learning Needs Attention to the Visual Context	提出DARA和TrueMICL数据集，提升多模态上下文学习中视觉信息的利用率	large language model multimodal	✅
3	Pixels, Patterns, but No Poetry: To See The World like Humans	提出Turing Eye Test，评估多模态大语言模型在类人感知方面的能力差距	large language model multimodal
4	SIA: Enhancing Safety via Intent Awareness for Vision-Language Models	SIA：通过意图感知增强视觉-语言模型安全性，无需额外训练。	multimodal chain-of-thought
5	FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers	提出FreeCus，一种在扩散Transformer中实现免训练的主题驱动定制方法	large language model multimodal	✅
6	Weak Links in LinkedIn: Enhancing Fake Profile Detection in the Age of LLMs	提出GPT辅助对抗训练，提升LinkedIn虚假个人资料检测器在LLM生成内容下的鲁棒性	large language model
7	ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction	ConformalSAM：利用一致性预测解锁基础分割模型在半监督语义分割中的潜力	foundation model
8	CHROMA: Consistent Harmonization of Multi-View Appearance via Bilateral Grid Prediction	CHROMA：通过双边网格预测实现多视角外观一致性调和	foundation model
9	DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding	DynImg：利用视觉提示的关键帧提升多模态视频理解能力	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
10	DWTGS: Rethinking Frequency Regularization for Sparse-view 3D Gaussian Splatting	DWTGS：利用小波变换改进稀疏视角3D高斯溅射的频率正则化	3D gaussian splatting 3DGS gaussian splatting
11	SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting	SurfaceSplat：结合表面重建与高斯溅射，提升稀疏视图下的重建与渲染质量	3D gaussian splatting 3DGS gaussian splatting	✅
12	BenchDepth: Are We on the Right Way to Evaluate Depth Foundation Models?	提出BenchDepth基准，通过下游任务评估深度基础模型，避免对齐偏差。	depth estimation scene reconstruction foundation model
13	Hi^2-GSLoc: Dual-Hierarchical Gaussian-Specific Visual Relocalization for Remote Sensing	提出Hi^2-GSLoc以解决遥感中的视觉重定位问题	3D gaussian splatting 3DGS gaussian splatting
14	Dense-depth map guided deep Lidar-Visual Odometry with Sparse Point Clouds and Images	提出一种基于稠密深度图引导的深度LiDAR-视觉里程计，提升位姿估计精度。	visual odometry
15	Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models	PhysVidBench：构建物理常识理解基准，评估视频生成模型在工具使用等方面的能力。	affordance
16	DAViD: Data-efficient and Accurate Vision Models from Synthetic Data	DAViD：利用高效且精确的合成数据训练人体视觉模型	depth estimation

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
17	MeshMamba: State Space Models for Articulated 3D Mesh Generation and Reconstruction	MeshMamba：利用状态空间模型进行可动3D网格生成与重建	Mamba SSM state space model
18	CLAMP: Contrastive Learning with Adaptive Multi-loss and Progressive Fusion for Multimodal Aspect-Based Sentiment Analysis	提出CLAMP框架，通过对比学习和自适应多损失融合解决多模态情感分析中的跨模态对齐问题。	contrastive learning multimodal
19	Few-Shot Object Detection via Spatial-Channel State Space Model	提出空间-通道状态空间模型以解决少样本目标检测问题	Mamba state space model spatial relationship
20	Visual-Language Model Knowledge Distillation Method for Image Quality Assessment	提出基于视觉-语言模型知识蒸馏的图像质量评估方法，提升模型效率与局部特征识别能力。	distillation multimodal
21	Local Dense Logit Relations for Enhanced Knowledge Distillation	提出局部密集关系Logit蒸馏(LDRLD)，通过细粒度logit关系提升知识蒸馏效果。	distillation
22	Efficient Face Image Quality Assessment via Self-training and Knowledge Distillation	提出基于自训练和知识蒸馏的高效人脸图像质量评估方法，适用于实际部署。	distillation	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
23	Is Tracking really more challenging in First Person Egocentric Vision?	提出针对第一人称视角目标跟踪的基准研究，区分视角与场景的挑战。	egocentric egocentric vision first-person view
24	SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction	提出SeC框架，利用概念构建解决复杂视频分割中语义理解难题	feature matching
25	Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport	提出基于正则化Gromov-Wasserstein最优传输的自监督程序学习框架	egocentric

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
26	Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos	Being-H0：基于大规模人类视频的视觉-语言-动作预训练模型，提升灵巧操作能力。	manipulation sim-to-real motion generation	✅
27	Discovering and using Spelke segments	提出SpelkeNet，通过预测物体运动关系发现Spelke对象，提升物理交互任务性能。	manipulation world model affordance

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
28	HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation	HOLa：一种基于低秩分解VLM特征自适应的零样本HOI检测方法	human-object interaction HOI	✅

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent	EgoPrune：面向具身智能Egomotion视频推理的高效Token剪枝方法	spatiotemporal embodied AI multimodal

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Toward a Real-Time Framework for Accurate Monocular 3D Human Pose Estimation with Geometric Priors	提出结合几何先验的单目3D人体姿态实时估计框架	human motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页