cs.CV(2025-07-21)

📊 共 30 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (3) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (9 篇)

#题目一句话要点标签🔗
1 Extracting Visual Facts from Intermediate Layers for Mitigating Hallucinations in Multimodal Large Language Models 提出EVA方法,通过提取中间层视觉信息缓解多模态大语言模型中的幻觉问题 large language model multimodal
2 True Multimodal In-Context Learning Needs Attention to the Visual Context 提出DARA和TrueMICL数据集,提升多模态上下文学习中视觉信息的利用率 large language model multimodal
3 Pixels, Patterns, but No Poetry: To See The World like Humans 提出Turing Eye Test,评估多模态大语言模型在类人感知方面的能力差距 large language model multimodal
4 SIA: Enhancing Safety via Intent Awareness for Vision-Language Models SIA:通过意图感知增强视觉-语言模型安全性,无需额外训练。 multimodal chain-of-thought
5 FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers 提出FreeCus,一种在扩散Transformer中实现免训练的主题驱动定制方法 large language model multimodal
6 Weak Links in LinkedIn: Enhancing Fake Profile Detection in the Age of LLMs 提出GPT辅助对抗训练,提升LinkedIn虚假个人资料检测器在LLM生成内容下的鲁棒性 large language model
7 ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction ConformalSAM:利用一致性预测解锁基础分割模型在半监督语义分割中的潜力 foundation model
8 CHROMA: Consistent Harmonization of Multi-View Appearance via Bilateral Grid Prediction CHROMA:通过双边网格预测实现多视角外观一致性调和 foundation model
9 DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding DynImg:利用视觉提示的关键帧提升多模态视频理解能力 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
10 DWTGS: Rethinking Frequency Regularization for Sparse-view 3D Gaussian Splatting DWTGS:利用小波变换改进稀疏视角3D高斯溅射的频率正则化 3D gaussian splatting 3DGS gaussian splatting
11 SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting SurfaceSplat:结合表面重建与高斯溅射,提升稀疏视图下的重建与渲染质量 3D gaussian splatting 3DGS gaussian splatting
12 BenchDepth: Are We on the Right Way to Evaluate Depth Foundation Models? 提出BenchDepth基准,通过下游任务评估深度基础模型,避免对齐偏差。 depth estimation scene reconstruction foundation model
13 Hi^2-GSLoc: Dual-Hierarchical Gaussian-Specific Visual Relocalization for Remote Sensing 提出Hi^2-GSLoc以解决遥感中的视觉重定位问题 3D gaussian splatting 3DGS gaussian splatting
14 Dense-depth map guided deep Lidar-Visual Odometry with Sparse Point Clouds and Images 提出一种基于稠密深度图引导的深度LiDAR-视觉里程计,提升位姿估计精度。 visual odometry
15 Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models PhysVidBench:构建物理常识理解基准,评估视频生成模型在工具使用等方面的能力。 affordance
16 DAViD: Data-efficient and Accurate Vision Models from Synthetic Data DAViD:利用高效且精确的合成数据训练人体视觉模型 depth estimation

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
17 MeshMamba: State Space Models for Articulated 3D Mesh Generation and Reconstruction MeshMamba:利用状态空间模型进行可动3D网格生成与重建 Mamba SSM state space model
18 CLAMP: Contrastive Learning with Adaptive Multi-loss and Progressive Fusion for Multimodal Aspect-Based Sentiment Analysis 提出CLAMP框架,通过对比学习和自适应多损失融合解决多模态情感分析中的跨模态对齐问题。 contrastive learning multimodal
19 Few-Shot Object Detection via Spatial-Channel State Space Model 提出空间-通道状态空间模型以解决少样本目标检测问题 Mamba state space model spatial relationship
20 Visual-Language Model Knowledge Distillation Method for Image Quality Assessment 提出基于视觉-语言模型知识蒸馏的图像质量评估方法,提升模型效率与局部特征识别能力。 distillation multimodal
21 Local Dense Logit Relations for Enhanced Knowledge Distillation 提出局部密集关系Logit蒸馏(LDRLD),通过细粒度logit关系提升知识蒸馏效果。 distillation
22 Efficient Face Image Quality Assessment via Self-training and Knowledge Distillation 提出基于自训练和知识蒸馏的高效人脸图像质量评估方法,适用于实际部署。 distillation

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
23 Is Tracking really more challenging in First Person Egocentric Vision? 提出针对第一人称视角目标跟踪的基准研究,区分视角与场景的挑战。 egocentric egocentric vision first-person view
24 SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction 提出SeC框架,利用概念构建解决复杂视频分割中语义理解难题 feature matching
25 Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport 提出基于正则化Gromov-Wasserstein最优传输的自监督程序学习框架 egocentric

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
26 Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos Being-H0:基于大规模人类视频的视觉-语言-动作预训练模型,提升灵巧操作能力。 manipulation sim-to-real motion generation
27 Discovering and using Spelke segments 提出SpelkeNet,通过预测物体运动关系发现Spelke对象,提升物理交互任务性能。 manipulation world model affordance

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
28 HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation HOLa:一种基于低秩分解VLM特征自适应的零样本HOI检测方法 human-object interaction HOI

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
29 EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent EgoPrune:面向具身智能Egomotion视频推理的高效Token剪枝方法 spatiotemporal embodied AI multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
30 Toward a Real-Time Framework for Accurate Monocular 3D Human Pose Estimation with Geometric Priors 提出结合几何先验的单目3D人体姿态实时估计框架 human motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页