cs.CV(2025-05-30)

📊 共 56 篇论文 | 🔗 15 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (18 🔗6) 支柱二:RL算法与架构 (RL & Architecture) (14 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗3) 支柱六:视频提取与匹配 (Video Extraction) (6 🔗1) 支柱一:机器人控制 (Robot Control) (4) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (18 篇)

#题目一句话要点标签🔗
1 Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model 提出Period-LLM,增强多模态大模型在周期性任务上的性能 large language model multimodal
2 Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts Mixpert:通过高效的视觉专家混合模型缓解多模态学习冲突 large language model multimodal
3 DisTime: Distribution-based Time Representation for Video Large Language Models DisTime:面向视频大语言模型的基于分布的时间表示方法 large language model TAMP
4 Reasoning Can Hurt the Inductive Abilities of Large Language Models 发现思维链推理可能损害大语言模型的归纳能力,并提出改进方法 large language model chain-of-thought
5 Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks Agent-X:用于评估视觉中心Agent多模态推理能力的大规模基准 multimodal
6 Geospatial Foundation Models to Enable Progress on Sustainable Development Goals 提出SustainFM基准框架,评估地理空间基础模型在可持续发展目标中的应用潜力。 foundation model
7 Beyond Quantity: Distribution-Aware Labeling for Visual Grounding 提出DAL框架,通过分布感知的伪标签方法提升视觉定位性能 visual grounding
8 From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models 统一理论框架揭示大模型幻觉与越狱攻击的内在联系 foundation model
9 Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT 提出MVPBench:基于图结构评估多模态大模型在视觉物理常识推理中的多步推理能力 large language model multimodal chain-of-thought
10 The Butterfly Effect in Pathology: Exploring Security in Pathology Foundation Models 针对病理学Foundation模型的对抗攻击研究:揭示WSI分析的安全性风险 foundation model
11 CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs 提出CSVQA:一个用于评估VLM在STEM领域推理能力的中文多模态基准 multimodal
12 Federated Foundation Model for GI Endoscopy Images 提出基于联邦学习的胃肠内窥镜图像基础模型,解决数据隐私下的模型训练难题。 foundation model
13 SiLVR: A Simple Language-based Video Reasoning Framework 提出SiLVR框架,利用语言模型增强视频理解推理能力,无需额外训练。 large language model multimodal
14 SORCE: Small Object Retrieval in Complex Environments SORCE:提出复杂环境中基于文本的小目标检索新基准与多嵌入表示方法。 large language model multimodal
15 Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders 提出Nar-KFC,利用叙事性关键帧提升MLLM长视频理解能力 large language model multimodal
16 Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation Geo-Sign:利用双曲对比正则化提升几何感知的手语翻译性能 large language model
17 ViStoryBench: Comprehensive Benchmark Suite for Story Visualization ViStoryBench:用于故事可视化的综合性评测基准,涵盖多样叙事结构与风格。 large language model
18 Conformal Prediction for Zero-Shot Models 提出Conf-OT,提升零样本模型在领域漂移下的Conformal Prediction效率。 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (14 篇)

#题目一句话要点标签🔗
19 MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning MoDoMoDo:多领域数据混合用于多模态LLM的强化学习 reinforcement learning large language model multimodal
20 Harnessing Foundation Models for Robust and Generalizable 6-DOF Bronchoscopy Localization 提出PANSv2以解决支气管镜定位的鲁棒性与泛化问题 Mamba depth estimation foundation model
21 Reinforcing Video Reasoning with Focused Thinking 提出TW-GRPO框架,通过聚焦推理和精细化奖励提升视频推理能力。 reinforcement learning spatiotemporal large language model
22 VideoCAD: A Dataset and Model for Learning Long-Horizon 3D CAD UI Interactions from Video 提出VideoCAD数据集和VideoCADFormer模型,用于学习长时程3D CAD UI交互。 behavior cloning large language model multimodal
23 ACM-UNet: Adaptive Integration of CNNs and Mamba for Efficient Medical Image Segmentation ACM-UNet:自适应融合CNN与Mamba,高效医学图像分割 Mamba SSM state space model
24 LTM3D: Bridging Token Spaces for Conditional 3D Generation with Auto-Regressive Diffusion Framework LTM3D:基于自回归扩散框架的条件3D生成,桥接Token空间 masked autoencoder 3D gaussian splatting gaussian splatting
25 A Mathematical Perspective On Contrastive Learning 将对比学习视为概率分布优化,为跨模态任务提供新视角与算法 contrastive learning multimodal
26 Revisiting Cross-Modal Knowledge Distillation: A Disentanglement Approach for RGBD Semantic Segmentation 提出CroDiNo-KD,通过解耦表示学习RGBD语义分割中的跨模态知识蒸馏。 contrastive learning distillation
27 Progressive Class-level Distillation 提出渐进式类别级蒸馏(PCD)方法,解决知识蒸馏中低概率类别知识传递不足的问题。 teacher-student distillation
28 A Cross Branch Fusion-Based Contrastive Learning Framework for Point Cloud Self-supervised Learning 提出基于跨分支融合对比学习的PoCCA框架,用于点云自监督学习。 contrastive learning
29 EgoVIS@CVPR: What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning EgoVIS提出基于状态变化反事实推理的程序性视频表征学习方法 representation learning
30 Reason-SVG: Hybrid Reward RL for Aha-Moments in Vector Graphics Generation Reason-SVG:利用混合奖励强化学习提升LLM在矢量图形生成中的推理能力 reinforcement learning large language model
31 STORK: Faster Diffusion And Flow Matching Sampling By Resolving Both Stiffness And Structure-Dependence STORK:通过解决刚性和结构依赖性加速扩散模型和Flow Matching模型的采样 flow matching
32 State Estimation and Control of Dynamic Systems from High-Dimensional Image Data 提出基于CNN-GRU的神经网络架构,用于高维图像数据的动态系统状态估计与控制 reinforcement learning policy learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
33 Tackling View-Dependent Semantics in 3D Language Gaussian Splatting LaGa:通过建模视角依赖语义解决3D语言高斯溅射中的语义理解难题 3D gaussian splatting gaussian splatting splatting
34 InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing InteractAnything:通过LLM反馈和物体可供性解析实现零样本人-物交互合成 affordance human-object interaction HOI
35 Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic Priors 提出基于部件级语义先验的弱监督可供性区域定位方法,显著提升性能。 affordance human-object interaction egocentric
36 un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP un$^2$CLIP:通过反转unCLIP提升CLIP的视觉细节捕捉能力 open-vocabulary open vocabulary large language model
37 3D Gaussian Splat Vulnerabilities 揭示3D高斯溅射漏洞:提出CLOAK和DAGGER攻击,威胁安全应用。 3D gaussian splatting 3DGS gaussian splatting
38 Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors 提出VG LLM,利用视频中的3D几何先验增强MLLM的3D场景理解能力 scene understanding large language model multimodal
39 6D Pose Estimation on Point Cloud Data through Prior Knowledge Integration: A Case Study in Autonomous Disassembly 提出一种结合先验知识的点云6D位姿估计方法,用于自动化拆卸螺栓 6D pose estimation
40 AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion AdaHuman:基于可组合多视角扩散的动画3D人体高精度生成 3DGS

🔬 支柱六:视频提取与匹配 (Video Extraction) (6 篇)

#题目一句话要点标签🔗
41 Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames 提出Disjoint-3DQA基准,评估VLMs在分离视角下进行自中心空间推理的能力。 egocentric embodied AI
42 Leadership Assessment in Pediatric Intensive Care Unit Team Training 提出基于第一视角视频的PICU团队领导力自动评估框架 egocentric egocentric vision multimodal
43 Learning reusable concepts across different egocentric video understanding tasks 提出Hier-EgoPack框架,用于学习不同第一视角视频理解任务中的可复用概念 egocentric
44 PCIE_Interaction Solution for Ego4D Social Interaction Challenge PCIE_Interaction方案解决Ego4D社交互动挑战中的LAM和TTM任务 Ego4D
45 Reading Recognition in the Wild 提出 Reading in the Wild 数据集,并用 Transformer 模型实现智能眼镜中的阅读识别 egocentric multimodal
46 PCIE_Pose Solution for EgoExo4D Pose and Proficiency Estimation Challenge 提出HP-ViT+模型,解决EgoExo4D挑战赛中手部和身体姿态估计及熟练度评估问题 egocentric multimodal

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
47 Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces 提出Visual Embodied Brain框架,赋能多模态大语言模型在具身智能任务中的感知、推理与控制能力。 legged robot large language model multimodal
48 Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction 提出基于光流视频预测的双臂机器人通用策略,提升泛化性。 manipulation bi-manual dual-arm
49 S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation S4-Driver:基于时空视觉表征的可扩展自监督驾驶多模态大语言模型 motion planning large language model multimodal
50 Benchmarking Foundation Models for Zero-Shot Biometric Tasks 基准测试:零样本生物特征识别任务中的Foundation模型 manipulation large language model foundation model

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
51 MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM 提出MIRAGE基准以解决多模态大语言模型的幻觉问题 spatial relationship large language model multimodal
52 Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation 提出LongBench-T2I基准与Plan2Gen框架,用于评估和提升复杂指令下的图像生成。 spatial relationship large language model

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
53 S3CE-Net: Spike-guided Spatiotemporal Semantic Coupling and Expansion Network for Long Sequence Event Re-Identification 提出S3CE-Net,利用脉冲神经网络解决长序列事件相机行人重识别问题 spatiotemporal
54 Spatiotemporal Analysis of Forest Machine Operations Using 3D Video Classification 提出基于3D视频分类的林业机械作业时空分析方法 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
55 Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes Ctrl-Crash:可控扩散模型生成逼真车辆碰撞视频,助力交通安全研究 classifier-free guidance
56 MiniMax-Remover: Taming Bad Noise Helps Video Object Removal MiniMax-Remover:通过驾驭不良噪声提升视频物体移除效果 classifier-free guidance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页