cs.CV(2025-04-21)

📊 共 29 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱一:机器人控制 (Robot Control) (4 🔗2) 支柱八:物理动画 (Physics-based Animation) (2) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs 提出IV-Bench基准,评估多模态LLM在图像引导下的视频感知与推理能力 large language model multimodal
2 Event2Vec: Processing Neuromorphic Events directly by Representations in Vector Space 提出Event2Vec,通过向量空间表征直接处理神经形态事件数据 large language model multimodal
3 LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception 提出LongPerceptualThoughts数据集,提升视觉感知任务中类系统2推理能力。 chain-of-thought
4 Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models Eagle 2.5:通过长上下文后训练提升前沿视觉-语言模型性能 multimodal
5 Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning 揭示MILS图像描述框架的隐藏代价:高计算开销下的零样本性能 multimodal
6 Cognitive-Inspired Hierarchical Attention Fusion With Visual and Textual for Cross-Domain Sequential Recommendation 提出HAF-VT模型,融合视觉和文本信息,解决跨域序列推荐中用户兴趣建模问题。 multimodal
7 ScanEdit: Hierarchically-Guided Functional 3D Scan Editing ScanEdit:提出层级引导的功能性3D扫描编辑方法,实现指令驱动的场景编辑。 large language model
8 Insert Anything: Image Insertion via In-Context Editing in DiT 提出Insert Anything框架,通过DiT上下文编辑实现参考图像的无缝插入。 multimodal
9 Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation 提出FG-BMK基准,全面评估大型视觉语言模型在细粒度图像任务上的性能 multimodal
10 DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding DyFo:免训练动态聚焦视觉搜索,提升LMMs的细粒度视觉理解能力 multimodal
11 Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation 提出基于语义扰动的置信度校准框架,提升视觉-语言模型在对象级别上的置信度可靠性。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
12 StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians StyleMe3D:通过多编码器解耦先验,实现3D高斯模型的风格迁移 distillation 3D gaussian splatting 3DGS
13 VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models VisuLogic:用于评估多模态大语言模型视觉推理能力的新基准 reinforcement learning large language model multimodal
14 MonoTher-Depth: Enhancing Thermal Depth Estimation via Confidence-Aware Distillation 提出基于置信度感知的知识蒸馏方法,提升热成像单目深度估计精度。 distillation depth estimation monocular depth
15 DSPO: Direct Semantic Preference Optimization for Real-World Image Super-Resolution 提出DSPO,通过语义偏好优化对齐人类反馈,提升真实场景图像超分辨率效果 DPO direct preference optimization large language model
16 CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting 提出CAPTURe基准测试,评估视觉语言模型在遮挡场景下的空间推理能力 world model spatial relationship
17 Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT 提出一种混合知识蒸馏框架,用于农业物联网中设备端视觉系统的优化。 distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
18 MoBGS: Motion Deblurring Dynamic 3D Gaussian Splatting for Blurry Monocular Video 提出MoBGS,用于运动模糊单目视频的动态3D高斯溅射去模糊和新视角合成。 3D gaussian splatting 3DGS gaussian splatting
19 Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation Uni3C:统一3D增强的相机与人体运动控制,实现视频生成 monocular depth SMPL SMPL-X
20 Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends 提出多模态大语言模型以提升交通安全 scene understanding large language model multimodal
21 VistaDepth: Improving far-range Depth Estimation with Spectral Modulation and Adaptive Reweighting VistaDepth:通过频谱调制和自适应重加权提升远距离单目深度估计 depth estimation monocular depth
22 PIV-FlowDiffuser:Transfer-learning-based denoising diffusion models for PIV 提出基于迁移学习的去噪扩散模型PIV-FlowDiffuser,提升PIV分析精度和泛化性。 optical flow

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
23 Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs All-Angles Bench:评估多模态大语言模型在多视角理解中的能力 manipulation geometric consistency large language model
24 DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation DyST-XL:提出一种训练自由的框架,通过动态布局规划和内容控制,提升文本到视频生成效果。 trajectory optimization large language model
25 DRAWER: Digital Reconstruction and Articulation With Environment Realism DRAWER:基于视频的室内场景数字化重建与交互环境生成 sim-to-real
26 Spectral Dictionary Learning for Generative Image Modeling 提出基于谱字典学习的图像生成模型,实现可控图像合成。 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
27 RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild RealisDance-DiT:基于DiT的简单而强大的可控角色动画基线模型 character animation foundation model
28 An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes Quicksviewer:利用强化压缩视频块的高效视频理解LMM spatiotemporal multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
29 ICGM-FRAX: Iterative Cross Graph Matching for Hip Fracture Risk Assessment using Dual-energy X-ray Absorptiometry Images 提出ICGM-FRAX,利用双能X射线吸收法图像进行髋部骨折风险评估。 spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页