cs.CV（2024-09-25）

📊 共 27 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (14 🔗4) 支柱三：空间感知与语义 (Perception & Semantics) (8) 支柱二：RL算法与架构 (RL & Architecture) (3) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱五：交互与反应 (Interaction & Reaction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (14 篇)

#	题目	一句话要点	标签	🔗	⭐
1	EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models	EAGLE：高效理解任意指示性视觉提示的多模态大语言模型	large language model multimodal instruction following
2	Unveiling Ontological Commitment in Multi-Modal Foundation Models	提出一种从多模态模型中提取概念层级关系的方法，用于验证和校准模型。	foundation model multimodal
3	First Place Solution to the ECCV 2024 BRAVO Challenge: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation	利用DINOv2视觉基础模型，结合简单分割解码器，提升语义分割的鲁棒性	foundation model	✅
4	Block Expanded DINORET: Adapting Natural Domain Foundation Models for Retinal Imaging Without Catastrophic Forgetting	提出Block Expanded DINORET，解决自然域预训练模型在视网膜成像迁移中的灾难性遗忘问题	foundation model
5	Targeted Neural Architectures in Multi-Objective Frameworks for Complete Glioma Characterization from Multimodal MRI	针对多模态MRI的神经架构，用于完整神经胶质瘤表征的多目标框架	multimodal
6	ControlCity: A Multimodal Diffusion Model Based Approach for Accurate Geospatial Data Generation and Urban Morphology Analysis	ControlCity：基于多模态扩散模型生成精确地理空间数据并分析城市形态	multimodal
7	Robust Scene Change Detection Using Visual Foundation Models and Cross-Attention Mechanisms	提出基于DINOv2和交叉注意力的鲁棒场景变更检测方法	foundation model	✅
8	MaViLS, a Benchmark Dataset for Video-to-Slide Alignment, Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech, OCR, and Visual Features	MaViLS：用于视频-幻灯片对齐的基准数据集与多模态对齐算法	multimodal
9	Pix2Next: Leveraging Vision Foundation Models for RGB to NIR Image Translation	Pix2Next：利用视觉基础模型实现RGB到近红外图像的转换	foundation model
10	Underwater Camouflaged Object Tracking Meets Vision-Language SAM2	提出首个大规模水下伪装目标跟踪多模态数据集UW-COT220，并提出基于SAM2的视觉-语言跟踪框架VL-SAM2。	foundation model multimodal	✅
11	ChatCam: Empowering Camera Control through Conversational AI	ChatCam：通过对话式AI赋能相机控制，模拟专业电影摄影师工作流	large language model
12	Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation	提出基于上下文无关文法的视觉-语言导航细粒度评估框架	VLN
13	Attention Prompting on Image for Large Vision-Language Models	提出图像注意力提示方法，提升大视觉语言模型对文本指令的遵循能力	large language model
14	DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling	DALDA：利用扩散模型和LLM进行数据增强，自适应调整引导缩放以提升少样本学习性能	large language model	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
15	SeaSplat: Representing Underwater Scenes with 3D Gaussian Splatting and a Physically Grounded Image Formation Model	SeaSplat：利用3D高斯溅射和物理成像模型实现水下场景实时渲染	3D gaussian splatting 3DGS gaussian splatting
16	Optical Lens Attack on Deep Learning Based Monocular Depth Estimation	提出LensAttack：利用光学透镜干扰单目深度估计的物理攻击方法	depth estimation monocular depth
17	Generative Object Insertion in Gaussian Splatting with a Multi-View Diffusion Model	提出基于多视角扩散模型的高斯溅射对象插入方法，实现高质量三维场景重建。	gaussian splatting splatting
18	3DDX: Bone Surface Reconstruction from a Single Standard-Geometry Radiograph via Dual-Face Depth Estimation	提出3DDX，通过双面深度估计从单张标准X光片重建骨骼表面	depth estimation penetration
19	Disco4D: Disentangled 4D Human Generation and Animation from a Single Image	Disco4D：提出解耦的4D人体生成与动画框架，从单张图像生成逼真动态人体。	gaussian splatting splatting SMPL
20	Parameter-efficient Bayesian Neural Networks for Uncertainty-aware Depth Estimation	提出参数高效贝叶斯神经网络，用于不确定性感知的深度估计	depth estimation monocular depth
21	EventHDR: from Event to High-Speed HDR Videos and Beyond	EventHDR：提出基于事件相机的高速HDR视频重建方法，并构建真实数据集。	depth estimation monocular depth optical flow
22	Pose-Guided Fine-Grained Sign Language Video Generation	提出姿态引导的运动模型，用于生成精细且时序一致的手语视频	optical flow

🔬 支柱二：RL算法与架构 (RL & Architecture) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
23	Towards General Text-guided Image Synthesis for Customized Multimodal Brain MRI Generation	提出TUMSyn，一种文本引导的通用脑部MRI合成模型，用于定制多模态MRI生成。	contrastive learning multimodal
24	PASS: Path-selective State Space Model for Event-based Recognition	提出PASS框架，利用路径选择状态空间模型提升事件相机识别的频率泛化能力。	SSM state space model spatiotemporal
25	Adverse Weather Optical Flow: Cumulative Homogeneous-Heterogeneous Adaptation	提出累积同构-异构自适应框架，解决恶劣天气下光流估计的难题。	distillation optical flow

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
26	Spotlight Text Detector: Spotlight on Candidate Regions Like a Camera	提出聚光灯文本检测器STD，解决场景文本检测中不规则形状和密集文本重叠问题。	spatial relationship

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
27	A Versatile and Differentiable Hand-Object Interaction Representation	提出CHOIR：一种通用且可微的手-物交互表示方法，用于精确合成HOI。	HOI

⬅️ 返回 cs.CV 首页 · 🏠 返回主页