cs.CV（2024-07-23）

📊 共 22 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (7 🔗2) 支柱九：具身大模型 (Embodied Foundation Models) (7 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (4 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (2 🔗1) 支柱一：机器人控制 (Robot Control) (1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
1	HDRSplat: Gaussian Splatting for High Dynamic Range 3D Scene Reconstruction from Raw Images	HDRSplat：利用高动态范围原始图像进行3D高斯溅射场景重建	3D gaussian splatting 3DGS gaussian splatting
2	MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues	MicroEmo：针对视频对话中微表情动态的时间敏感多模态情感识别模型	open-vocabulary open vocabulary large language model
3	DHGS: Decoupled Hybrid Gaussian Splatting for Driving Scene	提出解耦混合高斯溅射(DHGS)，提升驾驶场景新视角合成质量。	gaussian splatting splatting	✅
4	SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation	SAM-CP：结合可组合提示的SAM，实现多功能分割	open-vocabulary open vocabulary foundation model
5	ToDER: Towards Colonoscopy Depth Estimation and Reconstruction with Geometry Constraint Adaptation	提出ToDER，通过几何约束自适应进行结肠镜深度估计与重建	depth estimation
6	SINDER: Repairing the Singular Defects of DINOv2	SINDER通过平滑正则化修复DINOv2的奇异缺陷，提升下游任务性能。	depth estimation	✅
7	VRP-UDF: Towards Unbiased Learning of Unsigned Distance Functions from Multi-view Images with Volume Rendering Priors	提出VRP-UDF，利用体渲染先验解决多视角图像无符号距离函数学习中的偏差问题。	implicit representation

🔬 支柱九：具身大模型 (Embodied Foundation Models) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
8	MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs	MLLM-CompBench：用于评估多模态大语言模型比较推理能力的基准测试。	large language model multimodal
9	PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects	PartGLEE：用于识别和解析任意对象部件的部件级基础模型	foundation model	✅
10	Histopathology image embedding based on foundation models features aggregation for patient treatment response prediction	提出基于Foundation Model特征聚合的病理图像嵌入方法，用于预测弥漫大B细胞淋巴瘤患者的治疗反应。	foundation model
11	C3T: Cross-modal Transfer Through Time for Sensor-based Human Activity Recognition	C3T：通过时间跨模态迁移，提升传感器人体活动识别在无监督模态适应下的性能	multimodal
12	Unveiling and Mitigating Bias in Audio Visual Segmentation	针对视听分割中音频启动偏差和视觉先验偏差，提出感知模块和对比学习策略。	visual grounding
13	Category-Extensible Out-of-Distribution Detection via Hierarchical Context Descriptions	提出CATEX，通过分层上下文描述实现可扩展的OOD检测。	large language model	✅
14	Harmonizing Visual Text Comprehension and Generation	TextHarmony：提出Slide-LoRA，统一视觉文本理解与生成任务。	multimodal	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
15	Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions	提出基于扩散模型的单目深度估计方法，提升复杂场景下的鲁棒性	distillation depth estimation monocular depth
16	MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence	MovieDreamer：提出层级生成框架，实现连贯长视觉序列的电影级视频生成	dreamer multimodal	✅
17	Accelerating Learned Video Compression via Low-Resolution Representation Learning	提出基于低分辨率表示学习的加速视频压缩框架，显著提升编解码速度。	representation learning
18	A Multi-view Mask Contrastive Learning Graph Convolutional Neural Network for Age Estimation	提出多视角掩码对比学习图卷积网络用于人脸年龄估计	contrastive learning

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
19	EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval	EgoCVR：一个用于细粒度组合视频检索的自中心视角基准数据集	egocentric	✅
20	Motion Capture from Inertial and Vision Sensors	提出MINIONS数据集和SparseNet框架，实现基于惯性和视觉传感器的低成本人体运动捕捉。	SMPL

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
21	Coarse-to-Fine Proposal Refinement Framework for Audio Temporal Forgery Detection and Localization	提出粗到精的音频时间伪造检测与定位框架，解决现有方法无法定位篡改片段的问题。	manipulation representation learning TAMP

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
22	VisMin: Visual Minimal-Change Understanding	提出VisMin基准，用于评估视觉语言模型在细粒度视觉理解上的能力	spatial relationship large language model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页