cs.CV（2024-05-30）

📊 共 46 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (12 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (9 🔗5) 支柱一：机器人控制 (Robot Control) (5) 支柱六：视频提取与匹配 (Video Extraction) (5) 支柱四：生成式动作 (Generative Motion) (2)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Transfer Attack for Bad and Good: Explain and Boost Adversarial Transferability across Multimodal Large Language Models	提出对抗性转移攻击方法以提升多模态大语言模型的鲁棒性	large language model multimodal
2	Temporal Grounding of Activities using Multimodal Large Language Models	提出基于多模态大语言模型的时序活动定位方法，优于现有视频LLM。	large language model multimodal
3	Visual Perception by Large Language Model's Weights	提出VLoRA以解决多模态大语言模型的计算效率问题	large language model multimodal
4	LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild	LLMGeo：评估大语言模型在复杂场景下的图像地理定位能力	large language model multimodal
5	Instruction-Guided Visual Masking	提出指令引导的视觉掩码IVM，提升多模态模型对复杂指令的理解和对齐能力。	multimodal instruction following visual grounding	✅
6	A Multimodal Dangerous State Recognition and Early Warning System for Elderly with Intermittent Dementia	针对老年痴呆症患者，提出多模态危险状态识别与预警系统，解决走失问题。	multimodal
7	FMARS: Annotating Remote Sensing Images for Disaster Management using Foundation Models	FMARS：利用Foundation Model标注遥感影像，助力灾害管理	foundation model	✅
8	Learning Robust Correlation with Foundation Model for Weakly-Supervised Few-Shot Segmentation	提出CORENet，利用基础模型学习鲁棒相关性，解决弱监督少样本分割问题	foundation model
9	Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals	利用对抗样本大规模揭示大型视觉语言模型中的偏见	large language model multimodal
10	AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization	AutoBreach：利用高效文字游戏优化实现通用自适应的大语言模型越狱攻击	large language model chain-of-thought
11	VAAD: Visual Attention Analysis Dashboard applied to e-Learning	VAAD：用于在线学习的视觉注意力分析仪表盘，提升学习行为洞察	multimodal
12	LLM as a Complementary Optimizer to Gradient Descent: A Case Study in Prompt Tuning	提出LLM辅助梯度下降优化框架，提升Prompt Tuning效果	large language model	✅
13	Enhancing Large Vision Language Models with Self-Training on Image Comprehension	提出STIC，通过图像理解自训练增强大规模视觉语言模型，减少对标注数据的依赖。	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
14	GaussianRoom: Improving 3D Gaussian Splatting with SDF Guidance and Monocular Cues for Indoor Scene Reconstruction	GaussianRoom：结合SDF引导和单目线索，提升3D高斯溅射在室内场景重建效果	3D gaussian splatting 3DGS gaussian splatting
15	$\textit{S}^3$Gaussian: Self-Supervised Street Gaussians for Autonomous Driving	提出自监督街景高斯方法，无需3D标注实现自动驾驶场景的动态静态元素分解。	3D gaussian splatting 3DGS gaussian splatting	✅
16	OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation	提出OpenDAS，通过开放词汇域自适应提升2D/3D分割性能	open-vocabulary open vocabulary	✅
17	RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection	RTGen：生成区域-文本对，提升开放词汇目标检测性能	open-vocabulary open vocabulary
18	EMAG: Ego-motion Aware and Generalizable 2D Hand Forecasting from Egocentric Videos	提出EMAG，解决以自我为中心的视频中手部动作预测的视角依赖和泛化性问题	optical flow egocentric Ego4D	✅
19	IReNe: Instant Recoloring of Neural Radiance Fields	IReNe：实现神经辐射场的即时颜色重着色，提升编辑效率与真实感。	NeRF neural radiance field scene reconstruction
20	Uncertainty-guided Optimal Transport in Depth Supervised Sparse-View 3D Gaussian	提出UGOT方法，利用不确定性引导的最优传输解决稀疏视角3D高斯重建问题	depth estimation monocular depth 3D gaussian splatting
21	A Pixel Is Worth More Than One 3D Gaussians in Single-View 3D Reconstruction	提出分层 Splatter Image 方法，利用多高斯模型提升单视角3D重建中遮挡区域的建模能力。	3D gaussian splatting 3DGS gaussian splatting
22	View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields	提出基于超度量特征场的3D一致性分层分割方法，解决视角不一致问题。	NeRF neural radiance field foundation model	✅
23	TetSphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric Meshes	提出TetSphere Splatting，利用四面体网格实现高质量3D形状建模。	splatting
24	Gated Fields: Learning Scene Reconstruction from Gated Videos	提出Gated Fields，利用主动门控视频序列实现室外场景的精确3D重建	scene reconstruction
25	CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets	CLAY：一种可控的大规模生成模型，用于创建高质量3D资产	implicit representation

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
26	NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models	提出NoiseBoost以解决多模态大语言模型的幻觉问题	reinforcement learning large language model multimodal	✅
27	PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting	提出PLA4D以解决文本驱动4D渲染中的运动与几何冲突问题	contrastive learning distillation gaussian splatting
28	Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition	提出MM-CDFSL，通过多模态蒸馏和掩码推理解决自中心动作识别中的跨域少样本学习问题。	distillation egocentric multimodal	✅
29	EgoSurgery-Phase: A Dataset of Surgical Phase Recognition from Egocentric Open Surgery Videos	EgoSurgery-Phase：发布首个开放手术阶段识别的头戴相机视角视频数据集，并提出注视引导的掩码自编码器。	masked autoencoder MAE egocentric	✅
30	Multi-Label Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining	提出多标签引导的软对比学习，高效预训练地球观测模型。	contrastive learning foundation model	✅
31	Boost Your Human Image Generation Model via Direct Preference Optimization	提出HG-DPO以提升人类图像生成模型的真实感	DPO direct preference optimization curriculum learning
32	MotionDreamer: Exploring Semantic Video Diffusion features for Zero-Shot 3D Mesh Animation	MotionDreamer：利用视频扩散模型的语义特征实现零样本3D网格动画	dreamer
33	Estimating Human Poses Across Datasets: A Unified Skeleton and Multi-Teacher Distillation Approach	提出统一骨架与多教师蒸馏方法，提升跨数据集人体姿态估计泛化性	distillation
34	DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark	提出DeMamba模块与GenVideo基准，提升AI生成视频检测的泛化性与鲁棒性。	Mamba	✅

🔬 支柱一：机器人控制 (Robot Control) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
35	SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation	SAM-E：利用视觉基础模型和序列模仿进行具身操作	manipulation scene understanding foundation model
36	May the Dance be with You: Dance Generation Framework for Non-Humanoids	提出一种非人形智能体舞蹈生成框架，通过视觉节奏与音乐的关联学习舞蹈动作。	humanoid reinforcement learning contrastive learning
37	Learning 3D Robotics Perception using Inductive Priors	利用归纳偏置学习3D机器人感知，提升泛化性和降低数据依赖。	sim2real scene understanding semantic map
38	HINT: Learning Complete Human Neural Representations from Limited Viewpoints	HINT：提出一种基于NeRF的人体神经表示学习方法，解决有限视角下完整人体建模问题。	humanoid NeRF
39	ParSEL: Parameterized Shape Editing with Language	ParSEL：提出一种基于语言的参数化形状编辑方法，实现对3D资产的可控编辑。	manipulation

🔬 支柱六：视频提取与匹配 (Video Extraction) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
40	MotionLLM: Understanding Human Behaviors from Human Motions and Videos	提出MotionLLM以解决多模态人类行为理解问题	SMPL human motion large language model
41	SMPLX-Lite: A Realistic and Drivable Avatar Benchmark with Rich Geometry and Texture Annotations	提出SMPLX-Lite数据集和参数化模型，用于驱动逼真且可控的全身虚拟化身	SMPL-X human motion
42	Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera	提出基于360度第一视角视频的视觉问答数据集，辅助视觉障碍人士。	egocentric
43	OmniHands: Towards Robust 4D Hand Mesh Recovery via A Versatile Transformer	OmniHands：通过通用Transformer实现鲁棒的4D手部网格重建	hand reconstruction
44	Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models	提出PlausiVL，利用视频-语言大模型进行符合现实的动作序列预测。	Ego4D

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
45	RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text	RapVerse：提出一种从文本生成连贯歌声和全身动作的统一框架	motion generation human motion multimodal
46	Stratified Avatar Generation from Sparse Observations	提出分层生成方法，从稀疏观测中重建全身虚拟化身	VQ-VAE SMPL

⬅️ 返回 cs.CV 首页 · 🏠 返回主页