cs.CV（2024-11-21）

📊 共 28 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (5 🔗2) 支柱八：物理动画 (Physics-based Animation) (1) 支柱六：视频提取与匹配 (Video Extraction) (1 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems	提出基于约束满足问题的零样本3D视觉定位方法，提升复杂场景理解能力。	large language model visual grounding	✅
2	Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts	Panther：利用指令引导的视觉提示增强多模态LLM的视觉感知能力	large language model multimodal
3	A Multimodal Approach to The Detection and Classification of Skin Diseases	提出多模态皮肤病检测与分类方法，结合图像与文本信息提升诊断准确率。	large language model multimodal
4	GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI	提出GMAI-VL，一个基于大规模多模态医学数据集的通用医学视觉-语言模型	multimodal
5	Multimodal 3D Brain Tumor Segmentation with Adversarial Training and Conditional Random Field	提出基于对抗训练和条件随机场的3D多模态脑肿瘤分割方法	multimodal
6	Multimodal Autoregressive Pre-training of Large Vision Encoders	提出AIMV2：一种基于多模态自回归预训练的大规模视觉编码器，显著提升下游任务性能。	multimodal
7	Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance	提出LACING框架，通过多模态双重注意力与软图像引导减少大型视觉语言模型中的语言偏见。	multimodal
8	SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning	SMoLoRA：探索并解决持续视觉指令微调中的双重灾难性遗忘问题	large language model multimodal instruction following	✅
9	Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding	提出DYTO：一种动态Token融合框架，用于零样本视频理解。	large language model multimodal
10	FoPru: Focal Pruning for Efficient Large Vision-Language Models	提出FoPru：基于注意力机制的焦点剪枝，提升大规模视觉语言模型效率	large language model multimodal
11	LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval	提出LLaVA-MR，利用多模态大语言模型解决视频片段检索难题。	large language model multimodal
12	FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression	FocusLLaVA：一种粗到细的视觉Token压缩方法，提升多模态大模型的效率和性能	large language model
13	Quantization without Tears	提出QwT，通过轻量级线性层结构实现高效、通用且高精度的网络量化。	multimodal	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
14	NexusSplats: Efficient 3D Gaussian Splatting in the Wild	NexusSplats：针对复杂光照和遮挡场景的高效3D高斯溅射	3D gaussian splatting 3DGS gaussian splatting
15	CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation	CLIPer：通过分层改进CLIP空间表示，实现开放词汇语义分割	open-vocabulary open vocabulary foundation model
16	Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction	提出DiffusionGS，通过扩散模型直接生成3D高斯点云，实现快速单阶段图像到3D生成与重建。	gaussian splatting splatting scene reconstruction	✅
17	Multimodal 3D Reasoning Segmentation with Complex Scenes	提出MORE3D网络和ReasonSeg3D数据集，用于复杂场景下的多模态3D推理分割。	scene understanding embodied AI multimodal
18	DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding	DINO-X：用于开放世界目标检测与理解的统一视觉模型	open-vocabulary open vocabulary
19	StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart	StereoCrafter-Zero：基于噪声重启的零样本立体视频生成框架	depth estimation	✅
20	Transforming Static Images Using Generative Models for Video Salient Object Detection	利用生成模型转换静态图像，提升视频显著性目标检测性能	optical flow

🔬 支柱二：RL算法与架构 (RL & Architecture) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
21	Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models	Insight-V：探索基于多模态大语言模型的长链视觉推理，提升复杂任务性能。	DPO large language model multimodal
22	PhysFlow: Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation	PhysFlow：利用多模态大模型和视频扩散进行4D动态物理场景仿真	distillation optical flow foundation model
23	Segment Any Class (SAC): Multi-Class Few-Shot Semantic Segmentation via Class Region Proposals	提出SAC：一种基于类别区域提议的多类别少样本语义分割方法，无需训练。	SAC foundation model
24	BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models	提出BiomedCoOp，通过提示学习提升BiomedCLIP在生物医学图像分类中的准确性和泛化性。	representation learning distillation large language model	✅
25	WARLearn: Weather-Adaptive Representation Learning	WARLearn：提出一种天气自适应的表征学习框架，提升恶劣天气下的模型性能。	representation learning	✅

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
26	Spatiotemporal Decoupling for Efficient Vision-Based Occupancy Forecasting	提出时空解耦的EfficientOCF，高效预测自动驾驶环境中的未来占用状态。	spatiotemporal

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
27	VAGUE: Visual Contexts Clarify Ambiguous Expressions	VAGUE：利用视觉上下文消除歧义性表达，提升多模态推理能力	Ego4D multimodal	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
28	Enhancing GeoAI and location encoding with spatial point pattern statistics: A Case Study of Terrain Feature Classification	融合空间点模式统计的GeoAI模型，提升地形特征分类精度	spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页