CLIP,直观而全面地解释

剪辑:直观而全面解读 (Jiǎn jí Zhí guān ér quán miàn jiě dú)

为通用机器学习任务创建强大的图像和语言表示

“Contrasting Modes” by Daniel Warfield using MidJourney. All images by the author unless otherwise specified.

在本文中,您将了解到“对比语言-图像预训练”(CLIP)的相关内容,这是一种创建视觉和语言表示的策略,以至于它们能够生成高度特定且高性能的分类器,而无需任何训练数据。我们将逐步介绍理论,讨论CLIP与传统方法的区别,然后逐步解析其架构。

CLIP predicting highly specific labels for classification tasks it was never directly trained on. Source

谁会受益于这篇文章?任何对计算机视觉、自然语言处理(NLP)或多模态建模感兴趣的人。

本文的难度级别如何?对初级数据科学家来说,本文应该很容易理解,但如果您没有数据科学经验,可能会很难跟上。当我们开始讨论损失函数时,难度会有所增加。

先决条件:需要对计算机视觉和自然语言处理有一些基础知识。

典型的图像分类器

在训练一个模型来检测图像是猫还是狗时,一种常见的方法是向模型提供猫和狗的图像,然后根据错误逐步调整模型,直到它学会区分这两者。

A conceptual diagram of what supervised learning might look like. Imagine we have a new model which doesn’t know anything about images. We can feed it an image, ask it to predict the class of the image, then update the parameters of the model based on how wrong it is. We can then do this numerous times until the model starts performing well at the task. I explore back propagation in this post, which is the mechanism which makes this generally possible.

这种传统的监督学习方法对许多用例来说是可以接受的,并且已被证明在各种任务中表现良好。然而,这种策略也会导致高度专业化的模型,只能在其初始训练的范围内表现良好。

Comparing CLIP with a more traditional supervised model. Each of the models were trained on and perform well on ImageNet (a popular image classification dataset), but when exposed to similar datasets containing the same classes in different representations, the supervised model experiences a large degradation in performance, while CLIP does not. This implies that the representations in CLIP are more robust and generalizable than other methods. Source

为了解决过度专业化的问题,CLIP以根本不同的方式进行分类;通过试图学习……