RINCE/ReLIC/ReLICv2

date

Dec 3, 2022

Last edited time

Mar 27, 2023 08:40 AM

status

Published

slug

RINCE_ReLIC_ReLICv2

一、Robust Contrastive Learning against Noisy Views（RINCE）

Robust Contrastive Learning against Noisy Views

Contrastive learning relies on an assumption that positive pairs contain related views, e.g., patches of an image or co-occurring multimodal signals of a video, that share certain underlying information about an instance. But what if this assumption is violated?

https://arxiv.org/abs/2201.04309

2201.04309.pdf

6769.6KB

本文提出 RINCE，一种对噪声（如图像的过度增广、视频的过度配音、视频和标题未对准等）有鲁棒性的损失，且不需要显式地估计噪声。

文章提到：RINCE 是一个用 Wasserstein dependency measure 表示的互信息的对比下界；而 InfoNCE 是 KL 散度表示的互信息的下界。

设数据分布为 ，噪声数据集为 ，其中标签正确即 的概率为 。则目标为最小化

其中为二元交叉熵损失。

对称的损失函数对噪声有鲁棒性。损失函数对称即满足

（ 为常数），其中 为 产生的预测分数（ 对的梯度也应具有对称性）。

对称的对比损失有如下形式：

其中和分别为正 / 负样本对的分数。权重的大小反映正负样本的相对重要程度。

InfoNCE 不满足对梯度中的对称条件.

RINCE 损失如下：

其中和均在 范围内（实验表明对 的值不敏感）。

💡

时, RINCE 完全满足（3）式的对称性（此时，满足（2）式且）：

此时该损失对噪声有鲁棒性。

趋于0时，RINCE渐近趋于InfoNCE。

不论为何值，正样本分数越高，负样本分数越低，损失越小。

💡

在梯度计算时：

时，RINCE 更重视 easy-positive 的样本（分数高的正样本）；

InfoNCE（）更重视 hard-positive 的样本（分数低的正样本）；

两者均重视 hard-negative 的样本（分数高的负样本）。

因此 InfoNCE 在无噪声时收敛更快，而的 RINCE 对噪声更有鲁棒性。实际中在的范围内取。

二、Representation learning via invariant casual mechanisms（ReLIC）

Representation Learning via Invariant Causal Mechanisms

Self-supervised learning has emerged as a strategy to reduce the reliance on costly supervised signal by pretraining representations only using unlabeled data. These methods combine heuristic proxy classification tasks with data augmentations and have achieved significant success, but our theoretical understanding of this success remains limited.

https://arxiv.org/abs/2010.07922

Mitrovic 等 - 2020 - Representation Learning via Invariant Causal Mecha.pdf

5307.3KB

本文把数据分为内容和风格（例如要分类图像是否为狗时，图像中的狗为内容，而背景、光照等因素为风格），学习到的表达应只与内容有关。

💡

为了约束不变性, 我们使用一个 KL 散度来约束特征对于数据增强的不变性:

其中是代理任务损失，KL 是 Kullback-Leibler (KL) 散度。请注意，可以使用分布的任何距离度量来代替 KL 散度。

采用数据增广的方案在保留内容的同时改变风格（如图像旋转、改变灰度、裁剪和平移等），形成正样本, 总体损失可以写为

上式 表示神经网络； 与 相关，往往取 ；, 为全连接层, 为计算两者的余弦相似度。是一对数据增广；即对两张图像分别进行两种数据增强之后, 计算两者的余弦相似度, 使用 KL 散度约束不同的数据增强后相似度不变。

前面一项为通常的对比损失，而后面一项是增广的不变性惩罚（或者不变性损失）（即增广应尽可能地不改变内容），该项可以减小类间距离。

本文还解释了自监督学习成功的原因，即证明了：设下游任务集合 ，且任务 比 中所有任务均更细化。如果通过 学习到的表达只与内容有关，那么这个表达可以泛化到 中所有的下游任务。

三、Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?（ReLICv2）

Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?

Despite recent progress made by self-supervised methods in representation learning with residual networks, they still underperform supervised learning on the ImageNet classification benchmark, limiting their applicability in performance-critical settings. Building on prior theoretical insights from ReLIC [Mitrovic et al., 2021], we include additional inductive biases into self-supervised learning.

http://arxiv.org/abs/2201.05119

Tomasev 等 - 2022 - Pushing the limits of self-supervised ResNets Can.pdf

1466.4KB

ReLICv2 和 ReLIC 的损失函数相似：

和 ReLIC 的区别在于选择正负样本：正样本的产生先采用 multi-crop augmentation 和基于显著性的背景移除，然后用标准的 SimCLR 的增广方案；负样本可以使用 hard-negative sampling，但本文就在 batch 里面均匀随机抽取。