Self Supervised Boy – Telegram
Self Supervised Boy
162 subscribers
9 photos
46 links
Reading papers on self/semi/weak supervised DL methods. Papers here: https://www.notion.so/Self-Supervised-Boy-papers-reading-751aa85ffca948d28feacc45dc3cb0c0
contact me @martolod
Download Telegram
One more update on Teacher-Student paradigm by Le Quoc.
Now Teacher is continuously updated to direct Student towards optimum w.r.t. the labeled data. On each step we took update gradient for the Teacher model as the gradient towards current pseudo-label. Then we scale this gradient w.r.t. cosine distance between two gradients of the Student model: from unlabeled and labeled data.
Achieved new SoTA on ImageNET (+1.6% top-1 acc).

More detailed with formulas here: https://www.notion.so/Meta-Pseudo-Label-b83ac7b7086e47e1bef749bc3e8e2124
Source here: https://arxiv.org/pdf/2003.10580.pdf
Oral from the ICLR 2021 on usage of teacher-student setup for cross-domain transfer learning. Teacher is trained on the labelled data and produces pseudolabels for the unlabelled data in target domain. This allows student to learn worthy in-domain representations and gain 2.9% of accuracy on one-shot learning with relatively low training effort.

With more fluff here: https://www.notion.so/Self-training-for-Few-shot-Transfer-Across-Extreme-Task-Differences-bfe820f60b4b474796fd0a5b6b6ad312
Source here: https://openreview.net/pdf?id=O3Y56aqpChA
One more oral from ICLR 2021. Theoretical this time, so no way I cans setup detailed overview.

Key points:
1) Authors altered the definition of neighbourhood. Instead of measuring distance between samples, they denote sample x' as a neighbour of x if there is such augmentation A(x) that distance |a(x) - x'| is lower than the threshold.
2) Assumption 1: any small subset of the in-class samples should have expansion (via adding neighbours to subset) to the larger in-class subset of samples.
3) Assumption 2: probability of having x' as neighbour of x with them having different ground truth labels is low and almost negligible.

Authors show, that those are sufficient requirements for the consistency regularisation in self-supervision and transfer learning to show good results.

This nicely adds to the previous paper on transfer learning, where authors shown how consistency regularisation helps. Also it nicely adds to works on smart augmentation strategies.

source: https://openreview.net/pdf?id=rC8sJ4i6kaH
Spotlight on ICLR 2021 by Schmidhuber. Proposes the method of unsupervised keypoints location algorithm with RL application on Atari.

Very clear and simple idea.:
1. Compressing image with VAE and using features from some intermediate layer of encoder later on.
2. Trying to predict feature vector by its surrounding vectors. If the prediction error is high, we found some important object.
3. Compressing error map for image as the mixture of gaussians with fixed covariance, each center representing one keypoint.

SoTA on Atari games, more robust to input noise.

Probably, could be also used outside of simple Atari framework if you have enough data to train, and take later layers of encoder.

With colorfull images here: https://www.notion.so/Unsupervised-Object-Keypoint-Learning-Using-Local-Spatial-Predictability-ddcf36a856ff4e389050b3089cd710bc
Source here: https://openreview.net/pdf?id=GJwMHetHc73
Yet another paper from ICLR 2021. This one proposed advanced method of pseudolabel generation.

In a few words, if we simultaneously train an encoder-decoder model to predict the segmentation on supervised data, and to produce consistent pseudolabel and prediction on unsupervised data independent of the augmentation.
As the pseudolabel we use specifically calibrated Grad-CAM from the encoder part of the model, and we fuse it with the prediction of the decoder part, again with fancy procedure.

With some more fluff and notes here.
Source here.
Pretty simple keypoint localisation pipeline with self-supervision constraints for unlabeled data. Again from ICLR 2021.

The key ideas are:
1. Add classification task for the type of keypoint as function of localisation network features. This usually is not required, because of the fixed order of keypoints in model predictions. But this small additional loss, actually boosts performance more then next two constraints.
2. Add constraint that if we localise keypoint on the spatially augmented image, result should be the same as spatially augmented localisation map.
3. Add constraint that representation vectors of keypoints should be invariant to augmentation.

And here they are, getting SoTA results for several challenging datasets even on 100% of dataset as the labeled data.

With a bit more imformation here.
Source here.
Yet again simple approach leading to unsupervised segmentation. Mostly useful as pre-training though.

Proposed pipeline first mines saliency object areas (with any available framework, possibly supervised) and then makes contrast learning for pixel embeddings inside those regions. During second step individual pixel embedding is attracted to the mean embedding of its object and pushed away from mean embeddings of other objects. This additional detail differs it from some previously proposed pipelines and allows wider training, because of slower growing rate of the loss pairs.

Less briefly and with some external links here.
Source here.
A bit old (NeurIPS 2019), but interesting take on the saliency prediction.

Instead of using direct mixture of different unsupervised salient region prediction algorithms and focusing on fusion strategy, authors proposed to use distillation in neural networks as a way to refine each algorithm predictions separately. Paper shows several steps of distillation, self-training and usage of moving average to stabilize the predictions of each method separately. After these steps, authors employ accumulated averages as labels for final network training.

Slightly more words here.
Source here.
Forwarded from Gradient Dude
Facebook open-sourced a library for state-of-the-art self-supervised learning: VISSL.

+ It contains reproducible reference implementation of SOTA self-supervision approaches (like SimCLR, MoCo, PIRL, SwAV etc) and their components that can be reused. Also supports supervised trainings.
+ Easy to train model on 1-gpu, multi-gpu and multi-node. Seamless scaling to large scale data and model sizes with FP16, LARC etc.

Finally somebody unified all recent works in one modular framework. I don't know about you, but I'm very happy 😌!

VISSL: https://vissl.ai/
Blogpost: https://ai.facebook.com/blog/seer-the-start-of-a-more-powerful-flexible-and-accessible-era-for-computer-vision
Tutorials in Google Colab: https://vissl.ai/tutorials/
New paper from Yan LeCun, who continued to introduce new features inspired by biology.

For each batch of input samples we produce two batches of vector representations, which differs only by the augmentation (randomly sampled). After this, we are able to calculate cross-correlation matrix of those representations (cross correlation between sets of augmentations it is). The loss itself pushes matrix to be as close to the identity as possible. Intuitively this pushes representations to be invariant to the augmentation as the main diagonal tend to be 1, and non-redundant and non-trivial since values out of the main diagonal tend to be 0.
Authors show via rigorous ablation tests how it helps to (1) ease requirements for large batch size, (2) avoid fuss of negative mining and (3) to take advance of the dimensionality of the representation.

More expanded here.
Much more discussion and great area overview in source.

P.S. It is always such a pleasure to read papers like this where authors proposed such clear concepts, so they have much more space for discussion.
👍1
Interactive Weak Supervision paper from ICLR 2021.

In contrast to classical active learning where experts are queried to assess individual samples, the idea of this paper is to assess labeling heuristics being automatically generated. Authors argue that since experts are good in writing such heuristics from scratch, they should be able to label auto-generated heuristics. To be able to rank non-assessed heuristics authors proposed to train an ensemble of models to predict the assessors' mark for the heuristic. As input for these models authors proposed to use fingerprint of the heuristic: concatenated predictions on some subset of data.

There is no very fancy results, there is some concerns raised by reviewers, and there are some strange notations in this paper. Yet the idea looks interesting to me.

With a bit deeper denoscription (and one unanswered question) here.
Source (and rebuttal comments with important links) there.
Transferable Visual Words. The paper which exploits assumption on medical images being well aligned for pseudo-labeling procedure.

Authors proposed to use the fact, that the structure on the medical images is fixed due to imaging procedures and anatomical semantics. They generated pseudo-labels under assumption that the same region spatial of different images represents more or less the same semantic features. To enforce this assumption further, they trained AE model, and selected training samples, which are close in the latent space.

They used index of the cropping region as pseudo-label and trained denoising AE with classification head on these crops as the model for the further fine-tuning.

Not only this method surpasses presented self-supervised baselines, but it is beneficial for combined pre-training with them.

More precise training and labeling procedure here.
Original paper here.
Self-supervision paper from arxiv for histopathology CV.

Authors draw inspiration from the process of how histopathologists tend to review the images, and how those images are stored. Histopathology images are multiscale slices of enormous size (tens of thousands pixels by one side), and area experts constantly move through different levels of magnification to keep in mind both fine and coarse structures of the tissue.

Therefore, in this paper the loss is proposed to capture relation between different magnification levels. Authors propose to train network to order concentric patches by their magnification level. They organise it as the classification task — network to predict id of the order permutation instead of predicting order itself.

Also, authors proposed specific architecture for this task and appended self-training procedure, as it was shown to boost results even after pre-training.

All this allows them to reach quality increase even in high-data regime.

My denoscription of the architecture and loss expanded here.
Source of the work here.
Forwarded from Gradient Dude
​​DetCon: The Self-supervised Contrastive Detection Method🥽
DeepMind

A new self-supervised objective, contrastive detection, which tasks representations with identifying object-level features across augmentations.

Object-based regions are identified with an approximate, automatic segmentation algorithm based on pixel affinity (bottom). These masks are carried through two stochastic data augmentations and a convolutional feature extractor, creating groups of feature vectors in each view (middle). The contrastive detection objective then pulls together pooled feature vectors from the same mask (across views) and pushes apart features from different masks and different images (top).

🌟Highlights
+ SOTA detection and Instance Segmentation (on COCO) and Semantic Segmentation results (on PASCAL) when pretrained in self-supervised regime on ImageNet, while requiring up to 5× fewer epochs than SimCLR.
+ It also outperforms supervised pretraining on Imagenet.
+ DetCon(SimCLR) converges much faster to reach SOTA: 200 epochs are sufficient to surpass supervised transfer to COCO, and 500 to PASCAL.
+ Linear increase in the number of model parameters (using ResNet-101, ResNet-152, and ResNet-200) brings a linear increase in the accuracy on downstream tasks.
+ Despite only being trained on ImageNet, DetCon(BYOL) matches the performance of Facebook's SEER model that used a higher capacity RegNet architecture and was pretrained on 1 Billion Instagram images.
+ First time a ResNet-50 with self-supervised pretraining on COCO outperforms the supervised pretraining for Transfer to PASCAL
+ The power of DetCon strongly correlates with the quality of the masks. The better the masks used during the self-supervised pretraining stage, the better the accuracy on downstream tasks.

⚙️ Method details
DetConS and DetConB, based on two recent self-supervised baselines: SimCLR and BYOL respectively with ResNet-50 backbone.
Authors adopt the data augmentation procedure and network architecture from these methods while applying the proposed Contrastive Detection loss to each.

Each image is randomly augmented twice, resulting in two images: x, x'.
In addition, they compute for each image a set of masks that segment the image into different components.
These masks can be computed using efficient, off-the-shelf, unsupervised segmentation algorithms. In particular, authors use Felzenszwalb-Huttenlocher algorithm a classic segmentation procedure that iteratively merges regions using pixel-based affinity. This algorithm does not require any training and is available in scikit-image. If available, human-annotated segmentations can also be used instead of automatically generated. Each mask (represented as a binary image) is transformed using the same cropping and resizing as used for the underlying RGB image, resulting in two sets of masks {m}, {m'} which are aligned with the augmented images x, x'.

For every mask m associated with the image, authors compute a mask-pooled hidden vector (i.e., similar to regular average pooling but applied only to spatial locations belonging to the same mask).
Then 2-layer MLP is used as a projection on top of the mask-pooled hidden vectors. Note that if you replace masked-pooling with a single global average pooling then you will get exactly SimCLR or BYOL architecture.

Standard contrastive loss based on cross-entropy is used for learning. Positive pair is the latent representations of the same mask from augmented views x and x'. Latent representations of different masks from the same image and from different images in the batch are used as negative samples. Moreover, negative masks are allowed to overlap with a positive one.
Forwarded from Gradient Dude
🦾 Main experiments

Pretrain on Imagenet -> finetune on COCO or PASCAL:
1. Pretrain on Imagenet in a self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 12 epochs on COCO or 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 5x fewer pretraining epochs than SimCLR.

Pretrain on COCO -> finetune on PASCAL for Semantic Segmentation task:
1. Pretrain on COCO in self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 4x fewer pretraining epochs than SimCLR.
5. The first time a self-supervised pretrained ResNet-50 backbone outperforms supervised pretraining on COCO.

📝 Paper: Efficient Visual Pretraining with Contrastive Detection
Contrast to Divide: a pretty practical paper on usage self-supervision to solve learning with noisy labels.

Self-supervision as a plug-and-play technique shows good results in more and more areas. As authors show in this paper, simply replacing a supervised pre-training with the self-supervised can show results and stability improvements. This is achieved due to removing the label noise memorisation (or domain gap in case of transfer) from the warm-up stage of training, therefore maintaining a better discrepancy between classes.

Source here.
Self-supervision pre-training for brain cortex segmentation: a paper from MICCAI-2018.

Quite old (for this boiling-hot area) paper, although with interesting take. Authors set up the metric learning pre-training, but instead of the 3D metric they estimated the geodesic distance on the brain surface between cuts taken orthogonal to the surface. Why? Because the brain cortex is a relatively thin structure along the curved brain surface. And therefore areas are separated not as the 3D space patches, but as patches on this surface. Authors demonstrate how predicted distance between adjacent slices then aligns with the ground truth borders of the areas.

Despite presented result is better then the naïve baseline, I wouldn't be astonished if the other pre-training techniques emerged since then, would provide good results as well.

With a bit more words and one formula here.
Original on there.
Instance Localization for Self-supervised Detection Pretraining. The paper on importance of the task-specific pre-training.

Authors hypothesised about problems of the popular self-supervised pre-training frameworks w.r.t. the task of localisation. They come to the idea that there are no losses to enforce localisation of the object representations. Therefore authors proposed a new loss. To make two contrastive representation of one image, they crop two random parts of it and paste those parts to random images from the same dataset. Later, they use neural network to embed those images, but contrast only region of embedding related to the pasted image. Instead of contrasting the whole image embedding as it is done usually.

Interestingly, the proposed loss not only provides SotA pre-training for the localisation task, but also degrades the classification quality. This is somewhat important practical finding, that while general representations becoming better and better, it could be more important to have task-specific pre-training, than the SotA, but tailored for another task.

More detailed here.
Source here.
SelfReg — paper on contrastive learning towards domain generalisation.

Domain generalisation methods are focused on training models which will not need transfer to work on new domains. Authors proposed to adapt the popular contrastive learning framework to this task.
To provide positive pair, they sample two examples of the same class from different domains. Compared to classical contrastive learning it is, like, having different domains instead of different augmentations, and different classes instead of different samples.
To avoid burden of the good negative sample mining authors adapted the BYOL idea, and employed a projection network to avoid representation collapse.
Suppose having f as the neural network under training, g as a trainable linear layer to gain projection of the representation and x_ck — random sample of the class c and domain k.
As the loss itself authors used two squared L2 distances:
1. |f(x_cj) - g(f(x_ck))|
2. |f(x_cj) - (l*g(f(x_cj)) + (1-l)*g(f(x_ck)))|. Where l ~ Beta.
NB! in the second loss, the right part is linear mixture of sample representations from different domains.
By minimising the presented loss alongside with classification loss itself, authors achieved pretty separated latent space representation, and got close to the SotA without additional tricks.

Source could be found here.