Self Supervised Boy – Telegram
Self Supervised Boy
162 subscribers
9 photos
46 links
Reading papers on self/semi/weak supervised DL methods. Papers here: https://www.notion.so/Self-Supervised-Boy-papers-reading-751aa85ffca948d28feacc45dc3cb0c0
contact me @martolod
Download Telegram
Channel created
Who are you? I'm PhD student doing DL research, preferably about weak/self supervision. Or even unsupervised things as well.
What happens? I'm writing here some reviews of papers I read.
Why the hell? Because it allows me to practice writing, and to understand papers I read deeper.
So what? I will be happy if it's somehow interesting to someone else. Anyways, here's my archive: https://www.notion.so/Self-Supervised-Boy-papers-reading-751aa85ffca948d28feacc45dc3cb0c0.
Self-training über alles. Another paper on self-training by Le Quoc.
They compared self-training with supervised and self-supervised pre-training for different tasks. Self-training seemingly works better, while pre-training even hurts final quality when enough labeled data is available or strong augmentation is applied.
Main practical takeaway is, self-training adds quality even after pre-training. So, it could be worthy to self-train your baseline models to have better start.
More detailed with tables here: https://www.notion.so/Rethinking-Pre-training-and-Self-training-e00596e346fa4261af68db7409fbbde6
Source here: https://arxiv.org/pdf/2006.06882.pdf
Unsupervised segmentation with autoregressive models. Authors proposed to scan image with different scanning orders and request that the close pixels produce close embeddings independently of the scanning order.
SoTA across the unsupervised segmentations.
More detailed with images and losses here: https://www.notion.so/Autoregressive-Unsupervised-Image-Segmentation-211c6e8ec6174fe9929e53e5140e1024
Source here: https://arxiv.org/pdf/2007.08247.pdf
One more update on Teacher-Student paradigm by Le Quoc.
Now Teacher is continuously updated to direct Student towards optimum w.r.t. the labeled data. On each step we took update gradient for the Teacher model as the gradient towards current pseudo-label. Then we scale this gradient w.r.t. cosine distance between two gradients of the Student model: from unlabeled and labeled data.
Achieved new SoTA on ImageNET (+1.6% top-1 acc).

More detailed with formulas here: https://www.notion.so/Meta-Pseudo-Label-b83ac7b7086e47e1bef749bc3e8e2124
Source here: https://arxiv.org/pdf/2003.10580.pdf
Oral from the ICLR 2021 on usage of teacher-student setup for cross-domain transfer learning. Teacher is trained on the labelled data and produces pseudolabels for the unlabelled data in target domain. This allows student to learn worthy in-domain representations and gain 2.9% of accuracy on one-shot learning with relatively low training effort.

With more fluff here: https://www.notion.so/Self-training-for-Few-shot-Transfer-Across-Extreme-Task-Differences-bfe820f60b4b474796fd0a5b6b6ad312
Source here: https://openreview.net/pdf?id=O3Y56aqpChA
One more oral from ICLR 2021. Theoretical this time, so no way I cans setup detailed overview.

Key points:
1) Authors altered the definition of neighbourhood. Instead of measuring distance between samples, they denote sample x' as a neighbour of x if there is such augmentation A(x) that distance |a(x) - x'| is lower than the threshold.
2) Assumption 1: any small subset of the in-class samples should have expansion (via adding neighbours to subset) to the larger in-class subset of samples.
3) Assumption 2: probability of having x' as neighbour of x with them having different ground truth labels is low and almost negligible.

Authors show, that those are sufficient requirements for the consistency regularisation in self-supervision and transfer learning to show good results.

This nicely adds to the previous paper on transfer learning, where authors shown how consistency regularisation helps. Also it nicely adds to works on smart augmentation strategies.

source: https://openreview.net/pdf?id=rC8sJ4i6kaH
Spotlight on ICLR 2021 by Schmidhuber. Proposes the method of unsupervised keypoints location algorithm with RL application on Atari.

Very clear and simple idea.:
1. Compressing image with VAE and using features from some intermediate layer of encoder later on.
2. Trying to predict feature vector by its surrounding vectors. If the prediction error is high, we found some important object.
3. Compressing error map for image as the mixture of gaussians with fixed covariance, each center representing one keypoint.

SoTA on Atari games, more robust to input noise.

Probably, could be also used outside of simple Atari framework if you have enough data to train, and take later layers of encoder.

With colorfull images here: https://www.notion.so/Unsupervised-Object-Keypoint-Learning-Using-Local-Spatial-Predictability-ddcf36a856ff4e389050b3089cd710bc
Source here: https://openreview.net/pdf?id=GJwMHetHc73
Yet another paper from ICLR 2021. This one proposed advanced method of pseudolabel generation.

In a few words, if we simultaneously train an encoder-decoder model to predict the segmentation on supervised data, and to produce consistent pseudolabel and prediction on unsupervised data independent of the augmentation.
As the pseudolabel we use specifically calibrated Grad-CAM from the encoder part of the model, and we fuse it with the prediction of the decoder part, again with fancy procedure.

With some more fluff and notes here.
Source here.
Pretty simple keypoint localisation pipeline with self-supervision constraints for unlabeled data. Again from ICLR 2021.

The key ideas are:
1. Add classification task for the type of keypoint as function of localisation network features. This usually is not required, because of the fixed order of keypoints in model predictions. But this small additional loss, actually boosts performance more then next two constraints.
2. Add constraint that if we localise keypoint on the spatially augmented image, result should be the same as spatially augmented localisation map.
3. Add constraint that representation vectors of keypoints should be invariant to augmentation.

And here they are, getting SoTA results for several challenging datasets even on 100% of dataset as the labeled data.

With a bit more imformation here.
Source here.
Yet again simple approach leading to unsupervised segmentation. Mostly useful as pre-training though.

Proposed pipeline first mines saliency object areas (with any available framework, possibly supervised) and then makes contrast learning for pixel embeddings inside those regions. During second step individual pixel embedding is attracted to the mean embedding of its object and pushed away from mean embeddings of other objects. This additional detail differs it from some previously proposed pipelines and allows wider training, because of slower growing rate of the loss pairs.

Less briefly and with some external links here.
Source here.
A bit old (NeurIPS 2019), but interesting take on the saliency prediction.

Instead of using direct mixture of different unsupervised salient region prediction algorithms and focusing on fusion strategy, authors proposed to use distillation in neural networks as a way to refine each algorithm predictions separately. Paper shows several steps of distillation, self-training and usage of moving average to stabilize the predictions of each method separately. After these steps, authors employ accumulated averages as labels for final network training.

Slightly more words here.
Source here.
Forwarded from Gradient Dude
Facebook open-sourced a library for state-of-the-art self-supervised learning: VISSL.

+ It contains reproducible reference implementation of SOTA self-supervision approaches (like SimCLR, MoCo, PIRL, SwAV etc) and their components that can be reused. Also supports supervised trainings.
+ Easy to train model on 1-gpu, multi-gpu and multi-node. Seamless scaling to large scale data and model sizes with FP16, LARC etc.

Finally somebody unified all recent works in one modular framework. I don't know about you, but I'm very happy 😌!

VISSL: https://vissl.ai/
Blogpost: https://ai.facebook.com/blog/seer-the-start-of-a-more-powerful-flexible-and-accessible-era-for-computer-vision
Tutorials in Google Colab: https://vissl.ai/tutorials/
New paper from Yan LeCun, who continued to introduce new features inspired by biology.

For each batch of input samples we produce two batches of vector representations, which differs only by the augmentation (randomly sampled). After this, we are able to calculate cross-correlation matrix of those representations (cross correlation between sets of augmentations it is). The loss itself pushes matrix to be as close to the identity as possible. Intuitively this pushes representations to be invariant to the augmentation as the main diagonal tend to be 1, and non-redundant and non-trivial since values out of the main diagonal tend to be 0.
Authors show via rigorous ablation tests how it helps to (1) ease requirements for large batch size, (2) avoid fuss of negative mining and (3) to take advance of the dimensionality of the representation.

More expanded here.
Much more discussion and great area overview in source.

P.S. It is always such a pleasure to read papers like this where authors proposed such clear concepts, so they have much more space for discussion.
👍1
Interactive Weak Supervision paper from ICLR 2021.

In contrast to classical active learning where experts are queried to assess individual samples, the idea of this paper is to assess labeling heuristics being automatically generated. Authors argue that since experts are good in writing such heuristics from scratch, they should be able to label auto-generated heuristics. To be able to rank non-assessed heuristics authors proposed to train an ensemble of models to predict the assessors' mark for the heuristic. As input for these models authors proposed to use fingerprint of the heuristic: concatenated predictions on some subset of data.

There is no very fancy results, there is some concerns raised by reviewers, and there are some strange notations in this paper. Yet the idea looks interesting to me.

With a bit deeper denoscription (and one unanswered question) here.
Source (and rebuttal comments with important links) there.