Neural Networks Engineering – Telegram
Neural Networks Engineering
2.2K subscribers
11 photos
37 links
Authored channel about neural networks development and machine learning mastering. Experiments, tool reviews, personal researches.

#deep_learning
#NLP

Author @generall93
Download Telegram
1. Representation-based architecture. 2. Interaction-based architecture (MV-LSTM)
Neural Networks debugging

When training neural networks it can often be unclear why the network is not learning. Is it about learning parameters or NN architecture?
Brute force search on full training dataset may be very time consuming even with GPU acceleration.
If you need to write code on your laptop and run it on remote machine, it makes process even more painful.
One way to solve this problem is to use synthetic datasets for debugging.
The idea is to create small sets of examples each of which is a little more complex then previous one.
Let me illustrate this approach. On picture example we can see that model is able to distinguish:
- Object presence
- Shapes
- Colors
- Rotation
- Stroke
And it can't distinguish alignment and count of objects. Keep in mind that number of layers and neurons should be scaled down according to the size of synthetic data, or network will overfit. Knowing evaluation results we can quickly iterate over modifications for our network architecture.
Of course solving synthetic dataset does not guarantee solving real-life tasks. As well as passing Unit test does not guarantee that code has no bugs.
But there is another useful thing we can do: with large amount of small experiments we can detect relations between the result and changing of network parameters. This information will help us to concentrate on significant parameters tuning while training on real data.
Not only images can synthesized for training. In my NEL project I am using 13 synthetic text datasets. Size of this datasets allows me to debug neural network on laptop without any GPU. You can find code its generation here .
Writing code for data generation may be time consuming and boring, so the next possible step in NN debugging is to create tools, framework or even language for data generation. With declarative SQL-like language it would be possible to create datasets automatically, for example using some kind of evolution strategy. Unfortunately I was unable to find anything suitable for this task, so it is a good place for you to contribute!
👍1🔥1
Practical example.

Situation: CDSSM model does not learn well on big dataset of natural language sentences. What is our next step?
Running the model on several synthetic datasets, we notice that CDSSM model is unable to handle the following simple data:
Each sentence has several of N topic words + random noise words. Two sentences matching only if they have at least one common word.
Example:

1 ldwjum mugqw sohyp sohyp dwguv mugqw
0 ldwjum mugqw sohyp labhvz epqori kfdo
1 xnwpjc agqv lmjoh wvncu tekj lmjoh
0 xnwpjc agqv lmjoh jhnt fhzb xauhq
1 vflcmn pnuvx eolwrj dhfvbt vflcmn toxeyc
0 vflcmn pnuvx eolwrj dhfvbt yetkah bfnxqp
1 rybmae bwcej xnwpjc bwcej yrhefk yhca
0 rybmae bwcej xnwpjc bhck zbfj yhca
1 sohyp htdp symc jrvsyn symc fpoxj
0 sohyp htdp symc eolwrj masq hjzrp
1 dhfvbt yetkah omsaij omsaij dhfvbt tqdef
0 dhfvbt yetkah omsaij zilrh wvncu sohyp

CDSSM overfits on this data. The reason is that it is unknown which word will be useful for matching before actual comparison of sentences.
Withal the final matching layer of CDSSM is unable to hold enough information about each concrete word in original sentence.
So the only way for network to minimize error is to remember noise in train set. Such behavior is easy to recognize on loss plots.
Train loss goes down quickly but validation loss grows - typical overfitting (pic 1.).
One possible way to solve this dataset is to change network architecture to the one that can handle low level interactions between words in sentence pair.
In previous post I mentioned interaction-focused models, the exact type of models we need. I choose ARC-II architecture for my experiment, you can check out implementation here . New model fits synthetic data perfectly well (pic 2.). As a result we can safely skip time consuming experiments with CDSSM model on real dataset.
1. CDSSM overfitting
2. ARC-II doing well
ARC-II Network
tanh (blue) vs ReLU (orange)
Dropout effect

* Green - 0% dropout. Overfitting
* Gray - 10% dropout. Best learning
* Orange - 20% dropout.
* Blue - 30% dropout. Underfitting
FastText embeddings done right


An important feature of FastText embeddings is the usage of subword information.
In addition to the vocabulary FastText also contains word's ngrams.
This additional information is useful for the following: handling Out-Of-Vocabulary words, extracting sense from word's etymology and dealing with misspellings.

But unfortunately all this advantages are not used in most open source projects.
We can easily discover it via GitHub (pic.). The point is that regular Embedding layer maps the whole word into a single stored in memory fixed vector. In this case all the word vectors should be generated in advance, so none of the cool features work.

The good thing is that using FastText correctly is not so difficult! FacebookResearch provides an example of the proper way to use FastText in PyTorch framework.
Instead of Embedding you should choose EmbeddingBag layer. It will combine ngrams into single word vector which can be used as usual.
Now we will obtain all advantages in our neural network.
Parallel preprocessing with multiprocessing

Using multiple processes to construct train batches may significantly reduce total training time of your network.
Basically, if you are using GPU for training, you can reduce additional batch construction time almost to zero. This is achieved through pipelining of computations: while GPU crunches numbers, CPU makes preprocessing. Python multiprocessing module allows us to implement such pipelining as elegant as it is possible in the language with GIL.

PyTorch DataLoader class, for example, also uses multiprocessing in it's internals.
Unfortunately DataLoader suffers lack of flexibility. It's impossible to create batch with any complex structure within standard DataLoader class. So it should be useful to be able to apply raw multiprocessing.

multiprocessing gives us a set of useful APIs to distribute computations among several processes. Processes does not share memory with each other, so data is transmitted via inter-process communication protocols. For example in linux-like operation systems multiprocessing uses pipes. Such organization leads to some pitfalls that I am going to tell you.

* map vs imap

Methods map and imap may be used to apply preprocessing to batches. Both of them take processing function and iterable as argument. The difference is that imap is lazy. It will return processed elements as soon as they are ready. In this case all processed batched should not be stored in RAM simultaneously. For training NN you should always prefer imap:

def process(batch_reader):
with Pool(threads) as pool:
for batch in pool.imap(foo, batch_reader):
....
yield batch
....


* Serialization

Other pitfall is associated with the need to transfer objects via pipes. In addition to the processing results, multiprocessing will also serialize transformation object if it is used like this: pool.imap(transformer.foo, batch_reader). transformer will be serialized and send to subprocess. It may lead to some problems if transformer object has large properties. In this case it may be better to store large properties as singleton class variables:


class Transformer():
large_dictinary = None

def __init__(self, large_dictinary, **kwargs):
self.__class__.large_dictinary = large_dictinary

def foo(self, x):
....
y = self.large_dictinary[x]
....


Another difficulty that you may encounter is if the preprocessor is faster than GPU learning. In this case unprocessed batches accumulate in memory. If your memory is not to large enough you will get Out-of-Memory error. One way to solve this problem is to limit batch preprocessing until GPU learning is done.
Semaphore is perfect solution for this task:

def batch_reader(semaphore):
for batch in source:
semaphore.acquire()
yield batch


def process(x):
return x + 1


def pooling():
with Pool(threads) as pool:
semaphore = Semaphore(limit)
for x in pool.imap(plus, batch_reader(semaphore)):
yield x
semaphore.release()


for x in pooling():
learn_gpu(x)


Semaphore has internal counter syncronized across all working processes. It's logic will block execution if some process tries to increase counet value above limit with semaphore.acquire ()
👍1
There are some cases when you need to run your model on a small machine.
For example, if your model is being called 1 time per hour or you just don't want to pay $150 per month to Amazon for t2.2xlarge instance with 32Gb RAM.
The problem is that the size of most pre-trained word embeddings can reach tens of gigabytes.

In this post, I will describe the method of access word vectors without loading it into memory.
The idea is to simply save word vectors as a matrix so that we could compute the position of each row without reading any other rows from disk.
Fortunately, all this logic is already implemented in numpy.memmap.
The only thing we need to implement ourselves is the function which converts word into an appropriate index. We can simply store the whole vocabulary in memory or use hashing trick, it does not matter at this point.
It is slightly harder to store FastText vectors that way because it requires additional computation on n-grams to obtain word vector.
So for simplicity, we will just pre-compute vectors for all necessary words.

You may take a look at a simple implementation of the described approach here:
https://github.com/generall/OneShotNLP/blob/master/src/utils/disc_vectors.py

Class DiscVectors contains method for converting fastText .bin model into on-disk matrix representation and json file with vocabulary and meta-information.
Once the model is converted, you can retrieve vectors with get_word_vector method. Performance check shows that in the worst case it takes 20 µs to retrieve single vector, pretty good since we are not using any significant amount of RAM.
​​Another approach to transfer learning in NLP is Question Answering.
In the most general case Question Answering is the generation of a textual answer to a given question by a given set of facts in some form.
You can find a demo of QA system here

There are many types of this systems:

Categorized by facts representation:

A. Relational database
B. Complex data structure - ontology, semantic web, e.t.c.
C. Text

Categorized by answer types

1. Yes\No - particular case of matching models
2. Finding bounding indexes for the answer
3. Generate answer by given text and question

Categorized by question type

a. The only possible question - model has no input for questions, it learns to answer only one question defined by training set
b. Constant number of questions - model has one-hot encoded input for questions.
c. Textual question in special query language - projects like this
d. Textual question in free form - model is supposed to some-how encode the text of questions.

For example this article deals with combination C-2-d in this categorization.
This combination leads to the necessity of using complex bi-directional attention mechanisms like BiDAF.
I, on the contrary, want to concentrate on generating answers without initial markup in the form of answer boundaries. And I will not care about complex question representations, for now.
Let's start with synthetic data baseline as it is described in my previous posts.
In this notebook I wrote a list of data generators. Each one is slightly more complicated than the previous one.
In the next posts, I will describe my attempts to implement neural network architecture. It should able to generate correct answers for this datasets, starting from the simplest ones.
​​Let's continue to dive into Question Answering. Last time we have generated several variants of synthetic sequences, from which we need to extract "answers". Each sequence type has each own pattern, and we want a neural network to find it.
In a most general sense, this task looks like sequence transformation - Seq2seq, similar to NMT.
In this post, I will describe how to implement a simple Seq2seq network with AllenNLP framework.

AllenNLP library includes components that standardize and simplify the creation of neural networks for text processing.
Its developers conducted a great work decomposing variety of NLP tasks into separate blocks.
It allowed to implement a set of universal pipeline components, suitable for reuse.
Implemented components could be used directly from code or for creating configs.

I have created a repository for my experiments. It contains a simple config file along with some accessory files. Let's take a look.

One of the main configuration parameters is a model. The model determines what happens to the data and the network during training and forecasting. The model parameter itself is a class that derives from allennlp.models.model.Model and implements the forward method.
We will use simple_seq2seq model which implements a basic sequence transformation scheme.

In classical seq2seq the source sequence is transformed by Encoder into representation, which is then read by Decoder to generate the target sequence.
simple_seq2seq module implements only Decoder. The Encoder should be implemented in other class, passed as a model parameter.
We will use the simplest encoder option - LSTM.

Here are some other useful model parameters:

- source_embedder - This class assigns a pre-trained vector to each input token. We have no pre-trained vectors for synthetic data so we will use random vectors. We will also make them untrainable to prevent overfitting.
- attention - attention function, used on each decoding step. Attention vector is concatenated with decoder state.
- beam_size - the number of variants, generated by beam search during decoding.
- scheduled_sampling_ratio - defines whether to use real or generated elements as a previous element during decoding.

Then we save our dataset so that the seq2seq dataset reader, implemented in AllenNLP, could work with it. Now we can launch training with a single command allennlp train config.json and observe training statistics on a Tensorboard.
A trained model could be easily used from Python, here is an example.

It should be noticed, that model is quickly overfitting on a synthetic data, so I generated a lot of it.

Unfortunately, AllenNLP seq2seq module is still under construction. It can't handle all existing variants of seq2seq models. For example, you can't implement Attention Transformer architecture from the article Attention is all you need. Attention Transformer requires a custom decoder, but it is hardcoded in simple_seq2seq. If you want to contribute AllenNLP seq2seq model, please, take a look at this issue. If you leave your reaction, it will help to focus AllenNLP developers attention on it.
​​Neural networks achieved great success at various NLP tasks, however, they are limited at handling infrequent patterns. In this article, the problem is described in the context of machine translation task.

The authors noted that NMT is good at learning translation pairs that are frequently observed, but the system may ‘forget’ to use low-frequency pairs when they should be. In contrast, in traditional rule-based systems, low-frequency pairs cannot be smoothed out no matter how rare they are. One solution to this is combining both approaches.

The authors propose to use a large external memory along with a selection mechanism to allow NN to use this memory. Selection fetches all relevant translation variants using words in source sentence and then an attention mechanism selects among these variants. After that neural net decides what source of prediction should be used on each translation step.

The important thing is that the vectors for external memory were trained separately. That basically means that we can build knowledge bases for neural nets. That seems like a promising way to construct really large-scale models with huge capacity.