Neural Networks Engineering – Telegram
Neural Networks Engineering
2.2K subscribers
11 photos
37 links
Authored channel about neural networks development and machine learning mastering. Experiments, tool reviews, personal researches.

#deep_learning
#NLP

Author @generall93
Download Telegram
Dropout effect

* Green - 0% dropout. Overfitting
* Gray - 10% dropout. Best learning
* Orange - 20% dropout.
* Blue - 30% dropout. Underfitting
FastText embeddings done right


An important feature of FastText embeddings is the usage of subword information.
In addition to the vocabulary FastText also contains word's ngrams.
This additional information is useful for the following: handling Out-Of-Vocabulary words, extracting sense from word's etymology and dealing with misspellings.

But unfortunately all this advantages are not used in most open source projects.
We can easily discover it via GitHub (pic.). The point is that regular Embedding layer maps the whole word into a single stored in memory fixed vector. In this case all the word vectors should be generated in advance, so none of the cool features work.

The good thing is that using FastText correctly is not so difficult! FacebookResearch provides an example of the proper way to use FastText in PyTorch framework.
Instead of Embedding you should choose EmbeddingBag layer. It will combine ngrams into single word vector which can be used as usual.
Now we will obtain all advantages in our neural network.
Parallel preprocessing with multiprocessing

Using multiple processes to construct train batches may significantly reduce total training time of your network.
Basically, if you are using GPU for training, you can reduce additional batch construction time almost to zero. This is achieved through pipelining of computations: while GPU crunches numbers, CPU makes preprocessing. Python multiprocessing module allows us to implement such pipelining as elegant as it is possible in the language with GIL.

PyTorch DataLoader class, for example, also uses multiprocessing in it's internals.
Unfortunately DataLoader suffers lack of flexibility. It's impossible to create batch with any complex structure within standard DataLoader class. So it should be useful to be able to apply raw multiprocessing.

multiprocessing gives us a set of useful APIs to distribute computations among several processes. Processes does not share memory with each other, so data is transmitted via inter-process communication protocols. For example in linux-like operation systems multiprocessing uses pipes. Such organization leads to some pitfalls that I am going to tell you.

* map vs imap

Methods map and imap may be used to apply preprocessing to batches. Both of them take processing function and iterable as argument. The difference is that imap is lazy. It will return processed elements as soon as they are ready. In this case all processed batched should not be stored in RAM simultaneously. For training NN you should always prefer imap:

def process(batch_reader):
with Pool(threads) as pool:
for batch in pool.imap(foo, batch_reader):
....
yield batch
....


* Serialization

Other pitfall is associated with the need to transfer objects via pipes. In addition to the processing results, multiprocessing will also serialize transformation object if it is used like this: pool.imap(transformer.foo, batch_reader). transformer will be serialized and send to subprocess. It may lead to some problems if transformer object has large properties. In this case it may be better to store large properties as singleton class variables:


class Transformer():
large_dictinary = None

def __init__(self, large_dictinary, **kwargs):
self.__class__.large_dictinary = large_dictinary

def foo(self, x):
....
y = self.large_dictinary[x]
....


Another difficulty that you may encounter is if the preprocessor is faster than GPU learning. In this case unprocessed batches accumulate in memory. If your memory is not to large enough you will get Out-of-Memory error. One way to solve this problem is to limit batch preprocessing until GPU learning is done.
Semaphore is perfect solution for this task:

def batch_reader(semaphore):
for batch in source:
semaphore.acquire()
yield batch


def process(x):
return x + 1


def pooling():
with Pool(threads) as pool:
semaphore = Semaphore(limit)
for x in pool.imap(plus, batch_reader(semaphore)):
yield x
semaphore.release()


for x in pooling():
learn_gpu(x)


Semaphore has internal counter syncronized across all working processes. It's logic will block execution if some process tries to increase counet value above limit with semaphore.acquire ()
👍1
There are some cases when you need to run your model on a small machine.
For example, if your model is being called 1 time per hour or you just don't want to pay $150 per month to Amazon for t2.2xlarge instance with 32Gb RAM.
The problem is that the size of most pre-trained word embeddings can reach tens of gigabytes.

In this post, I will describe the method of access word vectors without loading it into memory.
The idea is to simply save word vectors as a matrix so that we could compute the position of each row without reading any other rows from disk.
Fortunately, all this logic is already implemented in numpy.memmap.
The only thing we need to implement ourselves is the function which converts word into an appropriate index. We can simply store the whole vocabulary in memory or use hashing trick, it does not matter at this point.
It is slightly harder to store FastText vectors that way because it requires additional computation on n-grams to obtain word vector.
So for simplicity, we will just pre-compute vectors for all necessary words.

You may take a look at a simple implementation of the described approach here:
https://github.com/generall/OneShotNLP/blob/master/src/utils/disc_vectors.py

Class DiscVectors contains method for converting fastText .bin model into on-disk matrix representation and json file with vocabulary and meta-information.
Once the model is converted, you can retrieve vectors with get_word_vector method. Performance check shows that in the worst case it takes 20 µs to retrieve single vector, pretty good since we are not using any significant amount of RAM.
​​Another approach to transfer learning in NLP is Question Answering.
In the most general case Question Answering is the generation of a textual answer to a given question by a given set of facts in some form.
You can find a demo of QA system here

There are many types of this systems:

Categorized by facts representation:

A. Relational database
B. Complex data structure - ontology, semantic web, e.t.c.
C. Text

Categorized by answer types

1. Yes\No - particular case of matching models
2. Finding bounding indexes for the answer
3. Generate answer by given text and question

Categorized by question type

a. The only possible question - model has no input for questions, it learns to answer only one question defined by training set
b. Constant number of questions - model has one-hot encoded input for questions.
c. Textual question in special query language - projects like this
d. Textual question in free form - model is supposed to some-how encode the text of questions.

For example this article deals with combination C-2-d in this categorization.
This combination leads to the necessity of using complex bi-directional attention mechanisms like BiDAF.
I, on the contrary, want to concentrate on generating answers without initial markup in the form of answer boundaries. And I will not care about complex question representations, for now.
Let's start with synthetic data baseline as it is described in my previous posts.
In this notebook I wrote a list of data generators. Each one is slightly more complicated than the previous one.
In the next posts, I will describe my attempts to implement neural network architecture. It should able to generate correct answers for this datasets, starting from the simplest ones.
​​Let's continue to dive into Question Answering. Last time we have generated several variants of synthetic sequences, from which we need to extract "answers". Each sequence type has each own pattern, and we want a neural network to find it.
In a most general sense, this task looks like sequence transformation - Seq2seq, similar to NMT.
In this post, I will describe how to implement a simple Seq2seq network with AllenNLP framework.

AllenNLP library includes components that standardize and simplify the creation of neural networks for text processing.
Its developers conducted a great work decomposing variety of NLP tasks into separate blocks.
It allowed to implement a set of universal pipeline components, suitable for reuse.
Implemented components could be used directly from code or for creating configs.

I have created a repository for my experiments. It contains a simple config file along with some accessory files. Let's take a look.

One of the main configuration parameters is a model. The model determines what happens to the data and the network during training and forecasting. The model parameter itself is a class that derives from allennlp.models.model.Model and implements the forward method.
We will use simple_seq2seq model which implements a basic sequence transformation scheme.

In classical seq2seq the source sequence is transformed by Encoder into representation, which is then read by Decoder to generate the target sequence.
simple_seq2seq module implements only Decoder. The Encoder should be implemented in other class, passed as a model parameter.
We will use the simplest encoder option - LSTM.

Here are some other useful model parameters:

- source_embedder - This class assigns a pre-trained vector to each input token. We have no pre-trained vectors for synthetic data so we will use random vectors. We will also make them untrainable to prevent overfitting.
- attention - attention function, used on each decoding step. Attention vector is concatenated with decoder state.
- beam_size - the number of variants, generated by beam search during decoding.
- scheduled_sampling_ratio - defines whether to use real or generated elements as a previous element during decoding.

Then we save our dataset so that the seq2seq dataset reader, implemented in AllenNLP, could work with it. Now we can launch training with a single command allennlp train config.json and observe training statistics on a Tensorboard.
A trained model could be easily used from Python, here is an example.

It should be noticed, that model is quickly overfitting on a synthetic data, so I generated a lot of it.

Unfortunately, AllenNLP seq2seq module is still under construction. It can't handle all existing variants of seq2seq models. For example, you can't implement Attention Transformer architecture from the article Attention is all you need. Attention Transformer requires a custom decoder, but it is hardcoded in simple_seq2seq. If you want to contribute AllenNLP seq2seq model, please, take a look at this issue. If you leave your reaction, it will help to focus AllenNLP developers attention on it.
​​Neural networks achieved great success at various NLP tasks, however, they are limited at handling infrequent patterns. In this article, the problem is described in the context of machine translation task.

The authors noted that NMT is good at learning translation pairs that are frequently observed, but the system may ‘forget’ to use low-frequency pairs when they should be. In contrast, in traditional rule-based systems, low-frequency pairs cannot be smoothed out no matter how rare they are. One solution to this is combining both approaches.

The authors propose to use a large external memory along with a selection mechanism to allow NN to use this memory. Selection fetches all relevant translation variants using words in source sentence and then an attention mechanism selects among these variants. After that neural net decides what source of prediction should be used on each translation step.

The important thing is that the vectors for external memory were trained separately. That basically means that we can build knowledge bases for neural nets. That seems like a promising way to construct really large-scale models with huge capacity.
Wrote an article on the Medium about pushing fastText into Colab.

Tl;dr: original binary fastText is too large for Colab.
We can shrink it, but it is a little tricky for n-gram matrix: we need to consider uniformness of collision distribution.

The final model takes 2Gb of RAM instead of 16Gb and 94% similar to the original model.

Code is also provided.
​​Have finished building demo and landing page for my project on mention classification. The idea of this project is to create a model which can assign some labels to objects based on their mentions in context. Right now it works only for people mentions, but if I find interest in this work, I will extend the model to other types like organizations or events. For now, you can check out the online demo of the neural network.

The current implementation can take account of several mentions at a time, so it can distinguish relevant parts of the context, not just averaging prediction.
It's also open sourced, and built with AllenNLP framework from training to serving. Take a look at it.
More technical details of implementation coming later.
​​Partially trainable embeddings

Understanding the meaning of natural language require a huge amount of information to be arranged by a neural network.
And the largest part if this information is usually stored in word embeddings.

Typically, labeled data from a particular task is not enough to train so many parameters. Thus, word embeddings are trained separately on a large general-purpose corpora.

But there are some cases when we want to be able to train word embeddings in our custom task, for example:

- We have a specific domain with a non-standard terminology or sentence structure
- We want to use additional markup like <tags> in our task

In these cases, we need to update a small number of weights, responsible for new words and meanings. At the same time, we can't update pre-trained embeddings cause it will lead to very quick overfitting.

To deal with this problem partially trainable embeddings were used in this project.
The idea is to concatenate fixed pre-trained embeddings with additional small trainable embeddings. It is also useful to add a linear layer right after concatenation so embeddings could interact during training.
Changing the size of an additional embedding gives control over the number of parameters and, as a result, allows to prevent overfitting.

Another good thing is that AllenNLP allows implementing this technique without a single line of code but with just a simple configuration:

{
"token_embedders": {
"tokens-ngram": {
"type": "fasttext-embedder",
"model_path": "./data/fasttext_embedding.model",
"trainable": false
},
"tokens": {
"type": "embedding",
"embedding_dim": 20,
"trainable": true
}
}
}
​​Filterable approximate nearest neighbors search

I did a little research on how to search in vector space if you also need to take into account additional restrictions: search in a subset, filter by a numerical criterion or geo.
The article turned out to be too large for the telegram channel format, so I’ll leave only the essence here.
The full article is available on my updated blog.

The main point is that with minor modifications of the state-of-the-art HNSM algorithm we can cover a variety of filtering cases.
Modifications are to add edges to a navigation graph to ensure that it is connected after filtering out some part of its nodes.

Looking at filtering by category we can see that adding edges within particular small categories solve the connectivity problem for them.
And large categories sustain its connectivity due to the law of the Percolation theory.
Filtering by categories with could be relatively easy be extended to the numerical range filtering and geo spatial index.

At the full version of this article I also present a couple experiments to prove this approach.
It also contains some consideration of how to avoid possible failures.
Take a look at it!
​​Recently I found an interesting repository on GitHub.
Actually, it is not a single repository, but a whole project, created by a CAIR center for research at the University of Agder.
It includes a bunch of articles and different implementations of a novel concept called Tsetlin Machine.
The author claims that this approach can replace neural networks and is faster and more accurate.
This work itself looks quite marginal, it's not recent but didn't become widely used.
It is noticeable that it is alive only thanks to the enthusiasm of several people.

From public sources, I found only the overselling press release of their own university and a skeptical thread on Reddit. As rightly noted in the latter, there are quite a few red flags and imperfections in this work, including excessive self-citation, unconvincing MNIST experiments, a poorly written article that is difficult to read.

However, I still decided to spend a little time reading about this concept - to use finite automaton states with linear tactics as trainable parameters of the model.
States define if signals are used in a logical clause or not.
The model is trained with two types of feedback: first fights false-negative actuation of the Clause and the second, false-positive, respectively.

The author shows benchmarks of the model on a couple different tasks but pays small attention to the main problem - there is no method provided to make Tsetlin Machine truly deep.
Instead, he suggests to train it layer by layer like Hinton trained a Deep Belief Network.
This restriction won't let Tsetlin Machine equal with neural networks in any area.

On the other hand, there are no theoretical limitations for the discrete feedback propagation mechanism to exist.
I going to conduct some experiments with this concept, will keep you posted if something works out.
Tools for setting up a new ML project.

Compiled a list of tools I find worth a try if you are going to set up a new ML project.
This list is not intended to be exhaustive overview and it does not include any ML frameworks or libraries.
It is focused on auxiliary tools that can make development easier and experiments reproducible.
Some of this tools I have used in real projects, others I just tried on a toy example, but found interesting to use in future.
​​Filterable HNSW - part 2

In a previous article on the filter when searching for nearest neighbors, we discussed the theoretical background.
This time I am going to present a C++ implementation with Python bindings.

As a base implementation of HNSW I took hnswlib, stand-alone header-only implementation of HNSW.

With new implementation it is possible now to assign an arbitrary number of tags to any point with a simple code:

# ids - list of point ids
# tag - tag id
hnsw.add_tags(ids, tag)


The group of points under the same tag could be searched separately from others:

query_vector = ...
tag_to_search_in = 42
# Search among points with this tag
condition = [[(False, tag_to_search_in)]]
labels, dist = hnsw.knn_query(query_vector, k=10, conditions=condition)


These groups could also be combined using boolean expressions. For example (A | !B) & C is represented as [[(0, A), (1, B)], [(0, C)]], where A, B, C are logical clauses if respective tag is assigned to a point.

If the group is large enough ( >> 1/M fraction of all points), knn_query should work fine. But if the group is smaller, it may need to build additional connections in the HNSW graph for these groups.

hnsw.index_tagged(tag=42, m=8)


Based on the HNSW with categorical filtering, it is possible to build build a tool that can search in specified geo-region only.

Find a full version of this article with more examples and explanations in my blog.
ONNX and deployment libraries

Libraries like AllenNLP are great for model training and prototyping, they contain functions and helpers for almost any practical and theoretical task.
Some of these libraries even have functions for model serving, but they still might be a poor choice for a serving model in production.

Very same functionality, which makes them convenient for development, makes them hard to support in a production environment.
Docker image with only AllenNLP installed takes up a whole 1.9 GB compressed! It could hardly be called a micro-service.

In Tensorflow this problem was solved by saving computational graphs in a special serialization format, independent of training and preprocessing libraries.
This serialized view can later be served by the tensor serving service.
Good solution, but not universal - there are plenty of frameworks, like PyTorch, which does not follow Google's standard.

Now, this is a part where ONNX appears - an open standard for NN representation.
It defines a common set of operators - the building blocks of machine learning and deep learning models.
Not any valid Python-PyTorch model can be converted into ONNX representation. Only a subset of operations is also valid for ONNX.

Unfortunately, default implementation of most AllenNLP models does not fit this subset:

- AllenNLP model handles a vast variety of corner cases, conditions that are essentially python functions.
ONNX does not support arbitrary code execution, ONNX model should consist of computation graph only
- AllenNLP models take care of text preprocessing. It operates with dictionaries and tokenization. ONNX does not support these operations.

Luckily in most cases, AllenNLP models could be used as just a wrapper for actual model implementation.
For this, you need to have an AllenNLP model, which handles loss function, makes preprocessing, and interacts with the model trainer.
And also an internal class for the "pure" model, which implements standard nn.Module interface.
It should use tensors as input and output.
Internally it should construct a persistent computational graph.

This internal model now could be converted into the ONNX model and saved independently.

Having ONNX you can use whatever instrument you need to serve or explore your model.
Forwarded from Spark in me (Alexander)
Silero Speech-To-Text Models V1 Released

We are proud to announce that we have released our high-quality (i.e. on par with premium Google models) speech-to-text Models for the following languages:

- English
- German
- Spanish

Why this is a big deal:

- STT Research is typically focused on huge compute budgets
- Pre-trained models and recipes did not generalize well, were difficult to use even as-is, relied on obsolete tech
- Until now STT community lacked easy to use high quality production grade STT models

How we solve it:

-
We publish a set of pre-trained high-quality models for popular languages
- Our models are embarrassingly easy to use
- Our models are fast and can be run on commodity hardware

Even if you do not work with STT, please give us a star / share!

Links

- https://github.com/snakers4/silero-models
👍1
🔲 Qdrant - vector search engine

Since my last post about filtrable HNSW
I was working on a new Search Engine to give this idea a proper implementation.
And I finally published an alpha version of the engine called Qdrant.

Development is still in an early stage, but it already provides ElasticSearch-like conditions must, should and must_not which you can combine to represent an arbitrary condition.

Use-cases

You might need Qdrant in cases when a vector could not fully represent a sought object.
For example, a neural network might model a visual appearance of a piece of clothing, but can hardly consider its stock availability.

With Qdrant you can assign this feature as a payload and use it for filtering.

Among the possible applications:

- Semantic search with facets
- Semantic search on map
- Matching engines - e.g. Candidates and job positions
- Personal recommendations

Technical highlights

Qdrant is written in Rust, the language specially designed for system programming - the building of services that are used by other services.
Rust is comparable in speed with C but also protects from data races what is crucial for database applications.
Push the crab 🦀 if you are interested in more Rust-specific details of the project.

The engine uses write-ahead logging. Once it confirmed an update - it won't lose data even in case of power shut down.

You can already try it with Docker image:

docker pull generall/qdrant


Simple search request could look like this:

POST /test_collection/points/search
{
"filter": {
"should": [
{
"match": {
"key": "city",
"keyword": "London"
}
}
]
},
"vector": [0.2, 0.1, 0.9, 0.7],
"top": 3
}


All APIs are documented with OpenAPI 3.0.
It provides an easy way to generate client for any programming language.

I would highly appreciate any feedback on the project, and I will be grateful if you give it a star on GitHub.
2