Neural Networks Engineering – Telegram
Neural Networks Engineering
2.2K subscribers
11 photos
37 links
Authored channel about neural networks development and machine learning mastering. Experiments, tool reviews, personal researches.

#deep_learning
#NLP

Author @generall93
Download Telegram
Channel photo updated
Hello. This channel is my personal development log about neural networks and machine learning.
Sebastian Ruder wrote in his blog (http://ruder.io/requests-for-research/index.html) about perspectives of Neural Networks in NLP.
He thinks that Few-Shot and Transfer learning will give huge impact in this area.
I found his arguments convincing, so now I'm making experiments with Few-Shot learning.
There are a lot of good datasets for Few-Shot learning problem in Computer Vision, e.g. Omniglot https://github.com/brendenlake/omniglot or any dataset for face recognition.
The most closest analogue in NLP is Conversations and Question answering dataset.
But result evaluation of this tasks is very specific: we can't check if random text answers some question without human because it's impossible to align all texts with a set of predefined labels like it is accomplished in Omniglot.
There is however a NLP task which fits Few-Shot learning approach - Named Entity Linking or NEL for short.
In NEL we should assign exact concept link for each mention in text. You can think about concepts like about Wikipedia articles.
The number of possible concepts is very large and grows with time, so it's impossible to apply regular classification techniques.
It makes NEL task perfect for Few-Shot learning. I prepared WikiLinks based dataset https://www.kaggle.com/generall/oneshotwikilinks
This dataset consists of entity mentions with corresponding wikipedia links. There are also kNN baseline which gives 70% accuracy.
KNN uses only Bag Of Words features without any word2vec or synonyms.
My next step was to try use well known neural architectures for text matching. It worth detailed denoscription, so next few posts will be about it.
🔥1
As a starting point I used MatchZoo - a collection of text matching models https://github.com/faneshion/MatchZoo.
It contains a set of model implementations in Keras as well as number of benchmark datasets.
MatchZoo was created by authors of the main part of those models. It includes a lot of different examples, but configuration requires manual adjustment for each new task.
I used MatchZoo implementation of CDSSM model as a baseline reference for my own implementation. With this baseline I was sure that the source of all possible errors is my model, not the shifted labels in dataset.
A variety of deep matching models can be categorized into two types according to their architecture.
One is the representation-focused model (pic. 1), which tries to build a good representation for a single text with a deep neural network, and then conducts matching between compressed text representations. Examples include DSSM, C-DSSM and ARC-I.

The other is the interaction-focused model (pic. 2), which first builds local interactions (i.e., local matching signals) between two
pieces of text, and then uses deep neural networks to learn hierarchical interaction patterns for matching. Examples
include MV-LSTM , ARC-II and MatchPyramid.

Useful property of representation-focused models is the possibility to pre-compute representation vectors.
It allows, for example, to perform fast ranking of web pages in search engines.
However, it does not take into account the interaction between two texts until an individual presentation of each text is generated.
Therefore there is a risk of losing details (e.g., a city name) important for the matching task in representing the texts.
In other words, in the forward phase (prediction), the representation of each text is formed without knowledge of each other.
As a result, interaction-focused models tends to perform better in Question Answering and Paraphrase Indentification tasks, though they are not applicable for web-scale matching.
1. Representation-based architecture. 2. Interaction-based architecture (MV-LSTM)
Neural Networks debugging

When training neural networks it can often be unclear why the network is not learning. Is it about learning parameters or NN architecture?
Brute force search on full training dataset may be very time consuming even with GPU acceleration.
If you need to write code on your laptop and run it on remote machine, it makes process even more painful.
One way to solve this problem is to use synthetic datasets for debugging.
The idea is to create small sets of examples each of which is a little more complex then previous one.
Let me illustrate this approach. On picture example we can see that model is able to distinguish:
- Object presence
- Shapes
- Colors
- Rotation
- Stroke
And it can't distinguish alignment and count of objects. Keep in mind that number of layers and neurons should be scaled down according to the size of synthetic data, or network will overfit. Knowing evaluation results we can quickly iterate over modifications for our network architecture.
Of course solving synthetic dataset does not guarantee solving real-life tasks. As well as passing Unit test does not guarantee that code has no bugs.
But there is another useful thing we can do: with large amount of small experiments we can detect relations between the result and changing of network parameters. This information will help us to concentrate on significant parameters tuning while training on real data.
Not only images can synthesized for training. In my NEL project I am using 13 synthetic text datasets. Size of this datasets allows me to debug neural network on laptop without any GPU. You can find code its generation here .
Writing code for data generation may be time consuming and boring, so the next possible step in NN debugging is to create tools, framework or even language for data generation. With declarative SQL-like language it would be possible to create datasets automatically, for example using some kind of evolution strategy. Unfortunately I was unable to find anything suitable for this task, so it is a good place for you to contribute!
👍1🔥1
Practical example.

Situation: CDSSM model does not learn well on big dataset of natural language sentences. What is our next step?
Running the model on several synthetic datasets, we notice that CDSSM model is unable to handle the following simple data:
Each sentence has several of N topic words + random noise words. Two sentences matching only if they have at least one common word.
Example:

1 ldwjum mugqw sohyp sohyp dwguv mugqw
0 ldwjum mugqw sohyp labhvz epqori kfdo
1 xnwpjc agqv lmjoh wvncu tekj lmjoh
0 xnwpjc agqv lmjoh jhnt fhzb xauhq
1 vflcmn pnuvx eolwrj dhfvbt vflcmn toxeyc
0 vflcmn pnuvx eolwrj dhfvbt yetkah bfnxqp
1 rybmae bwcej xnwpjc bwcej yrhefk yhca
0 rybmae bwcej xnwpjc bhck zbfj yhca
1 sohyp htdp symc jrvsyn symc fpoxj
0 sohyp htdp symc eolwrj masq hjzrp
1 dhfvbt yetkah omsaij omsaij dhfvbt tqdef
0 dhfvbt yetkah omsaij zilrh wvncu sohyp

CDSSM overfits on this data. The reason is that it is unknown which word will be useful for matching before actual comparison of sentences.
Withal the final matching layer of CDSSM is unable to hold enough information about each concrete word in original sentence.
So the only way for network to minimize error is to remember noise in train set. Such behavior is easy to recognize on loss plots.
Train loss goes down quickly but validation loss grows - typical overfitting (pic 1.).
One possible way to solve this dataset is to change network architecture to the one that can handle low level interactions between words in sentence pair.
In previous post I mentioned interaction-focused models, the exact type of models we need. I choose ARC-II architecture for my experiment, you can check out implementation here . New model fits synthetic data perfectly well (pic 2.). As a result we can safely skip time consuming experiments with CDSSM model on real dataset.
1. CDSSM overfitting