Data Science by ODS.ai 🦜 – Telegram
Data Science by ODS.ai 🦜
44.4K subscribers
870 photos
96 videos
7 files
1.93K links
First Telegram Data Science channel. Covering all technical and popular staff about anything related to Data Science: AI, Big Data, Machine Learning, Statistics, general Math and the applications of former. To reach editors contact: @malev
Download Telegram
This is the recording from July 23rd SF Machine Learning Meetup at Workday Inc. San Francisco office.

Featured speaker - Ilya Sutskever
Ilya Sutskever received his PhD in 2012 from the University of Toronto working with Geoffrey Hinton. He was also a post-doc with Andrew Ng at Stanford University. After completing his PhD, he cofounded DNNResearch with Geoffrey Hinton and Alex Krizhevsky which was acquired by Google. He is interested in all aspects of neural networks and their applications.

https://clip.mn/video/yt-aUTHdgh1OjI
Yoshua Bengio:
A must-read for those interested in dialogue research, with an overview of available corpora for learning from them:

http://arxiv.org/abs/1512.05742
We have an annoucement to make.

Russian Deep Learning community is quite excited and enthusiastic about the recent Kaggle challenge put forward by Allen Institute for Artificial Intelligence (https://www.kaggle.com/c/the-allen-ai-science-challenge). Backed by a large interest group here in Moscow, we want to build off of this initiative by organising a Winter school paired with an AI-hackathon - http://qa.deephack.me . Collaborative work of many teams forms a powerful educational environment that can stimulate people to learn and work better, and may in the end lead to discoveries that would have been overlooked otherwise.

Based on our prior experience we expect a successful event! The last event like that we have organized—a week-long hackathon to improve DeepMind code to play Atari games (see http://deephack.me ) — did well. It was an academic, free for participants but competitive event that combined hacking with a crash course of educational lectures by +Yoshua Bengio, Andrey Dergachev , Alexey Dosovitski, Vitali Dunin-Barkovskyi , +Terran Lane, +Anatoly Levenchuk, +Sridhar Mahadevan , Maxim Milakov, +Sergey Plis, +Irina Rish, +Ruslan Salakhutdinov, +Jürgen Schmidhuber, +Thomas Unterthiner, Dmitri Vetrov, Alexander Zhavoronkov. The winning team was awarded with a trip to NIPS and their paper based on their work got accepted to a NIPS workshop. In fact, many other participants were inspired enough to come to NIPS on ther own.

We invite everybody who are interested in participation as a hacker or a speaker :)

More details (and registration form) can be found at http://qa.deephack.me
http://aipoly.com/ live demo of cv application — iPhone app that recognize objects on video
5 Deep Learning papers explained:


Infinite Dimensional Word Embeddings

Abstract:

We describe a method for learning word embeddings with stochastic dimensionality. Our Infinite Skip-Gram (iSG) model specifies an energy-based joint distribution over a word vector, a context vector, and their dimensionality. By employing the same techniques used to make the Infinite Restricted Boltzmann Machine (Cote & Larochelle, 2015) tractable, we define vector dimensionality over a countably infinite domain, allowing vectors to grow as needed during training.

Main Ideas:

This is a quite original use of our "infinite dimensions" trick we introduced in the iRBM. It wasn't entirely "plug and play" either, and the authors had to be smart in the approximations they proposed for training the iSG.

The qualitative results showing how the conditional on the number of dimensions contain information about polysemy are really neat! One assumption behind distributed word embeddings is that they should be able to represent the multiple meanings of words using different dimensions, so it's nice to see that this is exactly what is being learned here.

I think the only thing missing in this paper are comparisons with regular skipgram and perhaps other word embeddings methods on a specific task or on a word similarity task. In v2 of this paper, the authors do mention they are working on such results, so I'm looking forward to seeing those!

http://arxiv.org/pdf/1511.05392v2.pdf
Gradient-based Hyperparameter Optimization through Reversible Learning


Abstract (excerpt):

We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures.

Two Cents (excerpt):

This is one of my favorite papers of this year. While the method of unrolling several steps of gradient descent (100 iterations in the paper) makes it somewhat impractical for large networks (which is probably why they considered 3-layer networks with only 50 hidden units per layer), it provides an incredibly interesting window on what are good hyper-parameter choices for neural networks. Note that, to substantially reduce the memory requirements of the method, the authors had to be quite creative and smart about how to encode changes in the network's weight changes.

There are tons of interesting experiments, which I encourage the reader to go check out (see section 3).

The experiment on "training the training set", i.e. generating the 10 examples (one per class) that would minimize the validation set loss of a network trained on these examples is a pretty cool idea (it essentially learns prototypical images of the digits from 0 to 9 on MNIST).

Note that approaches like the one in this paper make tools for automatic differentiation incredibly valuable. Python autograd, the author's automatic differentiation Python library https://github.com/HIPS/autograd (which inspired our own Torch autograd https://github.com/twitter/torch-autograd) was in fact developed in the context of this paper.

http://arxiv.org/pdf/1502.03492v3.pdf
Speed Learning on the Fly

Authors: Pierre-Yves Massé, Yann Ollivier
Date posted to arXiv: 8 Nov 2015

Abstract :

Here we propose to adapt the step size by performing a gradient descent on the step size itself, viewing the whole performance of the learning trajectory as a function of step size. Importantly, this adaptation can be computed online at little cost, without having to iterate backward passes over the full data.

Main ideas:

I think the authors are right on the money as to the challenges posed by online learning. I think these challenges are likely to be greater in the context of training neural networks online, for which little satisfactory solutions exist right now. So this is a direction of research I'm particularly excited about.

At this points, the experiments consider fairly simple learning scenarios, but I don't see any obstacle in applying the same method to neural networks. One interesting observation from the results is that results are fairly robust to variations of "the learning rate of the learning rate", compared to varying and fixing the learning rate itself.

Finally, I haven't had time to entirely digest one of their theoretical result, suggesting that their approximation actually corresponds to an exact gradient taken "alongside the effective trajectory" of gradient descent. However, that result seems quite interesting and would deserve more attention.


http://arxiv.org/pdf/1511.02540v1.pdf
Spatial Transformer Networks

Authors: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu
Date posted to arXiv: 5 Jun 2015

Abstract :

In this work we introduce a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network. This differentiable module can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process.

Main ideas:

While the work on DRAW (http://arxiv.org/abs/1502.04623) previously proposed a similar approach to learning transformations on images, this work goes significantly beyond DRAW and generalizes the approach to a much richer family of transformations. I also really like the idea of applying the spatial transformer modules within a CNN, something that wasn't in the DRAW paper.

I really don't have much negative to say about this work, it's really solid!

The only thing that comes to mind is that, in the CUB-200-2011 experiment, the authors used ImageNet pre-trained Inception networks to initialise their models. The only reason it's worth mentioning is that the test set of the CUB-200-2011 dataset actually contains images from the ImageNet training set. But fortunately, there are very few of those, so this doesn't change the overall analysis of the results. Still, I do find it interesting that, with such forms of transfer learning becoming increasingly common, it appears that we, as deep learning researchers, will need to start paying much more attention to such considerations in the future than we used to.

http://arxiv.org/pdf/1506.02025v2.pdf
. Clustering is Efficient for Approximate Maximum Inner Product Search

Authors: Alex Auvolat, Sarath Chandar, Pascal Vincent, Hugo Larochelle, Yoshua Bengio
Date posted to arXiv: 21 Jul 2015

Abstract:

Efficient Maximum Inner Product Search (MIPS) is an important task that has a wide applicability in recommendation systems and classification with a large number of classes. Solutions based on locality-sensitive hashing (LSH) as well as tree-based solutions have been investigated in the recent literature, to perform approximate MIPS in sublinear time. In this paper, we compare these to another extremely simple approach for solving approximate MIPS, based on variants of the k-means clustering algorithm.

Main ideas:

Update 2015/11/23: Since I first wrote this note, I became involved in the next iterations of this work, which became v2 of the arXiv manunoscript. The notes below were made based on v1.

(Editor's note: link to version 1)

Since inner products are one of the main units of computation in neural networks, I'm very interested in MIPS as I suspect it could play an important role in scaling up neural networks. One example mentioned in the paper is that of approximating computations at the output layer of a neural network language model, corresponding to a softmax over a large number of units (as many as words in the vocabulary).

I find the combination of the "MIPS to MCSS" transformation with spherical clustering clever, cute and simple. Based on how good the results are compared to hashing, I find this direction of research quite compelling.

I would like to thank Dr. Larochelle, not only for the fantastic summaries and insights that he has been producing for several months at this point, but also for being gracious enough to allow us to reproduce extended excerpts in this and the previous article. I hope that these notes, along with the original papers themselves, provide you with some additional comprehension of the often-difficult concepts that go along with deep learning research.

Bio: Matthew Mayo is a computer science graduate student currently working on his thesis parallelizing machine learning algorithms. He is also a student of data mining, a data enthusiast, and an aspiring machine learning scientist.

http://arxiv.org/pdf/1507.05910v3.pdf
GPU-Trained System Understands Movies

The questions range from simpler ‘Who’ did ‘What’ to ‘Whom’ that can be solved by computer vision alone, to ‘Why’ and ‘How’ something happened in the movie, questions that can only be solved by exploiting both the visual information and dialogs.

https://news.developer.nvidia.com/gpu-trained-system-understands-movies/
Another paper about awesome application of Deep Learning. Now it is able to identify tumors.

The morphology of glands has been used routinely by pathologists to assess the malignancy degree of adenocarcinomas. Accurate segmentation of glands from histology images is a crucial step to obtain reliable morphological statistics for quantitative diagnosis. In this paper, we proposed an effective deep contour-aware network (DCAN) to solve this challenging problem under a unified multi-task learning framework. In the proposed network, multi-level contextual features from the hierarchical architecture are explored with auxiliary supervision for accurate gland segmentation. When incorporated with multi-task regularization during the training, the discriminative capability of intermediate features can be further improved. Moreover, our network can not only output accurate probability maps of glands, but also depict clear contours simultaneously for separating cluttered objects, which further boosts the gland segmentation performance. This unified framework can be efficient when applied to large-scale histopathological data without resorting to additional post-separating steps based on low-level cues. Our method (CUMedVision Team) won the 2015 MICCAI Gland Segmentation Challenge out of 13 competitive teams (photo of top teams), surpassing all the other methods by a significant margin.

http://appsrv.cse.cuhk.edu.hk/~hchen/research/2015miccai_gland.html
If you have any news to suggest, you can write @malev
Third:

Basic, state-of-the-art and best MOOCs are Andrew Ngs Machine Learning and Hinton's Neural Networks.
Now phones can record sound with gyroscope. Be careful.

https://crypto.stanford.edu/gyrophone/
Nice infographic on apple app charts