Lightning-Fast Deep Learning on Spark

Via parallel stochastic gradient updates

By Dirk Neumann
Download .zip Download .tar.gz View on GitHub


Training deep belief networks requires extensive data and computation. DeepDist accelerates model training by providing asychronous stochastic gradient descent for data stored on HDFS / Spark.

How does it work?

DeepDist implements Downpour-like stochastic gradient descent. It first starts a master model server. Subsequently, on each data node, DeepDist fetches the model from the master and calls gradient(). After computing gradients on the data partitions, gradient updates are sent back the the server. On the server, the master model is updated by descent(). Models can converge faster since gradient updates are constatently synchronized between the nodes.

Figure 1. The model is stored on the master node. The distributed nodes fetch the model before processing each partition, and send the gradient updates back the server. The server can perform stochastic gradient descent (or other optmization procedures) using the updates from the nodes.

How do MLlib and Mahout work?

In contrast, Mahout and MLlib have to compute all subgradients over the (subsampled) data before they can update the model. Gradient updates can be less frequent and deep belief models might converge more slowly.

Figure 2. Classical machine learning libraries broadcast the model parameters to the nodes, and then compute the gradients in parallel. At the end of the computational step, gradient updates are collected and transfered to the master.

Quick start

Let's learn vector representations of words using a word2vec model. These representations can be used in many natural language processing applications. To train it on all of wikipedia, one needs 15 lines of code:

The resulting vector representations of words can be added and subtracted. The model can for example be tested with 'woman' - 'man' + 'king' and computes: 'queen'.

Python module

DeepDist provides a simple Python interface. The with statement starts the model server. Distributed gradient updates are computed on partitions of a resilient distributed dataset (RDD). The gradient updates are incorporated into the master model via a descent function.

Scala package

Are you interested in a Scala version of this package?

Training Speed

Training speed can be greatly enhanced by adaptively adjusting the learning rate by AdaGrad. The complete Word2Vec model with 900 dimensions can be trained on the 19GB wikipedia corpus (using only words from the validation questions).

Figure 3. Performance on a test set (analogy questions) after training of a Word2Vec model with stochastic AdaGrad gradient descent.


J Dean, GS Corrado, R Monga, K Chen, M Devin, QV Le, MZ Mao, M’A Ranzato, A Senior, P Tucker, K Yang, and AY Ng. Large Scale Distributed Deep Networks. NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada, 2012.

T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

T Mikolov, K Chen, G Corrado, and J Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

Addition Information

Wikipedia data:
Extracting text: