Neural Networks and Deep Learning by Michael Nielsen

If you are like me and you took Andrew Ng’s introduction to Machine Learning on Coursera, then you might be excited about this hot topic in computer science and wondering how to get your hands dirty with your newly obtained knowledge and the massive amount of data out there. Although Ng’s course provides a great overview of different models in machine learning, namely regression (linear and logistic), support vector machines, and neural networks to name a few, I felt far from ready to apply such models to real data (even after taking an undergraduate and a graduate course in machine learning). It is quite straightforward to understand the theory of these models, but the same cannot be said about training them. When it comes to training models, particularly neural networks, it can be viewed as an art. Michael Nielsen’s free online book – Neural Networks and Deep Learning – does a fantastic job in de-mystifying the intuition and common practices in this art of training Neural Networks. Nielsen is a bit verbose in this book as he starts from the very fundamental concepts and theory of neural networks. Nevertheless, his explanations are very thorough and do a great job in providing a very intuitive description. The last chapter of the book provides a general overview of deep learning and what Nielsen believes we can expect from this field and (more broadly) machine learning in the years to come. Below I give a short summary of each chapter so that the interested reader can pick and choose which section(s) of Nielsen’s book they would like to delve into. I conclude with my own reflection (I am by no means an expert in this field) on Nielsen’s views of the current status and future of machine learning. In this conclusion, I will touch upon open science, of which Michael Nielsen is a major proponent.


The Fundamentals of Neural Networks and Learning

Chapter 1 discusses the fundamental theory of neural networks. It starts from the basic building block, namely the simplified realization of a neuron in the brain: the perceptron. Nielsen goes on to discuss how a neural network is built from layers of perceptrons. He then presents the classic MNIST problem of digit classification which will be tackled over the course of the book. As with most models, there is a cost function associated with neural networks that we seek to minimize. Nielsen presents the typical choice for the cost function (mean square error) and a method for minimizing this cost (gradient descent). Finally, Nielsen introduces his code for the book, which can be obtained here.

Chapter 2 presents backpropagation – the essential algorithm that allows neural networks to learn, i.e. obtain the most suitable parameters for the neural network given the data. This chapter contains more math than the other chapters so if you aren’t interested in the inner workings of neural networks and are more keen on training / using them, you may decide to skip this chapter. However, Nielsen does not delve into rigorous proofs; he favors simplifications as to effectively build intuition (he provides links to more in-depth papers and textbooks for the interested reader). The explicit steps for the backpropagation algorithm can be found here. It is worth spending some time reading and understanding this section on the key equations that make up backpropagation (it mainly involves partial derivatives and chain rule).

Chapter 3 introduces several adaptations that can be made to improve the performance (under certain scenarios) of the backpropagation algorithm introduced in the previous chapters. Nielsen first introduces the cross-entropy cost function, an alternative to the mean square error (MSE). Its advantage over MSE is that it allows the neural network to learn faster when it is “badly wrong.” Next, Nielsen covers a very important problem in machine learning and the common remedy: overfitting and regularization. This section is a must-read as the concepts apply not just to neural networks.


The Strength of Neural Networks

Chapter 4 develops the key intuition as to why neural networks are so powerful, namely their ability to compute an arbitrary function. Once again, Nielsen avoids rigorous proofs for the sake of building intuition. This is by far my favorite chapter as Nielsen illustrates (literally with some great visualizations) how neural networks can be constructed to compute an arbitrary function (first 1D and then 2D so that the idea can be generalized further). An immense effort and time was certainly doled out by Nielsen to provide incredible visual proofs and it is certainly rewarding for the reader.


Deep Learning

Chapter 5 begins the discussion on deep neural networks (deep nets for short) and why exactly they are so difficult to train. You would think after seeing the shallow networks of only one hidden layer that deep nets should perform much better! However, Nielsen debunks this misconception through a well-explained, intuitive approach, introducing the source of the problem: unstable gradients.

Chapter 6 is “a long one” (quoting Nielsen himself)  that can be broken up into four sections:

  1. Convolutional Neural Networks (CNNs)
  2. Survey of notable image classification approaches
  3. Brief description of other deep net structures
  4. Nielsen’s perspective on the future of neural networks, deep learning, and machine learning

Nielsen first describes one of the most widely used deep net structures: CNNs. Briefly put: instead of using a fully-connected layer where all pixels of an image are used to make a decision, CNNs give more emphasis to the spatial structure. This can also allow for faster training (as there are less weights) and thus more complicated networks! Along with introducing new code that uses the popular Theano library, Nielsen presents the concept of pooling (to further simplify/condense information after a CNN layer), the use of GPUs (to speedup training), the expansion of the training set (by displacing images by a few pixels), and the use of dropout (for regularization by “turning off” certain neurons). In the second part of the chapter, Nielsen steers away from theory in order to provide a survey of recent progress in image recognition. He introduces the ImageNet database (interesting TED Talk by the Principal Investigator behind ImageNet) as the new benchmark for image classification problems (now that we have the computational resources to tackle more complex problems than digit classification). After this overview of different approaches (with links to the corresponding papers), Nielsen introduces other popular deep net structures, namely Recurrent Neural Networks (RNNs), Long Short-Term Memory Units (LSTMs), and Deep Belief Networks (DBNs), and how a particular structure may be more appropriate for certain applications, such as speech recognition, natural language processing, machine translation, and music informatics.


The Future of Neural Networks and Deep Learning

Finally, Nielsen provides some personal insight in the future of neural networks and machine learning. I will write about this section in a bit more detail as Nielsen brings up several interesting points that deserve a discussion. I strongly recommend this section to anyone considering a career in machine learning.

  1. Future of neural networks and deep learning –  The sad truth is that we don’t fully understand why neural networks work, and that is why it is more of an art than a science when it comes to finding the right set of parameters. Nielsen argues that while this is the case, it is hard to say how significant neural networks will be in the future of machine learning. At one point, we may be limited by this lack of understanding and/or lack of computational resources (as was originally the case in the 90’s). However, Nielsen does predict that “deep learning is here to stay.” He makes an important distinction between neural networks and deep learning in that the latter is more so the “ability to learn hierarchies of concepts, building up multiple layers of abstraction.” According to Nielsen, the architecture that performs deep learning could very well be something other than a neural network (just as people switched from support vector machines to neural networks in recent years).
  2. “Deep learning is, if you’ll forgive the play on words, still a rather shallow field” – Referring to Conway’s law (“organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations”), Nielsen reflects on the current status of artificial intelligence and deep learning. He claims that deep learning is still a rather “monolithic field” in that there are only a few deep ideas and that it is still possible for a single person to master most of the deepest concepts in the field. Nielsen contrasts deep learning with medicine, in which we have “fields within fields within fields”! Furthermore, Nielsen foresees that we still have some time (perhaps decades) till deep learning grows out of this early stage and becomes a mature field.
  3. “Fashionable” data science – Nielsen points out machine learning’s fashionable role at several companies in the field of data science: finding the “‘known unknowns’ hidden in data.” This trend has brought a lot of money to the field and has attracted the most talented researchers, scientists, and engineers to industry. Nielsen argues that “the result will be large teams of people with deep subject expertise, and with access to extraordinary resources. That will propel machine learning further forward, creating more markets and opportunities, a virtuous circle of innovation.” However, I am a bit skeptical about this. As industry attracts the brightest minds, there will be a tendency to retain novel breakthroughs and massive databases. These bright minds in machine learning might end up working against each other (or working on the same thing which is almost the same problem) rather than collaborating to create the complex structure of fields within fields that Nielsen himself said deep learning needs. It is true that the researchers at these titans of industry are still publishing papers and revealing some of their breakthroughs, but sometimes these papers can be quite difficult to read. We need more literature like this book and people like Nielsen who go out of their way to provide more intuitive explanations and to motivate young, aspiring students to head in the direction of understanding rather than blindly applying models, such as neural networks, for companies to optimize their profit.

To finish off this lengthy review of Michael Nielsen’s online book, I refer to his TED talk about open science. He points out that that scientists, particularly young ones, tend to hoard data, computer code, ideas, and descriptions of problem for the betterment of their own career; because by withholding such information they can produce a scientific paper with their sole name on it. I fear the same might occur in the field of deep learning and machine learning, where companies are the driving forces, providing the funding and the resources in terms of technology and data. Nevetheless, projects like Theano and ImageNet (and this book by Michael Nielsen) have done a great job in nurturing this open science culture within machine learning, thus prompting some companies like Google (TensorFlow) and Clarifai to share their code and resources for the masses. As Nielsen says in the above video, I hope that we can break free from this “conservatism” of considering publications (or profit with regards to companies) as the only means of success and that we “embrace open science and really seize this opportunity that we have to reinvent discovery itself.”