The BOOK

The Little Learner

A straight line to deep learning

Daniel P. Friedman and Anurag Mendhekar


Drawings by Qingqing Su

Forewords by Guy L. Steele Jr. and Peter Norvig

Published by The MIT Press

The Little Learner covers all the concepts necessary to develop an intuitive understanding of the workings of deep neural networks: tensors, extended operators, gradient descent algorithms, artificial neurons, dense networks, convolutional networks, residual networks and automatic differentiation.

The book aims to explain the workings of deep learning to readers who may not have yet acquired the mathematical skills necessary to dive into the subject.

Unlike other books in the field, this book makes very few assumptions about background knowledge (high-school mathematics and familiarity with programming). It uses a layered approach to construct advanced concepts from first principles using really small (“little”) programs that build on one another.

We introduce these ideas using a conversational style in Question/Answer format that is characteristic of the other books in the Little series. The conversational style puts the reader at ease and enables the introduction of ideas in frame-by-frame manner as opposed to being hit with a wall of text.

AUTHORS

Daniel P. Friedman

Professor of
Computer Science

Indiana University
Bloomington

Daniel P. Friedman is Professor of Computer Science in the School of Informatics, Computing, and Engineering at Indiana University and is the author of many books published by the MIT Press, including The Little Schemer and The Seasoned Schemer (with Matthias Felleisen); The Little Prover (with Carl Eastlund);  The Reasoned Schemer (with William E. Byrd, Oleg Kiselyov, and Jason Hemann); and The Little Typer (with David Christiansen).

Anurag
Mendhekar

Co-founder and President

Paper Culture LLC
Los Altos

Anurag Mendhekar is co-founder and President of Paper Culture, where he focuses on developing artificial intelligence for creativity (also known as Generative AI). A serial entrepreneur, he started his career at the world-famous Xerox Palo Alto Research Center (PARC), where he was one of the inventors of Aspect-Oriented Programming. He earned his Ph. D. from Indiana University in Programming Languages, and since then his career has spanned a range of technologies including distributed systems, image and video compression, and video distribution for VR.

The CODE

About the code

A Racket Package

The code for the book is available as a Racket package known as malt. In order to run and examine the code, follow the instructions in this section.

Transcribing code from the book

An important reminder

We use some special notations in the book to make our code fit nicely into "little" boxes.

Please refer to the notations in Transcribing to Scheme on page
xxiii  to convert the notation into Scheme/Racket code.


Step 1

Install Racket

Download and Install Racket for your operating system here. The minimum version of Racket you will need is 8.2. Installing Racket will also install the command line program called raco. Ensure that the executable lies in your path (set the environment variable PATH to include the Racket directory. This step will depend upon your OS).

Step 2

Get the malt package

There are two ways to install the malt package.
The first one is to install it directly from the command line as a Racket package using raco. This will install the package in a default location. In order to follow the source code, you'll have to locate the package in your installation and follow the code from there (you can also look at it directly from the Git repository.)

raco pkg install malt

You should also run raco pkg update malt periodically to make sure you have the latest updates.

The second way to install malt is using the Git repository. This method also allows you to explore the source code of the package more easily. The following instructions help you install it, but refer to the README.md file of the repository for further instructions.

For MacOS and Linux: 
git clone https://github.com/themetaschemer/malt.git
cd malt
make
make install


For Windows:
git clone https://github.com/themetaschemer/malt.git
cd malt
raco pkg install

You should also run git pull periodically in the repository to make sure you have the latest updates.

Step 3

Start Exploring

Racket can be run either on the command line or by using DrRacket which is an IDE that comes with the Racket installation. Follow the DrRacket tutorials if you have never used it before. To start working with malt, begin your file like this.

#lang racket
(require malt)

Reference

Documentation for Malt

The detailed documentation for Malt in the standard Racket format is available  here.

Chapter GUIDE

Chapter 0

Learning Scheme

Malt is not required to run the code from Chapter 0. Make sure you begin your file with

#lang racket

Chapters 1 through 12

And Interludes I-IV

The malt package is required for running the code in these chapters and most of the interludes, except for Interlude V. Make sure you begin your files with

#lang racket
(require malt)

Chapter 6 onwards

Tiny Variations

Starting in chapter 6, you might begin to see tiny variations occurring in repeated runs of the same code. This is normal due to the stochastic nature of sampling-obj where ever it gets used.

Interlude V

Extended functions

Malt comes with three different implementations of tensors: learner, nested-tensors, and flat-tensors, which are the three ways tensors and automatic differentiation are developed in the Appendices A and B. The default representation is learner.

Interlude V describes the semantics of extended functions and how they can be implemented, but it requires the learner representation of tensors. If you switch implementations away from learner you will need to do temporarily switch implementations back to learner. In order to experiment with this code (only if you have switched away from learner), start your file with these lines

#lang racket
(require malt/interlude-V)

This will switch the representation of tensors for the remainder of this file, but tensors exported out from this file may not work well if the rest of the code is built with a different representation of tensors. To switch representations completely back to learner, follow the instructions in the README.md of the Git repository, or from the malt documentation.

Chapter 13 Interlude VI
Iris

How to run the Iris example.

And a word of advice.

The Iris example in the book requires the Iris data set to be loaded. It can be loaded like this

#lang racket
(require malt)
(require malt/examples/iris)


This will make  the following data set definitions available

iris-train-xs
iris-train-ys
iris-test-xs
iris-test-ys


The example can be run with

(run-iris)

This will run a grid-search to find a well-trained theta, and test its accuracy against iris-test-xs and iris-test-ys. All the code associated with the example is located in the malt/examples/iris sub-directory and can also be directly examined here.

A word of advice.
All neural networks are susceptible to variation depending upon how they are initialized. For larger networks, this is usually not a problem because the training process evens out these variations.

The neural network defined for the Iris example, however, is very small. This makes it susceptible to much larger variations because of the randomness in its initialization. What is important, however, is that we arrive at a trained theta that passes our accuracy thresholds.

Running grid-search with iris-classifier in order to find a trained theta, consequently, can end up with a different result than what may be described in the book.

Therefore, with the code for the book, we have included the initial theta that was used when the book went to print. It can be examined like this

tll-iris-initial-theta

The example on page 266 can then be run like this

(define iris-theta
  (with-hypers ((revs 2000)
                (alpha 0.0002)
                (batch-size 8))
     (naked-gradient-descent
       (sampling-obj
         (l2-loss iris-classifier)
         iris-train-xs iris-train-ys)
       tll-iris-initial-theta)))


The trained theta generated will also have some variation because of the stochastic nature of naked-gradient-descent using sampling-obj. This means that the accuracy from one trained iris-theta to another varies somewhat as well.

Readers are encouraged to experiment with grid-search as described in Interlude VI to find the right combination of hyperparameters for accuracy that is as high as possible for the iris-test-xs and iris-test-ys.

Chapter 15
Morse

How to run the Morse example.

Important. The morse example in the book will require you to run it using a faster implementation of tensors. Please follow the instructions in the README.md in the repository or the malt documentation to switch implementations to flat-tensors or nested-tensors.

The morse example in the book also requires its own data set to be loaded. It is done as follows

#lang racket
(require malt)
(require malt/examples/morse)


This will load the data set and provide the following training set definitions

morse-train-xs
morse-train-ys

and the following test set definitions.

morse-test-xs
morse-test-ys


The book describes two different networks, the first being a fully convolutional network (morse-fcn) and the second being a Residual network (morse-residual). Code for this example can be viewed here.

To train and test morse-fcn

(define fcn-model
  (train-morse morse-fcn))
(accuracy fcn-model
 morse-test-xs morse-test-ys)

This will display the accuracy of the trained model against the test set.

To train and test morse-residual

(define residual-model
  (train-morse morse-residual))
(accuracy residual-model
  morse-test-xs morse-test-ys)


This will similarly display the accuracy of the trained model against the test set.

The code in this example is also set up to display progress of the gradient descent by printing a moving average of the loss in the network at every 20 revisions. For example,

(16080 0.072560) [Memory: 139334768][Window size 6]

This says that the average of the loss across the last 6 batches at the 16080'th revision was 0.07256, while the system consumed about 139MB of memory. The count of revisions is cumulative, but can be reset by

(log-malt-reset)

Morse examples are currently set up to run 20000 revisions during training.  

Appendix A

Automatic Differentiation (AD)

The malt package is not required to implement the code in Appendix A. You can refer to the code related to the learner implementation as a reference. Begin your files with

#lang racket

Appendix B

More efficient tensors and AD

Only a portion of the malt package is required to implement the code in Appendix B, which are primarily the extended non-differentiable operators for nested tensors, and duals. These are provided in the module malt/appendix-B, so begin your file with

#lang racket
(require malt/appendix-B)

For flat-tensors, refer to the code here.

follow US

Stay up-to-date with the latest

Chapter GUIDE

Errata

Things that escaped our attention in the 1st printing.

Here is a list of known errors that have been found. Should you find more, let us know using any of the methods in the
Follow section.

Chapter 2, Pg 37, Frame 26: "Given the same tensor from frame 23" should be "Given a tensor similar to the one in frame 23".

Chapter 4, Pg 76, Frame 12: "at 0.6623" should be "at 0.6263".

Interlude IV, Pg 156, Frame 7
: The smoothed sequence should be
                  5.03 6.8 6.55 6.16 5.73 5.37 4.89

Chapter 9, Pg 174, Frame 44
: The starting learning rate should be 0.01, not 0.001.

Interlude V, Pg 190, Frame 44: The definition of *-2-1 provided has some edge cases which produce incorrect results in the learner and nested-tensors representation. To fully address those cases, that definition should be replaced with the following:
(define *-1-1
  (ext2 * 1 1))

(define *-2-1
  (ext2 *-1-1 2 1))

Chapter 11, Pg 229, Frame 57: The definition of k-relu should be replaced with:
(define k-relu
  (λ (k)
    (λ (t)
      (λ (θ)
        (cond
          ((zero? k) t)
          (else (((k-relu (sub1 k))
                  ((relu t) θ))
                 θ2)))))))

Chapter 15, Pg 339, Frame 71: The definition of skip should be as follows:
(define skip
  (λ (f j)
    (λ (t)
      (λ (θ)
        (+ ((f t) θ)
           (correlate θ
j t))))))


Appendix A, Pg 373, Frame 77: The definition of expt-0-0 should be:
(define expt-0-0
  (prim2 expt
    (λ (ra rb z)
      (values (* z (* rb (expt ra (- rb 1))))
        (* z (* (expt ra rb) (log ra)))))))


Kindle Edition, Chapter 3 Frame 44: The left-hand side same-as chart is rendered incorrectly. The correct rendering is as follows: Chapter 15, Pg 339, Frame 71: The definition of skip should be as follows:
(define skip
  (λ (f j)
    (λ (t)
      (λ (θ)
        (+ ((f t) θ)
           (correlate θ
j t))))))


Appendix A, Pg 373, Frame 77: The definition of expt-0-0 should be:
(define expt-0-0
  (prim2 expt
    (λ (ra rb z)
      (values (* z (* rb (expt ra (- rb 1))))
        (* z (* (expt ra rb) (log ra)))))))

Appendix B, Pg 396, Frame 48:
The definition of fill-gt-gu should be:
(define fill-gt-gu
  (λ (gt gu init i)
    (let-values ((values gti gui) (init i))
      (vector-set! gt i gti)
      (vector-set! gu i gui)
      (cond
        ((zero? i) (values gt gu))
        (else
          (fill-gt-gu gt gu init (sub1 i)))))))


Kindle Edition, Chapter 3 Frame 44: The left-hand side same-as chart is rendered incorrectly. The correct rendering is as follows: Chapter 9, Pg 174, Frame 44: The starting learning rate should be 0.01, not 0.001.

Interlude V, Pg 190, Frame 44: The definition of *-2-1 provided has some edge cases which produce incorrect results in the learner and nested-tensors representation. To fully address those cases, that definition should be replaced with the following:
(define *-1-1
  (ext2 * 1 1))

(define *-2-1
  (ext2 *-1-1 2 1))

Chapter 11, Pg 229, Frame 57: The definition of k-relu should be replaced with:
(define k-relu
  (λ (k)
    (λ (t)
      (λ (θ)
        (cond
          ((zero? k) t)
          (else (((k-relu (sub1 k))
                  ((relu t) θ))
                 θ2)))))))

Chapter 15, Pg 339, Frame 71: The definition of skip should be as follows:
(define skip
  (λ (f j)
    (λ (t)
      (λ (θ)
        (+ ((f t) θ)
           (correlate θ
j t))))))


Appendix A, Pg 373, Frame 77: The definition of expt-0-0 should be:
(define expt-0-0
  (prim2 expt
    (λ (ra rb z)
      (values (* z (* rb (expt ra (- rb 1))))
        (* z (* (expt ra rb) (log ra)))))))


Kindle Edition, Chapter 3 Frame 44: The left-hand side same-as chart is rendered incorrectly. The correct rendering is as follows: Chapter 15, Pg 339, Frame 71: The definition of skip should be as follows:
(define skip
  (λ (f j)
    (λ (t)
      (λ (θ)
        (+ ((f t) θ)
           (correlate θ
j t))))))


Appendix A, Pg 373, Frame 77: The definition of expt-0-0 should be:
(define expt-0-0
  (prim2 expt
    (λ (ra rb z)
      (values (* z (* rb (expt ra (- rb 1))))
        (* z (* (expt ra rb) (log ra)))))))

Appendix B, Pg 396, Frame 48:
The definition of fill-gt-gu should be:
(define fill-gt-gu
  (λ (gt gu init i)
    (let-values ((values gti gui) (init i))
      (vector-set! gt i gti)
      (vector-set! gu i gui)
      (cond
        ((zero? i) (values gt gu))
        (else
          (fill-gt-gu gt gu init (sub1 i)))))))


Kindle Edition, Chapter 3 Frame 44: The left-hand side same-as chart is rendered incorrectly. The correct rendering is as follows:

Corrected rendering of the kindle edition for Chapter 3 Frame 44