Neuroflow Optimizations for the MNIST Dataset

2024-06-07

Today, I wanted to train a Neuroflow model on a "real" dataset to see if it can hold up. I'm using the MNIST dataset, which is a collection of 70,000 small images of handwritten digits.

Setup

As outlined in the Working with the MNIST Dataset in JavaScript post, I manipulated the original MNIST dataset to create a simplified version that I could use for training in the browser. Ultimately, I created a dataset of 10,000 images for training, and a dataset of 2,000 images for testing. All images were scaled down to 14x14 pixels with normalized pixel values and threshold to binary.

Improving Topological Sort

The first issue I ran into was with the backward pass. Because of the increase in dimensions of the input vector (i.e. 14x14 is 196 elements), I got a maximum call stack size exceeded error.

neuroflow-js/src/engine.js:142
        visited.add(v)
                ^

RangeError: Maximum call stack size exceeded
    at Set.add (<anonymous>)
    at buildTopo (neuroflow-js/src/engine.js:142:17)

This error was due to a stack overflow caused by the recursion in the buildTopo function which was using recursion to build the topological sort.

  backward() {
    // topological order all of the children in the graph
    const topo = []
    const visited = new Set()
    const buildTopo = (v) => {
      if (!visited.has(v)) {
        visited.add(v)
        v._prev.forEach((child) => buildTopo(child))
        topo.push(v)
      }
    }
    buildTopo(this)

    // go one variable at a time and apply the chain rule to get its gradient
    this.grad = 1
    topo.reverse().forEach((v) => v._backward())
  }

To fix this, we can refactor the buildTopo function to use an iterative approach instead of recursion. Here's the updated code for the backward method I ended up using to avoid recursion:

  // topological order all of the children in the graph
  // doesn't use recursion to avoid max call stack size exceeded errors
  backward() {
    const topo = [] // List to store nodes in topological order
    const visited = new Set() // Set to track visited nodes
    const addedToTopo = new Set() // Set to ensure nodes are only added to topo once
    const stack = [this] // Stack to manage the iterative DFS

    // Build the topological order using an iterative approach
    while (stack.length > 0) {
      const node = stack[stack.length - 1] // Peek at the top node of the stack
      if (!visited.has(node)) {
        visited.add(node)
        let allChildrenVisited = true
        // Iterate over node._prev in reverse order to preserve the correct processing order
        Array.from(node._prev)
          .reverse()
          .forEach((child) => {
            if (!visited.has(child)) {
              stack.push(child) // Push unvisited children onto the stack
              allChildrenVisited = false
            }
          })
        if (allChildrenVisited) {
          stack.pop() // All children are visited, so remove the node from the stack
          if (!addedToTopo.has(node)) {
            // Check if the node is already added to topo
            topo.push(node) // Add the node to the topological order
            addedToTopo.add(node) // Mark the node as added
          }
        }
      } else {
        stack.pop() // Node has been visited, remove it from the stack
        if (!addedToTopo.has(node)) {
          // Ensure the node is only added once
          topo.push(node) // Add the node to the topological order
          addedToTopo.add(node) // Mark the node as added
        }
      }
    }

    // Reverse the topological order and apply the chain rule
    this.grad = 1 // Initialize the gradient of the output node
    topo.reverse().forEach((v) => v._backward()) // Apply backward pass in topological order
  }

Optimizing Precision

The next issue I ran into was with the floating point precision of the weights. The training was very slow and many of the weights values like 1e-81 and even smaller. To address this, I added a static limitPrecision() method to the Value class. This method limits the precision of the weights to 8 decimal places.

  static limitPrecision(value) {
    return +value.toFixed(8)
  }

Adding epsilon to prevent -Infinity from log(0)

The cross-entropy loss calculation uses the following formula:

L = -\sum_{i=1}^{C} y_i \log(p_i)

When $p_i$ is 0, the loss is undefined or -Infinity. To prevent this, I added a small constant, $\epsilon$ , to the log() function when required:

  log(epsilon = 1e-8) {
    if (!this.data) this.data = epsilon

    const out = new Value(Math.log(this.data), [this], 'log')
    out._backward = () => {
      this.grad += (1 / this.data) * out.grad
    }
    return out
  }

Optimizing Initialization of Weights

For the initial weights of the neuron, I was simply randomly initializing weights between -1 and 1:

new Value(rand() * 2 - 1))

With this approach, the loss started very high:

Epoch: 1.1, Loss: 17.64086554, Accuracy: 0.00%, Learning Rate: 0.00999
Epoch: 1.11, Loss: 5.03508861, Accuracy: 20.00%, Learning Rate: 0.0098905484
Epoch: 1.21, Loss: 6.50959899, Accuracy: 10.00%, Learning Rate: 0.0097920868
Epoch: 1.31, Loss: 3.73554129, Accuracy: 20.00%, Learning Rate: 0.0096946055
Epoch: 1.41, Loss: 6.96713219, Accuracy: 10.00%, Learning Rate: 0.0095980946
Epoch: 1.51, Loss: 4.34331557, Accuracy: 20.00%, Learning Rate: 0.0095025444
...

Since I'm using ReLU as the activation function, I switched the weights initialization to a simpler version of the He Weight Initialization:

w_i \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n}}\right)

...but to avoid the complexity of using a Gaussian distribution ( $\mathcal{N}$ ), I implemented it as:

new Value((rand() - 0.5) * Math.sqrt(2 / numOfInputs))

This change dramatically accelerated the training process by achieving a lower initial loss:

Epoch: 1.1, Loss: 2.30344208, Accuracy: 10.00%, Learning Rate: 0.00999
Epoch: 1.11, Loss: 2.30361714, Accuracy: 10.00%, Learning Rate: 0.0098905484
Epoch: 1.21, Loss: 2.30658628, Accuracy: 10.00%, Learning Rate: 0.0097920868
Epoch: 1.31, Loss: 2.30407266, Accuracy: 20.00%, Learning Rate: 0.0096946055
Epoch: 1.41, Loss: 2.29762469, Accuracy: 40.00%, Learning Rate: 0.0095980946
Epoch: 1.51, Loss: 2.30069053, Accuracy: 40.00%, Learning Rate: 0.0095025444

Demo

The optimizations enabled Neuroflow to successfully train on the MNIST dataset, with respectable accuracy within a reasonable timeframe.

Live demo of the MNIST classifier: