- Neuroflow Optimizations for the MNIST Dataset
- Setup
- Improving Topological Sort
- Optimizing Precision
- Adding epsilon to prevent -Infinity from log(0)
- Optimizing Initialization of Weights
- Demo

Today, I wanted to train a Neuroflow model on a "real" dataset to see if it can hold up. I'm using the MNIST dataset, which is a collection of 70,000 small images of handwritten digits.

As outlined in the Working with the MNIST Dataset in JavaScript post, I manipulated the original MNIST dataset to create a simplified version that I could use for training in the browser. Ultimately, I created a dataset of 10,000 images for training, and a dataset of 2,000 images for testing. All images were scaled down to **14x14 pixels** with normalized pixel values and threshold to binary.

The first issue I ran into was with the backward pass. Because of the increase in dimensions of the input vector (i.e. 14x14 is 196 elements), I got a `maximum call stack size exceeded`

error.

```
neuroflow-js/src/engine.js:142
visited.add(v)
^
RangeError: Maximum call stack size exceeded
at Set.add (<anonymous>)
at buildTopo (neuroflow-js/src/engine.js:142:17)
```

This error was due to a stack overflow caused by the recursion in the `buildTopo`

function which was using recursion to build the topological sort.

```
backward() {
// topological order all of the children in the graph
const topo = []
const visited = new Set()
const buildTopo = (v) => {
if (!visited.has(v)) {
visited.add(v)
v._prev.forEach((child) => buildTopo(child))
topo.push(v)
}
}
buildTopo(this)
// go one variable at a time and apply the chain rule to get its gradient
this.grad = 1
topo.reverse().forEach((v) => v._backward())
}
```

To fix this, we can refactor the `buildTopo`

function to use an iterative approach instead of recursion. Here's the updated code for the `backward`

method I ended up using to avoid recursion:

```
// topological order all of the children in the graph
// doesn't use recursion to avoid max call stack size exceeded errors
backward() {
const topo = [] // List to store nodes in topological order
const visited = new Set() // Set to track visited nodes
const addedToTopo = new Set() // Set to ensure nodes are only added to topo once
const stack = [this] // Stack to manage the iterative DFS
// Build the topological order using an iterative approach
while (stack.length > 0) {
const node = stack[stack.length - 1] // Peek at the top node of the stack
if (!visited.has(node)) {
visited.add(node)
let allChildrenVisited = true
// Iterate over node._prev in reverse order to preserve the correct processing order
Array.from(node._prev)
.reverse()
.forEach((child) => {
if (!visited.has(child)) {
stack.push(child) // Push unvisited children onto the stack
allChildrenVisited = false
}
})
if (allChildrenVisited) {
stack.pop() // All children are visited, so remove the node from the stack
if (!addedToTopo.has(node)) {
// Check if the node is already added to topo
topo.push(node) // Add the node to the topological order
addedToTopo.add(node) // Mark the node as added
}
}
} else {
stack.pop() // Node has been visited, remove it from the stack
if (!addedToTopo.has(node)) {
// Ensure the node is only added once
topo.push(node) // Add the node to the topological order
addedToTopo.add(node) // Mark the node as added
}
}
}
// Reverse the topological order and apply the chain rule
this.grad = 1 // Initialize the gradient of the output node
topo.reverse().forEach((v) => v._backward()) // Apply backward pass in topological order
}
```

The next issue I ran into was with the floating point precision of the weights. The training was very slow and many of the weights values like `1e-81`

and even smaller. To address this, I added a static `limitPrecision()`

method to the `Value`

class. This method limits the precision of the weights to 8 decimal places.

```
static limitPrecision(value) {
return +value.toFixed(8)
}
```

The cross-entropy loss calculation uses the following formula:

$L = -\sum_{i=1}^{C} y_i \log(p_i)$

When $p_i$ is 0, the loss is undefined or -Infinity. To prevent this, I added a small constant, $\epsilon$, to the `log()`

function when required:

```
log(epsilon = 1e-8) {
if (!this.data) this.data = epsilon
const out = new Value(Math.log(this.data), [this], 'log')
out._backward = () => {
this.grad += (1 / this.data) * out.grad
}
return out
}
```

For the initial weights of the neuron, I was simply randomly initializing weights between -1 and 1:

```
new Value(rand() * 2 - 1))
```

With this approach, the loss started very high:

```
Epoch: 1.1, Loss: 17.64086554, Accuracy: 0.00%, Learning Rate: 0.00999
Epoch: 1.11, Loss: 5.03508861, Accuracy: 20.00%, Learning Rate: 0.0098905484
Epoch: 1.21, Loss: 6.50959899, Accuracy: 10.00%, Learning Rate: 0.0097920868
Epoch: 1.31, Loss: 3.73554129, Accuracy: 20.00%, Learning Rate: 0.0096946055
Epoch: 1.41, Loss: 6.96713219, Accuracy: 10.00%, Learning Rate: 0.0095980946
Epoch: 1.51, Loss: 4.34331557, Accuracy: 20.00%, Learning Rate: 0.0095025444
...
```

Since I'm using ReLU as the activation function, I switched the weights initialization to a simpler version of the He Weight Initialization:

$w_i \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n}}\right)$

...but to avoid the complexity of using a Gaussian distribution ( $\mathcal{N}$ ), I implemented it as:

```
new Value((rand() - 0.5) * Math.sqrt(2 / numOfInputs))
```

This change dramatically accelerated the training process by achieving a lower initial loss:

```
Epoch: 1.1, Loss: 2.30344208, Accuracy: 10.00%, Learning Rate: 0.00999
Epoch: 1.11, Loss: 2.30361714, Accuracy: 10.00%, Learning Rate: 0.0098905484
Epoch: 1.21, Loss: 2.30658628, Accuracy: 10.00%, Learning Rate: 0.0097920868
Epoch: 1.31, Loss: 2.30407266, Accuracy: 20.00%, Learning Rate: 0.0096946055
Epoch: 1.41, Loss: 2.29762469, Accuracy: 40.00%, Learning Rate: 0.0095980946
Epoch: 1.51, Loss: 2.30069053, Accuracy: 40.00%, Learning Rate: 0.0095025444
```

The optimizations enabled Neuroflow to successfully train on the MNIST dataset, with respectable accuracy within a reasonable timeframe.

Live demo of the MNIST classifier: