Today, I wanted to train a Neuroflow model on a "real" dataset to see if it can hold up. I'm using the MNIST dataset, which is a collection of 70,000 small images of handwritten digits.
As outlined in the Working with the MNIST Dataset in JavaScript post, I manipulated the original MNIST dataset to create a simplified version that I could use for training in the browser. Ultimately, I created a dataset of 10,000 images for training, and a dataset of 2,000 images for testing. All images were scaled down to 14x14 pixels with normalized pixel values and threshold to binary.
The first issue I ran into was with the backward pass. Because of the increase in dimensions of the input vector (i.e. 14x14 is 196 elements), I got a maximum call stack size exceeded
error.
neuroflow-js/src/engine.js:142
visited.add(v)
^
RangeError: Maximum call stack size exceeded
at Set.add (<anonymous>)
at buildTopo (neuroflow-js/src/engine.js:142:17)
This error was due to a stack overflow caused by the recursion in the buildTopo
function which was using recursion to build the topological sort.
backward() {
// topological order all of the children in the graph
const topo = []
const visited = new Set()
const buildTopo = (v) => {
if (!visited.has(v)) {
visited.add(v)
v._prev.forEach((child) => buildTopo(child))
topo.push(v)
}
}
buildTopo(this)
// go one variable at a time and apply the chain rule to get its gradient
this.grad = 1
topo.reverse().forEach((v) => v._backward())
}
To fix this, we can refactor the buildTopo
function to use an iterative approach instead of recursion. Here's the updated code for the backward
method I ended up using to avoid recursion:
// topological order all of the children in the graph
// doesn't use recursion to avoid max call stack size exceeded errors
backward() {
const topo = [] // List to store nodes in topological order
const visited = new Set() // Set to track visited nodes
const addedToTopo = new Set() // Set to ensure nodes are only added to topo once
const stack = [this] // Stack to manage the iterative DFS
// Build the topological order using an iterative approach
while (stack.length > 0) {
const node = stack[stack.length - 1] // Peek at the top node of the stack
if (!visited.has(node)) {
visited.add(node)
let allChildrenVisited = true
// Iterate over node._prev in reverse order to preserve the correct processing order
Array.from(node._prev)
.reverse()
.forEach((child) => {
if (!visited.has(child)) {
stack.push(child) // Push unvisited children onto the stack
allChildrenVisited = false
}
})
if (allChildrenVisited) {
stack.pop() // All children are visited, so remove the node from the stack
if (!addedToTopo.has(node)) {
// Check if the node is already added to topo
topo.push(node) // Add the node to the topological order
addedToTopo.add(node) // Mark the node as added
}
}
} else {
stack.pop() // Node has been visited, remove it from the stack
if (!addedToTopo.has(node)) {
// Ensure the node is only added once
topo.push(node) // Add the node to the topological order
addedToTopo.add(node) // Mark the node as added
}
}
}
// Reverse the topological order and apply the chain rule
this.grad = 1 // Initialize the gradient of the output node
topo.reverse().forEach((v) => v._backward()) // Apply backward pass in topological order
}
The next issue I ran into was with the floating point precision of the weights. The training was very slow and many of the weights values like 1e-81
and even smaller. To address this, I added a static limitPrecision()
method to the Value
class. This method limits the precision of the weights to 8 decimal places.
static limitPrecision(value) {
return +value.toFixed(8)
}
The cross-entropy loss calculation uses the following formula:
When is 0, the loss is undefined or -Infinity. To prevent this, I added a small constant, , to the log()
function when required:
log(epsilon = 1e-8) {
if (!this.data) this.data = epsilon
const out = new Value(Math.log(this.data), [this], 'log')
out._backward = () => {
this.grad += (1 / this.data) * out.grad
}
return out
}
For the initial weights of the neuron, I was simply randomly initializing weights between -1 and 1:
new Value(rand() * 2 - 1))
With this approach, the loss started very high:
Epoch: 1.1, Loss: 17.64086554, Accuracy: 0.00%, Learning Rate: 0.00999
Epoch: 1.11, Loss: 5.03508861, Accuracy: 20.00%, Learning Rate: 0.0098905484
Epoch: 1.21, Loss: 6.50959899, Accuracy: 10.00%, Learning Rate: 0.0097920868
Epoch: 1.31, Loss: 3.73554129, Accuracy: 20.00%, Learning Rate: 0.0096946055
Epoch: 1.41, Loss: 6.96713219, Accuracy: 10.00%, Learning Rate: 0.0095980946
Epoch: 1.51, Loss: 4.34331557, Accuracy: 20.00%, Learning Rate: 0.0095025444
...
Since I'm using ReLU as the activation function, I switched the weights initialization to a simpler version of the He Weight Initialization:
...but to avoid the complexity of using a Gaussian distribution ( ), I implemented it as:
new Value((rand() - 0.5) * Math.sqrt(2 / numOfInputs))
This change dramatically accelerated the training process by achieving a lower initial loss:
Epoch: 1.1, Loss: 2.30344208, Accuracy: 10.00%, Learning Rate: 0.00999
Epoch: 1.11, Loss: 2.30361714, Accuracy: 10.00%, Learning Rate: 0.0098905484
Epoch: 1.21, Loss: 2.30658628, Accuracy: 10.00%, Learning Rate: 0.0097920868
Epoch: 1.31, Loss: 2.30407266, Accuracy: 20.00%, Learning Rate: 0.0096946055
Epoch: 1.41, Loss: 2.29762469, Accuracy: 40.00%, Learning Rate: 0.0095980946
Epoch: 1.51, Loss: 2.30069053, Accuracy: 40.00%, Learning Rate: 0.0095025444
The optimizations enabled Neuroflow to successfully train on the MNIST dataset, with respectable accuracy within a reasonable timeframe.
Live demo of the MNIST classifier: