All of our networks were implemented in and trained with PyTorch with 150 iterations of a dataset of binary and decimal digits from 1 to 1024. As a cursory performance evaluation to inform the discussion below we also implemented MLPs with the same type of layer scaling for the MNIST task \cite{lecun1998gradient}. The architectures we used were:
- Narrowing Network: (784 -> 800 -> 600 -> 400 -> 200 -> 10) Accuracy: 85% on MNIST
- Constant Width Network: (784 -> 500 -> 500 -> 500 -> 500 -> 10) Accuracy: 58% on MNIST
- Widening Network: (784 -> 200 -> 400 -> 600 -> 800 -> 10) Accuracy: 61% on MNIST