All of our networks were  implemented in and trained with PyTorch with 150 iterations of a dataset of binary and decimal digits from 1 to 1024.  As a cursory performance evaluation to inform the discussion below we also implemented MLPs with the same type of layer scaling for the MNIST task \cite{lecun1998gradient}. The architectures we used were: