Considering the short training and the insignificant difference in MAP performance between the smallest dataset of only 0.5% with a performance of 0.7149 and 100% with 0.7389 the suspicion of training error was heavily supported. Looking again into the description of the Kitti dataset and Coco, the error appeared to be in the high similarity of the two datasets. Since the Coco Resnet model was already pre-trained on the two classes 'person' and 'car', which are nearly identical to 'pedestrian' and 'car', transfer learning on Kitti only improved the pre-trained model to even better detect cars and pedestrians without offering a real example of transfer learning. Therefore, the MAP is already quite high wih few examples and a low number of training steps.
As a result, the training was repeated using the German Traffic Sign Detection dataset. Since this dataset is rather small with only 900 samples, augmentation was used for enhancement such that the size approximately doubled. For augmentation, simple filters, for example blurring, were used randomly. The training process ran for 20.000 steps.