Introduction

Huge dataset of hundreds of class are usually full of mistakes, incompletions and lack a clear definition for the annotation semantics. The general belief that smart models will eventually figure out semantics by itself is not satisfactory. We think it's possible to bring some lightweight structure and completeness to these datasets. We show that doing allows some serious improvements such as mistake cleaning and data augmentation, to the benefit learning algorithms. We also show that this allows to use semantic metrics on models. Last but not least we present our work on a Music Tag Dataset, built from Deezer Playlists Titles as annotation as an example of lightly structured, massively multi-label dataset for machine learning experiments. 

This paper will mainly be structured as such: In the introduction we may present issues that researchers face when trying to address massively multi-label and/or multi-class problems. The problems are mainly: