Abstract
In this work, we propose a lightly structured organisational system for managing datasets with numerous and heterogenous labels. In the general context of extensive research conducted on massive datasets with poor annotation quality, we present a simple yet effective procedure to reduce annotation noise effect on learned models and a virtuous circle towards ensuring semantic consistency and interpretability. We present an example on how to process large collection of low quality tags on music into a semi-structured, semantically consistent dataset for machine learning experiments. As an example of how such dataset can be built, we propose the DeezerTagSet, open for the researchers to use and improve.
Introduction
Massive datasets of hundreds of class are usually full of mistakes, incompletions and lack a clear definition for the annotation semantics. The general belief that smart models will eventually figure out semantics by itself is not satisfactory. We think it's possible to bring some lightweight structure and completeness to these datasets. We show that doing allows some serious improvements such as mistake cleaning and data augmentation, to the benefit learning algorithms. We also show that this allows to use semantic metrics on models. Last but not least we present our work on a Music Tag Dataset, built from Deezer Playlists Titles as annotation as an example of lightly structured, massively multi-label dataset for machine learning experiments.
Formerly we can define a dataset \(\mathcal{D} = \{(x_i, (y_i^1, y_i^2, .. , y_i^{N_x}) \}\) as a set of pairs of items \(x_i\) and associated labels \( (y_i^1, y_i^2, .. , y_i^{N_x})\) chosen from a (usually) finite set \(\mathcal{Y}\). The number of tags associated to each element can vary and the total size \(N_Y = card(\mathcal{Y})\) can be in the order of thousands. Alternatively one can think of a multi_labeldataset in the form of .\(\mathcal{D} = \{(x_i, \mathbf{y}_i \}\) with \(\mathbf{y}_i\)a binary vector of size \(N_Y\).The problem of correctly identifying the subset of labels associated with an item given a training dataset is referred to as multi-label classification (MLC). The specific case where each item has only one label associated is called multi-class classification MCC).
Models are usually trained on annotated data (train set) and evaluated on predicting labels on a distinct test set. Traditionnal metrics then measure the correspondence between predicted labels and a "ground-truth".
The advance in computational capabilities and deep models have led researchers to address MML/MMC problems on a variety of domains (ImageNet etc). Few studies have emphasized the inherent challenges that arise when dealing with such data. First, annotation comes at a price. Massive amount of annotation should therefore be very costly to obtain. A typical way of circumventing this is to distribute and outsource the annotation job, say to a large pool of anonymous annotators. The general hypothesis in the ML community is that "better data is more data". In this spirit, large scale datasets have been proposed.
Multi-label datasets
Perhaps the most known dataset in the Machine Learning community, that has served as benchmark for many advances in visual classification is ImageNet. Thousands of images are provided with a great variety of global label such as varieties of "cat" and"dog". It has been extensively used for supervised learning competition and is still a major source of labels for images and videos. The set of labels comes from the rich and structured WordNet database but few works addressing the multi-label classification challenges really exploit the resources of the WordNet system in the task.
An important point to be noted is that although the annotations come from a largely crowdsourced process (via Mechanical Turk), each label has been assigned a clear definition (with a wikipedia link) and a quality estimation process is run to measure the precision of the labelling.
ImageNet has succeeded in providing a good quality dataset for ml researchers because they have addressed the issues of ambiguity, item-label relationship and quality control. Using WordNet has been a huge aid in tis approach.approach. Wordnet can not be used for music because its lack some of the concepts that are key to musical unerstanding such as musical genre dubstep.
Contrary to the approach adopted by ImageNet, musical datasets have long used the concept of folksonomy to derive their list of labels. [refs MusicBrainz, LastFM]
Million Song Dataset \cite{McFee_2012}
[TODO]
AudioSet
[TODO]
Label-item relationship
One information is rarely accessible, let alone explicitely defined during an annotation process, it's the item-label relationship. In music-oriented datasets, one usually gets global labels (Million Song Dataset, AudioSet). In the AudioSet, when a label such as "Trumpet" is applied to a track, this typically can mean multiple things:
- The sample contains sound produced by a trumpet
- The sample contains EXCLUSIVELY sound produced by a trumpet
- The sample is about a trumpet player
Clearly which one of the above corresponds to the "ground truth" is of importance when considering an ML experiment, say training an instrument classifier. The ambiguity of a item-label relationship can grow more complex with the nature of the label. A typical such example is spatial label.
[TODO] List all possible relationship implied by a "Italy" tag on a song
Negative examples
Negative annotation is the process of explicitely defining the absence of relationship between an item and a label. This is something very rarely found in a dataset, especially when the number of labels grow large. In general, annotations are very sparse (vector y is mainly zeroes). An explicit negative relationship is exploited by some unseprvised learning models such as those based on triplet losses (TristouNet + refs). In these cases, the joint input of a positive and a negative example greatly helps the model to learn efficient representations, e.g. for clustering [refs]
[TODO] LastFM example of NOT tags ?
Some "tags" are negatively defined (e.g. underground, indie)
Cultural ambiguities
Another drawbacks in using tags without explicit semantic is the cultural ambiguities that may arise. An example of such problem we experienced is with the genre label "funk".
[TODO] examples of Funk
Part I: Massively Multi-Label and Multi-Class Problems
tasks
some tasks and challenges (from imageNet to AudioNet)
methods
Either top-down ontologies ('inference through semantic rules) or bottom-up from learning embeddings on the data (word2vec, NN etc)
Issues with datasets
Issues. inexistent semantic, bias, incompleteness, mistakes , Concepts overlapping, unknown reviewers. We may focus on audio datasets as example
Issues with evaluation
Well if ground-truth is noisy (and it usually is) what are evaluation good for ? Prec/Recall are not so informative. Also semantically important mistake are not penalized more than small ones.
Part II: Towards a lightly structured semantic organisation
here we propose a reasonable approach to building datasets with strong guarantees against overfitting and annotation noise.
Tag definitions
- attach a definition to each tag, make it public (public ontology is best). Identify and resolve ambiguities (Funk music)
Low and High-order Tags
- explicit relationship between tags and items. Especially differentiate between 0-order and higher-order Tags
Tag types and origin
- explicit the origin of a tag/item association. seek consensus, spot inconsistencies.
Part III: Our take on the subject: DeezerSet (lightly structured musical tag dataset)
with a public github ?
Comparing to existing datasets (MIREX, AUdioset, WhateverSet..)
Part IV. Using the DeezerSet
Massive Multi Label machine learning experiments. Removing inconsistencies and data augmentation