3.2 Dataset Preprocessing
It is essential to clean data before using it to generate model features, especially for Reddit data, which consists of several different sets of data. The authors’ initial step was to concatenate the most recent N titles or sentences for each user, where Nis the minimum number of posts necessary for the suggested model to work correctly. The time between two consecutive posts or comments made by the same person was another type of data that the authors were able to retrieve. After grouping the data, the authors removed all text-based links, capitalised nothing, and created ”bag of words” representations of the sequences. The authors then created a lemma for each term using WordNetLemmatizer.