5. Models
The authors used numerous models that take into account the factors discussed above. The TF-IDF, which is essentially a statistic that determines how important a specific word is to a document in a collection or corpus, is used by the authors to vectorize words. The fact that some words appear more frequently than others can be explained by the fact that the TF-IDF value increases in proportion to how frequently a word appears in the text and is offset by the number of documents in the corpus that contain the term. For the classification challenge, the authors used a logistic regression classifier, which delivered positive outcomes on a number of text classification tasks. The baseline models and these models were contrasted. The researchers used two different naive baselines:
Random guesser: A fundamental model that makes arbitrary predictions about a user’s level of depression. Each estimate has an equal chance of being chosen.
Stratified random guesser: Another simple model with the minor addition that the estimate is not entirely random. It uses the percentage of depressed people in the train set to determine whether or not the user will be depressed.
The number of posts, the tweets’ intensity, the typical spacing between posts, and the average number of words per post all worked well with our data and were used to enhance model performance. Models were improved using industry standards. To adjust the model and TF-IDF vectorizer settings, a grid search was done. The authors’ use of both bigram and unigram terms as shown in Figure 2 and 3, as well as their exclusion of keywords with a document frequency significantly below the selected threshold, also known as the cut-off value, had the most impact on the outcomes.