5. Models
The authors used numerous models that take into account the factors
discussed above. The TF-IDF, which is essentially a statistic that
determines how important a specific word is to a document in a
collection or corpus, is used by the authors to vectorize words. The
fact that some words appear more frequently than others can be explained
by the fact that the TF-IDF value increases in proportion to how
frequently a word appears in the text and is offset by the number of
documents in the corpus that contain the term. For the classification
challenge, the authors used a logistic regression classifier, which
delivered positive outcomes on a number of text classification tasks.
The baseline models and these models were contrasted. The researchers
used two different naive baselines:
Random guesser: A fundamental model that makes arbitrary predictions
about a user’s level of depression. Each estimate has an equal chance
of being chosen.
Stratified random guesser: Another simple model with the minor
addition that the estimate is not entirely random. It uses the
percentage of depressed people in the train set to determine whether
or not the user will be depressed.
The number of posts, the tweets’ intensity, the typical spacing between
posts, and the average number of words per post all worked well with our
data and were used to enhance model performance. Models were improved
using industry standards. To adjust the model and TF-IDF vectorizer
settings, a grid search was done. The authors’ use of both bigram and
unigram terms as shown in Figure 2 and 3, as well as their exclusion of
keywords with a document frequency significantly below the selected
threshold, also known as the cut-off value, had the most impact on the
outcomes.