he first feedback loop, which I will call the Popular Loop has three parts that generate each other. 1) A simple conversation becomes 2)popular and 3) generates a consensus which leads to greater simplification. The second loop, which I will call the Alternative Loop also consists of three parts. 1) An intricate and complex conversation 2) stays or becomes unpopular and 3) breaks into diverging positions.
Returning to Luhmann’s definition of meaning, ‘the product of the different choices that a system makes to deal with complexity’, how could one go about measuring it in Reddit? At first, I viewed meaning analogous to linguistic richness but found that linguistic richness is a broad and polemic area.
Measuring linguistic richness has been called the Gordian Knot of literary studies \cite{Miranda-Garciía2005}. At first, I attempted to analyze entire Reddit threads at once and create a ratio of common words to uncommon words, but this approach had serious flaws and did not give me what I wanted, a way to measure meaning.After the first flawed experiment, I instead analyzed each comment individually, a much more difficult task than looking at the threads as a whole. I explored four different measures, Hapax Legomena, the Type-Token ratio, Adjectives and Adverbs, and Yule’s I characteristic before choosing Yule’s I characteristic as the richness measure that most closely measures Luhmann’s “Meaning”. The previous three measures suffered from two key problems, jargon sensitivity and length dependence. Reddit comments vary tremendously in length and often feature thread-specific jargon and non-standard grammar. The variable length made Hapax Legomena and the Type Token ratio ineffective tools to measure richness (and from there meaning). A study of Adjectives and Adverbs quickly proved futile as the threads used non-standard grammar.
Yule’s I Measure of Richness
He called the result of this formula Characteristic I.
\begin{equation}
frac{\left(S_{1}\times S_{2}\right)}{\left(S_{2}-S_{1}\right)}\nonumber \\
\end{equation}
Where S1 is the total number of words (tokens, not unique). S2 is harder to explain, but was explained concisely in a script on GitHub by the used Lex_d.: “[S2] consists of the sum of the number of words that occur at any frequency times that frequency squared. So for example, if 2 words occured 3 times and 5 words occurred 6 times, M2 would be (2 * 3^2) + (5 * 6^2) = 198”(https://github.com/Scripted/lex_d).
This formula should roughly hold true no matter the size of the sample \cite{Yule1944}, unlike hapax measures of type token measures. Furthermore, coding this measure in Python was straightforward and had been done before by web engineer Swizek Teller. I was able to take his code, and attach it to a Reddit data scraper and then analyze the data using regressions.
Yule’s I and Systems Theory
Returning to Luhmann’s definition of meaning- ‘the product of the different choices that a system makes to deal with complexity’- how does Yule’s I score fit in?
Like the type token ratio, the higher the score, the more information and different choices that a post could hold.
For example, “ Take a left turn” has a higher score than “Cat Cat Cat Cat,” and has more information. To further develop this connection, I will use a concept from Chaos Theory, Kolmogorov Complexity. Laid out by Soviet Mathematician Andrey Kolmogorov in “Three Approaches to the Quantitative Definition of Information” \cite{Kolmogorov1968}. Kolmogorov Complexity is the length of the shortest description it would take to produce an output. For example, the string “aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa” (50 Characters Long) can be described by the description “a 50 times” (8 Characters). However, the string “esfjbkskjbgsldfnapsdkngirngvlsjrnvsjadb” can only be completely described as “esfjbkskjbgsldfnapsdkngirngvlsjrnvsjadb”. even though it has less characters than the a string, it has a higher Kolmogorov Complexity.
Reddit threads with higher Yule I scores have higher Kolmogorov complexities, as there are less repeated words. The higher the Yule I score, the higher the Kolmogorov complexity. The more complex a comment, the more potential choices, and therefore meaning it can store. Therefore, Yule’s I score measures the possible meaning in a comment. I then wrote a program in the Python coding language to apply this score to Reddit comments.
By looking at the Yule’s I score of each comment, we can fit a line to the increase or decrease of Yule’s I over time. We can then compare the rate of Yule’s I increase/decrease with the number of comments in a thread. By using rate of increase of a measure instead of average Yule I score, we ignore the line intercept. This allows for a standardized measure across all categories, as each categories might have different baseline mean Yule I scores shaped by what the commenters knew before entering the thread.
An Example Case: Reddit
Reddit
is an online forum divided into subreddits, threads, and comments. Each
subreddit has its own unique rules and moderators and is built around
set themes. As of June 2017, the most popular subreddits are AskReddit,
Funny, TodayILearned, and Science according
to
http://redditlist.com. Each subreddit has different threads,
which can be a question, a statement, or even a link to an article or
video. People comment on these threads and then comment on those
comments. They can then comment on those comments- commenting
indefinitely. Furthermore, they can upvote or downvote both comments and
threads.
Reddit’s Hot Algorithm
By default, if someone visits Reddit, they are immediately taken to a
page listing the most “Hot” threads. As the first place people visit,
being put on the Hot list brings people from across Reddit to a post,
rather than the much smaller and interest-specific audience brought
together in a sub-Reddit.
How does the hot algorithm work?
Reddit is open source, meaning that anyone can explore their code. On
the code repository GitHub, (https://github.com/reddit/reddit) I was
able to find the hot formula in the location
reddit/r2/r2/lib/db/_sorts.pyx
Threads will only stay on the front page briefly, quickly dropping out
as other threads with a higher time score (seconds since December 8,
2005) take its place. Since the vote ranking is logarithmic, there is a
huge emphasis on the first commenters\cite{Salihefendic2017} .
Why Polemic Science Issues?
More scientific papers than ever are being published \cite{Larsen2017}, and in cases of public interest, often contradict each other.This preponderance of conflicting information
means that there are too many studies for a layman to easily judge.
Thus, according to Luhman’s theory, we rely on social systems to process
complexity and give us a simpler picture of the world. By looking at
polemical science debates, we can analyze a dynamic and changing system.
For these reasons, I chose to look at science issues.
The categories I chose, Artificial Intelligence, Global Warming,
Genetically Modified Organisms, the CRISPR gene editing tool, and the
debate over Vaccines were categories that interested me. I have been
building a crude Artificial Intelligence throughout the year and was
introduced to debates over Artificial Intelligence. The debate over
Global Warming and Vaccines has come to the forefront in my home
country, the United States, after a recent election and seemed
pertinent. An interest in the debate over Genetically Modified Organisms
and CRISPR stems from growing up in a family of geneticists.
Instruments
To collect and analyze Reddit comments, I wrote a script in the Python
computing language. Reddit, which is open source, offers a service
called PRAW written in Python, which allows for the easy mining of
threadsand comments. I, therefore, used Python for this project. Another
computer language- R, provides stronger statistical analysis tools, and
a more straightforward implementation of Yule’s I score, but seeing that
I could measure it on Python and also use the PRAW kit, I decided to use
Python when writing my program “SuperYule”.