Definition 1 Let \(X_n\) be a sequence of scala random variables taking values in \(\mathbf{R}\) or \(\mathbf{C}\), and let \(X\) be another scalar random variable. We then say that \(X_n\) converges in probability to \(X\) if, for every radius \(\varepsilon > 0\), one has \(\mathbf{P}(|X_n - X| > \varepsilon) \rightarrow 0\) as \(n \rightarrow \infty\).
Hence, \(X_n\) converges to \(X\) in probability if \(|X_n - X| > \varepsilon\) will almost surely be false as \(n\) ascends to \(\infty\).
Recall that a sequence \(X_1,X_2,\dots\) converges almost surely if one has \({\lim \inf }_{n \rightarrow \infty} X_n = {\lim \sup }_{n \rightarrow \infty} X_n\), or equivalently if the probabilities of \(\bigvee_{n \geq N} (∣X_n−X∣>ε)\) eventually reach zero as \(n\) ascends to \(\infty\). Roughly speaking, convergence in probability is useful for controlling how each of the individual random variables \(X_n\) is close to the putative limit \(X\), while almost sure convergence is useful for controlling how the entire tail \((X_n)_{n \geq N}\) of the sequence is close to its putative limiting value \(X\).
We have the following relationships between convergence in probability and almost sure convergence:
Exercise 2 Let \(X_n, X\)be scalar random variables.
- Almost sure convergence implies convergence in probability; hence, the tail \((X_n)_{n \geq N}\) approaching to \(X\) is somewhat stronger than the individual \(X_n\) approaching to \(X\); thus, if \(X_n \rightarrow X\) almost surely, then \(X_n \rightarrow X\) in probability. Give a counterexample so that the converse does not hold.
- (Borel-Cantelli lemma) Suppose that \(\Sigma_n \mathbf{P}(|X_n - X| > \varepsilon) < \infty\) for all \(\varepsilon > 0\). It is a condition for \(X_n \rightarrow X\) almost surely. Hence, a sequence of random variables, with summability of the probabilities \(\mathbf{P}(|X_n - X| > \varepsilon)\) for any threshold \(\varepsilon > 0\), is almost surely convergent. This technique is convenient in proving almost sure convergence from summable bounds for convergence in probability. Give a counterexample for the converse.
- This gives a (strictly) partial converse to 1: If \(X_n \rightarrow X\) in probability, then one can find a subsequence \(X_{n_j}\) such that \(X_{n_j} \rightarrow X\) almost surely
- If \(\mathbf{E}|X_n|, \mathbf{E}|X| < \infty\) and \(E|X_n - X| \rightarrow 0\), then \(\mathbf{P}(|X_n - X| > \varepsilon) \rightarrow 0\) as \(n \rightarrow \infty\) for every \(\varepsilon > 0\); thus, if absolutely integrable variables \(X_n\) converges to the putative value \(X\), which is another absolutely integrable variable, in expectation, then \(X_n\) converges to \(X\) in probability. Give a counterexample so that the former notion is strictly stronger.
- (Urysohn subsequence principle) Suppose that every subsequence \(X_{n_j}\)of \(X_n\) has a further subsequence \(X_{n_{j_k}}\)that converges to \(X\) in probability. It is a necessary and sufficient condition for the entire sequence \(X_n\) to converge to \(X\) in probability.
- Does the Urysohn subsequence principle still hold if one replaces "in probability" thoroughly with "almost surely"?
- A continuous function preserves convergence in probability: If \(F: \mathbf{R} \rightarrow \mathbf{R}\) or \(\mathbf{C} \rightarrow \mathbf{C}\) is a continuous function and \(X_n \rightarrow X\) in probability, then \(F(X_n) \rightarrow F(X)\) in probability also. This holds more generally for several-variables functions \(F: \mathbf{R}^k \rightarrow \mathbf{R}\) or \(\mathbf{C}^k \rightarrow \mathbf{C}\); for instance, if\(X_n, Y_n\) respectively converges to \(X, Y\) in probability then \(X_n + Y_n\) and \(X_n Y_n\) converges to \(X + Y\) and \(XY\) in probability also; thus, \(X_n + Y_n \rightarrow X + Y\) and \(X_n Y_n \rightarrow X Y\) in probability if \(X_n \rightarrow X, Y_n \rightarrow Y\) both in probability.
- (Fatou's lemma for convergence in probability) If \(X_n\) are unsigned and the events \(|X_n - X| > \varepsilon\) tend to be null as \(n\) increases to \(\infty\) at any scale \(\varepsilon > 0\), then \(\mathbf{E}X \leq \lim \inf_{n \rightarrow \infty} \mathbf{E}X_n\); the limit infimum \(\lim \inf_{n \rightarrow \infty} \mathbf{E}X_n\) gives an upper bound for the expectation \(\mathbf{E}X\).
- (Dominated convergence in probability) If \(X_n \rightarrow X\) in probability, and almost surely \(|X_n| \leq Y\) for all \(n\) and some \(\)\(Y\) with finite expectation, then \(\mathbf{E}X_n \rightarrow \mathbf{E}X\) as \(n \rightarrow \infty\); hence, \(\lim_{n \rightarrow \infty} \mathbf{E}X_n = \mathbf{E}X\) if one has \(\lim_{n \rightarrow \infty} \mathbf{P}(|X_n - X| > \varepsilon) = 0\) and some randome variable \(Y\) with finite expectaion and dominance \(|X_n| \leq Y\).
Exercise 3
- (Independence of variables and limiting values in probability) Suppose we have a sequence \(X_1, X_2, \cdots\) of scalar random variables, another variable \(X\) and \(X_n \rightarrow X\) in probability. Then a random variable \(Y\) that is independent of the \(X_n\) is eventually independent of its limiting value \(X\).
- (Jointly independent variables and almost surely constant) If the \(X_n\) are jointly independent, then the putative limit \(X\) is almost surely constant.
We are now able to state the weak and strong laws of large numbers, in the specific case of identically distributed (iid for short) random variables.
Recall that:
- An identically distributed sequence \(X_n\) of random variables is a sequence such that the \(X_n\) are jointly independent, and each is a copy of another, so \(\mathbf{P}(X_n \in S) = \mathbf{P}(X_m \in S) \) for all measurable \(S\).
- A jointly independent sequence of random variables is \(X_1, X_2, \cdots\) whose distribution is the product of the \(X_n\)'s distributions: equivalently, \(\mathbf{P}((X_{i_1} \in S_1) \wedge \cdots \wedge (X_{i_n} \in S_n)) = \mathbf{P}(X_{i_1} \in S_1) \cdot \cdots \cdot \mathbf{P}(X_{i_n} \in S_n)\) for all finite subscripts \(i_1, \dots ,i_n\) and all measurable \(S_i\).
Theorem 4 (Law of large numbers, model case) Let \(X_1, X_2, \dots\) be an iid sequence of copies of copies of a random variable \(X\) with finite expectation.
- (Weak law of large numbers)Then, \((X_1 + \cdots + X_n)/n \rightarrow \mathbf{E}X\) in probability.
- (Strong law of large numbers) Then, \((X_1 + \cdots + X_n)/n \rightarrow \mathbf{E}X\) almost surely.
Remarks:
- The law of large numbers asserts roughly that the theoretical expectation \(\mathbf{E}X\) of a random variable \(X\) can be approximated by taking a large number of independent samples \(X_1, \dots, X_n\) of \(X\) and then forming the empirical mean \[\frac{X_1+\cdots+X_n}{n}.\]This capability to approximate the theoretical statistics of a probability distribution through empirical data is one of the foundational starting points for mathematics (which is not the focus of the course).
- The tendency of statistics to cluster closely around their mean value is the simplest instance of the concentration of measure phenomenon, which is of great significance not only within probability, but also in applications of probability such as statistics, theoretical computer science, combinatorics, random matrix theory, and high dimensional geometry. (See this blog post for further discussion.)
- Now is revelatory to compare LLN with what one can obtain from the Kolmogorov zero-one law: if the \(X_n\) are real-valued, then the limit superior \(\lim \sup_{n\rightarrow\infty} (X_1 + \cdots + X_n)/n\) and limit inferior \(\lim \inf_{n\rightarrow\infty} (X_1 + \cdots + X_n)/n\) are tail random variables, in the sense that they remain unaffected if one changes finite variables among the \(X_n\); in particular, events such as \(\lim \sup_{n\rightarrow\infty} (X_1 + \cdots + X_n)/n > t\) are tail events for any \(t \in \mathbf{R}\). From this fact and the zero-one law we can guarantee that there exists deterministic constants \(-\infty \le \mu_- \le \mu_+ \leq \infty\) such that \(\lim \inf_{n\rightarrow\infty} (X_1 + \cdots + X_n)/n = \mu_-\) and \(\lim \sup_{n\rightarrow\infty} (X_1 + \cdots + X_n)/n = \mu_+\) almost surely. One can then view the strong law of large numbers as the assertion \(\mu_- = \mu_+ = \mathbf{E}X\) when \(X\) is absolutely integrable.
- Informally, if \(X_1, X_2, \dots \) are iid with mean \(\mu\), then \(X_1+\cdots X_n\approx \mu n\) for large \(n\)
There are several variants of the law of large numbers, for instance when one drops the hypothesis of identical distribution, or when the random variable \(X\)is not absolutely integrable, or if one seeks for more quantitative bounds on the rate of convergence; We will discuss some of these variants later on this note.
There are several ways to prove the law of large numbers. One main strategy is to use the moment method, the one to control statistics such as the empirical mean \((X_1 + \cdots X_n)/n\) by computing moments such as the mean \(\mathbf{E}(X_1 + \cdots + X_n)/n\), the variance \(\mathbf{E}|(X_1 + \cdots + X_n)/n - \mathbf{E}(X_1 + \cdots + X_n)/n|^2\) or higher moments such as \(\mathbf{E}|(X_1 + \cdots + X_n)/n - \mathbf{E}(X_1 + \cdots + X_n)/n|^k\) for \(k=4,6, \dots\). Fortunately, the joint independence of the \(X_n\)allows us to compute such moments fairly easily, with some elementary combinatorics. A direct application of the moment method typically requires one to add a finite moment assumption on the putative value \(X\) such as \(\mathbf{E}|X|^k < \infty\), as yet we shall see, one can reduce fairly easily to this case by so-called a truncation argument. For the strong law of large numbers, one can also use methods from the domain of the theory of martingales, such as stopping time arguments and maximal inequalities; we will present some classical arguments of Kolmogorov in this context.
1. The moment method
We begin by using the moment method to establish both the strong and weak law of large numbers for sums of iid random variables, under additional moment hypotheses. To prove the versions for complex variables, it suffices to do so for real ones, since one can take real and imaginary parts after establishing the real case. Thus we shall focus our attention on real random variables henceforth.
Recall that:
- By the change of variable formula, two real random variables \(X\) and \(Y\) with the same distribution \(\mu\) have the same expectation \(\mathbf{E}|X| = \mathbf{E}|Y|\): thus, if \(\mathbf{P}(X \in S) = \mathbf{P}(Y \in S)\) for all measurable \(S\) then \(\mathbf{E}|X| = \mathbf{E}|Y|\).
- In particular, if \(X_1, X_2, \dots\)is an iid sequence of copies of an absolutely integrable random real variable \(X\) with mean \(\mathbf{E}X = \mu\), then the \(X_n\)all have the same mean \(\mathbf{E}X_n = \mu\).
- Adding or subtracting a constant to, from a random variable does not affect its variance; the expectation of a constant inserted and itself will cancel out: \(\mathbf{Var}(X - \mu) = \mathbf{E}|X - \mathbf{E}X|^2 = \mathbf{Var}(X)\)
The first moment
Let \(X_1, X_2, \dots\) be a sequence of iid copies of a scalar random variable \(X\), and put our partial sums \(S_n := X_1 + \cdots + X_n\). Suppose that \(X\) is absolutely integrable with mean \(\mathbf{E}X = \mu\). Then one can use linearity of expectation to compute the expectation, the first moment, of the partial sum \(S_n\): \[\mathbf{E}X_1 + \cdots + \mathbf{E}X_n = n\mu.\]
In particular, the mean \(\mu\) can be written as the expectation of the empirical mean \((X_1 + \cdots + X_n)/n\). This looks consistent with the strong and weak laws of large numbers. It yet does not instantly imply these laws, but we can record the following really weak bound for the moment \[\mathbf{P}((X_1+\cdots+X_n)/n\ge\lambda n\mu)\le\frac{1}{\lambda}\]for any allotted slot \(\lambda\ >0\) in the case that \(X\) is unsigned and \(\mathbf{E}X\) is finite. Therefore, in the unsigned case, the empirical \(\frac{(X_1 + \cdots + X_n)}{n}\) usually doesn't get much larger than the mean \(\mu\). We will refer to this inequality as a first moment bound on the \(S_n\) (as was obtained primarily through computations of the first moment \(\mathbf{E}S_n\)).
Second moments
We now turn to second moment bounds, occurring through computations of second moments such as \(\mathbf{E}|(X_1 + \cdots + X_n)/n|^2\)or \(\mathbf{Var}(S_n) = \mathbf{E}|S_n - \mathbf{E}S_n|^2\). While so, it will be convenient to normalize the mean \(\mu\) to equal zero, by replacing each \(X_i\) with \(X_i - \mu\) and \(X\) by \(X - \mu\), so that the partial sum \(S_n\) gets replaced by \(S_n - n\mu\) and the empirical mean \(S_n/n\) with \(S_n/n - \mu\). With this normalization, we see that to prove the strong or weak law of large numbers, it suffices to do so in the zero case \(\mu = 0\). On the other hand, if one has \(X\)unsigned, then normalizing it in this fashion will almost certainly destroy the unsigned property of \(X\); so it is not always desirable.
Remarks
- one can compute the variance component-wisely using linearity:\(\mathbf{Var}(X_1 + \cdots + X_n) = \mathbf{E}|(X_1 - \mathbf{E}X_1) + \cdots + (X_n - \mathbf{E}S_n)|^2 = \mathbf{E}|(X_1 - \mu) + \cdots + (X_n - \mu)|^2\)
- the variance of the partial sum \(S_n\) already has several forms \(\mathbf{Var}(X_1 + \cdots + X_n) = \mathbf{E}|(X_1 + \cdots + X_n) - \mathbf{E}(X_1 + \cdots + X_n)|^2 = \mathbf{E}|(X_1 + \cdots + X_n) - n\mu|^2\)
- With the mean-zero normalization, the variance is the second moment: \(\mathbf{Var}(X) = \mathbf{E}|X|^2\); in particular, \(\mathbf{Var}(X_1 + \cdots + X_n) = \mathbf{E}|X_1 + \cdots + X_n|^2\). Hence, with mean equal zero, the variance is finite if and only if \(X\) is square-integrable.
- The calculation of the first moment shows that if \(\mu = \mathbf{E}X\) is normalized, then also the \(\mathbf{E}S_n\) is normalized.
Recall that
- Two random variables \(X, Y\) are independent if and only if \[\mathbf{E}F(X)G(Y) = (\mathbf{E}F(X))(\mathbf{E}G(Y))\] for any scalar function \(F, G\) on \(R, R'\), provided \(F(X), G(Y)\) are both unsigned or both absolutely integrable. In particular, we have \(\mathbf{E}XY = (\mathbf{E}X)(\mathbf{E}Y)\) while \(X, Y\) are independent.
- The Chebyshev inequality \(\mathbf{P}(|X - \mathbf{E}X| \geq t) \leq \frac{1}{t^2}\mathbf{Var}(X)\) for the second moment \(\mathbf{Var}(X) = \mathbf{E}|X - \mathbf{E}X|^2\).
- The Cauchy-Schwarz inequality \(|\mathbf{E}XY| \leq (\mathbf{E}|X|^2)^{1/2} (\mathbf{E}|Y|^2)^{1/2}\) is valid whenever \(X, Y\) are square-integrable: \(\mathbf{E}|X|^2, \mathbf{E}|Y|^2 < \infty\), or equivalently \(|\mathbf{E}XY|^2 \leq \sigma_X^2 \sigma_Y^2 = \mathbf{E}|XY|^2\), provided \(X,Y\) are both square-integrable or both absolutely integrable. Hence, the expectation of \(X\) times \(Y\) is bounded with the respective second moments of \(X\) and \(Y\).
We now start computing the variance of the partial sum \(S_n\). Suppose that \(X\) has finite second moment, which we write \(\mathbf{E}|X|^2 = \sigma^2\). Since a constant addition and subtraction does not affect the variance \(\mathbf{Var}(X - \mu) = \mathbf{Var}(X)\), we also assume \(X\) has been normalized to have mean zero. Then the variance of \(S_n\) is simply \(\mathbf{E}|X_1 + \cdots + X_n|^2\). By linearity of expectation, we have \[\mathbf{Var}(X_1 + \cdots + X_n) = \underset{1 \leq i,j \leq n}{\sum}\mathbf{E}X_i X_j.\]
All expressions here on the right side of the identity are absolutely integrable thanks to the Cauchy-Schwarz inequality. If \(i = j\), then the \(\mathbf{E}X_i X_j\) is equal to the second moment \(\mathbf{E}|X|^2 = \sigma^2\). If otherwise \(i \neq j\), then \(X_i, X_j\) are independent, have mean zero by hypothesis, and thus \[\mathbf{E}X_i X_j = (\mathbf{E}X_i)(\mathbf{E}X_j) = 0.\]
Putting these in, we obtain
\[\mathbf{Var}(X_1 + \cdots + X_n) = n \sigma^2\]
or equivalently, that of the empirical mean\[\mathbf{Var}((X_1 + \cdots + X_n)/n) = \frac{1}{n}\sigma^2.\]
This bound's been established in the mean zero case, but it is clear that this holds in general, since constant addition and subtraction does not affect the variance; for instance \(\mathbf{Var}(S_n/n) = \mathbf{Var}(S_n/n - \mu) = \frac{1}{n}\sigma^2\), where we write \(\sigma^2 = \mathbf{E}|X - \mu|^2\).
This is the first demonstration of concentrations of measure effect that come to arise from operating many independent random variables \(X_1, \dots X_n\). At the opposite extremal case, we may take \(X_1, \dots, X_n\) all to be exactly the same random variable: \(X_1 = \cdots = X_n = X\). Then \(\frac{X_1 + \cdots + X_n}{n} = X\)has exactly the same mean and variance as X (obviously).
Remarks:
- Think of how we put \(\sigma^2 = \mathbf{E}|X|^2\): we have the form \(\sigma = (\mathbf{E}|X|^2)^{1/2}\); I've begun to wonder how cardinal this form is, remembering the inequality of Cauchy and Schwarz \(|\mathbf{E}XY| \leq (\mathbf{E}|X|^2)^{1/2} (\mathbf{E}|Y|^2)^{1/2}\)\(|\mathbf{E}XY|^2 \leq \sigma_X^2 \sigma_Y^2 = \mathbf{E}|XY|^2\).
Now, let us examine the behavior of the mean \(S_n/n\). If we insert the variance bound we've obtained into Chebyshev's inequality, we acquire \[\mathbf{P}(|\frac{S_n}{n} - \mu| \geq \varepsilon) \leq \frac{1}{n}\frac{\sigma^2}{\varepsilon^2}\]
for any natural number \(n\) and threshold \(\varepsilon > 0\), whenever \(X\) holds mean \(\mu\) and a finite variance \(\sigma^2 < \infty\). Hence, \(\frac{X_1 + \cdots + X_n}{n} \rightarrow \mu\) in probability: we have proved the weak law of large numbers, in the case of \(X\)with finite variance.
Note that the last bound of the \(S_n/n\) can be modified, so that \[\frac{\sqrt{n}}{\omega(n)}|\frac{X_1 + \cdots + X_n}{n} - \mu|\]
converges to zero in probability whenever \(\omega(n)\) is a function of \(n\) that eventually goes to infinity as \(n \rightarrow \infty\). For instance we have\[\frac{\sqrt n}{\log n}|\frac{S_n}{n} - \mu| \rightarrow 0\]
in probability. An informal intuition of this result is that the empirical mean \(S_n/n\) tends to stray from \(\mu\) by not much more than \(\sqrt n\). This understanding will be reinforced later when we study the central limit theorem and related results such as the Chernoff inequality. It is enhanced also by the law of the iterated logarithm, which we will not be able to get to in this set of studies.
Remarks
We have obtained in the proof above the probabilistic bound of the empirical mean\[\mathbf{P}(|\frac{X_1 + \cdots + X_n}{n} - \mathbf{E}X| \geq \delta) \leq \frac{1}{n}\frac{\mathbf{Var}(X)}{\delta^2}\]for arbitrary \(n\) and threshold \(\delta > 0\) - Unfortunately, one cannot use the bounds \(\mathbf{P}(|\frac{S_n}{n} - \mu| \geq \varepsilon) \leq \frac{1}{n}\frac{\sigma^2}{\varepsilon^2}\) to ensure \(S_n/n\) to be almost surely convergent by the Borel-Cantelli lemma; since the harmonic series \(\Sigma_n \frac{1}{n}\) diverges.
Recall that
- Jensen's inequality \[f(\mathbf{E}X) \leq \mathbf{E}f(X)\] for a real random variable \(X\) and a convex function \(f: \mathbf{R} \rightarrow \mathbf{R}\) where \(X, f(X)\) both have finite expectation. In particular for the random variable \(|X|^r \) and the function \(f(x) := |x|^{\frac{s}{r}}\) for \(0 < r < s < \infty\), we have \[\mathbf{E}|X|^{r} \leq (\mathbf{E}|X|^s)^{r/s}.\]Thus, if some moment \(\mathbf{E}|X|^s\) is finite, we can assure ourselves to have all lower moments \(\mathbf{E}|X|^r\) finite.
- Hölder's inequality \[|\mathbf{E}XY| \leq (\mathbf{E}|X|^p)^{1/p}(\mathbf{E}|Y|^q)^{1/q}\] for scalar random variables \(X, Y\) with \(0 < p, q < \infty\) satisfying \(\frac{1}{p} + \frac{1}{q} = 1\). In particular, if we have four copies \(X_i, X_j, X_k, X_l\) of a random variable \(X\) with \(\mathbf{E}|X|^4 < \infty\), then we have finite moments \(\mathbf{E}|X_i X_j|^2\) and hence \(|\mathbf{E}X_i X_j X_k X_l| < \infty\).
Higher moments
One can also hope to use the bound \(\mathbf{P}(|\frac{S_n}{n} - \mu| \geq \varepsilon) \leq \frac{1}{n}\frac{\sigma^2}{\varepsilon^2}\) together with the Borel-Cantelli lemma to obtain the strong law of large numbers in the second moment case, but unfortunately the bounding quantities \(\frac{1}{n}\frac{\sigma^2}{\varepsilon^2}\) are not summable in \(n\). To make the quantities summable, we will be after moments higher than the second. One could try to calculate third moments such as \(\mathbf{E}S_n^3\), but this turns out not to convey too much information, because of its signed nature; the expression \(\mathbf{E}|S_n|^3\) would in principle convey more usable information, but is practically difficult to compute as \(|S_n|^3\) is not a polynomial combination of the \(X_i\). We therefore move on to the fourth moment. We normalize \(X\) to the mean zero case \(\mu = 0\), write \(\sigma^2 = \mathbf{E}|X|^2\) to denote the variance of \(X\), and now assume a finite fourth moment \(\mathbf{E}|X|^4 < \infty\); which, by the Hölder or Jensen inequalities, implies that all lower moments such as \(\mathbf{E}|X|^2\) are finite. We then expand \(\mathbf{E}|S_n|^4\)\[\mathbf{E}|X_1 + \cdots + X_n|^4 = \sum_{1 \leq i,j,k,l \leq n} \mathbf{E}X_iX_jX_kX_l.\] Note that all expectations here are absolutely integrable by the hypothesis of finite fourth moment andöHolder's inequality. The \(\mathbf{E}X_i X_j X_k X_l\) looks complicated, but fortunately this time it simplifies in most cases. Suppose for instance that \(i\) differs from \(j,k,l\), then \(X_i\) and \(X_j X_k X_l\) are independent and so \[\mathbf{E}X_i X_j X_k X_l = 0.\] Similarly for other permutations. This leaves only a fewe quadruples \(i, j, k, l\) for which the \(\mathbf{E}X_i X_j X_k X_l\) could be non-zero: the three cases \(i = j \neq k = l, i = k \neq j = l, i = l \neq j = k\), where each of the indices \(i, j, k, l\) holds equality with exactly one other index; and the diagonal case \(i = j = k = l\). If for instance \(i = j \neq k = l\), then \[\mathbf{E}X_i X_j X_k X_l = \sigma^4\] since \(X_i X_j = X_i^2\) and \(X_k X_l = X_k^2\) are independent. Similarly for the cases \(i = k \neq j = l, i = l \neq j = k\), which contributes \(3n(n-1)\sigma^4\) in total to the \(\mathbf{E}|S_n|^4\). Hence we conclude with \[\mathbf{E}|X_1 + \cdots + X_n| = 3n(n - 1)\sigma^4 + n \mathbf{E}|X|^4\] and by Markov's inequality and removing the normalization \(\mu = 0\), that \[\mathbf{P}(|\frac{S_n}{n} - \mu| > \varepsilon) \leq \frac{3n(n-1)\sigma^4 + n \mathbf{E}|X - \mu|^4}{\varepsilon^4 n^4}.\]
The quantities on the right-hand side now decays like \(O(1/n^2)\). Thus, we are able to use the Borel-Cantelli lemma and conclude the strong law of large numbers in the case, provided one has finite fourth moment \(\mathbf{E}|X|^4 < \infty\).
One can of course continue to compute higher and higher moments of \(S_n\) with suitable finite moment hypotheses on \(X\), though as one can already see from the fourth moment calculation, the computations become increasingly combinatorial in nature. We will come back to this sort of analysis when we study the central limit theorem.
Applications of the moment methods
We begin with two quick applications of the weak law of large numbers to topics outside of probability. We will first present an explicit version of the Weierstrass approximation theorem.
Recall that