Definition 1 Let \(X_n\) be a sequence of scala random variables taking values in \(\mathbf{R}\) or \(\mathbf{C}\), and let \(X\) be another scalar random variable. We then say that \(X_n\) converges in probability to \(X\) if, for every radius \(\varepsilon > 0\), one has \(\mathbf{P}(|X_n - X| > \varepsilon) \rightarrow 0\) as \(n \rightarrow \infty\)
Hence, \(X_n\) converges to \(X\) in probability if \(|X_n - X| > \varepsilon\) will almost surely be false as \(n\) ascends to \(\infty\).
Recall that a sequence \(X_1,X_2,\dots\) converges almost surely if one has \({\lim \inf }_{n \rightarrow \infty} X_n = {\lim \sup }_{n \rightarrow \infty} X_n\), or equivalently if the probabilities of \(\bigvee_{n \geq N} (∣X_n−X∣>ε)\) eventually reach zero as \(n\) ascends to \(\infty\). Roughly speaking, convergence in probability is useful for controlling how each of the individual random variables \(X_n\) is close to the putative limit \(X\), while almost sure convergence is useful for controlling how the entire tail \((X_n)_{n \geq N}\) of the sequence is close to its putative limiting value \(X\).
We have the following relationships between convergence in probability and almost sure convergence:
Exercise 2 Let \(X_n, X\)be scalar random variables.
  1. Almost sure convergence implies convergence in probability; hence, the tail \((X_n)_{n \geq N}\) approaching to \(X\) is somewhat stronger than the individual \(X_n\) approaching to \(X\); thus, if \(X_n \rightarrow X\) almost surely, then \(X_n \rightarrow X\) in probability. Give a counterexample so that the converse does not hold.
  2. (Borel-Cantelli lemma) Suppose that \(\Sigma_n \mathbf{P}(|X_n - X| > \varepsilon) < \infty\) for all \(\varepsilon > 0\). It is a condition for \(X_n \rightarrow X\) almost surely. Hence, a sequence of random variables, with summability of the probabilities \(\mathbf{P}(|X_n - X| > \varepsilon)\) for any threshold \(\varepsilon > 0\), is almost surely convergent. This technique is convenient in proving almost sure convergence from summable bounds for convergence in probability. Give a counterexample for the converse.
  3. This gives a (strictly) partial converse to 1: If \(X_n \rightarrow X\) in probability, then one can find a subsequence \(X_{n_j}\) such that \(X_{n_j} \rightarrow X\) almost surely
  4. If \(\mathbf{E}|X_n|, \mathbf{E}|X| < \infty\) and \(E|X_n - X| \rightarrow 0\), then \(\mathbf{P}(|X_n - X| > \varepsilon) \rightarrow 0\) as \(n \rightarrow \infty\) for every \(\varepsilon > 0\); thus, if absolutely integrable variables \(X_n\) converges to the putative value \(X\), which is another absolutely integrable variable, in expectation, then \(X_n\) converges to \(X\) in probability. Give a counterexample so that the former notion is strictly stronger.
  5. (Urysohn subsequence principle) Suppose that every subsequence \(X_{n_j}\)of \(X_n\) has a further subsequence \(X_{n_{j_k}}\)that converges to \(X\) in probability. It is a necessary and sufficient condition for the entire sequence \(X_n\) to converge to \(X\) in probability.
  6. Does the Urysohn subsequence principle still hold if one replaces "in probability" thoroughly with "almost surely"?
  7. A continuous function preserves convergence in probability: If \(F: \mathbf{R} \rightarrow \mathbf{R}\) or \(\mathbf{C} \rightarrow \mathbf{C}\) is a continuous function and \(X_n \rightarrow X\) in probability, then \(F(X_n) \rightarrow F(X)\) in probability also. This holds more generally for several-variables functions \(F: \mathbf{R}^k \rightarrow \mathbf{R}\) or \(\mathbf{C}^k \rightarrow \mathbf{C}\); for instance, if\(X_n, Y_n\) respectively converges to \(X, Y\) in probability then \(X_n + Y_n\) and \(X_n Y_n\) converges to \(X + Y\) and \(XY\) in probability also; thus, \(X_n + Y_n \rightarrow X + Y\) and \(X_n Y_n \rightarrow X Y\) in probability if \(X_n \rightarrow X, Y_n \rightarrow Y\) both in probability.
  8. (Fatou's lemma for convergence in probability) If \(X_n\) are unsigned and the events \(|X_n - X| > \varepsilon\) tend to be null as \(n\) increases to \(\infty\) at any scale \(\varepsilon > 0\), then \(\mathbf{E}X \leq \lim \inf_{n \rightarrow \infty} \mathbf{E}X_n\); the limit infimum \(\lim \inf_{n \rightarrow \infty} \mathbf{E}X_n\) gives an upper bound for the expectation \(\mathbf{E}X\).
  9. (Dominated convergence in probability) If \(X_n \rightarrow X\) in probability, and almost surely \(|X_n| \leq Y\) for all  \(n\) and some \(\)\(Y\) with finite expectation, then \(\mathbf{E}X_n \rightarrow \mathbf{E}X\) as \(n \rightarrow \infty\); hence, \(\lim_{n \rightarrow \infty} \mathbf{E}X_n = \mathbf{E}X\) if one has \(\lim_{n \rightarrow \infty} \mathbf{P}(|X_n - X| > \varepsilon) = 0\) and some randome variable \(Y\) with finite expectaion and dominance \(|X_n| \leq Y\).
Exercise 3 
We are now able to state the weak and strong laws of large numbers, in the specific case of identically distributed (iid for short) random variables.
Recall that:
Theorem 4 (Law of large numbers, model case) Let \(X_1, X_2, \dots\) be an iid sequence of copies of copies of a random variable \(X\) with finite expectation.
  1. (Weak law of large numbers)Then, \((X_1 + \cdots + X_n)/n \rightarrow \mathbf{E}X\) in probability.
  2. (Strong law of large numbers) Then, \((X_1 + \cdots + X_n)/n \rightarrow \mathbf{E}X\) almost surely.
Remarks:
There are several variants of the law of large numbers, for instance when one drops the hypothesis of identical distribution, or when the random variable \(X\)is not absolutely integrable, or if one seeks for more quantitative bounds on the rate of convergence; We will discuss some of these variants later on this note.
There are several ways to prove the law of large numbers. One main strategy is to use the moment method, the one to control statistics such as the empirical mean  \((X_1 + \cdots X_n)/n\) by computing moments such as the mean \(\mathbf{E}(X_1 + \cdots + X_n)/n\), the variance \(\mathbf{E}|(X_1 + \cdots + X_n)/n - \mathbf{E}(X_1 + \cdots + X_n)/n|^2\) or higher moments such as \(\mathbf{E}|(X_1 + \cdots + X_n)/n - \mathbf{E}(X_1 + \cdots + X_n)/n|^k\) for \(k=4,6, \dots\). Fortunately, the joint independence of the \(X_n\)allows us to compute such moments fairly easily, with some elementary combinatorics. A direct application of the moment method typically requires one to add a finite moment assumption on the putative value \(X\) such as \(\mathbf{E}|X|^k < \infty\), as yet we shall see, one can reduce fairly easily to this case by so-called a truncation argument. For the strong law of large numbers, one can also use methods from the domain of the theory of martingales, such as stopping time arguments and maximal inequalities; we will present some classical arguments of Kolmogorov in this context.

1. The moment method

We begin by using the moment method to establish both the strong and weak law of large numbers for sums of iid random variables, under additional moment hypotheses. To prove the versions for complex variables, it suffices to do so for real ones, since one can take real and imaginary parts after establishing the real case. Thus we shall focus our attention on real random variables henceforth.
Recall that:

The first moment

Let \(X_1, X_2, \dots\) be a sequence of iid copies of a scalar random variable \(X\), and put our partial sums \(S_n := X_1 + \cdots + X_n\). Suppose that \(X\) is absolutely integrable with mean \(\mathbf{E}X = \mu\). Then one can use linearity of expectation to compute the expectation, the first moment, of the partial sum \(S_n\)\[\mathbf{E}X_1 + \cdots + \mathbf{E}X_n = n\mu.\]
In particular, the mean \(\mu\) can be written as the expectation of the empirical mean \((X_1 + \cdots + X_n)/n\). This looks consistent with the strong and weak laws of large numbers. It yet does not instantly imply these laws, but we can record the following really weak bound for the moment \[\mathbf{P}((X_1+\cdots+X_n)/n\ge\lambda n\mu)\le\frac{1}{\lambda}\]for any allotted slot \(\lambda\ >0\) in the case that \(X\) is unsigned and \(\mathbf{E}X\) is finite. Therefore, in the unsigned case, the empirical \(\frac{(X_1 + \cdots + X_n)}{n}\) usually doesn't get much larger than the mean \(\mu\). We will refer to this inequality as a first moment bound on the \(S_n\) (as was obtained primarily through computations of the first moment \(\mathbf{E}S_n\)). 

Second moments

We now turn to second moment bounds, occurring through computations of second moments such as \(\mathbf{E}|(X_1 + \cdots + X_n)/n|^2\)or \(\mathbf{Var}(S_n) = \mathbf{E}|S_n - \mathbf{E}S_n|^2\). While so, it will be convenient to normalize the mean \(\mu\) to equal zero, by replacing each \(X_i\) with  \(X_i - \mu\) and \(X\) by \(X - \mu\), so that the partial sum \(S_n\) gets replaced by \(S_n - n\mu\) and the empirical mean \(S_n/n\) with \(S_n/n - \mu\). With this normalization, we see that to prove the strong or weak law of large numbers, it suffices to do so in the zero case \(\mu = 0\). On the other hand, if one has \(X\)unsigned, then normalizing it in this fashion will almost certainly destroy the unsigned property of \(X\); so it is not always desirable.
Remarks
Recall that 
We now start computing the variance of the partial sum \(S_n\). Suppose that \(X\) has finite second moment, which we write  \(\mathbf{E}|X|^2 = \sigma^2\).  Since a constant addition and subtraction does not affect the variance \(\mathbf{Var}(X - \mu) = \mathbf{Var}(X)\), we also assume \(X\) has been normalized to have mean zero. Then the variance of \(S_n\) is simply \(\mathbf{E}|X_1 + \cdots + X_n|^2\). By linearity of expectation, we have \[\mathbf{Var}(X_1 + \cdots + X_n) = \underset{1 \leq i,j \leq n}{\sum}\mathbf{E}X_i X_j.\]
All expressions here on the right side of the identity are absolutely integrable thanks to the Cauchy-Schwarz inequality. If \(i = j\), then the \(\mathbf{E}X_i X_j\) is equal to the second moment \(\mathbf{E}|X|^2 = \sigma^2\). If otherwise \(i \neq j\), then \(X_i, X_j\) are independent, have mean zero by hypothesis, and thus \[\mathbf{E}X_i X_j = (\mathbf{E}X_i)(\mathbf{E}X_j) = 0.\]
Putting these in, we obtain
\[\mathbf{Var}(X_1 + \cdots + X_n) = n \sigma^2\]
or equivalently, that of the empirical mean\[\mathbf{Var}((X_1 + \cdots + X_n)/n) = \frac{1}{n}\sigma^2.\]
This bound's been established in the mean zero case, but it is clear that this holds in general, since constant addition and subtraction does not affect the variance; for instance \(\mathbf{Var}(S_n/n) = \mathbf{Var}(S_n/n - \mu) = \frac{1}{n}\sigma^2\), where we write \(\sigma^2 = \mathbf{E}|X - \mu|^2\)
This is the first demonstration of concentrations of measure effect that come to arise from operating many independent random variables \(X_1, \dots X_n\). At the opposite extremal case, we may take \(X_1, \dots, X_n\) all to be exactly the same random variable: \(X_1 = \cdots = X_n = X\). Then \(\frac{X_1 + \cdots + X_n}{n} = X\)has exactly the same mean and variance as X (obviously).
Remarks:
Now, let us examine the behavior of the mean \(S_n/n\). If we insert the variance bound we've obtained into Chebyshev's inequality, we acquire \[\mathbf{P}(|\frac{S_n}{n} - \mu| \geq \varepsilon) \leq \frac{1}{n}\frac{\sigma^2}{\varepsilon^2}\]
for any natural number \(n\) and threshold \(\varepsilon > 0\), whenever \(X\) holds mean \(\mu\) and a finite variance \(\sigma^2 < \infty\). Hence, \(\frac{X_1 + \cdots + X_n}{n} \rightarrow \mu\) in probability: we have proved the weak law of large numbers, in the case of \(X\)with finite variance.
Note that the last bound of the \(S_n/n\) can be modified, so that \[\frac{\sqrt{n}}{\omega(n)}|\frac{X_1 + \cdots + X_n}{n} - \mu|\]
converges to zero in probability whenever \(\omega(n)\) is a function of \(n\) that eventually goes to infinity as \(n \rightarrow \infty\). For instance we have\[\frac{\sqrt n}{\log n}|\frac{S_n}{n} - \mu| \rightarrow 0\]
in probability. An informal intuition of this result is that the empirical mean \(S_n/n\) tends to stray from \(\mu\) by not much more than \(\sqrt n\). This understanding will be reinforced later when we study the central limit theorem and related results such as the Chernoff inequality. It is enhanced also by the law of the iterated logarithm, which we will not be able to get to in this set of studies.
Remarks
  • We have obtained in the proof above the probabilistic bound of the empirical mean\[\mathbf{P}(|\frac{X_1 + \cdots + X_n}{n} - \mathbf{E}X| \geq \delta) \leq \frac{1}{n}\frac{\mathbf{Var}(X)}{\delta^2}\]for arbitrary \(n\) and threshold \(\delta > 0\) 
  • Recall that

    Higher moments

    One can also hope to use the bound \(\mathbf{P}(|\frac{S_n}{n} - \mu| \geq \varepsilon) \leq \frac{1}{n}\frac{\sigma^2}{\varepsilon^2}\) together with the Borel-Cantelli lemma to obtain the strong law of large numbers in the second moment case, but unfortunately the bounding quantities \(\frac{1}{n}\frac{\sigma^2}{\varepsilon^2}\) are not summable in \(n\). To make the quantities summable, we will be after moments higher than the second. One could try to calculate third moments such as \(\mathbf{E}S_n^3\), but this turns out not to convey too much information, because of its signed nature; the expression \(\mathbf{E}|S_n|^3\) would in principle convey more usable information, but is practically difficult to compute as \(|S_n|^3\) is not a polynomial combination of the \(X_i\). We therefore move on to the fourth moment. We normalize \(X\) to the mean zero case \(\mu = 0\), write \(\sigma^2 = \mathbf{E}|X|^2\) to denote the variance of \(X\), and now assume a finite fourth moment \(\mathbf{E}|X|^4 < \infty\); which, by the Hölder or Jensen inequalities, implies that all lower moments such as \(\mathbf{E}|X|^2\) are finite. We then expand \(\mathbf{E}|S_n|^4\)\[\mathbf{E}|X_1 + \cdots + X_n|^4 = \sum_{1 \leq i,j,k,l \leq n} \mathbf{E}X_iX_jX_kX_l.\] Note that all expectations here are absolutely integrable by the hypothesis of finite fourth moment andöHolder's inequality. The \(\mathbf{E}X_i X_j X_k X_l\) looks complicated, but fortunately this time it simplifies in most cases. Suppose for instance that \(i\) differs from \(j,k,l\), then \(X_i\) and \(X_j X_k X_l\) are independent and so \[\mathbf{E}X_i X_j X_k X_l = 0.\] Similarly for other permutations. This leaves only a fewe quadruples \(i, j, k, l\) for which the \(\mathbf{E}X_i X_j X_k X_l\) could be non-zero: the three cases \(i = j \neq k = l, i = k \neq j = l, i = l \neq j = k\), where each of the indices \(i, j, k, l\) holds equality with exactly one other index; and the diagonal case \(i = j = k = l\). If for instance \(i = j \neq k = l\), then \[\mathbf{E}X_i X_j X_k X_l = \sigma^4\] since \(X_i X_j = X_i^2\) and \(X_k X_l = X_k^2\) are independent. Similarly for the cases \(i = k \neq j = l, i = l \neq j = k\), which contributes \(3n(n-1)\sigma^4\) in total to the \(\mathbf{E}|S_n|^4\). Hence we conclude with \[\mathbf{E}|X_1 + \cdots + X_n| = 3n(n - 1)\sigma^4 + n \mathbf{E}|X|^4\] and by Markov's inequality and removing the normalization \(\mu = 0\), that \[\mathbf{P}(|\frac{S_n}{n} - \mu| > \varepsilon) \leq \frac{3n(n-1)\sigma^4 + n \mathbf{E}|X - \mu|^4}{\varepsilon^4 n^4}.\]
    The quantities on the right-hand side now decays like \(O(1/n^2)\). Thus, we are able to use the Borel-Cantelli lemma and conclude the strong law of large numbers in the case, provided one has finite fourth moment \(\mathbf{E}|X|^4 < \infty\).
    One can of course continue to compute higher and higher moments of \(S_n\) with suitable finite moment hypotheses on  \(X\), though as one can already see from the fourth moment calculation, the computations become increasingly combinatorial in nature. We will come back to this sort of analysis when we study the central limit theorem.

    Applications of the moment methods

    We begin with two quick applications of the weak law of large numbers to topics outside of probability. We will first present an explicit version of the Weierstrass approximation theorem.
    Recall that