Importance Sampling: A clever substitution of sampling region
MCI as a computational method was first initiated to solve the integration problem in estimating expectations. Later it was applied to the simulation of Bayesian posterior (BP) distributions. It has a transparent algorithm: Generate random samples from a distribution function say “Target”, then numerically calculate the integral by summing the values. The expectation obtained by MCI can be referred to as the Empirical Average (Supplementary Formula-1). For example, imagine we want to estimate the expectation of the function\(h\left(x\right)=\sin\left(x\right)\sqrt{\left|\cos\left(x\right)\right|}\), where the random variable (X) follows a Normal distribution with mean=0 and SD=5 (target). Firstly, generate 10000 random samples from target distribution, then obtain the values of h(x) from each generated sample, and then calculate numerically the mean and variance of this generated sample (Supplementary R Code-1). This random sampling method is not cost-effective when the target distribution is diffuse, because a large sample size is required to obtain acceptable precision.
MCI was improved by IS, a variance reduction technique which was first presented in statistical physics (1; 2). IS relaxes the procedure of treating all parts of the distribution equally, concentrating instead on those where estimation was critical. In this respect, an alternative function, say “Proposal”, close to the target, is suggested by making an educated guess. Contrary to MCI on which samples are treated evenly, a “weight” which shows the importance of a sample is allocated to each generated sample through an importance function. Actually, for each sample, one calculates the likelihood of getting that sample from the target distribution proportion to the likelihood of sampling it from the proposal distribution. After the sampling process is finished, the obtained relative likelihoods were normalized in a way that they sum to one. In this way, each point has its own likelihood of occurrence as a discrete probability distribution. The expectation obtained by IS calledWeighted Average (Supplementary Formula-2).
Consider a Normal (0, 0.05) and a t-student (dF=1) as the two different proposal distributions for Normal (0, 5), and estimate importance weights through the importance functions\(\frac{target:\ N(0,5)}{proposal:\ N\left(0,0.05\right)}\) and\(\ \frac{N(0,5)}{t(1)}\) for each generated point. After that, normalize the weights as\(\frac{\frac{N(0,5)}{N\left(0,0.05\right)}}{\sum\frac{N(0,5)}{N\left(0,0.05\right)}}\)and \(\ \frac{\frac{N(0,5)}{t(1)}}{\sum\frac{N(0,5)}{t(1)}}\) . Therefore, we have a discrete distribution function for which its properties such as the mean and variance are easily estimable. Estimated means for our generated samples were approximately 0 as obtained from MCI but their variances are considerably lower than the MCI approach (Supplementary R Code-2). Using alternative distributions can improve the variance of the samples, although a wide proposal distribution leads to worse estimates in terms of the variance and inefficient due to a large sampling number (Figure1). Choosing an appropriate proposal distribution that looked similar to the target would be ideal though difficult to find at times. Unbiased estimates of parameters are obtained for large samples by IS. It also works well when the importance function is not very variable. Indeed, an appropriate proposal distribution leads to lower variances and higher accuracy of approximation. Robert and Casella provided an example illustrating the use of Normal (0, 1) as a proposal to resemble sampling from a Cauchy C(0,1) target distribution caused infinite variance of importance weights (3). It leads to attach high importance to few points and provides inefficient estimates in terms of variance. By substituting heavy tails distributions like t-student rather than Normal, reasonable fitness is guaranteed.