Location 7 showed to have the highest correlation with location 8 and then 6, and the lowest correlation with location 25. If a confidence threshold is produced, it may be possible to statistically determine when a value from location 7 is truly from location 8, and not just highly correlated. Therefore various methods were assessed in their confidence to accurately reproduce the distribution of data from location 7.
The function ‘descdist()’ was performed to estimate kurtosis and skew in location 7. Kurtosis indicates the length of a skew tail, whereas the resulting skew output indicates the skew bias. The function showed a positive skew and a kurtosis not far from three. Therefore three common right-skewed distributions could be considered for fit: Weibull, gamma and lognormal distributions. As the skew is very short tailed, a possible normal distribution could be accepted upon rejection of the other distributions, even with the previously rejected Shapiro-Wilks test.
The function ‘fitdistr()’ was used to assess each chosen possible distribution (Weibull, gamma, lognormal and normal). The resulting value provides a mean as the maximum likelihood parameter. Therefore whichever distribution is closest in result to the true mean is selected. The distribution which provided the most promising value was normal distribution.
Transformation of location 7 to a true normal distribution was considered, but given the above results the data was assumed to be nearest in distribution to a normal distribution. Sample data was then generated using the aforementioned normal distribution simulation function and was applied to the distribution of location 7. The resulting sample spread ,however, showed to have a very low correlation with the data, showing that reproducing sample data for location 7 with a normal distribution was still inappropriate.
An alternative idea was then produced: generating sample data under the same distribution curve of location 7. With values fitted to a graphical distribution, what amount of adjustment is required for correlation to no longer occur? To assess this, it must first be possible to regenerate the sample data to have a correlation of near 1 to the original observed values. First the density from the observed values was evaluated and used to create the model of observed distribution. The ‘adjust’ function was applied to shift data values in order to create our sample values. Correlation, with minimal adjustment, appeared to be close to zero, showing this method to be equally unusable as the previous.
It was then realised that the function ‘cor()’ was being used to measure the correlation, however this measures linear correlation, with our data being non-linear. Therefore ‘nlcor()’ was introduced and previous correlation calculations were reassessed non-linearly. The correlation values improved slightly, but not significantly enough to be accepted. The need for a probability matrix for each temperature was then considered, however deemed to introduce unwanted bias.
As the previous methods had provided little progression, an entirely different approach was deliberated. Rather than creating sample data through the generation of individual values, the existing values could themselves shift slightly, therefore, overtime, the data will eventually deviate in correlation from observed location values. The R function used to achieve this is the base function ‘jitter()’. If 0.1% of change can be added to each value in progression, at what amount of change in jitter (or in this case ‘noise’) can we say that the value no longer belongs to, or correlated with the original dataset? If correlation occurs with surrounding data, at what stage or amount of noise does this occur? The default amount of jitter created is the factor by one fifth of the smallest difference between observed values. Therefore a minimal amount of noise (within realistic values) is applied. However temperature values recorded can be greater than 5 significant figures, so the noise applied would be too insignificant. Therefore our chosen noise value is set to amount and not factor within the function. This resulted in extremely high correlation, near 1 (0.99999998) with p<0.05, so the resulting sample dataset is accepted. This was then applied to location 7 in varying degrees of noise, from 0.01 to 4.00 in increments of 0.01, and repeated five times so as to gain more accurate value estimates. Further trials would be carried out, however, this test was extensive and required an extended duration. This was also carried out on the additional 24 locations.
Once 5 trials were run for every location, the max correlation for each row within a trial was calculated. This indicated the amount of noise added where the sample data no longer had the highest correlation with observed data for the location, and instead had a higher correlation with a nearby location. The degree of noise and correlation were extracted for these threshold points.