How to conduct a structural equation modelling: a tutorial and a primer

Structural equation modelling refers to a process of data analysis where the analyst submits a covariance (or correlation matrix) to an algorithm and specifies the relationships between the variables that make up the covariance or correlation matrix either in the form of matrix algebra or specifies a series of paths. The algorithm, in the form of a computer programme then recalculates a resulting matrix based on the path information and assigns path coefficients or coefficients to define the ways in which the different variables are linked to each other. The resulting matrix or imputed matrix is then compared with the original data matrix and the programme iteratively attempts to reduce the gap between the imputed matrix and the data matrix. This is the process of “convergence” and the final solution is a series of path coefficients that link the different elements of the model. Structural equation modelling has four steps: identification of the model, specification of the model, estimation of the model, and modification of the model to arrive at the best fit. Identification of the model implies that the parameters that the analyst must estimate should either match or be more than the number of non-redundant elements, given by the formula p * (p + 1) / 2 where p = number of variables in the data matrix. This also corresponds to the lower triangular matrix and the diagonals for a variance covariance matrix. A model would converge if the number of unknown estimable parameters are equal to or more than the number of non-redundant information; accordingly such conditions are referred to as over-identification or just identification of a model. Any situation where the number of unknown estimable parameters are less than the number of non-redundant information are referred to as under-identified model, and such models do not “converge” to a solution. After the analyst identifies a model, the analyst then specifies a model by providing information using either a path diagram or specifying three matrices of information: a matrix on the variance or covariance of the variables, a matrix on the paths, and a matrix on the subset of variables selected for the models. After the analyst submits the model to the programme, the programme fits the path coefficients and this process is referred to as estimation of the model. Here the model aims to fit the data provided and minimize any gap between the data matrix and the model matrix that is imputed. After this, the analyst modifies the model in many different ways to identify the best fit model to reduce the discrepancy between the data and the model matrix. This step is referred to as modification. An iterative process of identification, specification, estimation and modification leads to a final model that is a best fit of the data and the model and is driven by the theory. The fitness of the model is obtained by estimation of chi-square statistic (where non-significant p-values signify a better fit model), root mean square error approximation (where a low score close to 0 indicate better model performance), or fit indices (where scores close to 1.00 indicate better model performance). A number of computer programmes are available to conduct structural equation modelling. We will use “lavaan”, and “umx” to demonstrate how to conduct structural equation modelling with a demonstrable data set. SEMs can be either used to validate measurement models or structural models and we will use both types of models to show how these are conducted. Structural equation models are also used for regression analyses, measurement models and questionnaire validation, analysis of binary and time series data, and data on more than one groups.