Literature Review

2.1 Failure rate in Naval ship setting
Before the mission, each naval ship is equipped with a forecasted amount of spare engines. An underestimated forecast has a risk of mission failure as spares parts cannot be resupplied during mission times. An overestimated forecast may lead to reduced operating efficiency due to a load of unnecessary spare parts. Moreover, from a system point of view, overestimation induces unnecessary use of budget and even lead to inventory shortage for other ships. So, defining the optimal set of spare parts is crucial for mission success (Zammori et al., 2020). For accurate prediction, several special features resulting from Navy’s system should be noted. First of all, unbalances are observed in two categories of the data: age period and engine types. There was only a short period of failure rate data compared to the entire lifetime. In our case, for example, an early age has less data than the rest of the age period; this might be problematic as the failure rate of young ships is needed for operation. Also, the distribution of ships for each engine type category is not balanced. In our dataset with 98 ships, there are 5, 27, 43, 17, 6 ships for each engine type category. In this case, while a satisfactory model could be obtained from engine with a large amount of data, other models might suffer lack of data problems. Moreover, the similarity between ships and engines should also be noted as they undergo are expected as all the engines are under the same maintenance process; planned maintenance is performed by ROK Navy regardless of the engine type (Yoo, J. M. et al., 2019). Based on these circumstances, where ships as well as engines share certain qualities, the model with layered parameter structure is needed; it should be able to learn the specific structure between and within each layer from the data.
2.2 Failure forecasting models
Several models exist such as ARIMA, exponential smoothing, and seasonal trend decomposition using Loess (Hyndman and Athanasopoulos, 2018) that could model time series characteristics of failure rate. Among the existing time series models, Prophet, which adopts Bayesian generalized additive model shows high accuracy. Moreover, it decomposes time series into trend, seasonal, other regressor factors which enhances both its application and interpretability (Taylor, S.J., & Letham, B., 2018). More specific models concentrating on the characteristics of failure have been suggested. A bathtub is typical shape pattern of failure rate. Also, Weibull or Poisson distribution are often used as a distribution of failure rate. Wang and Yin (2019) performed failure rate forecasting through stochastic ARIMA model and Weibull distribution. Time series has been decomposed into bathtub-shape assumed trend and stochastic factors. Parameters of the Weibull distribution were separately learned for the increase, decrease, and flat period of the bathtub. The stochastic element was obtained using ARIMA, and the time series failure rate was calculated as the sum of the trend and stochastic elements. Sherbrooke (2006) proposed Pareto-optimal algorithms, named constructive algorithms, based on Poisson distribution. However, it had limits in determining the parameter. Zammori et al. (2020) tried to solve the problem of parameter estimation of Sherbrooke’s (2006) model by applying time-series Weibull distribution. Other attempts such as Pareto-optimal, Monte-Carlo(Sherbrooke, 2006), ARMA, and least-squares logarithm (Wang & Yin, 2019) have been made to add the effect of stochastic factors to this distribution. Attempts have been made to integrate time series models with information about system architecture. In the risk analysis of deepwater drilling riser fracture (Chang, y. et al., 2019), Bayesian network was used to predict the fracture failure rate. Bayesian network could also used to analyze and prevent the cause of a ship’s potential accidents (Afenyo, M. et al., 2017). Time series forecasting based on Bayesian network (Dikis, K., & Lazakis, I. , 2019) and Analytic Hierarchy Process (AHP) (Yoo, J. M., Yoon, S. W., & Lee, S. H., 2019) illustrate these approaches. They are based on the assumption that equipment, engines for example, within the same group follow similar failure patterns.
2.3. Hierarchical model
Hierarchical model has an edge in representing the features of Navy data introduced in 2.1; unbalanced category and sharing structure, by information pooling. Gelman et al. (2005) explained that hierarchical models are highly predictive because of pooling (Gelman et al., 2013). When hierarchical model is used, there is almost always an improvement, but to different degrees that depends on the heterogeneity of the observed data (Gelman, 2006a). When updating the model parameters, such as prior parameters, the relationship between the part of the data being used and the whole population should always be considered. Pooled effects between subclusters are partial as they are implemented through shared hyperparameters, not parameters. In a Bayesian hierarchy, the balance of fit can be learned by using hyperpriors. By properly setting the hyperprior structure, we can find a reasonable balance between over-fitting and under-fitting, as hyperpriors are known to serve as a regularizing factor. Many examples of applying hierarchical structure in cross-sectional data exist in diverse domain, such as ecology, education, business, and epidemiology (McElreath, 2020). The structure of cross-sectional data where the whole population is divided into multiple and nested subcategories provides an excellent environment for a hierarchical model. Previous literature on comparing the education effects of multiple schools has shown that incorporating the nested structure of the state, school, and class in the model had substantial improvement in terms of accuracy and interpretability (Rubin, 1981). 2.4. Model evaluation measures Time series cross-validation and k-fold cross-validation, along with the expanding forecast method, can be used to measure forecast accuracy in time series (Hyndman and Athanasopoulos, 2018). Several sets of training and test data are created in a walk-forward mode, and forecast accuracy is computed by averaging over the test sets. Various measures of forecast error exist, including the mean absolute, root mean squared and mean absolute percentage error. When a large difference of scale exists in the data, using a scaled error measure is recommended. The mean absolute scaled error is recommended for comparing forecast accuracy across multiple time series (Hyndman & Koehler, 2006). Information criteria can be used to measure the fit of a model in Bayesian models include widely applicable information criterion (WAIC) and the leave-one-out cross-validation (LOOCV); they are preferred to other criteria such as Akaike information criterion (AIC) and deviance information criterion (DIC) (Vehtari and Lampinen, 2002). For Bayesian models, where the estimation of parameters is based on sampled results, it is essential to check whether chains have reached their convergence before comparing models. For these purposes, trace plots and numerical summaries such as the potential scale reduction factor, Rhat (Stan Development Team, 2017b) are used. Rhat lower than 1.1, for each parameter, is recommended.