Model Covariates


We have identified country- and time-specific data from various sources to further inform our model predictions. These data, which we call covariates, act as important intelligence to supplement our individual-level dietary intake data, particularly in countries for which these inputs are limited. Below is a summary of the process for identifying and incorporating these data into our estimates.

Covariate Identification and Selection

A myriad of country-year specific data were considered as potentially useful covariates for the Bayesian hierarchical prediction model, including food availability, nutrient availability, food sales data, economic indicators, and geographical data.

We consulted experts and conducted comprehensive searches of publicly-available databases to identify covariates. We identified over 800 covariates including 98 foods and beverages from the FAO food balance sheets (FBS), and 248 foods, beverages, and nutrients from the Harvard GENuS dataverse. Additional key data sources included the WHO Global Health Observatory data repository, UNICEF Databases, the World Bank World Development Indicators, and the Barro-Lee Educational Attainment Dataset. We prioritized a subset of covariates (over 400) for covariate testing based on hypothetical links to the GDD dietary factors.

Covariate Imputation

Missing data for these covariates were imputed in a two-step process. First, for any given covariate, if data were missing for some years of a given country, we used linear interpolation to fill in those years. For missing years not in between data points (for example, imputing for years 2014 and 2015 when data was only available up to 2013), we used a three-year moving average forecast. Second, we imputed data for missing countries using the AMELIA: Multiple Imputation of Incomplete Multivariate Data package in R with the following model specifications: Number of imputed datasets = 1; Time series variable = year; Cross section variable = country; Polytime = squared effects; Bootstrap = none.

To assess validity of the imputations, we imputed non-missing values with the same model and visually compared observed versus imputed, values via scatter plots.

Covariate Testing

We conducted principal component analysis (PCA) using the 'princomp' function in R separately for: 1) 23 grouped FAO food balance sheet (FBS) foods, beverages, and energy, 2) 142 GeNUS foods, and beverages, and 3) 19 GeNUS nutrients and energy. The first four components from each PCA were considered for inclusion.

For each dietary factor, we calculated the correlations between covariates and original survey-level stratified mean dietary intakes, and we selected up to 10 covariates for model inclusion, favoring those with the highest correlations, a mix of food/nutrients and other covariates, and sensible links to the dietary factor.

Ongoing five-fold cross-validation to test different combination of covariates. The data will be split into five parts; four segments making up the training dataset and the remaining segment as the testing data. This process will be repeated using all segments as the testing and training data. The models' ELPD (expected log predictive density) will be compared to assess which model has the best predictive performance.

Final List of Covariates

The final list of selected covariates for each dietary factor will be available in the near future.