The GDD Prediction Model
The GDD prediction model estimates mean intake of 54 dietary factors by country, year, age, sex, urbanicity, and education in 185 countries by synthesizing survey mean intake data and relevant covariates from a range of sources. For a complete detailed list of these stratifiers, see the Scope of Our Current Data Collection.
The Bayesian multilevel framework has some advantageous properties that are appealing for our aims. A summary of the model is provided below.
Description of the Model
Fundamentally, the model is Bayesian on the log-means of intake with a nested hierarchical structure, assuming exchangeability between countries and superregions after accounting for covariates. To this structure, we add sex, urban/rural, education, and non-linear age effects (also within a nested-hierarchical structure), survey and country-level covariates, and overdispersion on study-level variance to account for non-sampling variation. It borrows heavily from models presented in Finucane et al. (Lancet 2011) and Flaxman et al. (An Integrative Metaregression Framework for Descriptive Epidemiology, 2015).
Hierarchical Nature of the Data
The model uses a hierarchical structure in which countries are nested in superregions, which are nested in the globe. The model assumes that the superregion means are distributed log-normally around the global mean, and that country means are distributed log-normally around their respective superregion means.
The model uses the following six superregions:
- Former Soviet Union (FSU)
- High-Income Countries (HIC)
- Latin America and The Caribbean (LAC)
- Middle East and North Africa (MENA)
- Sub-Saharan Africa (SSA)
- South Asia (SAARC)
Intercept, age trend, sex differences, education differences, and urban/rural differences
We fit a multi-level model with 3 levels (countries nested in superregions nested in the globe) for intercepts and sex differences, and 2 levels (superregions nested in the globe) for age pattern, education differences, and urban/rural differences. For many surveys, intake is not linearly associated with age. We model age using cubic splines with knots at age 5, 20, 50 and 65, using both regional and global effects. The model assumes between-country variance is the same across all superregions, and that education, urban/rural differences and age patterns are assumed to be the same for countries within a superregion.
We incorporate both country- and year-specific covariate data to further inform our estimates. Click here for more information on the selection of covariates used for each dietary factor.
An additional variance component is added to each study to allow the model to account for non-sampling variation due to survey-level error. Sources of this non-sampling variation include surveys not being nationally representative, surveys not being fully stratified by sex, urban/rural, or education, and surveys using large age groups (greater than 10 years). We also add an additional constraint to ensure local surveys are considered more variable than regional surveys.
Computation and Predictions
We fit the model using STAN, using the No-U-turn sampler (NUTS), a variant of Hamiltonian Monte Carlo. We use 4 chains of 2000 iterations each, treating the first 1000 iterations of each chain as a warm-up, for a total of 4000 Monte Carlo iterations to define our posterior distributions. The model described above is ultimately used to provide predictive distributions of mean intake for each dietary factor by country-year and subgroup. The output presented in this website are the medians, and 2.5th and 97.5th percentiles of these distributions.
For GDD 2010—the project’s previous iteration—model validity across different iterations was evaluated using cross-validation, randomly omitting 10% of the raw data and comparing the imputed intakes with the original raw data.
For GDD 2018, we have completed 5-fold model cross-validation across all dietary factors to formally test the predictive ability of the GDD 2018 prediction model. In this established model validation method, the model raw data inputs are randomly divided into five groups (“folds”). In each round of testing, one group is dropped, the model is imputed based on the remaining data, and the ensuing predictions are then compared to the omitted raw data. This process is then repeated four more times for the remaining four groups, and an overall statistical estimate of model fit is generated.
In addition to cross-validation, global heat maps of national mean intakes and frequency of implausible estimates were also assessed.