- Summary of Methods and Data Collection
Summary of Methods and Data Collection
- To estimate individual food and nutrient intakes worldwide by country, year, age across the lifespan, sex, education level, urban or rural residence, and pregnancy/nursing status.
- To use generated estimates to understand disease burdens, identify high-impact interventions, and evaluate and improve public health and nutrition policies.
- To create a public resource and dissemination platform for sharing intake estimates with all stakeholders in the global nutrition community
- To harmonize and publicly disseminate individual-level dietary data (24hr recall or food records) at the finest level.
List of surveys
Data collection for the Global Dietary Database has spanned roughly a decade. The first iteration of GDD, termed GDD 2010, modeled global diets through the year 2010. The current iteration of the project, termed GDD 2018, models global diets through 2018, as relevant data and covariates—e.g., from published reports, private Corresponding Members, FAO, and DHS (Demographic Health Survey)—are presently only available up to 2018, and estimating dietary intakes beyond 2018 would be wholly imputed and require strong assumptions about continuing trends.
As of July 2021, we have identified and retrieved 1,634 eligible survey-years of data from public and private sources. Of these, 1,240 have been checked, standardized, and approved for GDD 2018 model inclusion. The list of these surveys—including data on country, time, representativeness, assessment method, and sample demographics for each—is available on this page.
If nationally-representative surveys were not available for a country, we also considered national surveys without representative sampling; followed by regional, urban, or rural surveys; and finally large local cohorts, provided that selection and measurement bias were not apparent limitations. Surveys were included in the initial screening phase if the survey was reasonably population-based and representative, exposure data were reported or could be plausibly obtained, and sample size included at least 100 individuals. For countries with no surveys identified, other sources of potential data were considered, including the WHO Infobase, the STEP Database, and household budget survey data.
Most identified data were either privately held or not in a format appropriate for our modeling. We thus relied almost entirely on direct author contacts in each country to provide us with exposure data directly. Where available, we accessed and downloaded public datasets, from which dietary exposure data was extracted, standardized, reviewed, and cleaned by GDD team members.
Dietary Factors Covered in the GDD and Coding Methods
At its inception more than a decade ago, the GDD 2010 focused on dietary factors with confirmed or probable etiologic effects on major chronic diseases including cardiovascular diseases, diabetes, and cancer (22 dietary factors identified and included in total). GDD has since expanded in scope to characterize total diet in populations around the world, and now contains data on 55 dietary factors including 14 foods, 7 beverages, 15 macronutrients, 19 micronutrients, and 2 indices of carbohydrate quality, but the very few sources will limit global modeling of a small handful of dietary factors. Identification and definition of an additional 4 dietary factors to be incorporated into future iterations of GDD have also been finalized. See this page for the list of present dietary factors.
The availability of individual food vs. food group data
In the overall global model, the GDD determines and characterizes foods by group (e.g., fruits). Doing so maximizes the validity and accuracy of our prediction model and better facilitates comparisons of diet across countries. Individual food intake data are also available in the GDD, including:
1. Original survey microdata
Through correspondence with our private data owners, we have established data sharing agreements to publicly host their individual-level, non-modeled “microdata” on the GDD site. These microdata often contain dietary intake data for individual foods, which can vary widely between surveys.
2. Newly-standardized microdata using harmonization methods
See below for information regarding GDD efforts, coordinated and harmonized with GIFT, to standardize and disseminate intake data for individual foods across surveys.
Coding Method Used for Food Categorization
Categorizing foods as GDD variables
All dietary factors in the GDD have “optimal” and “suboptimal” definitions to maximize the comparability of dietary intake across surveys and countries. Each individual survey undergoes a rigorous coding protocol to ensure all primary data are captured and incorporated into GDD as accurately as possible. The dictionary of optimal and suboptimal definitions for GDD dietary factors is available on this page.
A major challenge in assessing dietary intake is the variation of descriptions of individual, self-reported food items, which can ultimately lead to assessment errors. To address this issue, we have developed methods and partnerships to apply FoodEx2—a sophisticated food description and classification system developed by the European Food Safety Authority (EFSA)—to surveys within GDD. This work helps to standardize global dietary intake beyond the dietary variables currently collected by GDD.
The GDD has signed MOUs and is formally collaborating with EFSA and FAO/WHO GIFT, as well as relevant data owners, in our work on FoodEx2 coding. The adaptation of the FoodEx2 system is a major advancement for the collection and storage of individual-level dietary intake data. With tools like FoodEx2, GDD more precisely informs global nutrition intakes, disease burdens, interventions, and policies.
Specifications of Models Used to Generate Estimates of Individual Intake of Foods or Nutrients at the National Level
The GDD prediction model estimates mean intakes of 52 dietary factors, jointly stratified by country, year, age, sex, urbanicity, education, and pregnancy status, in 185 countries by synthesizing survey mean intake data and relevant covariates from a range of sources. The Bayesian multilevel framework has some advantageous properties that are appealing for our aims. A summary of the model is provided on this page.
The prediction model is further informed by a broad range of covariate data from various global sources. These include:
- FAO food balance sheets data, 1980-2018
- Harvard Global Expanded Nutrient Supply (GENuS) data, 1980-2013
- Principal component analysis of FAO and GENuS data, 1980-2013
- Industry sales data on fat consumption, 1998-2015
- World Bank GDP, 1980-2015
- Precipitation, 1982-2014
- Unemployment rate, 1991-2015
- Land area
- Education years, 1980-2010
- Gini coefficient, 1980-2015
- Coastline ratio
- Poverty rate, 1991-2015
Indicators or Types of Modelled Estimates That Can be Generated from GDD
GDD 2018 estimates mean levels of dietary intake for 52 dietary factors across the global population, jointly stratified by the following characteristics:
- 1990 through 2018
- 185 countries
3. Age groups
- 0-1, 1-2, 2-5, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80-84, 85-89, 90-94, 95+ years
- Male and female
- Urban and rural (as defined by each survey's characteristics)
6. Education level
- Low (0-6 years formal education), medium (>6 to 12 years), and high (>12+ years)
GDD modelled estimates can also be used to generate different dietary patterns and metrics, such as healthy and unhealthy diet global patterns as well as other common metrics (with some modifications) such as the Healthy Eating Index (HEI), Alternative Healthy Eating Index (AHEI), Mediterranean Diet Score (MDS), Dietary Approaches to Stop Hypertension score (DASH), Infant and Young Child Minimum Dietary Diversity (IYCMDD), and Minimum Dietary Diversity for Women (MDDW).
How to Access the Raw Data in GDD 2018
A key aim of the GDD is to become an archive of all the surveys which serve as our model inputs. Such a functionality is helpful in connecting researchers with one another and with important data on the populations they study. This functionality relies on significant collaboration from the original data owners. We have collected signed data sharing agreements (DSAs) with survey owners to ensure the safe and responsible sharing of their raw data (microdata). Datasets for all surveys with approved DSAs, and general survey information about all surveys used for GDD input data, with or without DSAs, are available for download on this page. For publicly available datasets, links to the site from which the dataset originates are listed in lieu of data files, which generally cannot be directly shared based on the original data sharing requirements of the owners. The map on the page can be filtered by survey characteristics (assessment method, subject age range, representativeness, etc.) to isolate countries and surveys of interest.
Plans for Inclusion of Additional Surveys to the GDD
To allow completion of the findings, the incorporation of surveys into the GDD 2018 modeling phase has closed. Any additional surveys which may be identified will be reserved for a future round of funding and efforts (e.g., GDD 2020), unless they are so large or important to warrant inclusion now. In addition, as household-level surveys are less reliable than individual-level ones, we have deferred adding any further surveys to the current model (some household-level surveys are currently from large countries, such as Russia, for which individual-level data are not available).
Validation of Actual vs. Modelled Intakes for Different Population Groups
For GDD 2010—the project’s most recent iteration—model validity across different iterations was evaluated using cross validation, randomly omitting 10% of the raw data and comparing the imputed intakes with the original raw data. For GDD 2018, work has been completed for 5-fold model cross-validation across all dietary factors to formally test the predictive ability of the GDD 2018 prediction model. In this established model validation method, the model raw data inputs are randomly divided into five groups (“folds”). In each round of testing, one group is dropped, the model is imputed based on the remaining data, and the ensuing predictions are then compared to the omitted raw data. This process is then repeated four more times for the remaining four groups, and an overall statistical estimate of model fit is generated. More information about the GDD 2018 Prediction Model and its validation can be found on this page.
Validation of Dietary Patterns and Food Group Intakes vs. Adequacy of Nutrient Intakes
Currently, the GDD prediction model is designed to estimate mean dietary intake across subgroups by country-year, synthesizing information from various types of surveys. We also plan to expand our work to estimate the usual intake distribution—that is, to estimate usual dietary intake from just the means and standard deviations of intake provided to us by contributing members.
The general philosophy of our approach is to account for systematic bias from assessment methods of surveys used in our prediction model to obtain the full usual dietary intake distributions. By creating and fitting a separate mean-to-standard deviation linear regression model on the dietary data, we can
predict standard deviations from mean estimates for each strata and country-year, thereby generating joint predictive distributions of intake.