What is a Data Dictionary?

September 24, 2019

When you transition to mobile data collection, you are presented with an opportunity to monitor an increasingly large and complex set of variables. This may increase not only the overall number of indicators but their interconnectedness and your ability to manipulate them as well. This means your entire dataset can get awfully complicated awfully quickly.

One solution to this problem is a data dictionary. A data dictionary defines and describes each variable in a data set, giving you a clear understanding of the indicators you’re working with. Aside from a general definition of the variable, its description can include information like allowable inputs (e.g. numerical ranges, multiple-choice options, etc.) and the workflows they are a part of.

Especially when your monitoring and evaluation team has multiple members, sometimes in different locations, a data dictionary helps to ensure everyone is on the same page and can often make analysis much easier.

A Data Dictionary in Action

One of our partners is working to design and evaluate how a mobile tool can help improve symptom control and information exchange among specialists and local health workers treating late-stage cancer patients in Tanzania.

In this project, we created a data dictionary using tools available on the CommCare platform and some manual customization to ensure that the epidemiologist and other data analysts on our study team would be able to understand and interpret the data coming out of multiple forms collected at various time points over a period of one year. 

This data dictionary included demographic variables collected from all study participants, from the patient to the health worker and specialist, as well as patient satisfaction and beliefs and more technical clinical treatment variables. Each of the 80 variables in the data dictionary included information such as the source, the “timepoint” at which the data would be collected (e.g. enrollment, 6-week follow-up, end of study, etc.), and mapped the question ID and possible choice values to each variable. 

The organization of the data into a data dictionary format can help the study team achieve goals laid out in the data analysis plan by, for example, providing a way for the analyst to easily identify and pull desired demographic variables – age, gender, household income, etc. – that may be used in characterizing the study sample. The data dictionary can be used to identify exactly which variables should be compared across time points (baseline, 6-week, and end-of-study) to determine any changes in symptom response over time. 

All this information can help the research team in a number of different ways:

  1. Demographic information helps to control for any differences in subjects when running analyses.
  2. The technical clinical information and the subjects’ impressions and beliefs help to understand the effects of the intervention, as well as possible correlations with other variables to explore.
  3. The 6-week surveys will also help the team optimize the experience for participants beyond the direct effects on health outcomes.

To examine their data dictionary a bit deeper, click here.

Make Your Own Data Dictionary

There is no set formula or template for a data dictionary, as the purpose of the document should be informed by the structure and objectives of the project. However, there are a few steps that are important for every project to take that will ensure you don’t leave anything out.

Make a list

The obvious step: Make a list of all the variables you capture. Review your surveys and existing datasets or export them from your forms. Don’t start describing the variables yet, but do include information such as the variable name, its associated case property, and its source (e.g. survey, dataset, etc.).

Define your variables

Many of your variables will be self-descriptive (e.g. year of birth, disease type, etc.). When they aren’t, you should write a short definition for each of the variables, which can be either a brief description or a formula based on the variable’s calculation. You might also include example inputs for these variables to ensure your team’s understanding. This will serve as a quick reference for the team to understand what the variable is for.

Describe your variables

Include any relevant notes in your data dictionary to describe things like when in the process your variables will be collected, who’s responsible for collecting them, and what category they fit in. For example, a patient’s date of birth is collected during their enrollment (the “when”) by a health worker (the “who”) and falls in the “sociodemographic” category (the “what”).

Map your variables

One of the key reasons for a data dictionary is to understand when and where you are using your variables in the app. Include the available multiple-choice responses, explain in which forms the variable is included, and outline any dependencies or requirements associated with the variable, such as whether it is part of a question that uses display conditions. 

Maintain your dictionary

Many of your variables and their attributes should remain the same throughout your program, but there may be reasons that you need to either add new variables or new descriptions over time. Some sources may allow you to integrate an auto-update feature with your data dictionary, but set up reviews at regular intervals in case you’ll need to make the changes yourself. When these changes do occur, be sure to communicate them with your team ahead of time, as they can often affect their analyses. 

Get Started

Now that you know what a data dictionary is and the basics of how to set one up, look back at your project objectives to see how it might help your program. Do teams in different places analyze the results of your work? Will you be revisiting your program over time for further evaluations? Do stakeholders often require updates to your surveys? There can be many different situations where a data dictionary will be useful, so take the time to understand the specific purpose yours will serve to understand how to define and describe the variables tracked by your mobile data collection program.

Written by
Dimagi

Read more from
Staff Blog

The World's Most Powerful Mobile Data Collection Platform

Start a FREE 30-day CommCare trial today. No credit card required.

Get Started

Learn More

Get the latest news delivered
straight to your inbox