The concept of developing ‘synthetic’ microdata populations has been around for close to twenty years. Synthetic microdata populations are created via a combination of aggregated data and Public Use Microdata Samples (PUMS) — both from the U.S. Census Bureau’s American Community Survey (ACS) — to generate a complete set of households and persons for a particular area. In RTI’s 2005-2009 U.S. Synthetic Population, we locate each synthetic household across the landscape to match high-resolution population distributions while also putting the correct mix of households in each census block group by householder age, householder race, household size, and household income to match census characteristics. So these synthetic populations are explicitly geospatial.
We’ve been working on the development of these synthetic populations as part of the Models of Infectious Disease Agent Study (MIDAS) for several years (www.epimodels.org). In that time, we’ve built two complete U.S. synthetic population databases—one based on 2000 decennial census data and the second based on the 2005-2009 (ACS). The cool thing about synthetic populations is that they provide an accurate representation of a complete population at the household and person level (i.e., NOT aggregated to some level of geography like census block group or census tract) AND they are completely de-identified so they can be used without disclosure issues.
One of the primary uses of synthetic populations is in the field of Agent-Based Modeling (ABM). In these stochastic simulations, each person (agent) is given a set of characteristics and behaviors and at each time-step of the simulation, each person interacts with his/her environment and other persons. ABM simulations of infectious diseases such as influenza are really interesting because they allow the modeler to test out various interventions and mitigation scenarios (such as closing schools, testing the effectiveness of vaccines of various efficacies, etc.) and see which intervention or combination of interventions is most likely to retard or diminish future outbreaks.
One of the problems in understanding synthetic populations is that they are difficult to visualize. In our 2005-2009 U.S. Synthetic Population, we have over 112,000,000 household records and over 280,000,000 individual person records. How can you make a map with that many datapoints? Our solution is to create a web-based multi-scale visualization tool that allows you to see map the synthetic population in a highly interactive system. This system, called the Synthetic Population Viewer is built on open source technologies and is completely open and available to the public.
What can you see with the viewer? Amazing patterns of the U.S. population! The figure below illustrates the four variables you can map with the viewer. Each of the four panels shows a different variable so you can see how they interact with each other.
Obvious patterns of race and income are everywhere, of course. But some more subtle patterns–where combinations of household size, household income, race, and age all play a part–are also identifiable such as:
- Off Campus Student Housing: neighborhoods around large universities identified by high density, young households, with small household size and heterogeneous race. The area below surrounds N.C. State University in Raleigh, North Carolina.
- Communities for the Elderly: neighborhoods identified by older householders, small household size, low income, and homogeneous race. Take this as an example near Seal Beach, California:
The patterns are limitless and, if you look closely, very surprising!