Integrated Public Use Microdata Series International
census microdata for social and economic research

Project proposal

Steven Ruggles, Robert McCaa, Deborah Levison, Todd Gardner, and Matthew Sobek


Abstract

This project will create and disseminate an integrated international census database incorporating countries on all the continents. It will be the world’s largest public-use demographic database, with multiple samples from each country enabling analyses across time and space. The project entails two complementary tasks: first, the collection of data that will support broad-based investigations in the social and behavioral sciences; second, the creation of a system incorporating innovative capabilities for worldwide web-based access to both metadata and microdata.

Although large machine-readable census samples exist for many countries, public access to these data is restricted in virtually every case. The investigators have proposed to more than a dozen countries a plan for public access to these data, and in every case they have expressed enthusiasm and eagerness to cooperate. But the goal of this project is not simply to make international microdata available; it will also make them usable. Even in the few cases where microdata are available, comparison across countries or time periods is challenging owing to inconsistencies between datasets and inadequate documentation of comparability problems. Because of this, comparative international research based on pooled microdata is rarely attempted. This project will reduce the barriers to international research by preserving datasets and making them freely available, converting them into a uniform format, providing comprehensive documentation, and by developing new web-based tools for disseminating the microdata and documentation.

The project builds on the experience of the Integrated Public Use Microdata Series (IPUMS), which received primary funding from NSF. The IPUMS is a coherent series of individual-level U.S. census data drawn from 13 census years between 1850 and 1990. By putting all the census samples in a compatible format and integrating their documentation, the IPUMS greatly simplifies the use of multiple census years. Just as important, new methods of electronic dissemination have democratized access to these resources. The IPUMS includes 22 samples and 65 million records drawn from one country, the United States. The IPUMS is one of the world’s largest public-use databases, but it is modest by comparison with the current project.

The project is composed of four interrelated elements. The first is planning and design. The international dimension of the database poses new design challenges, since it must accommodate variations in census design and cultural concepts. The basic design goals, however, remain the same as in the IPUMS: the system should simplify use of the data while losing no meaningful information.

The second element, microdata conversion, involves both domestic and international components. The domestic segment will include the IPUMS data while adding new U.S. samples to allow detailed study of the late 20th and early 21st century. It will incorporate the Current Population Survey March supplements from 1962 to 2005, and the 1% 2000 Census sample. With these additions, the database will have a much stronger contemporary focus than the current IPUMS. The international component of the database falls into two categories. For some countries, the project will incorporate already-existing public-use samples. For other countries, no public-use census files presently exist. In these instances, new samples will be drawn from surviving census tapes using techniques to ensure that respondent confidentiality is preserved. These data files are often poorly documented, and will necessitate extensive assistance from the statistical offices and experts of each country to assure their correct interpretation.

The third element, the development of metadata, is central to the project and poses even greater challenges than the microdata. The documentation will not be confined to codebooks and census questionnaires. As with the IPUMS, a wide variety of ancillary information will be provided to aid in the interpretation of the data, including full detail on sample designs and sampling errors, procedural histories of each dataset, full documentation of error correction and other post-enumeration processing, and analyses of data quality.

The final element of the project is the creation of an integrated data access system to distribute both the data and the documentation on the Internet. Users will extract customized subsets of both data and documentation tailored to their particular research questions. The system will consist of a set of tools for navigating the mass of documentation, defining datasets, and constructing customized variables. Given the large number of variables and samples, the documentation will be so unwieldy as to be virtually unusable in printed form. Accordingly, we will develop software that will construct electronic documentation customized for the needs of each user.