Search the IPUMS-International Website Provide Feedback to IPUMS-International Look at IPUMS-International Sample Documentation Information Data Retrieval and Generation Options Return to IPUMS-International Home Page

Project Description
Principles
Progress Report
Release Dates
Revision History

Data
Apply for Access
Create an Extract
Download Datasets
Citation

Documentation
Samples
Variables
Source Materials

Resources
Microdata Inventory
Enumeration Forms
Microdata Handbook
International Partners

Contact Us



         
  Harmonization

Harmonizing census data is not a new idea. First proposed in 1872 at the International Statistics Congress held in St. Petersburg, not much progress was made until the last half of the twentieth century. One of the signal achievements of the United Nations Statistics Division has been in the international harmonization of census concepts from the enumeration form to the publication of final tables. While incomplete, the effort has enjoyed widespread support by statistical agencies around the globe. Beginning in 1991, the IPUMS-USA project has worked to harmonize census data for the United States for the period since 1850, and IPUMS-International has capitalized on this experience.

International census samples employ differing numeric classification systems and reconciliation of these codes is a major part of this project. Variables must be easy to use for comparisons across time and space. This requires that we provide the lowest common denominator of detail that is fully comparable. On the other hand, we must retain all meaningful detail in each sample, even when it is unique to a single dataset.

For most variables, it is impossible to construct a single uniform classification without losing information. Some samples provide far more detail than others, so the lowest common denominator of all samples inevitably loses important information. Composite coding schemes offer a solution. The first one or two digits of the code provide information available across all samples. The next one or two digits provide additional information available in a broad subset of samples. Finally, trailing digits provide detail only rarely available. For example, in IPUMS-International, the first digit of the variable for marital status is comparable across all samples. The second digit delineates consensual unions from other forms of marriage (where appropriate) and distinguishes among the categories separated, divorced, and married with spouse absent. The final digit provides additional detail with the married and married-spouse-absent categories (such as polygamous marriages in Kenya). The basic goal of our harmonization efforts is to simplify use of the data while losing no meaningful information.

In addition to providing harmonized codes for variables and accompanying documentation, the IPUMS-International project is carrying out a variety of additional tasks to improve data quality, not all of which have been implemented in this preliminary release of the data. These tasks include the following:

  • Cleaning data to eliminate duplicate records, inappropriately merged households, and other errors
  • Developing internal consistency checks to maximize data integrity. This includes, for example, examining consistency between age and marital status, occupation, and school attendance; looking for persons with multiple spouses for countries in which this is not an accepted custom; and checking for agreement between household and individual characteristics.
  • Implementing allocation procedures to impute values for missing or inconsistent data items, using logical edits together with probabilistic "hot deck" methodology. A data quality flag identifies allocated data items.
  • Creating constructed variables to simplify data analysis, including family interrelationship variables. A system of logical rules identifies the record number within each household of the individual’s mother, father, or spouse, if they were present in the household. These pointers allow users to automatically attach the characteristics of these kin or to construct measures of fertility and family composition. Other constructed variables describe family and household characteristics at the individual and household level (such as family and subfamily membership, family and subfamily size, and number of own children).