![]() |
![]() |
![]() ![]() |
![]() |
![]() |
The component samples of the IPUMS employ a variety of sample designs. This section offers IPUMS users a concise introduction to the various designs; for further information see the chapters on sample design in Volume 3. Overview All the samples are also stratified to some degree. That is, they divide the population into strata based on key characteristics, and then sample separately from each stratum. This ensures that each stratum is proportionately represented in the final sample. The samples for years prior to 1960 are geographically stratified: the original source materials were divided up prior to sampling in such a way as to ensure a more even geographical distribution of cases than would be expected from a true random sample. The 1960 and subsequent samples employed more elaborate stratification schemes based not only on geography but also on such characteristics as household size, race, and group quarters membership. The effects of clustering and stratification are described in more detail in Chapter 3 "Sampling Error." The ways in which each sample is stratified are described in the sample-by-sample discussion below. The sample designs for all years are constrained by the available units of enumeration. In the pre-1940 censuses, all individuals were assigned to a family. The definition of the census family varied only slightly from census to census between 1850 and 1920. Generally, a family was an individual or group of individuals living together in the same dwelling place. Two or more families could reside in a single structure, provided they occupied separate parts of it and their housekeeping was separate. However, all the permanent occupants of any hotel, military barracks, or large institution, were considered members of a single family. Census enumerators likewise counted boarders, lodgers, and servants as part of the family occupying the dwelling place where they slept, regardless of their housekeeping arrangements. In 1940 the basic unit of enumeration shifted from the census family to households and quasi-households. A household consisted of the group of persons occupying a dwelling place or part of a dwelling place with either separate cooking equipment or an outside entrance. The maximum number of boarders and lodgers in a household was ten; if the dwelling place contained more than ten boarders and/or lodgers, it was enumerated as a quasi-household. Quasi-households also included hotels, military barracks, dormitories, and other large institutions. The 1950 census was similar, except that quasi-households included all units with five or more persons unrelated to the household head. In the years since 1950, the term group quarters has been substituted for the term quasi-household, and there have been minor variations in the criteria used to distinguish separate households within the same structure. In 1980, the number of unrelated persons required for group quarters classification was raised from five to ten.1 For the census years before 1950, the sample units - households and dwellings - are actually subsets of the original enumeration units. The following sample-by-sample discussion describes the sample units used for each census year, and briefly defines the procedures used to select cases from the original sources for inclusion in each sample. The Design of Each Public Use
Microdata Sample (PUMS): 1850: The manuscript census of the 1850 free population consists of roughly 560,000 census pages recorded on 976 reels of microfilm. Each census page has eighty-four lines, and the information pertaining to each individual appears on a separate line. The sample was drawn systematically from each microfilm reel, ordinarily at intervals of six pages. On each selected census page, one line was randomly selected and designated as the sample point. Any valid sample unit beginning at the sample point or within four subsequent lines was included in the sample, yielding a 1-in-100 sample with equal probabilities of inclusion for all individuals and households. Valid sample units are defined as follows: 1. Dwellings: structures containing fewer than 31 residents, with or without multiple families.1860 and 1870: The manuscript censuses for 1860 and 1870 consist of approximately twomillion census pages recorded on 3,186 reels of microfilm. Each census page has eighty lines, and the information pertaining to each individual appears on a separate line. The 1860 and 1870 samples employ the same sampling scheme as the 1850 sample, except that households containing any black person are sampled at twice the density of other households (1-in-50). 1880: The manuscript census for 1880 consists of about 600,000 enumeration pages with one hundred lines per page. These records are contained on 1,454 reels of microfilm. One line was randomly selected from each page and designated as the sample point. Sample units were included only if they began at the designated sample point. Valid sample units were defined the same as in 1850, except that related groups in group quarters could be identified through the family relationship variable as well as by surname. This procedure yielded a 1-in-100 sample with equal probabilities of inclusion for all individuals, households, and dwellings. 1900: The 1900 manuscript census consists of some 900,000 census pages contained on 1,850 microfilm reels. Each page contains one hundred lines. The measured length of each reel was used to estimate the number of census lines on the reel. One in 750 of these lines were randomly selected and designated as sample points. Cases were entered if a sample point fell on the first individual in a valid sample unit. Sample units were defined as follows: 1. Families: a family consisted of the head of a census family, all persons related to the head, and all co-resident employees of the head (servants and domestic farm workers).The documentation for the 1900 sample refers to the latter two categories as "primaries." Because this terminology conflicts violently with Census Bureau usage, the IPUMS avoids this term. These sample units are largely incompatible with the modern census concept of group quarters. The sample does, however, provide sufficient information to determine whether any individual would have been sampled as a group quarters resident in another census year. With care, most common measures can be made compatible with other census years. The design yielded a flat 1-in-760 sample of individuals and families. 1910: The 1910 population census schedules are contained on approximately 1,000,000 census pages of one hundred lines each. These records are contained on 1,784 reels of microfilm. Each reel of the 1910 census was divided into five page segments (or strata), and two randomly chosen lines were designated as sample points in each stratum. Sample units were entered only if they landed on a head or head-equivalent in a regular household, the head or head-equivalent in the primary family of a large household, or an individual unrelated to the head of a large household. The definitions for these units were as follows: 1. Regular household: census families with a head or head-equivalent and fewer than 21 members unrelated to the head. A few of these units are classified as group quarters in the IPUMS because they had 10 or more members unrelated to the household head.(As in the case of the 1900 sample, we have altered the terminology used in the 1910 documentation to conform to current usage.) This procedure yields a representative 1-in-250 sample of households and individuals and, unlike the 1900 design, can be made compatible with later definitions of households and group quarters. In addition to the flat 1-in-250 sample, oversamples of the black and hispanic populations in 1910 will shortly be available. 1920: The manuscript census of the 1920 population consists of about 1.2 million census pages recorded on 2,076 reels of microfilm. Each census page has one hundred lines, and the information pertaining to each individual appears on a separate line. The 1920 sample employs the same sampling scheme as the 1850 sample, except that family relationships as well as surnames are used to identify related groups in group quarters. 1940: The population schedules of the 1940 census are preserved on 4,576 microfilm reels. Each census page contains information on forty individuals. Two lines on each page were designated as "sample lines" by the Census Bureau; the individuals falling on those lines - 5 percent of the population - were asked a set of supplemental questions that appear at the bottom of the census page. Two of every five census pages were systematically selected for examination. On each selected census page, one of the two designated sample lines was randomly selected. Data entry personnel then counted the size of the sample unit containing the targeted sample line. Sample units containing fewer than seven persons were included in the sample in inverse proportion to their size. Thus, every one-person unit was included in the sample, every second two-person unit, every third three-person unit, and so on. Units with seven or more persons were included with a probability of 1-in-7: every seventh household of size seven or more was selected for the sample. Sample units for 1940 were defined as follows: 1. Households: dwelling places with fewer than five persons unrelated to a household head, excluding institutions and transient quarters, were sampled as households.This design ensures that each selected sample unit contains one individual who was asked the supplemental sample questions at the bottom of the enumeration form. It yields a flat 1-in-100 sample of persons in units of size seven or less. Persons in units larger than seven are over-represented in the 1940 sample; they must be weighted downward to achieve a representative distribution of household size. Analyses of sample-line individuals who answered supplemental questions must also be weighted. Appropriate weights are included in the IPUMS variables HHWT and SLWT. 1950: The 1950 census schedules are contained on 6,278 microfilm reels. Each census page contains information on thirty individuals. Every fifth line on the census page was designated as a sample line, and additional questions for the sample-line individuals on each page appear at the bottom of the form. For the last sample-line individual on each page, there was a block of additional supplemental questions. Thus, 20 percent of individuals were asked a basic set of supplemental questions, and 3.33 percent of individuals were asked a full set of supplemental questions. One in eleven pages within each enumeration district was selected randomly. On each selected census page, the sixth sample-line individual (the one with the full set of questions) was selected for inclusion in the sample. Any other members of the sample unit containing the selected individual were also included. Sample units are defined as in the 1940 sample. As in the 1940 sample, each household in the 1950 sample includes one individual who was asked supplemental questions. The sampling procedure yielded a flat 1-in-330 sample of these sample-line individuals. But the sampling procedure is not flat for persons who were not sample-line individuals. The probability of inclusion in the sample is directly proportional to the size of the unit. Thus, when analyzing the entire population of the persons in units with more than one individual, cases must be weighted in inverse proportion to household size. An appropriate weight is included in the IPUMS variables HHWT and SLWT. 1960: The 1960 census used a machine-readable household form instead of the traditional census schedule. Census information was collected on a separate form for each housing unit. For the first time, housing questions were included on the same form as the population items, and are thus included in the census samples. Every fourth enumeration unit received a "long form," which contained supplemental sample questions that were asked of all members of the unit. Since the public use microdata files are drawn entirely from these long forms, the sample questions are available for all individuals in every unit, instead of for a single member of each unit as in 1940 and 1950. Of the units receiving a long form, four-fifths received one version (the 20% questionnaire), and one-fifth received a second version with the same population questions but slightly different housing questions (the 5% questionnaire). The 1-in-100 1960 sample is drawn from both questionnaires. 1960 sample units are defined the same as for the 1940 and 1950 samples. The sample employed a three-step procedure to select cases from the long-form questionnaires, which collectively formed a 25 percent sample. First, the entire census was divided into 33,000 geographic units, called smallest weighting areas (SWAs). The population of each SWA was broken into 44 categories, based on age group, sex, race, headship, and home ownership. For each category a weight was calculated representing the ratio of persons in the full population count to persons in the 25 percent sample. These weights were used in calculating most census tabulations of sample characteristics for small geographic areas. Next, the sample weights generated for each SWA were used to select a stratified 5 percent sample from the 25 percent sample. The 25 percent sample of the long forms was divided into 38 strata, based on household size, home ownership, race, and group quarters residence. Within each stratum, the cumulative sum of weights for each household head was calculated, and a case was selected for inclusion in the sample each time the cumulative sum passed a multiple of twenty. This procedure yielded a flat 5 percent sample that was used to produce many of the census publications pertaining to the general population. Finally, a 1 percent sample was selected from the 5 percent sample, using essentially the same procedure to select every fifth case within each of 38 strata. The strata used in this selection were the same as those used to select the five percent sample, except that they employed a slightly different classification of household size. The 1 percent 1960 sample is divided into 100 subsamples, each of which incorporates the same stratification. This elaborate three-step selection scheme yielded a flat sample with very small standard errors, especially for race and home ownership. 1970: Sample units in the 1970 samples are defined the same as in the 1940, 1950, and 1960 samples. One in five housing units in 1970 received a long form containing supplemental sample questions. There were two versions of the long form, with different inquiries on both housing and population items; 15 percent of households received one version, and 5 percent received the other. Six independent 1 percent public use samples were produced for 1970, three from the 15 percent questionnaire and three from the 5 percent questionnaire. Each of the three samples drawn from each questionnaire provide somewhat different geographical information. The procedures used to select cases for inclusion in the 1970 public use samples were similar to those used in 1960 but were slightly more elaborate. Again, weights were constructed for the SWA as the ratio of persons with selected characteristics in the full population count to persons with the same characteristics in the 15 percent and 5 percent samples. In 1970, these weights were calculated in three stages that controlled for household size, sex of head, presence of own children of head, group quarters residence, headship, race, age, and sex. To select the six 1 percent samples (three from the 15 percent sample and three from the 5 percent sample), the weighted population for each sample was divided into seventy-five strata, based on home ownership, race, sex of head, household size, presence of own children, inmate status, and other residence in group quarters. Within each stratum, the sum of weights for household heads was cumulated. The weights represent the ratio of persons in the full count to persons in each sample; because three 1 percent extracts were required for each sample, a case was selected each time the cumulated total of weights passed a multiple of thirty-three. As in 1960, each sample was divided into one hundred subsamples, all of which incorporate the same stratification. 1980: The 1980 census employed a single long form questionnaire completed by one-half of housing units in places with a population under 2,500 and one-sixth of other housing units. Overall, 19.4 percent of housing units were included in the sample. Sample units were defined the same as in 1970, except that the threshold for sampling as group quarters was raised from five or more persons unrelated to the head to ten or more persons unrelated to the head. Three samples were produced in 1980: a 5% (State) sample and two 1% (Metro and Urban/Rural) samples. Each of the three samples aims to preserve different types of geographic information. The 1980 census used the same procedures as the 1970 census to select long-form sample cases for inclusion in the sample, but each step was more elaborate. As in 1970, a three-stage ratio estimation procedure was used to assign weights to sample cases representing the ratio of the full population count to the sample count for persons with particular characteristics in smallest weighting areas. For the 1980 samples, the weights were designed to control for 179 characteristics and combinations of characteristics, including household size, presence of own children, group quarters residence, householder status, detailed race and Spanish origin, age, and sex. The weighted population was divided into 102 strata, including breakdowns by race, Spanish origin, home ownership, sampling rate, and presence of own children. As in 1960 and 1970, cases were selected by cumulating the weights within each stratum, and one hundred stratified subsamples were identified within each of the 1980 samples. 1990: The 1990 census used a single long-form questionnaire for sample questions completed by one-half of persons in places with a population under 2,500, one-sixth of persons in other tracts and block numbering areas with fewer than 2,000 housing units, and one-eighth of all other areas. Overall, about one-sixth of housing units completed a long form. Sample units were defined the same as in 1980. Three samples were produced: a 5 percent sample, a 1 percent sample containing somewhat different geographic codes, and a 3 percent sample of the elderly. The ratio estimation procedure used to assign weights to sample cases in 1990 was virtually identical to the procedure used in 1980. The stratification scheme, however, continued the trend toward increasing complexity: the number of separate strata was increased from 102 to 1,049, mainly because of additional detail on age and race. At this point, the 1990 selection procedure broke with the precedent established in the previous three census years. The previous censuses used the weights to extract a flat sample from each stratum, so the final public use samples had equal probabilities of inclusion for all individuals and households. For 1990, the Census Bureau opted instead to produce weighted samples. Within each state, the Bureau divided the sample questionnaires into an appropriate number of 1 percent samples. For example, if 20 percent of the population of a state completed long forms, the sample questionnaires for that state were divided into twenty subsamples of equal size. Each subsample would then consist of every twentieth case drawn from each stratum. The 5 percent, 1 percent, and 3 percent files were then selected at random from the 1 percent subsamples for each state. Weights were attached to each case representing the number of individuals in the general population represented by any particular case in the sample; these weights range from 0 to 1,138. The advantage of the weighted sample design adopted for 1990 is that it provides maximum precision for persons residing in small localities. The disadvantages are significant, however. The sample is not only more cumbersome to use than those previously produced by the Census Bureau, but precision is actually reduced for the general population. For these reasons, the IPUMS provides a 1-in-100 unweighted extract of the 1990 5 percent file (the state sample), which we created using the same method that the Census Bureau used to create the 1960 and 1970 samples. Furthermore, the IPUMS variable SELFWTHH (Self-weighting sample identifier) can be used to construct a flat subsample from any 1990 IPUMS file.
|