![]() |
![]() |
![]() ![]() |
![]() |
![]() |
As described in the previous chapter, the component samples of the IPUMS used a variety of sample designs. These variations significantly affect the precision of sample estimates. Estimates derived from any sample are subject to sampling variability, which is usually measured as the standard error. The standard error of a sample statistic estimates the variation of that statistic across many similar samples drawn from the same population. Approximately two-thirds of random samples will produce estimates within one standard error of the full population, and approximately 95 percent of samples produce estimates within two standard errors. Standard errors depend on both sample size and sample design. This chapter describes how sample design affects sample precision, estimates the resulting differences in standard errors across the IPUMS samples, and discusses strategies for obtaining realistic estimates of statistical significance. Clustering
Some individual characteristics, such as ancestry, are highly correlated within households. For example, if one household member is Chinese, the odds are high that other household members are also Chinese. Suppose we wish to estimate the standard error for the proportion of the population of Chinese ancestry. If we had the sort of sample generally assumed by statistics textbooks - an independent random sample of all individuals in the population - the standard error for Chinese ancestry would be inversely proportional to the square root of the number of individuals in the sample. But because the samples are cluster samples and Chinese ancestry is highly correlated within clusters, the usual method for calculating standard errors would overestimate sample precision. A better estimate of standard error would be obtained by substituting the number of households for the number of individuals when making the calculation. Standard errors in cluster samples depend on both the number of clusters sampled and on the homogeneity of variables within clusters. Calculation of standard errors for cluster samples is complicated.1 In the worst case, with perfect homogeneity within clusters, the standard errors for variables would be inversely proportional to the square root of the number of clusters rather than the number of individuals. For variables that are heterogeneous within clusters, such as age and sex, clustering may have little or no effect on sample precision. The impact of clustering therefore varies from variable to variable. It also varies from census year to census year. The homogeneity of particular characteristics within clusters can change over time. For example, as ethnic intermarriage increases, ethnicity within households becomes less homogeneous. The size of clusters also differs across census years. The larger the average size of clusters, the smaller the number of independent observations. Household size has fallen dramatically over the past century as fertility has declined, boarders and extended families have become less common, and more persons have chosen to live as primary individuals. Thus the samples for recent years have far smaller clusters, on average, than those for earlier years. Changes in cluster size occur not only because of change over time in household size but also because of differences in the treatment of group quarters - large units such as institutions, boarding houses, and college dormitories. To maximize sample precision, such units are sampled at the individual level in all census years. Thus, instead of treating a prison with 1,000 inmates as a single sample unit, they would be sampled as if they were 1,000 one-person households. This procedure multiplies manyfold the number of independent observations for persons in large units. The criteria for designating units to be sampled on an individual basis varies considerably among the samples. Table 1 summarizes the criteria used of group quarters in each census year. In general, the 1940, 1950, 1960, and 1970 samples employ the broadest definition of group quarters: all persons in units with five or more persons unrelated to the household head are sampled as individuals. Thus, for example, a wealthy family with five co-resident servants would be treated as group quarters, and each member of the unit - including the primary family - would be sampled as if he or she resided in a separate one-person household. This inclusive definition of group quarters minimizes the size of clusters. In the 1850-1880 and 1920 samples, on the other hand, a unit may have up to thirty members, related or unrelated, before it is sampled as group quarters. In a unit with over thirty members, each related group is sampled jointly in order to preserve all co-resident family relationships. This is a minimal definition of group quarters that maximizes the size of clusters; sample precision was reduced in order to preserve information about family and household interrelationships. The other census years fall between these extremes; further details appear in the preceding chapter on sample designs. Table 1. Criteria for Individual-Level Sampling
There is one additional feature of the sample designs that affects the size of clusters. Most public use files are samples of households, individuals within households, and group quarters.2 The 1850-1880 and 1920 samples, however, add another level of hierarchy in that multi-household dwellings containing thirty or fewer residents were sampled as dwellings instead of as households - that is, all households within the dwelling were sampled. This provides important additional information, since many dwellings contained two interrelated households. At the same time, sampling by dwellings sacrifices some sample precision, since it increases the average size of clusters. Stratification
Stratification has the opposite effect of clustering: it increases the precision of sample estimates. It does so not only for those characteristics that are explicitly stratified, but also for any other characteristics that are correlated with them. In some cases, the positive effects of stratification outweigh the adverse effects of clustering, so the IPUMS sample designs can actually yield smaller standard errors than would be obtained through a simple random sample of similar size. The samples for the four most recent census years all employ elaborate stratification schemes; these are described in some detail in the preceding chapter on sampling design. To select one example, the 1960 sample divided the population into 38 strata, based on household size, home ownership, race, and group quarters residence, and systematically selected households from each stratum for inclusion in the sample. Sample precision was further enhanced by a selection scheme that ensured even coverage within every geographic area. Public use samples of subsequent censuses are even more elaborately stratified - the 1990 sample was selected from 1,049 strata. The samples for earlier censuses are not as stratified. Unlike the 1960 and subsequent samples, these were not drawn from an existing machine-readable source; instead, they were entered by hand from microfilm copies of the original enumerators' manuscripts. Most individual and household characteristics were unknown before the cases were entered, so they could not be efficiently used to stratify the samples. However, since the microfilm reels and the manuscripts they contain are both arranged geographically, this characteristic was available for every case prior to data entry. Each of the pre-1960 samples could therefore be geographically stratified. This was done by dividing the raw data into individual census pages, groups of pages, or individual microfilm reels, and then sampling independently from each of these strata. The result is a significantly more even geographical distribution of cases than would be expected from a true random sample. This dramatically improves the precision of the geographic variables, such as region and urban residence. It also indirectly improves the precision of variables highly correlated with geography (e.g., race, ethnicity, education, occupation, farm residence, and home ownership). Estimating Sampling Errors in
the IPUMS Once the samples are complete, however, it is fairly easy to develop empirical estimates of standard errors. As stated above, standard errors are simply estimates of the standard deviation of a statistic over all possible samples of a population. The IPUMS samples are large enough that we can divide them into many subsample replicates and then directly measure the distribution of a statistic across the subsamples. Table 2 shows estimated design factors for selected variables in each IPUMS file from 1880 to 1980. These were arrived at by dividing each sample into fifty randomly selected subsample replicates, calculating the standard deviation of the expected value of each variable across the fifty subsamples, and dividing the result by the standard error that statistical theory predicts for a simple random sample of the same size as each subsample.3 Table 2. Design Factors for Selected Variables, 1880-1980 Samples
The design factors represent the ratio of observed standard errors for a variable to the standard errors that would be obtained from a simple random sample of the same size. Thus, a design factor of 1.0 means that the effects of stratification and clustering on sample precision cancel one another out. If the design factor is 1.0, a standard statistical package like SPSS or SAS - or a standard statistics textbook - would produce reliable significance statistics. A design factor of 2.0 means that the empirically observed standard errors are twice as great as would be predicted for a simple random sample. For such variables, a statistical package would overestimate statistical significance. Conversely, a design factor of 0.5 means that the sample is twice as precise as would by predicted by standard statistical tests. Only a few variables have design factors that usually exceed 1.0 by a wide margin. The most dramatic case is RACE, where the design factor exceeds 2.0 in each of the census years before 1960. This reflects the impact of clustering, since households have historically been extremely homogeneous with respect to race. Stratification schemes adopted since 1960 have reduced the design factor for race. BPL (Birthplace), LANGUAGE, and CITIZEN (Citizenship status) also tend to be homogeneous within clusters and have relatively high design factors. The design factor for SCHOOL (attendance) probably dropped over time due to declining fertility: as the number of school-age children per household has fallen, the potential for clustering has diminished. The design factors presented in Table 2 are only valid for analyses of all individuals in the nation as a whole; the results could be significantly different for any population subgroup. Moreover, particular categories within a given variable may have design factors markedly different from that of the variable itself. For example, the categories "head" and "wife" from RELATE (Relationship) have uniformly low design factors; because each household ordinarily contains only one head and no more than one wife, there is no potential for homogeneity within households. By contrast, the design factor for the category "child" is quite high because children tend to occur in combination. Most common individual-level analyses have lower design factors than those listed in Table 2 because researchers tend to choose population subgroups in such a way as to effectively minimize the impact of clustering. For example, fertility studies most frequently focus on married women ages 15 to 49. Since the great majority of sample clusters contain only one such individual, the impact of clustering is trivial. Thus, fertility studies almost inevitably have design factors at or below 1.0, which yield conservative estimates of statistical significance. Likewise, investigations of such topics as the living arrangements of elderly women, the occupational status of young men, or the education of never-married adult women should generally yield precision at least as high as would be obtained from a simple random sample of the same size. Researchers can usually set up their analyses to avoid high design factors. For example, consider the variable SCHOOL (attendance). For the population as a whole, the design factor is 1.33, indicating that the standard error for school attendance in the IPUMS sample would be 33 percent larger than in a simple random sample. But if we restrict the analysis by age and sex and look at the school attendance of girls ages 10 to 14, the design factor improves to 1.08 because most clusters include only one girl in this age group. Finally, if we randomly select one school-age child from each household, the design factor falls below 1.0. In general, if studies are designed so that they rarely or never examine more than one individual per household, the effects of clustering can be altogether avoided. Especially when doing analyses of children, boarders, and other groups likely to appear multiple times in the same household, researchers should develop strategies to eliminate the redundant cases. Instead of assessing the characteristics of all children, for example, one can look at eldest children, or youngest children, or children of a particular age, or a randomly selected child from each household. In practice, IPUMS users rarely take the trouble to estimate true standard errors, mainly because the methods for doing so are so cumbersome. Instead, researchers usually accept the significance statistics generated by their statistical packages. On the whole, the results presented here are reassuring: most of the analyses done by users of the IPUMS probably have design factors close to 1.0 or lower, which means that the estimates of the packages are not too far off. If users are aware of the dangers of clustering and design their studies to minimize it, they can safely use statistical procedures designed for simple random samples.
|