Pitfalls of Using Standard Statistical Software Packages for Sample Survey Data

Donna J. Brogan, Ph.D.

Rollins School of Public Health, Emory University, Atlanta

COPYRIGHT: This article is copyrighted and is not to be used without proper acknowledgment and citation. It will appear as a chapter in Encyclopedia of Biostatistics, edited by Peter Armitage and Theodore Colton (Editors-in-Chief), to be published by John Wiley in summer, 1998 as six volumes. The article will be in a section titled "Design of Experiments and Sample Surveys", edited by Paul Levy.

INTRODUCTION

In the past ten years many researchers in the health sciences have become interested in performing secondary analyses using data from complex sample surveys. These analyses are descriptive, analytical, hypothesis generating or model building. Sample survey statisticians are aware that specialized software should be used to analyze complex sample survey data, particularly when analyses are descriptive or analytical and the survey design includes clustering ( [6], [7] ). Carlson's [3] article in this volume reviews some currently available sample survey software packages.

However, some scientists are not aware of the need to use specialized software or, if aware, prefer not to do so because of the need to learn a new software package. Secondary data analysts may be confused when they realize that there is a difference of opinion, even among sample survey statisticians, as to when it is necessary to use specialized software for analysis of sample survey data ( [4], [5], [10] ). Finally, many biostatisticians are not able to offer advice on this topic because they are not familiar with the specialized data analysis issues for sample survey data.

This paper uses sample survey data from BRFSS (Behavioral Risk Factor Surveillance System) surveys to illustrate that biased point estimates, inappropriate standard errors and confidence intervals, and misleading tests of significance can result from using standard statistical software packages to analyze sample survey data. General recommendations are given to indicate situations in which serious errors are likely to occur with the use of standard statistical software packages for sample survey data.

CAUTIONS IN USING STANDARD STATISTICAL SOFTWARE PACKAGES

Standard statistical software packages generally do not take into account four common characteristics of sample survey data: (1) unequal probability selection of observations, (2) clustering of observations, (3) stratification and (4) nonresponse and other adjustments [2 ]. Point estimates of population parameters are impacted by the value of the analysis weight for each observation. These weights depend upon the selection probabilities and other survey design features such as stratification and clustering. Hence, standard packages will yield biased point estimates if the weights are ignored. Estimated variance formulas for point estimates based on sample survey data are impacted by clustering, stratification and the weights. By ignoring these aspects, standard packages generally underestimate the estimated variance of a point estimate, sometimes substantially so.

Most standard statistical packages can perform weighted analyses, usually via a WEIGHT statement added to the program code. Use of standard statistical packages with a weighting variable may yield the same point estimates for population parameters as sample survey software packages. However, the estimated variance often is not correct and can be substantially wrong, depending upon the particular program within the standard software package.

DESCRIPTION OF BRFSS SURVEYS

The BRFSS [13 ] program was established by CDC (Centers for Disease Prevention and Control) to provide state level data to estimate the prevalence of risk factors for disease and poor health. States select a continuous probability sample of the adult noninstitutionalized population using some type of random digit dialing (RDD) telephone sampling [9]; the Mitofsky-Waksberg technique [16] is commonly used. Once a residence is reached, almost all states select one adult, with equal probability, to undergo a telephone interview. States generally interview about 1500 to 4500 adults per year.

BRFSS surveys result in an unequal probability sample of adults, primarily because only one adult per sampled household is selected. Weighting adjustments generally are done for enumeration nonresponse (a roster of adults within the household was not obtained) and/or for interview nonresponse (the selected adult was not interviewed). Further, poststratification of the observations to U.S. Census data generally is done. Hence, each observation in the dataset has a value for the variable FINALWT (final analysis weight). This value indicates the number of persons in the population represented by that observation. The value of FINALWT varies across observations, sometimes considerably, depending upon the state's sampling plan.

In addition to differential weighting, the statewide sample generally is clustered by telephone bank (usually defined as a group of telephone numbers with identical area code, prefix and first two digits of the suffix). Some states use stratification in their sampling process. Although BRFSS sampling details differ across the states, each statewide BRFSS survey typically is weighted and clustered, and a few are stratified.

This paper uses calendar year 1993 BRFSS data on diabetes for the six states given in Table 1, yielding a total sample size of 20,049 observations (completed interviews) over the six states. Presence/absence of diabetes is defined as a yes/no answer to "Have you ever been told by a doctor that you have diabetes"; the few observations with other than a yes/no answer are excluded from all analyses.

METHODS FOR COMPARISON OF TWO SOFTWARE PACKAGES

In order to illustrate the general cautions expressed above, BRFSS data were analyzed with a standard statistical package (SAS System for Windows, Version 6.11 [12] ) and a specialized sample survey software package (SUDAAN, version 7.0, [15] ), hereafter referred to as SAS and SUDAAN. Analyses with SUDAAN are considered to provide the correct answers. Sample survey packages other than SUDAAN would provide identical point estimates as SUDAAN and identical or very close estimated variances, depending upon the variance estimation procedure used. Two common variance estimation techniques are Taylor Series linearization ( [8], [14*] ) as in SUDAAN and replication procedures[11] as in WesVar [1]. Other standard statistical packages are assumed to provide the same results as SAS for unweighted analyses and for point estimates for weighted analyses. However, other standard statistical packages may have different default calculations for variability calculations using weighted analyses.

Each state's sampling plan was described to SUDAAN in the same way; within a state no stratification was used and observations were clustered in their appropriate primary sampling unit (PSU), usually a telephone bank. In order to perform analyses over all six states, as well as state-specific analyses, the six state concatenated dataset was described to SUDAAN as a stratified (by state) multi-stage clustered survey. The finite population correction factor was not used in estimated variance calculations. For those familiar with SUDAAN, the PROC statement included DESIGN = WR, the NEST statement included the stratification variable STATE and the clustering (telephone bank) variable PSU, and the WEIGHT statement included the variable FINALWT.

SAS analyses were conducted using four different approaches, all of which ignored the clustering and stratification. The first SAS approach analyzed the dataset unweighted; this is equivalent to using the WEIGHT statement with the variable _ONE_ (a variable whose value is 1.0 for every observation in the dataset). Table 1 (column 2) shows that, with this approach, the CA sample size is 3719 and it contributes 19% to the total inference population.

The second SAS approach used the WEIGHT statement with the variable FINALWT. There is great variability in FINALWT; Table 1 (column 3) indicates its range as 70 to 72,663. Table 1 also shows that using FINALWT implies that the CA sample contributes 50% to the total inference population, rather than only 19% in an unweighted analysis.

The third SAS approach used the WEIGHT statement with the variable NORMWT, a normed weight based on FINALWT. This approach is recommended by some data analysts as giving results approximately equal to results from a sample survey package. For person j within state i , let finalwt(i,j) be the value of the variable FINALWT. Then, the value of NORMWT for this person is defined as:

normwt(i,j) = (20,049) * [finalwt(i,j)] / (45,452,569)

The figure 45,452,569 is the estimated total adult population of the six states, which is the sum of the value of FINALWT over all 20,049 observations (Table 1, column 3). The variable NORMWT has values less than 1.0 and greater than 1.0, and the sum of the values of NORMWT over the entire dataset is 20,049, the total sample size. Table 1 (column 4) shows that, with this approach, the CA sample contributes 50% to the total inference population.

The fourth SAS approach used the WEIGHT statement with the variable STNORMWT, a second normed weight calculated from FINALWT, where the norming is done within state. Hence, the sum of the values of STNORMWT over all observations within a state equals the sample size for that state (Table 1, column 5). Clearly, the sum of STNORMWT over the entire sample equals the total sample size 20,049. Table 1 shows that, with this approach, the CA sample contributes 19% to the total inference population.

SUDAAN and the four SAS methods were compared first on a descriptive analysis, i.e. estimation of diabetes prevalence for the total population (six states combined) and for each state. PROC DESCRIPT in SUDAAN was used, although PROC CROSSTAB in SUDAAN could also have been used, where the diabetes variable was coded as 1 or 2. PROC MEANS was used for all four SAS methods to obtain the estimated prevalence and estimated standard error for the point estimate; the diabetes variable was coded as 0 (no diabetes) or 100 (have diabetes).

Second, SUDAAN and the four SAS methods were compared on a chi-square analysis to test the null hypothesis of no relationship between gender and diabetes. These analyses were performed for the total population and for each state. PROC CROSSTAB was used in SUDAAN and PROC FREQ was used in SAS with diabetes coded as a categorical variable (1,2).

RESULTS

Descriptive Analyses

Table 2 (columns 2 and 3) shows that unweighted SAS, compared to SUDAAN, overestimates the prevalence of diabetes by about 10% of the SUDAAN point estimate for the total population (5.40% versus 4.86%) and for half of the states. Note also that the estimated standard errors in Table 2 are smaller for unweighted SAS than for SUDAAN. For the entire population the SUDAAN estimated standard error is 35% larger than the standard error estimated by unweighted SAS (.219 versus .160). The combination of the biased point estimate and underestimation of the standard error could result in quite misleading confidence intervals for the prevalence of diabetes.

Table 2 (columns 4 and 5) shows that SAS with FINALWT or NORMWT give identical results, with the SAS point estimates the same as SUDAAN but the SAS estimated standard errors still lower than SUDAAN. The magnitude of SAS underestimation of the standard error with FINALWT or with NORMWT is somewhat worse than with SAS unweighted. The advantage of using SAS with FINALWT or NORMWT, compared to SAS unweighted, is that correct point estimates are obtained.

Table 2 (columns 2 and 6) shows that SAS with STNORMWT gives identical results to SAS with FINALWT or NORMWT for state specific analyses but yields a biased point estimate for the total population along with an underestimated standard error.

Chi -Square Analyses

The chi-square analysis tests the null hypothesis that the prevalence of diabetes is the same for males and females. Table 3 (columns 2 and 3) shows that unweighted SAS, compared to SUDAAN, yields a higher value for the chi-square statistic for the entire population, giving a smaller P-value (.003 versus .014). A comparison of unweighted SAS with SUDAAN, state by state, shows no consistent pattern; the P-value for unweighted SAS is sometimes higher and sometimes lower than for SUDAAN. However, since it was noted above that unweighted SAS yields biased estimates of diabetes prevalence, unweighted SAS probably should not be given serious consideration in an analysis to determine if diabetes prevalence differs by sex.

Table 3 (columns 2 and 4) shows that SAS with FINALWT, compared to SUDAAN, yields an unreasonably large and suspicious value of the chi-square statistic for the total population and for each state. For this reason P-values are not included in Table 3 for SAS with FINALWT. PROC FREQ in SAS, with FINALWT, considers the sample size to be the sum of the values of FINALWT (i.e. 45,452,569) as opposed to the actual sample size of 20,049 for the total population. This is the reason for the very large values of the chi-square statistic.

Table 3 (columns 2 and 5) shows that SAS with NORMWT, compared to SUDAAN, yields a chi-square value for the six state area which is twice as large (12.64 versus 6.07). However, this relationship between SUDAAN and SAS with NORMWT does not hold for each of the six states in Table 3. Compared to SUDAAN, SAS with NORMWT yields a larger chi-square statistic value for some states but a smaller value for other states. This occurs because, within each state, SAS considers the sample size as the sum of NORMWT. Hence, the sample size for CA is artificially inflated to 10,049 from 3719, whereas the sample size for WV is artificially deflated to 598 from 2425 (see Table 1). Thus, the chi-square statistic using SAS with NORMWT, compared to SUDAAN, is much larger for CA but much smaller for WV.

Table 3 (columns 2 and 6) shows that SAS with STNORMWT, compared with SUDAAN, yields a chi-square statistic for the total population about twice as large as SUDAAN (13.20 versus 6.07). For each state the chi-square statistic based on SAS with STNORMWT is about 15% to 20% larger than provided by SUDAAN. Because STNORMWT is normed within a state, the sum of the weights reflect the statewide sample size. Hence, SAS with STNORMWT shows the common pattern that SAS generally calculates a larger value of the chi-square statistic than does SUDAAN.

DISCUSSION

Unweighted Analyses with Standard Statistical Software

Although the empirical evidence in this paper is based only on one type of survey (BRFSS), only on six states and only on 1993 data, the findings are consistent with other similar investigations [7]. Using a standard statistical package with unweighted analyses to analyze sample survey data generally will yield (1) biased point estimates of population parameters, (2) underestimates of the standard error for point estimates, (3) confidence intervals on population parameters which are too narrow, and (4) tests of significance which are too likely to reject the null hypothesis because the standard errors or variability in the data generally are underestimated.

The extent of the bias in unweighted point estimates will depend upon the particular dataset and is related to the variability of the FINALWT variable. If FINALWT has little variability in the dataset, then an unweighted point estimate will be close to a weighted point estimate. In the six state BRFSS dataset, the value of FINALWT ranged from 70 to 72,663 over the six states. This extreme variability in the value of FINALWT primarily is due to varying sampling fractions across the states, i.e. a small variation in state sample size (2400 to 4400) but widely different statewide populations (1.4 to 22.8 million).

Another factor which contributes to the bias of estimates based on unweighted analyses is the relationship between the value of FINALWT and the variable being analyzed. In the dataset used here the value of FINALWT is primarily influenced by the sampling fraction in each state; you could say that certain states are "oversampled". If state were strongly related to the analysis variable (diabetes), then point estimates of diabetes prevalence from unweighted analyses could be seriously biased. In this dataset, the estimated statewide prevalences of diabetes do not differ dramatically, ranging from 4% to 6%. If blacks had been oversampled within each state to a large extent, then the bias in estimated diabetes prevalence using unweighted analyses would be substantial and positive, since blacks have a higher prevalence of diabetes than do whites.

In addition to potentially biased point estimates from unweighted analyses, standard errors and other measures of variability generally are underestimated due to clustering and variability in FINALWT. The intracluster correlation coefficients in BRFSS datasets generally are positive but not substantial. This might be expected from the Mitofsky-Waksberg RDD technique and the fact that most states only have about three completed interviews per PSU (telephone bank). Variability in FINALWT, and not clustering, is most likely the most important factor contributing to the higher estimated variances from SUDAAN in this BRFSS dataset. If other sample survey datasets had been used with a higher degree of intra-cluster correlation, unweighted analyses would have produced even smaller estimates of variability, compared to SUDAAN.

Weighted Analyses with Standard Statistical Software

Using weighted analyses with FINALWT or NORMWT produces unbiased point estimates of prevalence for the entire population over all six states and for any strata (states) of interest. Although not illustrated, these weighted analyses also yield unbiased point estimates of diabetes prevalence among subpopulations based on other characteristics, such as race or gender, where the subpopulations contain observations from all or some strata.. Hence, either of these two weighted techniques are fine if only point estimates of prevalence are desired. Weighted SAS using FINALWT or NORMWT tends to underestimate the standard error of estimated prevalences. The degree of underestimation depends upon the size of the intra-cluster correlation coefficient for the variables being analyzed. The higher the intra-cluster correlation, the more serious the underestimation of the variability. Weighted analyses using NORMWT or FINALWT can be a reasonable analytical approach for point estimates of population parameters under the following condition: all intra-cluster correlation coefficients are near zero.

However, SAS with FINALWT in PROC FREQ gives substantially incorrect results because the sample size is assumed to be the population size. Whether this is true in other standard statistical packages depends upon the packages' default options for weighted analyses in chi-square tests.

SAS with NORMWT in PROC FREQ gives a larger chi-square statistic than does SUDAAN for the entire population, about twice as large. However, this procedure yields substantially incorrect chi-square statistics for state-specific analyses. The state specific analyses are wrong because the incorrect sample size is assumed for the state analyses. This will occur also whenever subpopulations are analyzed using NORMWT and the variable which defines the subpopulation is related to the value of FINALWT.

SAS with the second normed weight, STNORMWT, gives more reasonable values for the chi-square statistic for state level analyses, although the chi-square statistics were always larger than with SUDAAN. However, if the weight STNORMWT is used for analyses over the entire population, a biased point estimate is obtained for population parameters.

Conclusions

In searching for an approach to analyze sample survey data with standard statistical software, two reasonable criteria are:

Based on the empirical results above, a weighted analysis with either FINALWT or NORMWT are the only approaches of the four considered which yield unbiased point estimates for populations and subpopulations. FINALWT is not good to use with SAS PROC FREQ because the sample size is interpreted to be the population size. Hence, this leaves only the option of a weighted analysis using NORMWT. However, as shown above, weighted analyses with NORMWT possibly can yield quite misleading results in subpopulation analyses.

It is recommended that sample survey software be used to analyze sample survey data, especially for estimation of population parameters, descriptive analyses and analytical analyses. Under certain circumstances, standard statistical packages can be used to provide results approximately equal to the results obtained from survey software. However, recognition of these circumstances and awareness of the potential pitfalls of using standard statistical packages requires detailed information about the characteristics of the survey dataset (e.g. sampling plan, weighting scheme, intracluster correlation) as well as knowledge of the particular formulas and default options used by the standard software package for weighted analyses. In the end, it seems easier and less time consuming to use a sample survey software package.



REFERENCES

[1] Brick JM, Broene P, James P and Severynse J (1996). A User's Guide to WesVarPC, Westat, Inc., Rockville, MD.

[2] Brick JM and Kalton G (1996). Statistical Methods in Medical Research, 5, 215-238.

[3] *Carlson B. *An article in EOB about software for variance estimation in sample surveys.

[4] Graubard BI and Korn EL (1996). Statistical Methods in Medical Research, 5, 263-281.

[5] Groves RM (1989). Survey Errors and Survey Costs, John Wiley, New York,

[6] Korn EL and Graubard BI (1991). American Journal of Public Health, 81(9), 1166-1173.

[7] Landis JR, Lepkowski JM, Eklund SA, and Stehouwer SA (1982). A Statistical Methodology for Analyzing Data from a Complex Survey: the First National Health and Nutrition Examination Survey. Vital and Health Statistics, 2(92), DHEW, Washington, DC.

[8] LaVange LM, Stearns SC, Lafata JE, Koch GG, and Shah BV (1996). Statistical Methods in Medical Research, 5, 311-329.

[9] Lepkowski JM, (1988). In Telephone Survey Methodology, RM Groves, PP Biemer, LE Lyberg, JT Massey, WL Nicholls, and J Waksberg, eds. John Wiley, New York, pp. 73-98.

[10] Pfeffermann D (1996). Statistical Methods in Medical Research, 5, 239-261.

[11] Rust KF and Rao JNK (1996). Statistical Methods in Medical Research, 5, 283-310.

[12] SAS Institute Inc. (1993). SAS Companion for the Microsoft Window Environment, Version 6. SAS Institute Inc., Cary, NC.

[13] Siegel PZ, Brockbill RM, Frazier EL, Mariolis P, Sanderson LM and Waller MN, (1991). MMWR CDC Surveillance Summaries 40(4), 1-23.

[14] *Shah. EOB article on Taylor Series linearization approach for variance estimation.

[15] Shah BV, Barnwell BG and Bieler GS, (1996). SUDAAN User's Manual: Release 7.0, Research Triangle Institute, Research Triangle Park, NC.

[16] Waksberg J (1978). Journal of the American Statistical Association, 73(361), 40-46.

ACKNOWLEDGEMENT

This work was partially supported by CDC via the Division of Diabetes Translation and the Division's 1996 Conference Planning Committee. An invited paper based on this work was presented at the 1996 Diabetes Translation Conference, "Health Care in Transition: Diabetes as a Model for Public Health", held in Washington DC on March 31-April 3, 1996. All statements are the sole responsibility of the author.

BIOGRAPHY

Donna Brogan received her Ph.D. in statistics in 1967 from Iowa State University. She has worked in sample surveys throughout her career, especially in design and analysis strategies. She conducts workshops on using sample survey software for data analysis. Currently she is Professor of Biostatistics at the Rollins School of Public Health at Emory University in Atlanta.

TABLE 1

Sample Size per State and (Min, Max) and Sum of Three Types of Weights per State

1993 BRFSS Surveys

STATE
Sample Size

(Percent)
FINALWT

(Min, Max), Sum, (Percent)
NORMWT

(Min, Max),

Sum, (Percent)
STNORMWT

(Min, Max),

Sum, (Percent)
California
3719

(18.6)
(635, 72663)

22,780,741

(50.1)
(.280, 32.1)

10049

(50.1)
(.104, 11.9)

3719

(18.6)
Florida
3087

(15.4)
(610, 19131)

10,563,183

(23.2)
(.269, 8.4)

4659

(23.2)
(.178, 5.6)

3087

(15.4)
Maryland
4361

(21.8)
(70, 3876)

3,727,710

(8.2)
(.031, 1.8)

1644

(8.2)
(.082, 4.5)

4361

(21.8)
Minnesota
3412

(17.0)
(222, 4182)

3,277,173

(7.2)
(.098, 1.8)

1446

(7.2)
(.232, 4.4)

3412

(17.0)
Tennessee
3045

(15.2)
(233, 7460)

3,747,334

(8.2)
(.103, 3.3)

1653

(8.2)
(.190, 6.1)

3045

(15.2)
West Virginia
2425

(12.1)
(118, 2588)

1,356,429

(3.0)
(.052, 1.1)

598

(3.0)
(.211, 4.6)

2425

(12.1)
6 State Total
20049
(70, 72663)

45,452,569
(.031, 32.1)

20049
(.082, 11.9)

20049


TABLE 2

Estimated Prevalence and (Standard Error) of Diabetes

by State and Analysis Procedure

ANALYSIS PROCEDURE

STATE
SUDAAN

FINALWT
SAS

_ONE_
SAS

FINALWT
SAS

NORMWT
SAS

STNORMT
California
4.48

(.373)
4.88

(.354)
4.48

(.340)
4.48

(.340)
4.48

(.340)
Florida
5.19

(.421)
5.64

(.416)
5.19

(.400)
5.19

(.400)
5.19

(.400)
Maryland
4.96

(.364)
5.10

(.333)
4.96

(.329)
4.96

(.329)
4.96

(.329)
Minnesota
4.23

(.370)
4.37

(.350)
4.23

(.345)
4.23

(.345)
4.23

(.345)
Tennessee
6.19

(.479)
6.37

(.443)
6.19

(.437)
6.19

(.437)
6.19

(.437)
West Virginia
6.04

(.520)
6.68

(.507)
6.04

(.484)
6.04

(.484)
6.04

(.484)
6 State Total
4.86

(.219)
5.40

(.160)
4.86

(.152)
4.86

(.152)
5.10

(.155)

TABLE 3

Calculated Chi-Square Statistic and (P-value) for Testing Independence of Gender and Diabetes, by State and Analysis Procedure

ANALYSIS PROCEDURE

STATE
SUDAAN

FINALWT
SAS

_ONE_
SAS

FINALWT
SAS

NORMWT
SAS

STNORMT
California
1.81

(.178)
0.001

(.975)

13396
5.91

(.015)
2.19

(.139)
Florida
2.15

(.143)
3.44

(.064)

8436
3.72

(.054)
2.46

(.116)
Maryland
5.48

(.019)
2.96

(.085)

5616
2.48

(.116)
6.57

(.010)
Minnesota
0.11

(.745)
0.11

(.743)

104
0.05

(.830)
0.11

(.742)
Tennessee
0.57

(.452)
2.10

(.147)

802
0.35

(.552)
0.65

(.419)
West Virginia
3.09

(.079)
2.74

(.098)

1950
0.86

(.354)
3.49

(.062)
6 State Total
6.07

(.014)
8.91

(.003)

28662
12.64

(<.001)
13.20

(< .001)