Sampling Error Software for Personal Computers

Jim Lepkowski and Judy Bowles
University of Michigan

From The Survey Statistician, No. 35, December 1996, 10-17, the newsletter of the International Association of Survey Statisticians (IASS); reprinted with the kind permission of the authors and editor. See below for more information about the The Survey Statistician and the International Association of Survey Statisticians (IASS)

Many students in sampling methods courses are often surprised to learn that cluster sampling generates positive correlations among sample elements and consequent losses in precision relative to the simple random selections they have been trained elsewhere to analyze. Their surprise turns to dismay when they learn that the standard statistical analysis packages that they use routinely do not account for this lost precision correctly.

One of the next questions they ask is, "What software is there that does take the cluster sampling into account in an analysis?" The appropriate answer depends somewhat on how much material has been covered in the course about design based inference. The methods of inference from complex samples and the statistical techniques that support them are covered in several textbooks and monographs, but those topics are often more than can be covered adequately for students in an introductory sampling methods course. A satisfactory initial answer to the question about software for design-based survey inference may be a brief catalogue of suitable packages.

Our purpose in this article is to give such a catalogue. We are interested in currently available statistical estimation software which accounts for clustering and other complex features of the stratified multistage samples. We do not attempt to give even a limited treatment of the inferential and analytic methods that underlie this software. Readers interested in the inferential basis or particular techniques might examine the monograph by Lee, Forthofer, and Lorimor (1989), or study the more advanced treatments given in Skinner, Holt, and Smith (1989) or Lehtonen and Pahkinen (1995).

The small catalogue of suitable software given here is further limited to those programs or packages that are currently available on personal computers. Thus, we do not discuss one of the earliest packages which provided such software, OSIRIS (Institute for Social Research, 1982), which offered a suite of programs which accounted for complex design features in estimation for means, proportions, and regression statistics, because the OSIRIS sampling error software is only available at present for mainframe computers.

We list only commercial or documented free-ware statistical software packages that are currently available for use by the general survey data analyst. The eight packages catalogued here are CENVAR, CLUSTERS, Epi Info, PC CARP, Stata, SUDAAN, VPLX, WesVarPC. There are no doubt other packages that are commercially or otherwise available to the general survey analyst that we have failed to list. We apologize for our oversight, and welcome the opportunity to list other software in a future issue of the Survey Statistician. Please contact Jim Lepkowski at the Institute for Social Research, 426 Thompson Street, Ann Arbor, Michigan 48103, or by e-mail at jimlep@umich.edu.

Most of these packages have more extensive features than only estimation for complex sample survey data, incorporating software for processing and managing survey data as well. The presentation does not attempt to address these more extensive features, concentrating on facilities for handling complex design features. These packages are described in alphabetic order following a brief description of specific features. We conclude with a few remarks about sampling error estimation, the more commonly used statistical analysis software packages, and a brief listing of features we think sampling error software ought to incorporate in the future.

Sampling Error Estimation

Unequal probabilities of selection and compensatory weights, nonresponse and noncoverage compensatory weights, population control adjustment, poststratification, stratification of sampling units, and multistage selection are widely used features in "complex sample designs". At a minimum, survey data must have a weight, stratum identifiers, and sampling unit identifiers for each responding sample element. Sampling error estimation software must have the capacity to account for weights, stratification, and sampling unit in the estimation process.

Nearly all standard statistical software will handle weights correctly for estimating point estimates in most analysis provided. Few handle the weights correctly for variance estimates. Further, only one, Stata, has estimation features to account for the stratification and multistage selection employed in the design.

The programs listed in the next section all report handling weights, stratification, and multistage selection correctly in estimation of point and variance estimates. They all require the specification of weights, strata, and sampling units for each sample element. They do not all handle every conceivable sample design in an unbiased fashion. For example, primary sampling units in most stratified multistage sample designs are selected with probabilities proportionate to size and without replacement. Only one program in the list, SUDAAN, has features to handle explicitly this type of design. However, all listed programs will handle such a design under an ultimate cluster sample selection model (Kalton, 1979). Under the ultimate cluster sampling model, elements within primary sampling units are divided into ultimate clusters and a without replacement selection of those ultimate clusters is chosen across primary sampling units. Variance estimates are computed using only between first stage unit totals without having to compute variance components at each stage of selection. All programs use the ultimate cluster sampling model in variance estimation, although SUDAAN also has features to estimate variances for designs employing without replacement selection of primary sampling units.

There are two principal methods of variance estimation employed for complex sample designs: Taylor series approximation and repeated replication (see Wolter, 1985). Repeated replication methods include the method of random groups, the jackknife, balanced half samples, bootstrap methods, and various modifications of these methods. The Taylor series approximation and the repeated replication methods do not produce identical estimates of sampling error, but empirical investigation (Kish and Frankel, 1970; Kish and Frankel, 1974) have shown that the differences in the methods for many statistics is small. The important practical implication is that a Taylor series approximation must be developed analytically for each statistic, while repeated replication uses the same basic estimation method regardless of the statistic being estimated. Most of the listed programs use Taylor series approximations to compute sampling error estimates. The repeated replication programs in the list offer many of the basic methods, except the bootstrap.

A variety of statistical methods are handled by the listed programs. Some estimate sampling variances and related statistics (design effects, misspecification effects, intracluster homogeneity) only for means, totals, and proportions for the total sample, for subclasses of the total sample, and for differences between subclasses. Others estimate sampling variances for regression and logistic regression statistics. Nearly all estimate test statistics based on these sampling variances. A few compute variance estimates and associated test statistics for survival analysis, contingency table analysis, generalized estimating equation models, specialized ratio estimates, and standardized rates.

Most of the listed programs operate under DOS. Two of the DOS-based programs (CENVAR and Epi Info) use menus to display options for users to specify options. The other DOS programs (CLUSTERS, PC CARP, VPLX) use keyword input. There are also two programs that operate under Microsoft Windows. One of these programs (WesVarPC) employs "pull down menus" for user specification of program options, while another (SUDAAN version 7) employs a combination of pull down menus and keyword input.

We have not attempted for this brief catalogue to compare the estimates produced by these programs, except in limited testing to be sure the software could be loaded and operated successfully on our microcomputers. A small number of cross-tabulations were prepared from a modest sized sample survey data set (n = 3,617) with all programs except VPLX and PC CARP. Estimated proportions, standard errors, and coefficients of variation were similar for all the programs in this testing. More extensive comparisons are needed across sample surveys of different sizes and for more statistics, but those comparisons are beyond the scope of our current review.

Finally, five of these packages (CENVAR, CLUSTERS, Epi Info, VPLX, and WesVarPC) are available for free, or at a nominal charge to handle processing and shipping. Access is made even easier for all but CLUSTERS since they can be downloaded over the World Wide Web (WWW) using standard network browsing software. Information is available over the WWW for all but one of the programs. In the listings below, e-mail addresses to obtain further information are given for all software, and WWW access to information or to the software and documentation itself is also given.

Sampling Error Software

CENVAR (U.S. Bureau of the Census; contact International Programs Center, U.S. Bureau of the Census, Washington, DC 20233-8860; e-mail IMPS@census.gov; WWW access to software at http://www.census.gov/ftp/pub/ipc/www/imps.html).

CENVAR is a component of a statistical software system designed by the U.S. Bureau of the Census for processing, management, and analysis of complex survey data, the Integrated Microcomputer Processing System (IMPS). The program operates in a DOS environment but is screen oriented and completely menu driven. Sample designs that can be accommodated include simple random sampling, stratified random element sampling, and multistage cluster samples with equal or unequal probabilities of selection. These sample designs are all addressed through the ultimate cluster sampling model. CENVAR uses Taylor series approximation for variance estimation, based on PC CARP software developed by Iowa State University (see below). Users must have a PC compatible microcomputer with 640K RAM and at least 10 Mb of hard disk space, DOS 3.2 or higher, and a printer capable of 132 characters per line. A math co-processor is highly recommended for Intel 286, 386, and 486SX processors. IMPS version 3.1 is easily downloaded from the WWW site, and installation is relatively straightforward. Only a subset of the IMPS needs to be installed to run CENVAR. Input data must be in ASCII format, and users must create an IMPS dictionary to read data. Blank fields are not acceptable and must be filled with zeroes before IMPS can read the data. Documentation is available in Word Perfect 5.1 format. CENVAR is not difficult to use, but users must be cautious about missing data; it appears to us in casual use that it will not accept missing values for any variables for a record. CENVAR produces sampling error estimates for means, proportions, and totals for the total sample as well as specified subclasses in a tabular layout. In addition to the standard error, 95% confidence interval limits, coefficients of variation, design effects, and unweighted sample sizes are given. We did experience some problems with the calculation of the design effects - users may want to double-check the figures.

CLUSTERS (World Fertility Survey; contact Vijay Verma, 105 Park Road, Teddington (Middlesex), TW11 OAW, United Kingdom; e-mail vjverma@essex.uk; no WWW access to software).

CLUSTERS is a stand alone program originally designed by the staff of the World Fertility Survey, and later updated and improved by Vijay Verma and Mick Price. It is now distributed for a nominal charge for handling by Vijay Verma. The program operates in a DOS environment and is keyword driven. Users must prepare in a file keyword commands in an ASCII format, and the location of that command file is passed to the program during execution. The principal sample design is a stratified multistage cluster sample, addressed through the ultimate cluster sampling model. CLUSTERS uses Taylor series approximation for variance estimation. Users must have a PC compatible microcomputer with 640K RAM and at least 2 Mb of hard disk space for the program, DOS 3.2 or higher, and a printer capable of 132 characters per line. A math co-processor is highly recommended for Intel 286, 386, and 486SX processors. The program comes with an extensive set of sample set ups and sample data sets for testing installation. CLUSTERS itself is easily installed, although some users may find execution a bit confusing until they learn how to specify the various required and optional input and output files. Input data must be in ASCII format, and users must create a dictionary in the command file to read data. Documentation is available in ASCII format. CLUSTERS produces sampling error estimates for means and proportions for the total sample as well as specified subclasses and subclass differences in a spreadsheet-like layout. In addition to the standard error, coefficients of variation, design effects, unweighted sample sizes, and intracluster homogeneity estimates are given.

Epi Info (U.S. Centers for Disease Control and Prevention; contact Andrew G. Dean, MD, Epidemiology Program Office, Mailstop C08, Centers for Disease Control and Prevention, Atlanta, GA 30333; e-mail AGD1@epo.em.cdc.go or EpiInfo@cdc1.cdc.gov; WWW access to software http://www.cdc.gov/epo/epi/epi.html).

Epi Info is an epidemiological and statistical software system designed by the U.S. Centers for Disease Control and Prevention. It has features for processing, managing, and analyzing epidemiological data, including complex survey data (CSAMPLE component). The program operates in a DOS environment but is screen oriented. Documentation is available on-line in the program, and can be printed chapter-by-chapter. The basic sample design that can be accommodated is stratified multistage cluster sampling through the ultimate cluster sampling model. Epi Info uses Taylor series approximation for variance estimation. Users must have a PC compatible microcomputer with 640K RAM and at least 10 Mb of hard disk space, DOS 3.2 or higher, and a printer capable of 132 characters per line. A math co-processor is highly recommended for Intel 286, 386, and 486SX processors. Epi Info is easily downloaded from the WWW site, but make sure you download a copy of the README.TXT file to get valuable installation information about installation on a PC or network. Installation is relatively straightforward, although it appears necessary for the user to create the needed subdirectories in advance. We loaded the entire Epi Info program even though we only needed a subset to estimate sampling errors. Input data can be from DBF, Lotus, or ASCII format. Users must create a questionnaire in Epi Info in the EPED component to read ASCII data; DBF and Lotus format entry are easier. The CSAMPLE component is entirely menu-driven. Epi Info produces sampling error estimates for means and proportions for the total sample as well as for subclasses specified in a two-way layout. The printed output includes only unweighted frequencies, weighted proportions or means, standard errors, 95% confidence interval limits, and design effects.

PC CARP (Iowa State University; contact Sandie Smith, Statistical Laboratory, 219 Snedecor Hall, Ames, IA 50011; e-mail sandie@iastate.edu; WWW access to software http://www.statlib.iastate.edu/survey/software/pccarp.html).

PC CARP is the PC version of the mainframe program SUPER CARP, Complex Analysis Regression Program, developed at Iowa State University. PC CARP can be used to estimate standard errors for means, proportions, quantiles, ratios, differences of ratios, and entries as well as test statistics for two way contingency tables. There are three companion programs which enlarge the range of analyses available: PC CARPL for logistic regression; POSTCARP for poststratified estimates of totals, ratios, and differences of ratios; and EV CARP for regression analysis, including under measurement error in the explanatory variables. The program operates in a DOS environment and is keyword oriented. The programs are designed to handle stratified multistage cluster samples with finite population corrections at up to two stages of selection. PC CARP uses Taylor series approximation for variance estimation. Users must have a PC compatible microcomputer with 640K RAM, and a math co-processor is highly recommended for Intel 286, 386, and 486SX processors. PC CARP and its companion programs may be purchased from the Statistical Laboratory at Iowa State University at $300. Input data can be in ASCII format, and users must create a dictionary to read data. Printed documentation is distributed with the software.

Stata (Stata Corporation; contact Stata Corporation, 702 University Drive East, College Station, TX 77840; e-mail stata@stata.com; WWW site http://www.stata.com).

Stata is a programmable statistical analysis software system which has recently begun to introduce commands which allow users to compute sampling error estimates for many statistics. The program is available for DOS and Windows environments and is keyword driven. Menus and help screens are available in the Windows version. We have only recently obtained preliminary documentation for the Stata commands for survey data analysis, and we are not sure what sample designs can be accommodated. Stata survey analysis commands use Taylor series approximation for variance estimation. Stata itself urges users to have a floating point processor on their PC, and Windows users are recommended to have 8 Mb of RAM and at least 4 Mb of hard disk space. The list price for Stata is $945 for commercial and $395 for academic users, and the survey analysis commands are included as part of the package. Since we have not yet purchased the software, we cannot comment on installation or use of the survey analysis commands, or input data format and documentation. The current survey analysis commands include svymean, svytotal, svyratio, and svyprop for means, totals, ratios, and proportions. The commands svyreg, svylogit, and svyprobt are available for the obvious regression, logistic regression, and probit analysis procedures. The commands svylc and svytest allow estimation of linear combinations of parameters and hypothesis tests. The command svydes allows the user to describes the specific sample design and should be used prior to any of the above commands. There are plans to add survey analysis commands for estimating distribution functions and quantiles, contingency table analysis, missing data compensation, and other analyses.

SUDAAN (Research Triangle Institute; contact SUDAAN Product Coordinator, Statistical Software Center, Research Triangle Institute, 3040 Cornwallis Road, Research Triangle Park, NC 27709-2194; e-mail SUDAAN@rti.org; WWW site http://www.rti.org/patents/sudaan.html).

SUDAAN is a statistical software package for analysis of correlated data, including complex survey data. It provides facilities for estimation of a range of statistics and their associate sampling errors, including means, proportions, ratios, quantiles, cross-tabulations, odds ratios; linear, logistic, and proportional hazards regression models; and contingency table analysis. The program uses Taylor series approximations for variance estimation. It accommodates with and without replacement selection of first stage units, including components of variance, as well as simple random sampling and stratified element sampling designs. SUDAAN is available for PCs under MS DOS as well as for Windows, and prices vary depending on the type of firm, renewal status, and number of licenses. For example, the single license price for a new license PC version of SUDAAN 6.53 for commercial companies and government agencies is $995; the Windows version 7.0 is $1495. Current pricing lists are available directly from RTI. Although the most expensive of the software programs listed, users get a lot for their money. SUDAAN will read directly PC SAS data sets (but not the SAS for Windows data sets). The program is keyword driven, even in the Windows version, but the keyword syntax is very much like SAS. ASCII data sets can also be used in SUDAAN. Printed documentation is provided with each licensed copy of SUDAAN.

VPLX (U.S. Bureau of the Census; contact Robert E. Fay, U.S. Bureau of the Census, Room 3067, Bldg. 3, Washington, DC 20233-9001; e-mail rfay@census.gov; WWW access to software at http://www.census.gov/sdms/www/vwelcome.html).

VPLX is a stand alone program for variance estimation designed and used by the U.S. Bureau of the Census for complex survey data. The program operates in a DOS environment and is keyword driven. VPLX is primarily designed for stratified multistage cluster samples under the ultimate cluster sampling model. It uses repeated replication methods for variance estimation, including a random group, a jackknife, and a balanced repeated replication procedure. Users must have a PC compatible microcomputer with 8 Mb RAM and at least 3 Mb of hard disk space, and DOS 5.0 or higher. A math co-processor is highly recommended for Intel 286, 386, and 486SX processors. VPLX is easily downloaded from the WWW site, and installation is a matter of copying files to directories (there is no separate installation program). Input data must be in ASCII format, and users must create a dictionary to read data. Documentation is available from the Web site in Adobe format. An Adobe reader is also readily downloaded from the WWW. VPLX is more complicated to use than other keyword driven programs on this list, and requires, like CLUSTERS, program file be developed separately in an ASCII format for input to the program execution. It produces sampling error estimates for means, proportions, and totals for the total sample as well as specified subclasses.

WesVarPC (Westat, Incorporated; contact Westat, Inc., 1650 Research Blvd., Rockville, MD 20850-3129; e-mail WESVAR@westat.com; WWW access to software at http://www.westat.com/wesvarpc/index.html).

WesVarPC is a statistical software system designed by Westat, Inc. for analysis of complex survey data. The program operates in a Windows (3.1, 3.11, and 95) environment and completely menu driven. The primary sample design that can be accommodated is a stratified multistage cluster sample based on the ultimate cluster sampling model. WesVarPC uses repeated replication for variance estimation, including jackknife, balanced half sample, and the Fay modification to the balanced half sample method. Users must have a PC compatible microcomputer with 4Mb RAM and at least 10.1 Mb of hard disk space, and Windows 3.1, 3.11, or 95. It is easily downloaded from the WWW site, and installation is probably the easiest among all the software listed using the installation software provided. Documentation is in an Adobe format, but instructions are provided on how to download the Adobe reader. The documentation is the easiest to read and best laid out among all the programs listed. Input data can be in ASCII format, or DBF, SPSS for Windows, SAS Transport, or PC SAS for DOS format. Input from any one of these formats is easy in the menu driven environment. WesVarPC requires that a new version of the data set be created in a special WesVarPC format. This requires the specification of replicates and, if poststratification is to be incorporated into the variance estimates, replicate weights. Users unfamiliar with these procedures may find the specifications slightly confusing. WesVarPC has facilities at present for contingency table analysis, regression, and logistic regression. There is an extensive menu driven system for creating new variables which extends the range of statistics that WesVarPC can be used for. Output is in a list format with one line per statistic in a simple format. The format is not suitable for publication, but it can be saved to a file for processing in a spreadsheet or other program.

Concluding Remarks

It is a bit difficult to summarize briefly the features of these sampling error estimation programs. Some have many components, and we may have not portrayed their features completely. If there are errors, we apologize. They were inadvertent, and we would be happy to publish a revised description of any software description. Similarly, we are not sure we have been comprehensive in our catalogue. If there are other programs you are aware of that are available to survey analysis public that estimate sampling errors taking stratified multistage sampling into account, please let us know for future editions of the Survey Statistician.

Fortunately, our inaccuracies and omissions can, and probably will, be corrected in future issues of the Survey Statistician. We are in the process of inviting authors of the software to prepare descriptions of their software for the Survey Statistician. For example, in the next issue we hope to present an article by the authors of IMPS and its sampling error component CENVAR. Jim Lepkowski will serve as the editor for this feature in future issues.

There are other approaches to sampling error estimation possible besides these programs. For example, we have at the University of Michigan for some time used a primitive set of macros in SAS to estimate sampling errors for a wide variety of statistical analysis methods using repeated replication procedures. For extremely limited problems, some students have actually created spreadsheets in a few of the more popular spreadsheets to compute sampling error estimates from weighted cluster totals. Our goal here has been only to catalogue currently available software that any user could obtain access to.

Finally, there are bound to be new programs developed, and modifications and enhancements of the programs we have listed. For example, David Bellhouse at the University of Western Ontario produced the TREES sampling error estimation program several years ago. While he is not sure it will operate on current computing platforms, he is developing new software that uses computer algebra to derive formulae for variances of specified estimators. It may be useful to prepare a revised catalogue as undiscovered existing software, new software, or important modifications of existing software become available to us.

References

Institute for Social Research (1984). The OSIRIS Statistical Software System. Ann Arbor, MI: Institute for Social Research.

Kalton, G. (1979). Ultimate cluster sampling. Journal of the Royal Statistical Society, Series A 142 (2) 210-222.

Kish, L., and Frankel, M. (1970). Balanced repeated replications for standard errors, Journal of the American Statistical Association 65, 1071-1094.

Kish, L., and Frankel, M. (1974). Inference from complex samples (with discussion), Journal of the Royal Statistical Society, Series B, 36, 1-37.

Lee, E.L., Forthofer, R.N., and Lorimor, R.J. (1989). Analyzing Complex Survey Data. Beverly Hills, CA: Sage Publications, Inc.

Lehtonen, R., and Pahkinen, E.J. (1995). Practical Methods for Design and Analysis of Complex Surveys. Chicester: John Wiley and Sons.

Skinner, C.J., Holt, D., and Smith, T.M.F. (1989). Analysis of Complex Surveys. Chicester: John Wiley and Sons.

Acknowledgement: International Association of Survey Statisticians

The above article appears in the current issue of The Survey Statistician (No. 35, December 1996, 10-17). It is an introduction to a new section that will contain related articles in future issues.

The Survey Statistician is the newsletter of the International Association of Survey Statisticians (IASS) and is distributed twice a year in English and French to members of the association.

Statisticians interested in members in the IASS can write directly to the Secretary of the IASS c/0 Insee DR Aquitaine, Attn. Mme Claude OLIVIER, Bureau 136, 33 rue de Saget, 33076 Bordeaux Cedex, France. Alternatively, send an email to the editor of The Survey Statistician (brickm1@westat.com) with your name and mailing address and an IASS membership application will be mailed to you.