Many students in sampling methods courses are often
surprised to learn that cluster sampling generates positive correlations
among sample elements and consequent losses in precision relative
to the simple random selections they have been trained elsewhere
to analyze. Their surprise turns to dismay when they learn that
the standard statistical analysis packages that they use routinely
do not account for this lost precision correctly.
One of the next questions they ask is, "What
software is there that does take the cluster sampling into account
in an analysis?" The appropriate answer depends somewhat
on how much material has been covered in the course about design
based inference. The methods of inference from complex samples
and the statistical techniques that support them are covered in
several textbooks and monographs, but those topics are often more
than can be covered adequately for students in an introductory
sampling methods course. A satisfactory initial answer to the
question about software for design-based survey inference may
be a brief catalogue of suitable packages.
Our purpose in this article is to give such a catalogue.
We are interested in currently available statistical estimation
software which accounts for clustering and other complex features
of the stratified multistage samples. We do not attempt to give
even a limited treatment of the inferential and analytic methods
that underlie this software. Readers interested in the inferential
basis or particular techniques might examine the monograph by
Lee, Forthofer, and Lorimor (1989), or study the more advanced
treatments given in Skinner, Holt, and Smith (1989) or Lehtonen
and Pahkinen (1995).
The small catalogue of suitable software given here
is further limited to those programs or packages that are currently
available on personal computers. Thus, we do not discuss one
of the earliest packages which provided such software, OSIRIS
(Institute for Social Research, 1982), which offered a suite of
programs which accounted for complex design features in estimation
for means, proportions, and regression statistics, because the
OSIRIS sampling error software is only available at present for
mainframe computers.
We list only commercial or documented free-ware statistical
software packages that are currently available for use by the
general survey data analyst. The eight packages catalogued here
are
CENVAR,
CLUSTERS,
Epi Info,
PC CARP,
Stata,
SUDAAN,
VPLX,
WesVarPC.
There are no doubt other packages that are commercially
or otherwise available to the general survey analyst that we have
failed to list. We apologize for our oversight, and welcome the
opportunity to list other software in a future issue of the Survey
Statistician. Please contact Jim Lepkowski at the Institute
for Social Research, 426 Thompson Street, Ann Arbor, Michigan
48103, or by e-mail at
jimlep@umich.edu
.
Most of these packages have more extensive features
than only estimation for complex sample survey data, incorporating
software for processing and managing survey data as well. The
presentation does not attempt to address these more extensive
features, concentrating on facilities for handling complex design
features. These packages are described in alphabetic order following
a brief description of specific features. We conclude with a
few remarks about sampling error estimation, the more commonly
used statistical analysis software packages, and a brief listing
of features we think sampling error software ought to incorporate
in the future.
Sampling Error Estimation
Unequal probabilities of selection and compensatory
weights, nonresponse and noncoverage compensatory weights, population
control adjustment, poststratification, stratification of sampling
units, and multistage selection are widely used features in "complex
sample designs". At a minimum, survey data must have a weight,
stratum identifiers, and sampling unit identifiers for each responding
sample element. Sampling error estimation software must have
the capacity to account for weights, stratification, and sampling
unit in the estimation process.
Nearly all standard statistical software will handle
weights correctly for estimating point estimates in most analysis
provided. Few handle the weights correctly for variance estimates.
Further, only one, Stata, has estimation features to account
for the stratification and multistage selection employed in the
design.
The programs listed in the next section all report
handling weights, stratification, and multistage selection correctly
in estimation of point and variance estimates. They all require
the specification of weights, strata, and sampling units for each
sample element. They do not all handle every conceivable sample
design in an unbiased fashion. For example, primary sampling
units in most stratified multistage sample designs are selected
with probabilities proportionate to size and without replacement.
Only one program in the list, SUDAAN, has features to handle
explicitly this type of design. However, all listed programs
will handle such a design under an ultimate cluster sample selection
model (Kalton, 1979). Under the ultimate cluster sampling model,
elements within primary sampling units are divided into ultimate
clusters and a without replacement selection of those ultimate
clusters is chosen across primary sampling units. Variance estimates
are computed using only between first stage unit totals without
having to compute variance components at each stage of selection.
All programs use the ultimate cluster sampling model in variance
estimation, although SUDAAN also has features to estimate variances
for designs employing without replacement selection of primary
sampling units.
There are two principal methods of variance estimation
employed for complex sample designs: Taylor series approximation
and repeated replication (see Wolter, 1985). Repeated replication
methods include the method of random groups, the jackknife, balanced
half samples, bootstrap methods, and various modifications of
these methods. The Taylor series approximation and the repeated
replication methods do not produce identical estimates of sampling
error, but empirical investigation (Kish and Frankel, 1970; Kish
and Frankel, 1974) have shown that the differences in the methods
for many statistics is small. The important practical implication
is that a Taylor series approximation must be developed analytically
for each statistic, while repeated replication uses the same basic
estimation method regardless of the statistic being estimated.
Most of the listed programs use Taylor series approximations
to compute sampling error estimates. The repeated replication
programs in the list offer many of the basic methods, except the
bootstrap.
A variety of statistical methods are handled by the
listed programs. Some estimate sampling variances and related
statistics (design effects, misspecification effects, intracluster
homogeneity) only for means, totals, and proportions for the total
sample, for subclasses of the total sample, and for differences
between subclasses. Others estimate sampling variances for regression
and logistic regression statistics. Nearly all estimate test
statistics based on these sampling variances. A few compute variance
estimates and associated test statistics for survival analysis,
contingency table analysis, generalized estimating equation models,
specialized ratio estimates, and standardized rates.
Most of the listed programs operate under DOS. Two
of the DOS-based programs (CENVAR and Epi Info) use menus to display
options for users to specify options. The other DOS programs
(CLUSTERS, PC CARP, VPLX) use keyword input. There are also two
programs that operate under Microsoft Windows. One of these programs
(WesVarPC) employs "pull down menus" for user specification
of program options, while another (SUDAAN version 7) employs a
combination of pull down menus and keyword input.
We have not attempted for this brief catalogue to
compare the estimates produced by these programs, except in limited
testing to be sure the software could be loaded and operated successfully
on our microcomputers. A small number of cross-tabulations were
prepared from a modest sized sample survey data set (n
= 3,617) with all programs except VPLX and PC CARP. Estimated
proportions, standard errors, and coefficients of variation were
similar for all the programs in this testing. More extensive
comparisons are needed across sample surveys of different sizes
and for more statistics, but those comparisons are beyond the
scope of our current review.
Finally, five of these packages (CENVAR, CLUSTERS,
Epi Info, VPLX, and WesVarPC) are available for free, or at a
nominal charge to handle processing and shipping. Access is made
even easier for all but CLUSTERS since they can be downloaded
over the World Wide Web (WWW) using standard network browsing
software. Information is available over the WWW for all but one
of the programs. In the listings below, e-mail addresses to obtain
further information are given for all software, and WWW access
to information or to the software and documentation itself is
also given.
Sampling Error Software
CENVAR
(U.S. Bureau of
the Census; contact International Programs Center, U.S. Bureau
of the Census, Washington, DC 20233-8860; e-mail
IMPS@census.gov
;
WWW access to software at http://www.census.gov/ftp/pub/ipc/www/imps.html
).
CENVAR is a component of a statistical software system
designed by the U.S. Bureau of the Census for processing, management,
and analysis of complex survey data, the Integrated Microcomputer
Processing System (IMPS). The program operates in a DOS environment
but is screen oriented and completely menu driven. Sample designs
that can be accommodated include simple random sampling, stratified
random element sampling, and multistage cluster samples with equal
or unequal probabilities of selection. These sample designs are
all addressed through the ultimate cluster sampling model. CENVAR
uses Taylor series approximation for variance estimation, based
on PC CARP software developed by Iowa State University (see below).
Users must have a PC compatible microcomputer with 640K RAM and
at least 10 Mb of hard disk space, DOS 3.2 or higher, and a printer
capable of 132 characters per line. A math co-processor is highly
recommended for Intel 286, 386, and 486SX processors. IMPS version
3.1 is easily downloaded from the WWW site, and installation is
relatively straightforward. Only a subset of the IMPS needs to
be installed to run CENVAR. Input data must be in ASCII format,
and users must create an IMPS dictionary to read data. Blank
fields are not acceptable and must be filled with zeroes before
IMPS can read the data. Documentation is available in Word Perfect
5.1 format. CENVAR is not difficult to use, but users must be
cautious about missing data; it appears to us in casual use that
it will not accept missing values for any variables for a record.
CENVAR produces sampling error estimates for means, proportions,
and totals for the total sample as well as specified subclasses
in a tabular layout. In addition to the standard error, 95% confidence
interval limits, coefficients of variation, design effects, and
unweighted sample sizes are given. We did experience some problems
with the calculation of the design effects - users may want to
double-check the figures.
CLUSTERS
(World Fertility
Survey; contact Vijay Verma, 105 Park Road, Teddington (Middlesex),
TW11 OAW, United Kingdom; e-mail vjverma@essex.uk
; no WWW access
to software).
CLUSTERS is a stand alone program originally designed
by the staff of the World Fertility Survey, and later updated
and improved by Vijay Verma and Mick Price. It is now distributed
for a nominal charge for handling by Vijay Verma. The program
operates in a DOS environment and is keyword driven. Users must
prepare in a file keyword commands in an ASCII format, and the
location of that command file is passed to the program during
execution. The principal sample design is a stratified multistage
cluster sample, addressed through the ultimate cluster sampling
model. CLUSTERS uses Taylor series approximation for variance
estimation. Users must have a PC compatible microcomputer with
640K RAM and at least 2 Mb of hard disk space for the program,
DOS 3.2 or higher, and a printer capable of 132 characters per
line. A math co-processor is highly recommended for Intel 286,
386, and 486SX processors. The program comes with an extensive
set of sample set ups and sample data sets for testing installation.
CLUSTERS itself is easily installed, although some users may
find execution a bit confusing until they learn how to specify
the various required and optional input and output files. Input
data must be in ASCII format, and users must create a dictionary
in the command file to read data. Documentation is available
in ASCII format. CLUSTERS produces sampling error estimates for
means and proportions for the total sample as well as specified
subclasses and subclass differences in a spreadsheet-like layout.
In addition to the standard error, coefficients of variation,
design effects, unweighted sample sizes, and intracluster homogeneity
estimates are given.
Epi Info
(U.S. Centers
for Disease Control and Prevention; contact Andrew G. Dean, MD,
Epidemiology Program Office, Mailstop C08, Centers for Disease
Control and Prevention, Atlanta, GA 30333; e-mail
AGD1@epo.em.cdc.go
or EpiInfo@cdc1.cdc.gov
;
WWW access to software http://www.cdc.gov/epo/epi/epi.html
).
Epi Info is an epidemiological and statistical software
system designed by the U.S. Centers for Disease Control and Prevention.
It has features for processing, managing, and analyzing epidemiological
data, including complex survey data (CSAMPLE component). The
program operates in a DOS environment but is screen oriented.
Documentation is available on-line in the program, and can be
printed chapter-by-chapter. The basic sample design that can
be accommodated is stratified multistage cluster sampling through
the ultimate cluster sampling model. Epi Info uses Taylor series
approximation for variance estimation. Users must have a PC compatible
microcomputer with 640K RAM and at least 10 Mb of hard disk space,
DOS 3.2 or higher, and a printer capable of 132 characters per
line. A math co-processor is highly recommended for Intel 286,
386, and 486SX processors. Epi Info is easily downloaded from
the WWW site, but make sure you download a copy of the README.TXT
file to get valuable installation information about installation
on a PC or network. Installation is relatively straightforward,
although it appears necessary for the user to create the needed
subdirectories in advance. We loaded the entire Epi Info program
even though we only needed a subset to estimate sampling errors.
Input data can be from DBF, Lotus, or ASCII format. Users must
create a questionnaire in Epi Info in the EPED component to read
ASCII data; DBF and Lotus format entry are easier. The CSAMPLE
component is entirely menu-driven. Epi Info produces sampling
error estimates for means and proportions for the total sample
as well as for subclasses specified in a two-way layout. The
printed output includes only unweighted frequencies, weighted
proportions or means, standard errors, 95% confidence interval
limits, and design effects.
PC CARP
(Iowa State University;
contact Sandie Smith, Statistical Laboratory, 219 Snedecor Hall,
Ames, IA 50011; e-mail sandie@iastate.edu
; WWW access to software
http://www.statlib.iastate.edu/survey/software/pccarp.html
).
PC CARP is the PC version of the mainframe program
SUPER CARP, Complex Analysis Regression Program, developed at
Iowa State University. PC CARP can be used to estimate standard
errors for means, proportions, quantiles, ratios, differences
of ratios, and entries as well as test statistics for two way
contingency tables. There are three companion programs which
enlarge the range of analyses available: PC CARPL for logistic
regression; POSTCARP for poststratified estimates of totals, ratios,
and differences of ratios; and EV CARP for regression analysis,
including under measurement error in the explanatory variables.
The program operates in a DOS environment and is keyword oriented.
The programs are designed to handle stratified multistage cluster
samples with finite population corrections at up to two stages
of selection. PC CARP uses Taylor series approximation for variance
estimation. Users must have a PC compatible microcomputer with
640K RAM, and a math co-processor is highly recommended for Intel
286, 386, and 486SX processors. PC CARP and its companion programs
may be purchased from the Statistical Laboratory at Iowa State
University at $300. Input data can be in ASCII format, and users
must create a dictionary to read data. Printed documentation
is distributed with the software.
Stata
(Stata Corporation;
contact Stata Corporation, 702 University Drive East, College
Station, TX 77840; e-mail stata@stata.com
;
WWW site http://www.stata.com
).
Stata is a programmable statistical analysis software
system which has recently begun to introduce commands which allow
users to compute sampling error estimates for many statistics.
The program is available for DOS and Windows environments and
is keyword driven. Menus and help screens are available in the
Windows version. We have only recently obtained preliminary documentation
for the Stata commands for survey data analysis, and we are not
sure what sample designs can be accommodated. Stata survey analysis
commands use Taylor series approximation for variance estimation.
Stata itself urges users to have a floating point processor on
their PC, and Windows users are recommended to have 8 Mb of RAM
and at least 4 Mb of hard disk space. The list price for Stata
is $945 for commercial and $395 for academic users, and the survey
analysis commands are included as part of the package. Since
we have not yet purchased the software, we cannot comment on installation
or use of the survey analysis commands, or input data format and
documentation. The current survey analysis commands include svymean,
svytotal, svyratio, and svyprop for means, totals, ratios, and
proportions. The commands svyreg, svylogit, and svyprobt are
available for the obvious regression, logistic regression, and
probit analysis procedures. The commands svylc and svytest allow
estimation of linear combinations of parameters and hypothesis
tests. The command svydes allows the user to describes the specific
sample design and should be used prior to any of the above commands.
There are plans to add survey analysis commands for estimating
distribution functions and quantiles, contingency table analysis,
missing data compensation, and other analyses.
SUDAAN
(Research Triangle Institute; contact SUDAAN
Product Coordinator, Statistical Software Center, Research Triangle
Institute, 3040 Cornwallis Road, Research Triangle Park, NC 27709-2194;
e-mail
SUDAAN@rti.org
; WWW site
http://www.rti.org/patents/sudaan.html
).
SUDAAN is a statistical software package for analysis
of correlated data, including complex survey data. It provides
facilities for estimation of a range of statistics and their associate
sampling errors, including means, proportions, ratios, quantiles,
cross-tabulations, odds ratios; linear, logistic, and proportional
hazards regression models; and contingency table analysis. The
program uses Taylor series approximations for variance estimation.
It accommodates with and without replacement selection of first
stage units, including components of variance, as well as simple
random sampling and stratified element sampling designs. SUDAAN
is available for PCs under MS DOS as well as for Windows, and
prices vary depending on the type of firm, renewal status, and
number of licenses. For example, the single license price for
a new license PC version of SUDAAN 6.53 for commercial companies
and government agencies is $995; the Windows version 7.0 is $1495.
Current pricing lists are available directly from RTI. Although
the most expensive of the software programs listed, users get
a lot for their money. SUDAAN will read directly PC SAS data
sets (but not the SAS for Windows data sets). The program is
keyword driven, even in the Windows version, but the keyword syntax
is very much like SAS. ASCII data sets can also be used in SUDAAN.
Printed documentation is provided with each licensed copy of
SUDAAN.
VPLX
(U.S. Bureau of the
Census; contact Robert E. Fay, U.S. Bureau of the Census, Room
3067, Bldg. 3, Washington, DC 20233-9001; e-mail rfay@census.gov
;
WWW access to software at http://www.census.gov/sdms/www/vwelcome.html
).
VPLX is a stand alone program for variance estimation
designed and used by the U.S. Bureau of the Census for complex
survey data. The program operates in a DOS environment and is
keyword driven. VPLX is primarily designed for stratified multistage
cluster samples under the ultimate cluster sampling model. It
uses repeated replication methods for variance estimation, including
a random group, a jackknife, and a balanced repeated replication
procedure. Users must have a PC compatible microcomputer with
8 Mb RAM and at least 3 Mb of hard disk space, and DOS 5.0 or
higher. A math co-processor is highly recommended for Intel 286,
386, and 486SX processors. VPLX is easily downloaded from the
WWW site, and installation is a matter of copying files to directories
(there is no separate installation program). Input data must
be in ASCII format, and users must create a dictionary to read
data. Documentation is available from the Web site in Adobe format.
An Adobe reader is also readily downloaded from the WWW. VPLX
is more complicated to use than other keyword driven programs
on this list, and requires, like CLUSTERS, program file be developed
separately in an ASCII format for input to the program execution.
It produces sampling error estimates for means, proportions,
and totals for the total sample as well as specified subclasses.
WesVarPC
(Westat, Incorporated;
contact Westat, Inc., 1650 Research Blvd., Rockville, MD 20850-3129;
e-mail WESVAR@westat.com
;
WWW access to software at http://www.westat.com/wesvarpc/index.html
).
WesVarPC is a statistical software system designed
by Westat, Inc. for analysis of complex survey data. The program
operates in a Windows (3.1, 3.11, and 95) environment and completely
menu driven. The primary sample design that can be accommodated
is a stratified multistage cluster sample based on the ultimate
cluster sampling model. WesVarPC uses repeated replication for
variance estimation, including jackknife, balanced half sample,
and the Fay modification to the balanced half sample method.
Users must have a PC compatible microcomputer with 4Mb RAM and
at least 10.1 Mb of hard disk space, and Windows 3.1, 3.11, or
95. It is easily downloaded from the WWW site, and installation
is probably the easiest among all the software listed using the
installation software provided. Documentation is in an Adobe
format, but instructions are provided on how to download the Adobe
reader. The documentation is the easiest to read and best laid
out among all the programs listed. Input data can be in ASCII
format, or DBF, SPSS for Windows, SAS Transport, or PC SAS for
DOS format. Input from any one of these formats is easy in the
menu driven environment. WesVarPC requires that a new version
of the data set be created in a special WesVarPC format. This
requires the specification of replicates and, if poststratification
is to be incorporated into the variance estimates, replicate weights.
Users unfamiliar with these procedures may find the specifications
slightly confusing. WesVarPC has facilities at present for contingency
table analysis, regression, and logistic regression. There is
an extensive menu driven system for creating new variables which
extends the range of statistics that WesVarPC can be used for.
Output is in a list format with one line per statistic in a simple
format. The format is not suitable for publication, but it can
be saved to a file for processing in a spreadsheet or other program.
Concluding Remarks
It is a bit difficult to summarize briefly the features
of these sampling error estimation programs. Some have many components,
and we may have not portrayed their features completely. If there
are errors, we apologize. They were inadvertent, and we would
be happy to publish a revised description of any software description.
Similarly, we are not sure we have been comprehensive in our
catalogue. If there are other programs you are aware of that
are available to survey analysis public that estimate sampling
errors taking stratified multistage sampling into account, please
let us know for future editions of the Survey Statistician.
Fortunately, our inaccuracies and omissions can,
and probably will, be corrected in future issues of the Survey
Statistician. We are in the process of inviting authors of
the software to prepare descriptions of their software for the
Survey Statistician. For example, in the next issue we
hope to present an article by the authors of IMPS and its sampling
error component CENVAR. Jim Lepkowski will serve as the editor
for this feature in future issues.
There are other approaches to sampling error estimation
possible besides these programs. For example, we have at the
University of Michigan for some time used a primitive set of macros
in SAS to estimate sampling errors for a wide variety of statistical
analysis methods using repeated replication procedures. For extremely
limited problems, some students have actually created spreadsheets
in a few of the more popular spreadsheets to compute sampling
error estimates from weighted cluster totals. Our goal here has
been only to catalogue currently available software that any user
could obtain access to.
Finally, there are bound to be new programs developed,
and modifications and enhancements of the programs we have listed.
For example, David Bellhouse at the University of Western Ontario
produced the TREES sampling error estimation program several years
ago. While he is not sure it will operate on current computing
platforms, he is developing new software that uses computer algebra
to derive formulae for variances of specified estimators. It
may be useful to prepare a revised catalogue as undiscovered existing
software, new software, or important modifications of existing
software become available to us.
References
Institute for Social Research (1984). The OSIRIS
Statistical Software System. Ann Arbor, MI: Institute for
Social Research.
Kalton, G. (1979). Ultimate cluster sampling.
Journal of the Royal Statistical Society, Series A 142
(2) 210-222.
Kish, L., and Frankel, M. (1970). Balanced repeated
replications for standard errors, Journal of the American Statistical
Association 65, 1071-1094.
Kish, L., and Frankel, M. (1974). Inference from
complex samples (with discussion), Journal of the Royal Statistical
Society, Series B, 36, 1-37.
Lee, E.L., Forthofer, R.N., and Lorimor, R.J. (1989).
Analyzing Complex Survey Data. Beverly Hills, CA: Sage
Publications, Inc.
Lehtonen, R., and Pahkinen, E.J. (1995). Practical
Methods for Design and Analysis of Complex Surveys. Chicester:
John Wiley and Sons.
Skinner, C.J., Holt, D., and Smith, T.M.F. (1989).
Analysis of Complex Surveys. Chicester: John Wiley and
Sons.
Acknowledgement: International Association of Survey
Statisticians
The above article appears in the current issue of The Survey Statistician
(No. 35, December 1996, 10-17). It is an introduction to a new section that
will contain related articles in future issues.
The Survey Statistician is the newsletter of the International Association of Survey Statisticians (IASS) and is distributed twice a year in English and French to members of the association.
Statisticians interested in members in the IASS can write directly to
the Secretary of the IASS c/0 Insee DR Aquitaine, Attn. Mme Claude
OLIVIER, Bureau 136, 33 rue de Saget, 33076 Bordeaux Cedex, France.
Alternatively, send an email to the editor of The Survey Statistician
(brickm1@westat.com
)
with your name and mailing address and an IASS membership application
will be mailed to you.