Guidance on Data Analysis

2c_guidance on data analysis.pdf

[NCCDPHP] Oral Health Basic Screening Survey for Children

Guidance on Data Analysis

OMB: 0920-1346

Document [pdf]
Download: pdf | pdf
Form Approved
OMB No. 0920-1346

GUIDANCE ON HOW TO ANALYZE DATA FROM A SCHOOL-BASED ORAL HEALTH SURVEY
JULY 2013, UPDATED JUNE 2015 AND JULY 2017
Due to the technical nature of this topic, this information will be most helpful to data analysts,
epidemiologists and statisticians.
Has your state, territory or local health agency conducted a school-based oral health survey?
If yes, then you probably have questions about how to appropriately weight and analyze the data to best
represent the target population of your survey. The purpose of this document is to provide a basic framework
for how to appropriately analyze data from a statewide school-based oral health survey that employed a
complex sampling design. Because no one method is appropriate
Because no one method is appropriate for
for all states/territories, we encourage you to read this
all states/territories, we encourage you to
document, and then contact ASTDD for additional guidance on
read this document, and then contact
analyzing data for your state/territory. Although this document is
ASTDD for additional guidance on
geared towards states and territories, the techniques are
analyzing data for your state/territory.
appropriate for other jurisdictions such as counties.
This topic is important because most oral health surveys employ a complex sampling design that may include
stratification, unequal selection probabilities and clustering. To obtain valid point estimates, standard errors,
and confidence intervals, analysis must account for the sampling design. Simply doing a weighted analysis using
statistical procedures like SAS Proc Freq is not appropriate because the variance estimation in such programs
use formulas appropriate for simple random sampling rather than complex sampling. These formulas do not
account for stratification or clustering and may result in biased point estimates of population parameters (in an
unweighted analysis) and/or underestimation of standard errors and confidence intervals for point estimates.
This document is limited to a discussion of data weighting and analysis. For additional information on how to
conduct and use data from a school-based oral health survey, please refer to the Basic Screening Survey (BSS)
tools developed by the Association of State and Territorial Dental Directors (ASTDD). These tools are available at
the following website: www.astdd.org/basic-screening-survey-tool/.
Do you want to submit the data to the National Oral Health Surveillance System (NOHSS)?
NOHSS (www.cdc.gov/oralhealthdata/) is a collaborative effort between CDC's Division of Oral Health and
ASTDD. NOHSS is designed to monitor the burden of oral disease, use of the oral health care delivery system,
and the status of community water fluoridation on both a national and state level. NOHSS tracks oral health
surveillance indicators based on data sources and surveillance capacity available to most states. The Council of
State and Territorial Epidemiologists (CSTE) and the Chronic
Disease Directors (CDD) were instrumental in developing the
If you follow the guidance provided in
this document and ASTDD’s sampling
framework for chronic disease surveillance indicators, including
guidance document, your oral health
the oral health indicators in NOHSS. If you follow the guidance
data will meet the specifications for
provided in this document and ASTDD’s sampling guidance
inclusion in NOHSS.
document, your oral health data will meet the specifications for
inclusion in NOHSS.
Only oral health survey data that meet the following specifications are included in the NOHSS data system:
• The data are from a statewide probability sample of elementary schools.
July 2017

1

• If a complex sampling scheme is used, the data must be weighted for the sampling scheme.
• ASTDD strongly suggests that, at minimum, 3rd grade children be screened. Grades K-2 as well as Head
Start may also be screened and are included in the NOHSS website.
NOTE: In some cases a state may be unable to follow this guidance. For example, because of small school size
and confidentiality issues, an IRB may require that school identifiers not be included in the dataset. If you
encounter such issues, please contact ASTDD for additional guidance.
Once our oral health data has been collected, what steps do I need to take to prepare the data for analysis?
There are several steps you should take to prepare for the analysis phase of your survey. Carefully review each
step and decide if it is appropriate for your situation.
Step 1: Enter the data into an electronic file that can be
Steps to Take to Prepare for Analysis
exported to an appropriate statistical package. Good options
Enter
the data
for
data
entry
systems
include
Epi
Info
Clean the data
(http://wwwn.cdc.gov/epiinfo/) and Microsoft Access. To
Determine number screened at each school
minimize or eliminate data entry errors it is important to
Determine number in sampling interval
have a “very smart” data entry system that can make a
Calculate weight factor
variety of checks on the data while it is being entered.
Typically a useful data entry system checks each field for
valid values, inconsistencies in data across fields, skip patterns, etc. A good data entry system, just like a good
form, should be designed to be self-explanatory and easy to use. Examples of data entry systems using Epi Info
and Access are available from ASTDD.
Step 2: Clean the data file. If you used a smart data entry system, there should be very few data entry errors.
Make sure that each record includes the appropriate school code. A school code is necessary for calculating the
weight factor that will be used in the analysis. For additional information on data cleaning and preemptive data
cleaning techniques, refer to the following brief: http://www2.sas.com/proceedings/sugi26/p015-26.pdf . Once
you have selected the data entry system and statistical software package to use, it may be helpful to read briefs
or reference books specific to that system or package.
Step 3: Determine how many children were screened at each school. This can be accomplished by generating a
frequency distribution for school codes. The number of children screened at each school will be used as the
denominator in the weight factor calculation.
Step 4: Go back to the file you used to select the sample and determine how many children were enrolled in
each sampling interval. Link the sampling interval information to the participating school codes. The number of
children in the sampling interval will be partially dependent on the type of sampling strategy you used.
• If you used a probability proportional to size (PPS) sampling strategy, the number of children in the
sampling interval will be the same for each sampling interval in a given strata. Refer to Example 1 and
Table 1 (page 11).
• If you used a non-PPS sampling strategy, the number of children in the sampling interval will generally
be different for each sampling interval. Refer to Example 2 and Table 2 (page 13).
• Refer to the school survey sampling guidance developed by ASTDD for additional information. The
sampling guidance is available at the following site: www.astdd.org/basic-screening-survey-tool/.
Step 5: Calculate the weight factor using the following formula. Each child in a particular school and grade will
have the same weight factor.
• Weight = (# of children in sampling interval) / (# of children screened in sampling interval)
o This formula reflects the reduction of the overall probability calculation of:
July 2017

2

•

(# enrolled in school/# in sampling interval)* (# children invited to participate /# enrolled in
school) * (# screened/# children invited to participate) = # of children screened in sampling
interval/# of children in sampling interval
o Note that number enrolled in school cancels out in the first and second terms and number of
children invited to participate cancels out in the second and third terms
o The analysis weight is the inverse of the reduced probability term: (# of children in sampling
interval) / (# of children screened in sampling interval)
NOTE: The number of children in the sampling interval is based on the sampling frame used for selecting
the sample which generally will be from the school year prior to the data collection year. These numbers
would be expected to be very close to current numbers.

What statistical software package and program code should I use?
Analysis of data from surveys that employ a complex sampling design, such as a school-based oral health survey,
must account for the sampling design. Several statistical software packages are either (1) designed specifically to
analyze complex sample survey data or (2) have special procedures or modules to correctly analyze complex
sample survey data including SUDAAN, SAS, STATA, SPSS, Epi Info and R. All of these packages are appropriate
for the analysis of school-based oral health survey data; your decision for which package to use will probably be
based on availability, familiarity or cost. Both Epi Info and R are available at no cost to the user.
To help you with the analysis process, we have created sample program code for each of the packages listed and
have compared results from each package based on a sample data set from a recent state oral health survey of
kindergarten and 3rd grade children. Information about each statistical software package, except R, was
excerpted from Software for Analysis of YRBS Data (CDC 2016).
Definition of variables used in the sample program code:
• Grade – K=kindergarten, 3=third grade
• Race – 1=white, 2=black
• Cluster – a unique number for each school, primary sampling unit (PSU)
• Strata – geographic region of the state, stratification variable used in selecting sample
• Weight – analysis weight factor (# children in sampling interval / # children screened in interval)
• Untreated – does the child have untreated decay (0=No, 1=Yes)
• Treated – does the child have treated decay (0=No, 1=Yes)
• Experience – does the child have treated and/or untreated decay (0=No, 1=Yes). You will need to create
the variable “Experience” from “Treated” and “Untreated”
o If Untreated is missing and Treated is missing then Experience should be coded missing
o If Untreated=0 and Treated=0 then Experience=0
o If Untreated=1 or Treated=1 then Experience=1
• Sealants – does the child have dental sealants (0=No, 1=Yes)
General items that deserve caution:
• Missing data: Each software package uses its own special coding for missing data, for example “.” in SAS
and “N/A” in R. These special codes for missing data translate to numeric values in calculations,
sometimes very small and sometimes very large. Take care in recoding or creating new variables to be
sure that missing data are categorized as you intended.
• Subpopulation analyses, “By” statement dropping observations from data set: With complex sample
data, to get estimates for a subpopulation, such as male or Hispanic children, the statistical software
requires information about the sampling design, the strata, and primary sampling units (PSUs). Dropping
observations from the data set for children who are not in the subpopulation can result in loss of
information on some strata and PSUs, resulting in estimates that do not account for the correct number
of strata and PSUs. Using a “By” statement to get estimates for males and females in SAS, for example, is
July 2017

3

equivalent to doing the analysis once dropping all of the females, and then again dropping all of the
males. The “By” statement is NOT the recommended way to get estimates for subpopulations for
many of the software packages. Sudaan version 11 is an exception. Check the documentation for your
preferred statistical software to be sure you are using the correct syntax for proper subpopulation
analysis.
SAS sample code: SAS versions 8 and higher include special sample survey procedures that are appropriate for
analyzing complex survey data. These sample survey procedures use SAS syntax that will be familiar to those
who already use SAS. There are three sample design statements in SAS: CLUSTER, where the name of the
primary sampling unit (PSU) is placed; STRATA, where the name of the stratification variable is placed; and
WEIGHT, where the name of the analysis weight variable is placed. Variables may be numeric or character. The
input data file does not need to be sorted by stratum and/or PSU variables before analysis.
Univariate analysis (data not presented):
PROC SURVEYFREQ ;
STRATA strata ;
CLUSTER cluster ;
WEIGHT weight ;
TABLES untreated experience sealants / cl ;
RUN ;
Table 3:
PROC SURVEYFREQ ;
STRATA strata ;
CLUSTER cluster ;
WEIGHT weight ;
TABLES grade*untreated / row cl ;
RUN ;
PROC SURVEYFREQ ;
STRATA strata ;
CLUSTER cluster ;
WEIGHT weight ;
TABLES grade*experience / row cl ;
RUN ;
PROC SURVEYFREQ ;
STRATA strata ;
CLUSTER cluster ;
WEIGHT weight ;
TABLES grade*sealants / row cl ;
RUN ;
Table 4:
PROC SURVEYFREQ ;
STRATA strata ;
CLUSTER cluster ;
WEIGHT weight ;
TABLES grade*race*sealants / row cl ;
RUN ;
July 2017

4

Epi Info sample code: Epi Info includes a module for complex sample survey analysis. The analytic capabilities of
Epi Info are limited and are oriented towards public health field work applications. Sample design information is
entered into the appropriate box (Weight, PSU, Stratify by) in the dialog box that appears once an analysis
(Complex Sample Frequencies, Complex Sample Tables, Complex Sample Means) has been selected. You can
also use the syntax codes below. Variables may be numeric or character. The input data file does not need to be
sorted by stratum and/or PSU variables before analysis. IMPORTANT NOTE: As of June 2015, Epi Info does not
have the ability to appropriately generate subpopulation analyses. Using the “Select” statement will drop
observations and may impact information about the sampling design, strata, and primary sampling units (PSUs).
Dropping observations from the data set for children who are not in the subpopulation may result in loss of
information on some strata and PSUs, resulting in estimates that do not account for the correct number of strata
and PSUs.
Univariate analysis (data not presented):
FREQ untreated experience sealants STRATAVAR=strata WEIGHTVAR=weight PSUVAR=cluster
Table 3:
TABLES grade untreated STRATAVAR=strata WEIGHTVAR=weight PSUVAR=cluster
TABLES grade experience STRATAVAR=strata WEIGHTVAR=weight PSUVAR=cluster
TABLES grade sealants STRATAVAR=strata WEIGHTVAR=weight PSUVAR=cluster
Table 4:
SELECT grade=”3”
TABLES race sealants STRATAVAR=strata WEIGHTVAR=weight PSUVAR=cluster
R sample code: R is an open source, freely available software. Users develop R “packages” for specific purposes.
Analysis of complex sample survey data requires the package “survey” developed by Thomas Lumley at
University of Washington. Details of R can be found at the R-Project website http://www.r-project.org/ and
further details of the survey package can be found at http://r-survey.r-forge.r-project.org/survey/index.html.
Variables may be numeric or character. The input data file does not need to be sorted by stratum and/or PSU
variables before analysis.
#Describe the sample design to R
BSS <- svydesign(id=~Cluster, strat=~Strata, weight=~Weight, data=dat2)
Univariate analysis (data not presented):
uniana <- svymean(~Untreated+Experience+Sealants, BSS,na.rm = TRUE)
#Calculating Confidence Intervals
t1<- ftable(uniana)
UniTab = data.frame(Mean = t1[,1], CintLow = t1[,1]-1.96*t1[,2], CintHigh = t1[,1]+1.96*t1[,2])
#Rounding Table Values
UniTab[,1:3] <- round(100*UniTab[,1:3],1)
Table 3:
#Estimate proportions and standard errors within groups
vun<-svyby(~Untreated, ~Grade, svymean, design=BSS, keep.names=FALSE, na.rm = TRUE)
vexp<-svyby(~Experience, ~Grade, svymean, design=BSS, keep.names=FALSE, na.rm = TRUE)
vseal<-svyby(~Sealants, ~Grade, svymean, design=BSS, keep.names=FALSE, na.rm = TRUE)
July 2017

5

#Calculating the Confidence Intervals
v1 = vexp[,1:2]
v1$Untreated = vun$Untreated
v1$cll.Untreated= v1$Untreated - 1.96*vun$se.Untreated
v1$clu.Untreated= v1$Untreated + 1.96*vun$se.Untreated
v1$Experience = vexp$Experience
v1$cll.Experience= v1$Experience - 1.96*vexp$se.Experience
v1$clu.Experience= v1$Experience + 1.96*vexp$se.Experience
v1$Sealants = vseal$Sealants
v1$cll.Sealants= v1$Sealants - 1.96*vseal$se.Sealants
v1$clu.Sealants= v1$Sealants + 1.96*vseal$se.Sealants
#Rounding Table Values
v1[,2:10] <- round(100*v1[,2:10],1)
Table 4:
v3Seal<-svyby(~Sealants, ~Grade+Race, svymean, design=BSS, keep.names=FALSE, na.rm = TRUE)
# Calculating Confidence Interval and Rounding
v3Seal$CintLow=v3Seal$Sealants - 1.96*v3Seal$se.Sealants
v3Seal$CintHigh=v3Seal$Sealants + 1.96*v3Seal$se.Sealants
v3Seal[,3:6] <- round(100*v3Seal[,3:6],1)
SUDAAN sample code: SUDAAN is specifically designed to analyze complex sample survey data. The user
describes the sample survey design in three statements: (1) by specifying an option for the DESIGN keyword on
the PROC statement, (2) by specifying the stratification and clustering (PSU) variables on the NEST design
statement, and (3) by specifying the analysis weight variable on the WEIGHT design statement. All variables
must be numeric. For this example, grade was changed for K and 3 to 0 and 3. Data should be sorted by the
variables that appear on the NEST statement before analysis; otherwise procedure syntax must contain the
NOTSORTED option.
Univariate analysis (data not presented):
proc descript data=bss design=wr conf_lim=95;
nest strata cluster;
weight weight;
var EXPERIENCE UNTREATED SEALANTS;
catlevel
1
1
1;
run;
Table 3:
proc descript data=bss design=wr conf_lim=95;
nest strata cluster;
weight weight;
class GRADE;
var EXPERIENCE UNTREATED SEALANTS;
catlevel
1
1
1;
tables GRADE;
run;
July 2017

6

Table 4:
proc descript data=bss design=wr conf_lim=95;
nest strata cluster;
weight weight;
class RACE;
var SEALANTS;
catlevel
1;
tables RACE;
subpopn GRADE=3;
run;
STATA sample code: STATA version 7 or higher offers the capability to perform many statistical procedures on
complex sample survey data. When performing menu-driven analyses, sample design information is entered
into boxes on the MAIN (PSU and stratification variables) and WEIGHT (analysis weight variable) tabs of the
dialogue box that appears after “declare survey design for data set” is chosen from the Survey Data Analysis
menu. If syntax is written the information is included on the SVYSET statement. The survey design descriptors
only need to be entered once at the beginning of the analysis session. Although variables in STATA data sets can
be numeric or character, all variables used in an analysis must be numeric. The input data file does not need to
be sorted by stratum and/or PSU variables before analysis.
Univariate analysis (data not presented):
Svyset Cluster [pweight = Weight], strata(Strata)
Svy linearized : proportion Sealants Untreated Experience
Table 3:
Svy linearized : proportion Sealants Untreated Experience, over(Grade)
Table 4:
Svy linearized : proportion Sealants Untreated Experience, over(Race Grade)
SPSS sample code: SPSS has an add-on module, SPSS Complex Samples, which includes sample selection and
analysis of complex sample survey data. When performing menu-driven analysis, the sample design information
is entered into a dialogue box when preparing for analysis and is saved as a sampling plan for the data set. Once
the sampling plan has been created, it will be opened along with the data set at the beginning of an SPSS
session. You can also use the following syntax code. Variables may be numeric or character. The input data file
does not need to be sorted by stratum and/or PSU variables before analysis.
*Analysis Preparation Wizard.
CSPLAN ANALYSIS
/PLAN FILE='...\astdd test sample.csaplan'
/PLANVARS ANALYSISWEIGHT=Weight
/SRSESTIMATOR TYPE=WOR
/PRINT PLAN
/DESIGN STRATA=Strata CLUSTER=Cluster
/ESTIMATOR TYPE=WR.
Univariate analysis (data not presented):
* Complex Samples Frequencies.
CSTABULATE
/PLAN FILE='...\astdd test sample.csaplan'
July 2017

7

/TABLES VARIABLES=Untreated experience Sealants
/CELLS POPSIZE TABLEPCT
/STATISTICS CIN(95)
/MISSING SCOPE=TABLE CLASSMISSING=EXCLUDE.
Table 3:
* Complex Samples Frequencies.
CSTABULATE
/PLAN FILE='...\astdd test sample.csaplan'
/TABLES VARIABLES=Untreated
/SUBPOP TABLE=Grade DISPLAY=LAYERED
/CELLS POPSIZE TABLEPCT
/STATISTICS SE CIN(95)
/MISSING SCOPE=TABLE CLASSMISSING=EXCLUDE.
* Complex Samples Frequencies.
CSTABULATE
/PLAN FILE='...\astdd test sample.csaplan'
/TABLES VARIABLES=experience
/SUBPOP TABLE=Grade DISPLAY=LAYERED
/CELLS POPSIZE TABLEPCT
/STATISTICS SE CIN(95)
/MISSING SCOPE=TABLE CLASSMISSING=EXCLUDE.
* Complex Samples Frequencies.
CSTABULATE
/PLAN FILE='...\astdd test sample.csaplan'
/TABLES VARIABLES=Sealants
/SUBPOP TABLE=Grade DISPLAY=LAYERED
/CELLS POPSIZE TABLEPCT
/STATISTICS SE CIN(95)
/MISSING SCOPE=TABLE CLASSMISSING=EXCLUDE.
Table 4:
* Complex Samples Crosstabs.
CSTABULATE
/PLAN FILE='...\astdd test sample.csaplan'
/TABLES VARIABLES=Race BY Sealants
/SUBPOP TABLE=Grade DISPLAY=LAYERED
/CELLS POPSIZE ROWPCT
/STATISTICS SE CIN(95)
/MISSING SCOPE=TABLE CLASSMISSING=EXCLUDE.
Comparison of results: Tables 3 and 4 compare the results from each of the aforementioned statistical software
packages from a recent state oral health survey of kindergarten and 3 rd grade children. Table 3 presents the
prevalence of decay experience, untreated decay, and dental sealants by grade while the prevalence of dental
sealants among 3rd grade children by race is presented in Table 4. Each of the six statistical software packages
described in this document can be used to appropriately analyze data from a school-based survey, although
Epi Info may not appropriately account for the correct number of strata and PSUs in subpopulation analyses.

July 2017

8

Are there other things that I should consider or be aware of?
Yes, there are a variety of other issues that may impact your analysis or how you report your survey reports.
Following is a short list that you should review. If you have additional questions or concerns please contact
ASTDD.
• Finite population correction: If more than 10% of children from within any given strata are selected, you
may want to consider using a finite population correction factor, which reduces variance yielding smaller
standard errors and confidence intervals. For additional information on finite population correction refer
to Introduction to Survey Sampling.
• No data for a given sampling interval: If a school refuses to participate, we encourage you to select a
replacement school from the same sampling interval. Unfortunately, circumstances may result in an
inability to screen the original or a replacement school in a given sampling interval. If this happens you
should clearly report that you were not able to screen children in a sampling interval along with what that
interval represented. For example, if you selected 70 schools but you only have data for 69, report that
you are missing data from one sampling interval that represents children from region 3 attending schools
where 30-40% of the children are eligible for the National School Lunch Program (NSLP).
• Reporting response rates: For each school in your survey, you should collect the number of children
enrolled in the grade of interest on the day of the screening (or the number invited to participate if you
did not invite all children in a given grade). Your response rate for the survey will be the number screened
divided by the number enrolled or invited to participate.
• Stratifying results by school NSLP percentage: Many states use school NSLP percentage as a surrogate
measure of socioeconomic status. We recommend using the current year NSLP status of the school if
stratifying the results by NSLP; this information can be obtained from the school on the day of the
screening.
• Limitations of survey: When preparing your survey report, it is important to clearly state any limitations of
the survey including representativeness and response rate.
• Confidence intervals: Confidence intervals are important because they provide context for understanding
the precision or exactness of a point estimate. The wider the confidence interval, the less exact the point
value estimate becomes. Take, for example, a point estimate of 40% for the prevalence of dental caries
experience. If the confidence interval of this point estimate is 35%-45%, then we can have greater
certainty that the true prevalence is near 40% than if the confidence interval was 10-70%. For your data to
be included in NOHSS, confidence intervals must be included (unless you screened all children in your
target population).
Where can I get additional help?
ASTDD can help you with the survey analysis process. Please contact us if you have any questions.
Association of State & Territorial Dental Directors
Kathy Phipps, Data and Surveillance Coordinator
Phone: 805-776-3393, Email: kathyphipps1234@gmail.com

July 2017

9

Acknowledgements
Supported by Cooperative Agreement NU5U8DP004919 from the Centers for Disease Control and Prevention. Its
contents are solely the responsibility of the authors and do not necessarily represent the official views of CDC.
ASTDD would like to thank Kathy Phipps, Michael Manz, Laurie Barker, Eugenio Beltran, Mei Lin, Liang Wei and
Srdjan Lesaja for their assistance in developing and reviewing this guidance.
References and additional resources
• Software for Analysis of YRBS Data, Division of Adolescent and School Health, Centers for Disease
Control and Prevention, September 2016
• Brogan D, Sampling error estimation for survey data. In: United Nations, Department of Economic and
Social Affairs, Household Sample Surveys in Developing and Transition Countries, 2005. Available at:
http://unstats.un.org/unsd/hhsurveys/pdf/Household_surveys.pdf
• Kalton G. Introduction to survey sampling. Quantitative Applications in the Social Sciences. 1988. Sage
Publications, Beverly Hills. Series/Number 07-035. Isbn 0-8039-2126-8.
• Kish L. Statistical Design for Research. 2004. John Wiley & Sons, Hoboken.

July 2017

10

Example #1
Weight factor calculation for a survey that used systematic PPS sampling
with implicit stratification by region, urban/rural status and NSLP percent
This example shows the steps from sample selection to weight factor calculation when a probability
proportional to size (PPS) sampling strategy is used. Based on available resources, the decision was made to
include 70 schools in the “Utopia” oral health survey of 3 rd grade children. The following sampling steps were
employed:
• The sampling frame list was sorted by region then by urban/rural status within each region
• Schools were then sorted by percent of children participating in NSLP within urban/rural school
categories.
Calculations used for selecting the systematic PPS sample:
• Sampling interval for sampling = (total 3rd grade enrollment) / (# of schools to be screened)
o 53,320 / 70 = 761.7
• Random start = random number between 0 and interval (761.7) = 148.0
o This is the first school selection number
o There are a variety of methods for selecting a random number including, but not limited to, Excel
and www.random.org
• Select the school with the 148th child. Add the sampling interval (761.7) to 148 to get the next school
(909.7). Continue adding the sampling interval repeatedly until all 70 school selections are made.
148.0, 909.7, 1671.4, 2433.1, 3194.8, 3956.5, 4718.2, …
•

These numbers are matched to the cumulative enrollment numbers in the sampling list. The schools
with enrollment intervals containing the sample selection numbers are selected into the sample. The
sampling frame list and the selected schools are shown in Table 1.

Weight factor calculation (Table 1):
• Weight = (number of children in sampling interval) / (number of children screened)
• When PPS sampling is used the number of children in the sampling interval will always be the sampling
interval used when selecting the sample, in this case, 761.7.

July 2017

11

Table 1: Systematic PPS sampling with implicit stratification by region, urban/rural and NSLP participation
Region

Urban/
Rural

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2

Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Rural
Rural
Rural

July 2017

School Name
KEOWEE
WALHALLA
RAVENEL
LAKEVIEW
NINETY SIX
NORTHSIDE
MCCORMICK
FAIR-OAK
HICKORY TAVERN
HOLLYWOOD
CHEROKEE TRAIL
PINECREST
MERRYWOOD
DIAMOND HILL
SPRINGFIELD
TAMASSEE-SALEM
WESTMINSTER
HODGES
LAURENS
GRAY COURT OWINGS
E B MORSE
CLINTON
WESTWOOD
ORCHARD PARK
WARE SHOALS PRIMARY
JOANNA-WOODSON
JAMES M BROWN
OAKLAND
BLUE RIDGE ELEMENTARY
WATERLOO
EASTSIDE
WOODFIELDS
SALUDA
MATHEWS
JOHN C CALHOUN
FORD
MIDWAY SCHL
WREN
WRIGHT
POWDERSVILLE
CONCORD
HUNT MEADOWS
MT LEBANON
MERRIWETHER
SPEARMAN
LA FRANCE
BELTON
STARR
CENTERVILLE
WEST PELZER
HONEA PATH
CEDAR GROVE
PALMETTO
TOWNVILLE
W E PARKER
MCLEES
IVA
CALHOUN ACADEMY
WHITEHALL
NEW PROSPECT
JOHNSTON
HOMELAND PARK
PENDLETON
FLAT ROCK .
NEVITT FOREST SCHOOL
DOUGLAS
VARENNES ACADEMY
SPARTANBURG
LOCKHART
BUFFALO

National
School Lunch
Program %
37.1%
39.8%
52.3%
52.4%
52.7%
52.8%
53.9%
55.7%
56.8%
57.2%
57.8%
60.0%
60.1%
60.3%
60.9%
61.7%
67.2%
67.5%
67.7%
68.7%
68.7%
69.0%
69.8%
71.5%
72.0%
72.2%
73.9%
74.8%
76.6%
77.5%
78.9%
80.8%
81.0%
83.3%
89.9%
92.8%
16.7%
26.5%
28.6%
31.2%
31.5%
39.8%
41.9%
48.4%
51.3%
51.3%
52.2%
53.2%
55.5%
55.6%
55.6%
55.9%
61.9%
63.5%
64.9%
65.1%
67.2%
67.6%
69.9%
72.3%
73.5%
74.7%
74.7%
76.5%
80.7%
82.4%
90.8%
43.2%
58.8%
70.9%

3rd Grade
Enrollment
38
92
91
94
130
95
56
117
61
66
53
100
90
38
89
41
62
36
92
58
95
87
120
61
55
50
98
78
90
37
67
97
107
84
36
81
142
100
28
173
133
75
55
120
60
52
160
57
117
69
97
90
90
36
88
118
70
120
75
66
49
52
52
67
66
48
65
45
23
103

Cumulative
Enrollment
38
130
221
315
445
540
596
713
774
840
893
993
1,083
1,121
1,210
1,251
1,313
1,349
1,441
1,499
1,594
1,681
1,801
1,862
1,917
1,967
2,065
2,143
2,233
2,270
2,337
2,434
2,541
2,625
2,661
2,742
2,884
2,984
3,012
3,185
3,318
3,393
3,448
3,568
3,628
3,680
3,840
3,897
4,014
4,083
4,180
4,270
4,360
4,396
4,484
4,602
4,672
4,792
4,867
4,933
4,982
5,034
5,086
5,153
5,219
5,267
5,332
5,377
5,400
5,503

Selected
School

# of Children in
Sampling
Interval (A)

Number
Screened (B)

Weight
(A)/(B)

148.0

761.7

52

14.648

909.7

761.7

75

10.156

1,671.4

761.7

67

11.369

2,433.1

761.7

35

21.763

3,194.8

761.7

96

7.934

3,956.5

761.7

85

8.961

4,718.2

761.7

74

10.293

12

Example #2
Weight factor calculation for a survey that used systematic non-PPS sampling
with implicit stratification by region, urban/rural status and NSLP percent
This example shows the steps from sample selection to weight factor calculation when a non-PPS sampling
strategy is used. Based on available resources, the decision was made to include 70 schools in the “Utopia” oral
health survey of 3rd grade children. The following sampling steps were employed:
• The sampling frame list was sorted by region then by urban/rural status within each region
• Schools were then sorted by percent of children participating in NSLP within urban/rural school
categories.
Calculations used for selecting the systematic non-PPS sample:
• Sampling interval for sampling = (number of schools in sampling frame) / (# of schools to be screened)
o 700 / 70 = 10.0
• Random start = random number between 1 and interval (10) = 6.0
o This is the first school selection number
o There are a variety of methods for selecting a random number including, but not limited to, Excel
and www.random.org
• Select the 6th school. Add the sampling interval (10.0) to 6 to get the next school (16.0). Continue adding
the sampling interval repeatedly until all 70 school selections are made.
6.0, 16.0, 26.0, 36.0, 46.0, 56.0, 66.0, …
•

These numbers are matched to the sequential number of schools in the sampling list to identify the
schools selected into the sample. The sampling frame list and the selected schools are shown in Table 2.

Weight factor calculation (Table 2):
• Weight = (number of children in sampling interval) / (number of children screened)
• When non-PPS sampling is used the number of children in the sampling interval will vary from one
interval to another. The number of children in the sampling interval is the total of all 3rd grade children
in the given interval.
• NOTE: In this example, dividing the number of schools by the number of schools to screen produced a
whole number. Please contact ASTDD if you need more information on how to appropriately calculate
weights if a fractional sampling interval was used.

July 2017

13

Table 2: Systematic sampling (non-PPS) with implicit stratification by region, urban/rural status and NSLP percent
Region

Urban/
Rural

School Name

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2

Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Rural
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Urban
Rural
Rural
Rural

KEOWEE
WALHALLA
RAVENEL
LAKEVIEW
NINETY SIX
NORTHSIDE
MCCORMICK
FAIR-OAK
HICKORY TAVERN
HOLLYWOOD
CHEROKEE TRAIL
PINECREST
MERRYWOOD
DIAMOND HILL
SPRINGFIELD
TAMASSEE-SALEM
WESTMINSTER
HODGES
LAURENS
GRAY COURT OWINGS
E B MORSE
CLINTON
WESTWOOD
ORCHARD PARK
WARE SHOALS
JOANNA-WOODSON
JAMES M BROWN
OAKLAND
BLUE RIDGE
WATERLOO
EASTSIDE
WOODFIELDS
SALUDA
MATHEWS
JOHN C CALHOUN
FORD
MIDWAY
WREN
WRIGHT
POWDERSVILLE
CONCORD
HUNT MEADOWS
MT LEBANON
MERRIWETHER
SPEARMAN
LA FRANCE
BELTON
STARR
CENTERVILLE
WEST PELZER
HONEA PATH
CEDAR GROVE
PALMETTO
TOWNVILLE
W E PARKER
MCLEES
IVA
CALHOUN ACADEMY
WHITEHALL
NEW PROSPECT
JOHNSTON
HOMELAND PARK
PENDLETON
FLAT ROCK
NEVITT FOREST
DOUGLAS
VARENNES ACADEMY
SPARTANBURG
LOCKHART
BUFFALO

July 2017

National
School Lunch
Program %
37.1%
39.8%
52.3%
52.4%
52.7%
52.8%
53.9%
55.7%
56.8%
57.2%
57.8%
60.0%
60.1%
60.3%
60.9%
61.7%
67.2%
67.5%
67.7%
68.7%
68.7%
69.0%
69.8%
71.5%
72.0%
72.2%
73.9%
74.8%
76.6%
77.5%
78.9%
80.8%
81.0%
83.3%
89.9%
92.8%
16.7%
26.5%
28.6%
31.2%
31.5%
39.8%
41.9%
48.4%
51.3%
51.3%
52.2%
53.2%
55.5%
55.6%
55.6%
55.9%
61.9%
63.5%
64.9%
65.1%
67.2%
67.6%
69.9%
72.3%
73.5%
74.7%
74.7%
76.5%
80.7%
82.4%
90.8%
43.2%
58.8%
70.9%

3rd Grade
Enrollment
38
92
91
94
130
95
56
117
61
66
53
100
90
38
89
41
62
36
92
58
95
87
120
61
55
50
98
78
90
37
67
97
107
84
36
81
142
100
28
173
133
75
55
120
60
52
160
57
117
69
97
90
90
36
88
118
70
120
75
66
49
52
52
67
66
48
65
45
23
103

Cumulative
# of Schools
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70

Sampling
Interval
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
6
6
6
6
6
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7

Selected
School

6

16

26

36

46

56

66

# of Children in
Sampling
Interval (A)

840

659

771

915

898

850

570

Number
Screened (B)

Weight
(A)/(B)

52

16.154

25

26.360

40

19.275

63

14.524

38

23.63

79

10.759

25

22.800

14

Table 3: Results from analyses of an oral health dataset using SAS, Epi Info, R, SUDAAN, SPSS and Stata
Oral health variable
Decay experience (% yes)
SAS 9.3
Epi Info 7
R
SUDAAN
SPSS
Stata
Untreated decay (% yes)
SAS 9.3
Epi Info 7
R
SUDAAN
SPSS
Stata
Dental sealants (% yes)
SAS 9.3
Epi Info 7
R
SUDAAN
SPSS
Stata

3rd Grade
95% CI
Lower limit

95% CI
Upper limit

57.6
57.6
57.6
57.6
57.6
57.6

54.2
54.2
54.2
54.1
54.1
54.2

61.0
61.0
60.9
61.0
61.0
61.0

21.3
21.3
21.3
21.3
21.3
21.2

18.8
18.8
18.8
18.9
18.9
18.7

23.8
23.8
23.7
23.9
23.9
23.7

29.0
29.0
29.0
29.0
29.0
29.0

25.6
25.6
25.7
25.8
25.7
25.6

32.4
32.4
32.3
32.5
32.5
32.4

Kindergarten
95% CI
Lower limit

95% CI
Upper limit

Estimated %

43.1
43.1
43.1
43.1
43.1
43.1

38.6
38.6
38.7
38.7
38.7
38.6

47.6
47.6
47.5
47.6
47.6
47.6

19.7
19.7
19.7
19.7
19.7
19.6

16.8
16.8
16.9
16.9
16.9
16.8

22.5
22.5
22.5
22.7
22.7
22.5

Estimated %

NA

Table 4: Results from analyses of an oral health dataset using SAS, Epi Info, R, SUDAAN, SPSS and Stata
3rd Grade Students
Oral health variable
Dental sealants (% yes)
SAS 9.3
Epi Info 7
R
SUDAAN
SPSS
Stata

July 2017

Non-Hispanic White
95% CI
95% CI
Estimated %
Lower limit
Upper limit
31.0
31.0
31.0
31.0
31.0
31.0

26.5
26.5
26.6
26.7
26.7
26.5

35.6
35.6
35.5
35.8
35.7
35.6

Non-Hispanic Black
95% CI
95% CI
Estimated %
Lower limit
Upper limit
25.5
25.5
25.5
25.6
25.5
25.6

21.5
21.5
21.6
21.7
21.7
21.5

29.6
29.6
29.5
29.8
29.8
29.7

15
File Type	application/pdf
File Title	Surveys can direct planning efforts for screening programs – where to concentrate efforts if resources are limited
Author	Michael C Manz
File Modified	2024-04-01
File Created	2024-04-01