Download:
pdf |
pdfDetermining Area Sample
Sizes for the Consumer
Expenditure Survey
SYLVIA A. JOHNSON-HERRING
SHARON KRIEGER
DAVID SWANSON
Sylvia A. Johnson-Herring is a mathematical
statistician in the Division of Price Statistical Methods, Consumer Expenditure Surveys,
Bureau of Labor Statistics.
Sharon Krieger is a mathematical statistician
in the Division of Price Statistical Methods,
Consumer Expenditure Surveys, Bureau of
Labor Statistics.
David Swanson is Branch Chief, Division of
Price Statistical Methods, Consumer Expenditure Surveys, Bureau of Labor Statistics.
T
he Consumer Expenditure Survey (CE) is a national household survey conducted by the
U.S. Bureau of Labor Statistics (BLS)
to find out how Americans spend their
money. The survey’s sample design,
based on the decennial census, is updated approximately every 10 years. At
that time, many decisions need to be
made, such as the number of geographic
areas in which to collect data and the
number of households from which to
collect data in each area. This article
describes a new method for making
these decisions, one that has been incorporated in the sample design to be
introduced in 2005.
Background
The CE is used to produce the most
accurate estimate of consumer expenditures possible at the national level.
The U.S. Consumer Price Index (CPI)
program relies on CE data to produce
inflation estimates. The most comprehensive CPI is based on the expenditure patterns of consumers in urban
and metropolitan areas and is denoted
CPI-U. The CPI-U population represents about 87 percent of the total U.S.
population. The CE is designed to balance the goals of the CE and CPI programs. These goals compete with each
other when BLS allocates the CE’s nationwide sample of households to geographic areas covered by the two programs.
24 Consumer Expenditure Survey Anthology, 2005
The number of households in the
CE’s national sample is determined by
the survey’s data collection budget.
Allocating this fixed number of households to individual geographic areas
must be done in a way that satisfies
the competing goals of the CE and CPI
programs as much as possible. The CE
program’s goal is to allocate the sample
households to the selected geographic
areas in proportion to their share of the
U.S. population, whereas the CPI
program’s goal is to allocate sample
households to the selected urban areas in proportion to their share of the
Nation’s urban population. The CPI
program further strives to include a
minimum number of households in each
selected urban area to ensure the statistical quality of its published price
indexes for those areas.
This article describes a new automated method of allocating the CE’s
nationwide sample of households in a
way that balances competing goals and
constraints. The CE actually consists
of two surveys, the Diary and Interview surveys, but this article focuses
on the Interview survey.
Geographic areas in the CE
sample
The selection of households for the
survey begins with the definition and
selection of primary sampling units
(PSUs), which consist of counties (or
parts thereof), groups of counties, or
independent cities. The sample design
currently used in the survey, based on
the 1990 census, consists of 105 PSUs,
classified into 4 size categories:
These 11 region-size classes are treated just like the 31 A PSUs and are also
referred to as self-representing geographic areas. Hence, the CE can be thought
of as having 42 self-representing geographic areas: 31 A PSUs plus 11 regionsize classes for the smaller PSUs. Because the 4 D region-size classes are
used by the CE only, there are only 38
self-representing geographic areas used
by the CPI.
• 31 “A” PSUs, which are metropolitan statistical areas (MSAs)
with a population of 1.5 million
or greater
• 46 “B” PSUs, which are MSAs
with a population less than 1.5
million
The sample allocation problem
In the CE’s current sample design, usable interviews are collected from 7,760
households1 in each calendar quarter
of the year: 4,260 households in the A
PSUs, and 3,500 households in the B,
C, and D PSUs. To guarantee that
enough data are collected to satisfy
CPI’s publication requirements, the
sample of 7,760 households is allocated
so that at least 120 usable interviews
are obtained in each of the 38 geographic areas used by the CPI, with no
minimum number of usable interviews
required in the 4 D geographic areas.
Thus, the problem is to allocate the
7,760 households in the CE’s national
sample to the 42 geographic areas in a
way that satisfies the following constraints:
• 10 “C” PSUs, which are nonmetropolitan urban areas
• 18 “D” PSUs, which are nonmetropolitan rural areas. The “D” PSUs
are used in the CE program but
not in the CPI program.
These 105 PSUs are grouped according to the geographic areas they
represent. A populous PSU constitutes
its own geographic area, which is called
a “self-representing” geographic area.
The 31 A PSUs are self-representing
geographic areas, and they are in the
sample with certainty. The 74 B, C, and
D PSUs are “non-self-representing”
PSUs. They were randomly selected
to represent all of the less populous
PSUs in the Nation. The 74 non-selfrepresenting PSUs are grouped into 11
geographic areas called region-size
classes, which are formed by crossclassifying the 4 regions of the country (Northeast, Midwest, South, and
West) with the 3 size classes (B,C, and
D) as shown in the shaded area of the
table below. There are only 11 regionsize classes for the areas that are not
self-representing because no C PSUs
were selected in the Northeast.
• The 31 A PSUs are allotted 4,260
households.
• The 11 B, C, and D region-size
classes are allotted 3,500 households.
• Each of the 38 geographic areas
used in the CPI is allotted 120 or
more households.
BLS staff recently reevaluated the
minimum sample size requirement of 120
usable interviews to determine whether
it is still an appropriate number. One of
the results of the reevaluation was the
development of a new automated
method of allocating the nationwide
sample of households to geographic
areas. The new method allowed repeated analyses to be conducted
quickly and easily using different minimum sample size requirements. The
method involved setting up the sample
allocation problem as a mathematical
optimization problem and using SAS
statistical software to solve it.
Target versus required sample size
In the past, there were various interpretations of the word “required” in the
phrase “minimum required sample size.”
At times, the requirement that at least
120 usable interviews be obtained was
interpreted as a target sample size,
meaning that the expected number of
usable interviews should be at least 120:
E( x i ) ≥ 120.
At other times, it was interpreted as a
required sample size, meaning that
there should be a very high probability
that at least 120 interviews be obtained,
P { xi ≥ 120} ≥ 0.95
where x i is the number of usable interviews collected in geographic area = i.
For example, under the first interpretation (target sample size), data collectors would have to visit 185 households
in each quarter of the year to collect
120 usable interviews in the Boston
metropolitan area, assuming that usable
interviews are obtained at 65 percent
of the residential addresses in the CE’s
sample.2
E( x i ) = 185 × 0.65 = 120
Table 1. PSU region-size classes
Size
Region
A
B
C
D
Total
Northeast ...........................................
Midwest .............................................
South .................................................
West ...................................................
6
8
7
10
8
10
22
6
–
4
4
2
4
4
8
2
18
26
41
20
Total ....................................................
31
46
10
18
105
1
In 2000 the average number of usable
interviews collected per quarter in the CE
Interview Survey was 7,760.
2
Approximately 15 percent of the residential addresses selected for the CE Interview Survey are ineligible for the survey, and
20 percent do not participate in the survey
due to refusal or to no one being home. This
leaves 65 percent of the sample to participate in the survey.
Consumer Expenditure Survey Anthology, 2005 25
However, under the second interpretation (required sample size), data collectors would have to visit 202 households to be 95-percent certain of getting
at least 120 usable interviews, again
assuming a 65-percent survey participation rate.
P { x i ≥ 120} =
⎛ 202 ⎞
⎟⎟0.65k (1 − 0.65) 202 − k = 0.95
k =120 ⎝ k ⎠
202
∑ ⎜⎜
Table 2 shows the difference in the
sample size that would be needed for a
target versus a required minimum number of usable interviews. The number
of selected addresses needed to
achieve a target minimum sample size
is approximately 10 percent less than
that needed for a required sample size.
The estimates in table 2 were produced using formulas from the binomial distribution for the mean and variance of the number of usable
interviews,
µ = E ( x i ) = 0.65 n
σ = V ( x i ) = 0.65(1 − 0.65) n
2
and the normal distribution was used
to approximate the binomial distribution to estimate a 95-percent confidence interval on the number of usable interviews:
One-sided confidence interval:
[ µ - 1.64σ , +∞ )
Two-sided confidence interval:
[ µ − 1.96σ , µ + 1.96σ ]
After some discussion, staff decided
that target sample sizes would be satisfactory. Because the widths of the
two-sided confidence intervals are relatively small, it is unlikely that any
sample sizes achieved will be greatly
below the target level.
Setting up the optimization
problem
The CE’s current sample design calls
for allocating 7,760 households to the
42 geographic areas in a way that satisfies the three constraints mentioned
previously.
These constraints can be written in
mathematical terms as follows:
• x1 + x 2 + L + x 31 = 4,260
• x 32 + x 33 + L + x 42 = 3,500
• x i ≥ 120 for i = 1,2,…,38
where x i is the number of usable interviews collected in geographic area=i.
Again, the objective of the CE’s
sample design is to allocate the nationwide sample of households to geo-
Table 2. Sample size needed to obtain a target versus a required minimum
number of usable interviews for the Consumer Expenditure Survey
Number of sample
households (n)
Expected number of usable
interviews assuming a 65percent survey participation
rate (=0.65n)
95-percent confidence
interval
Target sample size (two-sided 95-percent confidence interval)
62
92
123
154
185
215
40
60
80
100
120
140
[33, 47]
[51, 69]
[70, 90]
[88, 112]
[107, 133]
[126, 154]
Required sample size (one-sided 95-percent confidence interval)
72
105
137
170
202
234
47
68
89
110
131
152
26 Consumer Expenditure Survey Anthology, 2005
[40,
[60,
[80,
[100,
[120,
[140,
+∞)
+∞)
+∞)
+∞)
+∞)
+∞)
graphic areas in a way that minimizes
the standard error of the expenditures
estimate at the national level. Allocating the sample in proportion to the
population that each geographic area
represents comes very close to achieving that goal. Although this allocation
does not minimize the nationwide standard error, it is a very simple sample
design that is known to achieve near
minimization. Staff chose to implement
this method because of its simplicity
and its near optimal properties.
Based on research and evaluation,
staff modified the sample allocation
problem described above. More of the
CE’s sample households were allocated
to the urban portion of the Nation (of
interest to the CPI), and fewer households were allocated to rural areas. This
change results in a slight oversampling
of the urban areas: The CPI-U population represents about 87 percent of the
total U.S. population, but it is given 95
percent of the CE’s sample. An analysis showed that limiting the rural sample
to 400 households would have a minimal effect on the nationwide standard
error of the CE’s expenditure estimates.
Thus, the revised optimization problem
allocates exactly 400 households to the
4 rural geographic areas, leaving 7,360
households to be allocated to the 38
urban geographic areas.
For some of the geographic areas
with small populations—for example,
Anchorage and Honolulu—the requirement that at least 120 usable interviews be collected during each calendar quarter conflicts with the
objective of allocating the sample in
proportion to the population. For example, the Anchorage metropolitan area
has approximately 0.09 percent of the
U.S. population, and allocating the
7,760 usable interviews proportionally
would give Anchorage only enough addresses to obtain 7 usable interviews—
not 120.
Because an exact proportional allocation cannot be achieved within the
given constraints, BLS staff decided to
allocate the sample as proportionally
as possible. This involved setting up a
least-squares problem to square the
difference between each geographic
area’s proportion of the population and
its proportion of the sample and then
minimize the sum of those 42 squared
differences.
Thus, the optimization task is to
solve the following constrained leastsquares problem:
Table 3. The effect of changes in minimum target sample size on the
standard error for the Consumer Expenditure Survey
Given values of n, p i ,and p,
find values of n i that
Minimize
Subject to
⎛ ni
p ⎞
⎜⎜ − i ⎟⎟
∑
p⎠
i =1 ⎝ n
42
Percent change in standard error
(from SE for a minimum target sample
size of 120)
Minimum target sample for each
primary sampling unit
0
10
20
30
40
50
60
70
-4.16
-4.16
-4.15
-4.10
-4.04
-3.96
-3.88
-3.74
80
-3.54
90
100
110
-3.21
-2.72
-2.04
120
-1.14
130
140
150
160
170
180
+.06
+1.45
+3.28
+5.63
+10.07
+14.41
2
n1 + n 2 + L + n 38 = 7,360
n39 + n40 + n41 + n42 = 400
ni ≥ 120 for i = 1,2,...,38
ni ≥ 0 for i = 39,…,42
where
n i = number of housing units assigned
to geographic area = i
n = number of housing units nationwide (n = 7,760)
p i = population of geographic area = i
p = population in all geographic
sumer expenditures resulting from the
sample allocation process described
above was estimated using the following formula:
⎛ 42 ⎛ p ⎞ ⎞
V ( x ) = V ⎜⎜ ∑ ⎜⎜ i ⎟⎟xi ⎟⎟
⎝ i =1 ⎝ p ⎠ ⎠
areas (p = p1 + p2 + L + p42 )
2
⎛p ⎞
= ∑ ⎜⎜ i ⎟⎟ V ( xi )
i =1 ⎝ p ⎠
42
Solving the optimization
problem
The optimization problem described
above can be seen to have both equality and inequality constraints. This
creates a practical problem because
optimization problems with equality
constraints are usually solved with different techniques than those with inequality constraints. Least-squares
problems with equality constraints are
usually solved with linear algebra and
linear regression theory, while problems with inequality constraints are
usually solved with iterative search techniques. Fortunately, the SAS R procedure for nonlinear programming (PROC
NLP) can handle both kinds of constraints simultaneously. An example
using this SAS R procedure to solve
the problem above is given at the end
of this paper.
Estimating the standard error
The variance of the estimate of con-
2
42
⎛ p ⎞ σ2
= ∑ ⎜⎜ i ⎟⎟
i =1 ⎝ p ⎠ ni
where
xi = sample mean of geographic
area = i
x = sample mean of the Nation
42
=
∑px
i =1
42
i i
∑p
i =1
42
=
∑px
i i
i =1
p
=
42
⎛ pi ⎞
⎟⎟xi
⎝ p⎠
i =1
i
domly selected household
The variance of the estimate of
consumer expenditures under the
proposed sample allocation method
is estimated by substituting the values of ni obtained from the optimization problem (the output of PROC
NLP) into the formula
2
42
⎛ p ⎞ σ2
V ( x ) = ∑ ⎜⎜ i ⎟⎟
i =1 ⎝ p ⎠ ni
⎛ pi
p
i =1
42
SE = ∑ ⎜⎜⎝
2
⎞ σ2
⎟
⎟ n
⎠ i
This formula allows comparisons to
be made with the current method of
sample allocation. The value of σ does
not have to be known because the
change in standard error is the number
of interest; when the ratio of two estimates of the standard error is taken (to
compare the standard error of using,
say, 80 as the minimum sample size instead of 120), the σ in the numerator
and the σ in the denominator cancel
each other.
∑ ⎜⎜
σ = expenditure variance of a ran2
Then the standard error is computed
by taking the square root of the variance.
.
Standard error with different
minimum sample size
requirements
After allocating the CE’s nationwide
sample to individual geographic areas
using PROC NLP, staff computed the
percentage change in standard error for
various minimum target sample sizes.
The baseline used in the comparison
was the current sample allocation. The
current minimum target sample size is
around 120, but for technical reasons it
is not exactly equal to 120. The results
Consumer Expenditure Survey Anthology, 2005 27
Chart1.
Changes in the Consumer Expenditure Survey's standard error with minimum sample size
Percent change in standard error
20
15
10
5
Proposed sample size of 80
0
-5
-10
0
10
20
30
40
50
60
70
80
90
100 110 120 130 140 150 160 170 180
Minimum sample size
Table 4. The effect of changing sample allocations on the standard error for
the Consumer Expenditure Survey: Primary sampling units in the West
Primary sampling unit
Percent
Current
Proposed
change in
sample size sample size standard
error
Population
A419 Los Angeles ......................
A420 Greater Los Angeles ........
A422 San Francisco ...................
A423 Seattle ................................
A424 San Diego ..........................
A425 Portland ..............................
A426 Honolulu .............................
A427 Anchorage .........................
A429 Phoenix ..............................
A433 Denver ...............................
8,863,164
5,668,365
6,253,311
2,970,328
2,498,016
1,793,476
836,231
226,338
2,238,480
1,980,140
231
152
158
119
104
130
112
125
132
121
290
187
206
100
85
80
80
80
80
80
-10.81
-9.88
-12.44
+9.08
+10.78
+27.48
+18.32
+25.00
+28.45
+22.98
Total U.S. ................................
240,218,238
7,760
7,760
-3.54
NOTE: Minimum target sample size is 80.
of the comparisons are shown above
in table 3.
Standard error is minimized when the
sample is allocated directly in proportion to population—that is, when 0 is
the minimum number of usable interviews required in each geographic area
(table 3). Reducing the target number of
usable interviews from 120 to 0 would
reduce the standard error by 4.16 percent. Standard error is maximized when
the sample is divided equally among all
geographic areas—180 usable interviews
per geographic area. Increasing the target number of usable interviews from
120 to 180 would increase the standard
error by 14.41 percent.
Reducing the minimum target number of usable interviews from 120 to 80
per geographic area would reduce the
standard error by 3.54 percent. Nearly
all the reduction in standard error is
achieved by reducing the minimum target sample size to 80, and little further
reduction is achieved by reducing the
minimum target sample size below 80
(chart 1). Therefore, staff decided to
reduce the minimum target sample size
from 120 to 80 usable interviews per
geographic area.
Other effects of the proposed
allocation
A minimum target sample size of 80 usable interviews per geographic area reduces the national standard error by
3.54 percent and reduces the standard
error in the urban portion of the Nation
by 3.86 percent. After some discussion,
staff decided that a minimum target
28 Consumer Expenditure Survey Anthology, 2005
sample size of 80 would be satisfactory
for both surveys because the overall
standard error would be reduced and
publication criteria met for both the CE
and CPI programs.
Table 4 shows current and proposed
sample sizes for A PSUs in the West
after applying the proposed sample allocation method. The PSUs with populations larger than 4 million will have
their sample sizes increased, while the
PSUs with populations less than 4 million will have their sample sizes decreased. This change will decrease the
standard error in the larger A PSUs and
increase the standard error in the smaller
A PSUs, but the standard error for the
Nation as a whole will be reduced.
BLS staff tested other methods to
find one that satisfied the goals of both
the CE and CPI programs. Some of the
other methods tested had a positive
effect on reducing the standard error
for CE, but not for CPI, and vice versa.
The chosen method reduced CE and
CPI standard errors by about the same
amount, 3.54 percent and 3.86 percent,
respectively.
Conclusion
A new sample design for the CE will be
introduced in 2005. Based on analysis
of the current design, the new method
of sample allocation could reduce the
standard error of the estimate of consumer expenditures at the national level
by from 3 percent to 4 percent.
The CE and CPI programs’ competing goals and constraints complicated
the process of allocating households
to geographic areas in constructing the
CE’s national sample. CE program staff
wanted to allocate the sample in a way
that minimized the national variance,
while CPI program staff wanted to minimize the variance of the urban portion of
the Nation and also limit the variance of
individual sampled areas. Setting up a
mathematical optimization problem and
then solving a constrained least-squares
problem led to a solution that satisfied
the requirements of both the CE and the
CPI programs.
Writing the problem as a formal mathematical optimization problem had several advantages:
• It required the objectives and constraints to be stated clearly and
explicitly.
APPENDIX:
Automating the Sample
Allocation Process
Subject to
n1 + n 2 + L + n 38 = 7,360
n39 + n40 + n41 + n42 = 400
• It helped document the allocation
n i ≥ 80 for i = 1,2,…,38
process.
• It allowed several different allocation methods to be tested quickly
and easily.
• It led to an optimal solution to
the problem.
This approach offers clear benefits for allocating the CE’s nationwide sample of households to individual geographic areas while
satisfying the CE and CPI programs’
competing goals.
Below is the optimization problem for
the sample allocation, along with a
SAS R program (PROC NLP) that
solves it.
n i ≥ 0 for i = 39,…,42
Where
ni = number of housing units assigned
Given values of n, pi , and p,
to geographic area = i
find values of ni that
n = number of housing units nationwide (n = 7,760)
Minimize
pi = population of geographic area = i
42
⎛ ni
i =1
⎝
∑ ⎜⎜ n
−
pi ⎞
⎟
p ⎟⎠
2
p = population in all geographic areas
(p = p1 + p2 + L + p42 )
*************************************************
* COMPUTE THE SQUARED DIFFERENCE BETWEEN EACH
*
* AREA’S PROPORTION OF THE POPULATION & ITS
*
* PROPORTION OF THE SAMPLE.
*
*************************************************;
%MACRO MAC1;
SUM_POP = SUM(OF POP1-POP42);
%DO I=1 %TO 42;
SQR&I = ((N&I/7760) - (POP&I/SUM_POP))**2;
%END;
%MEND MAC1;
*************************************************
* SOLVE A CONSTRAINED LEAST-SQUARES PROBLEM TO *
* FIND THE NUMBER OF HOUSEHOLDS IN EACH PSU
*
* THAT MINIMIZES THE SUM OF SQUARED DIFFERENCES *
*************************************************;
PROC NLP DATA=POP_DATA(KEEP=POP1-POP42) NOPRINT
OUT=RESULTS(KEEP=N1-N42)
/* CONVERGENCE CRITERIA */
GCONV=1E-15 FCONV2=1E-15 MAXITER=100000;
/* DECISION VARIABLES */
DECVAR N1-N42;
/* COMPUTE THE SQUARED DIFFERENCES */
%MAC1;
/* SUM THE SQUARED DIFFERENCES */
F1=SUM(OF SQR1-SQR42);
/* FUNCTION TO BE MINIMIZED */
MIN F1;
/* PROBLEM CONSTRAINTS */
BOUNDS N1-N38>=80, N39-N42>=0;
NLINCON F2=7360, F3=400;
F2=SUM(OF N1-N38);
F3=SUM(OF N39-N42);
RUN;
Consumer Expenditure Survey Anthology, 2005 29
File Type | application/pdf |
File Title | Herring4_alt.pmd |
Subject | ce 2005 anthology report article 4 |
Author | vendemia |
File Modified | 2005-05-05 |
File Created | 2005-05-04 |