arXiv:2003.11423v1 [stat.ML] 25 Mar 2020
Design-unbiased statistical learning in survey sampling
Luis Sanguiao Sande
1
and Li-Chun Zhang
2,3,4
1
Instituto Nacional de Estad´ıstica ([email protected])
2
University of Southampton (email: [email protected])
3
Statistisk sentralbyraa
4
Universitetet i Oslo
Abs tract: Design-consistent model-assisted estimation has become the standard prac-
tice in survey sampling. However, a general theory is lacking so far, which allows one t o
incorporate modern machine-learning techniques tha t can lead to potentially much more
powerful assisting models. We propose a subsampling Rao-Blackwell method, and develop
a statistical learning theory for exactly design-unbiased estimation with the help of linear
or non-linear prediction models. Our approach makes use of classic ideas from Statistical
Science a s well as the rapidly growing field of Machine Learning. Provided rich auxiliary
information, it can yield considerable efficiency gains over standard linear model-assisted
methods, while ensuring valid estimation for the given target population, which is robust
against potent ial mis-specifications of the assisting model at the individual level.
Keywords: Rao-Blackwellisation, bagg ing, pq-unbiasedness, stability conditions
1 Introduction
Approximately design-unbiased model-assisted estimation is no t new. It has become the
standard practice in survey sampling, following many influential works such as arndal et
al. (1992) , Deville and arndal (1992). However, there lacks so far a theory, which allows
one to generally incorporate the many common machine-learning ( ML) techniques. For
instance, according to Breit and Opsomer (2017, p. 203), they “are not aware of direct uses
of random forests in a model-assisted survey estimator”. Since modern ML techniques can
often generate more flexible and powerful prediction models, when rich auxiliary feature
data are available, the potentials are worth exploring, in any situation where the practical
advantages of linear weighting are not essential compared to the efficiency gains that can
be achieved by alternative non-linear ML techniques.
We propo se a subsampling Rao-Blackwell (SRB) method, which enables exactly design-
unbiased estimation with the help of linear or non-linear prediction models. Monte Carlo
(MC) versions of the proposed method can be used in cases where exact RB method
is computat ionally too costly. The MC- SRB method is still exactly design-unbiased,
1
despite it is somewhat less efficient due to the additional MC error. In practice, though,
one can easily balance between the numerical efficiency of the MC-SRB method against
the statistical efficiency of the corresponding exact RB method.
The SRB method makes use of three classic ideas fro m Statistical Science and Machine
Learning. On the one hand, the training-test split of the sample of observations in ML
generates errors in the test set rather than resid uals, conditional on the training dataset,
which as we shall explain is the key to achieving exact design-unbiasedness. For model-
assisted survey estimation we use this idea to remove the finite-sample bias. On the other
hand, Rao-Blackwellisation (Rao, 1945; Blackwell, 19 47) and model-assisted estimation
(Cassel et al., 1976) are powerful ideas in Statistics and survey sampling, which we apply
to ML techniques to obtain design-unbiased survey estimators at the population level.
We shall refer to the amalgamation as statistical learning, since the term model-
assisted estimation is entrenched with t he property of approximate design-unbiasedness
(e.g. arndal 2010; Breit and Opsomer, 2017), whereas the focus of population- level
estimation and associated variance estimation is unusual in the ML literature.
In applications one needs to ensure design-consistency of the proposed SRB method,
in addition to exact design-unbiasedness. The property can readily be established f or
parametric or many semi-parametric assisting models. But the conditions required for
non-parametric algorithmic ML prediction models have so far eluded a treatment in the
literature. Indeed, this has been a main reason preventing the incorporation of such
ML techniques in model-assisted estimation f r om survey sampling. We shall develop
general stability conditions for design-consistency under both simple random sampling
and arbitrary unequal pro bability sampling designs.
For the first time, design-unbiased model-assisted estimation can thereby be achieved
generally in survey sampling. Wherever rich feature data are available, the approach of
statistical learning developed in this paper enables one to adopt suitable ML techniques,
which can make much more efficient use of the available auxiliary information.
The rest of the paper is organised as follows. In Section 2, we describe t he SRB method
that uses an assisting linear model. The underlying ideas of design-unbiased statistical
learning ar e explained, as well as the differences to the standard model-assisted gener-
alised regression estimation. Some basic methods of variance estimation are outlined,
where a novel jackknife variance estimator is developed for the SRB method. We move on
to non-linear ML techniques in Section (3). The similarity and difference to the bootstrap
aggregating (Breiman, 1996b) approach ar e explored. Moreover, we investigate and prove
the stability conditions for design-consistency of SRB method that uses non-parametric
algorithmic prediction models. Two simulation studies are presented in Section 4, which
illustrate the potential gains of the proposed unbiased statistical learning approach, com-
pared to standard linear model-assisted or model-based approaches. A brief summary
and topics for future research will be given in Section 5.
2
2 Unbiased linear est imation
In this section we consider unbiased linear estimation in survey sampling, which builds on
generalised regression ( GREG) estimation (S¨arndal et al. 1992). The GREG estimator is
the most common estimation method in practical survey sampling. It is consistent under
mild regularity conditions, and is often more efficient than exactly unbiased Horvitz-
Thompson (HT) estimation (Horvitz and Thompson, 1952). The pr oposes subsampling
Rao-Blackwellisation (SRB) method removes the finite-sample bias of GR EG generally,
whose relative efficiency is comparable to the standard GREG estimator.
2.1 Bias correction by subsampling
Let s be a sample (of size n) selected from the population U of size N, with probability
p(s), where
P
s
p(s) = 1 over all possible samples under a given sampling design. Let
π
i
= Pr(i s) > 0 be the sample inclusion probability, for each i U. Let y
i
be a survey
var iable, for i U, with unknown population total Y =
P
iU
y
i
.
Let the assisting linear model expectatio n of y
i
be given by µ(x
i
) = x
i
β, where x
i
is
the vector of covariates for each i U. Let µ(x
i
, s) = x
i
b be the estimator of µ(x
i
), where
b = (
P
is
x
i
x
i
i
)
1
P
is
x
i
y
i
i
is a weighted least squares (WLS) estimator of β. It is
possible to at t ach additional heteroscedasticity weights in the WLS; but the development
below is invariant to such variations, so that it is more convenient to simply ignore it in
the notation. Let X =
P
iU
x
i
. The GREG estimator of Y is given a s
b
Y
GR
= X
b +
X
is
(y
i
x
i
b)
i
While
b
Y
GR
is design-consistent under mild regularity conditions (e.g. arndal et al. 1992),
as n, N , it is usually biased given finite sample size n, except in special cases such
as when x
i
1 and π
i
n/N, where µ(x, s) =
P
is
y
i
/n = ¯y
s
and
b
Y
GR
= N ¯y
s
.
To remove the potential finite-sample bias of
b
Y
GR
, consider subsampling of s
1
s,
with known probability q(s
1
|s), such as SR S with fixed n
1
= |s
1
|, where
P
s
1
q(s
1
|s) = 1.
The induced probability of selecting s
1
from U is given by
p
1
(s
1
) =
X
s:s
1
s
q(s
1
|s)p(s)
where π
1i
= Pr(i s
1
) is the corresp onding inclusion probability for i U. Let s
2
= s\s
1
be the complement of s
1
in s. Let the conditional sampling probability of s given s
1
be
p
2
(s
2
|s
1
) = p(s)q(s
1
|s)/p
1
(s
1
)
and let π
2i
= Pr(i s
2
|s
1
) be the corresponding conditional inclusion probability in s
2
for i U \ s
1
. Let µ(x
i
, s
1
) = x
i
b
1
be the estimate of µ(x
i
) based on the sub-sample s
1
,
3
where b
1
= (
P
is
1
x
i
x
i
1i
)
1
P
is
1
x
i
y
i
1i
. Let
b
Y
1
=
X
is
1
y
i
+
X
iU\s
1
x
i
b
1
+
X
is
2
(y
i
x
i
b
1
)
2i
(1)
In other words, it is the sum of y
i
in s
1
and a difference estimator of the remaining
population total based on s
2
, via x
i
b
1
that does not depend o n the observations in s
2
.
Proposition The estimator
b
Y
1
is conditionally unbiased for Y over p
2
(s
2
|s
1
) g iven s
1
,
denoted by E
2
(
b
Y
1
|s
1
) = Y , as well as unconditionally over p(s), denoted by E
p
(
b
Y
1
) = Y .
Proof: As µ(x
i
, s
1
) is fixed for any i U \ s
1
given s
1
, the last two terms on the right-
hand side of (1) is unbiased for Y
P
is
1
y
i
given s
1
. It follows that
b
Y
1
is conditionally
unbiased for Y given s
1
; hence, design-unbiased over p(s) unconditionally a s well.
Example: Simple random sampling (SRS) Suppose SRS without replacement of
s from U, and s
1
from s with fixed size n
1
= |s
1
|, such that π
1i
= n
1
/N and π
2i
=
(n n
1
)/(N n
1
). In the special case o f x
i
1, b
1
is the sample mean in s
1
, and
b
Y
1
= n
1
b
1
+ (N n
1
)
X
is
2
y
i
/(n n
1
)
which a mo unts to using the sample mean in s
2
to estimate the population mean outside
of the given s
1
, i.e., instead of using the sample mean in s for the whole population mean.
Thus,
b
Y
1
achieves unbiasedness generally, but at a cost of increased variance.
2.2 Rao-Blackwellisation
One can reduce the variance of
b
Y
1
by the Rao- Blackwell method (Rao, 1 945; Blackwell,
1947). The minimal sufficient statistic in the finite population sampling setting is simply
D
s
= {(i, y
i
) : i s}. Applying the RB method to
b
Y
1
by (1) yields
b
Y
GR
, which is given
by the conditional expectation of
b
Y
1
given D
s
, i.e.
b
Y
GR
= E
q
(
b
Y
1
|D
s
) = E
q
(
b
Y
1
|s) (2)
where the expectation is evaluat ed with respect to q(s
1
|s), and the second expression
is leaner as long as one keeps in mind that {y
i
: i U} are treated as fixed constants
associated with the distinctive units.
Proposition The estimator
b
Y
GR
is design-unbiased for Y , denoted by E(
b
Y
GR
) = Y .
Proof: By construction, the combined randomisation distribution induced by p and q is
the same as that induced by p
1
and p
2
, for any s
1
s
2
= s and s
1
s
2
= . Thus,
E(
b
Y
GR
) = E
p
(
b
Y
GR
) = E
p
E
q
(
b
Y
1
|s)
= E
1
E
2
(
b
Y
1
|s
1
)
= E
1
Y
= Y
4
Next, for t he variance of
b
Y
GR
over p(s), i.e. V (
b
Y
GR
) = V
p
(
b
Y
GR
), we notice
V (
b
Y
1
) = E
p
V
q
(
b
Y
1
|s)
+ V
p
E
q
(
b
Y
1
|s)
= E
p
V
q
(
b
Y
1
|s)
+ V
p
(
b
Y
GR
)
V (
b
Y
1
) = E
1
V
2
(
b
Y
1
|s
1
)
+ V
1
E
2
(
b
Y
1
|s
1
)
= E
1
V
2
(
b
Y
1
|s
1
)
since E
2
(
b
Y
1
|s
1
) = Y . Juxt aposing the two expressions of V (
b
Y
1
) above, we obta in
V (
b
Y
GR
) = V
p
(
b
Y
GR
) = E
1
V
2
(
b
Y
1
|s
1
)
E
p
V
q
(
b
Y
1
|s)
(3)
where E
p
V
q
(
b
Y
1
|s)
is the variance reduction compared to
b
Y
1
.
Proposition Provided unbiased variance estimator
b
V (
b
Y
1
) with respect to p
2
(s
2
|s
1
), i.e.
E
2
b
V (
b
Y
1
)
= V
2
(
b
Y
1
|s
1
), a design-unbiased variance estimator for
b
Y
GR
is given by
b
V (
b
Y
GR
) =
b
V (
b
Y
1
) V
q
(
b
Y
1
|s)
Proof: By stipulation, we have E
b
V (
b
Y
1
)
= E
1
E
2
b
V (
b
Y
1
)
= E
1
V
2
(
b
Y
1
|s
1
)
, which is
the first term on the right-hand side of (3) . The r esult follows immediately.
Example: SRS, cont’d In the sp ecial case of x
i
1 and n
1
= n 1, we have
b
Y
1(i)
= n
1
¯y
(i)
+ (N n
1
)y
i
if s
2
= {i} and ¯y
(i)
denotes the mean in s
1
= s \ {i}. The RB estimator follows as
b
Y
GR
=
1
n
X
is
b
Y
1(i)
=
n
1
n
X
is
¯y
(i)
+
N
n
X
is
y
i
n
1
n
X
is
y
i
=
N
n
X
is
y
i
which is the usual unbiased full-sample expansion estimator in this case. The R B metho d
thus recovers the lost efficiency of any
b
Y
1(i)
on its own.
Let X
1
=
P
is
1
x
i
, and X
c
1
=
P
i6∈s
1
x
i
= X X
1
. To express
b
Y
GR
as a linear
combination of {y
i
: i s}, we rewrite
b
Y
1
as
b
Y
1
=
X
is
1
y
i
+
X
is
2
y
i
π
2i
+ (X X
1
)
b
1
X
is
2
x
i
b
1
π
2i
=
X
is
1
y
i
+
X
is
2
y
i
π
2i
+ (X
c
1
b
X
c
1
)
b
1
=
X
is
w
i
y
i
where
w
i
=
1 + (X
c
1
b
X
c
1
)
P
is
1
x
i
x
i
1i
1
x
i
1i
if i s
1
1
2i
if i s
2
5
It follows that the RB estimator (2) can be given as a linear estimator
b
Y
GR
=
X
is
w
i
y
i
where w
i
= E
q
(w
i
|s) (4)
This has an important practical advantage that {w
i
: i s} can be applied to produce
numerically consistent cross-tabulation o f multiple survey variables of interest.
In the case of SRS of s
1
with n
1
= n1, the RB weight w
i
in (4 ) is the average of w
i
’s
over n possible subsamples s
1
, for a given unit i s, where w
i
= 1/π
2i
when s
1
does not
include the unit i, otherwise w
i
is the corresp onding GREG weight for Y
c
1
=
P
k6∈s
1
y
k
,
which is different for each of the rest n 1 subsamples that includes the unit i.
2.3 Relative efficiency to GREG
Let B = E
p
(b) and e
i
= y
i
x
i
B for i U. Expanding the GREG estimator
b
Y
GR
around
(Y, X, B) yields
b
Y
GR
X
is
e
i
π
i
+ B
X
For
b
Y
1
, the first two terms on the right-hand side of (1) becomes X
b
1
if there exists a
vector λ such that x
i
λ 1, in which case
b
Y
1
is a function of (
b
Y
c
1
,
b
X
c
1
, b
1
), i.e.
b
Y
1
= X
b
1
+
b
Y
c
1
b
1
b
X
c
1
=
b
Y
c
1
+ b
1
(X
b
X
c
1
)
where
b
Y
c
1
=
P
is
2
y
i
2i
is conditionally unbiased for Y
c
1
=
P
i6∈s
1
y
i
given s
1
, and similarly
b
X
c
1
=
P
is
2
x
i
2i
for X
c
1
=
P
i6∈s
1
x
i
. Let Y
c
+
= E
1
(Y
c
1
) and X
c
+
= E
1
(X
c
1
). We have
E
1
(b
1
) B, since b
1
and b aim at the same population parameter, especially if n
1
is close
to n. In any case, expanding
b
Y
1
around (Y
c
+
, X
c
+
, B) yields
b
Y
1
Y
c
+
+ B
(X X
c
+
)
+
(
b
Y
c
1
Y
c
+
) + (b
1
B)
(X X
c
+
) B
(
b
X
c
1
X
c
+
)
=
b
Y
c
1
B
b
X
c
1
+ b
1
(X X
c
+
) + B
X
c
+
and
b
Y
c
1
B
b
X
c
1
=
X
is
2
e
i
2i
=
X
is
δ
2i
e
i
2i
where δ
2i
= 1 if i s
2
and 0 of i s
1
. Thus, we obtain
b
Y
GR
= E
q
(
b
Y
1
|s)
X
is
E
q
δ
2i
π
2i
|s
π
i
e
i
π
i
+ E
q
(b
1
|s)
(X X
c
+
) + B
X
c
+
(5)
Notice that B
X
c
+
is a constant. Thus, compared to
b
Y
GR
, the variance of
b
Y
GR
involves
that of E
q
(b
1
|s) in addition. As n, N , the first term on the right-hand side of
(5) is O
p
(N/
n) provided π
i
E
q
(δ
2i
2i
) = O
p
(1), whereas the second term is O
p
(
n) if
6
n
1
/n = O(1) provided the usual regularity conditions for GREG. As long as the sampling
fraction n/N is small, the first term will dominate, in which case the variance of the R B
estimator
b
Y
GR
is of the same order as that of the GREG estimator
b
Y
GR
.
Example: SRS, cont’d Let n
1
= n k, where k = |s
2
|. We have
π
i
π
1
2i
E
q
(δ
2i
|s) =
n
N
·
N n
1
k
·
(n 1)!
(k 1 )!(n k)!
·
k!(n k)!
n!
= 1
n
1
N
Let S
2
e
be the po pula t ion variance of {e
i
: i U}. The variance of the first-term in (5) is
V
p
X
is
E
q
δ
2i
π
2i
|s
π
i
e
i
π
i
= N
2
1
n
N
S
2
e
n
1
n
1
N
2
which is actually smaller than the approximat e variance of the GR EG estimator under
SRS, although the difference will not be noteworthy in practical t erms, if t he sampling
fraction n/N is small, since 1 n/N < 1 n
1
/N < 1. Meanwhile, due to the additional
var iance of E
q
(b
1
|s), the estimator
b
Y
GR
by unbiased RB method can possibly have a la r ger
var iance than the biased GREG (with general x
i
). It seems that one should use large n
1
if possible, to keep the additional variance due to E
q
(b
1
|s) small.
2.4 Delete-one RB method
The la rgest possible size of s
1
is n
1
= n 1. We refer to Rao-Blackwellisation based on
SRS of s
1
with n
1
= n 1 as the delete-one (o r leave-one-out, LOO) RB method. The
conditional sampling design p
2
(s
2
|s
1
) is not measurable in this case, in that one cannot
have an unbiased variance estimator
b
V (
b
Y
1
) based on a single observation y
j
in s
2
= {j}.
For an approximate variance estimator, we reconsider the basic case where {y
1
, ..., y
n
}
form a sample of independent a nd identically distributed (IID) observations, in or der to
develop an analogy to t he classic jackknife variance estimation (Tukey, 1958).
Denote by θ the population mean that is also t he expectation of each y
i
, for i = 1, ..., n.
As before, let ¯y
(j)
denote the mean in the subsample s
1
= s \ {j}. Following (1), let
ˆ
θ
(j)
=
n 1
N
¯y
(j)
+
1
n 1
N
y
j
be the delete-j estimator of θ, where y
j
acts as an unbiased estimator of the population
mean outside s
1
. The RB method yields the whole sample mean, denoted by
ˆ
θ
=
1
n
n
X
j=1
ˆ
θ
(j)
=
1
n
n
X
j=1
y
j
= ¯y
7
Observe that we have
ˆ
θ
=
P
n
j=1
z
(j)
/n, where
z
(j)
=
1
N n
N
ˆ
θ
(j)
n
ˆ
θ
= y
j
(6)
Thus, the RB estimator
ˆ
θ
is the mean of an IID sample of observations z
(j)
, for j = 1, ..., n,
as in the development of classic ja ckknif e variance estimation, so that we obtain
b
V (
ˆ
θ
) =
1
n(n 1)
n
X
j=1
z
(j)
ˆ
θ
2
Notice that, in this case, the IID observations used for the classic development of jackknife
method are given by z
(j)
= n
ˆ
θ (n 1 )
ˆ
θ
(j)
= y
j
instead of (6), where
ˆ
θ
(j)
= ¯y
(j)
.
For the delete-one RB method based on (1) and (2) given auxiliary {x
i
: i U},
we have π
1i
= π
i
(n 1)/n, such that the estimator b
1
can be denoted by b
(j)
, based on
s
1
= s \ {j}, where it is simply the delete-j jackknife regression coefficients estimator.
Rewrite the corresponding popula tion tot al estimator
b
Y
1
by (1) as
b
Y
(j)
= X
b
(j)
+
y
j
π
2j
x
j
b
(j)
π
2j
such that the R B method yields
b
Y
GR
by (2), as the mean of
b
Y
(j)
over j = 1 , ..., n. We
propose a j ackknife variance estimator for
b
Y
GR
, given by
b
V (
b
Y
GR
) =
N
2
n(n 1)
n
X
j=1
z
(j)
1
n
n
X
i=1
z
(i)
2
(7)
where
z
(j)
=
1
N n
b
Y
(j)
n
N
b
Y
GR
Notice that it may be the case under general unequal probability sampling that the
conditional inclusion probability π
2j
given s
1
= s \{j} is not exactly known. However, in
many situations where the sampling fraction is low, it is reasonable that
π
2j
π
j
P
i6∈s
1
π
i
=
π
j
n
P
is
1
π
i
π
j
n(1 n
1
/N)
An approximate delete-one RB estimator following (2) can then be given as
e
Y
GR
= X
n
X
j=1
b
(j)
n
+
1
n
1
N
n
X
j=1
y
j
π
j
1
n
1
N
n
X
j=1
x
j
π
j
b
(j)
(8)
with
e
Y
(j)
for jackknife variance estimation on replacing 1
2j
by n(1 n
1
/N)
j
. Mean-
8
while, the delete-one jackknife replicates of G REG
b
Y
GR
can be written as
b
Y
(j)
GR
= X
b
(j)
+
n
n 1
X
i6=j
y
i
π
i
X
i6=j
x
i
b
(j)
π
i
b
Y
(·)
GR
=
1
n
n
X
j=1
b
Y
(j)
GR
= X
n
X
j=1
b
(j)
n
+
n
X
i=1
y
i
π
i
n
X
i=1
x
i
π
i
X
j6=i
b
(j)
n 1
The estimator
b
Y
(·)
GR
is quite close to the approximate RB-estimator (8) ; indeed, identical
apart f rom 1 n
1
/N in the special case of x
i
i
= N/n. This is not surprising, since
the jackknife-based
b
Y
(·)
GR
is an alternative for reducing the bias of the GR EG estimator.
The difference is that, provided π
2j
is known, the proposed RB method will be exactly
design-unbiased, but not the jackknife-based
b
Y
(·)
GR
. Finally, the resemblance between
e
Y
GR
and
b
Y
(·)
GR
is another indication that the relative efficiency of the delete-one RB method is
usually not a concern compared to the standard GREG estimator
b
Y
GR
.
2.5 Monte Carlo RB
Exact Rao-Blackwellisation can be computationally expensive, when the cardinality of
the subsample space (of s
1
) is large. Instead of calculating the RB estimator exactly,
consider the Monte Carlo (MC) RB estimator given as follows:
b
Y
K
GR
= K
1
K
X
k=1
b
Y
1k
(9)
where
b
Y
1k
is the estimator
b
Y
1
based on the kth subsample, for k = 1, ..., K, which are
realisations of s
1
from q(s
1
|s), such that
b
Y
K
GR
is a Monte Carlo approximation of
b
Y
GR
.
Proposition The estimator
b
Y
K
GR
is design-unbiased for Y , denoted by E(
b
Y
K
GR
) = Y .
Proof: The result follows from E(
b
Y
1k
) = Y .
Adopting a computationally ma na geable K entails an increase of variance, i.e. V
q
(
b
Y
K
GR
|s),
compared t o
b
Y
GR
, so that the variance of
b
Y
K
GR
is given by
V (
b
Y
K
GR
) = E
1
V
2
(
b
Y
1
|s
1
)
E
p
V
q
(
b
Y
1
|s)
+ E
p
V
q
(
b
Y
K
GR
|s)
(10)
Due to the IID construction of
b
Y
1k
, an unbiased estimator of V
q
(
b
Y
K
GR
|s) is given by
b
V
q
(
b
Y
K
GR
|s) =
1
K(K 1)
K
X
k=1
(
b
Y
1k
b
Y
K
GR
)
2
This allows one to control the statistical efficiency of the MC-RB method, i.e. the choice
of K is acceptable when
b
V
q
(
b
Y
K
GR
|s) is deemed small enough in practical terms.
9
Proposition Provided unbiased variance estimator
b
V (
b
Y
1k
) with respect to p
2
(s
2
|s
1
),
i.e. E
2
b
V (
b
Y
1k
)
= V
2
(
b
Y
1
|s
1
), a design-unbiased variance estimator for
b
Y
K
GR
is given by
b
V (
b
Y
K
GR
) =
1
K
K
X
k=1
b
V (
b
Y
1k
)
1
K
K
X
k=1
(
b
Y
1k
b
Y
K
GR
)
2
Proof: Due to the IID construction of
b
Y
1k
, K
1
P
K
k=1
b
V (
b
Y
1k
) is an unbiased estimator of
the first term on the right-ha nd side of (10), while (K 1)
1
P
K
k=1
(
b
Y
1k
b
Y
K
GR
)
2
is an
unbiased estimator of the second term. The result fo llows.
Finally, for the delete-one RB method, where unbia sed variance estimator
b
V (
b
Y
1
) is
not available now that |s
2
| = 1, a pr actical option is to first apply the jackknife variance
estimator (7) to the K samples, as if
b
Y
K
GR
where the exact RB estimator
b
Y
GR
, and then
add to it the extra term
b
V
q
(
b
Y
K
GR
|s) for the additional Monte Carlo erro r . This would allow
one to use the Monte Carlo delete-one RB method in general.
3 Unbiased non-linear learning
In this section we consider design-unbiased estimation in survey sampling, which builds
on arbitrary ML technique that can be non-linear as well as non-parametric.
3.1 Design-unbiased ML for survey sampling
Denote by M the model or algorithm that aims to predict y
i
given x
i
. Let s
1
be the training
set, and s
2
= s \ s
1
the test set. Let
c
M be the trained model based on {(x
i
, y
i
) : i s
1
},
yielding µ(x
i
, s
1
) as the corresponding M-predictor of y
i
given x
i
. Apply the tra ined model
to i s
2
yields the p rediction errors of
c
M conditiona l on s
1
, denoted by e
i
= y
i
µ(x
i
, s
1
).
In contrast, the same discrepancy is referred to as the residuals of
c
M, when it is calculated
for i s
1
, denoted by ˆe
i
= y
i
µ(x
i
, s
1
), including when the training set s
1
is equal to s.
In standard ML, the erro r s in the test set are used to select different tr ained algorithms,
or to assess how well a tr ained algorithm can be expected to perf orm when applied to the
units with unknown y
i
’s.
From an inference point of view, a basic problem with the standard ML approach above
arises because one needs to be able to ‘extrapolate’ the information in {e
i
: i s
2
} to the
units outside s, in order for supervised learning to have any value at all. This is simply
because {y
i
: i s} are all observed and prediction in any form is unn ecessary for i s.
No matter how the training-test split is carried out, one cannot ensure valid µ(x
k
, s
1
) for
k 6∈ s, unless s is selected from the entire reference set of units, i.e. the population U,
in some non-informative (or representative) manner. This is the well-known problem of
observationa l studies in statistical science, which is sometimes recast as the problem of
concept drift in the ML literature (e.g. Tsymbal, 2004).
10
A design pq-unbiased approach to M-assisted estima tion of population total Y can be
achieved with respective to
(i) a probability sample s from U, with probability p(s), and
(ii) a probabilistic scheme q(s
1
|s) for the training-test split (s
1
, s
2
) given s.
Explicitly, let
b
Y
1M
be the estimator of Y obtained from the realised sample s a nd sub-
sample s
1
given the model M. It is said to be design pq-unbiased for Y , provided
E
pq
(
b
Y
1M
) =
X
s
p(s)
X
s
1
s
q(s
1
|s)
b
Y
1M
= Y
where E
pq
is the expectation of
b
Y
1M
over all possible (s, s
1
). Replacing the linear predictor
x
i
b
1
in (1) by any M-predictor µ(x
i
, s
1
) tra ined on s
1
, we obtain
b
Y
1M
=
X
is
1
y
i
+
X
iU\s
1
µ(x
i
, s
1
) +
X
is
2
e
i
2i
(11)
Proposition
b
Y
1M
by (11) is design pq-unbiased for Y using an arbitrary model M.
The proof is parallel to that for
b
Y
1
by (1), only that µ(x, s
1
) is now based on any chosen
model M. It is important to point out that the purpose here is to estimate Y at the
population level, instead of individual prediction per se. Indeed,
b
Y
1M
is design-unbiased,
regardless M is a strong or weak learner. The underlying probabilistic mechanism consists
of two necessary elements: p(s) ensures valid extrapolatio n of learning to the units outside
s, since ot herwise completely model-based prediction
P
is
y
i
+
P
iU\s
µ(x
i
, s) has no
guaranteed relevance to Y no matter how the tra ining set s
1
is chosen or how M is
selected, whereas subsampling q(s
1
|s) is required to be able to project the error s in s
2
to
the aggregated level, since projecting the residuals in s in t he manner of GREG estimator
b
Y
GR
(i.e. wit hout the training-test split) would not achieve unbiasedness exactly.
3.2 Subsampling RB and bootstrap aggregating
There is a natural affinity between the subsampling RB method and bootstrap agg r egating
(i.e. bagging). Bagging is originally devised t o improve unstable leaners (Breiman, 1996a;
1996b) for individual prediction, where the ag gregation averages the learner over bootstrap
replicates of the training set. The argument can be adapt ed to design-based population-
level estimation. Let
b
Y
M
= ψ(s; M) be an M-assisted estimator of Y , which varies over
different samples s. Insofar as {y
i
: i U} are treated as unknown constants and
b
Y
M
is
uniquely determined given {(y
i
, x
i
) : i s}, the only variation of
b
Y
M
derives from that of
the sample s. For some model M, such as regression tree with random feature selection,
11
there exists an extra varia tion of
b
Y
M
given s. In any case, let the expectation of
b
Y
M
be
ψ
M
= E(
b
Y
M
) = E
ψ(s; M)
over all possible s and additional randomness given s. We have
E
(
b
Y
M
Y )
2
= ( ψ
M
Y )
2
+ E
(
b
Y
M
ψ
M
)
2
since E(
b
Y
M
ψ
M
) = 0 by definition. Thus, ψ
M
has always a smaller mean squared error
than
b
Y
M
. Notice that in reality bagging is “caught in two currents” (Br eiman, 1994): the
improvement can be appreciable if
b
Y
M
is unstable, whereas the additional estimation of
ψ
M
by bagging may not be worthwhile if
b
Y
M
is a stable learner to start with.
It is clear from the above that, while it can reduce the variance of unba gged predictor,
bagging does not affect the potential bias, now that it aims at replacing
b
Y
M
= ψ(s; M)
by its expectation ψ
M
. The subsampling RB method is mo re effectual than bagging in
the following sense: on the one hand, it leads generally to design-unbiased estimation of
Y , which does not result from bagging alone; on the other hand, R ao-Blackwellising
b
Y
1M
reduces its variance, even when it is based on a stable learner, such as µ(x
i
, s
1
) = x
i
b
1
,
which bag ging does only for unstable
b
Y
M
. Replacing
b
Y
1
in (2) with
b
Y
1M
given by (11)
generally, the subsampling RB M-assisted estimator of Y is g iven by
b
Y
M
= E
q
(
b
Y
1M
|s) (12)
Proposition The subsampling RB M-assisted estimator
b
Y
M
by (12) is design p-unbia sed,
or simply design-unbiased, for Y using a n ar bitrary model M.
The proof is exactly parallel to that for
b
Y
GR
by (2). Notice that Rao-Blackwellisation
of
b
Y
1M
with respect to q(s
1
|s) can accommodate straightforwardly any additional variation
given s
1
due to the chosen model M. For example, given a subsample s
1
, one can g row a
regression tree with random feature selection. Despite the resulting µ(x, s
1
) is not fixed for
the given s
1
, the corresponding
b
Y
1M
is still design pq-unbiased, because it is conditionally
unbiased for Y given s
1
and the outcome of random feature selection, and Y is a constant
with r espect to subsampling of s
1
and random feature selection given s
1
.
Finally, Monte Carlo subsampling RB is operationally similar to bagging, involving
about the same amount of computation effort. In bagging, one draws a bootstrap replicate
sample from s; whereas in subsampling R B, one resamples s
1
from s according to q(s
1
|s).
In either case, one trains the model based on the resample. Repeating the two steps K
times yields the bagged predictor by bagging, and the MC-RB estimator by subsampling
RB. The choice of K balances between numerical and statistical efficiency.
12
3.3 Design consistency
Provided n
2
= |s
2
| 2, let
b
V (
b
Y
1M,k
) be an unbiased variance estimator with respect
to p
2
(s
2
|s
1
), i.e. E
2
b
V (
b
Y
1M,k
)
= V
2
(
b
Y
1M,k
), for K subsamples k = 1, ..., K. A design-
unbiased variance estimator for MC-RB estimator
b
Y
K
M
is given by
b
V (
b
Y
K
M
) =
1
K
K
X
k=1
b
V (
b
Y
1M,k
)
1
K
K
X
k=1
(
b
Y
1M,k
b
Y
K
M
)
2
(13)
similarly as for
b
Y
K
GR
. It is an open question at this stage how to determine the efficient
subsampling scheme q(s
1
|s), including the choice n
1
. Although given the simplicity and
practical advantage of the delete-one GREG-assisted
b
Y
K
GR
, any other M-assisted estimator
would not be worth considering, unless it has clearly a smaller estimated variance.
A design-unbiased M-assisted estimator is consistent, provided its sampling var iance
tends to 0 asymptotically, as n, N . Since this is the case with delete-one GREG-
assisted
b
¯
Y
GR
of populat ion mean
¯
Y = Y /N, and that in practice one would only admit
any alternative estimator that has an even smaller va riance, design consistency is not a
worrisome issue for design-unbiased M-assisted estimation in applications.
Meanwhile, we cannot find any direct references in the literature, concerning the design
consistency of ML techniques. For example, Gor don and Olshen (1978, 1980) establish
consistency of recursive partitio ning algorithms, such as regression tree, provided IID
training set. Toth and Eltinge (2011) extend their result, allowing sampling design in
addition to the IID super- population model M, such that the consistency of r egression
tree for individual prediction, based on samples selected from p(s), is not purely design-
based, but requires the super-population model to hold in addition.
In the standard ML literature, asymptotic results are typically derived under stability
conditions. Bousquet and Elisseeff (2002) establish uniform stability condition for Reg-
ularisation algorit hms. Mukherjee et al. (2006) pay special attention to empirical risk
minimisation algorithms. Bot h these works a re directed at individual-level predictor from
IID training set, denoted by S = {(x
i
, y
i
) : i = 1, ..., n}, asymptotically as n . Let
Z = (x, y) be generically the random variables from the relevant distribution. Let µ(x, S)
be a given predictor trained on S. Its prediction mean squared error is E
Z
(y µ(x, S)
2
.
Exp ectation with respect to S is needed in addition for the stability definitions.
Different definitions of stability are needed under the pq-design-based approach to
population-level estimation, where {(x
i
, y
i
) : i U} are treated as constants and only the
sample s is random. Below we consider first the delete-one RB estimator (12) under the
special case of SRS and, then, under general unequal pro bability sampling design.
13
3.3.1 Stability condition: SRS
Let s
j
= s \ {j} be the delete-j sample. Let s
ij
= s \ { i, j} be the delete-ij sample. Let
µ(x, s
j
) be the M-predictor given x, which is trained on s
j
, and µ(x, s
ij
) that on s
ij
. We
define µ(x, s) to be twice q- stable, if
µ(x
k
, s
j
) µ(x
k
, s)
P
0 and µ(x
k
, s
ij
) µ(x
k
, s
j
)
P
0 (14)
i.e. convergence in probability, as n, N asymptotically, for any i, j s and k U,
where s
j
results from delete-one q-sampling from s, and s
ij
from recursive q-sampling
where one randomly deletes i s
j
. Notice that the first part of (14) is analogous to the
‘point -wise hypothesis leave-o ne- out stability’ of Mukherjee et al. (2006).
Theorem 1: The delete-one RB estimator (12) is consistent for population mean
¯
Y
under SRS, as n, N , given twice q-stability a nd y
i
µ( x
i
, s) = O(1) for any s.
Proof: We have V
p
(
b
Y
M
) = E
1
V
2
(
b
Y
1M
|s
1
)
E
p
V
q
(
b
Y
1M
|s)
as by (3), where
b
Y
1M
(s
j
) =
X
ks
j
y
k
+
X
k6∈s
j
µ(x
k
, s
j
) + (N n + 1)z
j
(s
j
)
=
X
ks
y
k
+
X
k6∈s
µ(x
k
, s
j
) + (N n)z
j
(s
j
)
and z
k
(s
j
) = y
k
µ(x
k
, s
j
) for a ny k and delete-j sample s
j
. Under SRS, we have
V
2
b
Y
1M
(s
j
)|s
j
= (N n + 1)
X
ks
c
j
z
k
(s
j
)
¯
Z
s
c
j
(s
j
)
2
where
¯
Z
s
c
j
(s
j
) =
P
ks
c
j
z
k
(s
j
)/(N n) and s
c
j
= U \ s
j
. By (14), we have
z
k
(s
j
)
¯
Z
s
c
j
(s
j
) = [1 + o
p
(1)]
z
k
(s
ij
)
¯
Z
s
c
j
(s
ij
)
for any i s
j
, where
¯
Z
s
c
j
(s
ij
) =
P
ks
c
j
z
k
(s
ij
)/(N n), and, averaged over all i s
j
,
V
2
b
Y
1M
(s
j
)|s
j
= [1 + o
p
(1)](N n + 1) (N n){
X
is
j
1
n 1
b
V
s
c
j
}
b
V
s
c
j
=
1
n
s
c
j
1
X
ks
c
j
z
k
(s
ij
)
¯
Z
s
c
j
(s
ij
)
2
where n
s
c
j
= |s
c
j
| = N n + 1. One can consider
b
V
s
c
j
as an unbiased estimator of
τ(s
ij
) =
1
N
s
c
ij
1
X
ks
c
ij
z
k
(s
ij
)
¯
Z
s
c
ij
(s
ij
)
2
14
where N
s
c
ij
= |s
c
ij
| = N n + 2, i.e. the population variance of z
k
(s
ij
) in s
c
ij
, based on
SRS sample s
c
j
from s
c
ij
conditional on s
ij
, since s
c
ij
= s
c
j
{i} = U \ s
ij
, such that
E
1
X
is
j
1
n 1
b
V
s
c
j
= E
s
ij
E
i
(
b
V
s
c
j
|s
ij
)
= E
s
ij
τ(s
ij
)
.
Given y
k
µ(x
k
, s) = O(1) for any s and k U, we obtain
E
1
V
2
(
b
Y
1M
|s
1
)
= [1 + o(1)](N n + 1)(N n)E
s
ij
τ(s
ij
)
.
Next, for V
q
(
b
Y
1M
|s), we notice that, by µ(x
k
, s
j
) µ(x
k
, s)
P
0 in (14),
b
Y
1M
(s
j
) = [1 + o
p
(1)]
X
ks
y
k
+
X
k6∈s
µ(x
k
, s) + (N n)z
j
(s
j
)
where V
q
z
j
(s
j
)|s
=
P
is
z
i
(s
i
) ¯z(s)
2
/n, for ¯z(s) =
P
js
z
j
(s
j
)/n. In Sen-Yates-
Grundy type expression using pairwise differences, we can write
V
q
z
j
(s
j
)|s
=
n 1
n
{
n(n 1)
2
1
X
i<js
1
2
z
i
(s
i
) z
j
(s
j
)
2
}
=
n 1
n
{
n(n 1)
2
1
X
i<js
b
V
ij
}[1 + o
p
(1)]
where
b
V
ij
=
1
2
z
i
(s
ij
) z
j
(s
ij
)
2
, by µ(x
k
, s
ij
) µ(x
k
, s
j
)
P
0 in (14). One can consider
b
V
ij
as an unbiased estimator of τ(s
ij
), i.e. the population variance of z
k
(s
ij
) in s
c
ij
, based
on SR S sample s
2
= {i, j} from s
c
ij
conditional on s
1
= s
ij
. Moreover, one can view the
expression in last brackets {} above as E
q
(
b
V
ij
|s) with respect to q
(s
ij
|s), such that
E
p
{·}
= E
p
E
q
(
b
V
ij
|s)
= E
s
1
E
s
2
(
b
V
ij
|s
ij
)
= E
s
1
τ(s
ij
)
E
s
ij
τ(s
ij
)
.
Given y
k
µ(x
k
, s) = O(1) for any s and k U, we obtain
E
p
V
q
(
b
Y
1M
|s)
= [1 + o(1)]( N n)
2
n 1
n
E
s
ij
τ(s
ij
)
.
Finally, the result follows fro m E
1
V
2
(
b
Y
1M
|s
1
)
and E
p
V
q
(
b
Y
1M
|s)
above, since
V
p
(
b
¯
Y
M
) = (1
n
N
)
1
n
E
s
ij
τ(s
ij
)
+ o(1) .
15
3.3.2 Stability condition: Unequal probability sampling
For general unequal probability sampling, we define the following the stability conditions.
First, we define µ(x, s) to be simply q-stable if, for any j s and k U, we have
µ(x
k
, s
j
) µ(x
k
, s)
P
0 (15)
asymptotically as n, N , where s
j
results from delete-one subsampling q(s
j
|s). Next,
we define µ(x, s) to be p-stable for the delete-one RB method, if
1
n
X
js
b
N
j
N
µ(x
j
, s)
1
N
X
kU
µ(x
k
, s)
P
0 (16)
where
b
N
j
= π
1
2j
+ (n 1) is an estimator of N based on s
2
= {j}. Notice that, given
q-stability (15), it is possible to replace p-stability ( 16) by a pq-stability conditio n
1
n
X
js
b
N
j
N
µ(x
j
, s
j
)
1
N
X
kU
µ(x
k
, s
j
)
P
0
which reduces to
P
js
µ(x
j
, s
j
)/n
P
kU
µ(x
k
, s
j
)/N
P
0 under SRS, and resembles
the IID ‘expected-leave-one-out stability of Mukherjee et al. (2006): the first term above
is the empirical average in the observed set in both definitions, whereas for the second
term here we replace averaging over Z in the IID setting by that over the population
distribution function, which places point mass 1/N on each k U.
Some regularity condition on the sampling design p( s) is needed to f or the g eneral
situation. Let
b
¯
Y
j
= y
j
b
N
j
/N be the leave-one-out ( LOO) HT estimator of population
mean
¯
Y based on s
2
= {j}. We define the sampling design to be LOO-consistent, if
1
n
X
js
b
¯
Y
j
P
¯
Y (17)
asymptotically as n, N . The condition is specified for the LOO-RB-HT estimator,
where
P
js
b
¯
Y
j
/n = E
q
(
b
¯
Y
j
|s). Under SRS,
P
js
b
¯
Y
j
/n =
P
js
y
j
/n is the sample mean,
which conver ges to
¯
Y in probability, provided y
i
= O(1) for all i U. We emphasise that
the condition (17) concerns only the sampling design p(s), since it is formulated in terms
of the y-values alone, i.e. based on an ‘empty’ M-predictor, so to speak.
Theorem 2: The delete-one RB estimator (12) is consistent for population mean
¯
Y , as
n, N , provided q- and p-stabilities, and LOO-consistent sampling design p(s).
16
Proof: Given the delete-j sample s
j
under any general sampling design, we can write
b
Y
1M
(s
j
) =
X
is
y
i
+
X
k /s
µ(x
k
, s
j
) + (π
1
2j
1)
y
j
µ( x
j
, s
j
)
=
X
is
y
i
µ(x
i
, s
j
)
+
X
kU
µ(x
k
, s
j
) + (π
1
2j
1)
y
j
µ( x
j
, s
j
)
where π
2j
is the conditional probability of selecting j from s
c
j
given s
j
. Given q-stability
(15), i.e. µ( x, s
j
) = µ(x, s) + o
p
(1), t he RB-estimator of the population mean is
b
¯
Y
M
=
1
N
X
kU
µ(x
k
, s) +
1
n
X
js
b
N
j
N
y
j
µ( x
j
, s)
+ o
p
(1)
= {
1
N
X
kU
µ(x
k
, s)
1
n
X
js
b
N
j
N
µ(x
j
, s)} +
1
n
X
js
b
¯
Y
j
+ o
p
(1) .
The result follows from applying the p-stability condition (16 ) to the expression in t he
brackets {}, and the LOO-consistency condition (17) to that in [ ].
4 Simulations
Below we present and discuss some simulation results of the delete-one RB (or LOO-RB)
method, a nd the associated jackknife variance estimator described in Section 2.4. The
HT and some GREG estimators are computed for comparisons. The target is always the
population mean (denoted by θ) in a given set-up. The simulations proceed as follows.
- B samples (usually B = 100) are drawn independently from the given fixed populatio n
according to a specified sampling design.
- We obtain an estimate
ˆ
θ
(b)
based on each sample, for b = 1, ..., B. In particular, for
the LOO-RB method, we calculate its associated jackknife variance estimate v
(b)
.
- An estimate of E(
ˆ
θ) over repeated sampling is
ˆ
¯
θ =
P
B
b=1
ˆ
θ
(b)
/B, with associated Monte
Carlo error
p
v/B, where v =
P
B
b=1
(
ˆ
θ
(b)
ˆ
¯
θ)
2
/(B 1). An estimate of its bias is
ˆ
¯
θ θ;
an estimate of its root mean squared error (RMSE) is {
P
B
b=1
(
ˆ
θ
(b)
θ)
2
/B}
1/2
.
- Similarly for the bias a nd RMSE of the variance estimator v
(b)
, except that the true
var iance of the LOO-RB method is unknown and is replaced by its estimate v.
Now that the HT estimat or and the LOO-RB methods are unbiased, an inspection of
their respective simulation-based bias estimates and the associated Monte Carlo errors
can usually provide adequate information, in order to judge whether a certain conclusion
of the results is warranted g iven the actual number of simulations.
17
4.1 Simulations with synthetic data
The GREG estimator has become the standard-bearer in practical survey sampling in the
past three decades. Using simple simulations below, we would like to gain some basic
appreciation of the pros and cons of the corresponding LOO-RB-GREG estimator, given
by ( 2), under the proposed unbiased learning approach. Small synthetic populations
are generated based on only two regressors. The first regressor x
1
follows a log-normal
distribution with mean and variance both set to o ne. The second regressor x
2
follows a
Poisson distribution with mean 5. The target y-variable in each setting is generated as
the absolute value of a certain function of x
1
and x
2
plus a regression error.
Table 1: Simulation results o f HT, GREG and LOO-RB-GREG estimator, by two different
sampling designs. Monte Carlo errors of bias estimates in par entheses.
Simple Random Sampling Probability Proportional to x
2
Estimator Bias (MC Error) R MSE Bias (MC Error) RMSE
HT 0.08 (0.19) 1.91 0.12 (0.20) 2.02
GREG -0.09 (0.13) 1.29 0.03 (0.1 6) 1.60
LOO-RB-GREG 0.10 (0.14) 1.39 0.16 (0.16) 1.63
Variance by jackknife 0.52 (0.10) 1.09 4.73 (1.02) 11.23
We start with a setting where the GREG estimator should have a neglig ible or very
small bias. Let the population size be 200, and let the target survey y-variable be t he
absolute value of 1.5x
1
+ x
2
+ ǫ, where ǫ follows a no rmal distribution with zero mean and
var iance that is a quarter of the variance of x
1
. Let the sample size be 20. Two sampling
designs are used: SRS, or conditional Poisson sampling with probabilities proportio na l to
x
2
as the size variable. The results are given in Table 1.
It can be seen that under both sampling designs, GREG and LOO-RB-GREG have
essentially the same efficiency, and both outperform HT estimation. Recall that the bias
of the GREG estimator is negligible in this scenario because of the underlying linear
population model. Clearly, the jackknife variance estimator (7), which is derived as a
direct analogy to the IID-sample situation, needs to be modified for unequal probability
sampling designs such as the conditional Poisson sampling here.
Table 2: Simulation r esults of HT, GREG and LO O-RB-GREG estimator under SR S,
but two different populat ion models. Monte Carlo errors of bias estimates in parentheses.
V (y) x
1
, n = 5 Non-linear, V (y)
x
1
, n = 20
Estimator Bias (MC Error) RMSE Bias (MC Error) RMSE
HT -0.46 (0.50) 5.03 0.95 (1.26) 12.57
GREG -0.8 2 (0.4 1) 4.16 -2.41 (0.51) 5.62
LOO-RB-GREG -0.68 (0.77) 7.69 0.68 (0.86) 8.62
Variance by jackknife -13.22 (15.75) 157.31 -7.36 (10.50) 104.76
Consider now two potentially problematic settings. First, we introduce heteroscedas-
ticity by make the variance of the y-var iable proportional to x
1
, while reducing the sample
18
size at the same time, where n = 5 (from N = 100). The results are given in the left part
of Table 2. The LOO-RB-GREG is the least efficient estimator here: the heteroscedas-
ticity setting incr eases the variance of
b
Y
1
based on each subsample, whereas the small
sample size implies RB averaging over only 5 subsample estimates (instead of 20 above).
The RMSE of the jackknife variance estimator is much bigger fo r similar reasons.
Next, reverting t o (N, n) = (200, 20), we generate the target y-variable non-linearly
as the absolute value of 0.5x
1
+ 0.25x
2
1
+ x
2
+ ǫ, where ǫ follows a normal distribution
with zero mean and variance proportional to
x
1
. The results under SRS are given in
the right par t of Table 2. The G REG estimator has now a relatively large bias, which
is removed by the LOO -RB-GREG estimator. However, the unbiased learning estimator
loses efficiency compare to GREG in terms of the MSE, although it is still much better
than the HT estimator. The performance o f t he variance estimator is similar as before.
These results illustrate the basic pros and cons of delete-one RB-GREG vs. st andard
GREG estimation. On the one hand, the GREG estimator may suffer from non-negligible
bias, e.g. because one applies the assisting linear model in a routine manner without con-
ducting careful model diagnostics as one should, whereas the unbia sed learning approach
avoids the bias by definition. On the other hand, delete-one subsampling may suffer f rom
loss of efficiency given heteroscedastic o bservations in very small samples.
4.2 Simulations with real data
The population consists of a sample of about 17000 small and medium-sized enterprises
from the Spanish Structural Business Survey (SSBS). As the target variables we consider
three survey variables collected in the SSBS: Turnover, Total personnel expenses and Total
procurements of goods and services. Seventeen variables from the administrative corpo-
rate income tax data are imported as the regressors. One of them is turnover, although
for many enterprises the turnover from tax data will be different to the turnover from
SSBS by definition; in addtion the two observed values may differ because of registration
delays or other operational reasons. The estimators to be considered are: HT, GREG1
with one regressor (turnover), GREG17 with the seventeen regressors (as main effects),
LOO-RB-GREG1 with one regressor (turnover), and LOO-RB random forest (RF) with
seventeen features. When only one regressor is used, RF is not a good option to be
included here. Jackknife variance estimation is applied to the two SRB estimators.
4.2.1 Turnover
This case is interesting because turnover (from tax data) is one of the regressors. We
consider SRS and stratified SRS designs. For the latter, three strata are created by the
number of employees, which is a commonly used stratificatio n variable in SBS, although
the actual designs always have many other complicating details in practice. The stratum
sample sizes are allocated proportionally to the stratum population sizes. The total
19
sample size is 10% of t he popula t ion under both the designs. The simulation results are
given in Ta ble 3, similarly as before and suitably scaled for presentation.
Table 3: Simulation results for Turnover estimation by different estimators under two
sampling designs. Monte Carlo errors of bias estimates in par entheses.
SRS Stratified SRS
Estimator Bias (MC Error) RMSE Bias (MC Error) RMSE
HT -0.22 (0.47) 4.65 -0.3 6 (0.32) 3.20
GREG1 0.01 (0.26) 2.56 0.30 (0.22) 2.16
LOO-RB-GREG1 -0.21 (0.27) 2.68 0.05 (0.23) 2.31
Variance LOO-R B-GREG1 -1.02 (0.32) 3.3 7 0 .60 (0.32) 3.27
GREG17 0.42 ( 0.22) 2.30 0 .79 (0.75) 7.47
LOO-RB-RF -0.19 (0.14) 1.37 -0.1 0 (0.14) 1.38
Variance LOO-R B-RF 0.09 (0.03) 0.34 -0.04 (0.04) 0.38
The LOO-RB-RF (by random forest) is the most efficient estimator under SRS. It
is more efficient than the GREG17 estimator, because RF yields a better prediction
model than simple linear regression using all the regressors as main effects. In fact, the
GREG17 estimator introduces a small bias compared t o the GREG1 estimator, a nd is
only more efficient by a small ma r gin. The LO O -RB-GREG1 estimator has about the
same efficiency as the GREG1 estimator. Compared to the simple simulatio n results
earlier, heteroscedastic variance does not cause loss of efficiency to the LOO-RB method
here, because the sample size is large enough. The j ackknife variance estimator has no
statistically significant bias for the LOO -RB-RF estimator, but it has a small negat ive
bias for the LOO-RB-GREG1 estimator.
Next, under the stratified SRS design, the LOO-RB-RF is again the most efficient
estimator. Its RMSE is about the same as under SRS, which is not surprising given
proportional allocation of stratum sample sizes, because RF is able to account for the
design size variable using the auxiliary information in the 17 regressors. In cont r ast, the
GREG17 estimator actually loses efficiency and does not behave well here, which again
illustrates that applying the GREG est ima t or without appropriate attention to model
diagnostics can b e counter-productive in practice. The relative performance of the simple
GREG1 estimator and its unbiased counterpart LOO-RG-GREG1 is similar as under SRS,
and both are slightly more efficient under the stratified design. The jackknife variance
estimators perform similarly as under the SRS design.
4.2.2 Other target variables
Simulation results for the other two target variables under the stratified SRS design are
given in Ta ble 4.
When it comes to Total personal expenses, turnover from the t ax data is not a good
regressor at all, such that the simple GR EG 1 estimator yields no improvement over the HT
20
Table 4: Simulation results for To t al personal expenses and Total procurements under
stratified SRS design. Monte Carlo errors of bias estimates in parentheses.
Total personal expenses Total procurements
Estimator Bias (MC Error) RMSE Bias (MC Error) RMSE
HT 0.03 (0.04) 0.58 -0.05 (0.29) 2.8 8
GREG1 0.09 (0.04) 0.60 -0.07 (0.21) 2.0 8
LOO-RB-GREG1 0.06 (0.04) 0.60 -0.20 (0.21) 2.1 4
Variance LOO-R B-GREG1 0.02 (0.01) 0.03 0.03 (0.14) 1.37
GREG17 0.01 ( 0.02) 0.33 0 .50 (0.18) 1.90
LOO-RB-RF 0.01 (0.02) 0.32 -0.07 (0.11) 1.1 1
Variance LOO-R B-RF 0.03 (0.00) 0.07 -0.21 (0.03) 0.34
estimator. Similarly with the LOO-RB-GREG1 estimator, which performs similarly as the
GREG1 estimator, as can be expected. Meanwhile, both the G REG17 and LOO -RB-RF
estimators are noticeably b etter. This suggests that the other regressors can be linearly
related to this target variable, and the RF model is flexible enough to automatically
capture this linear regression relationship here. The ja ckknif e variance estimators exhibit
no bias for the LOO-RB methods in this case.
Turning to the results for Total procurements of g oods a nd services, the LOO-RB-RF
estimator is again by far the most efficient of all. The GREG1 and LOO-RB-GREG1
estimators similarly improve on the HT estimator, where turnover from tax data is a
reasonable regressor f or this variable. The GREG17 estimator is mor e efficient than the
simple GREG1 estimator by a small margin, albeit at the cost of introducing a small
bias that is statistically significant. In contrast, the LOO-RB-RF estimator provides a
much gr eat er gain of efficiency while remaining design-unbiased. The jackknife variance
estimator is essentially unbiased for the LOO-RB-GREG 1 estimator, but it has a small
negative bias for the LOO-RB-RF estimator.
4.2.3 Conclusions
The following conclusions seem warranted based on the simulation results above.
In situations where simple GREG estimation (with few regressors) have little bias to
start with, i.e. when the simple linear regression model is a reasonable statistical model,
the proposed unbiased learning approach is unlikely to offer appreciable improvement in
practice. More advanced learning techniques cannot be of much help, without a supply
of additional useful features. Nevertheless, while G REG estimation may suffer from no n-
negligible bias in a given situation, because the linear model is inappropriate, the unbiased
learning approach can avoid the bias automatically.
More importantly, provided richer auxiliary infor ma tion, the proposed unbiased SR B
learning approach can yield large gains. On the one hand, it allows o ne to make use
of modern ML techniques that can potentially lead to much more flexible and powerful
prediction models, without demanding the same kind of effort that is often necessary for
21
building complex parametric models. On the other hand, t he theory for design-unbiased
statistical learning develo ped in this paper ensures the resulting ML-assisted estimator
is valid for descriptive inference, so that the ML-prediction model can help t o generate
valid and efficient estimation at the aggregated level, wihtout requiring the model to be
entirely correct a t the individual level, because the prediction errors in the sample are
extrapolated to the population of interest based on the known pq-sampling design.
5 Summary remarks
Amalgamating classic ideas of Statistical Science and Machine Learning , we developed an
ML-assisted SRB a pproach for pq-design-unbiased stat istical learning in survey sampling.
It allows o ne to generally achieve design-unbiased model-assisted estimation based on
probability sampling from the population of interest. The freedom to adopt modern as
well as emerging power ful algorithmic ML-prediction models should enable o ne to make
more efficient use of the rich auxiliary infor mation whenever it is available.
A topic for future research can be noted immediately. As mentioned ear lier, it is an
open question at this stage how to construct the efficient subsampling scheme q(s
1
|s),
including the choice n
1
= |s
1
|. Moreover, a related issue is the sampling design. In this
paper, we have assumed the pq- design approach, because it fits nat ur ally with the current
practice of survey sampling, where the sampling design p(s) is already implemented and
given at the stage of estimation, so that o nly the subsampling scheme q(s
1
|s) is left to one’s
own device. However, by construction, the combined randomisation distribution induced
by (p, q) is the same as that induced by (p
1
, p
2
), for any s
1
s
2
= s and s
1
s
2
= .
It may be worth invest igating whether a direct approach to the design of (p
1
, p
2
) may
offer certain advantages. Finally, it is easily envisaged that more efficient and accurate
var iance estimation methods will be discovered by future research.
References
[1] Blackwell, D. (1947). Conditional expectation and unbiased sequential estimation.
Ann. Math. Statist., 18: 105-110.
[2] Bousquet, O. and Elisseeff, A. (2002). Stability and generalization, J. Mach. Learning
Res., 2:499-526.
[3] Breidt, F.J. and Opsomer, J.D. (2 017). Model-assisted survey est imation with mod-
ern prediction techniques. Statist. Scien., 32:190-205.
[4] Brieman, L. (1996a) . Heuristics of instability and stabilization in model selection.
Ann. Statist., 24:2350-2383.
22
[5] Breiman, L. (1996b). Bagging predictors. Mach. Learn., 26:123-140.
[6] Cassel, C. M., arndal, C.-E. and Wretman, J. H. (1976). Some results on generalized
difference estimation and generalized regression estimation for finite populations.
Biometrika, 63:615-620.
[7] Deville, J.-C. and arndal, C.-E. (1992). Calibration estimators in survey sampling.
J. Amer. Statist. Assoc., 87:376-382.
[8] G ordon, L. and Olshen, R. (197 8). Asymptotically Efficient Solutions to the Classi-
fication Problem. Ann. Statist., 6:515-533.
[9] G ordon, L. and Olshen, R. (1980). Consistent Nonparametric Regression From Re-
cursive Partitio ning Schemes. J. Mult. Ana., 10:611-627.
[10] Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without
replacement from a finite universe. J. Amer. Statist. Assoc., 47:663-685.
[11] Mukherjee, S., Niyogi, P., Poggio, T. and Rifkin, R. (2006). Lear ning theory: stability
is sufficient for generalization and necessary and sufficient for consistency of empirical
risk minimization. Adv. Comp. Math., 25:161-193.
[12] Rao, C. R. (1945). Information and accuracy attainable in the estimation of statistical
parameters. Bull. Cal cutta Math. Soc., 37:81-91.
[13] arndal, C.-E. (2010). The calibration approach in survey theory and practice. Surv.
Methodol., 33:99-119.
[14] arndal, C.-E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sam-
pling. New York: Springer-Verlag.
[15] Toth, D. and Eltinge, J. L. (2011) . Building consistent regression trees from complex
sample data. J. Amer. Statist. Assoc., 106:1626-1636.
[16] Tsymbal, A. (2004) . The problem of concept drift: definitions and related work.
Comp. Scien., 106 (2), 58.
[17] Tukey, J.W. (1958). Bia s and confidence in not quite larg e samples (abstract). Ann.
Math. Statist., 29:614.
23