Design-unbiased statistical learning in survey sampling

arXiv:2003.11423v1 [stat.ML] 25 Mar 2020

Luis Sanguiao Sande

and Li-Chun Zhang

2,3,4

Instituto Nacional de Estad´ıstica ([email protected])

University of Southampton (email: [email protected])

Statistisk sentralbyraa

Universitetet i Oslo

Abs tract: Design-consistent model-assisted estimation has become the standard prac-

tice in survey sampling. However, a general theory is lacking so far, which allows one t o

incorporate modern machine-learning techniques tha t can lead to potentially much more

powerful assisting models. We propose a subsampling Rao-Blackwell method, and develop

a statistical learning theory for exactly design-unbiased estimation with the help of linear

or non-linear prediction models. Our approach makes use of classic ideas from Statistical

Science a s well as the rapidly growing ﬁeld of Machine Learning. Provided rich auxiliary

information, it can yield considerable eﬃciency gains over standard linear model-assisted

methods, while ensuring valid estimation for the given target population, which is robust

against potent ial mis-speciﬁcations of the assisting model at the individual level.

Keywords: Rao-Blackwellisation, bagg ing, pq-unbiasedness, stability conditions

1 Introduction

Approximately design-unbiased model-assisted estimation is no t new. It has become the

standard practice in survey sampling, following many inﬂuential works such as S¨arndal et

al. (1992) , Deville and S¨arndal (1992). However, there lacks so far a theory, which allows

one to generally incorporate the many common machine-learning ( ML) techniques. For

instance, according to Breit and Opsomer (2017, p. 203), they “are not aware of direct uses

of random forests in a model-assisted survey estimator”. Since modern ML techniques can

often generate more ﬂexible and powerful prediction models, when rich auxiliary feature

data are available, the potentials are worth exploring, in any situation where the practical

advantages of linear weighting are not essential compared to the eﬃciency gains that can

be achieved by alternative non-linear ML techniques.

We propo se a subsampling Rao-Blackwell (SRB) method, which enables exactly design-

unbiased estimation with the help of linear or non-linear prediction models. Monte Carlo

(MC) versions of the proposed method can be used in cases where exact RB method

is computat ionally too costly. The MC- SRB method is still exactly design-unbiased,

despite it is somewhat less eﬃcient due to the additional MC error. In practice, though,

one can easily balance between the numerical eﬃciency of the MC-SRB method against

the statistical eﬃciency of the corresponding exact RB method.

The SRB method makes use of three classic ideas fro m Statistical Science and Machine

Learning. On the one hand, the training-test split of the sample of observations in ML

generates errors in the test set rather than resid uals, conditional on the training dataset,

which as we shall explain is the key to achieving exact design-unbiasedness. For model-

assisted survey estimation we use this idea to remove the ﬁnite-sample bias. On the other

hand, Rao-Blackwellisation (Rao, 1945; Blackwell, 19 47) and model-assisted estimation

(Cassel et al., 1976) are powerful ideas in Statistics and survey sampling, which we apply

to ML techniques to obtain design-unbiased survey estimators at the population level.

We shall refer to the amalgamation as statistical learning, since the term model-

assisted estimation is entrenched with t he property of approximate design-unbiasedness

(e.g. S¨arndal 2010; Breit and Opsomer, 2017), whereas the focus of population- level

estimation and associated variance estimation is unusual in the ML literature.

In applications one needs to ensure design-consistency of the proposed SRB method,

in addition to exact design-unbiasedness. The property can readily be established f or

parametric or many semi-parametric assisting models. But the conditions required for

non-parametric algorithmic ML prediction models have so far eluded a treatment in the

literature. Indeed, this has been a main reason preventing the incorporation of such

ML techniques in model-assisted estimation f r om survey sampling. We shall develop

general stability conditions for design-consistency under both simple random sampling

and arbitrary unequal pro bability sampling designs.

For the ﬁrst time, design-unbiased model-assisted estimation can thereby be achieved

generally in survey sampling. Wherever rich feature data are available, the approach of

statistical learning developed in this paper enables one to adopt suitable ML techniques,

which can make much more eﬃcient use of the available auxiliary information.

The rest of the paper is organised as follows. In Section 2, we describe t he SRB method

that uses an assisting linear model. The underlying ideas of design-unbiased statistical

learning ar e explained, as well as the diﬀerences to the standard model-assisted gener-

alised regression estimation. Some basic methods of variance estimation are outlined,

where a novel jackknife variance estimator is developed for the SRB method. We move on

to non-linear ML techniques in Section (3). The similarity and diﬀerence to the bootstrap

aggregating (Breiman, 1996b) approach ar e explored. Moreover, we investigate and prove

the stability conditions for design-consistency of SRB method that uses non-parametric

algorithmic prediction models. Two simulation studies are presented in Section 4, which

illustrate the potential gains of the proposed unbiased statistical learning approach, com-

pared to standard linear model-assisted or model-based approaches. A brief summary

and topics for future research will be given in Section 5.

2 Unbiased linear est imation

In this section we consider unbiased linear estimation in survey sampling, which builds on

generalised regression ( GREG) estimation (S¨arndal et al. 1992). The GREG estimator is

the most common estimation method in practical survey sampling. It is consistent under

mild regularity conditions, and is often more eﬃcient than exactly unbiased Horvitz-

Thompson (HT) estimation (Horvitz and Thompson, 1952). The pr oposes subsampling

Rao-Blackwellisation (SRB) method removes the ﬁnite-sample bias of GR EG generally,

whose relative eﬃciency is comparable to the standard GREG estimator.

2.1 Bias correction by subsampling

Let s be a sample (of size n) selected from the population U of size N, with probability

p(s), where

p(s) = 1 over all possible samples under a given sampling design. Let

= Pr(i ∈ s) > 0 be the sample inclusion probability, for each i ∈ U. Let y

be a survey

var iable, for i ∈ U, with unknown population total Y =

i∈U

Let the assisting linear model expectatio n of y

be given by µ(x

) = x

⊤

β, where x

the vector of covariates for each i ∈ U. Let µ(x

, s) = x

⊤

b be the estimator of µ(x

), where

b = (

i∈s

⊤

/π

)

−1

i∈s

/π

is a weighted least squares (WLS) estimator of β. It is

possible to at t ach additional heteroscedasticity weights in the WLS; but the development

below is invariant to such variations, so that it is more convenient to simply ignore it in

the notation. Let X =

i∈U

. The GREG estimator of Y is given a s

= X

⊤

b +

i∈s

− x

⊤

b)/π

While

is design-consistent under mild regularity conditions (e.g. S¨arndal et al. 1992),

as n, N → ∞, it is usually biased given ﬁnite sample size n, except in special cases such

as when x

≡ 1 and π

≡ n/N, where µ(x, s) =

i∈s

/n = ¯y

and

= N ¯y

To remove the potential ﬁnite-sample bias of

, consider subsampling of s

⊂ s,

with known probability q(s

|s), such as SR S with ﬁxed n

= |s

|, where

q(s

|s) = 1.

The induced probability of selecting s

from U is given by

) =

s:s

⊂s

q(s

|s)p(s)

where π

= Pr(i ∈ s

) is the corresp onding inclusion probability for i ∈ U. Let s

= s\s

be the complement of s

in s. Let the conditional sampling probability of s given s

) = p(s)q(s

|s)/p

)

and let π

= Pr(i ∈ s

) be the corresponding conditional inclusion probability in s

for i ∈ U \ s

. Let µ(x

, s

) = x

⊤

be the estimate of µ(x

) based on the sub-sample s

where b

= (

i∈s

⊤

/π

)

−1

i∈s

/π

. Let

i∈s

i∈U\s

⊤

i∈s

− x

⊤

)/π

(1)

In other words, it is the sum of y

in s

and a diﬀerence estimator of the remaining

population total based on s

, via x

⊤

that does not depend o n the observations in s

Proposition The estimator

is conditionally unbiased for Y over p

) g iven s

denoted by E

(

) = Y , as well as unconditionally over p(s), denoted by E

(

) = Y .

Proof: As µ(x

, s

) is ﬁxed for any i ∈ U \ s

given s

, the last two terms on the right-

hand side of (1) is unbiased for Y −

i∈s

given s

. It follows that

is conditionally

unbiased for Y given s

; hence, design-unbiased over p(s) unconditionally a s well. 

Example: Simple random sampling (SRS) Suppose SRS without replacement of

s from U, and s

from s with ﬁxed size n

= |s

|, such that π

= n

/N and π

(n − n

)/(N − n

). In the special case o f x

≡ 1, b

is the sample mean in s

, and

= n

+ (N − n

)

i∈s

/(n − n

)

which a mo unts to using the sample mean in s

to estimate the population mean outside

of the given s

, i.e., instead of using the sample mean in s for the whole population mean.

Thus,

achieves unbiasedness generally, but at a cost of increased variance.

2.2 Rao-Blackwellisation

One can reduce the variance of

by the Rao- Blackwell method (Rao, 1 945; Blackwell,

1947). The minimal suﬃcient statistic in the ﬁnite population sampling setting is simply

= {(i, y

) : i ∈ s}. Applying the RB method to

by (1) yields

∗

, which is given

by the conditional expectation of

given D

, i.e.

∗

= E

(

) = E

(

|s) (2)

where the expectation is evaluat ed with respect to q(s

|s), and the second expression

is leaner as long as one keeps in mind that {y

: i ∈ U} are treated as ﬁxed constants

associated with the distinctive units.

Proposition The estimator

∗

is design-unbiased for Y , denoted by E(

∗

) = Y .

Proof: By construction, the combined randomisation distribution induced by p and q is

the same as that induced by p

and p

, for any s

∪ s

= s and s

∩ s

= ∅. Thus,

∗

) = E

(

∗

) = E



(

|s)



= E



(

)



= E





= Y 

Next, for t he variance of

∗

over p(s), i.e. V (

∗

) = V

(

∗

), we notice

V (

) = E



(

|s)



+ V



(

|s)



= E



(

|s)



+ V

(

∗

)

V (

) = E



(

)



+ V



(

)



= E



(

)



since E

(

) = Y . Juxt aposing the two expressions of V (

) above, we obta in

V (

∗

) = V

(

∗

) = E



(

)



− E



(

|s)



(3)

where E



(

|s)



is the variance reduction compared to

Proposition Provided unbiased variance estimator

V (

) with respect to p

), i.e.



V (

)



= V

(

), a design-unbiased variance estimator for

∗

is given by

V (

∗

) =

V (

) − V

(

|s)

Proof: By stipulation, we have E



V (

)



= E





V (

)





= E



(

)



, which is

the ﬁrst term on the right-hand side of (3) . The r esult follows immediately. 

Example: SRS, cont’d In the sp ecial case of x

≡ 1 and n

= n − 1, we have

1(i)

= n

¯y

(i)

+ (N − n

if s

= {i} and ¯y

(i)

denotes the mean in s

= s \ {i}. The RB estimator follows as

∗

i∈s

1(i)

i∈s

¯y

(i)

i∈s

−

i∈s

which is the usual unbiased full-sample expansion estimator in this case. The R B metho d

thus recovers the lost eﬃciency of any

1(i)

on its own.

Let X

i∈s

, and X

i6∈s

= X − X

. To express

∗

as a linear

combination of {y

: i ∈ s}, we rewrite

i∈s

+ (X − X

)

⊤

−

i∈s

⊤

i∈s

+ (X

−

)

⊤

i∈s

where







1 + (X

−

)

⊤



i∈s

⊤

/π



−1

/π

if i ∈ s

1/π

if i ∈ s

It follows that the RB estimator (2) can be given as a linear estimator

∗

i∈s

∗

where w

∗

= E

|s) (4)

This has an important practical advantage that {w

∗

: i ∈ s} can be applied to produce

numerically consistent cross-tabulation o f multiple survey variables of interest.

In the case of SRS of s

with n

= n−1, the RB weight w

∗

in (4 ) is the average of w

’s

over n possible subsamples s

, for a given unit i ∈ s, where w

= 1/π

when s

does not

include the unit i, otherwise w

is the corresp onding GREG weight for Y

k6∈s

which is diﬀerent for each of the rest n − 1 subsamples that includes the unit i.

2.3 Relative eﬃciency to GREG

Let B = E

(b) and e

= y

−x

⊤

B for i ∈ U. Expanding the GREG estimator

around

(Y, X, B) yields

≈

i∈s

+ B

⊤

For

, the ﬁrst two terms on the right-hand side of (1) becomes X

⊤

if there exists a

vector λ such that x

⊤

λ ≡ 1, in which case

is a function of (

, b

), i.e.

= X

⊤

− b

⊤

+ b

⊤

(X −

)

where

i∈s

/π

is conditionally unbiased for Y

i6∈s

given s

, and similarly

i∈s

/π

for X

i6∈s

. Let Y

= E

) and X

= E

). We have

) ≈ B, since b

and b aim at the same population parameter, especially if n

is close

to n. In any case, expanding

around (Y

, X

, B) yields

≈



+ B

⊤

(X − X

)





(

− Y

) + (b

− B)

⊤

(X −X

) − B

⊤

(

− X

)





− B

⊤



+ b

⊤

(X − X

) + B

⊤

and

− B

⊤

i∈s

/π

i∈s

∗

/π

where δ

∗

= 1 if i ∈ s

and 0 of i ∈ s

. Thus, we obtain

∗

= E

(

|s) ≈

i∈s



∗



+ E

|s)

⊤

(X − X

) + B

⊤

(5)

Notice that B

⊤

is a constant. Thus, compared to

, the variance of

∗

involves

that of E

|s) in addition. As n, N → ∞, the ﬁrst term on the right-hand side of

(5) is O

(N/

√

n) provided π

(δ

∗

/π

) = O

(1), whereas the second term is O

(

√

n) if

/n = O(1) provided the usual regularity conditions for GREG. As long as the sampling

fraction n/N is small, the ﬁrst term will dominate, in which case the variance of the R B

estimator

∗

is of the same order as that of the GREG estimator

Example: SRS, cont’d Let n

= n − k, where k = |s

|. We have

−1

(δ

∗

|s) =

N −n

(n − 1)!

(k − 1 )!(n − k)!

k!(n − k)!

= 1 −

Let S

be the po pula t ion variance of {e

: i ∈ U}. The variance of the ﬁrst-term in (5) is



i∈s



∗





= N



1 −





1 −



which is actually smaller than the approximat e variance of the GR EG estimator under

SRS, although the diﬀerence will not be noteworthy in practical t erms, if t he sampling

fraction n/N is small, since 1 − n/N < 1 − n

/N < 1. Meanwhile, due to the additional

var iance of E

|s), the estimator

∗

by unbiased RB method can possibly have a la r ger

var iance than the biased GREG (with general x

). It seems that one should use large n

if possible, to keep the additional variance due to E

|s) small.

2.4 Delete-one RB method

The la rgest possible size of s

is n

= n − 1. We refer to Rao-Blackwellisation based on

SRS of s

with n

= n − 1 as the delete-one (o r leave-one-out, LOO) RB method. The

conditional sampling design p

) is not measurable in this case, in that one cannot

have an unbiased variance estimator

V (

) based on a single observation y

in s

= {j}.

For an approximate variance estimator, we reconsider the basic case where {y

, ..., y

}

form a sample of independent a nd identically distributed (IID) observations, in or der to

develop an analogy to t he classic jackknife variance estimation (Tukey, 1958).

Denote by θ the population mean that is also t he expectation of each y

, for i = 1, ..., n.

As before, let ¯y

(j)

denote the mean in the subsample s

= s \ {j}. Following (1), let

(j)

n − 1

¯y

(j)



1 −

n − 1



be the delete-j estimator of θ, where y

acts as an unbiased estimator of the population

mean outside s

. The RB method yields the whole sample mean, denoted by

∗

j=1

(j)

j=1

= ¯y

Observe that we have

∗

j=1

(j)

/n, where

(j)

N − n



(j)

− n

∗



= y

(6)

Thus, the RB estimator

∗

is the mean of an IID sample of observations z

(j)

, for j = 1, ..., n,

as in the development of classic ja ckknif e variance estimation, so that we obtain

V (

∗

) =

n(n − 1)

j=1



(j)

−

∗



Notice that, in this case, the IID observations used for the classic development of jackknife

method are given by z

(j)

= n

θ − (n − 1 )

(j)

= y

instead of (6), where

(j)

= ¯y

(j)

For the delete-one RB method based on (1) and (2) given auxiliary {x

: i ∈ U},

we have π

= π

(n − 1)/n, such that the estimator b

can be denoted by b

(j)

, based on

= s \ {j}, where it is simply the delete-j jackknife regression coeﬃcients estimator.

Rewrite the corresponding popula tion tot al estimator

by (1) as

(j)

= X

⊤

(j)

−

⊤

(j)

such that the R B method yields

∗

by (2), as the mean of

(j)

over j = 1 , ..., n. We

propose a j ackknife variance estimator for

∗

, given by

V (

∗

) =

n(n − 1)

j=1



(j)

−

i=1

(i)



(7)

where

(j)

N −n



(j)

−

∗



Notice that it may be the case under general unequal probability sampling that the

conditional inclusion probability π

given s

= s \{j} is not exactly known. However, in

many situations where the sampling fraction is low, it is reasonable that

≈

i6∈s

n −

i∈s

≈

n(1 − n

/N)

An approximate delete-one RB estimator following (2) can then be given as

∗

= X

⊤

j=1

(j)



1 −



j=1

−



1 −



j=1

⊤

(j)

(8)

with

(j)

for jackknife variance estimation on replacing 1/π

by n(1 − n

/N)/π

. Mean-

while, the delete-one jackknife replicates of G REG

can be written as

(j)

= X

⊤

(j)

n − 1



i6=j

−

i6=j

⊤

(j)



(·)

j=1

(j)

= X

⊤

j=1

(j)

i=1

−

i=1

⊤



j6=i

(j)

n − 1



The estimator

(·)

is quite close to the approximate RB-estimator (8) ; indeed, identical

apart f rom 1 − n

/N in the special case of x

⊤

/π

= N/n. This is not surprising, since

the jackknife-based

(·)

is an alternative for reducing the bias of the GR EG estimator.

The diﬀerence is that, provided π

is known, the proposed RB method will be exactly

design-unbiased, but not the jackknife-based

(·)

. Finally, the resemblance between

∗

and

(·)

is another indication that the relative eﬃciency of the delete-one RB method is

usually not a concern compared to the standard GREG estimator

2.5 Monte Carlo RB

Exact Rao-Blackwellisation can be computationally expensive, when the cardinality of

the subsample space (of s

) is large. Instead of calculating the RB estimator exactly,

consider the Monte Carlo (MC) RB estimator given as follows:

= K

−1

k=1

(9)

where

is the estimator

based on the kth subsample, for k = 1, ..., K, which are

realisations of s

from q(s

|s), such that

is a Monte Carlo approximation of

∗

Proposition The estimator

is design-unbiased for Y , denoted by E(

) = Y .

Proof: The result follows from E(

) = Y . 

Adopting a computationally ma na geable K entails an increase of variance, i.e. V

(

|s),

compared t o

∗

, so that the variance of

is given by

V (

) = E



(

)



− E



(

|s)



+ E



(

|s)



(10)

Due to the IID construction of

, an unbiased estimator of V

(

|s) is given by

(

|s) =

K(K − 1)

k=1

(

−

)

This allows one to control the statistical eﬃciency of the MC-RB method, i.e. the choice

of K is acceptable when

(

|s) is deemed small enough in practical terms.

Proposition Provided unbiased variance estimator

V (

) with respect to p

i.e. E



V (

)



= V

(

), a design-unbiased variance estimator for

is given by

V (

) =

k=1

V (

) −

k=1

(

−

)

Proof: Due to the IID construction of

, K

−1

k=1

V (

) is an unbiased estimator of

the ﬁrst term on the right-ha nd side of (10), while (K − 1)

−1

k=1

(

−

)

is an

unbiased estimator of the second term. The result fo llows. 

Finally, for the delete-one RB method, where unbia sed variance estimator

V (

) is

not available now that |s

| = 1, a pr actical option is to ﬁrst apply the jackknife variance

estimator (7) to the K samples, as if

where the exact RB estimator

∗

, and then

add to it the extra term

(

|s) for the additional Monte Carlo erro r . This would allow

one to use the Monte Carlo delete-one RB method in general.

3 Unbiased non-linear learning

In this section we consider design-unbiased estimation in survey sampling, which builds

on arbitrary ML technique that can be non-linear as well as non-parametric.

3.1 Design-unbiased ML for survey sampling

Denote by M the model or algorithm that aims to predict y

given x

. Let s

be the training

set, and s

= s \ s

the test set. Let

M be the trained model based on {(x

, y

) : i ∈ s

yielding µ(x

, s

) as the corresponding M-predictor of y

given x

. Apply the tra ined model

to i ∈ s

yields the p rediction errors of

M conditiona l on s

, denoted by e

= y

−µ(x

, s

In contrast, the same discrepancy is referred to as the residuals of

M, when it is calculated

for i ∈ s

, denoted by ˆe

= y

−µ(x

, s

), including when the training set s

is equal to s.

In standard ML, the erro r s in the test set are used to select diﬀerent tr ained algorithms,

or to assess how well a tr ained algorithm can be expected to perf orm when applied to the

units with unknown y

’s.

From an inference point of view, a basic problem with the standard ML approach above

arises because one needs to be able to ‘extrapolate’ the information in {e

: i ∈ s

} to the

units outside s, in order for supervised learning to have any value at all. This is simply

because {y

: i ∈ s} are all observed and prediction in any form is unn ecessary for i ∈ s.

No matter how the training-test split is carried out, one cannot ensure valid µ(x

, s

) for

k 6∈ s, unless s is selected from the entire reference set of units, i.e. the population U,

in some non-informative (or representative) manner. This is the well-known problem of

observationa l studies in statistical science, which is sometimes recast as the problem of

concept drift in the ML literature (e.g. Tsymbal, 2004).

A design pq-unbiased approach to M-assisted estima tion of population total Y can be

achieved with respective to

(i) a probability sample s from U, with probability p(s), and

(ii) a probabilistic scheme q(s

|s) for the training-test split (s

, s

) given s.

Explicitly, let

be the estimator of Y obtained from the realised sample s a nd sub-

sample s

given the model M. It is said to be design pq-unbiased for Y , provided

(

) =

p(s)

⊂s

q(s

|s)

= Y

where E

is the expectation of

over all possible (s, s

). Replacing the linear predictor

⊤

in (1) by any M-predictor µ(x

, s

) tra ined on s

, we obtain

i∈s

i∈U\s

µ(x

, s

) +

i∈s

/π

(11)

Proposition

by (11) is design pq-unbiased for Y using an arbitrary model M.

The proof is parallel to that for

by (1), only that µ(x, s

) is now based on any chosen

model M. It is important to point out that the purpose here is to estimate Y at the

population level, instead of individual prediction per se. Indeed,

is design-unbiased,

regardless M is a strong or weak learner. The underlying probabilistic mechanism consists

of two necessary elements: p(s) ensures valid extrapolatio n of learning to the units outside

s, since ot herwise completely model-based prediction

i∈s

i∈U\s

µ(x

, s) has no

guaranteed relevance to Y no matter how the tra ining set s

is chosen or how M is

selected, whereas subsampling q(s

|s) is required to be able to project the error s in s

the aggregated level, since projecting the residuals in s in t he manner of GREG estimator

(i.e. wit hout the training-test split) would not achieve unbiasedness exactly.

3.2 Subsampling RB and bootstrap aggregating

There is a natural aﬃnity between the subsampling RB method and bootstrap agg r egating

(i.e. bagging). Bagging is originally devised t o improve unstable leaners (Breiman, 1996a;

1996b) for individual prediction, where the ag gregation averages the learner over bootstrap

replicates of the training set. The argument can be adapt ed to design-based population-

level estimation. Let

= ψ(s; M) be an M-assisted estimator of Y , which varies over

diﬀerent samples s. Insofar as {y

: i ∈ U} are treated as unknown constants and

uniquely determined given {(y

, x

) : i ∈ s}, the only variation of

derives from that of

the sample s. For some model M, such as regression tree with random feature selection,

there exists an extra varia tion of

given s. In any case, let the expectation of

= E(

) = E



ψ(s; M)



over all possible s and additional randomness given s. We have



(

− Y )



= ( ψ

− Y )

+ E



(

− ψ

)



since E(

−ψ

) = 0 by deﬁnition. Thus, ψ

has always a smaller mean squared error

than

. Notice that in reality bagging is “caught in two currents” (Br eiman, 1994): the

improvement can be appreciable if

is unstable, whereas the additional estimation of

by bagging may not be worthwhile if

is a stable learner to start with.

It is clear from the above that, while it can reduce the variance of unba gged predictor,

bagging does not aﬀect the potential bias, now that it aims at replacing

= ψ(s; M)

by its expectation ψ

. The subsampling RB method is mo re eﬀectual than bagging in

the following sense: on the one hand, it leads generally to design-unbiased estimation of

Y , which does not result from bagging alone; on the other hand, R ao-Blackwellising

reduces its variance, even when it is based on a stable learner, such as µ(x

, s

) = x

⊤

which bag ging does only for unstable

. Replacing

in (2) with

given by (11)

generally, the subsampling RB M-assisted estimator of Y is g iven by

∗

= E

(

|s) (12)

Proposition The subsampling RB M-assisted estimator

∗

by (12) is design p-unbia sed,

or simply design-unbiased, for Y using a n ar bitrary model M.

The proof is exactly parallel to that for

∗

by (2). Notice that Rao-Blackwellisation

with respect to q(s

|s) can accommodate straightforwardly any additional variation

given s

due to the chosen model M. For example, given a subsample s

, one can g row a

regression tree with random feature selection. Despite the resulting µ(x, s

) is not ﬁxed for

the given s

, the corresponding

is still design pq-unbiased, because it is conditionally

unbiased for Y given s

and the outcome of random feature selection, and Y is a constant

with r espect to subsampling of s

and random feature selection given s

Finally, Monte Carlo subsampling RB is operationally similar to bagging, involving

about the same amount of computation eﬀort. In bagging, one draws a bootstrap replicate

sample from s; whereas in subsampling R B, one resamples s

from s according to q(s

|s).

In either case, one trains the model based on the resample. Repeating the two steps K

times yields the bagged predictor by bagging, and the MC-RB estimator by subsampling

RB. The choice of K balances between numerical and statistical eﬃciency.

3.3 Design consistency

Provided n

= |s

| ≥ 2, let

V (

1M,k

) be an unbiased variance estimator with respect

to p

), i.e. E



V (

1M,k

)



= V

(

1M,k

), for K subsamples k = 1, ..., K. A design-

unbiased variance estimator for MC-RB estimator

is given by

V (

) =

k=1

V (

1M,k

) −

k=1

(

1M,k

−

)

(13)

similarly as for

. It is an open question at this stage how to determine the eﬃcient

subsampling scheme q(s

|s), including the choice n

. Although given the simplicity and

practical advantage of the delete-one GREG-assisted

, any other M-assisted estimator

would not be worth considering, unless it has clearly a smaller estimated variance.

A design-unbiased M-assisted estimator is consistent, provided its sampling var iance

tends to 0 asymptotically, as n, N → ∞ . Since this is the case with delete-one GREG-

assisted

∗

of populat ion mean

Y = Y /N, and that in practice one would only admit

any alternative estimator that has an even smaller va riance, design consistency is not a

worrisome issue for design-unbiased M-assisted estimation in applications.

Meanwhile, we cannot ﬁnd any direct references in the literature, concerning the design

consistency of ML techniques. For example, Gor don and Olshen (1978, 1980) establish

consistency of recursive partitio ning algorithms, such as regression tree, provided IID

training set. Toth and Eltinge (2011) extend their result, allowing sampling design in

addition to the IID super- population model M, such that the consistency of r egression

tree for individual prediction, based on samples selected from p(s), is not purely design-

based, but requires the super-population model to hold in addition.

In the standard ML literature, asymptotic results are typically derived under stability

conditions. Bousquet and Elisseeﬀ (2002) establish uniform stability condition for Reg-

ularisation algorit hms. Mukherjee et al. (2006) pay special attention to empirical risk

minimisation algorithms. Bot h these works a re directed at individual-level predictor from

IID training set, denoted by S = {(x

, y

) : i = 1, ..., n}, asymptotically as n → ∞. Let

Z = (x, y) be generically the random variables from the relevant distribution. Let µ(x, S)

be a given predictor trained on S. Its prediction mean squared error is E



(y −µ(x, S)



Exp ectation with respect to S is needed in addition for the stability deﬁnitions.

Diﬀerent deﬁnitions of stability are needed under the pq-design-based approach to

population-level estimation, where {(x

, y

) : i ∈ U} are treated as constants and only the

sample s is random. Below we consider ﬁrst the delete-one RB estimator (12) under the

special case of SRS and, then, under general unequal pro bability sampling design.

3.3.1 Stability condition: SRS

Let s

= s \ {j} be the delete-j sample. Let s

= s \ { i, j} be the delete-ij sample. Let

µ(x, s

) be the M-predictor given x, which is trained on s

, and µ(x, s

) that on s

. We

deﬁne µ(x, s) to be twice q- stable, if

µ(x

, s

) − µ(x

, s)

→ 0 and µ(x

, s

) − µ(x

, s

)

→ 0 (14)

i.e. convergence in probability, as n, N → ∞ asymptotically, for any i, j ∈ s and k ∈ U,

where s

results from delete-one q-sampling from s, and s

from recursive q-sampling

where one randomly deletes i ∈ s

. Notice that the ﬁrst part of (14) is analogous to the

‘point -wise hypothesis leave-o ne- out stability’ of Mukherjee et al. (2006).

Theorem 1: The delete-one RB estimator (12) is consistent for population mean

under SRS, as n, N → ∞, given twice q-stability a nd y

− µ( x

, s) = O(1) for any s.

Proof: We have V

(

∗

) = E



(

)



− E



(

|s)



as by (3), where

) =

k∈s

k6∈s

µ(x

, s

) + (N − n + 1)z

)

k∈s

k6∈s

µ(x

, s

) + (N − n)z

)

and z

) = y

− µ(x

, s

) for a ny k and delete-j sample s

. Under SRS, we have



)|s



= (N − n + 1)

k∈s



) −

)



where

) =

k∈s

)/(N − n) and s

= U \ s

. By (14), we have

) −

) = [1 + o

(1)]



) −

)



for any i ∈ s

, where

) =

k∈s

)/(N − n), and, averaged over all i ∈ s



)|s



= [1 + o

(1)](N − n + 1) (N −n){

i∈s

n − 1

}

− 1

k∈s



) −

)



where n

= |s

| = N − n + 1. One can consider

as an unbiased estimator of

τ(s

) =

− 1

k∈s



) −

)



where N

= |s

| = N − n + 2, i.e. the population variance of z

) in s

, based on

SRS sample s

from s

conditional on s

, since s

= s

∪ {i} = U \ s

, such that



i∈s

n − 1



= E



(

)



= E



τ(s

)



Given y

− µ(x

, s) = O(1) for any s and k ∈ U, we obtain



(

)



= [1 + o(1)](N − n + 1)(N − n)E



τ(s

)



Next, for V

(

|s), we notice that, by µ(x

, s

) − µ(x

, s)

→ 0 in (14),

) = [1 + o

(1)]



k∈s

k6∈s

µ(x

, s) + (N − n)z

)



where V



)|s



i∈s



) − ¯z(s)



/n, for ¯z(s) =

j∈s

)/n. In Sen-Yates-

Grundy type expression using pairwise diﬀerences, we can write



)|s



n − 1

{



n(n − 1)



−1

i<j∈s



) − z

)



}

n − 1

{



n(n − 1)



−1

i<j∈s

}[1 + o

(1)]

where



) −z

)



, by µ(x

, s

) −µ(x

, s

)

→ 0 in (14). One can consider

as an unbiased estimator of τ(s

), i.e. the population variance of z

) in s

, based

on SR S sample s

′

= {i, j} from s

conditional on s

′

= s

. Moreover, one can view the

expression in last brackets {} above as E

′

(

|s) with respect to q

′

|s), such that



{·}



= E



′

(

|s)



= E

′



′

(

)



= E

′



τ(s

)



≡ E



τ(s

)



Given y

− µ(x

, s) = O(1) for any s and k ∈ U, we obtain



(

|s)



= [1 + o(1)]( N − n)

n − 1



τ(s

)



Finally, the result follows fro m E



(

)



and E



(

|s)



above, since

(

∗

) = (1 −

)



τ(s

)



+ o(1) . 

3.3.2 Stability condition: Unequal probability sampling

For general unequal probability sampling, we deﬁne the following the stability conditions.

First, we deﬁne µ(x, s) to be simply q-stable if, for any j ∈ s and k ∈ U, we have

µ(x

, s

) − µ(x

, s)

→ 0 (15)

asymptotically as n, N → ∞, where s

results from delete-one subsampling q(s

|s). Next,

we deﬁne µ(x, s) to be p-stable for the delete-one RB method, if

j∈s

µ(x

, s) −

k∈U

µ(x

, s)

→ 0 (16)

where

= π

−1

+ (n − 1) is an estimator of N based on s

= {j}. Notice that, given

q-stability (15), it is possible to replace p-stability ( 16) by a pq-stability conditio n

j∈s

µ(x

, s

) −

k∈U

µ(x

, s

)

→ 0

which reduces to

j∈s

µ(x

, s

)/n −

k∈U

µ(x

, s

)/N

→ 0 under SRS, and resembles

the IID ‘expected-leave-one-out stability’ of Mukherjee et al. (2006): the ﬁrst term above

is the empirical average in the observed set in both deﬁnitions, whereas for the second

term here we replace averaging over Z in the IID setting by that over the population

distribution function, which places point mass 1/N on each k ∈ U.

Some regularity condition on the sampling design p( s) is needed to f or the g eneral

situation. Let

= y

/N be the leave-one-out ( LOO) HT estimator of population

mean

Y based on s

= {j}. We deﬁne the sampling design to be LOO-consistent, if

j∈s

→

Y (17)

asymptotically as n, N → ∞. The condition is speciﬁed for the LOO-RB-HT estimator,

where

j∈s

/n = E

(

|s). Under SRS,

j∈s

/n =

j∈s

/n is the sample mean,

which conver ges to

Y in probability, provided y

= O(1) for all i ∈ U. We emphasise that

the condition (17) concerns only the sampling design p(s), since it is formulated in terms

of the y-values alone, i.e. based on an ‘empty’ M-predictor, so to speak.

Theorem 2: The delete-one RB estimator (12) is consistent for population mean

Y , as

n, N → ∞, provided q- and p-stabilities, and LOO-consistent sampling design p(s).

Proof: Given the delete-j sample s

under any general sampling design, we can write

) =

i∈s

k /∈s

µ(x

, s

) + (π

−1

− 1)



− µ( x

, s

)



i∈s



− µ(x

, s

)



k∈U

µ(x

, s

) + (π

−1

− 1)



− µ( x

, s

)



where π

is the conditional probability of selecting j from s

given s

. Given q-stability

(15), i.e. µ( x, s

) = µ(x, s) + o

(1), t he RB-estimator of the population mean is

∗

k∈U

µ(x

, s) +

j∈s



− µ( x

, s)



+ o

(1)

= {

k∈U

µ(x

, s) −

j∈s

µ(x

, s)} +



j∈s



+ o

(1) .

The result follows from applying the p-stability condition (16 ) to the expression in t he

brackets {}, and the LOO-consistency condition (17) to that in [ ]. 

4 Simulations

Below we present and discuss some simulation results of the delete-one RB (or LOO-RB)

method, a nd the associated jackknife variance estimator described in Section 2.4. The

HT and some GREG estimators are computed for comparisons. The target is always the

population mean (denoted by θ) in a given set-up. The simulations proceed as follows.

- B samples (usually B = 100) are drawn independently from the given ﬁxed populatio n

according to a speciﬁed sampling design.

- We obtain an estimate

(b)

based on each sample, for b = 1, ..., B. In particular, for

the LOO-RB method, we calculate its associated jackknife variance estimate v

(b)

- An estimate of E(

θ) over repeated sampling is

θ =

b=1

(b)

/B, with associated Monte

Carlo error

v/B, where v =

b=1

(

(b)

−

θ)

/(B −1). An estimate of its bias is

θ −θ;

an estimate of its root mean squared error (RMSE) is {

b=1

(

(b)

− θ)

/B}

1/2

- Similarly for the bias a nd RMSE of the variance estimator v

(b)

, except that the true

var iance of the LOO-RB method is unknown and is replaced by its estimate v.

Now that the HT estimat or and the LOO-RB methods are unbiased, an inspection of

their respective simulation-based bias estimates and the associated Monte Carlo errors

can usually provide adequate information, in order to judge whether a certain conclusion

of the results is warranted g iven the actual number of simulations.

4.1 Simulations with synthetic data

The GREG estimator has become the standard-bearer in practical survey sampling in the

past three decades. Using simple simulations below, we would like to gain some basic

appreciation of the pros and cons of the corresponding LOO-RB-GREG estimator, given

by ( 2), under the proposed unbiased learning approach. Small synthetic populations

are generated based on only two regressors. The ﬁrst regressor x

follows a log-normal

distribution with mean and variance both set to o ne. The second regressor x

follows a

Poisson distribution with mean 5. The target y-variable in each setting is generated as

the absolute value of a certain function of x

and x

plus a regression error.

Table 1: Simulation results o f HT, GREG and LOO-RB-GREG estimator, by two diﬀerent

sampling designs. Monte Carlo errors of bias estimates in par entheses.

Simple Random Sampling Probability Proportional to x

Estimator Bias (MC Error) R MSE Bias (MC Error) RMSE

HT 0.08 (0.19) 1.91 0.12 (0.20) 2.02

GREG -0.09 (0.13) 1.29 0.03 (0.1 6) 1.60

LOO-RB-GREG 0.10 (0.14) 1.39 0.16 (0.16) 1.63

Variance by jackknife 0.52 (0.10) 1.09 4.73 (1.02) 11.23

We start with a setting where the GREG estimator should have a neglig ible or very

small bias. Let the population size be 200, and let the target survey y-variable be t he

absolute value of 1.5x

+ x

+ ǫ, where ǫ follows a no rmal distribution with zero mean and

var iance that is a quarter of the variance of x

. Let the sample size be 20. Two sampling

designs are used: SRS, or conditional Poisson sampling with probabilities proportio na l to

as the size variable. The results are given in Table 1.

It can be seen that under both sampling designs, GREG and LOO-RB-GREG have

essentially the same eﬃciency, and both outperform HT estimation. Recall that the bias

of the GREG estimator is negligible in this scenario because of the underlying linear

population model. Clearly, the jackknife variance estimator (7), which is derived as a

direct analogy to the IID-sample situation, needs to be modiﬁed for unequal probability

sampling designs such as the conditional Poisson sampling here.

Table 2: Simulation r esults of HT, GREG and LO O-RB-GREG estimator under SR S,

but two diﬀerent populat ion models. Monte Carlo errors of bias estimates in parentheses.

V (y) ∝ x

, n = 5 Non-linear, V (y) ∝

√

, n = 20

Estimator Bias (MC Error) RMSE Bias (MC Error) RMSE

HT -0.46 (0.50) 5.03 0.95 (1.26) 12.57

GREG -0.8 2 (0.4 1) 4.16 -2.41 (0.51) 5.62

LOO-RB-GREG -0.68 (0.77) 7.69 0.68 (0.86) 8.62

Variance by jackknife -13.22 (15.75) 157.31 -7.36 (10.50) 104.76

Consider now two potentially problematic settings. First, we introduce heteroscedas-

ticity by make the variance of the y-var iable proportional to x

, while reducing the sample

size at the same time, where n = 5 (from N = 100). The results are given in the left part

of Table 2. The LOO-RB-GREG is the least eﬃcient estimator here: the heteroscedas-

ticity setting incr eases the variance of

based on each subsample, whereas the small

sample size implies RB averaging over only 5 subsample estimates (instead of 20 above).

The RMSE of the jackknife variance estimator is much bigger fo r similar reasons.

Next, reverting t o (N, n) = (200, 20), we generate the target y-variable non-linearly

as the absolute value of 0.5x

+ 0.25x

+ x

+ ǫ, where ǫ follows a normal distribution

with zero mean and variance proportional to

√

. The results under SRS are given in

the right par t of Table 2. The G REG estimator has now a relatively large bias, which

is removed by the LOO -RB-GREG estimator. However, the unbiased learning estimator

loses eﬃciency compare to GREG in terms of the MSE, although it is still much better

than the HT estimator. The performance o f t he variance estimator is similar as before.

These results illustrate the basic pros and cons of delete-one RB-GREG vs. st andard

GREG estimation. On the one hand, the GREG estimator may suﬀer from non-negligible

bias, e.g. because one applies the assisting linear model in a routine manner without con-

ducting careful model diagnostics as one should, whereas the unbia sed learning approach

avoids the bias by deﬁnition. On the other hand, delete-one subsampling may suﬀer f rom

loss of eﬃciency given heteroscedastic o bservations in very small samples.

4.2 Simulations with real data

The population consists of a sample of about 17000 small and medium-sized enterprises

from the Spanish Structural Business Survey (SSBS). As the target variables we consider

three survey variables collected in the SSBS: Turnover, Total personnel expenses and Total

procurements of goods and services. Seventeen variables from the administrative corpo-

rate income tax data are imported as the regressors. One of them is turnover, although

for many enterprises the turnover from tax data will be diﬀerent to the turnover from

SSBS by deﬁnition; in addtion the two observed values may diﬀer because of registration

delays or other operational reasons. The estimators to be considered are: HT, GREG1

with one regressor (turnover), GREG17 with the seventeen regressors (as main eﬀects),

LOO-RB-GREG1 with one regressor (turnover), and LOO-RB random forest (RF) with

seventeen features. When only one regressor is used, RF is not a good option to be

included here. Jackknife variance estimation is applied to the two SRB estimators.

4.2.1 Turnover

This case is interesting because turnover (from tax data) is one of the regressors. We

consider SRS and stratiﬁed SRS designs. For the latter, three strata are created by the

number of employees, which is a commonly used stratiﬁcatio n variable in SBS, although

the actual designs always have many other complicating details in practice. The stratum

sample sizes are allocated proportionally to the stratum population sizes. The total

sample size is 10% of t he popula t ion under both the designs. The simulation results are

given in Ta ble 3, similarly as before and suitably scaled for presentation.

Table 3: Simulation results for Turnover estimation by diﬀerent estimators under two

sampling designs. Monte Carlo errors of bias estimates in par entheses.

SRS Stratiﬁed SRS

Estimator Bias (MC Error) RMSE Bias (MC Error) RMSE

HT -0.22 (0.47) 4.65 -0.3 6 (0.32) 3.20

GREG1 0.01 (0.26) 2.56 0.30 (0.22) 2.16

LOO-RB-GREG1 -0.21 (0.27) 2.68 0.05 (0.23) 2.31

Variance LOO-R B-GREG1 -1.02 (0.32) 3.3 7 0 .60 (0.32) 3.27

GREG17 0.42 ( 0.22) 2.30 0 .79 (0.75) 7.47

LOO-RB-RF -0.19 (0.14) 1.37 -0.1 0 (0.14) 1.38

Variance LOO-R B-RF 0.09 (0.03) 0.34 -0.04 (0.04) 0.38

The LOO-RB-RF (by random forest) is the most eﬃcient estimator under SRS. It

is more eﬃcient than the GREG17 estimator, because RF yields a better prediction

model than simple linear regression using all the regressors as main eﬀects. In fact, the

GREG17 estimator introduces a small bias compared t o the GREG1 estimator, a nd is

only more eﬃcient by a small ma r gin. The LO O -RB-GREG1 estimator has about the

same eﬃciency as the GREG1 estimator. Compared to the simple simulatio n results

earlier, heteroscedastic variance does not cause loss of eﬃciency to the LOO-RB method

here, because the sample size is large enough. The j ackknife variance estimator has no

statistically signiﬁcant bias for the LOO -RB-RF estimator, but it has a small negat ive

bias for the LOO-RB-GREG1 estimator.

Next, under the stratiﬁed SRS design, the LOO-RB-RF is again the most eﬃcient

estimator. Its RMSE is about the same as under SRS, which is not surprising given

proportional allocation of stratum sample sizes, because RF is able to account for the

design size variable using the auxiliary information in the 17 regressors. In cont r ast, the

GREG17 estimator actually loses eﬃciency and does not behave well here, which again

illustrates that applying the GREG est ima t or without appropriate attention to model

diagnostics can b e counter-productive in practice. The relative performance of the simple

GREG1 estimator and its unbiased counterpart LOO-RG-GREG1 is similar as under SRS,

and both are slightly more eﬃcient under the stratiﬁed design. The jackknife variance

estimators perform similarly as under the SRS design.

4.2.2 Other target variables

Simulation results for the other two target variables under the stratiﬁed SRS design are

given in Ta ble 4.

When it comes to Total personal expenses, turnover from the t ax data is not a good

regressor at all, such that the simple GR EG 1 estimator yields no improvement over the HT

Table 4: Simulation results for To t al personal expenses and Total procurements under

stratiﬁed SRS design. Monte Carlo errors of bias estimates in parentheses.

Total personal expenses Total procurements

Estimator Bias (MC Error) RMSE Bias (MC Error) RMSE

HT 0.03 (0.04) 0.58 -0.05 (0.29) 2.8 8

GREG1 0.09 (0.04) 0.60 -0.07 (0.21) 2.0 8

LOO-RB-GREG1 0.06 (0.04) 0.60 -0.20 (0.21) 2.1 4

Variance LOO-R B-GREG1 0.02 (0.01) 0.03 0.03 (0.14) 1.37

GREG17 0.01 ( 0.02) 0.33 0 .50 (0.18) 1.90

LOO-RB-RF 0.01 (0.02) 0.32 -0.07 (0.11) 1.1 1

Variance LOO-R B-RF 0.03 (0.00) 0.07 -0.21 (0.03) 0.34

estimator. Similarly with the LOO-RB-GREG1 estimator, which performs similarly as the

GREG1 estimator, as can be expected. Meanwhile, both the G REG17 and LOO -RB-RF

estimators are noticeably b etter. This suggests that the other regressors can be linearly

related to this target variable, and the RF model is ﬂexible enough to automatically

capture this linear regression relationship here. The ja ckknif e variance estimators exhibit

no bias for the LOO-RB methods in this case.

Turning to the results for Total procurements of g oods a nd services, the LOO-RB-RF

estimator is again by far the most eﬃcient of all. The GREG1 and LOO-RB-GREG1

estimators similarly improve on the HT estimator, where turnover from tax data is a

reasonable regressor f or this variable. The GREG17 estimator is mor e eﬃcient than the

simple GREG1 estimator by a small margin, albeit at the cost of introducing a small

bias that is statistically signiﬁcant. In contrast, the LOO-RB-RF estimator provides a

much gr eat er gain of eﬃciency while remaining design-unbiased. The jackknife variance

estimator is essentially unbiased for the LOO-RB-GREG 1 estimator, but it has a small

negative bias for the LOO-RB-RF estimator.

4.2.3 Conclusions

The following conclusions seem warranted based on the simulation results above.

In situations where simple GREG estimation (with few regressors) have little bias to

start with, i.e. when the simple linear regression model is a reasonable statistical model,

the proposed unbiased learning approach is unlikely to oﬀer appreciable improvement in

practice. More advanced learning techniques cannot be of much help, without a supply

of additional useful features. Nevertheless, while G REG estimation may suﬀer from no n-

negligible bias in a given situation, because the linear model is inappropriate, the unbiased

learning approach can avoid the bias automatically.

More importantly, provided richer auxiliary infor ma tion, the proposed unbiased SR B

learning approach can yield large gains. On the one hand, it allows o ne to make use

of modern ML techniques that can potentially lead to much more ﬂexible and powerful

prediction models, without demanding the same kind of eﬀort that is often necessary for

building complex parametric models. On the other hand, t he theory for design-unbiased

statistical learning develo ped in this paper ensures the resulting ML-assisted estimator

is valid for descriptive inference, so that the ML-prediction model can help t o generate

valid and eﬃcient estimation at the aggregated level, wihtout requiring the model to be

entirely correct a t the individual level, because the prediction errors in the sample are

extrapolated to the population of interest based on the known pq-sampling design.

5 Summary remarks

Amalgamating classic ideas of Statistical Science and Machine Learning , we developed an

ML-assisted SRB a pproach for pq-design-unbiased stat istical learning in survey sampling.

It allows o ne to generally achieve design-unbiased model-assisted estimation based on

probability sampling from the population of interest. The freedom to adopt modern as

well as emerging power ful algorithmic ML-prediction models should enable o ne to make

more eﬃcient use of the rich auxiliary infor mation whenever it is available.

A topic for future research can be noted immediately. As mentioned ear lier, it is an

open question at this stage how to construct the eﬃcient subsampling scheme q(s

|s),

including the choice n

= |s

|. Moreover, a related issue is the sampling design. In this

paper, we have assumed the pq- design approach, because it ﬁts nat ur ally with the current

practice of survey sampling, where the sampling design p(s) is already implemented and

given at the stage of estimation, so that o nly the subsampling scheme q(s

|s) is left to one’s

own device. However, by construction, the combined randomisation distribution induced

by (p, q) is the same as that induced by (p

, p

), for any s

∪ s

= s and s

∩ s

= ∅.

It may be worth invest igating whether a direct approach to the design of (p

, p

) may

oﬀer certain advantages. Finally, it is easily envisaged that more eﬃcient and accurate

var iance estimation methods will be discovered by future research.

References

[1] Blackwell, D. (1947). Conditional expectation and unbiased sequential estimation.

Ann. Math. Statist., 18: 105-110.

[2] Bousquet, O. and Elisseeﬀ, A. (2002). Stability and generalization, J. Mach. Learning

Res., 2:499-526.

[3] Breidt, F.J. and Opsomer, J.D. (2 017). Model-assisted survey est imation with mod-

ern prediction techniques. Statist. Scien., 32:190-205.

[4] Brieman, L. (1996a) . Heuristics of instability and stabilization in model selection.

Ann. Statist., 24:2350-2383.

[5] Breiman, L. (1996b). Bagging predictors. Mach. Learn., 26:123-140.

[6] Cassel, C. M., S¨arndal, C.-E. and Wretman, J. H. (1976). Some results on generalized

diﬀerence estimation and generalized regression estimation for ﬁnite populations.

Biometrika, 63:615-620.

[7] Deville, J.-C. and S¨arndal, C.-E. (1992). Calibration estimators in survey sampling.

J. Amer. Statist. Assoc., 87:376-382.

[8] G ordon, L. and Olshen, R. (197 8). Asymptotically Eﬃcient Solutions to the Classi-

ﬁcation Problem. Ann. Statist., 6:515-533.

[9] G ordon, L. and Olshen, R. (1980). Consistent Nonparametric Regression From Re-

cursive Partitio ning Schemes. J. Mult. Ana., 10:611-627.

[10] Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without

replacement from a ﬁnite universe. J. Amer. Statist. Assoc., 47:663-685.

[11] Mukherjee, S., Niyogi, P., Poggio, T. and Rifkin, R. (2006). Lear ning theory: stability

is suﬃcient for generalization and necessary and suﬃcient for consistency of empirical

risk minimization. Adv. Comp. Math., 25:161-193.

[12] Rao, C. R. (1945). Information and accuracy attainable in the estimation of statistical

parameters. Bull. Cal cutta Math. Soc., 37:81-91.

[13] S¨arndal, C.-E. (2010). The calibration approach in survey theory and practice. Surv.

Methodol., 33:99-119.

[14] S¨arndal, C.-E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sam-

pling. New York: Springer-Verlag.

[15] Toth, D. and Eltinge, J. L. (2011) . Building consistent regression trees from complex

sample data. J. Amer. Statist. Assoc., 106:1626-1636.

[16] Tsymbal, A. (2004) . The problem of concept drift: deﬁnitions and related work.

Comp. Scien., 106 (2), 58.

[17] Tukey, J.W. (1958). Bia s and conﬁdence in not quite larg e samples (abstract). Ann.

Math. Statist., 29:614.