despite it is somewhat less efficient due to the additional MC error. In practice, though,
one can easily balance between the numerical efficiency of the MC-SRB method against
the statistical efficiency of the corresponding exact RB method.
The SRB method makes use of three classic ideas fro m Statistical Science and Machine
Learning. On the one hand, the training-test split of the sample of observations in ML
generates errors in the test set rather than resid uals, conditional on the training dataset,
which as we shall explain is the key to achieving exact design-unbiasedness. For model-
assisted survey estimation we use this idea to remove the finite-sample bias. On the other
hand, Rao-Blackwellisation (Rao, 1945; Blackwell, 19 47) and model-assisted estimation
(Cassel et al., 1976) are powerful ideas in Statistics and survey sampling, which we apply
to ML techniques to obtain design-unbiased survey estimators at the population level.
We shall refer to the amalgamation as statistical learning, since the term model-
assisted estimation is entrenched with t he property of approximate design-unbiasedness
(e.g. S¨arndal 2010; Breit and Opsomer, 2017), whereas the focus of population- level
estimation and associated variance estimation is unusual in the ML literature.
In applications one needs to ensure design-consistency of the proposed SRB method,
in addition to exact design-unbiasedness. The property can readily be established f or
parametric or many semi-parametric assisting models. But the conditions required for
non-parametric algorithmic ML prediction models have so far eluded a treatment in the
literature. Indeed, this has been a main reason preventing the incorporation of such
ML techniques in model-assisted estimation f r om survey sampling. We shall develop
general stability conditions for design-consistency under both simple random sampling
and arbitrary unequal pro bability sampling designs.
For the first time, design-unbiased model-assisted estimation can thereby be achieved
generally in survey sampling. Wherever rich feature data are available, the approach of
statistical learning developed in this paper enables one to adopt suitable ML techniques,
which can make much more efficient use of the available auxiliary information.
The rest of the paper is organised as follows. In Section 2, we describe t he SRB method
that uses an assisting linear model. The underlying ideas of design-unbiased statistical
learning ar e explained, as well as the differences to the standard model-assisted gener-
alised regression estimation. Some basic methods of variance estimation are outlined,
where a novel jackknife variance estimator is developed for the SRB method. We move on
to non-linear ML techniques in Section (3). The similarity and difference to the bootstrap
aggregating (Breiman, 1996b) approach ar e explored. Moreover, we investigate and prove
the stability conditions for design-consistency of SRB method that uses non-parametric
algorithmic prediction models. Two simulation studies are presented in Section 4, which
illustrate the potential gains of the proposed unbiased statistical learning approach, com-
pared to standard linear model-assisted or model-based approaches. A brief summary
and topics for future research will be given in Section 5.
2