Theresa Kuchler
Dominic Russel
Johannes Stroebel
Working Paper 26990
1050 Massachusetts Avenue
Cambridge, MA 02138
April 2020, Revised August 2020
We use aggregated data from Facebook to show that COVID-19 was more likely to spread
between regions with stronger social network connections. Areas with more social ties to two
early COVID-19 “hotspots” (Westchester County, NY, in the U.S. and Lodi province in Italy)
generally had more confirmed COVID-19 cases as of the end of March. These relationships hold
after controlling for geographic distance to the hotspots as well as for the income and population
densities of the regions. As the pandemic progressed in the U.S., a county's social proximity to
recent COVID- 19 cases predicts future outbreaks over and above physical proximity. These
results suggest data from online social networks can be useful to epidemiologists and others
hoping to forecast the spread of communicable diseases such as COVID-19.
To forecast the geographic spread of communicable diseases such as COVID-19, it is valuable
to know which individuals are likely to physically interact (Piontti et al., 2018). In particular,
since social ties shape patterns of physical interaction, the strength of social connections
between cities and regions are important for determining a locality’s level of risk for future
outbreaks. Yet, the geographic structure of social networks is difficult to measure on a
national or global scale. In this paper, we use aggregated data from Facebook to measure
social connections between regions. We show that these connectedness measures can help
forcecast the geographic spread of COVID-19.
We construct a measure of social connectedness between U.S. counties and between Ital-
ian provinces. This Social Connectedness Index captures the probability that Facebook users
in a pair of these regions are Facebook friends with each other (Bailey et al., 2018b). We
hypothesize that regions connected through many friendship links are likely to have more
physical interactions between their residents, providing opportunities for the spread of com-
municable diseases. Indeed, our measure has been shown to be predictive of travel patterns
across Europe (Bailey et al., 2020c) and within urban areas (Bailey et al., 2020a), suggesting
it contains important information about real-world interactions. Most directly, Coven and
Gupta (2020) use our Social Connectedness Index to show that counties with higher levels
of social connectedness to New York were more likely to be destinations for those fleeing the
city during the pandemic. This provides direct evidence for our hypothesized mechanism.
After introducing our Social Connectedness Index, we show that regions with stronger
social ties to early COVID-19 “hotspots” Westchester County, NY, in the United States,
and Lodi province in Italy have more documented COVID-19 cases per resident as of
March 30, 2020. This relationship is robust to controlling for the geographic distance to these
early “hotspots”, as well as a number of demographic characteristics of the regions. These
case studies highlight that social connectedness might have served as a valuable predictive
measure in addition to physical distance and other existing epidemiological model inputs.
We then exploit the changing geography of the pandemic in the U.S. to conduct a more
systematic analysis. We construct regional measures of COVID-19 exposure through social
connectedness (“social proximity to cases”) and through physical distance (“physical prox-
imity to cases”). We find a county’s growth in social proximity to cases in one time period
is strongly correlated with the county’s growth in actual cases in the next time period. Even
after controlling for physical proximity to cases and other regional demographics, a doubling
in social proximity to cases in one two-week period corresponds to a 22.5% increase in actual
cases per 10,000 residents in the next two-week period. This positive relationship holds for
every two-week period between March 30 and July 20, 2020. To mimic a real-world use
case, we also conduct a simple out-of-sample prediction exercise. We find that models that
include our measure of social proximity to cases are better able to predict a region’s future
case growth than those that rely on geographic distance and other demographics alone.
Our use of the Social Connectedness Index to forecast COVID-19 outbreaks adds to an
active body of research that studies how different aspects of social media and internet-usage
patterns can be used for tracking and preventing disease (for an overview, see Aiello et al.,
2020). One strand of this literature uses the content of individuals’ internet searches or social
media posts; most famously, Google Flu Trends used search queries related to influenza for
early outbreak detection (Ginsberg et al., 2009). Other researchers have also used content
from Twitter posts (Rodr´ıguez-Mart´ınez and Garz´on-Alfonso, 2018; Jahanbin and Rahma-
nian, 2020), Facebook likes (Gittelman et al., 2015), Wikipedia searches (Generous et al.,
2014), and Instagram posts (Correia et al., 2016) to predict public health outcomes. A sec-
ond strand of research, which has received much attention during the COVID-19 pandemic,
uses geolocation data to track individuals’ movement patterns. These data have been used
to explore the determinants and effects of social distancing behavior (for an overview, see
Giuliano and Rasul, 2020), as well as forecast disease spread (see e.g. Bengtsson et al., 2015;
Wesolowski et al., 2012, 2015; Peixoto et al., 2020). A third strand uses crowdsourced infor-
mation, including surveys, to monitor disease symptoms and detect potential outbreaks (see
Facebook Symptom Survey; Smolinski et al., 2015; Paolotti et al., 2014).
In comparison to this literature, our stable network-based measure is less likely to suffer
from changes in internet behavior or seasonality, both of which have hampered Google Flu
Trends (Olson et al., 2013). In addition, our measures do not require individuals to have
experienced symptoms, which potentially allows us to identify at-risk localities before disease
Finally, because our measures are based only on aggregated connections
(instead of individual movement), they are easily accessible to researchers and consistently
available for a large number of geographies around the world.
More generally, our results add to a literature that has applied aspects of network the-
ory to build spatial epidemiological models (for overviews, see Keeling and Eames, 2005;
Keeling and Rohani, 2011; Danon et al., 2011). Works in this literature move beyond the
basic assumption that individuals within a population are “fully mixed”, or equally likely
to interact; instead, they better represent the dynamics of real-world connections (see e.g.
Newman, 2002; Klovdahl, 1985; Klovdahl et al., 1994; Mossong et al., 2008; Yang et al.,
2020). While some of these studies parameterize models with information on local networks,
we are unaware of any that introduces a measure with comparably high levels of coverage
and granularity.
Our hope is that our unique measure of social connectedness can help pa-
rameterize future epidemiological work. In addition, we hope that the Social Connectedness
Index can advance the literature on the determinants and effects of urban and regional social
networks (see Bailey et al., 2020a; Kim et al., 2017; B¨uchel and von Ehrlich, 2016; Mossay
and Picard, 2011; Brueckner and Largey, 2008; Glaeser et al., 1992).
It is important to note that our objective in this paper is not to incorporate social
connectedness data into a state-of-the-art epidemiological model. Instead, we provide a
unique measure to assess regions’ outbreak risk, answering the call of Avery et al. (2020),
among others, who highlight the “urgent need” for “creative and entrepreneurial methods”
of interpreting and sharing data to model coronavirus spread. To that end, the data in this
paper, as well as similar data for a wide range of other geographies, are accessible by emailing
[email protected]. We encourage interested researchers to do so.
However, it suggests that our data might partner well with these measures. For example, if one can
detect an early outbreak using surveys, they could then predict (and potentially prevent) the next outbreak
using information on social connectedness.
For example, the Social Connectedness Index is available at the ZCTA level in the U.S., the NUTS3
level in Europe, the GADM2 level in the Indian Subcontinent, and the GADM1 level throughout much of
the rest of the world.
1 Data Description
To measure the intensity of social connectedness between locations, we use a de-identified
and aggregated snapshot of all active Facebook users and their friendship networks from
March 2020. As of the end of 2019, Facebook had nearly 2.5 billion monthly active users
around the world: 248 million in the U.S. and Canada, 394 million in Europe, 1.04 billion in
Asian-Pacific, and 817 billion in the rest of the world (Facebook, 2020). The data therefore
has extremely wide coverage, and provides a unique opportunity to map the geographic
structure of social networks around the world. Locations are assigned to users based on their
information and activity on Facebook, including their public profile information, and device
and connection information. Establishing a connection on Facebook requires the consent
of both individuals, and there is an upper limit of 5,000 on the number of connections a
person can have. As a result, Facebook connections are generally more likely to be between
real-world acquaintances than links on many other social networking platforms.
Our measure of the social connectedness between two locations i and j is the Social
Connectedness Index (SCI) introduced by Bailey et al. (2018b):
Social Connectedness
F B Connections
F B Users
F B U sers
. (1)
Here, F B Connections
is the total number of Facebook friendship links between Facebook
users living in location i and Facebook users living in location j. F B Users
and F B Users
are the number of active users in each location. Social Connectedness
thus measures the
relative probability of a Facebook friendship link between a given Facebook user in location
i and a given Facebook user in location j: if this measure is twice as large, a given Facebook
user in region i is twice as likely to be friends with a given Facebook user in region j.
In previous work, we have shown that this measure predicts a large number of important
economic and social interactions. For example, social connectedness as measured through
Facebook friendship links is strongly related to patterns of sub-national and international
trade (Bailey et al., 2020b), patent citations (Bailey et al., 2018b), and investment decisions
(Kuchler et al., 2020). More generally, we have found that information on individuals’
Facebook friendship links can help understand their product adoption decisions and their
housing and mortgage choices (Bailey et al., 2018a, 2019a,b).
Data on COVID-19 cases in the United States by county come from Johns Hopkins
University Center for Systems Science and Engineering. Similarly, data for COVID-19 cases
for each Italian province come from the Italian Dipartimeno della Protezione Civile. As with
any data on cases, some bias may be introduced by differential testing across regions.
2 Early Hotspot Analysis
In this section, we explore how the domestic spread of confirmed COVID-19 cases is related
to social connectedness to two early COVID-19 “hotspots”: Westchester County, NY in the
U.S., and Lodi Province in Italy. Westchester County includes New Rochelle, a community
that had the first major COVID-19 outbreak in the eastern United States (Chappell, 2020).
As of March 20th, the county had over 9,300 cases, second only to nearby New York City.
Additionally, a number of articles reported wealthy residents from Westchester and the New
York area had fled to other parts of the U.S. (Tully and Stowe, 2020), providing a vector that
could potentially spread the disease. Indeed, geneticists and epidemiologists later found that
travel from New York seeded much of the first wave of U.S. COVID-19 outbreaks (Carey and
Glanz, 2020). Social connections to Westchester may thus provide particularly important
information for tracking COVID-19 spread, especially given that Coven and Gupta (2020)
found that connectedness to New York predicted travel patterns from the city early in the
pandemic. Lodi is an Italian province of around 230,000 inhabitants in the heavily impacted
region of Lombardy. It contains Codogno, where the earliest cases of COVID-19 in Italy
were detected, and was at the center of Italy’s outbreak (Horowitz et al., 2020).
Panel (a) of Figure 1 shows a heatmap of the social connectedness of Westchester County,
NY, to all other U.S. counties; darker colors correspond to stronger social ties. Panel (b)
shows the distribution of COVID-19 cases per 10,000 residents across U.S. counties on March
20, 2020, with darker colors corresponding to higher COVID-19 prevalence. These maps show
a number of similarities. Perhaps most notably, coastal regions and urban centers appear to
have both high levels of connectedness to Westchester and larger numbers of COVID-19 cases
per resident. But a number of more subtle patterns emerge as well. Both measures are high
in the communities along the coasts of Florida (in particular along the southeastern coast,
near Miami), in western and central Colorado (in particular in areas with ski resorts), and
in the upper northeast. These areas are all popular vacation destinations and second home
locations for many well-heeled residents of Westchester. Indeed, the governors of Florida
and Rhode Island both publicly lamented the number of New York area residents fleeing
to their states and spreading COVID-19 (Mower, 2020; Carlisle, 2020). By contrast, many
areas that are geographically closer but less socially connected to Westchester, such as in
western Pennsylvania and West Virginia, had fewer confirmed COVID-19 cases on March
20, 2020. There are also a number of patterns of COVID-19 prevalence that connectedness
to Westchester alone cannot explain. Areas surrounding King County, WA (Seattle), for
example, have relatively low levels of connectedness to Westchester, but were an independent
early hotspot of COVID-19. Some states in the southern U.S. where residents were slower
Figure 1: Social Network Distributions from Westchester and COVID-19 Cases in the U.S.
(a) Log of SCI to Westchester County, NY (b) COVID-19 Cases per 10k Residents by County
(c) Westchester binscatter without controls
0 2 4 6
Cases per 10k people
4 5 6 7 8 9
log(Social Connectedness)
(d) Westchester binscatter with controls
0 1 2 3 4 5
Cases per 10k people
4 5 6 7 8
log(Social Connectedness)
Note: Panel (a) shows the social connectedness to Westchester for U.S. counties. Panel (b) shows the number
of confirmed COVID-19 cases by U.S. county on March 30th, 2020. Panels (c) and (d) show binscatter plots
with counties more than 50 miles from Westchester as the unit of observation. To generate the plot in Panel
(c), we group l og(SCI) into 100 equal-sized bins and plot the average against the corresponding average
case density. Panel (d) is constructed in a similar manner. However, we first regress log(SCI) and cases
per 10,000 residents on a set of control variables and plot the residualized values on each axis. Red lines
show quadratic fit regressions. The controls for Panel (d) are 100 dummies for the percentile of the county
distance to Westchester from the Nation Bureau of Economic Research; population density and median
household income made available from (Chetty et al., 2016); and dummies for the six National Center for
Health Statistics Urban-Rural county classifications.
to limit travel also have higher case densities than would be predicted purely by social
connectedness to Westchester (Glanz et al., 2020).
The two bottom panels of Figure 1 explore the relationship between COVID-19 preva-
lence and social ties to Westchester more formally. Panel (c) shows a binscatter plot of
social connectedness to Westchester County and the number of COVID-19 cases per 10,000
residents. We exclude those counties within 50 miles of Westchester County: while those
areas have strong social links to Westchester, they are also close enough geographically such
that their populations might interact physically with Westchester residents even in the ab-
sence of social links (e.g., in supermarkets and houses of worship). There is a strong positive
relationship between COVID-19 prevalence and social ties to Westchester. Quantitatively,
a doubling of a county’s social connectedness to Westchester is associated with an increase
of about 0.88 COVID-19 cases per 10,000 residents. The R-Squared of this relationship is
0.093, suggesting that, in a statistical sense, 9.3% of the cross-county variation in COVID-19
cases can be explained by counties’ social connectedness to Westchester.
One concern with interpreting these initial correlations is that they might be primarily
picking up other factors that affect the spread of COVID-19, and that are correlated with
social connectedness. Specifically, even after dropping counties within 50 miles of Westch-
ester, the correlations might be primarily picking up geographic distance to Westchester
(which is related to the number of friendship links to Westchester). As a result, including
social connectedness might not improve predictive power for models that already control
for some of these other variables. In Panel (d), we therefore present a binscatter plot of
the relationship between social connectedness to Westchester County and COVID-19 cases
that controls for a number of these possible confounding variables (in addition to excluding
nearby counties). Most importantly, we non-parametrically control for the geographic dis-
tance between each county and Westchester County by including 100 dummies for percentiles
of that distance. We also control for income, population density, and a classification of how
urban/rural a county is. Even conditional on these other factors, Panel (d) shows a strong
positive relationship between COVID-19 cases as of March 30, 2020 and social connectedness
to Westchester County. With these controls, a doubling of a county’s social connectedness
to Westchester is associated with an increase of about 0.80 COVID-19 cases per 10,000 res-
idents. The total R-Squared of the statistical relationship is 0.190, while the incremental
R-Squared from controlling for social connectedness to Westchester is 0.037.
It is important to highlight that the purpose of this exercise is to demonstrate the pre-
dictive power of social connectedness measured via online social networks for COVID-19
prevalence. We chose the current set of control variables to highlight that the Social Con-
nectedness Index has such predictive power over and above a number of variables on which
Figure 2: Social Network Distributions of Lodi and COVID-19 Cases in Italy
(a) Percentile of SCI to Lodi Province, Italy (b) COVID-19 Cases per 10k Residents by Province
(c) Lodi binscatter without controls
0 10 20 30 40 50
Cases per 10k people
9 9.5 10 10.5 11
log(Social Connectedness)
(d) Lodi binscatter with controls
5 10 15 20 25 30
Cases per 10k people
9.4 9.6 9.8 10 10.2 10.4
log(Social Connectedness)
Note: Panel (a) shows a measures of Social Connectedness to Lodi for Italian provinces. Panel (b) shows
the number of confirmed COVID-19 cases by Italian province on March 30th, 2020. Panels (c) and (d) show
binscatter plots with provinces more than 50 kliometers from Lodi as the unit of observation. To generate the
plot in Panel (c) we group log(SCI) into 30 equal-sized bins and plot the average against the corresponding
average case density. Panel (d) is constructed in a smaller manner. However, we first regress log(SCI) and
cases per 10,000 residents on a set of control variables and plot the residualized values on each axis. Red lines
show quadratic fit regressions. The controls for Panel (d) are 20 dummies for the quantile of the province
distance to Lodi; GDP per inhabitant; and population density.
data is already easily available, and that may partially proxy for social connections in models
of communicable disease spread. The observed increase in predictive power thus suggests
that the Social Connectedness Index might serve as a valuable measure above some existing
proxies for social interactions.
Figure 2 explores the analogous relationships for Lodi province in Italy.
The provinces
with highest COVID-19 case densities and connectedness to Lodi are in the surrounding
Lombardy region, as well as the nearby Piemonte and Veneto regions. There are also rela-
tively high levels of both connectedness to Lodi and COVID-19 cases in Rimini, a popular
tourist destination along the Adriatic sea. A number of provinces in southern Italy send
workers and students to the industrial Lombardy region, and therefore have strong social
ties to that region. While some of these areas have seen a number of COVID-19 cases, they
are not disproportionally larger, perhaps reflecting the efforts of Italian authorities to re-
strict the movement of individuals (Kington, 2020). Panels (c) and (d) repeat the binscatter
exercise from Figure 1. We exclude provinces within 50 kilometers. In Panel (d) we control
for geographic distance using 20 dummies for the quantile of the distance from each province
to Lodi, as well as GDP per inhabitant and population density. Again we find that the So-
cial Connectedness Index appears to have predictive power above these other measures that
might commonly be used to proxy for social interactions. Quantitatively, a doubling of SCI
corresponds to an increase of 16.6 COVID-19 cases per 10,000 residents after controlling for
these relevant factors. The incremental R-Squared of including social connectedness to Lodi
over the other control variables is 0.057.
These cases studies illustrate the potential usefulness of our measure of social connect-
edness for predicting disease spread. In the next section, we use a time series of case growth
from March through July to explore this relationship in more detail.
3 Time Series Analysis
In this section, we exploit the changing geography of the pandemic in the U.S. to more
systematically investigate the predictive value of the Social Connectedness Index in fore-
casting the spread of COVID-19. Specifically, we construct two metrics: “Social Proximity to
Cases”, a county-level measure of exposure to COVID-19 cases through social networks, and
This is not to suggest that the Social Connectedness Index is the only such measure, and we believe
that further advances can be made using other data sources, such as cell-phone location pings. But the social
connectedness index has a number of advantages, including the fact that it is easily accessible to researchers
and consistently available for a large number of global geographies.
Because Italian provinces on the island of Sardinia do not align with European NUTS3 regions (the
level at which we measure social connectedness), we include Sardinia as a single observation in our analysis.
“Physical Proximity to Cases”, a county-level measure of exposure through physical prox-
imity. While the two measures will be related (because individuals generally have stronger
social ties to those who are geographically nearby, as documented in Bailey et al., 2018b),
the examples in the previous section illustrate that some geographically distant places
such as Westchester and the east coast of Florida can have strong social ties. These
relationships, and many others which would not be predicted by physical distance, are the
unique predictive value added by the social connectedness data.
Key Variable Construction. We construct our measure of social proximity to cases as:
Social P roximity to Cases
Cases P er 10k
Social Connectedness
Social Connectedness
Cases P er 10k
is the number of confirmed COVID-19 cases per 10,000 residents in county
j as of time t. The sums j and h are over all counties. Analogously, we construct a measure
of a county’s physical proximity to cases as:
P hysical P roximity to Cases
Cases P er 10k
1 + Distance
Here, Distance
is the physical distance between counties i and j measured in miles.
Empirical Specification. We first study the relationship between actual case growth in
different time periods and “lagged” (i.e. in past time periods) growth in our measures. We
hypothesize that if social connectedness is an important predictor of the path of COVID-19
spread, a lagged measure of social proximity to new cases will have a positive relationship
with new case counts in the next period. For each county i and time period t, we then
estimate the equation:
log(∆ Cases per 10k + 1)
= β
log(∆Cases per 10k + 1)
+ β
log(∆Cases per 10k + 1)
+ β
log(∆Social P roximity to Cases)
+ β
log(∆Social P roximity to Cases)
+ β
log(∆P hysical P roximity to Cases)
+ β
log(∆P hysical P roximity to Cases)
+ X
Here, t is defined as one of the eight two-week time periods between March 30 and July
20, 2020. For each time period t, prior two-week periods are denoted t 2 and t 1 (for
example, March 3-16 and March 16-30 for the first period starting March 30). We always
include two lags of own case growth, and explore the effects of lagged changes of social and
physical proximity to cases. X
are a set of time-specific fixed effects, including percentiles
of population density and median household income. In our strictest specification we also
add time by state fixed effects.
Empirical Results. Table 1 shows that growth in social proximity to cases in one period
has a strong positive relationship with actual case growth in the next. In columns 1 and 4,
we see this relationship exists without controlling for physical distance. In columns 2 and
5, however, we show that past physical proximity to cases is also strongly correlated with
present case growth, a relationship which may confound the previous one. To address this,
in columns 3 and 6 we include both measures. While the coefficient on social proximity to
cases falls somewhat (suggesting some of the relationship is due to physical proximity), the
relationship remains highly significant in both the the statistical and real-world sense. In
our strictest specification, which includes state fixed effects interacted with week, a doubling
of social proximity to cases in one period corresponds to a 22.5% increase in actual cases per
10,000 residents in the next period. That this result persists in the presence of state fixed
effects allows us to rule out concerns that it may be due to differences in state-level public
health measures.
We next conduct a similar analysis, but report coefficients separately for each time period,
allowing us to study how the relationship between social connections and new COVID-19
cases changes over the course of the pandemic. Table 2 shows that in every two-week period
from March 30 to July 20, a one time period lagged measure of social proximity to cases was
a statistically significant predictor of actual case growth. The magnitudes of the coefficients
suggest that a doubling in social proximity to cases in one two-week period corresponds to
between a 9.8% and 50.7% increase in actual cases in the next time period, after controlling
for physical proximity to cases and all of our other previous controls.
In columns 1 and 2, which describe disease spread in March and the first days of April,
the relationship is particularly strong. A possible explanation is that COVID-19 spread
through areas with strong social ties before widespread social distancing. For example, trips
between Westchester and coastal Florida may have been common before public recognition
of the outbreak, but relatively infrequent later. Indeed, as social distancing peaked through
April and early May
the importance of social proximity to cases falls while the importance
For quantitative measures of these social distancing patterns see the Facebook Data for Good
Table 1: COVID-19 Case Growth and Prior Proximity to Cases
2 Week Lag:
0.592*** 0.434*** 0.437*** 0.325***
log(Change in Social Proximity to Cases + 1) (0.071)
(0.106) (0.043) (0.054)
4 Week Lag:
-0.067 0.067 -0.077*** 0.020
log(Change in Social Proximity to Cases + 1) (0.050) (0.084) (0.020) (0.029)
2 Week Lag:
1.266** 1.054** 1.622*** 1.266***
log(Change in Physical Proximity to Cases + 1) (0.408) (0.372) (0.163) (0.212)
4 Week Lag:
-1.170** -1.028** -1.287*** -1.092***
log(Change in Physical Proximity to Cases + 1) (0.408) (0.374) (0.264) (0.305)
2 Week Lag:
0.319*** 0.635*** 0.376*** 0.330*** 0.549*** 0.376***
log(Change in Cases per 10k Residents + 1) (0.043) (0.022) (0.052) (0.032) (0.025) (0.038)
4 Week Lag:
0.052 0.069*** 0.008 0.079*** 0.062*** 0.040*
log(Change in Cases per 10k Residents + 1) (0.032) (0.016) (0.040) (0.012) (0.010) (0.017)
Time X Pop Density FEs
Time X Median Household Income FEs
Time X State FEs
Sample Mean 1.593 1.593 1.593 1.593 1.593 1.593
R-Squared 0.641 0.638 0.650 0.684 0.682 0.686
N 25,056 25,056 25,056 25,048 25,048 25,048
Note: Table shows results from regression 4. Each observation is a county, two-week period (between March
30 and July 20, 2020). The dependent variable in all columns is log of one plus the number of new COVID-19
cases per 10,000 residents. Columns 1 and 3 include log of growth in social proximity to cases lagged by
two and four weeks (one and two time periods). Columns 2 and 4 include analogous measures of physical
proximity to cases. Columns 3 and 6 include both measures. All columns include controls for two-week and
four-week lagged changes in cases, as well as time-specific fixed effects for percentiles of county population
density and median household income. Columns 3-6 include additional time-specific fixed effects by state.
Standard errors are clustered by time period. Significance levels: *(p<0.10), **(p<0.05), ***(p<0.01).
of local cases and physical proximity to cases rises. In the final four periods (columns 5-
8), the coefficients on social proximity again generally increase, corresponding to the time
in which mobility began slowly returning toward baseline levels. Together, these results
are consistent with a story in which social proximity matters most when there are fewer
restrictions on individuals’ mobility. This provides more evidence that social connectedness
is predictive of interactions that spread communicable disease.
Building on these results, we next conduct a simple prediction exercise. During a pan-
demic, local policymakers might want to determine their area’s risk for outbreak to inform
public health measures. With this use case in mind, we build a series of simple models that
use available data at time t to predict case growth in all counties at time t + 1. We test
Mobility Dashboard, available at https://www.covid19mobility.org/dashboards/facebook-data-for-good/,
and the SafeGraph Shelter in Place Index, available at https://www.safegraph.com/dashboard/
Table 2: COVID-19 Case Growth and Prior Proximity to Cases, by Two-Week Period
March 31 -
April 13
April 14 -
April 27
April 28 -
May 11
May 12 -
May 25
May 26 -
June 8
June 9 -
June 22
June 23 -
July 6
July 7-
July 20
2 Week Lag:
0.731*** 0.379*** 0.141** 0.189*** 0.577*** 0.182** 0.320*** 0.259***
log(Change in Social Proximity to Cases + 1) (0.093) (0.087) (0.059) (0 .061) (0.062) (0.07 3) (0.057) (0.0 70)
4 Week Lag:
0.384 -0.224* 0.137* 0.023 -0.111* 0.208*** 0.046 0.101
log(Change in Social Proximity to Cases + 1) (0.449) (0.129) (0.082) (0 .060) (0.061) (0.07 4) (0.057) (0.0 63)
2 Week Lag:
1.259*** 0.699* 2.105*** 1.232*** -0.074 2.270*** 1.361*** 2.025***
log(Change in Physical Proximity to Cases + 1) (0.182) (0.395) (0 .283 ) (0.261) (0.31 4) (0.434) (0.350) (0.427)
4 Week Lag:
-2.425*** -0.273 -1.593*** -0 .892 *** 0.412 -2.742*** -1.5 56*** -1.8 71***
log(Change in Physical Proximity to Cases + 1) (0.745) (0.463) (0 .291 ) (0.282) (0.28 8) (0.443) (0.329) (0.403)
2 Week Lag:
0.174*** 0.403*** 0.556*** 0.466*** 0.278*** 0.365*** 0.306*** 0.320***
log(Change in Cases per 10k Residents + 1) (0.059) (0.050) (0.03 6) (0 .036) (0.035) (0.041) (0.033) (0.0 37)
4 Week Lag:
-0.136 0.136* -0.019 0.068* 0.126*** -0.017 0.005 0.021
log(Change in Cases per 10k Residents + 1) (0.256) (0.076) (0.04 7) (0 .037) (0.035) (0.039) (0.033) (0.0 34)
Pop Density FEs Y Y Y Y Y Y Y Y
Median Household Income FEs Y Y Y Y Y Y Y Y
State FEs Y Y Y Y Y Y Y Y
Sample Mean 1.234 1.253 1.331 1.369 1.422 1.579 2.031 2.524
R-Squared 0.600 0.571 0.642 0.647 0.667 0.621 0.678 0.706
N 3,131 3,131 3,131 3,131 3,131 3,131 3,131 3,131
log(Change in Cases per 10k Residents + 1)
Note: Table shows time-specific results from regression 4. Each observation is a county. The dependent
variable is log of one plus the number of new COVID-19 cases per 10,000 residents in one two-week period
between March 30 and July 20, 2020. All columns include log of growth in social and physical proximity to
cases, as well as actual cases, lagged by two and four weeks (one and two time periods). All columns include
time-specific fixed effects for state, and percentiles of population density and median household income.
Significance levels: *(p<0.10), **(p<0.05), ***(p<0.01).
the added predictive value of social proximity to cases by building separate models that
include and exclude this measure. Because we do not use the “test” data in the training ex-
ercise, a reduction in prediction error would be reflective of a true improvement in real-world
predictions (as opposed to an increase in R
in our previous analyses).
Table 3: Predicting COVID-19 spread in U.S., with and without Social Proximity to Cases
Without Social
Proximity to Cases
With Social
Proximity to Cases
Diff. from Social
Proximity to Cases
Without Social
Proximity to Cases
With Social
Proximity to Cases
Diff. from Social
Proximity to Cases
(1) April 14 - April 27 2.523 2.598 0.075 1.597 1.497 -0.099
(2) April 28 - May 11 1.082 1.168 0.086 0.922 0.845 -0.077
(3) May 12 - May 25 0.742 0.729 -0.014 0.754 0.726 -0.028
(4) May 26 - June 8 0.742 0.716 -0.026 0.701 0.678 -0.024
(5) June 9 - June 22 0.826 0.798 -0.027 0.795 0.770 -0.025
(6) June 23 - July 6 0.886 0.865 -0.022 0.862 0.840 -0.022
(7) July 7 - July 20 0.813 0.792 -0.020 0.802 0.786 -0.016
RMSE: Linear Regression
RMSE: Random Forest
Note: Table shows results from county-level predictions of COVID-19 case growth. The predicted outcome
is log of one plus the number of new COVID-19 cases per 10,000 residents. Columns 1-3 show root mean
squared errors (RMSEs) from a linear regression model trained on data from all weeks prior to the week of
interest. Columns 4-6 show analogous results from a random forest model. The model inputs in columns 1
and 4 are population density; median household income; and log of growth in physical proximity to cases
and actual cases, lagged by two and four weeks (one and two time periods). Columns 2 and 5 add social
proximity to cases. Columns 3 and 6 show the change in RMSE from adding social proximity to cases.
Table 3 shows the results of this prediction exercise. Columns 1-3 describe the prediction
error from a simple linear regression model that includes the measures in Table 2 with non-
binned measures of population density and median household income. Column 1 includes
the two lagged measures of social proximity to cases and column 2 excludes them. Each
row shows the root mean squared error (RMSE) from a model trained using data from all
periods before the period of interest, then tested on that next period. In the first two periods
which include the most limited training data the RMSE is relatively large for both
models. Then, as the training sample gets larger, the RMSE is consistently between 0.71
and 0.89 log new cases per 10,000 residents. Column 3 shows that in each of these last five
rows, the RMSE is lower from including social proximity to cases, suggesting that it does
significantly improve our predictions.
The results in columns 4-6 are generated using a random forest, an ensemble prediction
algorithm commonly used in data science applications. The algorithm allows us to find non-
linear relationships, without overfitting, by aggregating mean predictions from a number of
regression trees generated over sample subsets of both observations and input variables.
The out-of-sample predictions of the random forest model generally outperform those of
In our analysis we use 500 trees. For more information on random forests, see Breiman (2001).
the linear model. In addition, including measures of social proximity to cases leads to an
improved forecasts of COVID-19 cases throughout the course of the epidemic.
The methodology used for these predictions are relatively simple and should not be
interpreted as a full epidemiological model. However, the results in Table 3 strongly suggest
that our measure of social connectedness may prove useful in future epidemiological work.
4 Conclusion
In the context of threats from communicable disease, a region’s ability to determine optimal
public health responses depends on its ability to forecast the risk of an outbreak (Reich
et al., 2019). A primary determinant of this risk is the likelihood of physical interactions
between the region’s residents and residents of other areas with severe outbreaks. Information
on the geography of social connections, which shape patterns of physical interactions, are
therefore crucially important for public health officials. In this paper, we use de-identified
and aggregated data from Facebook to measure social connections between regions, and
find it be an important predictor of future outbreaks during the COVID-19 pandemic. We
show that areas more connected to early pandemic hotspots in the U.S. and Italy had, on
average, higher case counts by March 20, 2020, even after controlling for physical distance
and other demographics. Furthermore, the inclusion of social proximity to cases improves
predictions of COVID-19 spread during the first four months of the U.S. pandemic, over and
above models that include only physical proximity to cases and other controls. Our results
should not be interpreted as an attempt to create a state-of-the-art epidemiological model;
instead, our hope is that our new measure provides a tool for epidemiologists and public
health officials hoping to forecast the spread of communicable diseases such as COVID-19.
