Can data from mobile phones be used to observe economic shocks and their consequences at multiple scales? Here we present novel methods to detect mass layoffs, identify individuals affected by them and predict changes in aggregate unemployment rates using call detail records (CDRs) from mobile phones. Using the closure of a large manufacturing plant as a case study, we first describe a structural break model to correctly detect the date of a mass layoff and estimate its size. We then use a Bayesian classification model to identify affected individuals by observing changes in calling behaviour following the plant's closure. For these affected individuals, we observe significant declines in social behaviour and mobility following job loss. Using the features identified at the micro level, we show that the same changes in these calling behaviours, aggregated at the regional level, can improve forecasts of macro unemployment rates. These methods and results highlight promise of new data resources to measure microeconomic behaviour and improve estimates of critical economic indicators.
Economic statistics are critical for decision-making by both government and private institutions. Despite their great importance, current measurements draw on limited sources of information, losing precision with potentially dire consequences. The beginning of the Great Recession offers a powerful case study: the initial BEA estimate of the contraction of GDP in the fourth quarter of 2008 was an annual rate 3.8%. The American Recovery and Reinvestment Act (stimulus) was passed based on this understanding in February 2009. Less than two weeks after the plan was passed, that 3.8% figure was revised to 6.2%, and subsequent revisions peg the number at a jaw dropping 8.9%—more severe than the worst quarter during the Great Depression. The government statistics were wrong and may have hampered an effective intervention. As participation rates in unemployment surveys drop, serious questions have been raised as to the declining accuracy and increased bias in unemployment numbers .
In this paper, we offer a methodology to infer changes in the macroeconomy in near real time, at arbitrarily fine spatial granularity, using data already passively collected from mobile phones. We demonstrate the reliability of these techniques by studying data from two European countries. In the first, we show it is possible to observe mass layoffs and identify the users affected by them in mobile phone records. We then track the mobility and social interactions of these affected workers and observe that job loss has a systematic dampening effect on their social and mobility behaviour. Having observed an effect in the micro data, we apply our findings to the macroscale by creating corresponding features to predict unemployment rates at the province scale. In the second country, where the macro-level data are available, we show that changes in mobility and social behaviour predict unemployment rates ahead of official reports and more accurately than traditional forecasts. These results demonstrate the promise of using new data to bridge the gap between micro- and macroeconomic behaviours and track important economic indicators. Figure 1 shows a schematic of our methodology.
2. Measuring the economy
Contemporary macroeconomic statistics are based on a paradigm of data collection and analysis begun in the 1930s [2,3]. Most economic statistics are constructed from either survey data or administrative records. For example, the US unemployment rate is calculated based on the monthly Current Population Survey of roughly 60 000 households, and the Bureau of Labor Statistics manually collects 80 000 prices a month to calculate inflation. Both administrative databases and surveys can be slow to collect, costly to administer and fail to capture significant segments of the economy. These surveys can quickly face sample size limitations at fine geographies and require strong assumptions about the consistency of responses over time. Statistics inferred from survey methods have considerable uncertainty and are routinely revised in the months following their release as other data are slowly collected [1,4–6]. Moreover, changes in survey methodology can result in adjustments of reported rates of up to 1–2 percentage points .
The current survey-based paradigm also makes it challenging to study the effect of economic shocks on networks or behaviour without reliable self-reports. This has hampered scientific research. For example, many studies have documented the severe negative consequences of job loss in the form of difficulties in retirement , persistently lower wages following re-employment including even negative effects on children's outcomes [9,10], increased risk of death and illness [11,12], higher likelihood of divorce , and, unsurprisingly, negative impacts on happiness and emotional well-being . Owing to the cost of obtaining the necessary data, however, social scientists have been unable to directly observe the large negative impact of a layoff on the frequency and stability of an individual's social interactions or mobility.
3. Predicting the present
These shortcomings raise the question as to whether existing methods could be supplemented by large-scale behavioural trace data. There have been substantial efforts to discern important population events from such data, captured by the pithy phrase of, ‘predicting the present’ [15–18]. Prior work has linked news stories with stock prices [19–21] and used web search or social media data to forecast labour markets [22–26], consumer behaviour [27,28], automobile demand, vacation destinations [15,29]. Research on social media, search and surfing behaviour have been shown to signal emerging public health problems [30–37]; although for a cautionary tale see . And recent efforts have even been made towards leveraging Twitter to detect and track earthquakes in real-time detection faster than seismographic sensors [39–41]. While there are nuances to the analytic approaches taken, the dominant approach has been to extract features from some large-scale observational data and to evaluate the predictive (correlation) value of those features with some set of measured aggregate outcomes (such as disease prevalence). Here we offer a twist on this methodology through identification of features from observational data and to cross-validate across individual and aggregate levels.
All of the applications of predicting the future are predicated, in part, on the presence of distinct signatures associated with the systemic event under examination. The key analytic challenge is to identify signals that (i) are observable or distinctive enough to rise above the background din, (ii) are unique or generate few false positives, (iii) contain information beyond well-understood patterns such as calendar-based fluctuations and (iv) are robust to manipulation. Mobile phone data, our focus here, are particularly promising for early detection of systemic events as they combine spatial and temporal comprehensiveness, naturally incorporate mobility and social network information and are too costly to intentionally manipulate.
Data from mobile phones has already proved extremely beneficial to understanding the everyday dynamics of social networks [42–48] and mobility patterns of millions [49–56]. With a fundamental understanding of regular behaviour, it becomes possible to explore deviations caused by collective events such as emergencies , natural disasters [58,59] and cultural occasions [60,61]. Less has been done to link these data to economic behaviour. In this paper, we offer a methodology to robustly infer changes to measure employment shocks at extremely high spatial and temporal resolutions and improve critical economic indicators.
We focus our analysis at three levels: the individual, the community and the provincial levels. We begin with unemployment at the community (town) level, where we examine the behavioural traces of a large-scale layoff event. At the community and individual levels, we analyse call record data from a service provider with an approximately 15% market share in an undisclosed European country. The community-level dataset spans a 15-month period between 2006 and 2007, with the exception of a six-week gap due to data extraction failures. At the province level, we examine call detail records from a service provider from another European country, with an approximately 20% market share and data running for 36 months from 2006 to 2009. Records in each dataset include an anonymous ID for caller and callee, the location of the tower through which the call was made, and the time the call occurred. In both cases, we examine the universe of call records made over the provider's network (see the electronic supplementary material for more details).
5. Observing unemployment at the community level
We study the closure of an auto-parts manufacturing plant (the plant) that occurred in December 2006. As a result of the plant closure, roughly 1100 workers lost their jobs in a small community (the town) of 15 000. Our approach builds on recent papers [52–54,57] that use call record data to measure social and mobility patterns.
There are three mobile phone towers within close proximity of the town and the plant. The first is directly within the town, the second is roughly 3 km from the first and is geographically closest to the manufacturing plant, while the third is roughly 6.5 km from the first two on a nearby hilltop. In total, these three towers serve an area of roughly 220 km2 of which only 6 km2 is densely populated. There are no other towns in the region covered by these towers. Because the exact tower through which a call is routed may depend on factors beyond simple geographical proximity (e.g. obstructions due to buildings), we consider any call made from these three towers as having originated from the town or plant.
We model the pre-closure daily population of the town as made up of a fraction of individuals γ who will no longer make calls near the plant following its closure and the complimentary set of individuals who will remain (1 − γ). As a result of the layoff, the total number of calls made near the plant will drop by an amount corresponding to the daily calls of workers who are now absent. This amounts to a structural break model that we can use to estimate the prior probability that a user observed near the plant was laid off, the expected drop in calls that would identify them as an affected worker and the time of the closure (see the electronic supplementary material for full description of this model). We suspect that some workers laid off from the plant are residents of the town, and thus they will still appear to make regular phone calls from the three towers and will not be counted as affected. Even with this limitation, we find a large change in behaviour.
To verify the date of the plant closing, we sum the number of daily calls from 1955 regular users (i.e. those who make at least one call from the town each month prior to the layoff) connecting through towers geographically proximate to the affected plant. The estimator selects a break date, tlayoff, and pre- and post-break daily volume predictions to minimize the squared deviation of the model from the data. The estimated values are overlaid on daily call volume and the actual closure date in figure 2a. As is evident in the figure, the timing of the plant closure (as reported in newspapers and court filings) can be recovered statistically using this procedure—the optimized predictions display a sharp and significant reduction at this date. As a separate check to ensure this method is correctly identifying the break date, we estimate the same model for calls from each individual user i and find a distribution of these dates is peaked around the actual layoff date (see the electronic supplementary material, figure S1).
6. Observing unemployment at the individual level
To identify users directly affected by the layoff, we calculate Bayesian probability weights based on changes in mobile phone activity. For each user, we calculate the conditional probability that a user is a non-resident worker laid off as part of the plant closure based on their observed pattern of calls. To do this, we compute the difference in the fraction of days on which a user made a call near the plant in 50 days prior to the week of the layoff. We denote this difference as Δq = qpre − qpost. We consider each user's observed difference a single realization of a random variable, Δq. Under the hypothesis that there is no change in behaviour, the random variable Δq is distributed Under the alternative hypothesis, the individual's behaviour changes pre- and post-layoff, the random variable Δq is distributed where d is the mean reduction in calls from the plant for non-resident plant workers laid off when the plant was closed. We assign user i the following probability of having been laid off given his or her calling pattern: 6.1
Calculating the probabilities requires two parameters, γ, our prior that an individual is a non-resident worker at the affected plant and d, the threshold we use for the alternative hypothesis. The values of γ = 5.8% and d = 0.29 are determined based on values fit from the model in the previous section.
6.1. Validating the layoff
On an individual level, figure 2b shows days on which each user makes a call near the plant ranked from highest to lowest probability weight (only the top 300 users are shown, see the electronic supplementary material, figure S2 for more users). Users highly suspected of being laid off demonstrate a sharp decline in the number of days they make calls near the plant following the reported closure date. While we do not have ground-truth evidence that any of these mobile phone users was laid off, we find more support for our hypothesis by examining a two-week period roughly 125 days prior to the plant closure. Figure 2c shows a sharp drop in the fraction of calls coming from this plant for users identified as laid-off post-closure. This period corresponds to a confirmed coordinated holiday for plant workers and statistical analysis confirms a highly significant break for individuals classified as plant workers in the layoff for this period. Given that we did not use call data from this period in our estimation of the Bayesian model, this provides strong evidence that we are correctly identifying the portion of users who were laid off by this closure. In aggregate, we assign 143 users probability weights between 50 and 100%. This represents 13% of the pre-closure plant workforce and compares closely with the roughly 15% national market share of the service provider.
7. Assessing the effect of unemployment at the individual level
We now turn to analysing behavioural changes associated with job loss at the individual level. We first consider six quantities related to the monthly social behaviour: (i) total calls, (ii) number of incoming calls, (iii) number of outgoing calls, (iv) calls made to individuals physically located in the town of the plant (as a proxy for contacts made at work), (v) number of unique contacts and (vi) the fraction of contacts called in the previous month that were not called in the current month, referred to as churn. In addition to measuring social behaviour, we also quantify changes in three metrics related to mobility: (vii) number of unique locations visited, (viii) radius of gyration and (xi) average distance from most visited tower (see the electronic supplementary material for detailed definitions of these variables). To guard against outliers such as long trips for vacation or difficulty identifying important locations due to noise, we only consider months for users where more than five calls were made and locations where a user recorded more than three calls.
We measure changes in these quantities using all calls made by each user (not just those near the plant) relative to months prior to the plant closure, weighting measurements by the probability an individual is laid off and relative to two reference groups: individuals who make regular calls from the town but were not believed to be laid off (mathematically we weight this group using the inverse weights from our Bayesian classifier) and a random sample of 10 000 mobile phone users throughout the country (all users in this sample are weighted equally).
Figure 3a–i shows monthly point estimates of the average difference between relevant characteristics of users believed to be laid off compared to control groups. This figure shows an abrupt change in variables in the month directly following the plant closure. Despite this abrupt change, data at the individual level are sufficiently noisy that the monthly point estimates are not significantly different from 0 in every month. However, when data from months pre- and post-layoff are pooled, these differences are robust and statistically significant. The right panel of figure 3 and electronic supplementary material, table I show the results of OLS regressions comparing the pre- and post-closure periods for laid-off users relative to the two reference groups (see the electronic supplementary material for detailed model specification as well as confidence intervals for per cent changes pre- and post-layoff for each variable). The abrupt and sustained change in monthly behaviour of individuals with a high probability of being laid off is compelling evidence in support of using mobile phones to detect mass layoffs with mobile phones.
We find that the total number of calls made by laid-off individuals drops 51% and 41% following the layoff when compared with non-laid-off residents and random users, respectively. Moreover, this drop is asymmetric. The number of outgoing calls decreases by 54% compared to a 41% drop in incoming calls (using non-laid-off residents as a baseline). Similarly, the number of unique contacts called in months following the closure is significantly lower for users likely to have been laid off. The fraction of calls made by a user to someone physically located in the town drops 4.7 percentage points for laid-off users compared with residents of the town who were not laid off. Finally, we find that the month-to-month churn of a laid-off person's social network increases roughly 3.6 percentage points relative to control groups. These results suggest that a user's social interactions see significant decline and that their networks become less stable following job loss. This loss of social connections may amplify the negative consequences associated with job loss observed in other studies.
For our mobility metrics, we find that the number of unique towers visited by laid-off individuals decreases 17% and 20% relative to the random sample and town sample, respectively. Radius of gyration falls by 20% and 22% while the average distance a user is found from the most visited tower also decrease decreases by 26% relative to reference groups. These changes reflect a general decline in the mobility of individuals following job loss, another potential factor in long-term consequences.
8. Observing unemployment at the province level
The relationship between mass layoff events and these features of CDRs suggests a potential for predicting important, large-scale unemployment trends based on the population's call information. Provided the effects of general layoffs and unemployment are similar enough to those due to mass layoffs, it may be possible to use observed behavioural changes as additional predictors of general levels of unemployment. To perform this analysis, we use another CDR dataset covering approximately 10 million subscribers in a different European country, which has been studied in prior work [44,45,52–54,57]. This country experienced enormous macroeconomic disruptions, the magnitude of which varied regionally during the period in which the data are available. We supplement the CDR dataset with quarterly, province-level unemployment rates from the EU Labor Force Survey, a large sample survey providing data on regional economic conditions within the EU (see the electronic supplementary material for additional details).
We compute seven aggregated measures identified in the previous section: call volume, incoming calls, outgoing calls, number of contacts, churn, number of towers and radius of gyration. Distance from home was omitted due to strong correlation with radius of gyration, while calls to the town were omitted because it is not applicable in a different country. For reasons of computational cost, we first take a random sample of 3000 mobile phone users for each province. The sample size was determined to ensure the estimation feature values are stable (see the electronic supplementary material, figure S6 for details). We then compute the seven features aggregated per month for each individual user. The kth feature value of user i at month t is denoted as yi,t,k and we compute month over month changes in this quantity as A normalized feature value for a province s is computed by averaging all users in selected province . In addition, we use percentiles of the bootstrap distribution to compute the 95% CI for the estimated feature value.
After aggregating these metrics to the province level, we assess their power to improve predictions of unemployment rates. Note that we do not attempt to identify mass layoffs in this country. Instead, we look for behavioural changes that may have been caused by layoffs and see whether these changes are predictive of general unemployment statistics. First, we correlate each aggregate measure with regional unemployment separately, finding significant correlations in the same direction as was found for individuals (see the electronic supplementary material, table II). We also find the strong correlations between calling behaviour variables, suggesting that principal component analysis (PCA) can reasonably be used to construct independent variables that capture changes in calling behaviour while guarding against colinearity. The first principal component, with an eigenvalue of 4.10, captures 59% of the variance in our data and is the only eigenvalue that satisfies the Kaiser criterion. The loadings in this component are strongest for social variables. Additional details on the results of PCA can be found in the electronic supplementary material, tables III and IV. Finally, we compute the scores for the first component for each observation and build a series of models that predict quarterly unemployment rates in provinces with and without the inclusion of this representative mobile phone variable.
First, we predict the present by estimating a regression of a given quarter's unemployment on calling behaviour in that quarter (e.g. using phone data from Q1 to predict unemployment in Q1). As phone data are available the day a quarter ends, this method can produce predictions weeks before survey results are tabulate and released. Next, we predict the future in a more traditional sense by estimating a regression on a quarter's surveyed unemployment rate using mobile phone data from last quarter as a leading indicator (e.g. phone metrics from Q1 to predict unemployment rates in Q2). This method can produce more predictions months before surveys are even conducted. See the electronic supplementary material, figure S3 for a detailed timeline of data collection, release and prediction periods. We have eight quarters of unemployment data for 52 provinces. We make and test our predictions by training our models on half of the provinces and cross-validate by testing on the other half. The groups are then switched to generate out of sample predictions for all provinces. Prediction results for an AR1 model that includes a CDR variable are plotted against actual unemployment rates in figure 4. We find strong correlation coefficients between predictions of predictions of present unemployment rates (ρ = 0.95) as well as unemployment rates one-quarter in the future (ρ = 0.85).
As advocated in , it is important to benchmark these type of prediction algorithms against standard forecasts that use existing data. Previous work has shown that the performance of most unemployment forecasts is poor and that simple linear models routinely outperform complicated nonlinear approaches [62–65] and the dynamic stochastic general equilibrium (DSGE) models aimed at stimulating complex macroeconomic interactions [66,67]. With this in mind, we compare predictions made with and without mobile phone covariates using three different model specifications: AR1, AR1 with a quadratic term (AR1 Quad), AR1 with a lagged national GDP covariate (AR1 GDP). In each of these model specifications, the coefficient related to the principal component CDR score is highly significant and negative as expected given that the loadings weigh heavily on social variables that declined after a mass layoff (see the electronic supplementary material, tables V and VI regression results). Moreover, adding metrics derived from mobile phone data significantly improves forecast accuracy for each model and reduces the RMSE of unemployment rate predictions by between 5 and 20% (see inserts in figure 4). As additional checks that we are capturing true improvements, we use mobile phone data from only the first half of each quarter (before surveys are even conducted) and still achieve a 3–10% improvement in forecasts. These results hold even when variants are run to include quarterly and province level fixed effects (see the electronic supplementary material, tables VII and VIII).
In summary, we have shown that features associated with job loss at the individual level are similarly correlated with province level changes in unemployment rates in a separate country. Moreover, we have demonstrated the ability of massive, passively collected data to identify salient features of economic shocks that can be scaled up to measure macroeconomic changes. These methods allow us to predict ‘present’ unemployment rates two-to-eight weeks prior to the release of traditional estimates and predict ‘future’ rates up to four months ahead of official reports more accurately than using historical data alone.
We have presented algorithms capable of identifying employment shocks at the individual, community and societal scales from mobile phone data. These findings have great practical importance, potentially facilitating the identification of macroeconomic statistics with much finer spatial granularity and faster than traditional methods of tracking the economy. We can not only improve estimates of the current state of the economy and provide predictions faster than traditional methods, but also predict future states and correct for current uncertainties. Moreover, with the quantity and richness of these data increasing daily, these results represent conservative estimates of its potential for predicting economic indicators. The ability to get this information weeks to months faster than traditional methods is extremely valuable to policy and decision-makers in public and private institutions. Further, it is likely that CDR data are more robust to external manipulation and less subject to service provider algorithmic changes than most social media . But, just as important, the micro nature of these data allow for the development of new empirical approaches to study the effect of economic shocks on interrelated individuals.
While this study highlights the potential of new data sources to improve forecasts of critical economic indicators, we do not view these methods as a substitute for survey-based approaches. Though data quantity is increased by orders of magnitude with the collection of passively generated data from digital devices, the price of this scale is control. The researcher no longer has the ability to precisely define which variables are collected, how they are defined, when data collection occurs making it much harder to insure data quality and integrity. In many cases, data are not collected by the researcher at all and are instead first pre-processed by the collector, introducing additional uncertainties and opportunities for contamination. Moreover, data collection itself is now conditioned on who has specific devices and services, introducing potential biases due to economic access or sorting. If policy decisions are based solely on data derived from smartphones, the segment of the population that cannot afford these devices may be underserved.
Surveys, on the other hand, provide the researcher far more control to target specific groups, ask precise questions and collect rich covariates. Though the expense of creating, administering and participating in surveys makes it difficult to collect data of the size and frequency of newer data sources, they can provide far more context about participants. This work demonstrates the benefits of both data gathering methods and shows that hybrid models offer a way to leverage the advantages of each. Traditional survey-based forecasts are improved here, not replaced, by mobile phone data. Moving forward, we hope to see more such hybrid approaches. Projects such as the Future Mobility Survey  and the MIT Reality Mining project  bridge this gap by administering surveys via mobile devices, allowing for the collection of process generated data as well as survey-based data. These projects open the possibility to directly measure the correlation between data gathered by each approach.
The macroeconomy is the complex concatenation of interdependent decisions of millions of individuals . To have a measure of the activity of almost every individual in the economy, of their movements and their connections should transform our understanding of the modern economy. Moreover, the ubiquity of such data allows us to test our theories at scales large and small and all over the world with little added cost. We also note potential privacy and ethical issues regarding the inference of employment/unemployment at the individual level, with potentially dire consequences for individuals’ access, for example, to financial markets. With the behaviour of billions being volunteered, captured and stored at increasingly high resolutions, these data present an opportunity to shed light on some of the biggest problems facing researchers and policy-makers alike, but also represent an ethical conundrum typical of the ‘big data’ age.
J.L.T., Y-R.L., D.S., E.M. and D.L. designed and performed data analysis and wrote the paper. M.C.G. provided data and edited the paper.
We declare we have no competing interests.
J.L.T. received funding from the National Science Foundation Graduate Research Fellowship Program (NSF GRFP). D.L. acknowledges support from the Defense Threat Reduction Agency (grant no.: HDTRA1-10-1-0100/BRBAA08-Per4-C-2-0033). The views expressed in this paper are the authors' alone.
- Received March 2, 2015.
- Accepted May 7, 2015.
- © 2015 The Author(s) Published by the Royal Society. All rights reserved.