## Abstract

Culturomics was recently introduced as the application of high-throughput data collection and analysis to the study of human culture. Here, we make use of these data by investigating fluctuations in yearly usage frequencies of specific words that describe social and natural phenomena, as derived from books that were published over the course of the past two centuries. We show that the determination of the Hurst parameter by means of fractal analysis provides fundamental insights into the nature of long-range correlations contained in the culturomic trajectories, and by doing so offers new interpretations as to what might be the main driving forces behind the examined phenomena. Quite remarkably, we find that social and natural phenomena are governed by fundamentally different processes. While natural phenomena have properties that are typical for processes with persistent long-range correlations, social phenomena are better described as non-stationary, on–off intermittent or Lévy walk processes.

## 1. Introduction

Observational data are often very complex, appearing without any structure or pattern in either time or space. Examples of such observations can be found across the whole spectrum of the social and natural sciences, ranging from economics [1] to physics [2], biology [3] and medicine [4]. The origins of observed irregular behaviour, however, are not always clear. Roughly, five decades ago, deterministic chaos was discovered [5] and quickly rose to prominence as a possible mechanism of inherent unpredictability and complexity [6,7]. Yet, the strict criteria for declaring deterministic chaos in observed data [8], most notably the satisfaction of criteria for stationarity and determinism [2], and the verification of exponential divergence [9,10], are rarely satisfied. In response, attention has begun to shift from chaos to noise and random processes as alternative [11] (or, in many cases, as even more probable) sources of irregularity. While the theory of deterministic chaos relies on nonlinear dynamical systems with typically only a few degrees of freedom, the analysis of stochastic processes, especially those that yield data with scale invariance, relies on random fractal theory [12] or its generalization, multifractal theory [9,13]. Indeed, investigations based on these theoretical foundations may provide an elegant statistical characterization of a broad range of heterogeneous phenomena [14], and in this paper, it is our goal to extend this theory to culturomics, as recently introduced in Michel *et al.* [15].

Culturomics, and the study of human culture in general, seemingly has little to do with deterministic chaos and fractals. However, quantitative analyses of various aspects of human culture have become increasingly popular; examples include the study of human mobility patterns [16–18], the spread of infectious diseases [19–22] and malware [23,24], the dynamics of online popularity [25], social movement [26] and language [27–29], and even tennis [30]. This progress is driven not only by important advances in theory and modelling, but also by the increasing availability of vast amounts of data and knowledge, also referred to as metaknowledge [31], which allows scientists to apply advanced methods of analysis on a large scale [32]. The seminal study by Michel *et al.* [15] was accompanied by the release of a vast amount of data comprising metrics derived from approximately 4 per cent of books ever published (over five million in total), and it was this release that made the present study, i.e. the application of random fractal theory, possible. The data are available at ngrams.googlelabs.com as counts of *n*-grams that appeared in a given corpus of books published in each year. An *n*-gram is made up of a series of *n* 1-grams, and a 1-gram is a string of characters uninterrupted by a space. Note that a 1-gram is not necessarily a word, for it may be a number or a typo as well. Besides the counts of individual *n*-grams, the total counts of *n*-grams contained in each corpus of books in a given year are also provided, from which yearly usage frequencies can be obtained.

In this paper, we show what new insights are attainable by applying random fractal theory to this vast culturomic dataset. Our goal is to try and go beyond the interpretations of trajectories provided in Michel *et al.* [15] by means of an accurate determination of scaling parameters [33], and in particular, the Hurst parameter *H*, which enables us to characterize the nature of correlations (memory), if any, contained in the irregular time series. In general, data with long-range correlations are an important subclass of 1/*f*^{α} noise [34–36], which is characterized by a power-law decaying power spectral density, and whose dimensionality cannot be reduced by principal component analysis since the rank-ordered eigenvalue spectrum also decays as a power law [37]. Processes that generate time series with such properties are said to have anti-persistent correlations if 0 < *H* < 1/2, are memoryless or have only short-range correlations if *H* = 1/2, and have persistent long-range correlations (long memory) if 1/2 < *H* < 1 [12]. Moreover, values of *H* > 1 are possible as well; these values, however, are characteristic of non-stationary processes or rather special stationary processes such as on–off intermittency with power-law distributed on and/or off periods and Lévy walks [10]. (Note that the latter should not be confused with Lévy flights, which are random processes consisting of many independent steps, and are thus memoryless with *H* = 1/2.) Prominent examples where 1/*f*^{α} noise was recently observed and quantified include DNA sequences [38,39], human cognition [40] and coordination [41], posture [42], cardiac dynamics [43–46] and the distribution of prime numbers [47], to name but a few.

Despite the many successful attempts at assessing long-range correlations in complex time series—for example, by means of detrended fluctuation analysis (DFA) [48], as well as many other methods [9,13]—care should be exercised by their interpretation, particularly if one is faced with relatively short time series that contain trends [49], non-stationarity [50] or signs of rhythmic activity [51,52]. Although it is obviously impossible to make general statements concerning these properties for all the *n*-grams contained in the corpus of the over five million digitized books, which amount roughly to over two billion culturomic trajectories, it is clear that the time series are short, comprising a little more than approximately 200 points corresponding to the two centuries considered (more precisely, from year 1770 to 2007), and that many will inevitably contain strong trends [15]. In order to successfully surpass the difficulties and pitfalls associated with the analysis of such time series [10], besides the traditional DFA, we also use an adaptive fractal analysis (AFA), which is based on nonlinear adaptive multiscale decomposition. We use these methods to determine the Hurst parameter *H* for several 1-grams that are representative for social and natural phenomena. Examples of words that we focus on include war, unemployment, hurricane and earthquake, and we find that those that describe social phenomena (war, unemployment, etc.) in general have different scaling properties than those describing natural phenomena (hurricane, earthquake, etc.). Our results can be corroborated aptly with arguments from real life, and they fit nicely to the declared goal of culturomics, which is to extend the boundaries of scientific inquiry to a wide array of new phenomena [15].

The remainder of this paper is organized as follows. In the next section, we present the main results, in §3, we summarize them and discuss their potential implications, while in the appendix, we describe the details of fractal analysis.

## 2. Results

We start by presenting the results of the AFA for natural phenomena. We first plot in figure 1*a* the original time series (thin line) and the estimated trend (thick line) for the 1-gram ‘earthquake’. The detrended data are presented in figure 1*b*. It can be observed that overall the trend is very modest and simple, increasing only slightly towards the present day. Using equation (A 4), the Hurst parameter can be estimated from the slope of the *F*(*w*) versus *w* dependence on a double log scale. In figure 1*c*, we show that the analysis of detrended data yields *H* = 0.65, while in figure 1*d* we show that *H* = 0.75 if the original data are used as input. Both calculations produce similar results, showing a very modest slope, and rely on statistically robust scaling. Based on the meaning of the Hurst parameter, the fractal analysis of the culturomic trajectory for ‘earthquake’ reveals that this phenomenon has persistent long-range correlation.

As another example, we show in figure 2 the same analysis for the 1-gram ‘hurricane’. Unlike the ‘earthquake’ trajectory, the trend for ‘hurricane’ is more pronounced. It has a strong upwards component, especially in the last couple of decades. Hence, it can be expected that the discrepancy of the two estimated *H* values for the original and detrended data will be somewhat larger than that for the 1-gram ‘earthquake’ analysed in figure 1. This expectation is indeed confirmed by comparing figure 2*c* and figure 2*d*, from where it follows that for the detrended data *H* = 0.70 while for original data *H* = 0.85. Still, however, both results robustly classify ‘hurricane’ as a phenomenon with persistent long-range correlations, thus adding to the evidence that this may be valid, in general, for natural phenomena.

To test this hypothesis more thoroughly, we have performed the same analysis as depicted in figures 1 and 2, along with the DFA, for 13 other phenomena that can be classified as characteristic of natural phenomena. Although there may be some disagreement as to what terms are *characteristic* of natural phenomenon, and other 1-grams as well as *n*-grams could be suggested as characteristic of natural phenomena and analysed, we consider our selection to be sufficiently representative for this study. Supporting this assumption are the results presented in table 1, which point robustly towards the conclusion that natural phenomena, in general, really can be classified as processes with persistent long-range correlations. More specifically, for detrended data, we find that all estimated Hurst parameters are within the 1/2 < *H* < 1 range with an average of (AFA), which leads us to the mentioned final conclusion. Results obtained for original data (before detrending, not shown), on the other hand, leave a bit more room for discussion. There, for certain 1-grams, like ‘mudslide’ and ‘flooding’, the value of *H* is larger than one. This suggests that the data would be more appropriately described as being either non-stationary, on–off intermittent or Lévy walk-like. Such a discussion, however, would be to a large degree baseless as the upward trends occurring towards the present time in most *n*-grams describing natural phenomena must be properly taken into account. The observed trends may be considered as a straightforward consequence of the fact that we have more and more data readily available on natural phenomena, which is due to advancements in measuring techniques as well as the increasingly global reach of the Internet. Modern data collection and telecommunication technologies have raised our awareness, in general, of natural phenomena, and, as a result, it is reasonable to expect this increased awareness to be reflected in an increase of occurrences in the corpus. Note, however, that similar arguments can be raised for other fields and trivia (e.g. celebrity gossip, popular culture) as well, and thus one could argue that relatively, the usage frequencies should not necessarily increase as a result of that.

Turning to social phenomena, we will show that the problems discussed for natural phenomena are in some cases amplified, but, more importantly, that social phenomena, apart from rare exceptions, cannot be classified solely as processes with persistent long-range correlations.

First, we presented the AFA for the 1-gram ‘war’ in figure 3. The original data depicted by the thin line in figure 3*a* are clearly reminiscent of historical events, as World Wars I and II generate two large peaks that more or less dwarf the usage frequencies reported in other decades. This observation goes hand in hand not just with the magnitude of the two world wars, but also with the increase in the usage frequency of ‘war’ in the published literature at that time. In agreement with the historical events is the estimated trend line depicted by the thick line in figure 3*a*. However, even after the detrending, the resulting culturomic trajectory still clearly reflects history in that the periods of World Wars I and II stand out from the rest, as can be inferred from the curve depicted in figure 3*b*. The Hurst parameter *H* determined using the detrended and original data (presented in figure 3*c*,*d*) have similar values to each other (*H* = 1.09 for the detrended data and *H* = 1.15 for the original data). As a result, both classify ‘war’ as either a non-stationary, on–off intermittent or a Lévy walk-like process.

Another illustrative example of fractal analysis is presented in figure 4, where we examine the 1-gram ‘unemployment’. A crucial distinction from ‘war’, as well as all the considered natural phenomena, is that unemployment was non-existent, or at least it was not mentioned, in the literature prior to 1900, which is clearly inferable from the original data depicted with a thin in figure 4*a*. With the coming of age of the industrial revolution, the job market began to take shape, and with it came, rather inevitably it seems, the problem of unemployment. The trend depicted with a thick line in figure 4*a* clearly captures this fact. Moreover, we note that the first broad peak in the plot starts at around 1930, and thus correlates well with the Great Depression, while the second broad peak starts at around 1970, and thus correlates with that period of US economic stagnation and high inflation that was linked with the Middle Eastern oil crisis. After detrending, the situation is of course only marginally improved (in terms of assuring a more stationary record), as can be concluded from the curve depicted in figure 4*b*. The Hurst parameters, equalling *H* = 1.32 for the detrended data (*c*) and *H* = 1.39 for the original data (*d*), both clearly reflect non-stationarity, and accordingly, ‘unemployment’ can be considered the result of such a process.

As in the case with natural phenomena (table 1), we also performed the same fractal analysis as in ‘war’ and ‘unemployment’, along with the DFA, for 13 other social phenomena. The results are presented in table 2. It can be observed that the large majority of considered 1-grams have *H* > 1 (AFA), which indicates that social phenomena are most likely to be either non-stationary, on–off intermittent or Lévy walk-like process. This conclusion is obtained irrespective of whether detrending is performed or not, although the average Hurst parameter for detrended data, equalling (AFA), is smaller than that obtained for original data (before detrending, not shown), which is . This technical discrepancy, however, is probably due to the successful removal of some level of non-stationarity that is in general characteristic of social phenomena (more so than of natural phenomena). We would like to note, however, that in general not all *H* > 1 occurrences should be, by default, attributed to non-stationarity in the trajectories. While visual inspection may lend support to such a conclusion, as was the case for results presented in figure 4, in general the *H* value alone cannot distinguish between non-stationary, on–off intermittent or Lévy walk-like processes. In fact, the time series are too short for a robust assessment of a more precise nature of the examined social phenomena. At a glance, and since this is indeed most common, it seems convenient to attribute *H* > 1 in social phenomena to non-stationarity, yet only additional future data can enable us to differentiate whether the peaks are part of an on–off intermittent process with power-law distributed on and/or off events, or if they are part of a Lévy walk. Finally, we would also like to point out that of course not *all* phenomena that can be considered as social will have *H* > 1. Examples include 1-grams such as ‘famine’ or ‘Christian’, which for the largest parts of the recorded human history were either directly related to natural phenomena (severe droughts, flooding or other phenomena negatively affected that season's yield on vegetables, crops, grass and animal population, hence leading to famine) or have been an integral part of the human culture for a long time (prior to the start of the culturomic trajectories). Moreover, social topics that are of little interest will not garner much attention, and are as such also unlikely to have usage frequencies with *H* > 1. The social phenomena where the human factor has played a key role recently and which are reasonably popular, however, all share features that are characteristic of processes with *H* > 1. In fact, it seems just to conclude that the more the social phenomena can be considered recent (unemployment, recession and democracy), the higher their Hurst parameter is likely to be (table 2). This agrees nicely also with the recent observation of bursts and heavy tails in human dynamics [53].

## 3. Discussion

By applying fractal analysis based on DFA and AFA to culturomic trajectories of 1-grams describing typical social and natural phenomena over the past two centuries, we have found that they obey different scaling laws. As we will discuss in what follows, our findings agree nicely with existing theory and expectations, as well as offer new interpretations as to what might be the main driving forces behind the examined phenomena.

We find that natural phenomena have properties that are typical of processes that generate persistent long-range correlations, as evidenced by the Hurst parameter being in the range 0.55 ≤ *H* ≤ 0.85, with an average over all 15 considered 1-grams equalling (AFA). The prevalence of long-term memory in natural phenomena compels us to conjecture that the long-range correlations in the usage frequency of the corresponding terms is predominantly driven by occurrences in nature of those phenomena. Using data from five million digitized books to arrive at this understanding certainly supports the declared goal of culturomics and lends strong support to its core principles. Owing to this memory, and of course by using statistical data available, we know, based on the Gutenberg–Richter law [54], that in the UK, for example, an earthquake of 3.7–4.6 on the Richter scale is likely to happen every year, an earthquake of 4.7–5.5 is due approximately every 10 years, while an earthquake of 5.6 or larger is bound to happen every 100 years [55]. Similar ‘statistical predictions’ are available for tsunamis and many other, if not all, natural phenomena. On a more personal level, this also agrees with how we naturally develop an understanding for the weather and related natural phenomena for the region we live in.

Social phenomena, on the other hand, have the Hurst parameter in the range 0.74 ≤ *H* ≤ 1.33, with an average over all 15 considered 1-grams equalling (AFA). This is indicative of non-stationary processes, or stationary processes like on–off intermittency with power-law distributed on and/or off periods or Lévy walks. While our analysis does not allow distinction between these three options, it is clear that all these processes are fundamentally different from those describing natural phenomena. So while it is common to hear speculations about possible average periods regarding social phenomena—for instance, that there may be an average period between major wars or stock market crashes—our analysis suggests this is not the case, and that social phenomena tend to follow different scaling laws from natural phenomena. Such a difference is not unexpected, as social phenomena are, by nature, more complex than natural phenomena; the former depend on political, economic and social forces, as well as on natural phenomena. The results of this additional complexity can be seen in our fractal analysis of a set of culturomic trajectories.

In summary, we hope to have successfully demonstrated that the data made available through the Culturomics project [15], when coupled with advanced methods of analysis, offer fascinating opportunities to explore human culture in the broadest possible sense.

## Acknowledgements

This research was supported in part by the US NSF grant CMMI-0825311 to Jianbo Gao, and by the Slovenian Research Agency's grant J1-4055 to Matjaž Perc. The authors are grateful to Prof. Johnny Lin of North Park University for many useful discussions.

## Appendix A. Methods

Nonlinear adaptive multiscale decomposition starts by partitioning a time series into segments of length *w* = 2*n* + 1, where neighbouring segments overlap by *n* + 1 points, thus introducing a time scale of ((*w* + 1)/2)*τ* = (*n* + 1)*τ*, where *τ* is the sampling time. Each segment is then fitted with the best polynomial of order *M*. Note that *M* = 0 and 1 correspond to piece-wise constant and linear fitting, respectively. We denote the fitted polynomials for the *i*th and (*i* + 1)th segments by *y*^{(i)}(*l*_{1}) and *y*(*i* + 1)(*l*_{2}), respectively, where *l*_{1}, *l*_{2} = 1, … , 2*n* + 1. We then define the fitting for the overlapped region as
A1where *w*_{1} = (1 − (*l* − 1)/*n*) and *w*_{2} = (*l* − 1)/*n* can be written as (1 − *d*_{j}/*n*) for *j* = 1, 2, and where *d*_{j} denotes the distances between the point and the centres of *y*^{(i)} and *y*^{(i+1)}, respectively. This means that the weights decrease linearly with the distance between the point and the centre of the segment. Such a weighting ensures symmetry and effectively eliminates any jumps or discontinuities around the boundaries of neighbouring segments. In fact, the scheme ensures that the fitting is continuous everywhere, is smooth at the non-boundary points, and has the right- and left-derivatives at the boundary. Moreover, since it can deal with an arbitrary trend without *a priori* knowledge, it can remove non-stationarity, including baseline drifts and motion artefacts [56], and the procedure may also be used as either high-pass or low-pass filter with superior noise-removal properties than linear filters, wavelet shrinkage or chaos-based noise reduction schemes [57].

Based on the described adaptive decomposition, a fractal analysis can be conducted as follows. Let {*x*_{1}, *x*_{2}, … , *x*_{n}} be a stationary stochastic process with mean and autocorrelation function of type
A2where *H* is the Hurst parameter. This is often called an increment process, and its power spectral density is 1/*f*^{2H−1}. The integral of the increment process
A3on the other hand, is called a random walk process, and its power spectral density is 1/*f*^{2H+1}. Starting from an increment process, similarly to DFA [48], we first construct a random walk process using equation (A 3). If, however, the original data can already be classified as a random walk-like process, then this step is not necessary, although for ideal fractal processes, there is no penalty even if this step is done. Next, for a window size *w*, we determine, for the random walk process *u*(*i*) (or the original process if it is already a random walk-like process), a global trend *v*(*i*), *i* = 1, 2, … , *N*, where *N* is the length of the walk. The residual, *u*(*i*) − *v*(*i*), characterizes fluctuations around the global trend, and its variance yields the Hurst parameter *H* according to
A4

The validity of equation (A 4) can be proved if one starts from an increment process with the Hurst parameter equal to *H*. Using Parseval's theorem [9], the variance of the residual data corresponding to a window size *w* may be equated to the total power in the frequency range (*f*_{w}, *f*_{cutoff}) as
A5where *f*_{w} = 1/*w*, and *f*_{cutoff} is the highest frequency of the data. When *f*_{w}≪*f*_{cutoff}, we see that equation (A 4) has to be valid. In fact, the above treatment makes it clear that even if we start from a random walk process with the Hurst exponent equal to *H*, integration will give the process a spectrum of 1/*f*^{2H+1+2} = 1/*f*^{2(H+1)+1}, and, therefore, the final Hurst parameter will be simply *H* + 1. This in turn indicates that there is indeed no penalty if one uses equation (A 3) when the data are already a random walk-like process. Note that the proposed approach, if needed, can be readily extended and applied successfully to multifractal as well as higher dimensional data.

The described fractal analysis approach, which we will refer to as AFA, in general yields results that are consistent with the traditionally used DFA [48], as can be concluded from results presented in figure 5. Nevertheless, especially for processes having *H* > 1 [10], AFA may yield better scaling, which is why, although we analyse the culturomic trajectories with both methods, we rely on the results of AFA for final interpretation. The potential advantage of AFA over DFA is due to the fact that the trend for each window of size *w* obtained by AFA is smooth, while that obtained by DFA may change abruptly at the boundary of neighbouring segments. For short non-stationary time series, this may prove favourable for obtaining better scaling in the *F*(*w*) versus *w* dependence.

- Received December 3, 2011.
- Accepted January 27, 2012.

- This journal is © 2012 The Royal Society

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.