## Abstract

In empirical studies, trajectories of animals or individuals are sampled in space and time. Yet, it is unclear how sampling procedures bias the recorded data. Here, we consider the important case of movements that consist of alternating rests and moves of random durations and study how the estimate of their statistical properties is affected by the way we measure them. We first discuss the ideal case of a constant sampling interval and short-tailed distributions of rest and move durations, and provide an exact analytical calculation of the fraction of correctly sampled trajectories. Further insights are obtained with simulations using more realistic long-tailed rest duration distributions showing that this fraction is dramatically reduced for real cases. We test our results for real human mobility with high-resolution GPS trajectories, where a constant sampling interval allows one to recover at best 18% of the movements, while over-evaluating the average trip length by a factor of 2. Using a sampling interval extracted from real communication data, we recover only 11% of the moves, a value that cannot be increased above 16% even with ideal algorithms. These figures call for a more cautious use of data in quantitative studies of individuals' movements.

## 1. Introduction

Recent years have witnessed a dramatic increase in the use of large amounts of available data thanks to information and communication technologies. These new sources allow one to monitor and to map the dynamical properties of many complex systems on an unprecedented scale [1] and we now have access to a vast number of spatial trajectories representing movements of objects in geographical space [2]. In particular, such datasets have opened the opportunity to better understand human movements [3–7] and the impact of mobility on important processes such as epidemic spreading [8]. These recent works extend previous studies of movements and foraging patterns of animals [9,10] and rely on tracking man-made inanimate objects [11,12]. However, as is the case for any dataset, these new sources of information have limits and biases [13–16] that need to be assessed.

It is common to approximate the continuous spatio-temporal record of the followed individual (or animal) by a series of straight lines, thus describing the movements of an organism as a sequence of behavioural events called *moves* for animals [17] and *trips* for humans. This empirical approach allows a natural implementation of the theoretical framework of continuous-time random walks [11,18], where a *rest* time is associated with the endpoint of each move. However, this leads to the first major problem due to the lack of behavioural information in the empirical data [19]. Real trajectories always exhibit a large variety of *intertwined* static and dynamic behaviours [20]: slow versus fast movement for animals [19], fixation versus saccade in eye-tracking [21] or activities versus trips in human mobility [22]. Isolating and identifying these behaviours from a series of chronologically ordered points is an important statistical challenge [23] and a growing array of methods based on spatio-temporal characteristics of the trajectories have been developed to perform this task automatically [2,19,21]. These methods are often tailored for the specific dataset in question [20]. Therefore, even the working definition of a ‘move’ might vary significantly between studies, depending on the method and the technology used [24].

A second complication comes from the limits of the technology used for collecting the empirical data. In the case of spatial movements, a crucial aspect is the temporal sampling of the trajectory. The simplest and most common method used is *periodic* sampling, where spatial coordinates are recorded at regular time intervals. Alternatively, other data sources are characterized by an *event-based* sampling where locations are recorded at certain (random) events. This is the case, for instance, for the most common sources of human mobility data such as call detail records (CDR) of mobile phone data [25] and geo-located social media accesses [26]. In both cases, the discrete displacements recorded are associated with continuous moves [17], but this is a strong oversimplification, and all derived quantities will depend on the sampling process itself [20,27–29]. The sampling of random processes might even be the principal cause of the emergence of long tails in several statistical distributions [30,31]. For example, in the case of periodic sampling, it has been shown that non-Lévy movements can be erroneously interpreted as Lévy flights when sampling time intervals are larger than the natural timescale of animals' movements [32,33]. The sampling rate is thus a crucial element that has to be taken into account when analysing empirical trajectories [20,34].

For both periodic and event-based sampling, the nature of data forces researchers to make the following naive assumptions:

(i) an individual is always at rest at the location where its position is recorded; and

(ii) every change of position is associated with a single move.

This point of view has been adopted, for instance, in the first important papers where human mobility has been studied with mobile phone data [3,4] and often replicated, even in recent studies [35–37]. However, the use of these new sources of data exacerbates the challenges associated with temporal sampling. Indeed, in these data, trajectories are represented as sequences of positions recorded at the moment of a communication event (which can be a call, a text message or an application access). The trajectory sampling is therefore coupled with the random and bursty nature of human communications [38]. The probability distribution of the time interval between calls [3,4], e-mails [38] and tweets [39] has a long tail which can be fitted by a power law with an exponent value close to −1 (and with a cut-off on the order of days). Only in a few cases, a small set of trajectories sampled every Δ = 1 or 2 h is available [3,4,40]. Even when individuals with a very high call frequency are selected [40], they are still inactive most of the time [41]. In order to identify human mobility patterns, it thus become necessary to introduce ad hoc methods based on reasonable assumptions and almost arbitrary parameters [16,42].

In this paper, we discuss the effect of sampling and assumptions (i) and (ii) on the measured properties of random movements. We will consider one of the simplest and realistic cases where the trajectory consists of two alternating phases, moves and rests, whose durations *t* and *τ* are regarded as independent random variables. Trajectories can then be seen as an alternating renewal process, i.e. a generalization of Poisson processes to arbitrary holding times and to two alternating kinds of events. The sampling time interval Δ depends on the particular experiment and can be either constant or randomly distributed. Using methods of renewal theory along the lines of [43], we provide a theoretical estimate for the fraction of correctly sampled trips with periodic sampling, and show the existence of an optimal sampling time interval. We then extend our results numerically to the case of event-based sampling, and with more realistic rest times and speeds. This allows us to show that sampling human trajectories in more realistic settings is necessarily worse than predicted by our analytical model. Finally, we use high-resolution (spatially and temporally) GPS trajectories to verify our predictions on real data.

## 2. Results

### 2.1. Theoretical analysis

We study the effect of the periodic sampling rate on the apparent distribution of measured move lengths. We focus on the case of an alternating sequence of rests and moves and we further assume that the movement is one dimensional with a constant velocity *v* (see electronic supplementary material, section ‘Numerical analysis’, for other cases). Simplifying the problem to one dimension is here sufficient to point out when the sampling is inadequate. We show below that the diagnostics we use to identify the optimal sampling times are independent of the dimensionality of the space in which a trajectory is embedded. We will not discuss here other issues associated with temporal sampling, like the apparent speed and turning angles in a general two-dimensional case [20,27,29], the possible fits of the displacement distribution [32,33,44,45], or interpolation methods to reconstruct the movements between samplings [46]. The quantities entering this problem are therefore: the move duration *t*, the move length ℓ = *vt*, the resting time *τ*, and the time interval Δ between two consecutive measures. The distributions *P*(*t*) and *P*(*τ*) are characteristics of the specific subject in motion, while the distribution of the sampling interval *P*(Δ) is associated with the technology used for tracking the motion. Sampling the trajectory gives us a displacement distribution *P*(ℓ*) where ℓ* is the apparent length of a move, and the problem is thus to compute this distribution *P*(ℓ*) for any given distributions *P*(*t*), *P*(*τ*) and *P*(Δ).

During rests, the displacement is assumed to be zero, and so the succession of rests and moves is associated with a continuous increasing function *x*(*θ*), where *θ* is the time parameter (figure 1). We sample the position *x**_{k} for every instant , where Δ_{j} is the value of the *j*th sampling interval (in the case of constant sampling, Δ_{j} = , and so *θ**_{k} = *k*). The succession of space–time coordinates (*θ**_{k}, *x**_{k}) (shown in figure 1 and in the two-dimensional example of electronic supplementary material, figure S1) thus represents all the knowledge we have about the trajectory after sampling. For two consecutive measures at times *θ**_{k} and *θ**_{k+1}, there is an observed displacement ℓ*_{k} = *x**_{k+1} − *x**_{k}. Our goal is then to estimate the differences between the distribution of real displacement lengths ℓ and of the observed displacements ℓ*_{k}. In particular, we want to understand the biases induced by different choices for *P*(Δ).

If we make the naive assumption (ii), discussed in the introduction, that every observed displacement is associated with a single move, the necessary condition for this to be correct is that two subsequent sampling times *θ**_{k} and *θ**_{k+1} fall in two consecutive rests. We can also easily identify the cases where the sampling times fall in the same rest, because this is the only situation where we exactly have ℓ*_{k} = 0 and which does not lead to a wrong estimate of the individual's movement. Conversely, we must consider as errors all remaining configurations, because at least one of two things necessarily happens: (i) we have a sampling point at movement or (ii) a rest is missed by the temporal sampling. Either of these events leads to a misinterpretation of the individual mobility and to an under- or overestimate of the move lengths [32] and of the number of trips observed [15]. In order to go beyond this simple hand-waving argument, we will consider the case of exponential distributions for *P*(*t*) and *P*(*τ*), constant sampling time interval , and constant speed *v*. In this case, we obtain explicitly the distribution *P*(ℓ*) of sampled displacements. This will allow us to discuss the impact of the sampling, and to show, in particular, that there is an optimal value for .

#### 2.1.1. Constant sampling rate and exponential distributions

We will consider the case of exponential distributions for the move and rest durations:
2.1and a constant sampling interval:
2.2(*δ*(*x*) is Dirac's delta function). In the constant velocity case, the real displacements are also exponentially distributed:
2.3with .

Using methods of renewal theory [47–49], along the lines of [43], we obtain an explicit expression for the distribution *P*(ℓ*) of apparent displacements ℓ* after sampling (see electronic supplementary material, section ‘Analytical calculations’, and in particular equations (S15), (S33)):
2.4where the continuous part of this distribution reads
2.5with , and where *I*_{0}(*y*) and *I*_{1}(*y*) are modified Bessel functions of the first kind.

In the following, we will not consider the discrete part associated with the Dirac's delta function *δ*(ℓ*) of the distribution *P*(ℓ*), as the value ℓ* = 0 can be easily recognized and excluded in any practical scenario. The fraction of sampling intervals associated with null movements (ℓ* = 0), denoted by *C*_{0}(), can be significantly large. In the stationary regime [50], we can compute *C*_{0}() for any distributions *P*(*t*) and *P*(*τ*), and a constant sampling time (see electronic supplementary material, equation (S17)). We can show that it is a decreasing function, varying between (i.e. the fraction of time spent at rest, in the continuous sampling limit) and *C*_{0}(∞) = 0. In the particular case of exponential distributions (equation (2.1)), *C*_{0} is the prefactor of the *δ*(ℓ*) peak in equation (2.4), and can be very large. For instance, *C*_{0} ≈ 60% in the case of car mobility ( and , see Methods) and = 1 h. For this reason, we compare the original data to a rescaled probability distribution which does not include the *δ*(ℓ*) peak and is given by (see electronic supplementary material, figure S2)
2.6

We show in figure 2*a* the dependence of the continuous part of *P*_{ℓ*>0}(ℓ*) on , keeping the average travel time fixed to the experimental value of 0.30 h for car mobility [7].

We note that *P*_{cont}(ℓ*) can have a maximum, even if the original distribution *P*(ℓ) is a decreasing function. The measurements allow us to recover the exponential tail of travel times only if the resting time is sufficiently long. Conversely, when the sampling time is larger than the average duration of a rest, the result of the sampling is manifestly different from the original exponential distribution. In figure 2*b*, we take and (which are the values observed for vehicular mobility, see Methods) and study the outcome for different sampling times . Naturally, acts as a cut-off because all moves longer than this value are necessarily interrupted by the sampling. By contrast, for large values of , the number of short travels is underestimated, as subsequent short moves may be joined together and thus appear as an effective long one.

We also computed exactly the first two moments of the distribution equation (2.4) and found for the average
2.7(see electronic supplementary material, equation (S19) and equation (S26) for the second moment). Naturally, the exclusion of the null displacements influences the value of the distribution's moments. In particular, the average value of equation (2.6) can be computed by a simple rescaling and reads
2.8This rescaling yields notable changes in the numerical values of the moments. For instance, with realistic values for car mobility ( and ), a sampling time of 1 h gives 〈ℓ*〉/*v* ≈ 0.11 h, while excluding the zero-displacement part, we obtain 〈ℓ*〉_{ℓ*>0}/*v* ≈ 0.27 h.

#### 2.1.2. Optimal sampling times

We first note that high-frequency sampling (Δ → 0) does not automatically allow one to understand the whole trajectories under the naive assumptions (i) and (ii). Indeed, it is only with additional data that we can correctly reconstruct a whole trajectory. It is then necessary to implement a ‘segmentation’ algorithm that goes beyond the assumption (ii) that an observed displacement corresponds to one single move, as Δ → 0 implies that any move is cut into a very large number of segments [17]. In addition, high-frequency recordings are known to present uncertainties and systematic errors that need to be taken into account for extracting meaningful information [17,20,51–53]. A good segmentation algorithm should take into account the noise, the spatial scale and characteristic speeds of the tracked subjects. Here, it is not our intent to develop detailed segmentation methods, but to show the quality, and the limits, of the simpler assumption that one observed displacement is equal to one move. In this framework, having Δ → 0 means that we measure moves over a very short time, obtaining thus a distribution of measured displacement peaked at very small values and indicating that very high-frequency rates are not good under assumption (ii).

We can define an ‘optimal constant sampling time’ in two different ways: either as the time interval that correctly estimates the average length of moves, or as the time interval that maximizes the fraction of correctly sampled moves. The second approach offers a more general perspective, introducing a dimensionless measure for the quality of the sampling but which is unfortunately not a natural and common observable in experimental ecology or human mobility. For this reason, we consider in parallel the first approach that is based on a more natural quantity, the average displacement, which also has the merit of focusing on the character of the displacement distribution and therefore on what is perhaps the most controversial topic associated with individual trajectories: the mis-identification of a Lévy walk from empirical data. In the following, we obtain exact formulas for both in the exponential–exponential case (i.e. with conditions described by equation (2.1)).

#### 2.1.3. Average move duration and total number of moves

The optimal sampling time can be obtained by solving for the equation . The solution can be written in the form
2.9where *W*(*x*) is the Lambert function, such that *W*(*x*)e^{W(x)} = *x*. This function is defined for *x*≥ −e^{−1}, which always holds in our case because . Using the empirical values , , we obtain . This result is confirmed by Monte Carlo simulations (figure 3), where red circles represent the values for . With this ‘optimal’ sampling time based on the first moment, the second moment is slightly underestimated. Note that matching the average travel time is equivalent to correctly estimating the number *n* of trips, i.e. of moves and stops (see inset in figure 3*a*), which is computed by counting the number of consecutive sampled points *k* and *k* + 1 with ℓ*_{k} = *x**_{k+1} − *x**_{k} > 0. For , the trajectory is under-sampled (*n** < *n*) and trip lengths are overestimated, while for it is over-sampled (*n** > *n*) and trip lengths are underestimated.

This point of view about the number of moves allows us to extend the validity of this optimal sampling to higher dimensionality (two or three dimensions) and to any distribution *P*(*v*). The dimensionality of space indeed does not influence the moves' number counting. To illustrate this, we extend this analysis in the electronic supplementary material, section ‘Numerical analysis’ with a Monte Carlo simulation in the case where speed is a random variable depending on the move duration [7]. In this case, our exact results for *P*(ℓ) do not hold anymore, because moves have different speeds. Nevertheless, the value given by equation (2.9) only under-estimates the mean displacement length with varying speeds by some 5%.

More generally, all our analytical results concern the stationary regime of the renewal process. This stationary regime exists only if the mean values and are finite (see electronic supplementary material, section ‘Analytical calculations’). The distributions *P*(*t*) and *P*(*τ*) can thus have power-law tails, in principle, for our results to hold, but only with large enough exponents.

#### 2.1.4. Fraction of correctly sampled moves

In order to estimate , we have to compute the fraction *F*_{good} of movements that are correctly measured. This occurs when two consecutive sampling times fall during the rests immediately before and after a move, say *θ**_{k} in the rest *τ*_{m} and *θ**_{k+1} = *θ**_{k} + in the rest *τ*_{m+1}. The probability *P*_{good} of the latter event and the fraction *F*_{good} = *P*_{good}/(1 − *C*_{0}) are calculated in the electronic supplementary material, section ‘Analytical calculations’. In the case of exponential distributions, we obtain the explicit expression (see electronic supplementary material, equation (S37))
2.10In figure 3*b*, we compare the shape of *F*_{good} for fixed values of and with the result of a Monte Carlo simulation. For empirical values valid for car mobility (, ), the curve has a maximum for a sampling time given by (102 min). Both the value of and the height of the maximum of *F*_{good}() depend on the ratio (figure 4*a*). They are however independent of the spatial embedding and of the characteristics of *P*(*v*). The quantity is associated with the largest value of for the data sources we have analysed (mobile data, GPS trajectories and car mobility, see electronic supplementary material, table S1), and thus represents the best possible value associated with human mobility at an urban scale. It is remarkable that the optimal fraction of sampled movements in human mobility is so low that essentially one half of the moves are cut or merged during the sampling, limiting the possibility of understanding the individuals' behaviour. We also note that the value *F*_{good}() is not far from 51% (figure 3*b*). We thus see that, even if the measured and real distributions are similar with comparable first moments, we are often describing different movements. The nature of the process, characterized by and , limits our knowledge of the system for any value of .

The maximal value is naturally associated with another optimal sampling time representing the conditions for which we sample correctly the largest number of moves. This optimal sampling time is of the same order as and : . The function can be approximated as a constant when studying human mobility at an urban scale, or other datasets sharing similar ratios (figure 4): 2.11This result suggests that the sampling with (that is, substantially more frequently than the time frame of an average move or rest) is not optimal and will lead to incorrect results. This is apparently paradoxical, because if the trajectory is very well sampled, then it would be relatively straightforward to build an algorithm that reconstructs correctly moves and rests. However, such a high-frequency sampling is useful only when we have additional information that allows one to reconstruct the trajectory which can be done with more advanced technologies that do not need assumptions (i) and (ii).

### 2.2. Sampling human movements

The conditions of equations (2.1) and (2.2) define a process where both travel and rest times have a short-tailed distribution and the trajectory sampling is strictly periodic. While this allowed us to find exact analytical expressions and to uncover important effects of sampling on the statistical properties of trajectories, real-world problems are much more involved. Indeed, human travel times are characterized by short-tailed distributions (see [7] and references therein), and resting time can be broadly distributed for both humans [4,54] and animals (see [55] and references therein). In addition, the trajectory can be sampled with a random inter-sampling time.

We expect, in general, to observe the same behaviour as the exponential–exponential case (described by equation (2.1)) studied above for any peaked distribution of rest and move durations (i.e. when both the first two moments converge). We show here that when rests or sampling times are broadly distributed, the outcome of the sampling will be necessarily worse. The exponential–exponential conditions discussed above therefore correspond to the best-case among the typical scenarios observed empirically (although better sampling might be eventually obtained in marginal scenarios such as fixed rest and move times for example). We first confirm with Monte Carlo simulations the validity for more complex cases of the results obtained above for a constant sampling time interval. In particular, we show (see electronic supplementary material, section ‘Numerical analysis’ and table S1 for details) how the sampling quality *F*_{good} for cars' mobility progressively decreases from the upper bound of 51% when introducing randomness in sampling times (exponential or power-law) and in rest durations. For instance, introducing a broad *P*(*τ*) yields values of *F*_{good} lower than 40%, while a broad *P*(Δ) yields a *F*_{good} lower than 30%. We finally predict that, when coupling a broad *P*(Δ) and a broad *P*(*τ*) (as observed for mobile phone data), the quality of the sampling decreases significantly, with *F*_{good} falling to 23%.

We illustrate these different results on a spatio-temporal high-resolution dataset, namely the GeoLife GPS trajectories [56,57]. The data consist of coordinates given every 5 s, thus allowing us to perform a speed-based sequencing (see Methods). We measure the properties of the sequenced trajectories and find again an average trip time . The average rest time drops to , because data allow us to define activities at a finer scale. Using the functional form for *F*_{good} given by equation (2.10) for the ideal case, we find that the upper bound for the sampling quality declines substantially to . In the following, we use these GeoLife GPS trajectories to study the effect of sampling on real trajectories. In particular, we will validate the previous results by studying the effect of constant sampling and then use mobile phone data to sample the GPS trajectories with a random sampling interval.

We first sample the trajectories with a constant time interval that varies between 1 min and 6 h. For each value of , we compute the fraction of the trips that are correctly identified. The results are represented in figure 5. They confirm our analytical predictions. Indeed, we find that there exists an optimum value of the sampling time . Even though this was not expected, because of a non-exponential *P*(*τ*), this value coincides with the predicted maximum (the theoretical curve is represented as a dashed line in figure 5). The fraction of correctly sampled moves is lower than in the idealized case with at best 18% of the trips that are recovered (*F*_{good} ≈ 0.18) with a constant sampling interval.

We also estimate the average length of the sampled trips for every value of the sampling interval and compare it with the average trip length in the original sequenced trajectory (results are represented in electronic supplementary material, figure S3). The optimal value of the sampling time is much smaller than the one maximizing the number of correctly sampled trips. Furthermore, we find that, at the optimal sampling interval , the average sampled trip two-dimensional displacement is about two times larger than the average trip length of the original trajectories.

In the case of geo-localized data obtained from devices such as mobile phones, position and time are recorded at random times corresponding to a call or another event. The sampling time intervals are thus random variables. In general, they are distributed according to a broad law such as a power-law with exponent close to −1 [3,4]. Here, we use CDR mobile phone data from Senegal [58] and, as commonly done [35,40], extract the duration between calls of the users with extremely high average call frequency, in the same spirit as in [3]. We then sample the sequenced GPS trajectory using these durations. The result is staggering: only 11% of the trips are correctly sampled. One may argue that calls and rests are correlated, or that calls done during moves can be filtered out. We thus computed the proportion of correctly sampled trips at different levels of correlation (see electronic supplementary material, section ‘Correlations between calls and rests in empirical sampling’ for details), and find that, at best—when we only have calls during rests—only 16% of trips are recovered. The use of CDR mobile phone data or of any dataset presenting a long-tailed inter-event time distribution to study mobility is thus very questionable. We note that forcing a perfect correlation between calls and rests amounts to forcing assumption (i) presented above. Yet, the trajectory is still poorly sampled, meaning that assumption (ii) is flawed.

## 3. Discussion

A key aspect of every experimental science is to be aware of the limits of the experiment's set-up and of the measuring apparatus. Unfortunately, this point has often been neglected in the recent trend of data-driven studies. The desire for novel, large-impact results is leading to studies where many corners are cut. As a consequence, a large number of quantitative results are sustained almost exclusively by the sheer amount of data gathered, even when those data are not adequate for the problem at hand: not all biases do average themselves out. This is particularly true for the study of trajectories from sampling movements in space. The choices taken for trajectory segmentation, together with the temporal and spatial granularity of the measures, influence all quantities associated with these trajectories [20,34].

In this paper, we have shown that for any sampling of a trajectory alternating rests and movements (of animals, human or artefacts) the assumptions that each measure corresponds to a rest and that an observed displacement correspond to a move are intrinsically flawed. We solved analytically an idealized case which shows that the fraction of trips that are correctly identified with a constant sampling time interval is intrinsically limited, and that this limit is *at best* 51% for humans moving at the urban scale. We also showed that this fraction is significantly lower in any other realistic scenario, especially when mobility is being studied through the lens of mobile phone communications: using phone calls in order to track mobility gives correct predictions for 23% of the trips made with a car. Result gets even worse if one wants to investigate mobility at a finer scale: using high-resolution GPS data the value drops down to 11%, and we estimate that no more of 16% of movements can be recovered, even if a perfect stay-point identification algorithm is applied. These figures (summarized in electronic supplementary material, table S1) cast a shadow on the possibility of understanding [3] and modelling [4] human mobility from CDR data. Our ability to predict individuals' movements [40] is limited not only by the temporal and spatial scales of analysis [59,60], but also and highly predominantly by limitations inherent to the data sources. We provided new analytical tools to evaluate the quality of a sampled trajectory for the study of both animal and human movements. Positions must be collected (or, when necessary for historical comparisons, down-sampled [34]) at least with a frequency commensurate with the underlying moving and resting dynamics (). Alternatively, stay points can be reconstructed from high-frequency sampling (), but not when one has bursty inter-event times, because during the numerous extreme events constituting the long tail of the distribution *P*(Δ) the information on the movements is simply absent. Further studies and rigorous analysis of the empirical methods used in many studies are thus necessary in order to construct solid foundations for our knowledge.

## 4. Methods

### 4.1. GPS data

In order to prove the validity of our claims, we test the above predictions on high-resolution data, the GeoLife GPS trajectories [56]. This dataset consists of the trajectories of 182 subjects registered by a GPS device over the course of 3 years. The database contains 17 621 trajectories for a cumulated travel length of more than 1 000 000 km. Most trajectories are logged with a temporal precision of the order of the second.

Because the term ‘rest’ has a behavioural connotation, we will talk in the following about stay points [42]. These are locations where an individual stays for a certain period of time and from which he or she does not depart too much. Of course, the identification of stay points depends on the spatial and temporal granularity of the data [20].

As mentioned in the Introduction, the absence of contextual information forces us to make more or less realistic assumptions in order to identify travelling times and rests. We begin by filtering out the trajectories that are less than 1 km long, as they are not representative. We then proceed to identify stay points as follows:

— we consider all points around the point

*p*_{t}in a moving time window of duration*τ*= 10 s around*t*;— in this window, we compute the average movement speed between successive trajectory points;

— if the average speed is lower than 2 m s

^{−1}(fast walker), we identify*p*_{t}as a stay point;— we iterate the procedure for all points in the trajectory; and

— we aggregate consecutive stay points if the move in-between is less than 100 m and aggregate consecutive moves if the intermediate rest is shorter than 5 min.

The last passage is introduced in order to minimize the impact of fluctuations in the GPS reading. After this procedure, we obtain individual trajectories where stay points are identified.

We find the average travel and rest times and . The average travel time is identical to that observed for vehicular mobility. The average duration of a rest is, however, significantly shorter.

### 4.2. Call detail records data

We use the dataset 2 ‘fine-grained mobility’ of the Orange data made available for the D4D challenge [58] that provides anonymized individual CDR records. For privacy reasons, the caller IDs are reshuffled every 15 days. The dataset spans 25 such 15-day periods. The selection procedure that is most often used is the one proposed in [40], i.e. selecting only the individuals whose average call frequency is greater than 0.5 calls/hour. Here, we allowed for a more conservative margin by selecting only the 1.1% of individuals who had more than 1 call/hour in a period of 15 days. Furthermore, the data provide call time stamps with a 10-minutes granularity. We apply a smoothing procedure which consists of picking a time uniformly at random between *M* − 5 and *M* + 5, where *M* is the value in minute indicated by the time stamp. One should bear in mind that the mobile phone CDR and GPS trajectories come from two independent datasets describing two different populations and times of the year. For this reason, we did not enforce calendar synchronization between the datasets, but used the CDR data to randomly extract real inter-event times with the appropriate minimal frequency. More accurate numbers would thus be obtained in a situation where information on calls and trajectories would be available for the same user.

### 4.3. Characteristic times for car mobility

We need to identify the values of and in conditions that realistically describe human mobility. We do this by using the results of the analysis of urban and inter-urban traffic of private vehicles in Italy [7].

The average travel time observed for Italian cars is . Moreover, as discussed in [7] and references therein, in private as in public transportation, the distribution of trip durations *P*(*t*) in a city is short-tailed. A similar result has been found in taxi rides, in survey data (where also ) and on the GPS data [57] we use in this work (for separated modes of transport). For this reason, we can safely limit our numerical analysis to the case of exponential *P*(*t*).

Concerning rest times, two different functional forms have been proposed for the distribution *P*(*τ*). Car parking durations have been fitted with a stretched exponential:
4.1with *τ*_{0} ≈ 10^{−4} h and *β* ≈ 0.19 [7]. For mobile phone data, a truncated power-law fit has been proposed:
4.2with *γ* ≈ 1.8 and *τ*_{e} ≈ 17 h [4]. This fit is made on movements sampled at best with (it is thus expected to be influenced by the sampling issues described above), and does not allow one to identify rests shorter than 1 h. Note that in estimating the distribution's average below, we are extending the distribution (4.2) below this experimental range.

In our analytical study, we assume the distribution *P*(*τ*) to be exponential, while it is not in general. We therefore estimate the parameter averaging the distributions (4.1) and (4.2) between 5 min and 24 h, which corresponds to selecting only individuals moving every day, we obtain average rest times of 2.49 h and 0.55 h, respectively. To have a consistent description of car mobility, we choose to use the value . As our results suggest that the larger the better the sampling, our choice also defines a best-case scenario. In our numerical study, we will instead use the whole distributions given above.

## Data accessibility

The GeoLife GPS Trajectories used here for empirical validation data are publicly available [56] and available at https://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/ (last accessed 11 October 2017).

## Authors' contributions

R.G., R.L., J.M.L. and M.B. designed the research and wrote the text. R.G. performed the numerical analysis. R.L. performed the data analysis. J.M.L. performed the analytical calculations. R.G. and R.L. prepared the figures.

## Competing interests

We declare we have no competing interests.

## Funding

R.G. has received funding from the SESAR Joint Undertaking under grant agreement no. 699260 included in the European Union's Horizon 2020 research and innovation programme. R.L. acknowledges support from the James S. McDonnell Foundation through a Postdoctoral Fellowship.

## Acknowledgements

R.G. thanks M. Lenormand and T. Louail for useful discussions.

## Footnotes

Electronic supplementary material is available online at https://dx.doi.org/10.6084/m9.figshare.c.3982608.v2.

- Received October 19, 2017.
- Accepted January 11, 2018.

- © 2018 The Author(s)

Published by the Royal Society. All rights reserved.