Bird identification with radar is important for bird migration research, environmental impact assessments (e.g. wind farms), aircraft security and radar meteorology. In a study on bird migration, radar signals from birds, insects and ground clutter were recorded. Signals from birds show a typical pattern due to wing flapping. The data were labelled by experts into the four classes BIRD, INSECT, CLUTTER and UFO (unidentifiable signals). We present a classification algorithm aimed at automatic recognition of bird targets. Variables related to signal intensity and wing flapping pattern were extracted (via continuous wavelet transform). We used support vector classifiers to build predictive models. We estimated classification performance via cross validation on four datasets. When data from the same dataset were used for training and testing the classifier, the classification performance was extremely to moderately high. When data from one dataset were used for training and the three remaining datasets were used as test sets, the performance was lower but still extremely to moderately high. This shows that the method generalizes well across different locations or times. Our method provides a substantial gain of time when birds must be identified in large collections of radar signals and it represents the first substantial step in developing a real time bird identification radar system. We provide some guidelines and ideas for future research.
Bird movements have been studied by means of radar for almost 60 years (Eastwood 1967). With radar, birds can be detected over a wide range of distances (from a few metres up to 100 km) during day as well as night and under all weather conditions except heavy precipitation. It allows a precise estimation of target distance and depending on the type of radar and the operational mode also of flight altitude. Thus, the information available from radar is in the best case the position in time and space, and the amount of energy reflected.
Radar has been used: (i) in studies of bird migration (see the reviews by Bruderer (1997a,b), Gauthreaux & Belser (2003) and Schmaljohann et al. (2007b)), (ii) in studies of population dynamics of breeding seabirds (Lilliendahl et al. 2003; Hamer et al. 2005) or roosting land birds (Russell & Gauthreaux 1998; Bäckman & Alerstam 2002), (iii) monitoring of bird traffic at prospective locations of power lines, wind farms or bridges (Harmata et al. 1999; Desholm & Kahlert 2005) and (iv) in the context of bird strike avoidance near airports and en route (L. Buurma 1996, personal communication).
The potential increase in the construction of wind farms in the coming years is a serious concern for avian conservation (Garthe & Hüppop 2004) and radar methodology has been of great importance for environmental impact assessment (Desholm et al. 2006).
An important area of the application of radar ornithology is to quantify the numbers of birds aloft. At a first look the principle seems simple—the targets detected in the radar beam are used to infer the spatial or temporal distribution of birds. But insects, bats, other flying objects and weather phenomena can provide a significant proportion of the signals observed (Eastwood 1967). The major problem in terms of abundance is the distinction between birds and insects. Insect movements are very common (e.g. Riley & Reynolds 1986; Smith et al. 1993; Feng et al. 2007) and can interfere considerably with the studies on bird movements (Schmaljohann et al. in press). Airborne insect abundance depends mainly on the geographical area, the season and time of day. Hence, the severity of the problem is likely to be very variable depending on the recording location and time. In the past, either this problem was at least partly overlooked (cf. Schmaljohann et al. 2007a) or various means were used to try to separate bird signals from non-bird ones.
The intensity of radar signals (echo size) has been used jointly with the information of flight speed in several studies to identify birds up to the species level (e.g. Russell & Gauthreaux 1998; Lilliendahl et al. 2003). Using signal intensity alone would be questionable since large insects or insect swarms can produce signals of a magnitude similar to that of small birds. In the above-mentioned studies, the information on signal intensity was improved by prior knowledge of the abundance of the species under investigation. Unfortunately, such prior knowledge is rarely available. For the majority of migratory birds (small passerines), the intensity of radar signals will not suffice to distinguish between birds and insects, because there is a large overlap in signal intensity between small birds and large insects (Larkin 1991).
Generally, birds fly at higher airspeeds (the speed with respect to the air mass) than insects (Bloch & Bruderer 1982; Larkin 1991; Feng et al. 2007). Ground speed (the speed with respect to the ground) together with signal intensity is used in several studies to distinguish between birds and non-birds (Lilliendahl et al. 2003; Hamer et al. 2005). Ground speed depends on airspeed as well as wind speed and wind direction, and therefore wind information must be taken into account to calculate the airspeed of a bird or insect. Even if very accurate wind profiles were available, a clear separation of birds and insects is difficult due to overlap in airspeed between the two groups (Schmaljohann et al. in press). The study of Larkin (1991) shows that this overlap is far from negligible but suggests that airspeed can be a valuable tool for partially discriminating between birds and insects. The severity of the overlap depends on the insect species aloft and therefore information on local insect fauna is needed. The problem is expected to be more severe in regions or during seasons with a high abundance of large flying insects.
Recent studies performed with polarimetric weather radar, which transmits and receives horizontally and vertically polarized pulses, indicate that it can be used to discriminate between radar returns produced by birds and insects (Zrnic & Ryzhkov 1998; Bachmann & Zrnic 2007).
The temporal pattern of the signal intensity (echo signature) depends mainly on the variation in the shape of the target (e.g. by flapping). In various former studies, we tracked birds and insects by radar and identified the target through a telescope mounted parallel to the radar beam. This provided radar signals of visually identified targets, which enabled us to learn how the target identity affects the recorded signals. A typical pattern in signals produced by birds is a periodic fluctuation of signal intensity, which is caused by change in body shape due to flapping. For bird targets the frequency of the fluctuation either directly reflects wing flapping frequency or several harmonics (integer multiple of the frequency) are present in the signal. In the latter case wing flapping frequency is easily deduced since it is the fundamental frequency of the harmonics. In our data, we found that insects did occasionally produce periodic signals, but these were close in shape to a pure sine wave (unlike in birds) and their fundamental frequency was generally higher than that in birds. Several studies have pointed out that this pattern can be used to distinguish between the signals of birds and insects, mainly due to the distinctive wing flapping pattern in the bird signals. (Bonham & Blake 1956; Houghton 1964; Bruderer & Joss 1969; Bruderer & Steidinger 1972; figure 1). Bats tracked by radar produced radar signals similar to those of birds (Bruderer & Popa-Lisseanu 2005).
An accurate method to identify bird signals is ‘plot-based labelling’. That is, to look at the plots of the signals. Additionally, plots of the spectral density of the signals (e.g. Fourier transforms or wavelet transforms) can be used. Plot-based labelling has been successfully applied in numerous studies (Bloch et al. 1981; Stark & Liechti 1993; Liechti & Bruderer 1995; Bruderer & Liechti 1998; Liechti & Schaller 1999; Nievergelt et al. 1999; Hedenström & Liechti 2001; Bäckman & Alerstam 2003; Schmaljohann et al. 2007b). However, the method has some limitation—it is very time consuming, it requires a highly experienced person and, although there is a close match between different experts, it is not free of subjective interpretation.
Here, we present a classification algorithm that exploits the temporal pattern in the signals together with absolute signal intensity. Its aim is to automatically identify signals produced by birds. The algorithm is based on a signal processing method (continuous wavelet transform, CWT) and a statistical learning method (support vector classifier, SVC). It is specifically designed to be applicable to large datasets.
2.1 Description of the radar system and data
For this study, we used four datasets consisting of digitized radar signals. The datasets were collected during four radar surveys in the Sahara desert (Mauritania) at two locations in the years 2003 and 2004 (see table 1). The datasets are named A, B, C and D. The radar was operated during all hours of day and night. The height over ground of the detected targets ranges from 20 to 7390 m. This radar system is a former military fire-control tracking radar with vertically polarized pulses of 3.3 cm wavelength. The radar was adapted specifically to detect birds (Bruderer et al. 1995). Fixed beam measurements (Bruderer 1971) were performed at the oasis of Ouadane (20°56′ N, 11°35′ W) and Idini (17°57′ N, 15°28′ W). In fixed beam measurements, the radar is kept stationary pointing to a given direction. All targets moving across the stationary radar beam are recorded, providing information on target distance and variation over time in target reflectivity. To cover all relevant heights, the elevation angle of the beam ranged from 6° to 84° (see electronic supplementary material). The pulse repetition frequency and the sampling frequency of the radar was 2083.33 Hz. In order to improve the signal to noise ratio, the average was taken over adjacent but non-overlapping blocks of 16 consecutive samples. Therefore the final sampling frequency of the signals is 130.21 Hz; 130 Hz will be used for simplicity hereafter. All the signals were digitized and stored automatically. Obvious clutter signals produced by diffuse targets (clouds, rain) were removed manually. Since the sensitivity of a radar increases dramatically at close range, a sensitivity time control (Skolnik 1980) was applied to eliminate all targets that would not be detected at a distance of 3 km. Thus, small insects and weak ground clutter were suppressed. For more details on data collection see Schmaljohann et al. (2007b).
Each signal is a sequence of numerical values. A signal represents the fluctuation in signal intensity (equal to reflectivity) over time. The time unit is 1/130 s. The primary unit of the radar signal P was measured in milliwatts (mW). Since this quantity is usually very small, we transformed it into a more convenient unit, the dBm. The signal intensity S in dBm is defined as S=10 log (P/1), where 1 stands for 1 mW. See the plots of figures 1–3 for examples of radar signals. The duration of the signals ranged from 1.5 to 38 s. This radar recorded the distance of the target with a resolution of 30 m. During seven months of fieldwork, over 5000 single targets were tracked simultaneously with our radar and optically through a telescope. This provided data of visually verified targets. Heiko Schmaljohann (H.S.) did most of this work and used this experience to learn to identify targets based on the plots of the signals and the plots of the spectral density of the signals (fast Fourier transform). This will be referred to as plot-based labelling hereafter. H.S. labelled all the signals used in this study via plot-based labelling. He simultaneously used several criteria. (i) Presence of a wing flapping pattern, which is indicative of a bird (figure 1a,b). The detailed shape of the pattern was also considered. A pattern close in shape to a pure sine function is indicative of an insect. (ii) Presence of repeated clusters of larger fluctuations, which is indicative of passerine-like (figure 1b) or swift-like flapping style. (iii) Value of the fundamental frequency, which reflects the wing flapping frequency. It must be between 2.5 and 30 Hz for birds. (iv) Overall curvature of the signal. A curved signal indicates a target flying through the beam (figures 1 and 2). A straight signal indicates stationary clutter (figure 3).
The four datasets of the present study have been sorted by the plot-based method by H.S. into four classes. Class labels are written in uppercase throughout this article.
BIRD, signals that can unambiguously be assigned to flying birds; INSECT, signals that can unambiguously be assigned to flying insects; CLUTTER, signals that can unambiguously be assigned to objects on the ground (e.g. trees, hills, houses); UFO, signals that are between the typical bird signals and insect signals. They are clearly produced by flying targets but cannot be unambiguously assigned to birds or insects.
It must be assumed that an unknown proportion of the UFOs were actually produced by birds. It is important to note that the uncertainty in plot-based labelling has been explicitly coded by creating the class UFO. Sand, dust or swarms of insects within the same pulse volume as a bird produce radar returns that partially mask wing flapping pattern. The bird signals from dataset D are the most contaminated with such returns due to frequent sand storms and high abundance of insects. While recording dataset D the airborne insects and sand particles were sometimes so abundant that the signals looked similar to those of clutter and such signals have been labelled as CLUTTER. The signals from datasets C and B are moderately contaminated, mainly due to the abundance of insects. The signals from dataset A are weakly contaminated thanks to the low abundance of insects. The proportion of BIRDs is small in the datasets B, C and D with values between 4 and 14% (table 1).
2.2 Merging classes to obtain a two-class problem
To avoid the additional complication of multiclass models, we recoded the classes. The class BIRD was kept and it is contrasted to the new class OTHER that comprises the INSECT, UFO and CLUTTER classes. Including the UFOs implies that the class OTHER contains an unknown amount of signals that were produced by birds. In the models presented later, the response variable is categorical with two levels (BIRD and OTHER). Therefore the classifier that we construct is aimed at separating signals, which are clearly produced by birds from all other signals.
2.3 Overview of the classification method
The classification method presented in this article is divided into two consecutive steps. (i) A data pre-processing step in which variables are extracted from the signals (§2.4). (ii) A modelling step in which SVC methods are used to train a predictive model that links the variables to the class labels (§2.5).
The ultimate purpose is that the trained model should be used to predict the class of new data (i.e. new signals that have gone through the data pre-processing step) with unknown class label. Therefore an important issue is to quantify the classification performance that can be expected from this classification method. This was done by simulating the situation where we must assign class labels to unlabelled data (§2.8): we first train a SVC with a subset of the data, the training set. Then the trained SVC is used to predict the class labels of another subset of the data, the test set. Hence the test set is treated in the same way as we would treat unlabelled data, but for the test set we know the true class labels. This enables the comparison of the predicted class labels with the true class labels. From this comparison we compute four measures of classification performance (§2.7).
We used the environment for statistical computing R (R Development Core Team 2006) with the packages kernlab (Karatzoglou et al. 2004) for the SVCs and Rwave (Carmona & Whitcher 2004) for wavelet transforms.
2.4 Pre-processing the signals
2.4.1 Distance correction
The intensity of the signals is heavily influenced by the target's distance from the radar. Distance-corrected signals were computed according to the radar equation (R−4 law, Skolnik 1980). The distance-corrected signals correspond to what would have been recorded if all the targets were at a distance of 3000 m from the radar. Equations are given in the electronic supplementary material. The distance-corrected signals were used for all further steps.
2.4.2 Variables extraction
Signals are not a suitable input format for our modelling method, the SVC. Therefore variables must be extracted from the signals. The variable extraction results in 67 variables named maxi, sd-stat, resolution, coefmean-1, coefmean-2, …, coefmean-32 and coefsd-1, coefsd-2, …, coefsd-32. Details are presented below. Variable names are in italic characters throughout this article.
maxi. While a target flies across the fixed beam, the signal intensity will first increase and then decrease. As a consequence the signal's overall shape is curved (figures 1 and 2). The maximum intensity of the distance-corrected signals was used as a variable, which is called maxi. This variable depends on the target size, shape, texture, orientation and on how close it flew to the centre of the beam.
sd-stat. We wanted to quantify the variation in the signals that is due to wing flapping. To achieve this, we stationarized the distance-corrected signals (equal to removing the overall curvature) and then computed the standard deviation of the intensity of the stationarized signals. This gives the variable sd-stat. Stationarizing was done by taking the residuals from a local regression with a moving window (lowess (Cleveland 1981)) with the window size set to 150 samples. As an example, figure 4a shows the distance-corrected signal from a bird and the local regression line and figure 4b shows the stationarized signal. The stationarized signals are not used any further.
resolution. The signal intensity can take only a discrete number of values (a consequence of digitizing). This is obvious in figure 3a where the signal intensity takes only four different values. The number of values represented in a given signal was used as a variable that we call resolution. This measures the overall variability in signal intensity.
coefmean and coefsd. This group of 2×32 variables was extracted via continuous wavelet transform (CWT), which is a tool for analysing the spectral content of signals. The input of the CWT is a radar signal and the output is a time–frequency representation of the signal (see figures 1–3). A description of this procedure is given next.
The time resolution of the signals is 130 Hz and due to the Nyquist limit they cannot represent frequencies above 65 Hz. Frequencies below 0.31 Hz were not considered because some of the signals were not long enough to detect lower frequencies. Therefore we considered 32 frequencies between 0.31 and 65 Hz. The frequencies were selected on an exponential scale, which is usual with the CWT. For a signal composed of n samples the output of the CWT is a matrix with 32 rows and n columns. The numerical values in this matrix are called wavelet coefficients and they represent the signal amplitude at a given frequency and position in time.
Before computing the CWT, the signals were prepared in two steps. (i) The distance-corrected signals were demeaned and then normalized with sd-stat (i.e. divided by sd-stat). The wavelet coefficients are proportional to the amplitude of the signal. As explained above, sd-stat quantifies the amplitude due to wing flapping and therefore normalization with sd-stat ensures that the wavelet coefficients related to wing flapping (i.e. the upper frequencies of our range) have comparable magnitudes across the signals. There is no need to remove the overall curvature before computing the CWT. The curvature produces non-zero coefficients for low frequencies (typically below 0.31 Hz), but it does not affect the coefficients for higher frequencies, which are related to wing flapping. (ii) In order to reduce edge effects, the signals were padded on both ends with 1000 samples equal to the first or last value represented in the signal. The wavelet coefficients corresponding to these padded values were removed afterwards.
Next, 2×32 variables were extracted from the matrices obtained by CWT as follows. For every row of these matrices, the mean and standard deviation of the wavelet coefficients were computed. In other words, for each of the 32 frequencies, we computed the mean and standard deviation over time of the wavelet coefficients. These variables are called coefmean 1–32 and coefsd 1–32. Smaller numbers are used to name higher frequencies. An overview of the variable names and associated frequencies is given in the electronic supplementary material. These 64 variables contain information on the periodic patterns in the signals.
2.4.3 Transformation and standardization of the variables
The variables had skewed distributions and several standard transformations were applied to obtain more symmetric distributions (see electronic supplementary material). All variables were centred and scaled to mean=0 and standard deviation=1.
2.5 Support vector classifiers
As a statistical modelling method, we used the SVC (Cortes & Vapnik 1995; Vapnik 2000; Hastie et al. 2001). The SVC accepts several predictor variables and one binary categorical response variable. In our case, the response has the two classes BIRD or OTHER. During a training phase, a function that links the response to the variables is estimated via a dataset with known class labels. The resulting function is called a trained SVC and can be used to predict the class of new data instances with unknown class labels. A trained SVC does not directly predict the class. Its basic prediction is a numerical value, which we call the ‘score’. The score provides a ranking of the data instances. Data instances with a larger score (more positive) are more likely to be BIRDs and data instances with a smaller score (more negative) are more likely to be OTHERs (see figures 5 and 6). In order to predict the class (class-level prediction), a decision threshold must be chosen on the score. For example, setting the threshold to zero means that data instances that have a predicted score greater than zero are assigned to the BIRD class and data instances that have a predicted score smaller or equal to zero are assigned to the OTHER class. In SVCs a decision threshold equal to zero leads to the lowest overall error rate (i.e. the proportion of data instances whose class is wrongly predicted). A threshold equal to zero was used for all the analyses presented in this article.
There exist several subtypes of SVCs, which are specified by a kernel function (short: kernels). In preliminary work we assessed the performance of SVCs with four different kernels: ‘linear’, ‘polynomial’, ‘radial basis function’ and ‘Laplace’. The Laplace kernel gave the best performance and was used for this study.
The SVC with Laplace kernel has two tuning parameters named C and sigma which control the capacity of the classifier (i.e. its ‘flexibility’). They are called tuning parameters because values yielding good classification performance must be chosen by the operator (see below).
2.6 Model selection
A model is defined as a subset of the 67 variables and by specified values for the tuning parameters C and sigma. A model selection was performed with dataset B to choose a subset of variables and values of the tuning parameters that yield the best classification performance. Details are given in the electronic supplementary material. Specifications of the selected model are given in §3.
2.7 Four ways to measure classification performance
An unbiased estimation of classification performance is obtained by predicting the class of data instances that were not used for training and compare these predictions with the actual class labels. For this comparison we used four performance measures: the area under the ROC curve (AUC), the accuracy (ACC), positive predictive value (PPV) and false negative rate (FNR).
The AUC (Fawcett 2006) quantifies how well two classes are separated along the score axe. The AUC is estimated with the predicted scores and the true class labels of test data instances. AUC takes values between 0 and 1. A classifier that classifies at random has an AUC of 0.5, and a classifier that achieves perfect separation has an AUC of 1. An AUC below 0.5 indicates that the classifier does worse than random guessing. AUC offers two important advantages. (i) It does not depend on the a priori frequency of the classes in the dataset (called ‘class priors’ hereafter). Thus it is the only performance measure that allows comparison across different datasets, since class priors differ between the datasets (table 1). (ii) It informs on the separating ability of the score independently of a particular decision threshold on the score. A technical description is given in the electronic supplementary material.
When class-level prediction has been made, a test dataset can be divided into four groups. (a) BIRDs that were predicted to be BIRDs. (b) OTHERs that were predicted to be OTHERs. (c) BIRDs that were predicted to be OTHERs. (d) OTHERs that were predicted to be BIRDs. Let a, b, c and d represent the number of data instances found in the four groups, thenA more detailed description is found in (Fawcett 2006).
The accuracy (ACC) is the overall proportion of correctly classified data instances. It is the ‘usual’ performance measure. However in the case of unbalanced class priors the accuracy must be interpreted with caution. High values of the accuracy are often a direct consequence of the fact that one class is over-represented and do not reflect the separating ability of the classifier. As an example consider dataset D, where only 4% of the signals are BIRDs and 96% are OTHERs (table 1). Imagine a very unspecific classifier that would assign all the signals to the OTHERs. Such a classifier would still achieve an accuracy of 96%, but it would be useless to identify BIRDs in the data.
The PPV is the proportion of data instances correctly predicted to be BIRDs among all the data instances that have been predicted to be BIRDs. Since our purpose is to isolate a subset which contains mostly BIRDs we wish PPV to be as high as possible.
The FNR is the proportion of BIRDs that are wrongly predicted to be OTHERs among the total number of BIRDs in the test set. That is, it is the proportion of BIRDs that are ‘lost’ when the classifier is used for prediction, thus we wish FNR to be as small as possible.
The PPV and the FNR are useful when the interest is on one of the two classes as is the case here. ACC, PPV and FNR are specific for a given decision threshold, unlike the AUC that is independent of the decision threshold. It is important to realize that PPV and FNR are interrelated via the decision threshold. This means that PPV can be improved by moving the decision threshold to more positive values, but this will inevitably give a worse value for the FNR.
2.8 Assessing the classification performance of the selected model on different datasets
2.8.1 The principle of cross validation
The aim of cross validation is to estimate the expected classification performance in the prospective situation where the classifier will be used for prediction on new data. The principle is to predict the score or class of a subset of data which was not used for training (a test set). Then, the predictions made by the classifier on the test set are compared with the true class labels in order to estimate the classification performance (i.e. AUC, ACC, PPV or FNR). Cross validation leads to an unbiased estimation of classification performance. We computed confidence intervals around the performance estimators (tables 2 and 3) using the method described by Nadeau & Bengio (2003). We have chosen to use the 95% confindence bounds, i.e. intervals that include the true (but unknown) classification performance in at least 95 out of 100 cases. See the electronic supplementary material for a more formal development.
2.8.2 Cross validation to assess the ‘within dataset method’
This procedure involves data from one dataset at a time. It was performed separately for the four datasets. It can be divided into three steps. (i) A training set of sample size Ntrain was randomly sampled without replacement from a dataset. The SVC was trained with this training set. (ii) A test set of sample size Ntest was randomly sampled without replacement from the remaining signals of the dataset. The score and class of the elements of the test set were predicted with the trained classifier. (iii) The four performance estimators were computed from these predictions and the class labels. Steps (i)–(iii) were repeated 30 times and the average over the repeats was used as the final estimator (table 2). Ntrain was 4000 and Ntest was 5000 except for the dataset A, where Ntest was 1878. (Additional results with Ntrain=1000, 2000 and 8000 are given in the electronic supplementary material.)
2.8.3 Cross validation to assess the ‘cross dataset method’
This procedure involves data from two datasets at a time. The training data were always taken from dataset B and the test data were taken from one of the three other datasets A, C or D. The ‘cross dataset method’ was performed separately for the datasets A, C or D. The procedure is slightly more complicated than the ‘within dataset method’ due to the fact that class priors are different in the four datasets. It can be divided into three steps. (i) A training set of size Ntrain was randomly sampled with replacement from dataset B with the constraint that the class priors are equal to those of the test dataset. The SVC was trained with this training set. (ii) A test set of sample size Ntest was randomly sampled without replacement from dataset A or C or D. The score or class of the elements of the test set was predicted with the trained classifier. (iii) The four performance estimators were computed from these predictions and the class labels. Steps (i)–(iii) were repeated 30 times and the average over the repeats was used as the final estimator (table 3). Ntrain was 4000 and Ntest was 5000. (Additional results with Ntrain=1000, 2000 and 8000 are given in the electronic supplementary material.)
2.9 Visualization of classification performance
In order to visualize the classification performance, we plotted the distribution of the test data points over the score separately for each class (see figures 5 and 6). The distributions were estimated by kernel density estimation with a Gaussian kernel. Kernel bandwidth was selected according to the method of Scott (1992). Note that the distributions are normalized to have a surface of 1. Thus the plots do not account for the differences in class priors.
3.1 Model selection
The selected model (i.e. the best performing model) includes 43 out of the 67 variables: resolution, maxi, sd-stat, coefmean 32–25 and 16–1 and coefsd 32–29 and 12–1. The selected tuning parameters for this model are C=100 and sigma=0.1. This model underlies all the results presented hereafter. The performance of a selected set of models is summarized in table 4.
3.2 Performance of the within dataset method
The smallest accuracy is obtained for dataset C (ACC=0.9594) and the highest for dataset D (ACC=0.9796). The classification performance measured with AUC is highest on the dataset A followed by B, then C and finally D (table 2). Interestingly, dataset A with the highest proportion of birds resulted in the highest AUC and D with the lowest proportion of birds resulted in the lowest AUC. Overall AUC ranges between 0.9656 and 0.9953. The best performance for PPV and FNR resulted in the dataset A. The subset of data which is predicted by the classifier to be of the class BIRD contains 97.3% (PPV=0.9733) of signals, which truly are of the class BIRD, whereas only 2.8% (FNR=0.0283) of the total amount of BIRDs in the dataset have been wrongly predicted to be of the class OTHER. The lowest performance for PPV and FNR resulted in the dataset D. The subset of data which is predicted by the classifier to be of the class BIRD contains 87% (PPV=0.8691) of signals, which truly are of the class BIRD. But 49% (FNR=0.4851) of the total amount of BIRDs in the dataset have been wrongly predicted to be of the class OTHER (table 2).
3.3 Performance of the cross dataset method
Names of datasets in this section always refer to the data used for testing. The training set was always dataset B. Qualitatively the results of the cross dataset method are similar to those of the within dataset method, but the performance is lower. The AUC is highest on the dataset A, followed by C and finally D. Note that this is the same ranking as for the within dataset method (table 3).
3.4 Plots of class-specific distribution of the score
In all the four datasets, the relative position of the estimated distributions was qualitatively similar (figures 5 and 6). In all the datasets and for both the within and cross dataset methods, the class UFO presents the largest overlap with the BIRDs.
The plots illustrate the results obtained with the AUC, which can be interpreted as a measure of the overlap between two classes along the score. The overlap between BIRDs and OTHERs is lowest for dataset A and highest for dataset D. This holds for the within (figure 5) as well as the cross dataset methods (figure 6). Hence the overlap is related to the proportion of BIRDs in the datasets (cf. table 1). In terms of overlap, the cross dataset method is generally worse. For dataset D the cross dataset method suffers from a large overlap between BIRDs and the other three classes, which results in a poor performance (cf. table 3).
4.1 General aspects
The results clearly demonstrate that an automatic classifier for radar signals can be successfully established. A useful property of the method is that the output is a score, which provides a ranking of the data instances according to the reliability of the predictions. The decision threshold on the score can be set to other values than 0. For example, setting it to higher values improves the PPV but deteriorates the FNR (figures 5 and 6). This means that the selected subset will contain a higher proportion of birds with the disadvantage that more of the overall amount of the birds are lost to the rejected subset. It is also possible to define two decision thresholds on the score. An upper threshold can be chosen to identify a subset of data with a high proportion of birds (high PPV). A lower threshold can identify a subset of the data which includes only very few birds (low FNR). Finally, the data subset that is defined in between the two thresholds can either be treated separately in a specific analysis or be labelled additionally by the plot-based method.
In all cases a certain proportion of the BIRDs are not included in the selected subset. A desired property of the classifier would be that the excluded BIRDs represent a random subsample of all the BIRDs. Presently, we cannot say whether this is the case. A large part of the classifier's recognition ability is based on the pattern produced by wing flapping. Therefore it is possible that birds that do not produce a clear wing flapping pattern (e.g. soaring birds, dense bird flocks) will get, on average, a lower predicted value for the score than single birds, which are flapping most or all of the time. If this is the case, our method would selectively exclude soaring bird species and species that fly in dense flocks. This important point will hopefully be clarified in future studies.
Our approach uses a signal pre-processing procedure and then a nonlinear modelling method (SVC), which is necessary to achieve a high classification performance. A disadvantage of such an approach is that it makes it hardly possible to understand intuitively the process by which the classifier assigns a score to the signals.
The classifier resulting from the training process is dependent on the training dataset and thus on the correctness of labelling by the expert. There was a constraint in our expert learning, because we had to extrapolate from signals identified visually during daytime to nocturnal signals; though in a few field projects we could also verify nocturnal signals with infrared and light beam observations (Liechti et al. 1995). Since the flight mode of birds is basically the same during day and night (except soaring birds), we think that this extrapolation is unproblematic.
In the datasets B, C and D, the target class (BIRD) is underrepresented (table 1). This is an unfavourable situation because smaller class priors of the class BIRD per se decrease the PPV. In this context the PPV should be interpreted by considering the improvement it brings when compared with the proportion of BIRDs in the whole dataset (i.e. the class prior of the BIRDs).
Our principal purpose is to identify and select a subset of the data, which should contain a high proportion of BIRDs while rejecting another subset, which should contain as few BIRDs as possible. Let us take dataset B for illustration. Drawing biological conclusions on birds based on the complete dataset would be misleading since only 12% of the signals are from BIRDs. When the classifier is applied, the data are split into a selected subset and a rejected subset. The proportion of BIRDs in the selected subset is over 92% (PPV=0.9286), while 20% of the total number of BIRDs are lost to the rejected subset (FNR=0.2010). Drawing conclusions based on the selected subset seems much more reasonable. The results obtained for dataset C are very similar. For dataset A the proportion of BIRDs in the selected subset is over 97% and less than 3% of the total amount of BIRDs is lost to the rejected subset. This performance is above our expectation. It shows that when class priors are balanced, the practical usefulness of the classifier is excellent. For the least performing dataset D, the proportion of BIRDs in the selected subset is 87%, while 49% of the total amount of BIRDs is lost to the rejected subset. If we consider that the initial proportion of BIRDs in the complete dataset was only 4%, this represents a large improvement of the data quality, but the price paid in terms of the amount of birds lost to the rejected subset is large. Dataset D was recorded under frequent sandstorms and an extremely high abundance of insects, which reduces the classification performance (see below). In our experience, such situations are rare, and the performance obtained on datasets A, B and C is more representative of the practical usefulness of our method.
Using a classifier trained with data from another dataset (cross dataset method) provided qualitatively similar results as above, but with a consistently lower classification performance (table 3). It performed well when the proportion of BIRDs was approximately 50% (dataset A), but insufficiently when the proportion of BIRDs was low (datasets C and D). In the cross dataset method, it may be that the separability differs between the test set and the training set. Hence, estimations of PPV and FNR obtained from a cross validation on the training set will not necessarily represent the PPV and FNR obtained on the test set. More tests with other datasets are needed to conclude whether the cross dataset method is a reliable solution, at least when the proportion of BIRDs is high (e.g. dataset A). The big advantage of the cross dataset method is that there is no need to create a further training set via the time-consuming plot-based labelling.
It is known that an approximately correct specification of the class priors of the test set does improve the reliability of the predictions (Saerens et al. 2002). In our cross dataset approach, we assumed knowledge of the class priors in the test set. In practice this may not always be available (for example for real time classification). But other sources of information may be used to estimate priors. For example, temperature may be used to estimate insect activity or preliminary knowledge about the seasonal or local abundance of insects may be available.
Regardless of the training method used, the classifier achieved the best results (highest AUC) with dataset A, followed by B and C and finally D. This ranking is consistent with the decreasing proportion of BIRDs in the datasets (table 1), which was caused by a tremendous increase in the number of insects and amount of airborne sand. Therefore, overlap of different signals occurred frequently in the datasets B, C and D. The presence of insects and/or sand in the same pulse volume as a bird is expected to partially mask the wing flapping pattern. But the recognition ability of the classifier is mostly based on the wing flapping pattern in the signals. Therefore it is no surprise that as the patterns become less clear the classification performance decreases.
4.3 Possible improvements
Our datasets were dominated by nocturnal migrants, consisting mostly of small to medium sized birds flying singly. Signals of large birds (e.g. soaring birds) and flocks of birds were relatively rare (less than 1% of the BIRDs). Therefore, the present method was developed mainly to identify signals from small- to medium-sized single birds. In general, large birds and bird flocks produce stronger signals than small single birds, but unambiguous wing flapping patterns can hardly be detected. In flocks the signal represents the superimposed wing flapping of all individuals in the radar pulse volume, while large birds are often gliding or soaring without flapping. In addition interferences between the pulses (i.e. the electromagnetic wave) reflected by several targets within the same pulse volume will also render the signals more difficult to analyse.
Our classifier uses 43 variables. The signal intensity (echo size) is only represented in one variable ‘maxi’. The remaining variables contain information on the variability or on the shape of the signals, which both reflect the wing flapping. Therefore, our method heavily relies on information related to the wing flapping and probably does not fully exploit the potential of using the signal intensity. An improved version of the classifier should use the information on signal intensity more efficiently than in the present study. This is expected to result in a classifier with a good performance on bird flocks and large birds.
In general, the frequency represented by the overall curvature of the signal intensity is outside the range of frequencies analysed by our algorithm. However, this curvature together with target distance and width of the radar beam could be used to extract a variable representing the flight speed of the targets. Since birds have a higher airspeed than insects, this could further improve the classification performance.
A serious problem is the co-occurrence of intense insect and bird migration. In such situations, many signals of birds are contaminated by radar returns from insects, making them very difficult to be labelled unambiguously by the human expert. Accordingly, such signals were labelled as UFO. Thus an unknown proportion of the UFOs were actually produced by flying birds. These signals may contain characteristic patterns of birds that were not recognized by the human expert, but that are recognized by the classifier. Under this assumption we expect that the classifier has more difficulties at separating the BIRDs from the UFOs than the BIRDs from the INSECTs or CLUTTERs, which is the case in our data (see figures 5 and 6). At the moment, we can only speculate that UFO signals with high scores were actually produced by birds. But to confirm this we need additional independent information (e.g. parallel infrared images).
4.4.1 When is the exclusion on non-bird signals important?
We are convinced that for any radar study dealing with the movements of small- to medium-sized birds, a discrimination between bird and non-bird (in particular insect) signals is a prerequisite. Of course the exclusion of non-bird signals becomes less important when the overall proportion of birds is close to 100%. But in most situations such high proportions are not seen. In our study, which we consider representative for arid sub-tropical climates in general, birds represent 4–52% of the signals. The study by Larkin (1991) showed that in temperate regions insect contamination can be 24–28% (up to nearly 100% on specific nights), and even in the Arctic insect signals were recorded regularly by radar (Gudmundsson et al. 2002).
4.4.2 Application across different radar systems
The method is feasible for all kind of radar systems, which allow a target to be recorded continuously for at least a few seconds. We might wish to train a classifier with data from a given radar system and use this classifier to identify birds in a dataset recorded with another radar system. But the following aspects must be considered. (i) The process by which wing flapping produces the periodic fluctuations in the signal is not fully understood. It is known that wavelength and polarization of the pulses do affect this process considerably (Bruderer 1997b; Bachmann & Zrnic 2007). (ii) The overall power of the received signals depends on technical settings, which can differ significantly between radar systems, and even the relative fluctuations of the signal are dependent on the dynamic range of the amplifier. (iii) The time resolution of the signals can differ depending on the signal processing units of different radar systems. Based on these three points, we think the training data should be recorded with the same radar system as the data on which the classifier is applied.
4.4.3 Real time classification
Radar measurements are used for bird strike warnings, mainly in military aviation. Up to now these warnings are based on the expert knowledge of the radar operators. In order to introduce an automatic classifier, radar signal identification should be in real time or close to it (within minutes). Such real time warnings could also be used to mitigate detrimental effects of wind turbines by stopping them for the time of mass migration. For real time classification we have to rely on a classifier that was trained on data recorded in the past. Thus, there is a strong demand to improve the performance of the cross dataset method.
The differences in flapping styles between bird species are reflected in the shape of the recorded signals, and hence in the values of the extracted variables. It is reasonable to assume that, in general, the species composition of the birds aloft differs between recording locations and periods. We conclude that differences in species composition between the datasets are the probable cause of the lower performance of the cross dataset method as compared with the within dataset method. Based on this we suggest to improve the cross dataset method by creating a ‘standard’ training dataset where all different kinds of birds (small to large sized, continuous and intermittently flapping) are more or less equally represented. Such a standard training dataset could be built up over time by combining datasets from different sites, until repeated comparisons with the within dataset method would indicate a satisfactory performance.
The cross dataset method could be implemented in real time to provide a relative quantification of migration intensity (i.e. allowing relative comparisons between time periods). It could be used, for example, to detect a fast increase in the abundance of birds aloft and help us react by giving an alarm.
A promising approach for the future would be to combine several existing radar methods that provide information on target identity. (i) The method presented here, which uses wing flapping pattern. (ii) Signal intensity (echo size). As mentioned above signal intensity is underexploited in our method. (iii) Airspeed. (iv) Signal dissimilarity due to varying polarization (dual polarization radar, Bachmann & Zrnic 2007). As at least some of the information gathered by these methods is complementary, combining the methods might improve classification performance. Whether it is technically feasible to record simultaneously all this information with one radar system is questionable.
The time resolution of our recording system is restricted to 130 Hz, implying an upper bound at 65 Hz on the frequencies that can be analysed. Using a higher time resolution would allow the detection of higher frequencies and this might provide additional information on the wing flapping of some insects and improve the classification performance.
The pattern produced by the wing flapping of birds contains information related to the wing flapping frequency and the mode of flapping (continuous, intermittent). This information can be used to divide the birds' signals into several subclasses according to their wing flapping style, for example wader-like, passerine-like or swift-like (Bloch et al. 1981). This is of great interest for the studies on bird migration. It would be interesting to see whether the division into subclasses can be automated as well.
It has been known for many years that bird signals can be recognized accurately by human experts thanks to the excellent pattern recognition ability of humans. The main finding of our study is that this recognition task can be automated by using machine learning methods such as the SVC. This considerably reduces the dependency on human experts and will allow analysis of large datasets of automatically recorded radar signals in almost real time. We are convinced that this kind of classification will be an indispensable tool for future radar applications dealing with biological targets.
We would like to thank Erich Bächler, Bruno Bruderer, Alexandros Karatzoglou, Beat Naef-Daenzer, Dieter Peter, Thomas Steuri and Barbara Trösch for their assistance and support. We would like to thank Mark Desholm and two anonymous referees for their constructive comments on the manuscript.
The algorithm was implemented in the free and open source statistical programming language R (R Development Core Team 2006). Researchers interested in the algorithm (R-code) are encouraged to contact us.