## Abstract

The matching of two-dimensional shapes is an important problem with many applications in anthropology. Examples of objects that anthropologists are interested in classifying, clustering and indexing based on shape include bone fragments, projectile points (arrowheads/spearpoints), petroglyphs and ceramics. Interest in matching such objects originates from the fundamental question for many biological anthropologists and archaeologists: how can we best quantify differences and similarities? This interest is fuelled in part by a movement that notes: ‘an increasing number of archaeologists are showing interest in employing Darwinian evolutionary theory to explain variation in the material record’. Aiding such research efforts with computers requires a shape similarity measure that is invariant to many distortions, including scale, offset, noise, partial occlusion, etc. Most of these distortions are relatively easy to handle, either in the representation of the data or in the similarity measure used. However, rotation invariance seems to be uniquely difficult. Current approaches typically try to achieve rotation invariance in the representation of the data, at the expense of poor discrimination ability, or in the distance measure, at the expense of efficiency. In this work, we show that we can take the slow but accurate approaches and dramatically speed them up. On real world problems, our technique can take current approaches and make them four orders of magnitude faster, without false dismissals. Moreover, our technique can be used with *any* of the dozens of existing shape representations and with *all* the most popular distance measures, including Euclidean distance, dynamic time warping and longest common subsequence. We show the applications of our work to several important problems in anthropology, including clustering and indexing of skulls, projectile points and petroglyphs.

## 1. Introduction

Anthropologists often deal with physical (as opposed to purely social or linguistic, etc.) artefacts. While the colour or texture of such artefacts may be of interest, it is often the case that the *shape* is of most interest. Examples of artefacts that anthropologists are interested in classifying, clustering or indexing based on shape include bone fragments, projectile points (arrowheads/spearpoints), petroglyphs and ceramics. While anthropologists have long been interested in shape, interest in matching such objects is fuelled in part by the availability of computing power and by a recent movement that notes: ‘an increasing number of archaeologists are showing interest in employing Darwinian evolutionary theory to explain variation in the material record’ (O'Brien & Lyman 2003). For example, anthropologists have recently used tools from biological morphology to attempt to explain spatial and temporal distribution of projectile points in North America. Aiding such research efforts with computers requires a shape similarity measure that is invariant to many distortions, including scale, offset, noise, partial occlusion, etc. Most of these distortions are relatively easy to handle, particularly if we use the well-known technique of converting the shapes into time-series as in figure 1.

However, no matter what representation is used, rotation invariance seems to be uniquely difficult to handle. For example, Li & Simske (2002) notes that ‘rotation is always something hard to handle compared with translation and scaling’, and the literature abounds with similar statements. Many current approaches try to achieve rotation invariance in the representation of the data, at the expense of discrimination ability (Osada *et al*. 2002), or in the distance measure, at the expense of efficiency (Gdalyahu & Weinshall 1999; Adamek & O'Connor 2003, 2004; Attalla & Siy 2005).

As an example of the former, the very efficient rotation invariant technique of Osada *et al*. (2002) cannot differentiate between the shapes of the lowercase letters ‘d’ and ‘b’. As an example of the latter, the work of Adamek & O'Connor (2004), which is the state of the art in terms of accuracy or precision/recall, takes an untenable amount of time for each shape comparison.

In this work,1 we show that we can take the slow but accurate approaches and dramatically speed them up. For example, we can take the *O*(*n*^{3}) approach of Adamek & O'Connor (2004) and on real world problems bring the average complexity down to *O*(*n*^{1.06}). This dramatic improvement in efficiency does not come at the expense of accuracy; the algorithm is guaranteed to return the same answer set as the slower methods.

We achieve speedup of the existing methods by dramatically decreasing the central processing unit (CPU) requirements. Our technique works by grouping together similar rotations and defining an admissible lower bound to that group. Given a tight and admissible lower bound, we can use the many search techniques known in the database community.

Our technique has the following advantages:

there are dozens of techniques in the literature for converting shapes to time-series (Wang

*et al*. 2000; Cardone*et al*. 2003; Adamek & O'Connor 2004; Zhang & Lu 2004; Attalla & Siy 2005; Vlachos*et al*. 2005), including some that are domain specific (Rath & Manmatha 2002; Bhanu & Zhou 2004). Our approach works for*any*of these representations.while there are many distance measures for shapes in the literature, Euclidean distance, dynamic time warping (Rath & Manmatha 2002; Adamek & O'Connor 2003; Bhanu & Zhou 2004; Ratanamahatana & Keogh 2005) and longest common subsequence (Vlachos

*et al*. 2005) account for the majority of the literature. Our approach works for*any*of these distance measures.our approach uses the idea of envelope lower bounding as its cornerstone. Since the introduction of this idea a few years ago (Keogh 2002), dozens of researchers worldwide have adopted and extended this framework for applications as diverse as motion capture indexing (Keogh & Kasetty 2002), P2P searching (Karydis

*et al*. 2005), handwriting retrieval (Rath & Manmatha 2002), dance indexing, query by humming and monitoring streams (Wei*et al*. 2005). This widespread adoption of envelope lower bounding has insured that it has become a mature and widely supported technology, and it suggests that any contributions made here can be rapidly adopted and expanded.

The rest of this paper is organized as follows. In §2, we discuss the background material and the related work. In §3, we formally introduce the problem and in §4, we offer our solution. Section 5 offers a comprehensive empirical evaluation of our technique. Finally, §6 offers some conclusions and directions for future work.

## 2. Background and related work

The literature on shape matching is vast; we refer the interested reader to Cardone *et al*. (2003), Zhang & Lu (2004) and Veltkamp & Latecki (2006) for excellent surveys. While not all work on shape matching uses a one-dimensional representation of the two-dimensional shapes, an increasingly large majority of the literature does. Therefore, we consider only such approaches here. Note that we lose little by this omission. The two most popular measures that operate directly in the image space, the chamfer (Borgefors 1988) and Hausdorff (Olson & Huttenlocher 1997) distance measures, require *O*(*n*^{2} log *n*) time,2 and recent experiments (including some in this work) suggest that one-dimensional representations can achieve comparable or superior accuracy.

In essence, there are three major techniques for dealing with rotation invariance, landmarking, rotation invariant features and brute force rotation alignment. We consider each in §§§2.1–2.3.

### 2.1 Landmarking

The idea of ‘landmarking’ is to find the one ‘true’ rotation and only use that particular alignment as the input to the distance measure. The idea comes in two flavours: domain dependent and domain independent.

In domain dependent landmarking, we attempt to find a single (or very few) fixed feature to use as a starting point for the conversion of the shape to a time-series. For example, in face profile recognition, the most commonly used landmarks (fiducial points) are the chin or the nose (Bhanu & Zhou 2004). In limited domains, this may be useful, but it requires building special-purpose feature extractors. For example, even in a domain as intuitively well understood as human profiles, accurately locating the nose is a non-trivial problem, even if we discount the possibility of mustaches and glasses. Probably, the only reason any progress has been made in this area is that most work reasonably assumes that faces presented in an image are likely to be upright. For shape matching in skulls, the canonical landmark is called the Frankfurt horizontal (White 2000), which is defined by the right and left *porion* (the highest point on the margin of the external auditory meatus) and the left *orbitale* (the lowest point on the orbital margin). However, a skull can be missing the relevant bones to determine this orientation and still have enough global information to match its shape to similar examples. Indeed, the Skhul V skull shown in figure 12 is such an example.

In domain independent landmarking, we align all the shapes to some cardinal orientation, typically the major axis. This approach may be useful for the limited domains in which there is a well-defined major axis, perhaps the indexing of hand tools. However, there is an increasing recognition that the ‘…major axis is sensitive to noise and unreliable’ (Zhang & Lu 2004). For example, a recent paper shows that under some circumstances, a single extra pixel can change the rotation by ±90° (Zunic *et al*. 2006).

To show how brittle landmarking can be, we performed a simple clustering experiment where we clustered three primate skulls using Euclidean distance, with both the major axis technique and the minimum distance of all possible rotations (as found by brute force). Figure 2 shows the result. It is clear that the major axes do not have any biological meaning: the points connecting each axis for each specimen are not homologous (of shared evolutionary origin). Therefore, the resulting clusters are meaningless in terms of biology and morphology.

The most important lesson we learned from this experiment (and dozens of other similar experiments on diverse domains; see Keogh 2006) is that rotation (mis)alignment is the most important invariance for shape matching; unless we have the best rotation, then nothing else matters.

### 2.2 Rotation invariant features

A large number of papers achieve fast rotation invariant matching by extracting only the rotation invariant features and indexing them with a feature vector (Cardone *et al*. 2003). This feature vector is often called the shapes ‘signature’. There are literally dozens of rotation invariant features, including ratio of perimeter to area, fractal measures, elongatedness, circularity, min/max/mean curvature, entropy, perimeter of convex hull, etc. In addition, many researchers have attempted to frame the shape-matching problem as a more familiar histogram-matching problem. For example, in Osada *et al*. 2002, the authors build a histogram containing the distances between two randomly chosen points on the perimeter of the shapes in question. The approach seems to be attractive, for example, it can trivially also handle three-dimensional shapes, however, it suffers from extremely poor precision. For example, it cannot differentiate between the shapes of the lowercase letters ‘d’ and ‘b’ or ‘p’ and ‘q’, since these pairs of shapes have identical histograms. In general, all these methods suffer from very poor discrimination ability (Cardone *et al*. 2003).

### 2.3 Brute force rotation alignment

There are a handful of papers that recognize that the above attempts at approximating rotation invariance are unsatisfactory for most domains, and they achieve true rotation invariance by an exhaustive brute force search over all possible rotations, but only at the expense of computational efficiency (Gdalyahu & Weinshall 1999; Wang *et al*. 2000; Adamek & O'Connor 2003, 2004; Attalla & Siy 2005). For example, the paper of Adamek & O'Connor (2004) uses dynamic time warping (DTW is discussed in detail in §4.2) to handle non-rigid shapes in the time-series domain; while they note that most invariances are trivial to handle in this representation, they state that ‘rotation invariance can (only) be obtained by checking all possible circular shifts for the optimal diagonal path’. This step makes the comparison of two shapes *O*(*n*^{3}). Similarly, the paper of Wang *et al*. (2000) notes that ‘in order to find the best matching result, we have to shift one curve *n* times, where *n* is the number of possible start points’.

## 3. Rotation invariant matching

We begin by formally defining the rotation invariant matching problem and by assuming the Euclidean distance, and we generalize to other distance measures later. For clarity of presentation, we will generally refer to ‘time-series’, which the reader will note can be mapped back to the original shapes.

Suppose that we have two time-series, *Q* and *C* of length *n*, which were extracted from shapes by an arbitrary method,As we are interested in large data collections, we denote a database of *m* such time-series as ,

If we wish to compare two time-series, and therefore shapes, we can use the ubiquitous Euclidean distanceWhen using the Euclidean distance as a subroutine in a classification or indexing algorithm, we may be interested in knowing the exact distance only when it is eventually going to be less than some threshold *r*. For example, this threshold can be the ‘range’ in range search or the ‘best-so-far’ in nearest neighbour search. If this is the case, we can potentially speedup the calculation by doing early abandoning (Agrawal *et al*. 1993; Keogh & Kasetty 2002).

*Early abandon*. During the computation of the Euclidean distance, if we note that the current sum of the squared differences between each pair of corresponding data points exceeds *r*^{2}, then we can stop the calculation, secure in the knowledge that the exact Euclidean distance had we calculated it would exceed *r*.

While the idea of early abandoning is fairly obvious and intuitive, it is so important to our work that we illustrate it in figure 3 and provide pseudocode in table 1.

Note that the ‘num_steps’ value returned by the optimized Euclidean distance in table 1 is used only to tell us how useful the optimization was. If its value is significantly less than *n*, then this suggests a dramatic speedup.

While the Euclidean distance is a simple distance measure, it produces surprisingly good results for clustering, classification and query by content of shapes, *if* the time-series in question happen to be rotation aligned. For example, in an experiment in Ratanamahatana & Keogh (2005), we manually performed rotation alignment of the time-series extracted from face profiles by explicitly showing the algorithm the beginning and the endpoint of a face (the nape and the Adam's apple, respectively).

However, if the shapes are not rotation aligned, this method can produce extremely poor results. To overcome this problem, we need to hold one shape fixed, rotate the other and record the minimum distance of all possible rotations.

For reasons that will become apparent later, we achieve this by expanding one time-series into a matrix ** C** of size

*n*by

*n*,Note that each row of the matrix is simply a time-series, shifted (rotated) by one from its neighbours. It will be useful below to address the time-series in each row individually, so we will denote the

*i*th row as

*C*

_{i}, which allows us to denote the matrix above in the more compact form of

**={**

*C**C*

_{1},

*C*

_{2}, …,

*C*

_{n}}.

We can now define the rotation invariant Euclidean distance (RED) as

Table 2 shows the pseudocode to calculate this.

Note that the algorithm tries to take advantage of early abandoning by passing EA_Euclidean_Dist the value of *r*, the best rotation alignment discovered thus far.

If we are simply measuring the distance between two time-series, then the algorithm is invoked, with *r* set to infinity; however, as we shall see below, if the algorithm is being used as a subroutine in a linear scan of a large dataset , the calling routine can set the value of *r* to achieve speedup. In particular, the calling function sets *r* to the value of the best match (under any rotation) discovered thus far. Table 3 shows the pseudocode. Note that the time complexity for this algorithm is *O*(*mn*^{2}). This is simply untenable for large datasets.

Before continuing, we will review the notation introduced thus far in table 4.

Note that our notation seems somewhat space inefficient, in that it expands time-series *C*, of length *n*, to a matrix of size *n*×*n*. However, the rest of the database uses the original (arbitrary rotation) time-series, and since the size of the database is assumed to be large, this overhead is asymptotically irrelevant.

Depending on the application, we may wish to retrieve the shapes that are enantiomorphic (mirror images) to the query. For example, in matching skulls, the best match may simply be facing the opposite direction. In contrast, when matching letters, we *do not* want to match a ‘d’ to a ‘b’. If enantiomorphic invariance is required, we can trivially achieve this by augmenting matrix ** C** to contain

*C*

_{i}and reverse(

*C*

_{i}), for 1≤

*i*≤

*n*.

Thus far, we have shown a brute force search algorithm that can support rotation invariance, rotation-limited invariance and/or mirror image invariance. We simply put the appropriate time-series into matrix ** C** and invoke the algorithm in table 3. This algorithm, even though speeded up by the early abandoning optimization, is too slow for large datasets. In §4, we introduce our novel search mechanism.

## 4. Wedge based rotation matching

We will begin by showing how we can efficiently search for the best match in the main memory. We will further show how to generalize to other distance measures.

### 4.1 Fast and exact main memory search

We begin by defining time-series *wedges*. Imagine that we take several time-series, *C*_{1}, …, *C*_{k}, from our matrix ** C**. We can use these sequences to form two new sequences

*U*and

*L*,

*U*and

*L*stand for upper and lower, respectively. We can see why in figure 4. They form the smallest possible bounding envelope that encloses all members of the set

*C*

_{1}, …,

*C*

_{k}from above and below. More formally,For notational convenience, we will call the combination of

*U*and

*L*a

*wedge*, and denote a wedge as

*W*,We can now define a lower-bounding measure between an arbitrary time-series

*Q*and the entire set of candidate sequences contained in a wedge

*W*,For brevity, we do not show a proof of this lower-bounding property. A proof appears in Keogh (2006) and also in Li

*et al*. (2004), where the authors use this representation for different problems.

Note that the LB_Envelope function has been used before to support DTW (Keogh 2002; Rath & Manmatha 2002; Vlachos *et al*. 2003; Ratanamahatana & Keogh 2005), uniform scaling (Keogh *et al*. 2004) and query filtering (Wei *et al*. 2005). For these tasks, the lower-bounding distance function is the same, but the definition of *U* and *L* is different.

There are two important observations about LB_Envelope. First, in the special case where *W* is created from a single candidate sequence, it degenerates to the Euclidean distance. Second, not only does LB_Envelope lower bound all the candidate sequences *C*_{1}, …, *C*_{k}, but we can also do *early abandon* with LB_Envelope. While the latter fact might be obvious, for clarity we make it explicit in table 5.

Note once again that the value returned in num_steps is merely a bookkeeping device to allow a post-mortem evaluation of efficiency.

Suppose we have just two time-series, *C*_{1} and *C*_{2} of length *n*, and we know that in the future we will be given a time-series query *Q* and asked if one (or both) of *C*_{1} and *C*_{2} are within *r* of the query. We naturally wish to minimize the number of steps which we must perform (‘steps’ are measured by num_steps). We are now in a position to outline two possible approaches to this problem.

We can simply compare the two sequences,

*C*_{1}and*C*_{2}(in either order), to the query using the early abandon algorithm introduced in table 1. We will call this algorithm,*classic*.We can combine the two candidate sequences into a wedge and compare

*Q*to the wedge using LB_Envelope. If the LB_Envelope function early abandons, we are done. We can say with absolute certainty that neither of the two candidate sequences is within*r*of the query. If we cannot early abandon on the wedge, we need to individually compare the two candidate sequences,*C*_{1}and*C*_{2}(in either order), to the query. We will call this algorithm,*Merge*.

Let us consider the best and the worst cases for each approach. For *classic*, the worst case is if both candidate sequences are within *r* of the query, which will require 2*n* steps. In the best case, the first point in the query may be radically different to the first point in either of the candidates, allowing immediate early abandonment and giving a total cost of two steps.

For *Merge*, the worst case is also if both candidate sequences are within *r* of the query, because we will waste *n* steps in the lower-bounding test between the query and the wedge, and then *n* steps for each individual candidate, for a total of 3*n*. However, the best case, also if the first point in the query is radically different, would allow us to abandon with a total cost of one step.

Which of the two approaches is better depends on as follows:

the shapes of

*C*_{1}and*C*_{2}. If they are similar, this greatly favours*Merge*.the shape of

*Q*. If*Q*is truly similar to one (or both) of the candidate sequences, this would greatly favour*classic*.the matching distance

*r*. Here, the effect is non-monotonic and dependent on the two factors above.

We can generalize the notion of wedges by hierarchically nesting them. Let us begin by augmenting the notation of a wedge to include information about the sequences used to form it. For example, if a wedge is built from *C*_{1} and *C*_{2}, we will denote it as *W*_{(1,2)}. Note that a single sequence is a special case of a wedge; for example, the sequence *C*_{1} can also be denoted as *W*_{1}. We can combine *W*_{(1,2)} and *W*_{3} into a single wedge by finding maximum and minimum values for each *i*th location, from *either* wedge. More concretely,In figure 5, we illustrate this notation. We call *W*_{(1,2)} and *W*_{3} *children* of wedge *W*_{((1,2),3)}. Since individual sequences are special cases of wedges, we can also call *C*_{1} and *C*_{2} children of *W*_{(1,2)}.

Given the generalization to hierarchal wedges, we can also now generalize the *Merge* approach. Suppose we have a time-series *Q* and a wedge *W*_{((1,2),3)}. We can compare the query to the wedge using LB_Envelope. If the LB_Envelope function early abandons, we are done. We know with certainty that none of the three candidate sequences is within *r* of *Q*. If we cannot early abandon on the wedge, we need to compare the two child wedges, *W*_{(1,2)} and *W*_{3}, to the query. Again, if we cannot early abandon on the wedge *W*_{(1,2)}, we need to individually compare the two candidate sequences, *C*_{1} and *C*_{2} (in either order), to the query. We call this algorithm *H*-*Merge* (Hierarchal Merge).

The utility of a wedge is strongly correlated to its area. We can get some intuition by visually comparing LB_Envelope(*Q*, *W*_{(1,2)}) with LB_Envelope(*Q*, *W*_{((1,2),3)}), as shown in figure 6. Note that the area of *W*_{((1,2),3)} is much greater than that of *W*_{(1,2)}, and that this reduces the value returned by the lower-bound function, and thus the possibility to early abandon.

For some problems, the *H*-*Merge* algorithm can give exceptionally poor performance. If the wedge *W*_{(1,2)}, created from *C*_{1} and *C*_{2}, has an exceptionally large area (i.e. *C*_{1} and *C*_{2} are very dissimilar), it is very unlikely to be able to prune off any steps.

At this point, we can see that the efficiency of *H*-*Merge* is dependent on the candidate sequences and *Q* itself. In general, merging similar sequences into a hierarchal wedge is a good idea, but merging dissimilar sequences is a bad idea.

The observations above motivate a final generalization of *H*-*Merge*. Recall that to achieve rotation invariance, we expanded our time-series *C* into a matrix with *n* time-series. Given these *n* sequences, we can merge them into *K* hierarchal wedges, where 1≤*K*≤*n*. This merging forms a partitioning of the data, with each sequence belonging to exactly one wedge. We will use **W** to denote a set of hierarchal wedges,where *W*_{set(i)} is a (hierarchically nested) subset of the *n* candidate sequences. Note that we haveWe will attempt to merge together only similar sequences. We can then compare this set of wedges against our query. Table 6 formalizes the algorithm.

Note that this algorithm is designed to replace the Test_All_Rotations algorithm that is invoked as a subroutine in the Search_Database_for_Rotated_Match algorithm shown in table 3.

As we shall see in our empirical evaluations, *H*-*Merge* can produce very impressive speedup if we make judicious choices in the set of hierarchal wedges that make up *W*. However, the number of possible ways to arrange the hierarchal wedges is greater than *K*^{K}, and the vast majority of these arrangements will be very poor, so specifying a good arrangement of *W* is critical.

A simple observation alleviates the need to invent a new algorithm to find a good arrangement of *W*. Note that hierarchal clustering algorithms have very similar goals to an ideal wedge-producing algorithm. In particular, hierarchal clustering algorithms can be seen as attempting to minimize the distances between objects in each subtree. A wedge-producing algorithm should attempt to minimize the area of each wedge. However, the area of a wedge is simply the maximum Euclidean distance between any sequences contained therein (i.e. Newton–Cotes rule from elementary calculus). This motivates us to derive wedge sets based on the result of a hierarchal clustering algorithm. Figure 8 shows wedge sets *W*, of every size from 1 to 5, derived from the dendrogram shown in figure 7.

Given that the clustering algorithm produces the tentative wedge sets, all we need to do is to choose the best one. We could attempt to do this by eye; for example, in figure 8, it is clear that any sequence that early abandons on *W*_{3} will almost certainly also early abandon on both *W*_{2} and *W*_{5}; similar remarks apply to *W*_{1} and *W*_{4}. At the other extreme, the wedge at *K*=1 is so ‘fat’ that it is likely to have poor pruning power. The set *W*={*W*_{((2,5),3)}, *W*_{(1,4)}} is probably the best compromise. However, because the set of time-series might be very large, such visual inspection is not scalable.

The problem is actually even more complex, in that the best value for *K* also depends on the current value of *r* (recall *r* is the best-so-far in nearest neighbour search.). If *r* is large, then very little early abandoning is possible, and this favours a large value for *K*. In contrast, if *r* is small, we can do a lot of early abandoning, and we are better off having many sequences in a single wedge, hence we can early abandon all of them with a single calculation. However, note that for nearest neighbour search, the value of *r* will get smaller as we search through the database.

With this in mind, we dynamically choose the wedge set based on a fast empirical test. We start with the wedge set where *K*=2. Each time the bestSoFar value changes, we test a subset of the possible values of *K* and choose the most efficient one (as measured by num_steps) as the next *K* to use. Which subset to test is decided on-the-fly based on the current *K* value. They are the values which evenly divide the ranges [1, current_K] and [current_K, max_K] into a small number of intervals. All the experiments below use *five* as this small number; however, the results are not sensitive to this choice. For example, using values as low as 3 or as high as 25 did not change the *efficiency* by more than 1% on datasets larger than 128 objects. Note that on average, the bestSoFar value only changes log(*m*) times during a linear search, so this slight overhead in adjusting the parameter is not too burdensome; however, we do include this cost in all the experiments in §5.

### 4.2 Generalizing to other distance measures

As we shall see in §5, the Euclidean distance is typically very effective and intuitive as a distance measure for shapes. However, in some domains, it may not produce the best possible precision/recall or classification accuracy (Adamek & O'Connor 2003; Ratanamahatana & Keogh 2005). The problem is that even after best rotation alignment, subjectively similar shapes may produce time-series that are globally similar but contain local ‘distortions’. These distortions may correspond to local features that are present in both shapes but in different proportions. For example, in figure 9, we can see the more prominent sagittal and nuchal crests of a lowland gorilla change the locations, in which the brow ridge and jaw map to in a time-series relative to a mountain gorilla. However, it should be noted that the two specimens do not differ much in the actual shape of the brain case, much of the difference attributable to the extracranial structures such as the sagittal and nuchal crests.

Even if we assume that the database contains the actual object used as a query, it is possible that the two time-series are distorted versions of each other. Here, the distortions may be caused by camera perspective effect, differences in lighting causing shadows which appear to be features, parallax, etc.

Fortunately, there is a well-known technique for compensating such local misalignments, dynamic time warping (DTW; Keogh 2002; Ratanamahatana & Keogh 2005). While DTW was invented in the context of one-dimensional speech signals, others have noted its utility for matching shapes, including face profiles (Bhanu & Zhou 2004), leafs (Ratanamahatana & Keogh 2005), handwriting (Rath & Manmatha 2002) and general shape matching (Adamek & O'Connor 2004).

To align two sequences using DTW, an *n*×*n* matrix is constructed, where the (*i*th, *j*th) element of the matrix is the distance *d*(*q*_{i}, *c*_{j}) between the two points *q*_{i} and *c*_{j} (i.e. *d*(*q*_{i}, *c*_{j})=(*q*_{i}−*c*_{j})^{2}). Each matrix element (*i*, *j*) corresponds to the alignment between the points *q*_{i} and *c*_{j}, as illustrated in figure 10.

A warping path *P* is a contiguous set of matrix elements that defines a mapping between *Q* and *C*. The *t*th element of *P* is defined as *p*_{t}=(*i*, *j*)_{t}, so we haveThe warping path that defines the alignment between the two time-series is subject to several constraints. For example, the warping path must start and finish in diagonally opposite corner cells of the matrix; the steps in the warping path are restricted to adjacent cells (including diagonally adjacent cells); the points in the warping path must be monotonically spaced in time. In addition to these constraints, virtually all practitioners using DTW also constrain the warping path in a global sense by limiting how far it may stray from the diagonal (Keogh 2002; Rath & Manmatha 2002; Ratanamahatana & Keogh 2005). A typical constraint is the Sakoe–Chiba band, which states that the warping path cannot deviate more than *R* cells from diagonal.

The optimal warping path can be found in *O*(*nR*) time by dynamic programming (Keogh 2002). We shall show experimentally in §5 that DTW can significantly outperform Euclidean distance on real datasets.

Based on an arbitrary wedge *W* and the allowed warping range *R*, we define two new sequences, *DTW_U* and *DTW_L*,They form an additional envelope above and below the wedge, as illustrated in figure 11.

We can now define a lower-bounding measure for DTW distance between an arbitrary query *Q* and the entire set of candidate sequences contained in a wedge *W*,We make the following claim.

*For any sequence Q of length n and a wedge W containing a set of time-series C*_{1}, …, *C*_{k} *of the same length n*, *for any global constraint on the warping path of the form j*−*R*≤*i*≤*j*+*R*, *the following inequality holds*:

_{DTW}(

*Q*,

*W*); however, Vlachos

*et al*. (2003) contains the necessary modifications for both DTW and LCSS which are discussed below.

To facilitate later efficiency comparisons to Euclidean distance and other methods, it will be useful to define the time complexity of DTW in terms of num_steps as returned by tables 1 and 5. The variable ‘num_steps’ is the number of real-value subtractions that must be performed and completely dominates the CPU time, since the square root function is only performed once (and can be removed; see Keogh & Kasetty 2002). If we construct a full *n*×*n* warping matrix, then DTW clearly requires at least *n*^{2} steps. However, as we noted above and illustrated in figure 10, we can truncate the corners of the matrix to reduce this number to approximately *nR*, where *R* is the width of the Sakoe–Chiba band. While *nR* is the number of steps for a single DTW, we expect the average number of steps to be less, because some full DTW calculations will not be needed if the lower-bound test fails. Since the lower-bound test requires *n* steps, the average number of steps when doing *m* comparisons should bewhere *a* is the fraction of the database that requires the full DTW calculated. Note that even this is pessimistic, since both DTW3 and LB_Envelope_{DTW} are implemented as early abandoning (recall table 5). Therefore, we simply count the num_steps required by each approach and divide it by *m* to get the average number of steps required for one comparison.

In addition to DTW, several researchers have suggested using longest common subsequence (LCSS) as a distance measure for shapes (Yazdani & Meral Özsoyoglu 1996). The LCSS is very similar to DTW, except that while DTW insists that every point in *C* maps onto one (or more) point(s) in *Q*, LCSS allows some points to go unmatched. The intuition behind this idea in a time-series domain is that subsequences may contain additions or deletions; for example, an extra (or forgotten) dance move in a motion capture performance or a missed beat in ECG data. Rather than forcing DTW to produce an unnatural alignment between two such sequences, we can use LCSS, which simply ignores parts of the time-series that are too difficult to match. In the image space, the missing section of the time-series may correspond to a partial occlusion of an object or to a physically missing part of the object, as shown in figure 12.

Another example of an anthropological object that requires LCSS to obtain meaningful matches is projectile points as shown in figure 13. Such objects are often discovered with missing parts. In order to find matches using query-by-content in a database, we can try to extrapolate our best guess as to the missing features, but this opens the possibility of imposing our (possibly misguided) knowledge on the problem, rather than letting the data speak for itself. Using LCSS, the missing features can be simply ignored during the search process.

While we considered LCSS for generality, we will not further explain how to incorporate it into our framework. It has been shown in Vlachos *et al*. (2003) that it is trivial to lower-bound LCSS using the envelope-based techniques described previously. The minor changes include reversing some inequality signs, since LCSS is a similarity measure, not a distance measure. Our omission here of a detailed discussion is due to space limitations and to a slight bias against the method. Unlike Euclidean distance which has no parameters, or DTW, which has one intuitive and easy-to-set parameter, LCSS requires two parameters, and tuning them is non-trivial. In the experiments, we found that we could sometimes tune LCSS to *slightly* beat DTW on *some* problems; however, we did not have large enough datasets to allow training/test splits that guarded against overfitting to a statistically significant standard.

## 5. Experimental results

In this section, we empirically evaluate our approach. We begin by stating our experimental philosophy. In a recent paper, Veltkamp & Latecki (2006) attempted to reproduce the accuracy claims of several shape matching papers, but discovered to their dismay that they could not match the claimed accuracy for any approach. One suggested reason is the observation that many approaches have highly tuned parameters, a fact which we believe makes Euclidean distance (zero parameters) and DTW (one parameter) particularly attractive. Veltkamp & Latecki conclude that ‘it would be good for the scientific community if the reported test results are made reproducible and verifiable by publishing datasets and software along with the articles’. We completely concur and have placed *all* datasets at the following URL (Keogh 2006).

### 5.1 Effectiveness of shape matching

In general, this paper is not making any claims about the *effectiveness* of shape matching. Because we are simply speeding up arbitrary distance calculations on arbitrary one-dimensional representations of shapes, we automatically inherit the well-documented effectiveness of other researchers' published work (Gdalyahu & Weinshall 1999; Adamek & O'Connor 2003, 2004; Attalla & Siy 2005; Jalba *et al*. 2005; Ratanamahatana & Keogh 2005; Vlachos *et al*. 2005).

Nevertheless, for completeness, and in order to justify the extra computational expense of DTW, we will show the effectiveness of shape matching on several publicly available datasets.

Table 7 shows the error rate of one-nearest neighbour classification as measured using leaving-one-out evaluation. Recall that Euclidean distance has no parameters, DTW has a single parameter (the warping window width *R*), which was learned by looking only at the training data. For the face and leaf datasets, the (approximate) correct rotation was known (Ratanamahatana & Keogh 2005). We removed this information by randomly rotating the images.

The mixedBag dataset is small enough to run the more computationally expensive chamfer (Borgefors 1988) and Hausdorff (Olson & Huttenlocher 1997) distance measures. They achieved an error rate of 6.0 and 7.0%, respectively (Vlachos *et al*. 2005), slightly worse than Euclidean distance. Likewise, the chicken dataset allows us to compare directly to Mollineda *et al*. (2002), which used identical experiments to test six different algorithms based on *discrete* sequences extracted from the shapes. The best of these algorithms had an error rate of 20.5% and took over a minute for each distance calculation, whereas our approach takes an average time of 0.0039 s for each distance calculation.4 For the diatom dataset, the results are competitive with human experts, whose error rates ranged from 57 to 13.5% (Jalba *et al*. 2005), and only slightly worse than the morphological curvature scale spaces (MCSS) approach of Jalba *et al*. (2005), which obtained 26.0%. However, note that the Euclidean distance requires zero parameters once the time-series have been extracted, whereas the MCSS has several parameters to set.

In general, these experiments show two things (which had been noted before): the extra effort of DTW is useful in some domains and very simple time-series representations of shapes are competitive to other more complex representations.

We also performed extensive ‘sanity check’ experiments using a large database of primate skulls. For all species where we have at least two examples, we perform a hierarchical clustering and check to see if both samples of the same species clustered together. Figure 14 shows a typical example.

It is important to recall that figure 14 shows a phenogram, *not* a phylogenetic tree. However, on larger-scale experiments in this domain (shown in Keogh 2006), we found that large subtrees of the dendrograms did conform to the current consensus on primate evolution (Fleagle 1999).

We performed many additional experiments with large collections of petroglyphs and projectile points. Here, the evaluation is more subjective, but we find that in most cases, we get very intuitive results as in figure 15.

In most cases, we discovered that unintuitive results were caused by errors in the image processing software that automatically extracts the shapes, since many petroglyphs are heavily degraded by weather (Pope 2000).

### 5.3 Main memory experiments

There is an increasing awareness that comparing two competing approaches using only CPU time opens the possibility of implementation bias (Keogh & Kasetty 2002). As a simple example, while the Haar wavelet transform is *O*(*n*) and DFT is *O*(*n* log *n*), the DFT is *much* faster in the popular language Matlab, simply because it is a highly optimized subroutine. For this reason, many recent papers compare approaches with some implementation-free metric (Keogh 2002; Vlachos *et al*. 2003; Ratanamahatana & Keogh 2005; Vlachos *et al*. 2005). As we noted earlier, the variable ‘num_steps’ returned by tables 1 and 5 allows an implementation-free measure to compare performance.

For Euclidean distance queries, we compare to *brute force* and *fast Fourier transform* (FFT) methods, which are the only competitors to also guarantee no false dismissals (Vlachos *et al*. 2005). The cost model for the FFT lower bound is *n* log *n* steps. If the FFT lower bound fails, we allow the approach to avail of our early abandoning techniques discussed in §3.

We tested on two datasets, a homogeneous database of 16 000 projectile point images, all of length 251 and a heterogeneous dataset consisting of all the data used in the classification experiments, plus 1000 projectile points. In total, the heterogeneous dataset contains 5844 objects of length 1024. To measure the performance, we averaged over 50 runs, with the query object randomly chosen and removed from the dataset.

We measure the average number of steps required by each approach for a single comparison of two shapes, divided by the number of steps require by brute force. For our method, we include a start-up cost of *O*(*n*^{2}), which is the time required to build the wedges. Because the utility of early abandoning depends on the value of the best-so-far, we expect our method to do better as we see larger and larger datasets.

Figure 16 shows the results on the projectile points dataset using Euclidean distance.

We can see that for small datasets, our approach is slightly worse than *FFT* and simple *Early abandon* because we had to spend some time building the wedges. However, by the time we have seen 64 objects, we have already broken even, and thereafter rapidly race towards beating *FFT* and *Early abandon* by one order of magnitude and *Brute force* by two orders of magnitude.

The results on the projectile points dataset using DTW are shown in figure 17, and are even more dramatic.

Here, the cost of building the wedges is dwarfed by a single brute force–DTW–rotation invariant comparison, so our approach is faster even for a database of size 3. By the time we have examined the entire database, our approach is more than 5000 times faster than the brute force approach. It is interesting to note that the early abandoning strategy is by itself quite competitive, yet to our knowledge no one uses it. We suspect that this is because most people are more familiar with the elegant and terse recursive version of DTW, which does not allow early abandoning, than the iterative implementation, which does. However, note that even though our highly optimized early abandoning strategy is competitive, our wedge approach is still an order of magnitude faster once the dataset is larger than 500 objects.

Sometimes indexing methods that work well for highly homogeneous datasets do not work well for heterogeneous datasets, and vice versa. We consider this possibility by testing on the heterogeneous dataset in figure 18.

In this dataset, it takes our wedge approach slightly longer to beat *Early abandon* (and *FFT* for Euclidean search); however, by the time we have seen 8000 objects, our approach is two orders of magnitude faster than its Euclidean competitors, and for DTW it is an order of magnitude faster than *Early abandon* and 3976 times faster than brute force.

Recall that our algorithm requires the setting of a single parameter, the number of intervals to search for a new value for *K* every time the bestSoFar variable is updated. In all the experiments above, this value was set to 5. We found that we can change this value to any number in the range 3–20 without affecting the performance of our algorithm by more than 4%; therefore, we omit further discussion of this parameter setting.

As a final sanity check, we also measured the wall clock time of our best implementation of all methods. The results are essentially identical to those shown previously.

## 6. Conclusions and future work

We have introduced a method to support fast rotation invariant search of large shape datasets with arbitrary representations and distance functions. Our method supports rotation-limited queries and mirror image invariance if desired. This contrasts with most rotation invariant feature-based approaches, which are permanently unable to distinguish between rotations or mirroring. We have shown that our method gives impressive speedup for the state-of-the-art methods without sacrificing the guarantee of no false dismissals.

Note that our work assumes the availability of high quality and high contrast images. In some domains, especially petroglyphs, such images are difficult to obtain (Pope 2000); it would be interesting to consider methods that are more robust to error introduced in this way.

Future work includes both extensions and applications of the current work. We will attempt to extend this approach to the indexing of three-dimensional shapes, and we have begun to use our algorithm as a subroutine in several data mining algorithms, which attempt to cluster, classify and discover motifs in a variety of anthropological datasets, including petroglyph and projectile point databases.

## Acknowledgments

We gratefully acknowledge many useful comments from the anonymous reviewers, as well as helpful comments from Chotirat Ann Ratanamahatana, Michail Vlachos and Longin Jan Latecki. Thanks to Jason Dorff for help with skull images.

## Footnotes

↵In this work, we make use of Big

*O*notation such as ‘*O*(*n*^{3})’. See appendix A for an intuition for this notation.↵More precisely, the time complexity is

*O*(*Rp*log*p*), where*p*is the number of pixels in the perimeter and*R*is the number of rotations that need to be executed. Here*p*=*n*, and while*R*is a user-defined parameter, it should be approximately equal*n*to guarantee all rotations (up to the limit of rasterization) are considered.↵Note that a

*recursive*implementation of DTW would always require*nR*steps; however,*iterative*implementation (as used here) can potentially early abandon with as few as*R*steps.↵We are aware that one should normally not compare CPU times from different computers; however, here the four orders of magnitude offers a comfortable margin that dwarfs implementation details.

- Received August 7, 2006.
- Accepted September 21, 2006.

- © 2006 The Royal Society