Vassilis Athitsos               VLM Lab               Publications               Datasets

Nearest Neighbor Retrieval and Classification

What Is That?

Suppose that we want to build a system that can classify "objects" into specific classes. For example, each "object" can be the image of a hand, and classification can consist of recognizing the hand shape (see also my hand pose estimation project). As another example, each "object" can be the image of some handwritten letter or digit, and classification can consist of recognizing which letter or digit it is. A very commonly used classification method is called nearest neighbor classification, and works as follows: first, we create a database of example objects, for which we already know what the correct classification should be. Then, when the system is given a query, i.e., a new object to classify, the system simply finds the nearest neighbor of the query in the database,i.e., the database object that is the most similar to the query. Then, the system classifies the query as belonging to the same class as its nearest neighbor. For example, if the query is an image of a digit, and the nearest neighbor of the query in the database is an image of the digit "4", then the system classifies the query as an image of "4". Oftentimes, instead of only looking at the single nearest neighbor, a system uses the K nearest neighbors (where K can pretty much be any number) to classify the query.

hand pose recognition

hand pose recognition
Figure 1: Recognizing hand shape, or a digit, using nearest neighbors. Given a query image, the system finds the nearest neighbor of the query in the database, and outputs that the hand shape (or digit) in the query image is the hand shape (or digit) of the nearest neighbor. It is important to have in mind that the system is assumed to know the correct hand shape (or digit) for all database images.

Nearest-neighbor classifiers are very simple to design (all you have to do is get a database of examples), and often equal or exceed in accuracy much more complicated classification methods. A necessary part of nearest neighbor classification is nearest neighbor retrieval, i.e., the task of actually finding the nearest neighbors of the query. This can be tricky to do efficiently, especially when the database is very large.

Nearest-neighbor retrieval has many uses in addition to being a part of nearest-neighbor classification. For example, we often want to find web pages that are similar to a specific page; in this case the specific page is the query, and the Web (or at least a subset of the Web, processed and stored by Google, or some other search engine) is the database. As another example, when biologists identify a new protein, they typically use a computer program to identify, in a large database of known proteins, the proteins that are the most similar to the new protein.

So, What's the Problem?

Defining Accurate Similarity Measures

Saying that a database object is the "nearest neighbor" of the query implies that we have a way to measure distances between the query and database objects. The way we choose to measure distances can drastically affect the accuracy of the system. At the same time, defining a good distance measure can be a challenging task. For example, what is the right way to measure similarity between two Web pages? A research problem that we are very interested in is designing methods for automatically learning a distance measure given many examples of pairs of similar objects and pairs of dissimilar objects.

Efficient Retrieval

As mentioned earlier, finding the nearest neighbors of the query can be time-consuming, especially when we have a large database. The problem can be even worse when the distance measure we use is computationally expensive. At the same time, computationally expensive distance measures are often used in computer vision and pattern recognition in general. Examples of such measures include the edit distance for strings, the chamfer distance and Hausdorff matching for edge images, the Kullback-Leibler distance and the Earth Mover's Distance for probability distributions, and bipartite matching for sets of features. As knowledge expands in many different domains, and ever larger databases are used to store that knowledge, achieving efficient retrieval becomes increasingly important, and at the same time increasingly challenging.

Some Results of Our Research

Embedding Construction Using Machine Learning

When the distance measure used in computing nearest neighbors is computationally expensive, efficient retrieval can be achieved by computing an embedding of objects into a different space (typically a vector space) where distances can be measured much faster. In particular, an embedding F is a function that maps objects to vectors. One of the main results of our research is that constructing and optimizing such embeddings can be seen as a machine learning problem. In particular, in order for embedding F to preserve nearest neighbor structure, we want the following to hold: for any three objects Q,A,B, if Q is closer to A than to B, F(Q) must be closer to F(A) than to F(B). In other words, we can think of an embedding as a classifier that decides for any three objects Q,A,B, if Q is closer to A or to B. Optimizing the accuracy of that classifier is equivalent to optimizing the embedding for the purposes of preserving nearest neighbor structure. Our CVPR 2004 paper introduces the idea that embeddings correspond to classifiers, and describes the BoostMap algorithm, that constructs embeddings using machine learning.

Query-Sensitive Embeddings

In many pattern recognition domains we encounter objects with high-dimensional representations. Measuring distances between such objects is typically done by defining a global distance measure, that assigns a fixed weight to each dimension. In our SIGMOD 2005 paper we propose a method for constructing query-sensitive embeddings, where the distance measure automatically adapts to the query, giving higher weight to the dimensions that are the most informative for that query. An important property of query-sensitive embeddings is that they combine the efficiency of measuring distances in a vector space with the ability to preserve non-metric structure from the original space, like violations of the triangle inequality, or asymmetric distances.

Distance-Based Hashing

Distance-Based Hashing is a method for efficient similarity search in non-Euclidean spaces with computationally expensive distance measures. An index is constructed using multiple hash tables. Each object is hashed using hashing functions that are defined based on distances to some specific reference objects from the database. This method is described in our ICDE 2008 paper.

Embedding Cascades

When we want to classify an object using nearest neighbor classification, oftentimes we can get enough information about the class of the object without finding its true nearest neighbor. For example, if using a computationally cheap and inaccurate embedding we find that the query is mapped into a region where all database objects have the same class, it may be reasonable to assign that class to the query instead of trying to find the query's actual nearest neighbor. Our CVPR 2005 paper builds upon this intuition and describes a method for constructing an embedding cascade, i.e., a sequence of progressively more computationally expensive embeddings, designed in such a way that queries that are easy to classify are handled efficiently by embeddings that occur early in the sequence, and tough queries are handled by the slower but more accurate embeddings towards the end of the sequence. An interesting experimental result of cascaded nearest neighbor classification is that classification time actually decreases as the size of the database increases, a behavior that is in stark contrast with the behavior of typical nearest neighbor classifiers.

Fast Approximate Alignment

In many domains, in order to compare two objects we first need to align those objects, that is we need to determine which part of one object corresponds to which part of the other object. Aligning objects is oftentimes computationally expensive, because there are many possible alignments that we need to search in order to find the best one. Our ICDAR 2005 paper introduces a method for computing fast approximate alignments between arbitrary objects, based on how those objects align with a small number of prototypes. Applied to the task of online and off-line character recognition, our method makes nearest neighbor classification significantly faster, with negligible losses in accuracy.

Embedding-Based Subsequence Matching of Time Series

In a query-by-humming system, a user sings part of a melody and the computer identifies the songs that contain the melody. In a sign spotting system, a sign language user searches for occurrences of specific signs in a video database of sign language content. These are two example applications where users want to retrieve the best matching subsequences in a time series database given a query sequence. We are developing methods for efficient subsequence matching in large time-series databases using the popular Dynamic Time Warping (DTW) distance measure. A first result is described in our SIGMOD 2008 paper. Our work on time-series subsequence matching is funded by NSF grant IIS-0812601.

Code and Datasets

Code and data for the BoostMap method (described in our CVPR 2004 and PAMI 2008 papers) is available at these links: Code and data for Distance-Based Hashing (described in our ICDE 2008 paper) is available at these links:

Code and data for embedding-based subsequence matching (described in our SIGMOD 2008 paper) is available at these links:


Vassilis Athitsos               VLM Lab               Publications               Datasets