Nearest neighbor search is an important task which arises in different areas - pattern recognition, recommendation systems, DNA sequencing and even game development.
Usually, this task is formulated as follows. We have N points in some space (S dataset). We have to work with queries, which have dataset S and some point X as their parameters (X does not have to belong to S). Typical queries are "find k nearest neighbors of X" or "find all points in S at given distance R from X or closer". Depending on problem, we may have: a) different number of dimensions - from one to thousands, b) different metric type (Euclidean, 1-norm, ...), c) different dataset size. Hence, for different problems different algorithms are feasible.
The key point of the problem formulation is that dataset S is considered fixed. X may vary from request to request, but S remains unchanged. It allows to preprocess dataset and build data structure which accelerates processing. All strategies which promise better than O(N) processing time rely on some kind of preprocessing. Different preprocessing strategies have different features.
ALGLIB package includes nearestneighbor subpackage, which implements nearest neighbor search by means of kd-trees. Kd-trees allow to perform efficient search in low-dimensional spaces (from 1 to 5), but have lesser performance in high-dimensional spaces.
1 Nearest neighbor search with kd-trees
Basic description
Kd-tree construction
Querying a kd-tree
Other functions
Performance of kd-trees
2 Downloads section
Kd-trees are data structures which are used to store points in k-dimensional space. As it follows from its name, kd-tree is a tree. Tree leafs store points of the dataset (one or several points in each leaf). Each point is stored in one and only one leaf, each leaf stores at least one point. Tree nodes correspond to splits of the space (axis-oriented splits are used in most implementations). Each split divides space and dataset into two distinct parts. Subsequent splits from the root node to one of the leafs remove parts of the dataset (and space) until only small part of the dataset (and space) is left.
Chart at the left shows an example of kd-tree in the 2-dimensional space. Red squares are dataset points, black lines are splits. The thinner the line is, the deeper is the node which corresponds to the split.
kd-trees allow to efficiently perform searches like "all points at distance lower than R from X" or "k nearest neighbors of X". When processing such query, we find a leaf which corresponds to X. Then we process points which are stored in that leaf, and then we start to scan nearby leafs. At some point we may notice that distance from X to the leaf is higher than the worst point found so far. It is time to stop search, because next leafs won't improve search results. Such algorithm is good for searches in low-dimensional spaces. However, its efficiency decreases as dimensionality grows, and in high-dimensional spaces kd-trees give no performance over naive O(N) linear search (although continue to give correct results).
Considering number of dimensions K fixed, and dataset size N variable, we can estimate complexity of the most important operations with kd-tree:
ALGLIB implementation of kd-trees uses following dataset model:
kd-trees are built with one of the constructor functions: kdtreebuild or kdtreebuildtagged. First one builds kd-tree without tags (with optional Y-values), second one builds kd-tree with tags (and optional Y-values). As result, these functions return kdtree structure.
ALGLIB implementation of kd-trees supports following kinds of queries:
NN searches are performed in two stages. At the first stage we send a query using one of the querying functions: kdtreequeryknn, kdtreequeryaknn or kdtreequeryrnn. These functions perform search in a kd-tree, save result in the kdtree structure, and return result size (number of points satisfying search criteria).
At the second stage user may extract result by calling one of the functions: kdtreequeryresultsx to get X-values, kdtreequeryresultsxy to get X and Y-values, kdtreequeryresultstags to get tags, kdtreequeryresultsdistances to get distances from dataset points to X.
ALGLIB Reference Manual contains nneighbor_d_1 example which shows how to work with kd-trees.
kdtreeserialize and kdtreeunserialize functions can be used to serialize kd-tree (convert it to string) and unserialize it (restore structure from string representation). Serialization allows to save tree to file, move between different computers and even versions of ALGLIB in different programming languages. ALGLIB Reference Manual includes an example on serialization of kd-trees - nneighbor_d_2.
In order to estimate performance of ALGLIB implementation of kd-trees we've conducted a series of numerical experiments. Experiments were performed on AMD Phenom II X6 3.2GHz CPU with Microsoft C++ compiler and maximum optimization settings. During experiments we've generated N=50.000 points, uniformly and randomly distributed across D-dimensional unit hypercube. Then we performed 50.000 queries for K nearest neighbors.
Depending on number of dimensions K, kd-tree construction took from 30 to 60 ms. Approximate time complexity is O(D·N·logN), i.e. construction time is linear with respect to number of dimensions and linearithmic with respect to dataset size.
Charts above show how query processing time depends on number of neighbors K and number of dimensions D. From the first chart we can conclude that KNN query time grows approximately linearly with respect to K (the best result possible).
Second chart allows to study performance of kd-tree in connection with number of dimensions D. You may see that for any K (even for K=1) query time exponentially grows as we increase number of dimensions. Each additional dimension results in approximately 1.7-fold increase of processing time. In high dimensional spaces we have to scan more leafs of the tree, and at some moment kd-tree will become just complicated form of linear search (all nodes are examined). BTW, at this point query time will stop to grow exponentially with growth of D.
One of the ways to fight "curse of dimensionality" is to use approximate (AKNN) queries. When we process ε-approximate KNN query we can return inexact nearest neighbor, located at distance r·(1+ε) from X} (where r is a distance to true nearest neighbor). The larger approximation factor ε we choose, the more leafs can be skipped when we scan the tree.
Charts above show time complexity of AKNN-queries with different values of the approximation factor ε and different number of dimensions D. You may see that the higher D we choose, the more advantageous is to use AKNN queries. With D=5 and ε=1 AKNN queries give us only 3-fold increase of performance, but D=10 and ε=1 give us 10-fold increase.
Obviously, approximate KNN queries are not universal solution. In many cases we have to know exact nearest neighbors. However, in some areas (like large scale data processing) it is sufficient to work with inexact results.
This article is licensed for personal use only.
ALGLIB Project offers you two editions of ALGLIB:
ALGLIB Free Edition:
delivered for free
offers full set of numerical functionality
extensive algorithmic optimizations
no low level optimizations
non-commercial license
ALGLIB Commercial Edition:
flexible pricing
offers full set of numerical functionality
extensive algorithmic optimizations
high performance (SMP, SIMD)
commercial license with support plan
Links to download sections for Free and Commercial editions can be found below: