Jump to content

Talk:K-d tree

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Opening heading

[edit]

I believe the explanations on the animation are incorrect. The search is not always "first left then right". Instead, the component of the search vector that corresponds to the split dimension at the current node is compared to that stored in the node. This decides the direction of the search (either left of right). 85.23.56.103 (talk) 14:12, 18 January 2011 (UTC)[reply]

AFAIK the "canonical" method of constructing the tree is not to simply cycle through dimensions. In all papers I have seen the split dimension is selected by measuring the variance of feature values. 85.23.56.103 (talk) 08:34, 20 January 2011 (UTC)[reply]

1) Its left or right in tree space, not real-space. This tree space would correspond to the "upper" and "lower" sides of the split plane, if you prefer. 2) you can select by variance, but this makes your traversal algorithms slower, as you cannot simply compute the axis from the tree depth -- you have to retrieve it as it may be altered. Its also slower to construct because you have to check the variance. However, the search may be slightly faster due to "zeroing in" on the target quicker, IFF your data has strong spatial correlations. This is OR, and would also be a needless complication of the algorithm presentation. 129.67.86.189 (talk) 20:28, 13 May 2011 (UTC)[reply]
Left and right is absolutely in tree space and thus arbitrary. While finding the variance or range does make building the tree slower the entire benefit of a KD tree structure is that you build it up front and then reference it repeatedly. In practical terms this means that it is almost always a better idea to split along the axis of greatest variance/range than to hurt search performance (because additional useless splits really hurt search performance). I've written several algorithms based on KD-trees and have seen that there are also some provable positive aspects to splitting on the axis of greatest range in terms of how many traversals a nearest neighbor will be away in the tree. Any time you have data that is not evenly distributed you get a significant boost by splitting intelligently. Adam McCormick (talk) 05:32, 18 May 2011 (UTC)[reply]

Requested move

[edit]
The following discussion is an archived discussion of the proposal. Please do not modify it. Subsequent comments should be made in a new section on the talk page. No further edits should be made to this section.

Page moved. Vegaswikian (talk) 20:47, 13 June 2011 (UTC)[reply]

Kd-treeK-d tree – When did Bentley's "k-d tree" morph to "k-d tree" and "kd tree" and "kd-tree" and "kd-tree", and how did we end up with the worst, ugliest, and least meaningful of the lot? I propose we move it to "k-d tree" to be consistent with a large part of the literature, including at least the part that uses hyphens approximately correctly. Dicklyon (talk) 01:14, 7 June 2011 (UTC)[reply]

The above discussion is preserved as an archive of the proposal. Please do not modify it. Subsequent comments should be made in a new section on this talk page. No further edits should be made to this section.

nearest neighbour search complexity

[edit]

The article states n^(1-1/k) as the complexity for nearest neighbour search.
The reference cited - "Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees" - does not refer at all to NN search (only to range searches).
Further - the Moore article (referenced at the external links) - seems to state that there is no non-trivial bound, but that the average case is O(log n) - for any constant dimension.

Is this just plain wrong? or am I missing something? 79.180.7.128 (talk) 22:11, 15 December 2011 (UTC)[reply]

This is precisely what I thought. I think this complexity claim is wrong, for single nearest neighbour search at least (and certainly not backed up by that reference).
In fact I think that there is an algorithm for single nearest neighbour search which is proved to be O(log n) for a fixed number of dimensions, given in this paper:
"An Algorithm for Finding Best Matches in Logarithmic Expected Time"
H Friedman, J Bentley & A Finkel (1977)
http://doi.acm.org/10.1145/355744.355745
I will edit the article accordingly, please comment if you object or I have misunderstood.
dosttiey (talk) 20:19, 29 March 2013 (UTC)[reply]
The mentioned paper appears to just talk about an expected time complexity. Jaspervdg (talk) 14:54, 14 July 2015 (UTC)[reply]
In reality you need to check for extra node's to be sure that the nearest possible.
The average case can't be O(log n) because of that.
It's O(log n) for the best case to O(N) for the worse one. 185.144.27.53 (talk) 07:45, 16 April 2024 (UTC)[reply]

k-d tree for space-consuming objects

[edit]

Articles suggests that a k-d tree can be used to store space-consuming objects (e.g. rectangles, in 2 dimensions) but only specifies how this affects the range search algorithm. It would be useful to expand this to cover other algorithms (especially construction, but nearest-neighbour would probably be useful, too). JulesH (talk) 14:04, 7 September 2013 (UTC)[reply]

The article does not specify *how* to add volumes to a kd-tree: it just says, "the tree is constructed the usual way with all the rectangles at the leaves". Could someone provide enough detail to allow someone to implement range search over volumes? — Preceding unsigned comment added by 163.220.176.20 (talk) 06:12, 16 March 2018 (UTC)[reply]

Error in pseudocode for tree construction?

[edit]

The pseudocode for the tree construction is missing the trivial case (I assume, where there are no points, or one point). Something like "if more than one point then ..." around creating the subtrees? — Preceding unsigned comment added by 67.159.132.225 (talk) 05:58, 30 January 2014 (UTC)[reply]

Construction

[edit]

I find the description of the construction algorithm a bit confusing.

Splitting an array into subarrays - testing against the "x-coordinate", I assume the intent is to test against the current splitting dimension.

Since the array is already sorted, there's no need to "consider each element of the array" and test x-coordinate against the splitting plane. Instead, just find the median as described later?

The section addressing re-using the arrays, and building the subarrays from k-1 into the k-2 array does not specify whether to do all of this for every depth level, or part of it for every dimensional increase.

In the end it says: "Hence, building the entire k-d tree requires less than O([k-1]n log n) time, which is less than the O(kn log n) time that is required to sort the n points in k dimensions prior to building the k-d tree." But the whole premise on which the text describes the algorithm is that the all the k arrays are sorted according to its dimension before building the kd tree, and all the preceding text assumes that they are, so this last sentence does not seem to make any sense. — Preceding unsigned comment added by Waperboy (talkcontribs) 13:49, 27 September 2014 (UTC)[reply]

Worse than that is false claim that O([k-1]n log n) is less than O(kn log n); these are the same; see Big O notation. I think it just means that the total complexity is still O(kn log n), since each of the two major steps is O(kn log n). But I haven't checked it carefully. A source would be nice for this. I don't see what the factor of [k-1] is from in "This tree-building algorithm requires at most O([k-1]n) tests of coordinates against splitting planes to build each of the log n levels of a balanced k-d tree. " Dicklyon (talk) 01:28, 28 September 2014 (UTC)[reply]

Here is how to build a k-d tree in O((k-1) n log n) time. Let's consider the case of (x,y,z) points. First, sort the points separately in x, y and z using 3 O(n log n) sorts such as heapsort. Store the result of each sort as an array of indices into the array of (x,y,z) points. For the first subdivision, pick the median of the x-index array to be the partition. Subdividing the x-index array is trivial because all elements below the median form the left (lower) sub-array and the elements above the median form the right (upper) sub-array. Subdividing the y- and z-index arrays is a little more involved. Let's take subdivision of the y-index array as an example. We need a temporary index array; call it the w-index array. The idea is to sweep upward through the y-index array and compare the x-coordinate of each (x,y,z) point to the x-median. For those points whose x-coordinate < x-median, copy those points into the w-index array, beginning with the lowest address of the w-index array and proceeding upward through those addresses. For those points wholse x-coordinate > x-median, copy those points into the w-index array, beginning with the address immediately above the x-index of the x-median and proceeding upward through those addresses. After each y-index has been visited via this lower-to-upper index sweep, the w-index array contains two sets of y-indices wherein each set contains increasing y-values (or rather, indices to (x,y,z) points of increasing y-values): the lower set has x-coordinates that are all less than the x-median and the upper set has x-coordinates that are all greater than the x-median. This sweep has thus partitioned the points about the x-median while preserving the y-sort. Now the y-index array can be used to copy z-indices in the same manner that the w-index array was used to copy y-indices. And now one level of the k-d tree has been built. So, you always need 1 more index array than you have coordinates. I call this algorithm a "sweep and partition" algorithm. To build the entire k-d tree requires that O(log n) sweeps be performed. Each sweep involves copying n indices for each of (k -1) dimensions. Hence, the initial sort requires O(k n log n) time and building the tree thereafter requires O( (k-1) n log n) time. I have a Java program that builds a balanced k-d tree and handles all of the "corner" cases such as only 1 point, only 2 points, etc. at each level of subdivision. I can contribute that Java code to Wikipedia if someone can show me where to upload it. I sent Dick Lyon an email. Perhaps I could send it to that email address? Kirigiri (talk) 03:22, 14 October 2014 (UTC)[reply]

Rather than code, what would be most useful would be a reference to a published source. And maybe an indication of what you would put (k-1) instead of k in a big-O notation. Dicklyon (talk) 03:35, 14 October 2014 (UTC)[reply]

The factor of (k-1) arises because, at each level of the tree, 1 of the k coordinates is trivially subdivided by simply taking the median of a pre-sorted array as the partition value, but the remaining (k-1) coordinates must be copied from one array to another array to implement the "sweep and partition" algorithm. Hence, for n points and k coordinates, each level of the tree requires (k-1)n copy operations. And because the tree is (log n) deep, the time required to build the tree is O( (k-1) n log n). As far as I know, there is no publication that discusses it. I learned that this algorithm was possible by stumbling upon a homework assignment from a CS course at Western Washington University (I think) about 15 years ago. The assignment said, "First sort the points in x, y and z then build the balanced tree." Because I had previously built k-d trees by resorting at each level of subdivision, which requires O(n (log n)^2) time assuming an O(n log n) sort at each level of subdivision, I was intrigued by the stated order of sorting before building. It didn't take long for me to implement the algorithm. However, I have never seen it published. Dick, I see that you know Ivan Sutherland. Ask him for my Yahoo email address and write to me there. I'll send you the code and after you study it, you'll see how the algorithm works. I'm Russ Brown. 71.198.6.170 (talk) 05:22, 14 October 2014 (UTC)[reply]

[edit]

I found this link on Hackers News. • SbmeirowTalk21:52, 25 December 2015 (UTC)[reply]

Storing information in nodes + leaves or only leaves

[edit]

Some sources says that "The two main differences between a kD tree and a bst are 1) Values are always stored in leaves, and ...", in the article values are stored in vertices too, should this be clarified? — Preceding unsigned comment added by VirtualSatai (talkcontribs) 10:12, 5 June 2016 (UTC)[reply]

nearest neighbour algorithm isn't what is shown in the animation

[edit]

The article says to first search as far as a leaf node, then set the "current best" to that node. The animation does the opposite, first setting the "current best" to the root. This is confusing. — Preceding unsigned comment added by 137.122.114.44 (talk) 17:17, 9 March 2018 (UTC)[reply]

The current animation is really confusing - there is no representation of the tree, and there is no explanation of what is going on. There was an earlier animation that was less flashy, but made more sense. 129.67.86.87 (talk) 23:00, 27 October 2020 (UTC)[reply]

meaning of >>

[edit]

"As a general rule, if the dimensionality is k, the number of points in the data, n, should be n >> 2^k".

the >> symbol isn't used elsewhere in the article, isn't explained here, and isn't presented as a character so it can't be searched (on wikipedia or elsewhere). The closest relevant link, to 'Curse of Dimensionality', doesn't use this symbol or explain this symbol, so whatever this means isn't discernable from the text.

what does this sentence mean? 2605:A601:AF20:7500:88A6:9B44:D111:A8F9 (talk) 09:31, 19 September 2024 (UTC)[reply]