User:Malbrain/HAT-trie
The HAT-Trie is a type of radix trie that uses array nodes to collect individual key/value pairs under radix nodes and hash buckets into an associative array. Unlike a simple hash table, it allows sorted access to the key/value collection. The original inventors are Nicolas Askitis and Ranjan Sinha who describe their HAT-Trie in an article published in Proc. Thirtieth Australasian Computer Science Conference (ACSC2007), Ballarat Australia. CRPIT, 62. Dobbie, G., Ed. ACS. 97-105 [1]. Dr. Askitis shows that building and accessing the HAT-trie key/value collection is considerably faster than other access methods and is comparable to the Array Hash [2]. This is due to the cache-friendly nature of the data structure which attempts to group associated data into the 64 byte cache line size of the modern CPU (see CPU Cache).
A new HAT-Trie starts out as a NULL pointer representing an empty node. The first added key allocates the smallest array node and copies into it the key/value pair, which becomes the root of the trie. Each subsequent key/value pair is added to the initial array node until a maximum size is reached when the node is burst by re-distributing its keys into a hash bucket with new underlying array nodes, one for each occupied hash slot in the bucket. The key strings are stored in the array nodes with a length encoding byte prefixed to the key value bytes. The value associated with each key can be stored either in-line alternating with the key strings, or placed in a second array, e.g., memory immediately after and joined to the array node [3].
Once the trie has grown into its first hash bucket node, the hash bucket distributes new keys according to a simple hash function of the key value into array nodes contained underneath the bucket node. Keys continue to be added until a maximum number of keys for a particular hash bucket node is reached. The bucket contents are then re-distributed into a new radix node which replaces the hash bucket node in the trie structure[4] (e.g. see Burstsort[5]). The existing keys and values contained in the hash bucket are each shortened by one character and placed under the new radix node in a set of new array nodes.
Sorted access to the collection is provided by enumerating keys into a cursor by branching down the radix trie to assemble the leading characters, ending at either a hash bucket or an array node. Pointers to the keys contained in the hash bucket or array node are assembled into an array that is part of the cursor for sorting. Since there is a maximum number of keys in a hash bucket or array node, there is a pre-set fixed limit to the size of the cursor at all points in time. After the keys for the hash bucket or array node are exhausted by get-next (or get-previous) the cursor is moved into the next radix node entry and the process repeats[6].
References
[edit]- ^ http://crpit.com/confpapers/CRPITV62Askitis.pdf HAT-trie: A Cache-conscious Trie-based Data Structure for Strings
- ^ Askitis, N. & Zobel, J. (2005), Cache-conscious collision resolution for string hash tables, in ‘Proc. SPIRE String Processing and Information Retrieval Symp.’, Springer-Verlag, pp. 92–104
- ^ Askitis, N. and Zobel, J. 2011. Redesigning the string hash table, burst trie, and BST to exploit cache. ACM J. Exp. Algor. 15, 1, Article 1.7 (January 2011)
- ^ Burst tries: a fast, efficient data structure for string keys ACM Trans. Inf. Syst., Vol. 20, No. 2. (April 2002), pp. 192-223, doi:10.1145/506309.506312 by Steffen Heinz, Justin Zobel, Hugh E. Williams
- ^ Sinha, R. and Wirth, A. 2010. Engineering burstsort: Toward fast in-place string sorting. ACM J. Exp. Algor. 15, Article 2.5 (March 2010)
- ^ http://www.siam.org/meetings/alenex03/Abstracts/rsinha.pdf Cache-Conscious Sorting of Large Sets of Strings with Dynamic Tries