Talk:Merkle tree
This is the talk page for discussing improvements to the Merkle tree article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
This article is rated C-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||||||||||||
|
Merge of Tiger-Tree Hash into Hash tree
[edit]Hi! I am thinking of merging the Tiger-Tree Hash article into a subsection of Hash tree. Please respond here if you have any comments. --David Göthberg 02:03, 26 August 2005 (UTC)
- Sounds appropriate because a Tiger-Tree hash is just a hash tree hash with fixed choices for the block size and hash function (if I understand correctly?). — Matt Crypto 01:04, 28 August 2005 (UTC)
- Yes, you understand correctly. A Tiger tree hash is just a hash tree with fixed choices for the block size (1024 bytes), hash function (Tiger) and type of tree (binary). He, I just realised I should perhaps mention in the article that a hash tree can be of other types then binary. Each node can have much more child nodes then two. (Although binary is the most popular since at least in theory it is the most flexible.) For instance in my own system I am thinking of using lots of children for each node in such a way that the tree only has two levels. Like this: Data blocks, first level hashes, second level hashes, top hash. Sort of just a hash list with an extra level, easier to implement but still means you can download a "branch" at a time. (That is, download the first level hashes, check the integrity of them, then download a chunk of second level hashes that fits under one single hash in the first level and check the integrity. Then start downloading blocks that corresponds to that chunk of second level hashes.) --David Göthberg 11:45, 29 August 2005 (UTC)
- I think a merge would be ok, as TTH is rather short and is the only notable implentation of a hash I know. Just be careful to maintain a boundary between concept and implementation and this should be no problem. --R.Koot 19:46, 28 August 2005 (UTC)
- Yep, done so. --David Göthberg 11:45, 29 August 2005 (UTC)
Good job! --Adam David Newman 1:57, 19 October 2005 (UTC-6)
A note of dissent: The prominent position given to Tiger seems inappropriate. Is someone who looks up Hash Tree significantly more likely to be dealing with Tiger than with something in the SHA family? If the goal is to provide a good experience to someone inquiring about Tiger Trees, then let's have a Tiger Tree article that says, simply, that a tiger tree is a hash tree that uses Tiger. Peter (talk) 22:44, 26 August 2012 (UTC)
Clarification needed
[edit]- Note, this problem has a solution that is in the official papers and described in the last comment.
Similar to the Merkle-Damgard chain-hash, tree-hash requires something like Initial Values (IVs) to avoid collision which simply extend the tree. Using a tree hash on the picture it appears that
x = Hash 0-0 || Hash 0-1 || Hash 1-0 || Hash 1-1
and
y = Hash 0 || Hash 1
collide:
Tree-Hash(x)=Tree-Hash(y)
--24.60.175.200 06:40, 15 November 2005
- No, you are wrong, sort of. Tree-Hash(x) != Tree-Hash(y) except for one very special case. I'll write an explanation when I thought about a good way to explain it to you. --David Göthberg 02:39, 16 November 2005 (UTC)
If the data blocks incorporate the IV (or are hashed somehow differently into the tree hash leaves, but this needs more care) then my point is taken care of. Otherwise, the construction implies that on the Figure in the article, TopHash=hash(Hash0|Hash1)=hash(hash(Hash00|Hash01)|hash(Hash10|Hash11)).
It seems natural to interpret this as tree-hash(Hash00|Hash01|Hash10|Hash11)=tree-hash(Hash0|Hash1)=TopHash, which gives a collision. My comment is simply to make explicit whatever mechanisms/assumptions that exclude this interpretation. For example, careful use of IVs.
Btw, it might be a good idea to note that hash-chains can be viewed as a special case of the tree-hash (just a very unbalanced tree). Also, authenticated dictionary data structures [Naor Nissim, and subsequent work] constructed using Merkle's tree hashes are probably a good link to include. --24.60.175.200 19:21, 18 November 2005
- First of all, you should really learn to log in and sign your messages. It is really annoying to discuss with some one that does not sign his/her messages. It makes it unclear who says what and when it was said. For now I took the liberty of adding "fake" signatures to your messages to make them readable.
- Secondly, you are still wrong. Your case is only valid if the data block size is exactly twice the size of the hashsum output if a binary tree is used. (So if using SHA1 which has 20 byte hashsum size that would imply using a data block size of 40 bytes.) Using only 40 byte data block sizes would be very inefficient and as you pointed out very insecure.
- Most hash tree implementations use 1024 bytes or more for each data block they hash. That would mean that if you fed the "file" x from your example above to a hash tree function it would hash all of x once as a single data block to get the top hash since it is smaller then 1024 bytes. And if you fed the "file" y from your example above that too would be hashed as a single data block, and thus get another top hash then x.
- However what an attacker can do in this case is to trick the receiver into believing that the actual total size of the file is 40 bytes and then feed your y as the "file data". But there is no way an attacker can trick a receiver into believing that x is the file data.
- One way of avoiding that 40 byte fake file size attack and even avoid problems if you actually use the data block size of 40 bytes in your hash tree implementation is to instead create the top hash something like this:
- (Later edit: Solution here does not give complete protection. See last comment for the proper solution.)
top hash = hash( hash 0 | hash 1 | actual total file size )
- There is no need to use any IVs or file size values etc in the other levels of the hash tree. A more common way is to put the top hash in a metadata file with the file name, file size etc and then instead publish the hash of the metadata file.
- Your Naor Nissim link seems to link to a paper that contains a description of a hash tree structure similar but not the same to the kind of hash trees this article is talking about. (This article is about Merkle trees.) But if you like to then add that link, but state that it only describes a similar but not same approach.
- Regarding hash chains: I could not find any article in Wikipedia about hash chains and I don't remember that I have heard about them before. But if I guess correctly what they are then yes, one could see a hash chain as a very weird unbalanced hash tree. But for now I would not add that to the article for several reasons.
- Note that I have tried to keep the article short and easy to understand instead of complete. Since complete in this case would mean at least 10 pages of text and at least 4-10 pictures. Instead I resorted to using the sentence "There are several additional tricks, benefits and details regarding hash trees" and then linking to the THEX paper and RSA articles that tell more (but even those are incomplete).
- --David Göthberg 12:52, 20 November 2005 (UTC)
- Hi again 24.60.175.200. I just wanted to thank you for pointing out that type of attack. Even though your understanding of it was not complete it lead me to understand the two kinds of "40-byte attacks". I was not aware of it and have not seen any such thing mentioned in any papers on hash trees or hash lists. I have realised now that the fix I wrote about above does not give complete protection. I have to think more about this and discuss it with some other crypto guys.
- --David Göthberg 21:06, 8 December 2005 (UTC)
Update: This problem was discovered a couple of years ago and resolved in Justin Chapweske and Gordon Mohr's THEX Protocol. This specification for the contruction of tree hashes has likely received the most scrutiny of any tree hash implementation and is thus recommended for implementation. The appropriate solution to the above problem is to use separate hash functions for interior and leave nodes of the hash tree, which THEX implements by prefixing either 0x00 or 0x01 to the data to be hashed. THEX is implemented in Swarmcast and Gnutella.
24.118.51.232 22:04, 8 December 2005 (UTC)
- Ah yes, that seems to be a good solution. --David Göthberg 14:03, 24 February 2006 (UTC)
- Agreed, this is an excellent solution. Unfortunately, the article continues to promote insecure implementatinos. Few people will bother reading this talk page. Can we add a section describing the collision problem and this solution? WaywardGeek (talk) 15:53, 2 November 2015 (UTC)
Some clarifications
[edit]Several things seem to have confused editors and some readers. Since I don't know how to make them more clear in the article without making the text to heavy I will explain them here so future editors at least have access to the info.
- Hash trees doesn't have to be binary. That is, they can have more than two child nodes/leaves under each node. But binary hash trees seems to be the most used nowadays. One can just as well for instance have a tree with 100 child nodes/leaves under each node. Then if you use 100 bytes per hashed data block and have a file of say 1 million bytes that means two levels of hashes below the top hash. Like this: Top hash < 100 hashes < 100*100 hashes < 100*100 hashes * 100 bytes per block = 1 million bytes.
- The sentences about Lamport signatures and quantum computers seem to have caused a lot of confusion and edits. I have kept them short in the article since that is currently not an important issue and should probably instead be explained more in detail in the Lamport signature article. Lamport signatures are in them self already secure against quantum computers since they do not use the kind of maths that other public-key systems use, instead they only use hashes. Good strong hashes are believed to not be affected by quantum computers. You just need to use slightly bigger hashes than we use today. The bad part is that a Lamport signature can only be used once. So your published "public-key" can only be used one time. So if you want to sign many messages you would have to publish a long list of one time public Lamport keys. That would be awkward. But if you hash all those Lamport public keys with a hash tree you can then instead publish just the top hash! And when signing a message all you have to store in the message is the signature and the hash tree "branch" of hash nodes that makes it possible to verify the hashes up to the top hash. So hash trees doesn't make Lamport signatures more secure or change them, hash trees "just" makes it possible to easily handle MANY Lamport signatures.
- Some editors have added "citation needed" tags on the page or asked about the source of some of the statements. Well, all the sources needed are already right there in the external links section. Those links go to good sources that explains pretty much all the details about hash trees and the RSA labs article also explain why Lamport signatures are secure against quantum computers. (The article is written and published by RSA labs, it is not a Wikipedia article.) And I think RSA labs is one of the more trusted sources we have when it comes to crypto stuff. Many other sources I have seen also claims that quantum computers do not seriously affect hashes and stream/block ciphers. Just that public-key systems like RSA, DH, ECC etc are affected.
- Note that this article doesn't tell the full story about hash trees. There are many more things one need to understand to implement and use hash trees in a cryptographically secure way. But that would make this article huge. Currently the article is just an introduction to hash trees. Again the external links section points to documents telling most of the details and that is why I put a sentence in the text that mentions that: "There are several additional tricks, benefits and details regarding hash trees. See the external links below for more in-depth information."
- Tiger tree hashes are just a specific kind of hash tree. And currently the most used ones. There is nothing special about them using Tiger hashes. Tiger is just another hash function and Tiger in itself has nothing special to do with hash trees. It just so happens that when hash trees started to be used in p2p networks the p2p programmers did choose Tiger as the hash. They could just as well have chosen any other cryptographically secure hash.
--David Göthberg 06:23, 18 October 2006 (UTC)
Merger Proposal Hash tree (persistent data structure)
[edit]- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- The result of this discussion was to keep the articles separate. QVVERTYVS (hm?) 09:37, 3 May 2015 (UTC)
I am proposing a merge of Hash tree (persistent data structure) into Merkle tree.
- Merge - They're referring to the same thing, a trie where elements are stored by hashcode. --Padenton (talk) 22:48, 10 March 2015 (UTC)
- Oppose: these are very different data structures. A Merkle tree is not a trie as Padenton supposes. It is a recursive structure where leaves carry hashes of data, while internal nodes carry hashes of hashes. In a trie, internal nodes (conceptually, the links between them) carry parts of hashes of data, not hashes of hashes. QVVERTYVS (hm?) 10:35, 11 March 2015 (UTC)
- Oppose: Per Qwertyus' description. — Dsimic (talk | contribs) 21:23, 11 March 2015 (UTC)
- Oppose Merkle tree is a tree, which consist of hashes, of course. But 1) it is a very specific structure, rather than Hash tree (persistent data structure), and 2) it definitely deserves a separate article @ Hannasnow (talk) 16:38, 15 April 2015 (UTC)
Failure to Explain
[edit]Yo there!
I am half-oatmeal brained right now from other things, but that's only making things normal :8). Reading through a couple of screens multiple times produced not a clue about hash trees yet - and I need something concise. This article seems to totally fail ... in other words I feel it would be a complete waste of time to pursue it any further - and that I need to find a better teacher.
The subject complexity is not a problem for me, generally. And, even if it means creating a long story -- it should be engaging with bite-sized bits to digest.
Xgenei (talk) 02:12, 23 September 2015 (UTC)
Proof of membership of block i seems to be missing from article
[edit]The following section does not seem correct to me: "Then, the received hash tree is checked against the trusted top hash, and if the hash tree is damaged or fake, another hash tree from another source will be tried until the program finds one that matches the top hash."
Here is my interpretation of the above algorithm:
algorithm 1
input: a trusted root hash R and an untrusted, nondeterministic oracle O providing hash trees that purport to have root R. do forever ask O for a hash tree T if T is a valid hash tree for R, meaning ( (1) each leaf is a data block (2) the parent of each leaf contains the hash of its unique child (3) every other internal node contains the hash of the concatenation of the hashes of its children (4) the root of T contains R ) then output T and halt
This algorithm does not exploit the tree structure of T except to extend a hash function to large files, something the hash function presumably already does, and so provides no benefit over the following, much simpler algorithm:
algorithm 2
input: a trusted hash R and an untrusted, nondeterministic oracle O providing files that purport to have hash R. do forever ask O for a file F if hash(F) = R, then output F and halt
I imagine that a key desired benefit of a hash tree would be the ability to validate block i of the file, for any i, independently of validating any other block. That way, one could redownload just block i if it happened to be invalid. If we have to wait to get the entire tree before we can validate anything, and if we can't identify which blocks are invalid (and algorithm 1 does not provide any such information) then this whole data structure is pointless because we would have no choice but to redownload the entire tree upon any failure.
To rectify this, it seems necessary for the supplier of block i to also supply an independently verifiable proof that block i is valid, e.g. the hashes of the siblings of the nodes in the root-to-leaf path. Then the algorithm would look like this:
algorithm 3
input: a trusted hash R, a number of blocks n, and an untrusted, nondeterministic, asynchonous, parallelizable oracle O that, given i, eventually provides what purports to be block i together with a proof, in the form of the hashes h1, ..., hk (k = log_b n, b = degree of the tree) of the siblings of the nodes in a root-to-leaf path to block i. initialize an array A of length n with nulls to hold validated blocks while A contains a null, say in position i, and there is not already an outstanding request for block i ask O for block i and proof h1, ..., hk when O completes this request, validate block i as follows: hk' <- hash(block i) for j <- k, ..., 1 s <- the index of the sibling that the path from the root to block i goes to at level j. e.g. in a binary tree, if the path at the ith level went left, then s = 0, if it went right, then s = 1. h(j-1)' <- hash(concatenation of the members of hj, inserting hj' into position s) if h0' = R // block i is valid A[i] <- block i output the concatenation of the blocks in A and halt
This description is made a bit pointlessly complex by considering b-ary trees instead of specializing to binary trees, which I did only because the article currently seems to want this generalization.
Assuming that for each request, O fails to provide a valid block i independently with probability p, the expected number of blocks algorithm 1 or 2 (they are basically the same) must download is n/(1-p)^n. By constrast, algorithm 3 only requires n/(1-p), an exponential improvement. The proof of block i membership is obviously complete and it's sound because fooling the verifier requires finding a hash collision.
The point of all this discussion isn't to provide some novel research, but to point out that the current description seems to imply algorithm 1, which makes hash trees pointless. Algorithm 3 seems like an obvious fix but requires the notion of a proof of block i's membership, which the current article seems to be completely lacking.
Can someone who knows about real hash tree implementations confirm that suppliers of blocks must also provide a proof, not just block i, and if so, is the form of this proof exactly what I predicted or do they use some other proof system?
External links modified (January 2018)
[edit]Hello fellow Wikipedians,
I have just modified one external link on Merkle tree. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
- Added archive https://web.archive.org/web/20070317045032/http://www.codeproject.com/cs/algorithms/thexcs.asp to http://www.codeproject.com/cs/algorithms/thexcs.asp
When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.
An editor has reviewed this edit and fixed any errors that were found.
- If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
- If you found an error with any archives or the URLs themselves, you can fix them with this tool.
Cheers.—InternetArchiveBot (Report bug) 06:30, 26 January 2018 (UTC)
Related sources (zkSnark)
[edit]UC Berkeley. Nicholas Spooner nick.spooner@berkeley.edu. UC Berkeley. October 25[1] — Preceding unsigned comment added by Waterwizardm (talk • contribs) 08:31, 9 July 2020 (UTC)
References
Article doesn't mention torrent files
[edit]It seems like the article should mention one of the more common uses of Merkle trees, [bit]torrent files: http://bittorrent.org/beps/bep_0052.html — Preceding unsigned comment added by 24.156.255.250 (talk) 18:19, 7 January 2022 (UTC)
- We don't mention BitTorrent here since torrent files contain a hash list, not a hash tree. If the peer-to-peer program choses the right block sizes then hash lists are good enough and the added complexity of hash trees is unnecessary.
- --David Göthberg (talk) 08:22, 30 January 2023 (UTC)
Mention and/or redirect from "Merkle DAG"?
[edit]I suppose Merkle originally published the idea wrt. trees, but same idea naturally generalizes to directed acyclic graphs where multiple nodes may reference a given node's hash. For a while these were handwaved under "Merkle tree" (current article already mentions IPFS, Dat, Git & Mercurial as examples; only for the latter it admits they are DAG not tree!) but I'm seeing multiple people adopt the term "Merkle DAG" in recent years e.g.:
- https://docs.ipfs.tech/concepts/merkle-dag/
- Term possibly originated with IPFS? https://github.com/ipfs/papers/issues/4
- https://tylercipriani.com/blog/2016/03/21/Visualizing-Git-Merkle-DAG-with-D3.js/
- https://github.com/dat-ecosystem-archive/datproject-discussions/issues/28
The graph point of view naturally arises whenever you have texts that contain hashes of other texts (content adressing). They can be explicitly structured like Git commit & tree objects, but it could even be free-form prose — as long as texts reference each other by hash.
=> Whether you intended it or not, the result forms a directed graph.
- It's also a "tree" iff you enforce additional constraint to only refer to each text from one "parent".
- But if you rebuild the tree after changes, newer trees may share sub-trees with old trees, so once you consider history, the union of versions forms a graph. (This is a major way in which Git objects form a DAG, even if you abstain from duplicate files or merges.)
- The "acyclic" part is not strictly guaranteed, but it naturally arises from having to know a node's hash before you can refer to it; forming a cycle requires a hash collision (at random or deliberately constructed) and systems built on the Merkle tree/graph properties tend to assume hash collisions will not happen...
P.S. the arrows in the existing illustration are directed from leafs towards root; that is the direction in which data can be validated, and in which changes "bubble up", but imho better be reversed if you want to explain the "refers to" relation of content addressing.
Cbensf (talk) 14:24, 30 July 2024 (UTC)
- I absolutely think we should have a section on Merkle DAGs, although I think we should use the term "Sparse Merkle Tree" as this seems to be more common in literature. As far as the direction of the arrows, Merkle trees purpose are for inclusion verification, which is done from bottom to top, so I think the arrows should stay the direction they are pointed. Epachamo (talk) 16:53, 30 July 2024 (UTC)
- C-Class Cryptography articles
- Mid-importance Cryptography articles
- C-Class Computer science articles
- Mid-importance Computer science articles
- WikiProject Computer science articles
- WikiProject Cryptography articles
- C-Class Computing articles
- Low-importance Computing articles
- C-Class software articles
- Mid-importance software articles
- C-Class software articles of Mid-importance
- All Software articles
- All Computing articles