Talk:Aho–Corasick algorithm

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

Mid

This article has been rated as Mid-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Something is wrong with the first sentence in the "Informally, the algorithm constructs a trie ..." paragraph -- some kind of copy-paste replacement error; haven't figured out exactly how to fix it yet. Looks like there isn't a clean version to go back to; the mismatched parenthetical expression hasn't changed since it was added (17:36, 27 July 2006). Anyone? dvgrn 18:46, 19 September 2006 (UTC)[reply]

Got rid of the obvious error, but may have introduced some subtle inaccuracy -- will have to study the algorithm some more to be sure. dvgrn 20:40, 19 September 2006 (UTC)[reply]

the Link that should lead to the php/javascript implementation of the algorithm is broken. anyone knows where to find it? —Preceding unsigned comment added by 190.60.117.134 (talk) 20:25, 23 May 2008 (UTC)[reply]

Bad example dictionary

The example is bad from a didactical perspactive, because it doesn't contain any shared nodes/paths exept for proper prefixes, like the root/empty prefix. Shinobu (talk) 02:20, 6 October 2008 (UTC)[reply]

History

could somebody add details regarding when this algorithm was invented and when made public etc. Also, it would be better to add some background about the algorithm such as what kind of practical problems does it address? what are its applications etc.

Lokesh 2000 (talk) 04:42, 20 January 2010 (UTC)[reply]

"The Aho–Corasick string matching algorithm formed the basis of the original Unix command fgrep."

Not a great example. There were better algorithms for finding a single fixed string in text which fgrep could have used (eg, Knuth-Morris-Pratt). Remember that the strength of Aho-Corasick lies in finding a match for any of a large set of fixed strings in text; and this is not what fgrep does. Aho-Corasick was easily extended to implement Kleene's regular expressions and this extension is used in grep and egrep. So Aho-Corasick's use in fgrep was simply a case of ease of implementation. A virus scanner is by far the best example of the use of Aho-Corasick: in one sequential read of a file there is a set of many thousands of virus "signature" strings compared. Gdt (talk) 06:16, 10 August 2012 (UTC)[reply]

I suppose, but an important part of fgrep is that it allows for many fixed strings, likely with the -f flag. As I understand, regular grep was then extended with -f. GNU grep then, as well as I know, added the Boyer-Moore search on the longest fixed string before doing the regex match, and so also allowing for very fast fixed string matching. That removes the need for a separate fgrep. But yes, the optimal case for Aho-Corasick is a large number of pattern strings. Gah4 (talk) 22:27, 4 July 2013 (UTC)[reply]

The text should also mention the Rabin-Karp algorithm with Bloom filters, as that is the major alternative to Aho-Corasick. The two algorithms have differing strengths and weaknesses, so one can't be recommended over the other. Gdt (talk) 06:24, 10 August 2012 (UTC)[reply]

The diagram doesn't show the dictionary suffix links. I'll upload a new image that does. 24.27.32.151 (talk) 12:56, 16 August 2012 (UTC)[reply]

Visualization

I have made an interactive visualization of Aho-Corasick algorithm in flash. If you find it appropriate for this article, please add it to External links. I don't want to be accused of COI. The link is http://blog.ivank.net/aho-corasick-algorithm-in-as3.html

Just regex?

Isn't Aho-Corasick really just a specific case that regular expression engines that generate compiled state machines already support? — Preceding unsigned comment added by 173.228.89.16 (talk) 18:17, 16 June 2013 (UTC)[reply]

It is an optimization for (very) large numbers of constant strings. Regexp engines might support it, especially if the strings are not actually regexp, but the ones I know compile each one separately. I believe that GNU grep uses Boyer-Moore to speed up its searching, based on the longest constant string in the regexp, but haven't looked recently. Gah4 (talk) 15:04, 21 September 2016 (UTC)[reply]

Java vs. C++

There is a reference to a C++ implementation in the references. Someone added a reference to a Java implementation, but it was reverted. Is this a Java vs. C+ thing? Why one and not the other? I don't actually know about either of them. Gah4 (talk) 15:33, 21 September 2016 (UTC)[reply]

No, it is not a language thing. This article is about an algorithm; it is agnostic to programming issues. Generally, the external links section is for information that would be included in the article later or may be too detailed. External links that include readable content are better than links to just code. A link to an article about the algorithm (e.g., the lecture slides) would be much better than a code listing. It's poor practice to subject encyclopedia readers to digesting code, but we do it from time to time. Some code implementations also have substantial documentation in plain language.

A week ago an IP inserted an EL to a GitHub-hosted Java implementation.[1] I followed the link and found a project that appeared to have been posted a few days previously. That suggests some programmer has just done an implementation and posted it. That's the same as a blog; there's no independent review or assessment; it is not a reliable source. It's documentation is just a link to the 1975 paper that is already cited in this article. Then a link is placed on WP to advertise the implementation. Consequently, I reverted the addition with the edit comment "recent project? advert?"[2]

Although I recognized there was a GitHub C++ implementation, I did not research that link at the time and just left it (the WP article does not have an implementation). I just looked at that post now, and its documentation is a pointer to this WP article. I'm tempted to delete the C++ implementation link.

Glrx (talk) 17:51, 21 September 2016 (UTC)[reply]

I didn't look at either one, just wondered why one and not the other. Funny story, it seems that the algorithm was rediscovered by people working at the NCBI ^[1] in developing BLAST ^[2] software for DNA and protein sequence searching. I used their implementation, and cited it, for a project I was working on, and then someone told me that it was Aho-Corasick. I told the BLAST people, and they agreed. I suppose I could add BLAST to the external links section. It is definitely production level software, produced with government funding, not someone at home. I believe it is all C. Gah4 (talk) 18:45, 21 September 2016 (UTC)[reply]

References

^ "National Center for Biotechnology Information". www.ncbi.nlm.nih.gov. NCBI. Retrieved 21 September 2016.
^ "BLAST". blast.ncbi.nlm.nih.gov. NCBI. Retrieved 21 September 2016.

What are the input what is the output

It's not clear reading the article what must be the input? A dictionary of words (a set actually) and an array of chars aka. a string. Or it's an operation on two sets of words to see the intersection? What is the output? How does it relate to grep? i⋅am⋅amz3 (talk) 18:47, 9 May 2017 (UTC)[reply]

Aho Corasick vs TF-IDF

How does compare Aho Corasick to a search engine retrivial scoring function like TF-IDF? i⋅am⋅amz3 (talk) 18:47, 9 May 2017 (UTC)[reply]

Apart from both having applications in text mining, they are nothing alike. Aho-Corasick is a string searching algorithm, while TF-IDF is a term weighting statistic. Given a dictionary, Aho-Corasick builds a tree-like data structure for efficiently matching all the dictionary strings against a text document (i.e., just another string). On the other hand, TF-IDF measures the relevance of a given term in a given document from a particular collection of documents. Let's say that a user is searching the web and inputs a keyword query in a search engine. Then, the TF-IDF for each term in the query can be computed for each document in the collection and used to compute a final document score. Then, documents are ranked from highest to lowest score, in order to present the most relevant documents to the user. Aho-Corasick, on the other hand, might be useful on a situation where you want to identify, let's say, people's names in a collection of documents. You might use the people's names as your dictionary and then process each document to find matches. José Devezas (talk) 10:22, 6 June 2018 (UTC)[reply]

bab not in tree

In the presented example the word bab is mentioned among the words, but it is not represented in the automaton... Oh there is another automaton shown later ... the correct one ... the word this would be better replaced by "bellow" 217.153.187.164 (talk) 18:21, 27 October 2022 (UTC) sorry, I have to remember my account info[reply]

Quadratic?

The phrase in the first paragraph (underlining is mine):

Note that because all matches are found, there can be a quadratic number of matches if every substring matches

puzzled me. Quadratic with respect to what? Surely there cannot be more matches than strings, so it seems linear WRT the strings. Викидим (talk) 03:06, 22 July 2024 (UTC)[reply]

The statement is meant to be "quadratic in length of input", where length of input is total number of characters. But that claim is wrong, it cannot be quadratic - it is a nice exercise to show that there's an instance with

\Theta (n{\sqrt {(}}n))

matches, and that it cannot be asymptotically more. Martin Koutecký (talk) 19:27, 9 October 2024 (UTC)[reply]

Removed "quadratic". I think that the original statement implied that as the length extends, so does the potential number of entries in the dictionary. Викидим (talk) 23:39, 9 October 2024 (UTC)[reply]

I've made one more clarification. I think it's important to point out that, unlike with Knuth-Morris-Pratt, you cannot expect complexity linear in the length of the input, so linear in the length of input + output is the best possible. Martin Koutecký (talk) 11:11, 10 October 2024 (UTC)[reply]

Undid a part of your change. I do not understand the "longer" part: the input and output of the algorithm are two different data structures, comparing their lengths is definitely nontrivial. Викидим (talk) 17:58, 10 October 2024 (UTC)[reply]

I don't understand what's the issue.

The input is a collection of strings and we measure its length by counting the characters in it.

The output is a set of pairs (i,j) such that string i appears at index j. The number of such pairs may be (much) larger than the length of the input.

Since you were not happy with my phrasing, could you please come up with a better phrasing communicating the same fact? In my opinion, the current state is a downgrade from the incorrect statement that there may be quadratically many matches, but at least that conveyed that the "number of matches" really has to be a part of the complexity. This has now gotten lost. Martin Koutecký (talk) 13:10, 23 October 2024 (UTC)[reply]

I have no read-made language as I in turn do not understand your concern. My points are simple:

While it is easy for me to understand the meaning of words string "aaaaa" requires more space than "a", I cannot easily imagine a straightforward comparison of sizes of the same string "aaaaa" and of a tuple (1,1). Either the assumptions shall be listed or the comparison adjectives skipped.
Similarly, the generic meaning of quadratic describes a relationship between x and y of the form y = O(x*x). When declaring something to be quadratic, the text should identify not only y, but x also.
I am not an expert, but to the best of my knowledge AC algorithm is used primarily to find the threat patters in the network traffic. It does not appear to me that the "a", "aa", "aaa", ... is a typical dictionary for this application. I would like therefore to add something along the lines of "pathological" or "atypical" to the description of this example (this requires a WP:RS, of course).

Викидим (talk) 21:44, 24 October 2024 (UTC)[reply]

[NCBI-1] "National Center for Biotechnology Information". www.ncbi.nlm.nih.gov. NCBI. Retrieved 21 September 2016.

[BLAST-2] "BLAST". blast.ncbi.nlm.nih.gov. NCBI. Retrieved 21 September 2016.

[1]

[2]