Talk:Tesseract (software)

This is the talk page for discussing improvements to the Tesseract (software) article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

This article is written in American English, which has its own spelling conventions (color, defense, traveled) and some terms that are used in it may be different or absent from other varieties of English. According to the relevant style guide, this should not be changed without broad consensus.

This article is rated Start-class on Wikipedia's content assessment scale.
It is of interest to the following WikiProjects:

Computing: Software / Free and open-source software Low‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
Low	This article has been rated as Low-importance on the project's importance scale.
	This article is supported by WikiProject Software.
	This article is supported by Free and open-source software (assessed as Low-importance).

Google Mid‑importance

This article is within the scope of WikiProject Google, a collaborative effort to improve the coverage of Google and related topics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.GoogleWikipedia:WikiProject GoogleTemplate:WikiProject GoogleGoogle articles

Mid This article has been rated as Mid-importance on the project's importance scale.

WikiProject Google To-do:

Here are some tasks awaiting attention:

Article requests : Articles for most of the other products listed here and here.
Assess : All articles in the Category:Unknown-importance Google articles and Category:Unassessed Google articles using the project's assessment scale
Expand : Google Mapathon, Google Talkback
Maintain : This WikiProject
Merge : Google mobile services into List of Google products
Stubs : Category:Stub-Class Google articles and Category:Google stubs
Update : List of features in Android and Gmail interface#Product integration. Update logos of Google Marketing Platform products
Other :
- Add more stuff to this to do list if you like! (click here...)
- create:
- Help the Google article for a good article status
- Improve the Outline of Google
- Get more members using :
{{subst:Wikipedia:WikiProject Google/Invite Members}}
- Infobox Images with transparent areas needing a different background color

Not quite free software?

Although most of Tesseract is free software under the Apache License v2.0, the Aspirin neural network engine may not be. I've no idea if that license is free. I might email the FSF and ask - David Gerard 20:58, 7 September 2006 (UTC)[reply]

It seems Aspirin was removed in v. 1.02. Rwxrwxrwx 18:25, 5 November 2006 (UTC)[reply]

Yeah, I finally got email back from the FSF - they asked Google about that bit of the licence and Google apparently went "oops" :-) - David Gerard 16:23, 15 April 2007 (UTC)[reply]

User-friendly versions

Tesseract seems rather technically challenging to install/configure. FreeOCR is built on it, and may be more user-friendly for people who have the required Windows 2K/XP. Archivista Box is a complete document management solution Linux livecd that includes Tesseract.[1] [2] The iso download is here:[3] Do any other livecds include Tesseract? Does anyone make it available as on online tool? It is odd that this is a google project, but they aren't making it available in readily usable forms. -69.87.204.80 20:34, 2 October 2007 (UTC)[reply]

Tesseract is available on the Ubuntu repositories via the Synaptic package manager. It is therefore very easy to install, just a matter of checking a couple of boxes. Using it from the command line is also very simple as described in the Ubuntu Documentation - Ahunt (talk) 12:31, 28 June 2008 (UTC)[reply]

Userbox

If you use Tesseract, please feel free to put this userbox on your user page!

Code

Result

|{{User:Ahunt/Tesseract}}

This user does
OCR with Tesseract.

Usage

- Ahunt (talk) 12:20, 28 June 2008 (UTC)[reply]

Formats

I've just tried to scan a file on Ubuntu. I got this output:

screenshot.bmp: Not a TIFF or MDI file, bad magic number 19778 (0x4d42).

It seems that Tesseract wants a TIFF, or Microsoft's proprietary version of TIFF. No BMP. That contradicts the article. — Chameleon 23:53, 20 August 2008 (UTC)[reply]

You are quite right: the article is wrong and the Ubuntu wiki is right. I will fix the article. If you use ".tif" (and only that extension) it works really well. - Ahunt (talk) 00:07, 21 August 2008 (UTC)[reply]

Spell checking?

A spell checker is not integrated, it seems.-- Matthead Discuß 13:02, 26 February 2011 (UTC)[reply]

No it isn't. - Ahunt (talk) 14:50, 26 February 2011 (UTC)[reply]

BTW, thank you very very much for replacing the link to a web page explaining how to turn on the hOCR feature with a "Citation needed". This will improve the article and the reliability of wikipedia a lot. Keep up your good work. -- Matthead Discuß 18:10, 26 February 2011 (UTC)[reply]

And you should read WP:CIVIL because sarcasm like that isn't civil. You should also have a read of WP:SPS where it says: "Anyone can create a personal web page or pay to have a book published, then claim to be an expert in a certain field. For that reason, self-published media, such as books, patents, newsletters, personal websites, open wikis, personal or group blogs, Internet forum postings, and tweets, are largely not acceptable as sources." If you can find a proper ref for that feature then great, otherwise the wording will be removed from the article as explained at WP:V, which says "The threshold for inclusion in Wikipedia is verifiability, not truth; that is, whether readers can check that material in Wikipedia has already been published by a reliable source, not whether editors think it is true." - Ahunt (talk) 18:25, 26 February 2011 (UTC)[reply]

Thank you for making Wikipedia such a nice place. Please go ahead and remove the offending gibberish of mine. -- Matthead Discuß 19:26, 26 February 2011 (UTC)[reply]

Why don't you drop the incivility and find a ref for your text instead. I have done a search, but haven't found one yet. - Ahunt (talk) 20:01, 26 February 2011 (UTC)[reply]

Had to go through the Tesseract Issues Logs but I found the whole history of it there and added it as a ref. It is a primary source, though so it would be ideal to have a reliable third party ref as well. - Ahunt (talk) 20:12, 26 February 2011 (UTC)[reply]

Should the reference to FreeOCR be removed ?

Should the reference to FreeOCR be removed from the article on Tesseract (software) ?

The user comments section under URL:

   http://download.cnet.com/FreeOCR/3000-10743_4-10717191.html

emphatically identify FreeOCR as sneakware.

Please note: the intial download of FreeOCR is only a download of an installer; the installer itself passes virus scans, but then the installer goes on to download the bulk of the product. — Preceding unsigned comment added by 74.94.104.84 (talk) 20:09, 5 February 2014 (UTC)[reply]

Well there is a redirect from FreeOCR to this article, so it may be smarter to just tell the whole story instead. - Ahunt (talk) 20:57, 5 February 2014 (UTC)[reply]

Someone braver than I might want to check but currently (April 2018) the FreeOCR download is about 10 megabytes and the download page seems to be more reputable than before, so maybe things have changed.

or maybe not :) Someone (someone else) should try it out and see .... 116.231.75.71 (talk) 11:47, 15 April 2018 (UTC)[reply]

Oddly FreeOCR now redirects here to this article, but is not mentioned on the page. I think that redirect needs to be deleted. - Ahunt (talk) 12:46, 15 April 2018 (UTC)[reply]

Done - Ahunt (talk) 12:49, 15 April 2018 (UTC)[reply]

External links modified

Hello fellow Wikipedians,

I have just modified 2 external links on Tesseract (software). Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Corrected formatting/usage for http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.html
Corrected formatting/usage for http://code.google.com/p/tesseract-ocr/issues/detail?id=263

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

Y An editor has reviewed this edit and fixed any errors that were found.

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—^{cyberbot II}_{Talk to my owner:Online} 16:42, 31 March 2016 (UTC)[reply]

- Ahunt (talk) 16:59, 3 April 2016 (UTC)[reply]

one of the most accurate open-source OCR ??

Tesseract is considered one of the most accurate open-source OCR engines currently available.^[1]^[2]

^ Canonical Ltd. (February 2011). "OCR". Retrieved 2011-02-11.
^ Willis, Nathan (September 2006). "Google's Tesseract OCR engine is a quantum leap forward". Retrieved 2008-07-18.

The two references given are 6 and 9 years old. Are there any newer references? Otherwise the statement seems to be a little pretentious. --Dichter (talk) 13:09, 27 April 2017 (UTC)[reply]

The refs are still valid, but I think it should be dated and I will add that. See what you think. - Ahunt (talk) 13:39, 27 April 2017 (UTC)[reply]

Ad hoc logo?

Does anybody have an official Tesseract page that uses the image that is listed as the logo here? The original URL for the image points to a consulting company that seems only tenuously related to Tesseract (though I didn't delve). I did an image search for the displayed image and only found this page and a few blog entries that likely cut/pasted from here. I think we should either post a citation to an official Tesseract page for the logo or cut it. B k (talk) 19:50, 30 January 2020 (UTC)[reply]

[UbuntuDoc-1] Canonical Ltd. (February 2011). "OCR". Retrieved 2011-02-11.

[Linux.com-2] Willis, Nathan (September 2006). "Google's Tesseract OCR engine is a quantum leap forward". Retrieved 2008-07-18.

[1]

[2]