Talk:JMdict
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||||||
|
Merge with WWWJDIC?
[edit]They are two totally different projects:
- JMdict/EDICT is a dictionary project, with an online database, many contributors, editors, etc. The resulting dictionary file is exported daily, both in XML format, and in the two "EDICT" formats.
- WWWJDIC is a dictionary server which uses the EDICT2 file, along with many others, to provide an online dictionary service.
JimBreen (talk) 05:59, 11 September 2012 (UTC)
NPOV
[edit]"Jim Breen's own online dictionary WWWJDIC is a convenient way of searching EDICT."
I disagree strongly.
What this article should say is what the char-encoding is of EDICT and EDICT2 are and provide more information and less back-slapping (edict was a text file in EUC and so now useless in many English text editors such as Windows Scite Notepad++ and notepad2 among others.) Their FTP site says
"Please note that some extended 3-byte EUC characters are used, and this form is generally not supported by Microsoft"
but what we should be saying is that web browsers cannot be expected to display text from edict2 and users should look for the UTF-8 gz file edict2u.gz if they wish to use the dict on the internet in web pages.
JMDict/EDICT2 at 160,000 + entries might easily be the default on the internet for lack of anything better.
Were WWWJDIC "convenient" it would be on my bookmarks menubar. It is not. Perapera in Firefox is arguably convenient were it no so very inconvenient as soon as small web page font is enlarged.
Where, for example is edict-sub ?
The assertion on the Monash sites that the files are large and require servers is now laughable. The issue is not that JMDict is a 60MB file but that it is one sequential file of XML. The last Java tool using the dict took that 60 Mb file and bloated to 400+ Mb, so all is not well with these dicts from where I sit. What I do see is that one file is used because that is how they think of a flat-file system or a single file db system and that is how they think of some single tool with its single file. It is a bit like insisting that all aircraft should have one engine because the SPAD had only one or the Super Sabre had only one or that a web site should be one HTML page.
The assertion "which reflects the structure of the XML entries" is almost meaningless.
The entire article is utterly question-begging in it's assertion that the representation is "adequate". By which criteria and according to whom? Kudos to Jim Breen, fine. But this is WP, and not a Monash U web page.
There is no mention of whether multiple files would have been more appropriate and whether JSON, for example, would now be preferrable to XML. Multiple files would allow for reasonable sets to be indexed (Asahi top 3000, for example ; or Manga/Anime top 2000.)
Should a dict now be a light-weight CMS ? Would such a system be based on a single file? What would a useful node structure be for different uses (the assumption that one dict serves all may go back to Dr. Samuel J. but surely it is now moot in a digital lexicon.)
The very existence of technical lexicons shows just how much the "one-file" approach is doubtful. It is not true in the age of electronic medical record keeping that one dict can be for electronics and one for medical terms. That ship has sailed - even with healthcare IT being traditionally 15 years behind CS 101 content.
Because I am working on an alternative to the Monash approach, I am not the one to be correcting this article and bringing it up to snuff.
For the time being, I have placed two utf-8 encoded HTML pages of edict and edict2 at
http://kanji.aule-browser.com/edict-utf-8.html http://kanji.aule-browser.com/edict2-utf-8.html
for a reader of this note to form some opinion.
G. Robert Shiplett 00:34, 23 April 2012 (UTC)
Title
[edit]Why isn't this moved to JMDict? — Preceding unsigned comment added by 2001:240:2412:F4AC:79FA:D8A8:1D49:3660 (talk) 08:06, 13 May 2020 (UTC)