User:Underbar dk/Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles
The Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles by the National Institute of Information and Communications Technology is created by manually translating Japanese Wikipedia articles (related to Kyoto) into English. As of December 23, 2010, 14,111 Japanese articles are translated into English.[1] The corpus is used for supporting research and development relevant to high-performance multilingual machine translation, information extraction, and other language processing technologies.
Use and/or redistribution of the Corpus and the Lexicon is permitted under the conditions of Creative Commons Attribution-Share-Alike License 3.0.[2]
As the corpus is a collection of Japanese Wikipedia manually translated into English, released under CC BY-SA 3.0, English Wikipedia can use this corpus to fill in gaps in its coverage, provided that the articles in the corpus are in a usable state for English Wikipedia.
Scope
[edit]Only the articles fulfilling all conditions below will be considered for use on English Wikipedia
- Articles in the corpus with no corresponding article on English Wikipedia;
- Articles with sources in the original article on Japanese Wikipedia;
Additional considerations
[edit]- Need to determine if article in the corpus would likely pass English Wikipedia's WP:N
- Need to cleanup the articles in the corpus to conform with English Wikipedia's manual of style
List of articles
[edit]Methodology
[edit]Methodology
|
---|
Generated by running the following on Wikidata's Query HelperSELECT ?item ?lemma ?itemLabel WHERE {
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
VALUES ?lemma {
#List of article titles from the Wiki_Corpus_List_2.01.csv
#Example:
"新選組"@ja
"池田屋事件"@ja
}
?sitelink_ja schema:about ?item;
schema:isPartOf <https://ja.wikipedia.org/>;
schema:name ?lemma.
MINUS {
?sitelink_en schema:about ?item;
schema:isPartOf <https://en.wikipedia.org/>.
}
} ORDER BY ASC(?item)
page source
<p><a href='(.+)'>.+</a> \((.+)\) - .+<\/p>
csv format for joining
$2, https://www.japanese-wiki-corpus.org/$1
from
https://www.japanese-wiki-corpus.org/history/(.+)
to
https://www.japanese-wiki-corpus.org/history/{{urlencode:$1|PATH}}
RegEx:
join result
(.+),(.+),(.+),(.+)\n
sub to table
|-\n|$2\n|[[:ja:$1]]\n|$3\n|[$4]\n |
Rating criteria
[edit]In determining if the original Japanese Wikipedia (jawiki) article has sufficient sourcing for an English Wikipedia (enwiki) article, the jawiki articles are rated into the following ranks:
- y+: jawiki article has sufficient reliable sources (RS) to satisfy WP:BEFORE and has adequate citation footnotes
- y: jawiki article has sufficient reliable sources to satisfy WP:BEFORE
- insuf: jawiki article has insufficient sources (only 1 or overreliance on primary sources); or jawiki article is reasonably tagged for lack of sources or OR, despite satisfying "y" above
- n: jawiki article lacks sources
Note that the versions of Jawiki articles rated may be drastically different from the versions the corpus was based on. Ratings in brackets, where they exist, refer to the current version rather than the version the corpus is based on (assume current otherwise).
History
[edit]- 1966 articles in the corpus
- 1419 articles not on enwiki (72%)
- 1387 articles not on enwiki and presented in human-readable form on https://www.japanese-wiki-corpus.org/history.html
jawiki sourcing check:
- pass: 540 (38.9%)
- y+: 25 (1.8%)
- y: 514 (37.0%)
- fail: 847 (61.0%)
- insuf: 338 (24.4%)
- n: 510 (36.8%)
References
[edit]- ^ The last modified date of a file (Wiki_Corpus_List_2.01.csv) containing the list of translated articles is December 23, 2010.
- ^ "Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles". National Institute of Information and Communications Technology.