User:AKA MBG/Todo
Appearance
Test "Notes".[a]
Help:
- Help:Shortened footnotes#Shortened footnotes with separate explanatory notes with references
- {{Cite book}}
- {{Cite conference}}
This user page is actively undergoing a major edit for a little while. To help avoid edit conflicts, please do not edit this page while this message is displayed. This page was last edited at 15:33, 17 May 2021 (UTC) (3 years ago) – this estimate is cached, . Please remove this template if this page hasn't been edited for a significant time. If you are the editor who added this template, please be sure to remove it or replace it with {{Under construction}} between editing sessions. |
Wiktionary
[edit]Wiktionary data are heavily used in various NLP tasks (see #Wiktionary data in NLP).
Wiktionary data in NLP
[edit]Wiktionary has semi-structured data.[2] Wiktionary lexicographic data should be converted to machine-readable format in order to be used in natural language processing tasks.[3][4][5]
Wiktionary data mining is a complex task. There are the following difficulties:[6] (1) the constant and frequent changes to data and schema, (2) the heterogeneity in Wiktionary language edition schemas [b] and (3) the human-centric nature of a wiki.
There are several parsers for different Wiktionary language editions[7]:
- DBpedia Wiktionary — a subproject of DBpedia, the data are extracted from English, French, German and Russian wiktionaries; the data includes language, part of speech, definitions, semantic relations and translations. The declarative description of the page scema[8], regular expressions[9] and finite state transducer[10] are used in order to extract information.
- JWKTL (Java Wiktionary Library) — provides access to English Wiktionary and German Wiktionary dumps via a Java API.[11] The data includes language, part of speech, definitions, quotations, semantic relations, etymologies and translations. JWKTL is available for non-commercial use.
- wikokit — the parser of English Wiktionary and Russian Wiktionary[12]. The parsed data includes language, part of speech, definitions, quotations[13][c], semantic relations[14] and translations. This is a multi-licensed open-source software.
The various natural language processing tasks were solved with the help of Wiktionary data[15]:
- Rule-based machine translation between Dutch language and Afrikaans; data of English Wiktionary, Dutch Wiktionary and Wikipedia were used with the Apertium machine translation platform.[16]
- Construction of machine-readable dictionary by the parser NULEX, which integrates open linguistic resources: English Wiktionary, WordNet и VerbNet.[17] The parser NULEX scrapes English Wiktionary for tense information (verbs), plural form and part of speech (nouns).
- Speech recognition and synthesis, where Wiktionary was used to automatically create pronunciation dictionaries.[18] Word-pronunciation pairs were retrieved from 6 Wiktionary language editions (Czech, English, French, Spanish, Polish, and German). Pronunciations are in terms of the International Phonetic Alphabet.[d] The ASR system based on English Wiktionary has the highest word error rate, where each third phoneme has to be changed.[20]
- Ontology engineering[21] and semantic network constructing[e].
- Ontology matching.[22]
- Text simplification. Medero & Ostendorf[23] assessed vocabulary diffculty (reading level detection) with the help of Wiktionary data. Properties of words extracted from Wiktionary entries (definition length and POS, sense, and translation counts) were investigated. Medero & Ostendorf expected that (1) very common words will be more likely to have multiple parts of speech, (2) common words to be more likely to have multiple senses, (3) common words will be more likely to have been translated into multiple languages. These features extracted from Wiktionary entries were useful in distinguishing word types that appear in Simple English Wikipedia articles from words that only appear in the Standard English comparable articles.
- Part-of-speech tagging. Li et al. (2012)[24] built multilingual POS-taggers for eight resource-poor languages on the basis of English Wiktionary and Hidden Markov Models.[f]
- Sentiment analysis.[25]
Notes
[edit]- ^ Explanatory note example with reference.[1]
- ^ E.g. compare the entry structure and formatting rules in English Wiktionary and Russian Wiktionary.
- ^ Quotations are extracted only from Russian Wiktionary.[13]
- ^ If there are several IPA notations on a Wiktionary page – either for different languages or for pronunciation variants, then the first pronunciation was extracted.[19]
- ^ http://conceptnet5.media.mit.edu
- ^ The source code and the results of POS-tagging are available at https://code.google.com/p/wikily-supervised-pos-tagger
Citations
[edit]- ^ Meyer & Gurevych 2012, p. 999.
- ^ Meyer & Gurevych 2012, p. 140.
- ^ Zesch, Müller & Gurevych 2008, p. 4, Figure 1.
- ^ Meyer & Gurevych 2010, p. 40.
- ^ Krizhanovsky, Transformation 2010, p. 1.
- ^ Hellmann & Auer 2013, p. 302, p. 16 in PDF.
- ^ Hellmann, Brekle & Auer 2012, p. 3, Table 1.
- ^ Hellmann, Brekle & Auer 2012, pp. 8–9.
- ^ Hellmann, Brekle & Auer 2012, p. 10.
- ^ Hellmann, Brekle & Auer 2012, p. 11.
- ^ Zesch, Müller & Gurevych 2008.
- ^ Krizhanovsky, Transformation 2010.
- ^ a b Krizhanovsky 2011.
- ^ Krizhanovsky, Comparison 2010.
- ^ Krizhanovsky 2012, p. 14.
- ^ Otte & Tyers 2011.
- ^ McFate & Forbus 2011.
- ^ Schlippe, Ochs & Schultz 2012.
- ^ Schlippe, Ochs & Schultz 2012, p. 4802.
- ^ Schlippe, Ochs & Schultz 2012, p. 4804.
- ^ Meyer & Gurevych 2012.
- ^ Lin & Krizhanovsky 2011.
- ^ Medero & Ostendorf 2009.
- ^ Li, Graça & Taskar 2012.
- ^ Chesley et al. 2006.
References
[edit]- Chesley, Paula; Vincent, Bruce; Xu, Li; Srihari, Rohini K. (2006). "Using verbs and adjectives to automatically classify blog sentiment" (PDF). Training. 580: 233–235. Retrieved May 9, 2013.
- Hellmann, Sebastian; Brekle, Jonas; Auer, Sören (2012). "Leveraging the Crowdsourcing of Lexical Resources for Bootstrapping a Linguistic Data Cloud" (PDF). Proc. Joint Int. Semantic Technology Conference (JIST). Nara, Japan.
- Hellmann, S.; Auer, S. (2013). "Towards Web-Scale Collaborative Knowledge Extraction" (PDF). In Gurevych, Iryna; Kim, Jungi (eds.). The People's Web Meets NLP. Theory and Applications of Natural Language Processing. Springer-Verlag. pp. 287–313. ISBN 978-3-642-35084-9.
- Krizhanovsky, Andrew (2010). "Transformation of Wiktionary entry structure into tables and relations in a relational database schema". arXiv:1011.1368 [cs].
- Krizhanovsky, Andrew (2010). "The comparison of Wiktionary thesauri transformed into the machine-readable format". arXiv:1006.5040 [cs].
- Krizhanovsky, Andrew (2011). "Оценка использования корпусов и электронных библиотек в Русском Викисловаре" [Evaluation of the corpora and digital libraries used in Russian Wiktionary] (PDF). Труды международной конференции "Корпусная лингвистика–2011" [International scientific conference «Corpus linguistics-2011»] (in Russian). Saint Petersburg: Saint Petersburg State University. pp. 217–222.
- Krizhanovsky, Andrew (2012). "A quantitative analysis of the English lexicon in Wiktionaries and WordNet" (PDF). International Journal of Intelligent Information Technologies (IJIIT). 8 (4): 13–22. doi:10.4018/jiit.2012100102. Retrieved May 9, 2013.
- Li, Shen; Graça, Joao V.; Taskar, Ben (2012). "Wiki-ly supervised part-of-speech tagging" (PDF). Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Jeju Island, Korea: Association for Computational Linguistics. pp. 1389–1398.
- Lin, Feiyu; Krizhanovsky, Andrew (2011). "Multilingual ontology matching based on Wiktionary data accessible via SPARQL endpoint". Proc. of the 13th Russian Conference on Digital Libraries RCDL’2011. Voronezh, Russia. pp. 19–26. arXiv:1109.0732.
- McFate, Clifton J.; Forbus, Kenneth D. (2011). "NULEX: an open-license broad coverage lexicon" (PDF). The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference. Portland, Oregon, USA: The Association for Computer Linguistics. pp. 363–367. ISBN 978-1-932432-88-6.
- Medero, Julie; Ostendorf, Mari (2009). "Analysis of vocabulary difficulty using wiktionary" (PDF). Proc. SLaTE Workshop.
- Meyer, C. M.; Gurevych, I. (2010). "Worth its Weight in Gold or Yet Another Resource - A Comparative Study of Wiktionary, OpenThesaurus and GermaNet" (PDF). Proc. 11th International Conference on Intelligent Text Processing and Computational Linguistics, Iasi, Romania. pp. 38–49.
- Meyer, C. M.; Gurevych, I. (2012). "OntoWiktionary – Constructing an Ontology from the Collaborative Online Dictionary Wiktionary" (PDF). In Pazienza, M. T.; Stellato, A. (eds.). Semi-Automatic Ontology Development: Processes and Resources. IGI Global. pp. 131–161. ISBN 978-1-4666-0188-8.
- Otte, Pim; Tyers, F. M. (2011). "Rapid rule-based machine translation between Dutch and Afrikaans" (PDF). In Forcada, Mikel L.; Depraetere, Heidi; Vandeghinste, Vincent (eds.). 16th Annual Conference of the European Association of Machine Translation, EAMT11. Leuven, Belgium. pp. 153–160.
- Schlippe, Tim; Ochs, Sebastian; Schultz, Tanja (2012). "Grapheme-to-phoneme model generation for Indo-European languages" (PDF). Acoustics, Speech and Signal Processing (ICASSP). Kyoto, Japan. pp. 4801–4804.
- Zesch, Torsten; Müller, Christof; Gurevych, Iryna (2008). "Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary" (PDF). Proceedings of the Conference on Language Resources and Evaluation (LREC). Marrakech, Morocco.