Jump to content

Somali Corpus

From Wikipedia, the free encyclopedia

The Somali Corpus, also known as Kaydka Af Soomaaliga (KAF), is a digital collection of texts in the Somali, a language spoken in Greater Somalia, Ethiopia, and Kenya. It was started with 3 million words of Somali literature and language developed by Jama Musse Jama in 2016[1][2] as part of his doctoral dissertation.[3] The corpus currently contains over 7 million words, mainly from literature, poetry, songs, news, essays, and political speeches,[4] making it one of the most extensive collections of text types of language corpora within African languages and an important addition to online materials from under-resourced languages.[5][6][7][8] The words of the corpus are tagged for part of speech categories. The corpus can be used to distill frequency lists for Somali words.[9] The corpus also serves as the basis for an online Somali spell checker.[10]

Other Somali language corpora

[edit]

See also

[edit]

References

[edit]
  1. ^ "The Official Somali Corpus 2016".
  2. ^ Morgan Nilsson. 2018. Three Somali Language Corpora: How can they be useful? https://morgannilsson.se/ppt/2018-08-15-Mogadishu.pdf
  3. ^ Jama Musse Jama (2016). A Syntactically Annotated Corpus of Somali Literature. Unpublished PhD Thesis.
  4. ^ Jama Musse Jama. 2017. Somali Corpus: state of the art, and tools for linguistic analysis.https://www.academia.edu/26504727/Somali_Corpus_state_of_the_art_and_tools_for_linguistic_analysis.
  5. ^ Bendjaballah, Sabrina. 2024. Somali particle clusters: Complete paradigms, syncretism and corpus frequency. Brill’s Journal of Afroasiatic Languages and Linguistics. Brill 16(1). 102–136. https://doi.org/10.1163/18776930-01601003.
  6. ^ Mohammed, Siraj. 2020. Using machine learning to build POS tagger for under-resourced language: the case of Somali. International Journal of Information Technology 12(3). 717–729. https://doi.org/10.1007/s41870-020-00480-2.
  7. ^ Hashi, Awil. 2014. Developing a Model Corpus for Endangered Languages. Graduate Studies. University of Calgary. Doctoral thesis. https://doi.org/10.11575/PRISM/25614.
  8. ^ Nimaan, Abdillahi. 2014. Building and Evaluating Somali Language Corpora. In Jeff Good, Julia Hirschberg & Owen Rambow (eds.), Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, 73–76. Baltimore, Maryland, USA: Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-2210.
  9. ^ Giorgio Banti. 2022. Some Issues For An Etymological Dictionary Of Somali. https://www.academia.edu/81600790/Banti_2022_Some_issues_for_an_Etymological_Dictionary_of_Somali.
  10. ^ http://www.somalicorpus.com/index.php?lang=en
  11. ^ "Bangiga Af Soomaaliga".
  12. ^ "Somali Web Corpus".