Spelling Checkers — language specifications

Languages and sizes of dictionaries: go to language

New languages: Kazakh (Cyrillic/Latin), Khmer (Cambodia), Telugu (India), Punjabi (India), Sinhala (Sri Lanka), Tamil (India), Gujarati (India), Bengali (Bangladesh), Malayalam (India), Kurdish (Northern), Nepalese, Marathi (India), Hindi (India), Arabic, Azerbaijanian

Recent updated spell checker languages: Basque, Dutch, Italian, German, Austrian German, Swiss German, Estonian, Portuguese (acordo ortográfico), French/Canadian French, Finnish, Norwegian, Danish, Icelandic, English, Bahasa Indonesia, Bahasa Melayu, Spanish, Frisian/Frysk, Afrikaans, Romanian, Slovak.

The Arab and Hungarian lexicons have become the largest ever built without any artificial trick, both 5 million words.

94 languages (varieties)

English (lexicon size between 315,000 and 325,000, selection April 2010)
The American English (1), British English (2), Canadian English (3), South-African English (4), Australian/New-Zealand English (5) versions include a set of collocations and automatic respelling functions between American English, Canadian English, and British English orthographical varieties, e.g., (to UK) counseling -> counselling or (to US) counselling -> counseling; (UK & US) Mao Tse-tung -> Mao Zedong (see the Style Guides of the New York Times and the Economist). Be careful with expressions as Thanks God its Friday!. Without an apostrophe it looks a bit strange.
Lexicons agree with the leading unabridged dictionaries. The supplied idiom includes an extensive medical, chemical, social and geographical lexicon. Finally the idiom includes an extensive orthographical variety of building compounds.
visit download page | Continue ...

French/Canadian French (lexicon size over 570,000, selection February 2010)
Includes the most extensive geographical lexicon. Two lexicons are available, one according to the spelling of Le Larousse (2008) & Le Nouveau Petit Robert (2003) and one according to the most recent Rectifications de l’orthographe of the Conseil supérieur de la langue française first published 6 December 1990 (see also http://www.orthographe-recommandee.info) and has become more and more accepted at present time. La nouvelle orthographe du français n'est pas imposée, mais elle est officiellement recommandée. Les modifications, modérées, touchent environ deux-mille mots. Exemples :

  • un compte-goutte, des compte-gouttes ;
  • un après-midi, des après-midis; cout ;
  • entrainer, nous entrainons ;
  • paraitre, il parait; j'amoncèle, amoncèlement, tu époussèteras

  • and most impartant: Les rectifications de l'orthographe ont été approuvées initialement par :
  • Le Conseil supérieur de la langue française (Paris);
  • L'Académie française (France).
  • Both French and Canadian French versions include extensive (automatic) re-spelling tools between previous and new spelling forms.
    visit download page | Continue ...

    German (lexicon size 1,154,000, selection February 2010)
    The German spelling has again been reformed in 2006. Previous versions are kept available for a while, but the regular German spelling is distributed in three versions, “alt, neu, dpa (2007)”, including automatic respelling from old to new spelling forms (e.g., Prozeß → Prozess) and spelling of “feste grammatische und lexikale Wendungen”. If you prefer “die alte Rechtschreibung” and wish to purify your texts, a full re-spelling system from new to old will surprise you (e.g., Prozess → Prozeß). A version for the Nachrichtenagenturen (dpa) as proposed by the German-speaking news agencies is also available. (http://www.die-nachrichtenagenturen.de), Spelling neue Rechtschreibung in agreement with the Duden 24 2006, und IDS Sprach Report, July 2006. The German lexicon is based on over 260.000 expanded catchwords (konjugierte Stichwörter), and includes all German toponyms (Ortsnamen), over 13,000 autocorrections (Umschreibungen) and an extensive medical lexicon. Moreover, spell checking is strict, we don't approve errors like: Oberklasse-Wagen, Oberstufe-Schüler, Klasse-Bücher. It has to be: Oberklassenwagen, Oberstufenschüler, Klassenbücher.
    visit download page | Continue ...

    Swiss German (lexicon size 1,176,000, Swiss additions to German)
    The Swiss German lexicon includes all Swiss toponyms (Schweizerische Ortsnamen). There are three versions “alt, neu, dpa/SDA (2007/8)” see German.
    visit download page | Continue ...

    Austrian German (lexicon size 1,168,000, Austrian additions to German)
    The Austrian lexicon includes all Austrian toponyms (Österreichische Ortsnamen). There are three versions “alt, neu, dpa (2007)” see German.
    visit download page | Continue ...

    Spanish (lexicon size 891,000, selection August 2009)
    The spelling is according to “Gran Diccionario de la lengua española” and “Diccionario Real Academia Española”, 2001.
    Includes respelling of a set of common errors, e.g. Adam y Eva → Adán y Eva, Edinburgo → Edimburgo.
    visit download page | Continue ...

    Italian (lexicon size 945,000, selection April 2010)
    The spelling is according Lo Zingarelli 2006. Includes pronomial forms, and an extensive geographical lexicon (comuni e luoghi italiani).
    visit download page

    Swedish (lexicon size over 1,9 million words, selection February 2010)
    Includes geographical and proper names orthography according to Svenska Akademiens ordlista över svenska språket.
    visit download page | Continue ...

    Portuguese (lexicon size 1,5 million words, selection February 2010)
    Iberian and Brazilian Portuguese are very different in terms of use of verb tenses and idiom. Often Brazilian Portuguese is unacceptable for Iberian Portuguese publications, and the reverse is a source of misunderstanding too. Independently of orthography dictionaries need to be different. Therefore Iberian and Brazilian versions according to the previous and acordo ortográfico, have been compiled. These versions include respelling either between Iberian Portuguese and Brazilian Portuguese or between the previous and acordo ortográfico. O presidente de Portugal, Aníbal Cavaco Silva, promulgou o acordo ortográfico da língua portuguesa, ratificado no Parlamento do país em maio, informaram hoje à Agência Efe fontes da Presidência. ...., O Novo Acordo Ortográfico da Língua Portuguesa está em vigor no Brasil desde o último dia 1º (2009).
    Examples: equipolente versus eqüipolente or boleia versus boléia or ação versus acção
    visit download page

    Dutch (Nederlands, lexicon size 700,000, selection April 2010)
    The spelling according to the governmental rules (Groene Boekje, Workgroup Spelling, 2005, Taalunie, update Taalunie errata 27-05-2008) and in agreement with Van Dale Groot Woordenboek van de Nederlandse Taal (XIV ed.).
    The lexicon's idiom covers national and mondial geographic information, medical, administrative, social and many other special terms. A set of collocations and respelling from old to new orthography is included.
    visit download page | Continue ... | hall of shame ...

    Flemish (Vlaams, lexicon size 705,000, selection April 2010)
    The spelling according to the governmental rules (Groene Boekje, Workgroup Spelling, 2005, Taalunie, update Taalunie errata 27-05-2008) and agrees with Van Dale Groot Woordenboek van de Nederlandse Taal (XIV ed.)
    The lexicon's idiom covers national and mondial geographic information, medical, administrative, social and many other special terms. A set of collocations and respelling from old to new orthography is included.
    visit download page | Continue ...

    Surinam Dutch (Surinaams-Nederlands, lexicon size 700,000, selection April 2010)
    The Republic of Surinam has entered the Dutch Taalunie (January 2005) to unify their language with the Dutch language. The peculiarities of Surinam Dutch call for a separate lexicon. The spelling agrees with the governmental rules (Groene Boekje, Workgroup Spelling, 2005, Taalunie).
    The lexicon's idiom covers national and mondial geographic information, medical, administrative, social and many other special terms. A set of collocations and respelling from old to new orthography is included.
    visit download page | Continue ...

    Catalan (lexicon size 700,000, selection August 2009)
    The spelling agrees with Diccionari ortogràfic i de pronúncia, Enciclopèdia Catalana.
    visit download page

    Danish (lexicon size 810,000, selection January 2010)
    The spelling agrees with the Contemporary Danish spelling according to Dansk Retskrivingsordbogen, 1996.
    visit download page

    Norwegian, Nynorsk (lexicon size Bokmål 1,015,000 Nynorsk 480,000, selection February 2010)
    The spelling agrees with the Contemporary Norwegian spelling according to Tanums Store Rettskrivningsordbok.
    visit download page

    Sámi (lexicon size 1,6 million, selection February 2008)
    The spelling agrees with the Nord Sámi language as spoken in Finnmark county in the north of Norway. Sámi is a highly inflected language and words can have numberous word forms. This feature makes the North Sámi lexicon very lengthy.
    visit download page

    Finnish (lexicon size over 4,2 million words, selection February 2010)
    The spelling agrees with the Contemporary Finnish, spelling according Uusi Suomi-Englanti Suur-Sanakirja, 1984.
    visit download page

    Afrikaans (lexicon size 290,000, selection November 2009)
    The lexicon agrees with the spelling rules of the Suid-Afrikaanse Taalkommissie, 2002.
    visit download page

    Latin (lexicon size 450,000, selection August, 2007)
    The Latin lexicon has been compiled from classical, medieval, clerical, vulgate, and scientific texts. Names from the classical period and from the clerical (and Biblical) world have been included in the lexicon. Like dictionary publishers we do not use ligaturs: oeconornicae, Aegiptum, etc.
    visit download page

    Basque (lexicon size 3,05 million selection July 2010)
    The Basque language is highly inflected, and so is the Basque lexicon. Financial, Scientific, Geographical and proper names are included in the lexicon: Euskadi, Euskadik, Euskadiko, Euskadikoa, Euskadin, Euskadira, Euskadiren, Euskadirentzat, Euskaditik, Euskadiz, amortizazio-prezio..., banku-txartel..., efektu-biomarkatzaile..., epitelio-zelula..., etc.
    visit download page

    Russian (Россия) (lexicon size 1,000,000, selection January 2008)
    The Russian language goes back to Old Church Slavic, but a literacy tradition less tied to the church and Old Church Slavic exists too. The last extensive spelling reform occurred in 1917.
    visit download page

    Estonian (lexicon size 1,500,000, selection February 2010)
    The Estonian language belongs to the Finno-Ugric family of languages. It is closely related to Finnish, and similar to Finnish prepositions are attached to the end of the word.
    visit download page

    Icelandic (lexicon size 747,000, selection January 2010)
    The Icelandic language is a North Germanic (Scandinavian) language, since 1935 the official language of Iceland. Icelandic is characterized by extensive vowel gradations, for masculine, feminine and neuter. The historical morphological characteristics have been preserved.
    visit download page

    Lithuanian (lexicon size 862,000, selection June 2009)
    The Lithuanian language, like Latvian, belongs to the Baltic family of languages. Lithuanian uses the Latin alphabet with diacritics, including as <ė>, <į>, <ų>. Lithuanian is highly inflected.
    visit download page

    Latvian (lexicon size 700,000, selection June 2009)
    The Latvian language is one of the Baltic languages (see Lithuanian). The orthography is based on the Latin alphabet with diacritic marks, including <ņ>, <ķ>, <ģ>, <ļ>.
    visit download page

    Polish (lexicon size 1.9 million, selection February 2008)
    The Polish language is a West Slavic language spoken by approximately 42 million speakers. It is written in the Latin alphabet with diacritic marks and special characters: ł, Ł, ż, Ż.
    visit download page

    Frisian (lexicon size 290,000, selection July 2009)
    The Frisian language is spoken by approximately 300,000 speakers in the Dutch province of Friesland. It has been standardized thanks to the efforts of the Fryske Akademy. It is distinct from East and North Frisian dialects in Northern Germany.
    visit download page

    Galician (lexicon size 245,000, selection August 2007)
    The Galician language is now spoken in Spanish Galicia, situated north of Portugal. It is a Romance language related to Portuguese. Spelling according “Dicionário da língua galega, Sotelo Blanco”.
    visit download page

    Hungarian (lexicon size over 5 million words, selection December 2009)
    The Hungarian language belongs to the Uralic family of languages. It is the official language of Hungary. There is a weak relation to the Finno-Ugric languages. The orthography includes characters with the Hungarumlaut: <ő>, <ű>.
    visit download page

    Czech (lexicon size 1,690,000, selection December 2009)
    The Czech language is a West Slavic language. The orthography is based on the Latin alphabet, including diacritics: <č>, <ď>, <ě>, <ů>, <ž>.
    visit download page

    Upper Sorbian (lexicon size 770,000, selection January 2009)
    The Upper-Sorbian language is a West Slavic language. The orthography is based on the Latin alphabet. Upper and Lower Sorbian is spoken in the South Eastern section of the former German Democratic Republic. Spelling agrees with Hornjoserbskeje rěčneje komisje hač do junija 2005.
    visit download page

    Maltese (lexicon size 845,000, selection January 2006)
    The Maltese language is a Semitic language written in the Latin alphabet, including <ċ> <ħ> <ġ> and <ż>, orthography according to Joseph Aquilina (1987/1990). The speller includes checks for proper use of assimilations of the article.
    visit download page

    New Greek (Ελληνικά) (lexicon size 785,000, selection September 2009)
    The Greek characters α, β, γ, .... to ω have been used for millenniums. We do not know how Ancient Greek was pronounced, but modern Greek certainly is different. It now uses only a limited number of accents and diaereses.
    visit download page

    Occitan (lexicon size 250,000, Selection June 2007)
    Also known as Languedoc, is the original language spoken by the troubadours and Cathars in the South of France. The reconstruction of the language is based on the work of Loís Alibèrt (2000).
    visit download page

    Esperanto (lexicon size 300,000, selection August 2003)
    Esperanto is an artificial language, introduced by Dr. Lazaro Ludoviko Zamenhof. The language is based on several Indo-European languages. Typical for Esperanto are the characters <ĉ>, < ĝ>, <ĥ>, <ĵ>, <ŝ> and <ŭ>.
    visit download page

    Turkish (lexicon size 1,680,000, selection November 2008)
    The Turkish language is written in the Latin alphabet, but a few characters were added, such as the dotless-i which is very different from the dotted-i. Therefore the letter i is not a lower case of the majuscule letter I, a major problem to many systems. A geographical and medical lexicon is included.
    visit download page

    Romanian (lexicon size 1,000,000, selection June 2009)
    The Romanian language belongs to the Roman languages. It includes a few additional characters such as the a-breve <ă>, i-circumflex <î>, the s-cedille <ş>, the t-sedille <ţ>, the s-comma below <ș>, the t-comma below <ț>.
    visit download page

    Bulgarian (lexicon size 840,000, selection February 2008)
    The Bulgarian language is written in the Cyrillic alphabet.
    visit download page

    Faroese (lexicon size 175,000, selection February 2005)
    The Faroese language is spoken by 50,000 inhabitants of the Faroer Islands. It is based on the old Norse as is the Islandic language.
    visit download page

    Bahasa Indonesia (lexicon size 76,000, selection May 2010)
    The Bahasa Indonesian language is the standard language written and spoken in the Republic of Indonesia. Many Austronesian languages are spoken in the Indonesian Archipelago, but Bahasa Indonesia is the lingua franca.
    visit download page

    Slovene (lexicon size 425,000, selection October 2007)
    The Slovene language is spoken in the Republic of Slovenia, situated between Austria, Hungary, Croatia, and Italy. It is a south slavic language written in the Latin alphabet, including a few Slavic characters such as <č>, <š>, <ž> and the diagraphs Lj and Nj. Slovene is highly inflected and nearly every noun has an adjective form too.
    visit download page

    Croatian (lexicon size 547,000, selection October 2009)
    The Croatian language, formerly named Serbo-Croatic, is closely related to Serbian. The Croatian language is written in the Latin alphabet, including a few typical Slavic characters such as <č>, <ć>, <š>, <ž>, and digraphs Lj and Nj.
    visit download page

    Bosnian (lexicon size 565,000, selection August 2009)
    The Bosnian language, formerly named Serbo-Croatic, is closely related to Serbian and Croatian.
    visit download page

    Serbian Cyrillic (lexicon size 570,000, selection August 2009)
    The Serbian language is written in the Cyrillic alphabet, including typical Serbian characters Dž, Lj, Nj (Џ, Љ, Њ).
    visit download page

    Byelorussian (lexicon size 1,6 million, selection January 2008)
    The Byelorussian language is written in the Cyrillic alphabet, like the Russian language, but the language was heavily influenced by Polish for centuries. Today, in the Byelorussian Republic, Byelorussian plays a lesser role compared to the Russian language.
    visit download page

    Slovak (lexicon size 1 million words, selection August 2009)
    The Slovak language is closely related to Czech, but a few characters differ.
    visit download page

    Ukrainian (lexicon size 1,15 million words, selection November 2008)
    The Ukrainian language is written in the Cyrillic alphabet, but for centuries the language was heavily influenced by Polish.
    visit download page

    Swahili (lexicon size 75,000, selection February 2005)
    The Swahili language is spoken along the East Coast of Africa. It is the lingua franca of many coastal nations. The standardized language is called Kiswahili Sanifu. It shares the word kamusi (dictionary) with the Melayu word kamus. Swahili is written in the Latin alphabet.
    visit download page

    Bahasa Melayu (lexicon size 62,000, selection September 2009)
    Bahasa Melayu is the standard language of the Republic of Malaysia. It has a common root with Bahasa Indonesia. However, Bahasa Melayu was heavily influenced by the English language while Bahasa Indonesia was influenced by Dutch during the colonial age.
    visit download page

    Irish (Gaelic) (lexicon size 325,000, selection August 2007)
    The Gaelic language is a Celtic language spoken in Western Ireland. A class of words is lenited, pronounced with palatalization. A slightly different variety is spoken in the Highlands of Scotland.
    visit download page

    Welsh (lexicon size 365,000, selection August 2007)
    The Welsh language is the Celtic language of Wales, spoken by about 500,000 people (mainly bilingual in English).
    visit download page

    Greenlandic (lexicon size 85,000, selection February 2008)
    is an East Inuit language spoken by 50,000 Greenlanders.
    The Greenlandic language adds particle to particle to words and leading to a single word sentence. The Latin alphabet is used whereas the Canadian Inuit make use of their own script.
    visit download page

    Macedonian (lexicon size 320,000, selection January 2008)
    The Macedonian language is written in the Cyrillic alphabet.
    visit download page

    Albanian (lexicon size 310,000, selection February 2006)
    The Albanian language is written in the Latin alphabet. The Albanians call their language shqip and their country Shqipëria.

    Maori (lexicon selection March 2004)
    The Maori language is spoken in New Zealand and is written in the Latin alphabet. A macron is placed above the vowels to differentiate between long and short vowels.
    visit download page

    Xhosa (lexicon size 165,000, selection September 2005)
    The Xhosa language is spoken in the Republic of South Africa and is written in the Latin alphabet.
    visit download page

    Zulu (lexicon size 330,000, selection September 2008)
    The Zulu language is spoken in the Republic of South Africa and is written in the Latin alphabet.
    visit download page

    Arabic (العربية) (lexicon size ca. 5 million, selection October 2009)
    The Arabic languages have its own script and the orthography is mainly based on consonantal roots. These roots are unfolded to millions of words.
    visit download page

    Azerbaijanian (lexicon size 132,000, selection May 2010)
    Azerbaijanian is written in the Latin alphabet. It has much in common with Turkish.
    visit download page

    Hebrew (עִבְרִית) (lexicon size ca. 5.5 million, selection March 2008)
    The Hebrew language is written in Hebrew characters, mainly consonants. The orthography is based on roots of 3 radicals, which unfold to millions of words.
    visit download page

    Persian/Farsi (فارسی)
    (lexicon size 450,000, selection October 2009)
    The Persian language is written in the Arabic script, but being an Indo-European language vowels are important.
    visit download page

    Urdu (اردو) (lexicon size 131,000, selection October 2009)
    The Urdu language is closely related to Hindi, but written in the Arabic script. Urdu and Hindi are Indo-European languages.
    visit download page

    Breton (lexicon size 210,000, selection July 2007)
    The Breton language is spoken in French Bretagne. It is a Celtic language once related to extincted Cornish in the UK.
    visit download page

    Thai (ภาษาไทย) (lexicon size 80,000, selection March 2008)
    The Thai language is the official language of Thailand. Thai has its own script, a syllable script and most vowels are written above the consonants. Thai is a tone language and the tone marks are always written in top. The words of a sentence are written without spaces and therefore a sentences has to be segmented (hyphenated) prior to spell checking.
    visit download page

    Hindi (हिन्दी) (lexicon size 150,000, selection December 2009)
    The Hindi language is spoken in northern and central India. Written Hindi is relatively standardized over the whole Hindi language area. It is an Indo-Aryan language. Althrough related to Urdu, Hindi does not favour the use of Persian and Arabic loanwords. Hindi is written in the Devanagari script, it includes a lot of complex characters, consisting of vowels, consonants, vowel-signs (matras), numerals, and diacritical marks.
    visit download page

    Marathi (मराठी) (lexicon size 153,000, selection December 2009)
    The Marathi language is spoken in the Mahatashtra state of India. It is an Indo-Aryan language written in the Devanagari script.
    visit download page

    Nepalese (नेपाली) (lexicon size 125,000, selection December 2009)
    The Nepalese language (Nepali) is spoken in the Himalayan state of Nepal between India and China. Nepalese is written in the Devanagari script.
    visit download page

    Kurdish (Northern) (lexicon size 90,000, selection July 2009)
    belongs to the Iranian group of languages. Kurdish is spoken in Turkey, Iraq, Iran, Armenia, Georgia and Azerbaijan. The latin script is used for the Northern variety of Kurdish.
    visit download page

    Malayalam (മലയാളം) (lexicon size 410,000, selection December 2009)
    The Malayalam language is spoken in Kerala, a state in the south of India. It is a Dravidian language written in the Malayalam script, a descendant of the Brahmi script.
    visit download page

    Bengali (বাংলা) (lexicon size 126,000, selection November 2009)
    The Bengali language is spoken in Bangladesh. It is a Indo-Aryan language written in the Bengali script, a descendant of the Brahmi script.
    visit download page

    Gujarati (ગુજરાતી) (lexicon size 185,000, selection October 2009)
    The Gujarati language is spoken in the Indian state of Gujarat. It is a Indo-Aryan language written in the Gujarati script, a descendant of the Brahmi script.
    visit download page

    Tamil (தமிழ) (lexicon size 105,000, selection December 2009)
    The Tamil language is spoken in southern India (Tamil Nadu) and Sri Lanka. It is a Dravidian language written in the Tamil script, a descendant of the Brahmi script. Tamil has many Indo-Aryan loanwords. Tamil in Sri Lanka incorporates loadwords from the Dutch, Portuguese, and English language.
    visit download page

    Sinhala (සිංහල) (lexicon size 208,000, selection November 2009)
    The Sinhala language is spoken in Sri Lanka India. It is an Indo-Aryan branch of the Indo-European languages written in the Sinhala script, a descendant of the Indian Brahmi script. There is some affinity to neighbouring languages. Sinhala has features that may be traced to Dravidian influences.
    visit download page

    Punjabi (ਪੰਜਾਬੀ) (lexicon size 37,500, selection October 2009)
    The Punjabi language is spoken in Punjab state of India. It is an Indo-Aryan branch of the Indo-European languages written in the Gurmukhi script, a descendant of the Indian Brahmi script.
    visit download page

    Telugu (తెలుగు) (lexicon size 115,000, selection December 2009)
    The Telugu language is spoken in Andhra Pradesh, one of the largest states of India. It is a Dravidian of the Indo-European languages written in the Telugu script, a descendant of the Indian Brahmi script.
    visit download page

    Khmer (ភាសាខ្មែរ) (lexicon size 30,000, selection November 2009)
    The Khmer language is spoken in Cambodia. It is the second most widely spoken Austroasiatic language. As in Thai Khmer sentences are written without spaces. Therefore spell checking strongly depends on segmentation (see Hyphenator languages).
    visit download page

    Kazakh (Cyrillic/Latin) (lexicon size 900,000, selection May 2010)
    The Kazakh language is spoken east of the Caspian Sea. It is a Turkic language related to Azerbaijan and Turkish. Kazakh is mainly written in the Cyrillic alphabet in Kazakhstan but a transition to the Latin script has already been brought up by the President of Kazakhstan in 2006. For this reason both Cyrillic and Latin lexicons have been compiled.
    visit download page