Hyphenators — language specifications

Languages: go to language

Our hyphenators are not based on a hyphenated dictionary data base.
New hyphenator languages: Kazakh (Latin/Cyrillic), Khmer, Northern Kurdish, Swahili, Xhosa, Zulu, Hebrew, Irish/Gaelic (see Windows Unicode Demo).
Recent updated hyphenator languages: Galician, Finnish, Norwegian, Danish, Icelandic, German, Swiss German, Frisian/Frysk, Afrikaans, Turkish, Swedish, Norwegian, Russian, Romanian, Portuguese.
Updated hyphenators are built on larger learning corpora and are validated on larger shadow corpora.

77 language modules

Dutch (Update July 2009, ε < 0.0044 ‰)
supports the generally accepted spelling (the Netherlands), progressive spelling (Belgium), and the 1996 & October 2005 spelling reforms — four principles have been integrated in one hyphenator. Supports the Belgium, Surinam and Dutch idiom. The hyphenator recognizes compound boundaries and covers the Dutch idiom in the most extensive way.
visit download page | view a Dutch example (PDF) | view a newspaper mistake

English (Update January 2008, ε < 0.0098 ‰)
supports phonetical hyphenation according to the world's most trusted dictionaries: Webster's Third New International Dictionary, Webster’s New Twentieth Century Unabridged Dictionary (2nd edition), and Longman’s Dictionary of Contemporary English; based on an unabridged learning corpus, coming close in size to Webster's Unabridged Dictionaries; a common hyphenator is available for the British, Canadian, and American idiom. The hyphenator solves the irregularity of the alternation of English strong and weak syllables. The new double layer model enables the user to disregard certain secondary divisions (adapt~able instead of adapt~a~ble). Hyphenation agrees with The Oxford Colour Spelling Dictionary (1995). Compared to the other dictionaries, this last dictionary has fewer syllables.
The English module has separate entries for British, American, Canadian, Australian, New Zealand and South African English.
visit download page | view an English example (PDF)

German old (1980) and new (1996, 2006) (Update July 2009, ε < 0.0045 ‰)
supports every characteristic German hyphenation according to the most recent Duden Rechtschreibung August 2006. For German reformed two hyphenation styles have been implemented, one in agreement with the Duden(s) 1996-2004, and one using strict eingedeutschte syllables. The German hyphenator recognizes compound boundaries independent of the spelling reform. The new feature for “der Verwendung von Großbuchstaben SS für ß” correctly hyphenates both “Schreibungsweisen”. A special effort has been made to support medical and other scientific domains. The German hyphenators have been compared to over two million German, Swiss German & Austrian German words as an independent estimate of accuracy.
visit download page | view a German example (PDF) | view a newspaper mistake

Swiss German old (1980) and new (1996, 2006) (Update July, 2009)
responds accurately to the typical Swiss German deviations and local idiom (including the ß to ss transcription, Stra~ße comes Stras~se (not Stra~sse)).
visit download page | view a Swiss German example (PDF) | view a newspaper mistake

Austrian German old (1980) and new (1996, 2006) (Update July, 2009)
responds accurately to the typical Austrian German peculiarities and local idiom.
visit download page | view a German example (PDF) | view a newspaper mistake

French (two versions, Update March 2008)
accepts etymological syllabification according to Grevisse’s “le bon usage.” A second version accepts phonetical hyphenation rules recommended by the leading French linguist Nina Catach in Paris. Both versions use the new double layer technique to enable or disable hyphenation of muettes. Covers French idiom nearly completely.
visit download page | view a French example (PDF) | view a French example (not hyphenating muettes)(PDF)

Spanish (Update January 2006, ε < 0.0035 ‰)
supports the official hyphenation rules as published by large dictionary publishers; completely covers the Spanish and Latin American idiom.
visit download page | view a Spanish example (PDF)

Italian (Update June 2006, ε < 0.0008 ‰)
supports phonetical hyphenation, in Italian: "la sillabazione: basata prevalentemente sul criterio di tenere uniti i gruppi consonantici attestati, anche una sola volta, come iniziale di parola". In addition the new hyphenator handles hiatuses accurately, elisions (al-l’I.ta-lia), conjugations, declensions, and words that came from English and other foreign languages (beat-nik and not be-at-nik).
visit download page | view an Italian example (PDF)

Iberian and Brazilian Portuguese (previous and acordo ortográfico) (Update November 2009, ε < 0.007 ‰)
based on the vowel as the syllabic unit, but falling diphthongs and final diphthongs are kept together. Both Iberian and Brazilian idioms are supported by the Portuguese hyphenator engine. This engine supports doubling of the hyphen (repetir o hífen na linha sequinte).
visit download page | view a Portuguese example (PDF)

Czech (Update November 2007)
supports the reformed spelling. As is the case in every Slavic language, a number of additive vowels and consonants exists, which have a large impact on hyphenation. Syllables that solely consist of consonants are supported (ji-tr-nice).
visit download page

Slovak (Update November 2007)
supports the standard Slovak orthography. As is the case in every Slavic language, a number of additive vowels and consonants exists, which have a large impact on hyphenation. Syllables that solely consist of consonants are supported (ji-tr-nice).
visit download page

Swedish (Update September 2008)
accepts the mekaniska principen, but compounded words are divided into their morphological roots. An overwhelming occurrence of compounds, and newly created forms, makes it a challenge worth accepting. You can switch between c-k or ck- hyphenation, and between within-word vowel-vowel hyphenation.
The library version 6.2.1 also supports morphological hyphenation as specified in the Svenska Akademiens ordlista över svenska språket.
visit download page | view a Swedish example (PDF)

Finnish (Update February 2010)
is tuned in to the peculiarities of the Finnish language and shares attributes with all Finno-Ugric languages. It has a rich structure, including a large number of falling and rising diphthongs. The phonetical base of the syllable is accepted, here, fully hyphenated despite it’s overwhelming inflection structure. You may find its resemblance to the neighboring Estonian remarkable.
visit download page | view a Finnish example (PDF)

Catalan (Update August 2007)
supports the mixed French and Spanish origins of the Catalan language. A peculiarity of Catalan, needing special care, is the l geminada (l·l).
visit download page | view a Catalan example (PDF)

Danish (Update January 2010)
accepts the hyphenation rules of the Dansk Sprognævns Retskrivningsordbog. Compounds and newly created forms are supported; please note that it even hyphenates Norwegian according to consonant rules. visit download page

Norwegian/Nynorsk (Update January 2010, ε < 0.0047 ‰)
accepts consonant rules or the morphological rules of the Nordisk institutt of the University of Bergen. The principles of pattern recognition are put into practice on Nynorsk as well; one hyphenator engine for mixed language applications
visit download page | view a Norwegian example (PDF)

Icelandic (Update January 2010, ε < 0.021 ‰)
accepts morphological rules which separate the attached article and nominative, dative, accusative, and genitive cases and is capable of dividing a pileup of compounds.
visit download page | view an Icelandic example (PDF)

Estonian (Update October 2008)
behaves like the Finnish hyphenator and is capable of correctly hyphenating Estonian compounds and diphthongs. However, there are more diphthongs in the Estonian language than in the Finnish language which increases complexity. Taken widows in to account we hyphenate as las~te~aia~laps and not as las~te~ai~a~laps
visit download page | view an Estonian example (PDF)

New Greek (Update January 2005)
is tuned in to the Greek script, the Elot codepage or Unicode. It hyphenates more than between alpha and omega — not just the beginning and the end (Classical Greek), but a new era in progress (Modern Greek). Present-day Greek has evolved and is flourishing with diacritics.
visit download page

Polish (Update February 2008)
hyphenation of the Polish language is hindered by an immense number of consonants, quite often unpronounceable for non-Polish speakers. However, the hyphenator has been fully adapted to these difficult syllables.
visit download page

Latvian (Update December 2007)
is tuned in to the properties of Baltic languages. Words are richly declined. Latvian uses additional consonants and vowels, which are recognized by the hyphenator.
visit download page

Azerbaijanian (Update November 2009)
is one of the new Transcaucasian republics that are now independent from the former USSR. Azerbaijanian is related to Turkish. The Azerbaijani now use a Latin script. There is no standard Byte CodePage script.
visit download page

Turkish (Update November 2008, ε < 0.007 ‰)
Present-day Turkish is spoken in SW Asia, but in earlier times the Turkish region reached into the north of China. In Chinese history, the name Tu-kiu was mentioned 600 years ago. Turkish is characterized by a lot of additive particles that change the meaning of a word. A word can take numerous forms and different parallel hyphenations.
visit download page | view a Turkish example (PDF)

Lithuanian (Update December 2007)
is one of the Baltic languages which is richly declined. The (semi-)diphthongs, palatals, and affricates have been taken into consideration for hyphenation.
visit download page

Afrikaans (Update November 2009)
the Afrikaans language evolved from 17th-century Dutch and is an official language of South Africa. Its hyphenation has much in common with the Dutch language. Afrikanization of spelling has given the Afrikaans language its own identity. The Afrikaans hyphenator takes all Afrikaans peculiarities into consideration, including diaeresis hyphenation.
visit download page | view an Afrikaans example (PDF) | die taal en die passende tegnologie (PDF) | http://www.litnet.co.za/taaldebat/talo.asp

Russian (Update July 2007)
accepts Cyrillic characters, but does not complicate hyphenation. It is the nature of the Russian language: an abundance of prefixes and suffixes, modifying different moods in a fine gradation. The hyphenator has learned from a corpus of over a million Russian words.
visit download page

Basque (Update April 2008)
the Basque language is one of Europe’s most exotic minority languages, probably unrelated to any other language in the world. The Basque hyphenator is tuned in to all those peculiarities of real-life language.
visit download page | view a Basque/Euskara example (PDF)

Hungarian (Update April 2006)
the Hungarian language has lost many of its Uralic characteristics and many words have been borrowed from the Turkish and European languages. The language is flavoured with compounds and special hyphenations (briddzsel -> bridzs-dzsel).
visit download page | view a Hungarian example (PDF)

Bahasa Indonesia (Update June 2005)
the Bahasa Indonesia (Standard Indonesian) is an Austronesian language full of prefixes, suffixes, infixes, in general terms affixes including large classes of sound changes. Hyphenation is inextricably tied to meaning, even when the boundaries are masked by sound changes (mengarang from meng + karang) hyphenation is affected.
visit download page | view a Bahasa Indonesian example (PDF)

Bahasa Melayu (Update June 2005)
What holds for Bahasa Indonesia applies as well to Bahasa Melayu.
visit download page

Byelorussian (Update July 2007)
is is the language of the new nation of Belarus. It was proclaimed the country's sole official language, but Russian remains
dominant. Byelorussian is written in the Cyrillic alphabet.
visit download page

Bulgarian (Update July 2007)
is spoken by 90 % of the population of Bulgaria, 7 million people. Modern Bulgarian alphabet is the same as the Russian
alphabet.
visit download page

Serbian (Update November 2007)
Serbian or srpski jezik is written in the Cyrillic alphabet. Serbian is closely related to Croatian, however, Serbian characters are written with single symbols Џ, Љ, and Њ. (Dž, Lj, Nj ). Like words in any Slavic language Serbian words can have many prefixes to be hyphenated.
visit download page

Galician (Update July 2010)
The Galician language is now spoken in Spanish Galicia, situated north of Portugal. It is a Romance language related to Portuguese.  The orthography differs slightly from Spanish.
visit download page

Rhaeto-Romance (Februari 2002)
is the collective for three Romance dialects spoken in the northeastern Italy and southeastern Switzerland.
visit download page

Greenlandic (April 2002)
is an Eskimo language spoken in Greenland. Greenlandic is written in the Latin alphabet. Words can be very long and one word can be a complete sentence.
visit download page

Ukrainian (Update July 2007)
is the national language of Ukraine. It is  spoken by a population of 35 million people. Ukrainian has many Polish loan words,
but the influences of Russian can be found in the east of Ukraine too.
visit download page

Romanian (Update August 2007)
is the national language of Romania. It is a Romance language written in the latin script. One third of all Romanian words are of French origin. S|s-comma below (ș) and T|t-comma below (ț) are supported
visit download page

Croatian (Update November 2007)
or hrvatski jezik is written in the Latin alphabet. Croatian is closely related to Serbian. Croatian includes a few digraphs which sound like a single consonant (Dž, Lj, Nj ).
Like words in any Slavic language Croatian words can have many prefixes to be hyphenated.
visit download page

Bosnian (Update November 2007)
or Bosanski Jezik exists since Bosnia & Herzegovina became independent. Bosnian has developed its own identity, written in Latin and closely related to Croatian.
visit download page

Frisian (Update July 2009)
or Frysk is spoken in Friesland the northernmost province of the Netherlands.  Frisian is closer related to English than Dutch.
visit download page <>

Tagalog/Pilipino (April 2002)
is the national language of the Philippines. Three centuries of Spanish rule left a strong imprint on the vocabulary. The pre-, in- and suffixes to modify word meaning make hyphenation irregular.
visit download page

Slovene (Update November 2007)
or Slovenski jezik is written in the Latin alphabet. Slovene includes a few digraphs (Dž, Lj, Nj). Slovene has many prefixes and inflections. Some syllables divide consonants only: hm-kniti, kr-tina, tr-den.
visit download page | view a Slovene example (PDF)

Thai (Update December 2004)
The Thai people build sentences in a different way. Therefore, the Thai module is not a hyphenator in the traditional sense, but it is a word segmentation tool, that takes context into consideration.
visit download page | more on Thai | view a Thai example (PDF)

Macedonian (Update July 2007)
is the principal language of the new nation of Macedonia, it is closely related to Bulgarian and written in the Cyrillic alphabet.
visit download page

Maltese (Update January 2006)
is one of the official languages of the islands of Malta, it is a Semitic language written in the Latin alphabet, including <ċ> <ħ> <ġ> and <ż>, the variety of root words has a great impact on hyphenation.
visit download page | view a Maltese example (PDF)

Sámi (Update April 2006)
The hyphenation agrees with the Nord Sámi language as spoken in Finnmark county in the north of Norway.
visit download page | view a North Saami example (PDF)

Hebrew (December 2006)
is written in Hebrew consonants only and therefore hyphenation is partially uncertain. Within this uncertainty the hyphenator accepts graphical hyphenations.
visit download page

Irish/Gaelic (December 2006)
is a Celtic language mainly spoken in Ireland.
visit download page

Zulu (September 2008)
is a Bantu language mainly spoken in the Republic of South Africa.
Zulu is one of the 11 South African languages and is very different from Afrikaans and the other Indo-European language and so is hyphenation: be~nga~ka~la~li, ma~fu~ngwa~se.
visit download page

Xhosa (September 2008)
is a Bantu language mainly spoken in the Transkei, Ciskei and Eastern Cape regions of the Republic of South Africa.
Xhosa is one of the 11 official South African languages. It is very different from Afrikaans and the other Indo-European language, and so is hyphenation: i~si~Mpo~ndo, ye~Bha~nki.
visit download page

Swahili (March 2009)
is a Bantu language mainly spoken in East Africa.
Swahili is principal language of Tanzania, Zanzibar, Uganga and many neighbouring countries. Hyphenation examples are: ne~nda, ni~ra~ku~pi~ga, ni~li~m~pi~ga, u~nga~ma.
visit download page

Kurdish (Northern) (July 2009)
belongs to the Iranian group of languages. Kurdish is spoken in Turkey, Iraq, Iran, Armenia, Georgia and Azerbaijan. The latin script is used for the Northern variety of Kurdish.
visit download page

Khmer (Cambodia) (November 2009)
belongs to the Austroasiatic languages. Khmer has its own script known as Aksar Khmer. In Khmer no spaces are inserted between words. Yet words have to be segmented, even unknown words.
visit download page

Kazakh (Latin/Cyrillic) (Update May 2010)
belongs to the Turkic family languages. Kazakh is written in the Cyrillic, Arabic or Latin script. An official transition to the Latin script could happen in a 10 to 12 year period. The Kazakh hyphenator (Unicode) supports both the Latin and Cyrillic script.
visit download page

Under development
Faroese, and Esperanto.