Languages: | go to language |
Our hyphenators are not based on a hyphenated dictionary database, instead they rely on language models.
New hyphenator languages:
Marathi (India), Bengali (India), Hindi (India), Malayalam (India), Tamil (India).
Recent upgraded hyphenator languages:
Italian, Dutch, Flemish, Surinam Dutch, Afrikaans, Slovak, Czech, Galician, English (5x), French, Danish, Frisian (Frysk), Norwegian (2x),
Nynorsk (2x), Swedish (3x), Icelandic, German, Swiss German, Austrian German, Portuguese (Iberian & Brazilian), Finnish, Spanish,
Estonian, Thai, Canadian French .
Upgraded hyphenators are built on larger learning corpora and are validated against an even larger shadow corpora.
86 language modules
Dutch, Flemish, Surinam Dutch
(Upgrade October 2024, ε < 0.00012 ‰) (Hyphenates coronavirus idiom, and medical terms)
supports the most recent accepted Dutch orthography, and recognizes earlier spelling variants such as the progressive spelling (Belgium),
and the 1996 & October 2005 spelling reforms — four principles have been integrated in one hyphenator for the Belgium, Surinam and Dutch idiom.
The hyphenator recognizes compound boundaries (Frank~rijk~lei not Frank~rij~klei, or boer~de~rij~wa~ter~ijs~je not boer~de~rij~wa~te~r~ijs~je,
farmhouse ice lolly) and covers the Dutch idiom in the most extensive way.
The hyphenator recognizes special hyphenations in which characters are replaced, removed or added due to hyphenation (aveetje → ave-tje, NOT avé-tje).
There are over 1400 cases registered in our corpora.
visit download page |
view a Dutch example (PDF) | view
a newspaper mistake | view a recent newspaper mistake
English
(Upgrade July 2024, ε < 0.00015 ‰) (Hyphenates coronavirus idiom)
supports phonetical hyphenation according to the world's most trusted
dictionaries: Webster's Third New International Dictionary, Webster’s
New Twentieth Century Unabridged Dictionary (2nd edition), and Longman’s
Dictionary of Contemporary English.
Based on an unabridged learning corpus, the hyphenator's base is close to Webster's Unabridged Dictionaries.
The hyphenator covers the British, Canadian, and American idiom and
solves the extreme irregularity of the alternation of English strong and weak syllables.
The new double layer model enables the user to disregard
certain secondary divisions (adapt~able instead of adapt~a~ble).
Hyphenation agrees with Oxford Colour Spelling Dictionary (1995). Compared to the other dictionaries,
this last dictionary lists fewer hyphenation locations then there are syllables in a word (amazone instead of am~a~zon).
The English module has separate entries for British, American, Canadian, Australian, New Zealand and South African
English, and supplies you with superior hyphenation and hyphenation density.
Be careful, some algorithms erroneously divide words as "new-spaper", "whites-pace", etc.
visit
download page | view
an English example (PDF) | view
a newspaper mistake | view a book-printing mistake
German old (1980) and new (1996, 2024 plus)
(Upgrade October 2024 with a new language description model, ε < 0.0014 ‰) (Hyphenates coronavirus idiom)
supports every characteristic German hyphenation according to the most recent Duden
Rechtschreibung August 2006-2017 and Wahrig 2009. For German reformed two hyphenation styles have been implemented, one in agreement
with the Duden(s) 1996-2017, and the other prefers the Duden 2017-2024 hyphenations, if meaningful strict eingedeutschte syllables are used.
The German hyphenator recognizes compound boundaries independent of the spelling reform.
The new feature “der Verwendung von Großbuchstaben ESZETT (ß / ẞ)”
correctly hyphenates words written in both “Schreibungsweisen”.
A special effort has been made to support medical and other scientific domains.
The German hyphenators have been compared to over two million German, Swiss German & Austrian German words as an independent estimate of accuracy.
The "schleswig-holsteinischen Bildungsministerin" writes "berufsfel-dorientierend" in her letter, we hyphenate as it has to be: "be-rufs-feld-ori-en-tie-rend".
visit
download page | view
a German example (PDF) | view
a newspaper mistake
Swiss German old (1980) and new (1996, 2024 plus)
(Upgrade October 2024) (Hyphenates coronavirus idiom)
responds
accurately to the typical Swiss German deviations and local idiom
(including the ß
to ss transcription, Stra~ße comes Stras~se (not Stra~sse)) and with the addition of Swiss German & Swiss French geographical names.
visit
download page | view
a Swiss German example (PDF) | view
a newspaper mistake
Austrian German old (1980) and new (1996, 2024 plus)
(Upgrade October 2024) (Hyphenates coronavirus idiom)
responds
accurately to the typical Austrian German peculiarities and local idiom.
visit
download page | view
a German example (PDF) | view
a newspaper mistake
French
(two versions, Upgrade May 2024)
accepts etymological syllabification according to Grevisse’s “le bon
usage.” A second version accepts phonetica
hyphenation rules recommended by the leading French linguist Nina
Catach in Paris.
Both versions use the new multilayer technique to enable or disable
hyphenation of muettes.
Covers French idiom nearly completely and has been tested on a full-size French corpus, including extended geographical data of the world (Kamt~chat~ka, Chev~tchen~ki~vie, etc.).
visit
download page | view
a French example (PDF) |
view
a French example (not hyphenating muettes)(PDF)
Spanish Peninsular, Argentine, Mexican & Latin American
(Upgrade March 2020, ε < 0.0008 ‰)
supports the official hyphenation rules as published in the Ortografía de la lengua
española (2010), and large dictionary publishers;
completely covers the Spanish and Latin American idiom.
Be carefull with foreign words in Spanish! Do not hyphenate as "Burn the Wit-|ch" (El País, 23, 4-6-2016).
It is not a syllable!
visit
download page | view
a Spanish example (PDF)
Italian
(Upgrade October 2024, ε < 0.0008 ‰)
supports phonetical hyphenation, in Italian: "la sillabazione: basata prevalentemente sul criterio
di tenere uniti i gruppi consonantici attestati, anche una sola volta, come iniziale di parola".
In addition the new hyphenator handles hiatuses accurately, elisions (al-l’I.ta-lia),
conjugations, declensions,
and words that came from English and other foreign languages (beat-nik and not be-at-nik).
visit
download page | view
an Italian example (PDF)
Iberian and Brazilian Portuguese (previous and acordo ortográfico)
(Upgrade April 2022, ε < 0.006 ‰)
based on the vowel as the syllabic unit,
but falling diphthongs and final diphthongs are kept together.
Both Iberian and Brazilian idioms are supported by the Portuguese hyphenator engine.
This engine supports doubling of the hyphen (repetir o hífen na linha sequinte) and is based on the entire 2022 corpus.
visit
download page | view
a Portuguese example (PDF)
Czech
(Upgrade July 2024)
supports the reformed spelling. As is the case in every Slavic
language, a number of
additive vowels and consonants exists, which have a large impact on
hyphenation.
Syllables that solely consist of consonants are supported (ji-tr-nice).
visit
download page
Slovak
(Upgrade August 2024)
supports the standard Slovak orthography. As is the case in every Slavic
language, a number of
additive vowels and consonants exists, which have a large impact on
hyphenation.
Syllables that solely consist of consonants are supported (ji-tr-nice, trvr-dý).
visit
download page
Swedish
(Upgrade February 2024 (3x), ε < 0.0009 ‰) (Hyphenates coronavirus idiom)
accepts the mekaniska principen, but
compounded words are divided into their morphological roots. An
overwhelming
occurrence of compounds, and newly created forms, makes it a challenge
worth accepting.
You can switch between c-k or ck- hyphenation, and between within-word
vowel-vowel hyphenation.
A third hyphenator model also supports morphological hyphenation as specified
in the Svenska Akademiens ordlista över svenska språket (SAOL), however,
Mediespråksgruppen (tt.se) tar avstånd från SAOL:s avstavningssystem,
yet it has been introduced into the schooling system many years ago.
The fact that Mediespråksgruppen rejected SAOL hyphenation might be related
to technical hindrances to build a hyphenator just like that.
Still it is possible given the use of a proper language model.
The latest Swedish hyphenator (February 2024) agrees with SAOL 13/14 morfologiska avstavningar over a full Swedish language idiom.
All Swedish hyphenators have been optimized using a 2.1 million word hyphenation corpus of the Swedish language.
visit download page |
view a Swedish konsonantregel example (PDF) |
view a Swedish SAOL12 example (PDF)
Finnish
(Upgrade June 2021, ε < 0.0005 ‰) (Hyphenates coronavirus idiom)
is tuned in to the peculiarities of the Finnish language and shares attributes with all Finno-Ugric languages.
It has a rich structure, including a large number of falling and rising diphthongs.
The phonetical base of the syllable is accepted, here, fully hyphenated despite it’s
overwhelming inflection structure and a high proportion of compounds
(e.g., kort~te~li~ajo (street racing), ko~ru~om~mel (embroidery), tul~kin~ta~ero,
do not hyphenate orphans, e.g., tul~kin~ta~e~ro). The hyphenator supports special hyphenation types, e.g., vaa'an > vaa- // an.
You may find its resemblance to the neighboring Estonian remarkable.
The June 2021 hyphenator got a new extended language model, and learned the hyphenation principles embedded in more than five million words.
The latest Finnish hyphenator was optimized over 5,368,692 words resulting in 27,826,228 hyphens inserted.
visit
download page | view
a Finnish example (PDF)
Catalan (nova ortografia)
(Upgrade May 2017)
supports the mixed French and Spanish origins of the Catalan language.
A peculiarity of Catalan, needing special care, is the l
geminada (l·l). This version supports the spelling reform of October 2016.
visit
download page | view
a Catalan example (PDF)
Danish
(Upgrade April 2024, ε < 0.0001 ‰)
accepts the hyphenation rules of the Dansk
Sprognævns Retskrivningsordbog, med ændrede ord- og staveformer (2012). Nearly any compound and newly created neologism is correctly
hyphenated; please note that it even hyphenates Norwegian according to consonant rules.
The latest hyphenator also includes hyphenation of COVID and corona-related idiom. The hyphenator is trained on the full size 2024 Danish hyphenation corpus.
The language model has been updated.
visit
download page | view
a Danish example (PDF)
Norwegian/Nynorsk
(Upgrade February 2024, ε < 0.0017 ‰)
accepts consonant rules
or the morphological rules of the Språkrådet and the Nordisk institutt of the University of Bergen.
The principles of pattern recognition are put into practice
on Nynorsk as well; one hyphenator engine for mixed language applications using modern Norwegian idiom.
The latest hyphenator also includes hyphenation of COVID and corona-related idiom.
visit
download page | view a Norwegian example (PDF) |
view a recent newspaper mistake
Icelandic
(Upgrade November 2023, ε < 0.0018 ‰)
accepts morphological rules which separate
the attached article and nominative, dative, accusative, and genitive
cases and is
capable of dividing a pileup of compounds, even neologisms, e.g.,
ól-ymp-íu-gull, not not ólymp-íug-ull (Olympic Gold);
að~al~aust~ur~dælu~kerf~is~ins, not að~a~l~aust~ur~dælu~kerf~is~ins (main-eastern-pump-system);
flugu~fót~anna, not flug~u~fót~anna (fly's-foot);
kassa~gít~ar, not kassagít~ar (a song's title);
hinu~meg~in not, don't know hinumegin (on the other side);
ses~am~ol~íu not, sesa~mol~íu (sesam-oil)
visit
download page | view an Icelandic
example (PDF)
Estonian
(Upgrade January 2020)
behaves like the Finnish hyphenator and is capable
of correctly hyphenating Estonian compounds and diphthongs. However,
there are more diphthongs in the Estonian language than in the Finnish language
which increases complexity. Taken orphans in to account we hyphenate
as las~te~aia~laps and not as las~te~ai~a~laps, or as
aa~to~mi~ener~gia~agen~tuu~ri and not as aa~to~mi~e~ner~gi~a~a~gen~tuu~ri.
visit
download page |
view an Estonian
example (PDF)
New Greek
(Upgrade February 2016, ε < 0.0015 ‰)
is tuned in to the Greek script, the Elot codepage or Unicode (Greek and Greek Extended). It
hyphenates more than between alpha and omega — not just the beginning
and the end (Classical Greek),
but a new era in progress (Modern Greek). Present-day Greek has evolved
and is flourishing with
diacritics and highly significant for hyphenation.
visit
download page
Latin
(February 2016)
is an extinct language, but still spoken in the Roman Catholic church. Together with Greek,
Latin had an enormous influence on nearly any European language. The Latin hyphenator
divides Latin prefixes, suffixes, and enclitics.
visit
download page
Polish
(Upgrade February 2016)
hyphenation of the Polish language is hindered by an
immense number of consonants, quite often unpronounceable for
non-Polish speakers. However, the hyphenator's model has been fully adapted to these
difficult syllables and pronunciation of Polish words is the guiding priciple to hyphenate.
The hyphenator supports doubling of the hyphen.
visit
download page
Latvian
(Upgrade February 2016)
is tuned in to the properties of
Baltic languages. Words are richly declined. Latvian uses additional
consonants
and vowels, which are recognized by the hyphenator.
visit
download page
Azerbaijanian
(Upgrade February 2016)
is one of the new Transcaucasian
republics that are now independent from the former USSR.
Azerbaijanian is related to Turkish.
The Azerbaijani now use a Latin script. There is no standard Byte CodePage script.
visit
download page
Turkish
(Upgrade February 2016, ε < 0.007 ‰)
Present-day Turkish is spoken in SW Asia, but in
earlier times the Turkish region reached into the north of China. In
Chinese history, the
name Tu-kiu was mentioned 600 years ago. Turkish is characterized by a
lot of
additive particles that change the meaning of a word. A word can take
numerous forms and different parallel hyphenations.
visit
download page | view
a Turkish example (PDF)
Lithuanian
(Upgrade February 2016)
is one of the Baltic languages which is richly
declined. The (semi-)diphthongs, palatals, and affricates have
been taken into consideration for hyphenation.
visit
download page
Afrikaans
(Upgrade August 2024, ε < 0.0040 ‰)
the Afrikaans language evolved from
17th-century Dutch and is an official language of South Africa. Its
hyphenation has
much in common with the Dutch language. Afrikanization of spelling has
given the Afrikaans
language its own identity. The Afrikaans
hyphenator takes all Afrikaans peculiarities into consideration,
including diaeresis
hyphenation.
visit
download page |
view an Afrikaans
example (PDF) |
die taal en die passende tegnologie (PDF)
Russian
(Upgrade July 2014)
accepts Cyrillic characters, but does not complicate hyphenation. It is the nature of
the Russian language: an abundance of prefixes and suffixes, modifying
different moods in a fine gradation.
The hyphenator has learned from a corpus of over a
million Russian words.
visit
download page
Basque
(Upgrade February 2016)
the Basque language is one of Europe’s most exotic minority languages,
probably
unrelated to any other language in the world. The Basque hyphenator is
tuned in to all those peculiarities of real-life language.
visit
download page |
view a Basque/Euskara
example (PDF)
Hungarian
(Upgrade April 2006)
the Hungarian language has lost many of its Uralic characteristics and
many words have been
borrowed from the Turkish and European languages. The language is
flavoured with compounds
and special hyphenations (briddzsel -> bridzs-dzsel).
visit
download page | view
a Hungarian example (PDF)
Bahasa Indonesia
(Upgrade July 2014)
the Bahasa Indonesia (Standard Indonesian) is an
Austronesian language full of prefixes, suffixes, infixes, in general
terms affixes including large classes of sound changes. Hyphenation is
inextricably tied to meaning, even when the boundaries are masked by
sound changes (mengarang from meng + karang) hyphenation
is affected.
visit
download page | view
a Bahasa Indonesian example (PDF)
Bahasa Melayu
(Upgrade July 2014)
What holds for Bahasa Indonesia applies as well to Bahasa Melayu.
visit
download page
Bulgarian
(Upgrade February 2016)
is spoken by 90 % of the population of Bulgaria, 7 million
people. Modern Bulgarian alphabet is the same as the Russian alphabet, however,
the pronunciation of the hard and soft sign differs.
visit
download page
Serbian
(Upgrade July 2014)
Serbian or srpski jezik is written in the Cyrillic alphabet.
Serbian is closely related to Croatian, however, Serbian characters are
written with single symbols
Џ, Љ, and Њ.
(Dž, Lj, Nj ).
Like words in any Slavic language Serbian words can have many prefixes
to be hyphenated.
visit
download page
Galician
(Upgrade July 2024)
The Galician language is now spoken in Spanish Galicia, situated
north of Portugal. It is a Romance language related to
Portuguese. The orthography differs slightly from Spanish.
visit
download page
Rhaeto-Romance
(Februari 2002)
is the collective for three Romance dialects spoken in the
northeastern Italy and southeastern Switzerland.
visit
download page
Romanian
(Upgrade June 2009)
is the national language of Romania. It is a Romance language
written in the latin script. One third of all Romanian words are of
French origin. S|s-comma below (ș) and T|t-comma below (ț) are supported
visit
download page
Croatian
(Upgrade July 2014)
or hrvatski jezik is written in the Latin alphabet. Croatian is
closely related to Serbian. Croatian includes a few digraphs which
sound like a single consonant (Dž, Lj, Nj ).
Like words in any Slavic language Croatian words can have many prefixes
to be hyphenated.
visit
download page
Bosnian
(Upgrade July 2014)
or Bosanski Jezik exists
since Bosnia & Herzegovina became independent. Bosnian has
developed its own identity,
written in Latin and closely related to Croatian.
visit
download page
Slovene
(Upgrade May 2015)
or Slovenski jezik is written in the Latin alphabet. Slovene
includes a few digraphs (Dž, Lj, Nj).
Slovene has many prefixes and inflections. Some syllables divide
consonants only:
hm-kniti, kr-tina, tr-den.
visit
download page | view
a Slovene example (PDF)
Thai
(Upgrade October 2019)
The Thai people build sentences in a different way. Therefore, the Thai
module
is not a hyphenator in the traditional sense, but it is a word
segmentation tool, that takes
context into consideration.
visit download page |
more on Thai |
view a Thai example (PDF)
Tamil (தமிழ)
(March 2021)
The Tamil language is spoken in the southern part of India and in parts of Sri Lanka.
As the other languages of India, vowels are placed above consonants, except for initial vowels.
The script lets characters fuse together.
visit download page
Malayalam (മലയാളം)
(April 2021)
The Malayalam language is spoken in the southern part of India.
As the other languages of India, vowels are placed above consonants, except for initial vowels.
The script lets characters fuse together.
visit download page
Hindi (हिन्दी),
(Update May 2021)
The Hindi language is spoken as the most important language of India.
As the other languages of India, vowels are placed above consonants, except for initial vowels.
The script lets characters fuse together. Of course such a combination will not be broken.
visit download page
Bengali (বাংলা)
(May 2021)
is an Indo-Aryan language of the Bengali-Assamese group of languages. Bengali is spoken in the Bengal regio of India.
and is written in the Bengali script (Unicode 0x0980 - 0x09FF). Dependent vowel are placed above, below, before, or after a consonant,
but in written (script) always AFTER the consonant.
It is easy to write them in reverse order, a serious error. It is forbidden to hyphenate before an dependent vowel.
visit download page
Marathi (मराठी)
(May 2021)
Macedonian
(Upgrade July 2007)
Maltese
(Upgrade February 2016)
Sámi
(Upgrade July 2014)
Hebrew
(Upgrade February 2016)
Irish/Gaelic
(Upgrade February 2016)
Zulu
(Upgrade February 2016)
Xhosa
(Upgrade February 2016)
Swahili
(Upgrade February 2016)
Kurdish (Northern)
(July 2009)
Khmer (Cambodia)
(February 2016)
Kazakh (Latin/Cyrillic)
(Upgrade February 2016)
Welsh (Cymraeg)
(August 2015)
Under development
The Marathi language is spoken in the Mahatashtra state of India. It is an Indo-Aryan language written in the Devanagari script,
but grammar and word forms differ from Hindi.
visit download page
is the principal language of the new nation of Macedonia, it is closely
related to Bulgarian and written in the Cyrillic alphabet.
visit
download page
is one of the official languages of the islands of Malta, it is
a Semitic language written in the Latin alphabet,
including <ċ> <ħ> <ġ> and <ż>,
the variety of root words has a great impact on hyphenation.
visit
download page | view
a Maltese example (PDF)
The hyphenation agrees with the Nord Sámi language as spoken in Finnmark
county in the north of Norway.
visit
download page | view
a North Saami example (PDF)
is written in Hebrew consonants only and therefore hyphenation is partially uncertain. Within this uncertainty the hyphenator accepts graphical hyphenations.
visit
download page
is a Celtic language mainly spoken in Ireland.
visit
download page
is a Bantu language mainly spoken in the Republic of South Africa.
Zulu is one of the 11 South African languages and is very different from Afrikaans
and the other Indo-European language and so is hyphenation: be~nga~ka~la~li, ma~fu~ngwa~se.
visit
download page
is a Bantu language mainly spoken in the Transkei, Ciskei and Eastern Cape regions of the Republic of South Africa.
Xhosa is one of the 11 official South African languages. It is very different from Afrikaans and the other Indo-European language,
and so is hyphenation: i~si~Mpo~ndo, ye~Bha~nki.
visit
download page
is a Bantu language mainly spoken in East Africa.
Swahili is principal language of Tanzania, Zanzibar, Uganga and many neighbouring countries.
Hyphenation examples are: ne~nda, ni~ra~ku~pi~ga, ni~li~m~pi~ga, u~nga~ma.
visit
download page
belongs to the Iranian group of languages. Kurdish is spoken in Turkey, Iraq, Iran, Armenia, Georgia and Azerbaijan.
The latin script is used for the Northern variety of Kurdish.
visit
download page
belongs to the Austroasiatic languages. Khmer has its own script known as Aksar Khmer.
In Khmer no spaces are inserted between words. Yet words have to be segmented, even unknown words.
visit
download page
belongs to the Turkic family languages. Kazakh is written in the Cyrillic, Arabic or Latin script.
An official transition to the Latin script could happen in a 10 to 12 year period.
The Kazakh hyphenator (Unicode) supports both the Latin and Cyrillic script.
visit
download page
is a Celtic language which is hyphenated according to morphological principles.
In addition a lot of sound changes (mutations) replicates these principles.
The Welsh hyphenator improves the density of hyphenation at the end of the line.
visit
download page
Faroese, and Esperanto.