Skip navigation
The Australian National University


How do I send you any information?

It's easiest (for you) to email us: It's easiest (for us) if you can follow a fairly standard format for sending data; filling out values on the form is recommended.


The database has been compiled from a wide range of sources. Most commonly descriptive grammars were used to collect the information. In the absence of a grammar, the research was done with dictionaries or, very occasionally, wordlists or texts, although these latter types of sources do not provide as complete material as is desirable. Occasionally a theoretical work provided enough detail for us to enter data on a language, although in general theoretical works focus on only one aspect of a language's structure; for instance, there are works discussing syllable structure that concentrate exclusively on codas, as the end of the syllable, and say nothing about co-occurrence restrictions at the onset of the syllable. For this reason we have attempted to source our data from more complete descriptions of languages.

In addition to features such as whether complex onsets are allowed, we have noted whether a language includes epenthesis in its phonology, or has vowel deletion rules (or neither, or both); whether it violates the Sonority Sequencing Hierarchy (Lass 1984, Laver 1994); which variety of a given language the grammar was based on, and whether there are particular privileges associated with particular classes of consonants in different positions (for instance, many languages that allow complex onsets restrict these clusters to a stop+liquid sequence (while many other languages are not so restricted). We developed a set of operating principles to allow us to consistently code whether a language had long vowels or diphthongs, as well as how to treat loanwords. This was especially important for those cases in which no detailed grammar was available, and we had to work through more truncated materials.

Database compilation

The following is a overview of the guidelines that we have followed in constructing and populating the database that is the heart of the displays on this site. It should also be used by anyone wishing either to interpret the data provided or to make contributions from their own data. Any such contributions are very much welcome, and will be appropriately acknowledged.

What is epenthesis and what is the difference between the two epenthesis values?

Epenthesis is one possible resolution for cases when a language specifies two adjacent phonemes, but doesn't allow them to appear together. Rather than deleting one or the other of the members of the offending cluster, epenthesis is the solution in which another sound is inserted to interrupt the illicit cluster. For example, a word might be underlyingly /mnd/, but pronounced [minid]; in this case we can speak of an epenthetic [i] vowel applying in the language. When this happens throughout the word we can talk of general epenthesis. If we are dealing with a language that allows forms like /pse/ which are pronounced as [pese], or /mnd/ being pronounced as [mind], we are dealing with a different kind of epenthesis, which we have termed sesquisyllabic epenthesis (following Matisoff's terminology). These are the most commonly found patterns of epenthesis; if other types of epenthesis occur, such as inserting a glide or glottal stop at the start of an otherwise vowel-initial word, we have written notes explaining the rule.

What do I do about different varieties of the language?

If the author discusses one variety of the language in particular, then we have made a note of this. If the author compares several varieties, we try to choose the variety with the best description and the most data, noting which one it is. For varieties that are dependant on age, we try to fill in the spreadsheet based on the speech of older people, and make notes on other varieties.

What about minimal responses?

If a syllable structure is only used in paralinguistic contexts (for example, if long vowels do not exist except for hesitation syllables such as /i:/), we have not coded information about long vowels and syllable structure, but have made a note of the data. For instance, a CCC coda in English almost always involves an inflectional -s as the final C, implying that this pattern is tolerated, but not present in the lexicon.

What do I do about clusters formed through morphological processes?

If the language has clusters that are formed through morphological processes and otherwise don't exist, we have left them out of the phonotactics but put the details in Notes.

What do I do about loan words?

If loanwords have phonotactic rules that differ from those of native words, we have only acknowledged it in notes. For example, CC clusters in Cebuano only occur in loanwords. In such a case we have listed the phonotactic possibilities as CV, VC and CVC, and put in Notes that loanwords can have CC onset where C2 = liquid. Some languages have a long history of language contact, and it is difficult to tell which words are loan words due to the process of naturalisation. If this seemed to be the case, we mentioned it in Notes and filled in the phonotactics according to the data and the discussion.

Updated:  27 September 2012/Responsible Officer:  Director, CHL /Page Contact:  Phonotactics maintenance