Tuesday, June 27, 2017

Trees do not necessarily help in linguistic reconstruction


In historical linguistics, "linguistic reconstruction" is a rather important task. It can be divided into several subtasks, like "lexical reconstruction", "phonological reconstruction", and "syntactic reconstruction" — it comes conceptually close to what biologists would call "ancestral state reconstruction".

In phonological reconstruction, linguists seek to reconstruct the sound system of the ancestral language or proto-language, the Ursprache that is no longer attested in written sources. The term lexical reconstruction is less frequently used, but it obviously points to the reconstruction of whole lexemes in the proto-language, and requires sub-tasks, like semantic reconstruction where one seeks to identify the original meaning of the ancestral word form from which a given set of cognate words in the descendant languages developed, or morphological reconstruction, where one tries to reconstruct the morphology, such as case systems, or frequently recurring suffixes.

In a narrow sense, linguistic reconstruction only points to phonological reconstruction, which is something like the holy grail of computational approaches, since, so far, no method has been proposed that would convincingly show that one can do without expert insights. Bouchard-Côté et al. (2013) use language phylogenies to climb a language tree from the leaves to the root, using sophisticated machine-learning techniques to infer the ancestral states of words in Oceanic languages. Hruschka et al. (2015) start from sites in multiple alignments of cognate sets of Turkish languages to infer both a language tree, as well as the ancestral states along with the sound changes that regularly occurred at the internal nodes of the tree. Both approaches show that phylogenetic methods could, in principle, be used to automatically infer which sounds were used in the proto-language; and both approaches report rather promising results.

None of the approaches, however, is finally convincing, both for practical and methodological reasons. First, they are applied to language families that are considered to be rather "easy" to reconstruct. The tough cases are larger language families with more complex phonology, like Sino-Tibetan or any of its subbranches, including even shallow families like Sinitic (Chinese), or Indo-European, where the greatest achievements of the classical methods for language comparison have been made.

Second, they rely on a wrong assumption, that the sounds used in a set of attested languages are necessarily the pool of sounds that would also be the best candidates for the Ursprache. For example, Saussure (1879) proposed that Proto-Indo-European had at least two sounds that did not survive in any of the descendant languages, the so-called laryngeals, which are nowadays commonly represented as h₁, h₂, and h₃, and which leave complex traits in the vocalism and the consonant systems of some Indo-European languages. Ever since then, it has been a standard assumption that it is always possible that none of the ancestral sounds in a given proto-language is still attested in any its descendants.

A third interesting point, which I consider a methodological problem of the methods, is that both of them are based on language trees, which are either given to the algorithm or inferred during the process. Given that most if not all approaches to ancestral state reconstruction in biology are based on some kind of phylogeny, even if it is a rooted evolutionary network, it may sound strange that I criticize this point. But in fact, when linguists use the classical methods to infer ancestral sounds and ancestral sound systems, phylogenies do not necessarily play an important role.

The reason for this lies in the highly directional nature of sound change, especially in the consonant systems of languages, which often makes it extremely easy to predict the ancestral sound without invoking any phylogeny more complex than a star tree. That is, in linguistics we often have a good idea about directed character-state changes. For example, if a linguist observers a [k] in one set of languages and a [ts] in another languages in the same alignment site of multiple cognate sets, then they will immediately reconstruct a *k for the proto-language, since they know that [k] can easily become [ts] but not vice versa. The same holds for many sound correspondence patterns that can be frequently observed among all languages of the world, including cases like [p] and [f], [k] and [x], and many more. Why should we bother about any phylogeny in the background, if we already know that it is much more likely that these changes occurred independently? Directed character-state assessments make a phylogeny unnecessary.

Sound change in this sense is simply not well treated in any paradigm that assumes some kind of parsimony, as it simply occurs too often independently. The question is less acute with vowels, where scholars have observed cycles of change in ancient languages that are attested in written sources. Even more problematic is the change of tones, where scholars have even less intuition regarding preference directions or preference transitions; and also because ancient data does not describe the tones in the phonetic detail we would need in order to compare it with modern data. In contrast to consonant reconstruction, where we can do almost exclusively without phylogenies, phylogenies may indeed provide some help to shed light on open questions in vowel and tone change.

But one should not underestimate this task, given the systemic pressure that may crucially impact on vowel and tone systems. Since there are considerably fewer empty spots in the vowel and tone space of human languages, it can easily happen that the most natural paths of vowel or tone development (if they exist in the end) are counteracted by systemic pressures. Vowels can be more easily confused in communication, and this holds even more for tones. Even if changes are "natural", they could create conflict in communication, if they produce very similar vowels or tones that are hard to distinguish by the speakers. As a result, these changes could provoke mergers in sounds, with speakers no longer distinguishing them at all; or alternatively, changes that are less "natural" (physiologically or acoustically) could be preferred by a speech society in order to maintain the effectiveness of the linguistic system.

In principle, these phenomena are well-known to trained linguists, although it is hard to find any explicit statements in the literature. Surprisingly, linguistic reconstruction (in the sense of phonological reconstruction) is hard for machines, since it is easy for trained linguists. Every historical linguist has a catalogue of existing sounds in their head as well as a network of preference transitions, but we lack a machine-readable version of those catalogues. This is mainly because transcriptions systems widely differ across subfields and families, and since no efforts to standardize these transcriptions have been successful so far.

Without such catalogues, however, any efforts to apply vanilla-style methods for ancestral state reconstruction from biology to linguistic reconstruction in historical linguistics, will be futile. We do not need the trees for linguistic reconstruction, but the network of potential pathways of sound change.

References
  • Bouchard-Côté, A., D. Hall, T. Griffiths, and D. Klein (2013): Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences 110.11. 4224–4229.
  • Hruschka, D., S. Branford, E. Smith, J. Wilkins, A. Meade, M. Pagel, and T. Bhattacharya (2015): Detecting regular sound changes in linguistics as events of concerted evolution. Current Biology 25.1: 1-9.
  • Saussure, F. (1879): Mémoire sur le système primitif des voyelles dans les langues indo- européennes. Teubner: Leipzig.