Genomes and Languages

Hakan Ufuk

Apr 1, 2004

Languages are actually not that different from genes. Just as you would expect events like the Barbarian Migrations of the 5th century, or the Bubonic plague of the 14th century to leave marks on the gene pools of the surviving populations, languages are influenced, in that new words, new idioms and meanings are introduced. A recent study, published last November in the high-profile journal Nature, affirms this, convincingly establishing a philological tree using computational methods established for phylogeny (historical relations between species and their genes).1

When Did English and Hindi Begin to Differ?

The long-established “comparative method” of linguistics uses vocabulary, the structure of words, and the sound systems of languages to draw language family trees, depicting in what order related languages (such as, English, Hindi, and ancient Hittite) diverged from their mother languages and the relative “relatedness” of sister languages. Dates of divergence are usually referred to dates of historical or archaeological significance. For example, the Romanian language, a relative of Italian, must have been introduced to the region between 112 and 270 AD, when Roman troops occupied Dacia. However, the comparative method does not provide any dates itself, other than those of relative chronology. Lexicostatistics, the rival study for vocabulary change, extracts essential vocabulary from languages, such as “I, three, and hand,” which are assumed to be more resistant to change, and produces a metric of shared cognates and, hence, language kinship. Assuming a constant rate of language change over time, one can extrapolate to pre-history dates for language evolution. For example, one may try to estimate when the proto-Indo-European, the ancestor of English, Hindi, and Hittite, started branching into distinct new languages. Unfortunately, the promise of lexicostatistics (and its method, called glottochronology) became doubtful quickly after its birth. It was criticized very much in the same way as biological phylogenetic analyses were. One example to show the correspondence is that just as the mutation rates of genes (sequences of DNA) may change over time, languages may also be changing faster or slower at certain periods. Lexicostatistics is unreliable, as the similarity between languages could be mere chance convergences, or borrowings, or on the other hand, distant relatives could be unrecognizable after a great deal of divergence. These objections have plagued biology in similar ways.

Phylogenetics and Philology Side by Side

English

French

Russian

Greek

Persian

Hindi

IJeJaEghoManMeHandMainRukaCheriDastHathThreeTroisTriTriaSeTinMotherMereMatMiteraMaderMaNewNouveaunoviykenuryosTazeNeyanoseNezNosMitiNaNak

Figure 1: A partial list

for Indo-European words used by Gray and Atkinson (Dyen, I. Kruskal, J.B. & Black, P., FILE IE-DATA1 at http://www.ntu.edu.au/education/langs/ielex/IE-DATA1)

The recent study by Gray and Atkinson from the University of Auckland, New Zealand, published on November 27 in Nature, uses enhanced methods developed for phylogenetic studies in language tree construction, which produces trees that are consistent with those established by the comparative method. Most importantly, maximum-likelihood models and the Bayesian inference method were employed, both being statistical methods now established in phylogenetics, to counteract any weaknesses found in past attempts of glottochronology. Their method makes it possible to estimate divergence times without a strict rate of change, also enabling the determination of unsubstantiated sections of the tree, and the incorporation of these uncertainties in the calculation of the trees and divergence times. Gray and Atkinson only used fourteen age constraints to calibrate their divergence time calculations in estimating chronology, and after confirming tests eliminated some of these constraints, doubtful cognates, and other problems, they were able to come up with a date for the initial divergence of all Indo-European languages of 7,800 to 9,800 years ago. These dates coincide beautifully with the Anatolian farmer hypothesis, which claims dispersion of Indo-Europeans from Anatolia (modern-day Turkey) with the spreading of agriculture around 8,000-9,500 years ago, a hypothesis now supported by genetic studies that report a Neolithic, Near Eastern contribution to the European gene pool as well.

An Alternative Theory

This study does not extinguish one of the fiercest discussions of this century, which is favored by many linguists, that linguistic evidence favors the Kurgan expansion hypothesis, with Kurgan horsemen invading and spreading from the Asian steppes 6,000 years ago. It is thought that Kurgan horsemen possessed certain advantages, like the knowledge of the wheel and horseback riding, just as the Anatolians knew about farming. These linguists claim that the statistical and computational methods used in biology do not reflect the way languages change, and these methods use only vocabulary, but ignore grammar. This new study is a shot in the arm for the supporters of the Anatolian theory and resurrects glottochronology. Obviously, the discussion is far from being over. To reconcile the two, Gray and Atkinson note that they have observed an intense diversification period in their data at a date of 6,000 years ago, and they refer to an inclusive theory of both Anatolian origin and Kurgan expansion.2

Intertwined Trees

In their article Gray and Atkinson predict the combination of computational phylogenetic methods and vocabulary data to examine archaeological hypotheses in the future, as methods developed for biology continue to establish themselves in social sciences. As David Searls of Glaxo-Smith-Kline Pharmaceuticals concludes in his “News and Views” article in the same issue of Nature,3 “[this work] should stimulate even more cross-fertilization of ideas among those studying the intertwined trees of life and language.”