Computational Decipherment of Sumerian Pronunciation and Unorthographic Texts

Hakemuksen tiivistelmä

Pronunciation of Sumerian, the oldest written language in the world, has puzzled researchers since the first cuneiform tablets were deciphered in the late 19th century. The problem lies in the Sumerian writing system that spells word stems with logograms that do not carry over phonetic information, but merely describe words as concepts. After Sumerian died as a vernacular around 1900 BCE, Babylonian scholars continued its literary tradition for two millennia. To preserve knowledge of its pronunciation, they compiled a small number of Sumerian texts in an unorthographic script that represented it phonetically. These texts are, however, extremely ambiguous, complicated and varied, and often incomprehensible to modern scholars disconnected from the ancient oral tradition, unless a copy of them exists in the standard Sumerian orthography. Factors behind the spelling variation in these texts and their relation to Sumerian pronunciation remain largely unresolved. This project tackles the complexity by using state-of-the-art machine learning (ML) and data-augmentation methods. It aims to (1) decipher the poorly understood unorthographic texts using the surviving texts in standard orthography, and to (2) reconstruct the pronunciation of these texts using sign-to-phoneme projections from the texts the Babylonian scribes wrote in their native language. To achieve these goals, the project will collect all 2nd millennium Sumerian unorthographic texts into an annotated corpus, interlinked with the largest Sumerian text corpus (Oracc). In addition to training ML models, this corpus enables large-scale quantitative analysis of diachronic and synchronic patterns in spelling variation in unorthographic texts and outside them. The project will elucidate long-standing mysteries on Sumerian pronunciation and undeciphered unorthographic spellings, and brings an important collection of texts available as an Open Access corpus for the research community.