RUB » CERES » Research

Zoroastrian Middle Persian: Digital Corpus and Dictionary

The Middle Persian language played a prominent historical and cultural role in the first millennium CE as the official language of the Sasanian Empire, with a usage spanning several religious traditions. Its texts link East and West in both linguistic and cultural terms, and cover a period stretching from late antiquity to the early Islamic period. Despite this, there is no comprehensive digital database for this language, nor is there a comprehensive lexicographical tool covering the full variety of its vocabulary throughout the long period of its existence.

As a first step towards this eventual goal, the present project aims to create an online open-access corpus of all Zoroastrian Middle Persian (henceforth: ZMP) texts in the Pahlavi script. The project will present a corpus of around 54 texts, containing some 687,000 words in transliteration and transcription, as well as digital photographic documentation of the 15 oldest codices. The texts will be supplied with morphological and partial syntactical annotation, and encoded according to the guidelines developed by the “Text Encoding Initiative” (TEI). This comprehensive digital corpus of Pahlavi texts will in turn be used as a basis for the creation of a digital Middle Persian-English dictionary of ZMP, comprising an estimated 7,000 lemmata. It is our hope that we will subsequently be able to expand our work into related projects to include and create corpora and dictionaries of other types of Middle Persian texts.

The digital corpus and the ensuing digital dictionary constitute two closely interlinked analytical instruments. They focus on two closely connected but separate aspects of the texts, syntax and semantics, which are linked together in the work of the project. A web-based working environment will be used, which will enable the collaborative processing of both corpus and dictionary and will serve as a user interface for research and analysis of the prepared resources. Moreover, the project aims to make the corpus of Pahlavi literature accessible to the analysis and methods of corpus linguistics developed in the Digital Humanities.

The project will adopt a comprehensive new approach and methodology for texts written in ZMP, thereby creating a common basis for comprehensive analysis of both linguistics and conceptual history. This approach also adopts a perspective that highlights ‘horizontal’ (i.e. genre) as well as ‘vertical’ (i.e. historical) differences between texts, both in the corpus and the dictionary. The project is thus conceived as a basis for identifying internal and external factors in the complex fabric of ZMP literary texts, and for providing an adequate means for differentiated analysis of cultural, religious and social history.

A final aim of the project is to bring about links and interactions between the present endeavor and other projects, whether completed or ongoing, in the field of Old and Middle Iranian Studies.


04-2021 – 03-2024 (03-2030)

Funded by


  • Dr. Claes Neuefeind and Prof. Dr. Øyvind Eide, Cologne Center for eHumanities (CCeH)
  • Prof. Dr. Alberto Cantera, Free University Berlin
  • Prof. Dr. Shaul Shaked, Hebrew University Jerusalem

Affiliated Persons