exploring the structure of indus script using computational … · 2017-08-01 · heritage: journal...

Exploring the Structure of Indus Script Using Computational Methods

Nisha Yadav1 and M. N. Vahia1 1. Tata Institute of Fundamental Research, Mumbai ‐ 400 005, Maharashtra, India

Received: 17 September 2013; Accepted: 29 September 2013; Revised: 17 October 2013 Heritage: Journal of Multidisciplinary Studies in Archaeology 1 (2013): 210‐221

Abstract: The Indus script is a creation of one of the largest Bronze Age civilizations that flourished in the western part of the Indian subcontinent. It is one of the major undeciphered scripts of the ancient world. The lack of decipherment of the Indus script is attributed to the paucity of working material, brevity of the Indus texts, absence of any multilingual texts, and lack of definite knowledge of the language(s) spoken by the Indus people. While several interpretations of its contents have been put forward, no tools are available to validate them and hence there has been no consensus on any of these interpretations. In the present study, we use various computational methods related to data mining, machine learning and information theory to identify aggregate characteristics of the Indus script without making any assumptions about its content. The study aims to define some of the constraints that any proposed interpretation must satisfy.

Keywords: Indus Script, Indus Seals, Harappan Civilization, Data Mining, Computational Linguistics, Machine Learning, Statistical Analysis

Introduction Roots of the Indus valley civilization go back to around 7000 BC, it peaked around 2600 BC and went into decline around 1900 BC (Wright 2010; Agrawal 2007; Possehl 2002; Kenoyer 1998). One of the most intriguing aspects of the Indus valley civilization is the Indus script. The Indus script appears on inscribed objects often referred to as seals commonly made of steatite or terracotta. These objects are generally a few square centimetres in size (Fig. 1). They are catalogued in the three volumes of the Corpus of Indus Seals and Inscriptions (Joshi and Parpola 1987; Shah and Parpola 1991; Parpola et al. 2010). On these objects, the Indus people have expressed several aspects of their art, myths, perspective of nature, abstract geometrical and symmetrical patterns and at times, even their daily life. In terms of art, aesthetic sense and expressions of symmetric, geometric as well as abstract patterns, these objects are unsurpassed in their quality (Yadav and Vahia 2011; Sinha et al. 2011; Vahia and Yadav 2010). One of the most creative aspects of their work on these inscribed objects is the Indus script. An understanding of their script will provide an unprecedented insight to the minds of the Indus people. The script therefore holds a vital clue to understanding the Indus culture.

Yadav and Vahia 2013: 210‐221

211

Figure 1: Some examples of Indus seals with the Indus script

(Copyright: Harappa Archaeological Research Project/J. M. Kenoyer, Harappa.com, Courtesy: Dept. of Archaeology and Museums, Government of Pakistan)

Problem of Indus Script Reasons that make the problem of Indus script more challenging are the brevity of the Indus texts, lack of definitive knowledge about the language(s) that the people spoke, and absence of multilingual texts. These hurdles have however not prevented scholars from trying to understand the contents of the script. Possehl provides an excellent critical review of some of the various attempts to understand and interpret the Indus script (Possehl 1996). A more recent review of some of these attempts is provided elsewhere (Parpola 2005; Mahadevan 2002). In spite of these efforts, the problem of the Indus script lies unresolved with no universal consensus on any of the proposed interpretations.

Our Approach Since the scholars have not been able to agree on the contents of the Indus script, we explore the problem of the Indus script in a content‐insensitive manner. The study aims to assist any future attempts at decipherment by defining the patterns in the Indus writing. We employ various computational techniques to identify the characteristics of the Indus writing without making any assumptions about its content. Our study defines a syntactic framework of the Indus script that can validate any proposed interpretation (see for example Yadav et al. 2012).

We use Mahadevanʹs concordance, henceforth referred to as M77 (Mahadevan 1977) as the basic data set on which we apply various computational, mathematical and analytical tools to understand the syntax of the Indus script. It records 417 unique signs in 3573 lines of 2906 texts. From M77, we remove ambiguous texts and create a filtered dataset EBUDS (for details see Yadav et al. 2008a). EBUDS records 1548 texts and is used in most of our analyses.

Investigating the Structure of Indus Script The analyses performed on the Indus script dataset are summarized below.

Comparison with Randomized Dataset and Positional Analysis We begin our analysis of the Indus script dataset by analyzing its sign frequency distribution. The frequency distribution of signs in the Indus script follows Zipf‐

ISSN 2347 – 5463 Heritage: Journal of Multidisciplinary Studies in Archaeology 1: 2013

212

Mandelbrot law suggesting that a small number of signs account for most of the writing (Yadav et al. 2010). Similarly, the cumulative frequency distribution of text ender and text beginner signs reveals interesting information about the syntax of the Indus texts. While just 23 signs account for 80% of all text enders, about 82 signs account for 80% of all text beginners suggesting an asymmetry in the usage of text beginner and text ender signs (Fig. 2).

Figure 2: Cumulative frequency plot for all signs, text‐beginners and text‐enders

(Yadav et al. 2010)

In order to check if the sequencing of signs in the Indus texts is significant, we compare the Indus script dataset with a randomized dataset (Yadav et al. 2008a). Our study reveals that some specific sign combinations of 2, 3 and 4 signs appear with far higher frequency in the Indus script dataset than expected by chance. This suggests presence of correlations between signs in the Indus texts. It also indicates that that the length of the information unit in the Indus texts is 2, 3 or 4 signs (Yadav et al. 2008a). We then analyzed the distribution pattern of the sign combinations (pairs, triplets and quadruplets) in the Indus texts and found that they have preferred location in the Indus texts (Yadav et al. 2008a). The positional distribution of frequent sign pairs is shown in Table 1.

Segmentation of Indus Texts In order to check if it is possible to segment longer Indus texts into smaller segments we perform segmentation analysis on the Indus script dataset (Yadav et al. 2008b). We

find thsegmeTable

Ta

n‐gramIn anoa bigrnearesprecedfor ap

hat about 88ents of lengt2.

able 1: Positi

m Studies ofother study (am (first‐ordst neighbouding or succplications su

8% of all theth not excee

ional analysi

f the Indus S(Yadav et alder Markov ur sign. Spceeding any uch as restor

e Indus texteding 4. Som

is of frequen

cript . 2010) we dmodel), theecifically, wother sign. ring signs in

ts of length me examples

nt sign pairs

develop a bige range of cowe calculateWe use the n illegible In

Yada

5 or more cs of segmen

in EBUDS (Y

gram modelorrelation doe the probbigram mo

ndus texts (T

av and Vahia 2

can be segmnted texts ar

Yadav et al.

l of the Induoes not go bbabilities of del of the InTable 3), for

013: 210‐221

213

mented into re given in

2008a)

us script. In beyond the any sign

ndus script generating

ISSN 2

214

samplAsian cases (underscript

The falphanthese identic

CompWe covariouSumer

347 – 5463 He

le Indus texsites. The m(Yadav et al a model trmay have b

Table

four‐digit nnumeric seqsegments. Tcal texts in M

arison of Fleompare the flus linguisticrian, DNA,

eritage: Journa

xts, and for model can a. 2010). We frained on teeen used for

2: Examples

umber in tquences abovThe four‐digM77.

exibility in Slexibility in c and non‐Protein, and

al of Multidisci

comparing accurately gufind that likexts from ther writing We

s of segment

the first cove the segmegit numbers

Sign Usage wthe usage of‐linguistic sd Fortran (R

iplinary Studie

texts cominuess the miselihood of me Indus siteest Asian con

ted Indus tex

olumn is thents are the below the s

with Differef signs in thesystems vizRao et al. 20

es in Archaeolo

ng from the ssing signs imany of the Wes is very lontent (Rao e

xts (Yadav e

he text nummarkers usesegments ar

nt Systems e Indus scripz. English, 009a). We fin

ogy 1: 2013

Indus sitesin three fouWest Asian Iow suggestint al. 2009b).

et al. 2008b)

mber from ed for identre the text n

pt with sequSanskrit, Ond that the

s and West urths of the Indus texts ng that the

M77. The tification of numbers of

ences from Old Tamil, conditional

entropyIndus (Fig. 3linguissimila

CompZipf’s differefrequefrequeand sorelatio

wheredue tothat oambigdo not

Study et al. 2pairs a

y (a measurscript falls 3). Comparistic and nor results (Ra

Table 3: Re

arison of Ziplaw is a u

ent scripts. ency of worently used wo on, then thon

e k and a areo a natural toptimizes inguity (Zipf 19t follow this

of the varia2002) suggesand triplets.

re of flexibilwithin the rison of highon‐linguistic ao et al. 2010

estoration of

pf’s exponenuseful tool Zipf (Zipf ds in a text word is rankhe Zipf’s law

e constants, wtendency tonformation 949). While relation (Fe

ation in the Zsts that the v. In Table 4,

ity in the chrange of varher order bsystems (f

0).

f signs missi

nt for n‐gramto compare1935; Zipf and the fre

ked 1, the sew states that

with a (the Zo economizeexchange wthis law is eerrer‐i‐Canch

Zipfʹs exponevalue of the , we comput

hoice of a sirious linguisblock entropior blocks o

ing from Ind

ms e the relativ1949) explo

equency‐baseecond most the frequen

Zipf’s expon on the usawith minimempirical, it hho and Elvev

ent for n‐graZipf’s exponte the Zipf’s

Yada

ign given a stic systems ies for Induof up to six

dus texts (Ya

ve distributored the coed rank of tfrequently uncy f is relate

nent) close toge of words

mum effort has been shovåg 2010).

ams for Englnent (a) fallss exponent f

av and Vahia 2

preceding sincluded in

us script wix signs) also

adav et al. 20

tion of signrrelation bethe words. Iused word ied to the ran

o 1. The relas in human as well as own that ran

lish and Mans significantlfor n‐grams

013: 210‐221

215

sign) of the n the study ith various o provides

010)

ns between etween the If the most is ranked 2 nk r by the

ation arises languages minimum

ndom texts

ndarin (Ha ly for word in EBUDS


216

(for n varying from 1 to 5) and compare it with corresponding values for English, Mandarin and a randomized Indus dataset (R1) selected from an earlier analysis (Yadav et al. 2008a).

Figure 3: Comparison of Indus script data with various linguistic and non‐linguistic

systems (after Rao et al. 2009a; Yadav 2012)

It can be seen from Table 4 that the absolute value of the Zipfʹs exponent for sign pairs for EBUDS is closer to Mandarin than English and is significantly different from the randomized Indus dataset (R1).

Site and Medium Sensitivity of Indus Script In an additional study (Yadav 2013), we analyze the variation in the usage of signs in the Indus script across sites and types of objects. Some of the major conclusions from the study are:

Distribution of inscribed objects: Study of the distribution of the inscribed objects with respect to their site of occurrence and type suggests that Mohenjodaro accounts for the highest percentage of seals and Harappa accounts for the highest percentage of sealings.

Sensitivity of the Indus script to site and type of object: There are no significant variations in the usage of signs at different sites or on different types of objects. However, subtle preferences in the usage of signs in the Indus writing on different type of objects and at different sites indicate the presence of some individualistic clues to their content.


217

Table 4: Zipf’s exponent for EBUDS n‐grams (for n = 1 to 5)

Sl. No. n‐gram English* Mandarin* Indus script (EBUDS) Random (R1)1 Sign

(n=1)

1.00 ‐ 1.49 1.492 Pair

(n=2)

0.66 0.75 0.73 0.453 Triplet

(n=3)

0.49 0.59

0.38 @4 Quad

(n=4)

0.41 0.53

@ @5 Pent

(n=5)

0.39 0.48

@ @* These numbers are from Ha et al. 2002; @ = Poor fit due to paucity of data

Clustering of sites and types of objects: Using the method of clustering we compare various sites and types of objects based on different criteria such as their usage of signs or distribution of text lengths. Some of the significant conclusions from this analysis are:

1. Mohenjodaro and Lothal share high level of similarity in their pattern of text length distributions and usage of signs.

2. Harappa is distinct in its sign usage from all other sites.

3. The pattern of text length distribution and usage of signs in West Asian sites is distinct from all other sites.

4. With respect to the usage of signs, sealings and miniature tablets are closest to each other.

5. In usage of signs, seals share a high level of similarity with pottery graffiti.

Study of Design of Indus Signs In order to understand the general makeup and mechanics of the design of Indus signs we analyze the structural design of individual signs of the Indus script (Yadav and Vahia 2011). Our study is based on the design of the signs in the sign list given in (Mahadevan 1977) which consists of 417 distinct signs. We analyze the design and structure of all signs in the sign list of the Indus script and visually identify three types of design elements. The design elements include basic signs (154 in number), provisional basic signs (10 in number) and modifiers (21 in number). These elements combine in a variety of ways to generate the entire set of Indus signs. The signs are classified into two major categories: Basic signs (154 in number) and Composite signs (263 in number). Composite signs can be further classified into compound signs (signs


218

that are conglomeration of two or more basic signs) and modified signs (signs that are modified using modifiers). By comparing the environment of the compound signs with all possible sequences of their constituent basic signs, we find that sign compounding (ligaturing) and sign modification seem to change the meaning or add value to basic signs rather than save writing space.

Conclusions The Indus texts are linearly written and there is clear evidence of directionality in the Indus script (Parpola 1994; Mahadevan 1977). In our study of the Indus script, we have employed a series of computational methods and statistical tests on the dataset of the Indus script. Here we highlight some important conclusions based on our study.

1. Indus writing is highly structured in the sense that sequencing of signs has definite rules.

2. The sign frequency distribution of the Indus script follows Zipf‐Mandelbrot law, an empirical law generally followed by several ordered systems.

3. There is an asymmetry in the usage of text beginners and text enders with very few signs constituting most of the text enders while relatively large number of signs occur as text beginners.

4. It is possible to identify pairs of signs that appear together in the longer Indus texts but in general do not have affinity to each other. Using this insight, it is possible to revisit the entire corpus and show that most of the longer Indus texts can be segmented into smaller units.

5. A bigram model of the Indus script based on nearest neighbour associations can successfully predict signs in an Indus text with 75% accuracy. It can also generate sample Indus texts.

6. The Indus script seems to be versatile enough to permit writing of different content as can be seen from the texts on the Indus seals found at West Asian sites.

7. Studies of the flexibility in sign usage suggest that the Indus writing is as flexible as one would expect for natural linguistic systems and is much more than that for artificial linguistic systems (computer languages). However, it is less flexible in comparison to the systems in which abstractions are conveyed (music) or the manner in which biological information is coded (DNA or Protein).

8. While there is a common thread of rules and grammatical structures in the Indus writing, variation in sign usage across sites and type of objects suggests that writing on different types of objects and at different sites do have individualistic clues to their content.


219

9. Based on the design, the Indus signs can be classified two major categories: Basicsigns and Composite signs. Composite signs can be further classified into compound signs and modified signs.

10. Study of the design of Indus signs suggests that the compound signs are not simply a ‘short handing’ or a space saving device since the environment in which the compound signs appear in the Indus texts is completely different from that of its constituents in any combination.

Any proposed interpretation of the Indus script should be able to explain these characteristics.

Acknowledgements We are grateful to Prof. Jonathan Mark Kenoyer and Dr. Omar Khan associated with Harappa.com for their kind permission to use the images of Indus seals given in Figure 1 and making our work available to a wider community. We wish to thank Dr. Iravatham Mahadevan whose continuous encouragement and discussions played an important role in defining the problem.

Our papers listed below are also available at www.tifr.res.in/~archaeo.

References Agrawal, D. P. 2007. The Indus Civilization: An Interdisciplinary Perspective. New Delhi:

Aryan Books International. Ferrer‐i‐Cancho, R. and B. Elvevag. 2010. Random Texts Do Not Exhibit the Real Zipfʹs

Law‐Like Rank Distribution. PLoS One. vol 5. Ha, L. Q., E. I. Sicilia‐Garcia, J. Ming and F. J. Smith. 2002. Extension of Zipfʹs Law to

Words and Phrases. Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002): 315‐320.

Joshi, J. P. and A. Parpola. 1987. Corpus of Indus Seals and Inscriptions, 1. Collections in India. Helsinki: Suomalainen Tiedeakatemia and New Delhi: Memoirs of the Archaeological Survey of India No. 86.

Kenoyer, J. M. 1998. Ancient Cities of the Indus Valley Civilization. Oxford: Oxford University Press.

Mahadevan, I. 1977. The Indus Script: Texts, Concordance and Tables. Memoirs of the Archaeological Survey of India No. 77. New Delhi: Archaeological Survey of India.

Mahadevan, I. 2002. Aryan or Dravidian or Neither? A Study of Recent Attempts to Decipher the Indus Script (1995‐2000). Electronic Journal of Vedic Studies vol 8 (1).

Parpola, A. 1994. Deciphering the Indus Script. Cambridge: Cambridge University Press. Parpola, A. 2005. Study of the Indus Script. Proceedings of the International Conference of

Eastern Studies. Tokyo: The Tôhô Gakkai.


220

Parpola, A., B. M. Pande and P. Koskikallio (eds.). 2010. New Material, Untraced Objects, and Collections Outside India and Pakistan Part 1: Mohenjodaro and Harappa. Helsinki: Suomalainen Tiedeakatemia.

Possehl, G. L. 1996. Indus Age: The Writing System. New Delhi: Oxford and IBH Publishing Co. Pvt. Ltd.

Possehl, G. L. 2002. The Indus Civilization: A Contemporary Perspective. New Delhi: Vistaar Publications.

R. P. N. Rao, N. Yadav, M. N. Vahia, H. Joglekar, R. Adhikari and I. Mahadevan. 2010. Entropy, the Indus Script and Language: A Reply to R. Sproat. Computational Linguistics vol 36 (4): 795‐805.

R. P. N. Rao, N. Yadav, M. N. Vahia, H. Joglekar, R. Adhikari and I. Mahadevan. 2009a. Entropic Evidence for Linguistic Structure in the Indus Script. Science vol 324: 1165.

R. P. N. Rao, N. Yadav, M. N. Vahia, H. Joglekar, R. Adhikari and I. Mahadevan. 2009b. A Markov Model of the Indus Script. Proceedings of the National Academy of Sciences, vol 106 (33): 13685‐13690.

Shah, S. G. M. and A. Parpola. 1991. Corpus of Indus Seals and Inscriptions, 2. Collections in Pakistan. Helsinki: Suomalainen. Tiedeakatemia and Memoirs of the Archaeology and Museums.

Sinha, S., N. Yadav and M. N. Vahia. 2011. In Square Circle: Geometric Knowledge of the Indus Civilization. R. Sujatha, H.N. Ramaswamy and C.S. Yogananda (eds.). Math Unlimited: Essays in Mathematics. Enfield: Science Publishers.

Vahia, M. N. and N. Yadav. 2010. Harappan Geometry and Symmetry: A Study of Geometrical Patterns on Indus Objects. Indian Journal of History of Science vol 45 (3): 343‐368.

Wright, R. P. 2010. The Ancient Indus – Urbanism, Economy and Society. New York: Cambridge University Press.

Yadav, N. 2012. Statistical Studies of the Indus Script. Man and Environment vol XXXVII (1): 1‐7.

Yadav, N. 2013. Sensitivity of Indus Script to Site and Type of Object. Scripta vol 5: 67‐103.

Yadav, N. and M. N. Vahia. 2011. Classification of Patterns on Indus Objects. International Journal of Dravidian Linguistics vol 40 (2): 89‐114.

Yadav, N. and M. N. Vahia. 2011. Indus Script: A Study of its Sign Design. Scripta vol 3: 133‐172.

Yadav, N., H. Joglekar, R. P. N. Rao, M. N. Vahia, R. Adhikari and I. Mahadevan. 2010. Statistical Analysis of the Indus Script using n‐grams. PLoS One vol 5 (3).

Yadav, N., M. N. Vahia, I. Mahadevan and H. Joglekar. 2008a. A Statistical Approach for Pattern Search in Indus Writing. International Journal of Dravidian Linguistics vol XXXVII (1): 39‐52.


221

Yadav, N., M. N. Vahia, I. Mahadevan and H. Joglekar. 2008b. Segmentation of Indus Texts. International Journal of Dravidian Linguistics vol. XXXVII (1): 53‐72.

Zipf, G. K. 1935. The Psycho‐biology of Language: An Introduction to Dynamic Philology. Boston: Houghton Mifflin.

Zipf, G. K. 1949. Human Behaviour and the Principle of Least Effort. An Introduction to Human Ecology. Cambridge: Addison‐Wesley.

exploring the structure of indus script using computational … · 2017-08-01 · heritage: journal...

Documents