Title: Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf
1Investigating the Ancient Meroitic Language Using
Statistical Natural Language Techniques Zipfs
Law and Word Co-Occurrences
- Reginald Smith
- August 10, 2006
- Sudan Studies Association Conference
- Rhode Island College
2Meroitic is the language of the ancient kingdom
of Kush
- Used for almost six hundred years from 2nd
century BCE to 4th century CE - Phonetic language written right to left (like
Arabic) - Transliteration made possible by work of British
archaeologist FL Griffith around 1910
3Meroitic remains largely undeciphered and an
enigma
- No complete vocabulary is available
- Some words such as place names, loan words, or
simple concepts are known - For example or qore means
king - Perhaps or qes is Kush
- Many attempts have been made to understand
Meroitic using phonology or comparative
linguistics - Scholars have tried in vain to find a known
language that is a relative (see sources in
paper) - We wish we had a bilingual text like the Rosetta
stone to guide us
4A new method could use mathematics and linguistics
- Statistical natural language processing analyzes
the properties of language using a mix of
statistics and linguistics - There are several properties of languages that
are the same in all human languages - Certain techniques can also help us possibly
infer meanings of words (by relating them to
other known words)
5Zipfs Law Frequencies of Words
- If you rank order words in a text by how frequent
( of times a word appears) they are (1 being
most frequent) and then relate this to the
frequency of the word, you get Zipfs Law - Zipfs Law where F is the frequency of a word,
C is a constant, R is the rank, and a is known as
the power law exponent - For all languages a 1
6Zipf Law Graphs
- When you graph the frequency vs. the rank on a
log-log graph (graphing the logarithm of
frequency vs. the logarithm of rank) you get a
straight line whose slope is a
Zipf line fit on data. The red line is the fitted
slope on the data points
Picture Source University of Helsinki CS
department
7Does Meroitic follow Zipfs Law?
- The two graphs below show log-log plots of
frequency vs. rank for the Meroitic words in 69
texts. The slopes are shown for each - The normal plot counts the words as is. The
morpheme out plot split out suffixes like lowi
as the separate words lo and wi - Since it has a slope of nearly -1 the morpheme
out model of Meroitic seems to follow Zipfs Law
Normal plot Slope -0.81
Morpheme out plot Slope -1.03
8So what does this show us (besides graphs)
- Despite the apparently low amount of texts
available, our sample of Meroitic is structured
just like all other human languages (English,
Chinese, etc.) - Therefore, even though we dont know the meaning
of the words, we know that the language we have
is representative - Even though most of our samples are redundant
funeral stelae - We can then proceed to use other statistical
techniques on Meroitic and also compare its
statistical features to other languages
9Step Two Word Co-occurrence
- When words occur together in a text, they are
said to co-occur - I am here has co-occurrence between I-am and
am-here - Co-occurrences can tell us about the words if we
have enough of them - Words that co-occur with the same words often
have similar parts of speech or even meanings - Can we use word co-occurrence in Meroitic to
analyze classes of words?
10What I did with Meroitic
- I analyzed Meroitic by matching together words
that co-occurred with the same types of words - For example if you have two sentences I eat
horses and We eat lizards - I match I and We because they both co-occur
with eat - I also match horses and lizards because they
also co-occur with eat (in the opposite
direction) - I then graph connected words together and analyze
them with software - What happens?
Technical note I actually used undirected edges
for co-occurring words in the graph shown on the
next page
11Meroitic Words Graph
Group 3
- Four main groups of words form that correspond
well to Meroitic categories including positions
and titles, verbs, places, and miscellaneous nouns
Group 4
Group 1
Group 2
12Results
- Techniques like the word co-occurrence matching
can help us categorize Meroitic words that we
previously guessed on by mapping them against
words we already know the part of speech for - Similar statistical techniques may allow us to
match words with a similar meaning to infer the
meanings of some words - This is still speculative though
13Conclusion
- Statistical natural language processing is a new
approach to Meroitic that could supplement other
current efforts in the language - Much more work remains to be done, but this new
avenue may help us move closer to the goal of
understanding this beautiful and mysterious
language - Acknowledgements I give my boundless
appreciation to Dr. Richard Lobban and Dr.
Laurance Doyle for the help and advice they gave
me on this papers topics