Title: Computer Systems Lab TJHSST Current Projects 2004-2005 First Period
1Computer Systems LabTJHSSTCurrent Projects
2004-2005First Period
2Current Projects, 1st Period
- Caroline Bauer Archival of Articles via RSS and
Datamining Performed on Stored Articles - Susan Ditmore Construction and Application of a
Pentium II Beowulf Cluster - Michael Druker Universal Problem Solving Contest
Grader
2
3Current Projects, 1st Period
- Matt Fifer The Study of Microevolution Using
Agent-based Modeling - Jason Ji Natural Language Processing Using
Machine Translation in Creation of a
German-English Translator - Anthony Kim A Study of Balanced Search Trees
Brainforming a New Balanced Search Tree - John Livingston Kernel Debugging User-Space API
Library (KDUAL)
3
4Current Projects, 1st Period
- Jack McKay Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball - Peden Nichols An Investigation into
Implementations of DNA Sequence Pattern Matching
Algorithms - Robert Staubs Part-of-Speech Tagging with
Limited Training Corpora - Alex Volkovitsky Benchmarking of Cryptographic
Algorithms
4
5Archival of Articles via RSS and Datamining
Performed on Stored ArticlesRSS (Really Simple
Syndication, encompassing Rich Site Summary and
RDF Site Summary) is a web syndication protocol
used by many blogs and news websites to
distribute information it saves people having to
visit several sites repeatedly to check for new
content. At this point in time there are many RSS
newsfeed aggregators available to the public, but
none of them perform any sort of archival of
information beyond the RSS metadata. The purpose
of this project is to create an RSS aggregator
that will archive the text of the actual articles
linked to in the RSS feeds in some kind of
linkable, searchable database, and, if all goes
well, implement some sort of datamining
capability as well.
5
6Archival of Articles via RSS, and Datamining
Performed on Stored ArticlesCaroline Bauer
- Abstract
- RSS (Really Simple Syndication, encompassing
Rich Site Summary and RDF Site Summary) is a web
syndication protocol used by many blogs and news
websites to distribute information it saves
people having to visit several sites repeatedly
to check for new content. At this point in time
there are many RSS newsfeed aggregators available
to the public, but none of them perform any sort
of archival of information beyond the RSS
metadata. As the articles linked may move or be
eliminated at some time in the future, if one
wants to be sure one can access them in the
future one has to archive them oneself
furthermore, should one want to link such
collected articles, it is far easier to do if one
has them archived. The purpose of this pro ject
is to create an RSS aggregator that will archive
the text of the actual articles linked to in the
RSS feeds in some kind of linkable, searchable
database, and, if all goes well, implement some
sort of datamining capability as well.
7Archival of Articles via RSS, and Datamining
Performed on Stored ArticlesCaroline Bauer
- Introduction
- This paper is intended to be a detailed summary
of all of the author's findings regarding the
archival of articles in a linkable, searchable
database via RSS. - Background RSS
- RSS stands for Really Simple Syndication, a
syndication protocol often used by weblogs and
news sites. Technically, RSS is an xml-based
communication standard that encompasses Rich Site
Summary (RSS 0.9x and RSS 2.0) and RDF Site
Summary (RSS 0.9 and 1.0). It enables people to
gather new information by using an RSS aggregator
(or "feed reader") to poll RSS-enabled sites for
new information, so the user does not have to
manually check each site. RSS aggregators are
often extensions of browsers or email programs,
or standalone programs alternately, they can be
web-based, so the user can view their "feeds"
from any computer with Web access. - Archival Options Available in Existing RSS
Aggregators Data Mining - Data mining is the searching out of information
based on patterns present in large amounts of
data. //more will be here.
8Archival of Articles via RSS, and Datamining
Performed on Stored ArticlesCaroline Bauer
- Purpose
- The purpose of this project is to create an RSS
aggregator that, in addition to serving as a feed
reader, obtains the text of the documents linked
in the RSS feeds and places it into a database
that is both searchable and linkable. In addition
to this, the database is intended to reach an
implementation wherein it performs some manner of
data mining on the information contained therein
the specifics on this have yet to be determined. - Development Results Conclusions Summary
References - 1. "RSS (protocol)." Wikipedia. 8 Jan. 2005. 11
Jan. 2005 lthttp//en. wikipedia.org/wiki/RSS_28pr
otocol29gt. 2. "Data mining." Wikipedia. 7 Jan.
2005. 12 Jan. 2005. lthttp//en.
wikipedia.org/wiki/Data_mininggt.
9Construction and Application of a Pentium II
Beowulf ClusterI plan to construct a super
computing cluster of about 15-20 or more Pentium
II computers with the OpenMosix kernel patch.
Once constructed, the cluster could be configured
to transparently aid workstations with
computationally expensive jobs run in the lab.
This project would not only increase the
computing power of the lab, but it would also be
an experiment in building a lowlevel, lowcost
cluster with a stripped down version of Linux,
useful to any facility with old computers they
would otherwise deem outdated.
9
10Construction and Application of a Pentium II
Beowulf ClusterSusan Ditmore
- Text version needed
- (your pdf file won't copy to text)
11Universal Problem Solving Contest GraderMichael
Druker(poster needed)
11
12Universal Problem Solving Contest GraderMichael
Druker
- Steps so far
- Creation of directory structure for the grader,
the contests, the users, the users' submissions,
the test cases. - -Starting of main grading script itself.
- Refinement of directory structure for the grader.
- -Reading of material on bash scripting language
to be able to write the various scripts that will
be necessary.
13Universal Problem Solving Contest GraderMichael
Druker
- Current program
- !/bin/bash
- CONDIR"/afs/csl.tjhsst.edu/user/mdruker/techlab/c
ode/new/" - syntax is "grade contest user program"
- contest1
- user2
- program3
- echo "contest name is " 1
- echo "user's name is " 2
- echo "program name is " 3
14Universal Problem Solving Contest GraderMichael
Druker
- Current program continued
- get the location of the program and the test
data - make sure that the contest, user, program are
valid - PROGDIRCONDIR"contests/"contest"/users/"u
ser - echo "user's directory is" PROGDIR
- if -d PROGDIR
- then echo "good input"
- else echo "bad input, directory doesn't exist"
- exit 1 fi
- exit 0
15Study of Microevolution Using Agent-Based
Modeling in CThe goal of the project is to
create a program that uses an agent-environment
structure to imitate a very simple natural
ecosystem one that includes a single type of
species that can move, reproduce, kill, etc. The
"organisms" will contain genomes (libraries of
genetic data) that can be passed from parents to
offspring in a way similar to that of animal
reproduction in nature. As the agents interact
with each other, the ones with the
characteristics most favorable to survival in the
artificial ecosystem will produce more children,
and over time, the mean characteristics of the
system should start to gravitate towards the
traits that would be most beneficial. This
process, the optimization of physical traits of a
single species through passing on heritable
advantageous genes, is known as microevolution.
15
16THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- Abstract
- The goal of the project is to create a program
that uses an agent-environment structure to
imitate a very simple natural ecosystem one that
includes a single type of species that can move,
reproduce, kill, etc. The "organisms" will
contain genomes (libraries of genetic data) that
can be passed from parents to offspring in a way
similar to that of animal reproduction in nature.
As the agents interact with each other, the ones
with the characteristics most favorable to
survival in the artificial ecosystem will produce
more children, and over time, the mean
characteristics of the system should start to
gravitate towards the traits that would be most
beneficial. This process, the optimization of
physical traits of a single species through
passing on heritable advantageous genes, is known
as microevolution.
17THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- Purpose
- One of the most controversial topics in science
today is the debate of creationism vs. Darwinism.
Advocates for creationism believe that the world
was created according to the description detailed
in the 1st chapter of the book of Genesis in the
Bible. The Earth is approximately 6,000 years
old, and it was created by God, followed by the
creation of animals and finally the creation of
humans, Adam and Eve. Darwin and his followers
believe that from the moment the universe was
created, all the objects in that universe have
been in competition. Everything - from the
organisms that make up the global population, to
the cells that make up those organisms, to the
molecules that make up those cells has beaten all
of its competitors in the struggle for resources
commonly known as life.
18THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- This project will attempt to model the day-today
war between organisms of the same species.
Organisms, or agents, that can move, kill, and
reproduce will be created and placed in an
ecosystem. Each agent will include a genome that
codes for its various characteristics. Organisms
that are more successful at surviving or more
successful at reproducing will pass their genes
to their children, making future generations
better suited to the environment. The competition
will continue, generation after generation, until
the simulation terminates. If evolution has
occurred, the characteristics of the population
at the end of the simulation should be markedly
different than at the beginning.
19THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- Background
- Two of the main goals of this project are the
study of microevolution and the effects of
biological mechanisms on this process. Meiosis,
the formation of gametes, controls how genes are
passed from parents to their offspring. In the
first stage of meiosis, prophase I, the strands
of DNA floating around the nucleus of the cell
are wrapped around histone proteins to form
chromosomes. Chromosomes are easier to work with
than the strands of chromatin, as they are
packaged tightly into an "X" structure (two "gt"s
connected at the centromere). In the second
phase, metaphase I, chromosomes pair up along the
equator of the cell, with homologous chromosomes
being directly across from each other.
(Homologous chromosomes code for the same traits,
but come from different parents, and thus code
for different versions of the same trait.) The
pairs of chromosomes, called tetrads, are
connected and exchange genetic material.
20THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- This process, called crossing over, results in
both of the chromosomes being a combination of
genes from the mother and the father. Whole genes
swap places, not individual nucleotides. In the
third phase, anaphase I, fibers from within the
cell pull the pair apart. When the pairs are
pulled apart, the two chromosomes are put on
either side of the cell. Each pair is split
randomly, so for each pair, there are two
possible outcomes. For instance, the paternal
chromosome can either move to the left or right
side of the cell, with the maternal chromosome
moving to the opposite end. In telophase I, the
two sides of the cell split into two individual
cells. Thus, for each cell undergoing meiosis,
there are 2n possible gametes. With crossing
over, there are almost an infinite number of
combinations of genes in the gametes. This large
number of combinations is the reason for the
genetic biodiversity that exists in the world
today, even among species. For example, there are
6 billion humans on the planet, and none of them
is exactly the same as another one.
21THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- Procedure
- This project will be implemented with a matrix
of agents. The matrix, initialized with only
empty spaces, will be seeded with organisms by an
Ecosystem class. Each agent in the matrix will
have a genome, which will determine how it
interacts with the Ecosystem. During every step
of the simulation, an organism will have a choice
whether to 1. do nothing 2. move to an empty
adjacent space 3. kill an organism in a
surrounding space, or 4. reproduce with an
organism in an adjacent space. The likelihood of
the organism performing any of these tasks is
determined by the organism's personal variables,
which will be coded for by the organism's genome.
While the simulation is running, the average
characteristics of the population will be
measured. In theory, the mean value of each of
the traits (speed, agility, strength, etc.)
should either increase with time or gravitate
towards a particular, optimum value.
22THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- At its most basic level, the program written to
model microevolution is an agentenvironment
program. The agents, or members of the Organism
class, contain a genome and have abilities that
are dependent upon the genome. Here is the
declaration of the Organism class - class Organism
- public Organism()
- //constructors Organism(int ident, int row2, int
col2) - Organism(Nucleotide mDNA, Nucleotide dDNA, int
ident, - bool malefemale, int row2, int col2)
- Organism() //destructor void printGenome()
- void meiosis(Nucleotide gamete)
- Organism reproduce(Organism mate, int ident,
int r, int c) - int Interact(Organism neighbors, int nlen) ...
23THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- //assigns a gene a numeric value int Laziness()
- //accessor functions int Rage() int SexDrive()
int Activity() int DeathRate() int
ClausIndex() int Age() int Speed() int Row()
int Col() int PIN() bool Interacted() bool
Gender() void setPos(int row2, int col2) void
setInteracted(bool interacted) private void
randSpawn(Nucleotide DNA, int size) //randomly
generates a genome Nucleotide mom, dad
//genome int ID, row, col, laziness, rage,
sexdrive, activity, deathrate, clausindex, speed
//personal characteristics double age bool male,
doneStuff ...
24THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- The agents are managed by the environment class,
known as Ecosystem. The Ecosystem contains a
matrix of Organisms. - Here is the declaration of the Ecosystem class
- class Ecosystem public Ecosystem()
//constructors Ecosystem(double oseed)
Ecosystem() //destructor void Run(int steps)
//the simulation void printMap() void print(int
r, int c) void surrSpaces(Organism neighbors,
int r, int c, int friends) //the neighbors of
any cell private Organism Population //the
matrix of Organisms
25THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- The simulation runs for a predetermined number of
steps within the Ecosystem class. During every
step of the simulation, the environment class
cycles through the matrix of agents, telling each
one to interact with its neighbors. To aid in the
interaction, the environment sends the agent an
array of the neighbors that it can affect. Once
the agent has changed (or not changed) the array
of neighbors, it sends the array back to the
environment which then updates the matrix of
agents. Here is the code for the Organisms
function which enables it to interact with its
neighbors
26THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- int OrganismInteract(Organism neighbors, int
nlen) //returns 0 if the organism hasn't moved
1 if it has fout ltlt row ltlt " " ltlt col ltlt " "
if(!ID)//This Organism is not an organism fout
ltlt "Not an organism, cannot interact!" ltlt endl
return 0 if(doneStuff)//This Organism has
already interacted once this step fout ltlt "This
organism has already interacted!" ltlt endl return
0 doneStuff true int loop for(loop 0
loop lt GENES CHROMOSOMES GENE_LENGTH loop)
if(rand() RATE_MAX lt MUTATION_RATE) momloop
(Nucleotide)(rand() 4) if(rand() RATE_MAX
lt MUTATION_RATE)
27THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- The Organisms, during any simulation step, can
either move, kill a neighbor, remain idle,
reproduce, or die. The fourth option,
reproduction, is the most relevant to the
project. As explained before, organisms that are
better at reproducing or surviving will pass
their genes to future generations. The most
critical function in reproduction is the meiosis
function, which determines what traits are passed
down to offspring. The process is completely
random, but an organism with a "good" gene has
about a 50 chance of passing that gene on to its
child. Here is the meiosis function, which
determines what genes each organism sends to its
offspring - void Organismmeiosis(Nucleotide gamete) int
x, genect, chromct, crossover Nucleotide
chromo new NucleotideGENES GENE_LENGTH,
chromo2 new NucleotideGENES GENE_LENGTH
Nucleotide gene new NucleotideGENE_LENGTH,
gene2 new NucleotideGENE_LENGTH ... (more
code)
28THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- The functions and structures above are the most
essential to the running of the program and the
actual study of microevolution. At the end of
each simulation step, the environment class
records the statistics for the agents in the
matrix and puts the numbers into a spreadsheet
for analysis. The spreadsheet can be used to
observe trends in the mean characteristics of the
system over time. Using the spreadsheet created
by the environment class, I was able to create
charts that would help me analyze the evolution
of the Organisms over the course of the
simulation.
29THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- The first time I ran the simulation, I set the
program so that there was no mutation in the
agent's genomes. Genes were strictly created at
the outset of the program, and those genes were
passed down to future generations. If
microevolution were to take place, a gene that
coded for a beneficial characteristic would have
a higher chance of being passed down to a later
generation. Without mutation, however, if one
organism possessed a characteristic that was far
superior to the comparable characteristics of
other organisms, that gene should theoretically
allow that organism to "dominate" the other
organisms and pass its genetic material to many
children, in effect exterminating the genes that
code for less beneficial characteristics.
30THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- For example, if an organism was created that had
a 95 chance of reproducing in a given simulation
step, it would quickly pass its genetic material
to a lot of offspring, until its gene was the
only one left coding for reproductive tendency,
or libido.
31THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- As you can see from Figure 1, the average
tendency to reproduce increases during the
simulation. The tendency to die decreases to
almost nonexistence. The tendency to remain
still, since it has relatively no effect on
anything, stays almost constant. The tendency to
move to adjacent spaces, thereby spreading one's
genes throughout the ecosystem, increases to be
almost as likely as reproduction. The tendency to
kill one's neighbor decreases drastically,
probably because it does not positively benefit
the murdering organism. In Figure 2, we can see
that the population seems to stabilize at about
the same time as the average characteristics.
This would suggest that there was a large amount
of competition among the organisms early in the
simulation, but the competition quieted down as
one dominant set of genes took over the
ecosystem.
32THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- Figure 4 These figures show the results from the
second run of the program, when mutation was
turned on. As you can see, many of the same
trends exist, with reproductive tendency
skyrocketing and tendency to kill plummeting.
Upon reevaluation, it seems that perhaps the
tendencies to move and remain idle do not really
affect an agent's ability survive, and thus their
trends are more subject to fluctuations that
occur in the beginning of the simulation. One
thing to note about the mutation simulation is
the larger degree of fluctuation in both
characteristics and population. The population
stabilizes at about the same number, but swings
between simulation steps are more pronounced. In
Figure 3, the stabilization that had occurred in
Figure 1 is largely not present.
33THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- Conclusion
- The goal of this project at the outset was to
create a system that modeled trends and processes
from the natural world, using the same mechanisms
that occur in that natural world. While this
project by no means definitively proves the
correctness of Darwin's theory of evolution over
the creationist theory, it demonstrates some of
the basic principles that Darwin addressed in his
book, The Origin of Species. Darwin addresses two
distinct processes--natural selection and
artificial selection. Artificial selection, or
selective breeding, was not present in this
project at all. There was no point in the program
where the user was allowed to pick organisms that
survived. Natural selection, though it is a
stretch because nature was the inside of a
computer, was simulated. Natural selection,
described as the "survival of the fittest," is
when an organism's characteristics enable it to
survive and pass those traits to its offspring.
34THE STUDY OF MICROEVOLUTION USING AGENTBASED
MODELINGMatt Fifer
- In this program, "nature" was allowed to run its
course, and at the end of the simulation, the
organisms with the best combination of
characteristics had triumphed over their
predecessors. "Natural" selection occurred as
predicted. - All of the information in this report was either
taught last year in A.P. Biology last year and,
to a small degree, Charles Darwin's The Origin of
Species. I created all of the code and all of the
charts in this paper. For my next draft, I will
be sure to include more outside information that
I have found in the course of my research
35Using Machine Translation in a German English
TranslatorThis project attempts to take the
beginning steps towards the goal of creating a
translator program that operates within the scope
of translating between English and German.
35
36Natural Language Processing Using Machine
Translation in Creation of a German-English
TranslatorJason Ji
- Abstract
- The field of machine translation - using
computers to provide translations between human
languages - has been around for decades. And the
dream of an ideal machine providing a perfect
translation between languages has been around
still longer. This pro ject attempts to take the
beginning steps towards that goal, creating a
translator program that operates within an
extremely limited scope to translate between
English and German. There are several different
strategies to machine translation, and this pro
ject will look into them - but the strategy taken
to this pro ject will be the researcher's own,
with the general guideline of "thinking as a
human."
37Natural Language Processing Using Machine
Translation in Creation of a German-English
TranslatorJason Ji
- For if humans can translate between language,
there must be something to how we do it, and
hopefully that something - that thought process,
hopefully - can be transferred to the machine and
provide quality translations. - Background
- There are several methods of varying difficulty
and success to machine translation. The best
method to use depends on what sort of system is
being created. A bilingual system translates
between one pair of languages a multilingual
system translates between more than two systems.
38Natural Language Processing Using Machine
Translation in Creation of a German-English
TranslatorJason Ji
- The easiest translation method to code, yet
probably least successful, is known as the direct
approach. The direct approach does what it sounds
like it does - takes the input language (known as
the "source language"), performs 2 morphological
analysis - whereby words are broken down and
analyzed for things such as prefixes and past
tense endings, performs a bilingual dictionary
look-up to determine the words' meanings in the
target language, performs a local reordering to
fit the grammar structure of the target language,
and produces the target language output. The
problem with this approach is that it is
essentially a word-for-word translation with some
reordering, resulting often in mistranslations
and incorrect grammar structures.
39Natural Language Processing Using Machine
Translation in Creation of a German-English
TranslatorJason Ji
- Furthermore, when creating a multilingual system,
the direct approach would require several
different translation algorithms - one or two for
each language pair. The indirect approach
involves some sort of intermediate representation
of the source language before translating into
the target language. In this way, linguistic
analysis of the source language can be performed
on the intermediate representation. Translating
to the intermediary also enables semantic
analysis, as the source language input can be
more carefully to detect idioms, etc, which can
be stored in the intermediary and then
appropriately used to translate into the target
language.
40Natural Language Processing Using Machine
Translation in Creation of a German-English
TranslatorJason Ji
- The transfer method is similar, except that the
transfer is language dependent - that is to say,
the French-English intermediary transfer would be
different from the EnglishGerman transfer. An
interlingua intermediary can be used for
multilingual systems. - Theory
- Humans fluent in two or more languages are at the
moment better translators than the best machine
translators in the world. Indeed, a person with
three years of experience in learning a second
language will already be a better translator than
the best machine translators in the world as
well.
41Natural Language Processing Using Machine
Translation in Creation of a German-English
TranslatorJason Ji
- Yet for humans and machines alike, translation is
a process, a series of steps that must be
followed in order to produce a successful
translation. It is interesting to note, however,
that the various methods of translation for
machines - the various processes - become less
and less like the process for humans as they
become more complicated. Furthermore, it was
interesting to notice that as the method of
machine translation becomes more complicated, the
results are sometimes less accurate than the
results of simpler methods that better model the
human rationale for translation.
42Natural Language Processing Using Machine
Translation in Creation of a German-English
TranslatorJason Ji
- Therefore, the theory is, an algorithm that
attempts to model the human translation process
would be more successful than other, more
complicated methods currently in development
today. This theory is not entirely plausible for
full-scale translators because of the sheer
magnitude of data that would be required. Humans
are better translators than computers in part
because they have the ability to perform semantic
analysis, because they have the necessary
semantic information to be able to, for example,
determine the difference in a word's definition
based on its usage in context. Creating a
translator with a limited-scope of vocabulary
would require less data, leaving more room for
semantic information to be stored along with
definitions.
43Natural Language Processing Using Machine
Translation in Creation of a German-English
TranslatorJason Ji
- A limited-scope translator may seem unuseful at
first glance, but even humans fluent in any
language, including their native language, don't
know the entire vocabulary of the language. A
language has hundreds of thousands of words, and
no human knows even half of them all. A computer
with a vocabulary of commonly used words that
most people know, along with information to avoid
semantic problems, would therefore be still
useful for nonprofessional work. - Development
- On the most superficial level, a translator is
more user-friendly for an average person if it is
GUI-based, rather than simply text-based. This
part of the development is finished. The program
presents a GUI for the user.
44Natural Language Processing Using Machine
Translation in Creation of a German-English
TranslatorJason Ji
- A JFrame opens up with two text areas and a
translate button. The text areas are labeled
"English" and "German". The input text is typed
into the English window, the "Translate" button
is clicked, and the translator, once finished,
outputs the translated text into the German text
area. Although typing into the German text area
is possible, the text in the German text area
does not affect the translator process. The first
problem to deal with in creating a machine
translator is to be able to recognize the words
that are inputted into the system. A sentence or
multiple sentences are input into the translator,
and a string consisting of that entire sentence
(or sentences) is passed to the translate()
function.
45Natural Language Processing Using Machine
Translation in Creation of a German-English
TranslatorJason Ji
- The system loops through the string, finding all
space (' ') characters and punctuation characters
(comma, period, etc) and records their positions.
(It is important to note the position of each
punctuation mark, as well as what kind of a
punctuation mark it is, because the existence and
position of punctuation marks alter the meaning
of a sentence.) - The number of words in the sentence is determined
to be the number of spaces plus one. By recording
the position of each space, the string can then
be broken up into the words. The start position
of each word is the position of each space, plus
one, and the end position is the position of the
next space. This means that punctuation at the
end of any given word is placed into the String
with that word, but this is not
46Natural Language Processing Using Machine
Translation in Creation of a German-English
TranslatorJason Ji
- a problem the location of each punctuation mark
is already recorded, and the dictionary look-up
of each word will first check to ensure that the
last character of each word is a letter if not,
it will simply disregard the last character. The
next problem is the biggest problem of all, the
problem of actual translation itself. Here there
is no code yet written, but development of
pseudocode has begun already. As previously
mentioned, translation is a process. In order to
write a translator program that follows the human
translation process, the human process must first
be recognized and broken down into programmable
steps. This is no easy task. Humans with five
years of experience
47Natural Language Processing Using Machine
Translation in Creation of a German-English
TranslatorJason Ji
- in learning a language may already translate any
given text quickly enough, save time to look up
unfamiliar words, that the process goes by too
quickly to fully take note of. The basic process
is not entirely determined yet, but there is some
progress on it. The process to determine the
process has been as followed given a random
sentence to translate, the sentence is first
translated by a human, then the process is noted.
Each sentence given has ever-increasing
difficulty to translate.
48Natural Language Processing Using Machine
Translation in Creation of a German-English
TranslatorJason Ji
- For example the sentence, "I ate an apple," is
translated via the following process 1) Find the
sub ject and the verb. (I ate) 2) Determine the
tense and form of the verb. (ate past,
imperfekt form) a) Translate sub ject and verb.
(Ich ass) (note - "ass" is a real German verb
form.) 3) Determine what the verb requires. (ate
- eat requires a direct ob ject) 4) Find what
the verb requires in the sentence. (direct ob
ject comes after verb and article apple) 5)
Translate the article and the direct ob ject.
(ein Apfel) 6) Consider the gender of the direct
ob ject, change article if necessary. (der Apfel
ein - einen) Ich ass einen Apfel.
49Natural Language Processing Using Machine
Translation in Creation of a German-English
TranslatorJason Ji
- References
- (I'll put these in proper bibliomumbo
jumbographical order later!) - 1. http//dict.leo.org (dictionary) 2. "An
Introduction To Machine Translation" (available
online at http//ourworld.compuserve.com/homepages
/WJHutchins/IntroMT-TOC.htm) 3.
http//www.comp.leeds.ac.uk/ugadmit/cogsci/spchlan
/machtran.htm (some info on machine translation)
4.
50A Study of Balanced Search TreesThis project
investigates four different balanced search trees
for their advantages anddisadvantages, thus
ultimately their efficiency. Runtime and memory
space management are two main aspects under the
study. Statistical analysis is provided to
distinguish subtledifference if there is any. A
new balanced search tree is suggested and
compared with the four balanced search trees.
50
51A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- Abstract
- This project investigates four different balanced
search trees for their advantages and
disadvantages, thus ultimately their efficiency.
Run time and memory space management are two main
aspects under the study. Statistical analysis is
provided to distinguish subtle differences if
there is any. A new balanced search tree is
suggested and compared with the four balanced
search trees under study. Balanced search trees
are implemented in C extensively using pointers
and structs.
52A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- Introduction
- Balanced search trees are important data
structures. A normal binary search tree has some
disadvantages, specifically from its dependence
on the incoming data, that significantly affects
its tree structure hence its performance. Height
of search tree is the maximum distance from the
root of the tree to a leaf. An optimal search
tree is one that tries to minimize its height
given some number of data. To improve its height
thus its efficiency, balanced search trees have
been developed that self-balance themselves into
optimal tree structures that allows quicker
access to data stored in the trees, For example
red-black treee is a balanced binary tree that
balances according to color pattern of nodes (red
or black) by rotation functions.
53A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- Rotation function is a hall mark of nearly all
balanced search tree they rotate or adjust
subtree heights from a pivot node. Many balanced
trees have been suggested and developed
red-black tree, AVL tree, weight-balanced tree, B
tree, and more.
54A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- Background Information
- Search Tree Basics
- This pro ject requires a good understanding of
binary trees and general serach tree basics. A
binary tree has nodes and edges. Nodes are the
elements in the tree and edges represent
relationship between two nodes. Each node in a
binary tree is connected by edgesto zero to two
nodes. In general search tree, each node can have
more than 2 nodes as in the case of B-tree. The
node is called a parent and nodes connected by
edges from this parent node are called its
children. A node with no child is called a leaf
node. Easy visualization of binary tree is a real
tree put upside down on a paper with roots on the
top and branches on the bottom.
55A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- The grandparent of a binary tree is called root.
From the root, the tree branches out to its
immediate children and subsequent descendents.
Each node's children are designated by left child
and right child. One property of binary search
tree is that the value stored in the left child
is less than or equal to the value stored in
parent. The right child's value is, on the other
hand, greater than the parent's. (Lef t lt
Parent, P arent lt Right)
56A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- 3.2 Search Tree Functions
- There are several main functions that go along
with binary tree and general search trees
insertion, deletion, search, and traversal. In
insertion, a data is entered into the search
tree, it is compared with the root. If the value
is less than or equal to the root's then the
insertion functino proceeds to the left child of
the root and compares again. Otherwise the
function proceeds to the right child and compares
the value with the node's. When the function
reaches the end of the tree, for example if the
last node the value was compared with was a leaf
node, a new node is created at that position with
the new inserted value. Deletion function works
similarly to find a node with the value of
interest (by going left and right accordingly).
57A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- Then the funciton deletes the node and fixes the
tree (changing parent children relationship etc.)
to keep the property of binary tree or that of
general search tree. Search function or basically
data retrieval is also similar. After traversing
down the tree (starting from the rot), two cases
are possible. If there is a value in interest is
encountered on the traversal, then the functino
replys that there is such data in the tree. If
the traversal ends at a leaf node with no
encounter of the value in search, then the
function simply returns the otherwise. There are
three kinds of travesal functions to show the
structure of a tree preorder, inorder and
postorder.
58A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- They are recursive functions that print the data
in special order. For example in preorder
traversal, as the prefix pre suggests, first the
value of node is printed then the recursive
repeats to the left subtree and then to the right
subtree. Similary, in inorder traversal, as the
prefix in suggests, first the left subtree is
output, then the node's value, then the right
subtree. (Thus the node's value is output in the
middle of the function.) Same pattern applies to
the postorder transversal.
59A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- 3.3 The Problem
- It is not hard to see that the structure of a
binary search tree (or general search tree) that
the order of data input is important. In a
optimal binary tree, the data are input so that
insertion occurs just right which makes the tree
balanced, the size of left subtree is
approximately equal to the size of right subtree
at each node in the tree. In an optimal binary
tree, the insertion, deletion, and search
function occur in O(log N ) with N as the number
of data in the tree. This follows from that
whenever data comparison occurs and subsequent
traversal (to the left or to the right) the
number of possible subset divides in half at each
turn. However that's only when the input is
nicely ordered and the search tree is balanced.
60A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- It's also possible that the data are input so
that only right nodes are added. (Root- gt right-
gt right- gt right...)It's obvious that the search
tree now looks like just a linear array. And it
is. And this give O(N ) to do insertion, deletion
and search operation. This is not efficient. Thus
search trees are developed to perform its
functions efficiently regardless of data input. - 4 Balanced Search Trees
- Four ma jor balanced search trees are
investigated. Three of them, namely red-black
tree, height-balanced tree, and weight-balanced
tree are binary search trees. The fourth, B-tree,
is multiple children (gt 2) search tree.
61A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- 4.1 Red-black tree
- Red-black search tree is a special binary with a
color scheme each node is either black or red.
There are four properties that makes a binary
tree a red-black tree. (1) The root of the tree
is colored black. (2) All paths frmo the root to
the leaves agree on the number of black nodes.
(3) No path from the root to a leaf may contain
two consecutive nodes colored red. (4) Every path
from a node to a leaf (of the descendents) has
the same number of black nodes. The performance
of balanced search is directly related to the
height of the balanced tree. For a binary, lg
(number of nodes) is usually the optimal height.
In the case of Red-black tree with n nodes, it
has height at most 2lg (n 1). The proof is
noteworthy, but difficult to understand.
62A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- In order the prove the assertion that Red-black
tree's height is at most 2lg (n 1) we should
first define bh(x). bh(x) is defined to be the
number of black nodes on any path from, but not
including a node x, to a leaf.Notice that black
height (bh) is well defined under the property 2
of Red-black tree. It is easy to see that black
height of a tree is the black height of its root.
First we shall prove that the subtree rooted at
any given node x contains at least ( 2 bh(x)) - 1
nodes. We can prove this by induction on the
height of a node x The base case is bh(x) 0,
which suggests that x must be a leaf (NIL). This
is true then it follows that subtree rooted at x
contains 20 - 1 0. The following is the
inductive step. Let say node x has positive
height and has two children.
63A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- Note that each child has a black-height of either
bx(x), if it is a red node, or bh(x)-1, if it is
a black node. It follows that the subtree rooted
at xcontains at least 2(2( bh(x)) - 1 - 1) 1
2( bh(x)) - 1. The first term refers to the
minimum bounded by the sum of black height left
and right. and the second term (the 1) refers to
the root. Doing some algedra this leades to the
right side of the equaiton. Having proved this
then the maximum height of Red-black tree is
fairly straightforward method. Not Let h be the
height of the tree. Then by property 3 of
Red-black tree, at least half of the nodes on any
simple path from the root to a leaf must be
black. So then the black-height of the root must
be at least h/2. n gt 2( h/2) - 1 which is
equivalent to n gt 2( bh(x)) - 1 n 1 gt 2( h/2)
lg (n 1) gt lg (2( h/2)) h/2 h lt 2lg (n 1)
4
64A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
-
- Therefore we just proved that a red-black tree
with n nodes has height at most 2lg (n 1). - 4.2 Height Balanced Tree
- Height balanced tree is a different approach to
bound the maximum height of a binary search tree.
For each node, heights of left subtree and right
subtree are stored. The key idea is to balance
the tree by rotating around a node that has
greater than threshold height difference between
the left subtree and the right subtree. All boils
down to the following property (1) At each node,
the difference between height of left subtree and
height of right subtree is less than threshold
value. Height balanced tree should yield lg (n)
height depends on the threshold value.
65A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
-
- An intuitive, less rigorous and yet valid proof
is provided. Imagine a simple binary tree in the
worst case scenario, a line of nodes. If the
simple binary tree were to be transformed into a
height balanced tree, the following process
should do it. (1) Pick some node near the middle
of a given strand of nodes so that the threshold
property satisfies (absolute value(leftH () -
rightH ())) (2) Define this node as a parent and
the resulting two strands (nearly equal in
length) as leftsubtree and rightsubtree
appropriately. (3) Repeat steps (1) and (2) on
the leftsubtree and the rightsubtree. First note
this process will terminate. It's because at each
step, the given strand will be split in two
halves smaller than the original tree. So this
shows the number of nodes in a given strand will
decrease.
66A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
-
- This will eventually reach a terminal size of
nodes determined by the threshold height
difference. If a given strand is impossible to
divide so that the threshold height difference
holds, then that is the end for that sub
recursive routine. Splitting into two halves
recursively is analogous to dividing a mass into
two halves each time. Dividing by 2 in turn leads
to lg (n). So it follows the height of
height-balanced tree should be lg (n), or
something around that magnitude. It is
interesting to note that height balanced tree is
roughly complete binary tree. This is because
height balancing allows nodes to gather around
the top. There is probably a decent proof for
this observation, but simple intuition is enough
to see this.
67A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
-
- 4.3 Weight Balanced Tree
- Weight balanced tree is very similar to height
balanced tree. It is very the same idea, but just
different nuance. The overall data structure is
also similar. Instead of heights of left subtree
and right subtree, weights of left subtree and
right subtree are kept. The weight of a tree is
defined as the number of nodes in that tree. The
key idea is to balance the tree by rotating
around a node that has greater than threshold
weight difference between the left subtree and
the right subtree. Rotating around a node shifts
the weight balance to a favorable one,
specifically the one with smaller difference of
weights of left subtree and right subtree. Weight
balanced tree has the following main property
(1) At each node, the difference between weight
of left subtree and weight of right subtree is
less than the threshold value.
68A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
-
- Similar approach used to prove height balanced
tree is used to show lg (n) of weight balanced
tree. The proof uses mostly intuitive argument
built on recursion and induction. Transforming a
line of nodes, the worst case scenario in a
simple binary tree, to a weight balanced tree can
be done by the following steps. (1) Pick some
node near the middle of a given strand of nodes
so that the threshold property satisfies
(absolutev alue(lef tW () - rig htW ())) (2)
Define this node as a parent and the resulting
two strands (nearly equal in length) as
leftsubtree and rightsubtree appropriately. (3)
Repeat steps (1) and (2) on the leftsubtree and
the rightsubtree. It is easy to confuse the
first step in height balanced tree and weight
balanced tree, but picking the middle node surely
satisfies both the height balanced tree property
and weight balanced tree.
69A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
-
- Maybe the weight balanced tree property is well
defined, since the middle node presumably has
same number of nodes before and after its
position. This process will terminate. It's
because at each step, the given strand will be
split in two halves smaller than the original
strand. So this shows the number of nodes in a
given strand will decrease. This will eventually
reach a terminal size of nodes determined by the
threshold weight difference. Splitting into two
halves recursively is analogous to dividing a
mass into two halves each time. Dividing by 2 in
turn leads to lg (n). So it follows the height of
weight-balanced tree should be lg (n), or
something around that magnitude. Like height
balanced tree, weight balanced tree is roughly
complete binary tree.
70A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
-
- A New Balanced Search Tree(?)
- A new balanced search tree has been developed.
The binary tree has no theoretical value to
computer science, but probably has practical
value. The new balanced search tree will referred
as median-weight-mix tree for each node will have
a key, zero to two children, and some sort of
weight. - 5.1 Background
- Median-weight-mix tree probably serves no
theoretical purpose because its not perfect. It
has no well defined behavior that obeys a set of
properties. Rather it serves practical purpose
mostly likely in statistics.
71A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- Median-weight-mix tree is based on following
assumption in data processing (1) Given lower
bound and upper bound of total data input, random
behavior is assumed, meaning data points will be
evenly distributed around in the interval. (2)
Multiple bells is assumed to be present in the
interval. The first property is not hard to
understand. This is based on the idea that nature
is random. The data points will be scatter about,
but evenly since random means each data value has
equal chance of being present in the data input
set. An example of this physical modeling would
be a rain. In a rain, rain drops fall randomly
onto ground. In fact, one can estimate amount of
rainfall by sampling a small area.
72A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- Amount of rain is measured in the small sampling
area and then total rail fall can be calculated
by numerical pro jection, ratio or whatever
method. The total rain fall would be
rainfall-in-small-area area-of-totalarea /
area-of-small-area. The second assumption is
based upon less apparent observation. Nature is
not completely random, which means some numbers
will occur more often than others. When the data
values and the frequency of those data values are
plotted on 2D plane, a wave is expected. There
are greater hits in some range of data values
(the crests) than in other range of data values
(the trough). A practical example would be
height. One might expect well defined
bell-shaped curve based on the average
height.(People tends to be 5 foot 10 inches.) But
this is not true when you look at it global
scale, because there are isolated populations
around the world. The average height of Americans
is not necessarily the average height of Chinese.
So this wave shaped curve is assumed.
73A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- 5.2 Algorithm
- Each node will have a key (data number), an
interval (with lower and upper bounds of its
assigned interval) and weights of left subtree
and right subtree. The weights of each subtree
are calculated is based on constants R and S.
Constant R represents the importance of focusing
frequency heavy data points. Constant S
represents the importance of focusing frequency
weak data points. So the ratio R/S consequently
represents the relative importance of frequency
heavy vs. frequency weak data points. Then tree
will be balanced to adjust to a favorable R/S
ratio at each node by means of rotating, left
rotating and right rotating.
74A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- 6 6.1 Methodology
- Idea
- Evaluating binary search trees can be done in
various ways because they can serve number of
purposes. For this pro ject, a binary search tree
was developed to take some advantage of random
nature of statistics with some assumption.
Therefore it is reasonable to do evaluation on
this basis. With this overall purpose, several
behaviors of balanced search trees will be
examined. Those are (1) Time it takes to process
a data set (2) Average time retrieval of data (3)
Height of the binary tree The above properties
are the ma jor ones that outline the analysis.
Speed is important and each binary tree is timed
to check how long it takes to process input data.
But average time retrieval of data is also
important because it is best indication of
efficiency of the data structures. What is the
use when you can input a number quick but
retrieve it slow? Lastly, height of the binary
tree is check to see if how theoretical idea
works out in practical situation.
75A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- 6.2 Detail
- It is worthwhile to note how each behaviors are
measured in C. For measuring time it take to
process a data set, the starting time and the
ending time will be recorded by function clock ()
under time.h library. Then the time duration will
be (End-Time - StartTime) / CLOCKS PER SEC. The
average time retrieval of data will be calculated
by first summing time it takes check each data
points in the tree and dividing this sum by the
number of data points in the binary tree. Height
of the binary tree, the third behavior under
study, is calculated by tree traversal, pre-, in-
or post-order, by simply taking the maximum
height/depth visited as each node is scanned.
There will be several test cases (identical) to
check red-black binary tree, height-balanced
tree, weight-balanced tree, and median-weight-mix
tree. First category of test run will be test
cases with gradually increasing number of
randomly generated data points. Second category
of test run will be hand manipulated.
76A Study of Balanced Search Trees Brainforming a
New Balanced Search TreeAnthony Kim
- Data points will still be randomly generated
however under some statistical behaviors, such a
"wave," a single bell curve, etc. Third category
of test run will be real life data points such as
heights, ages, and others. Due to immense amount
of data, some proportional scaling might be used
to accommodate the memory capability of the
balanced binary trees. - 7 Result Analysis
- C codes of the balanced search trees will be
provided. Testing of balanced search trees for
their efficiency and such. Graphs and table will
be provided. Under construction - 8 Conclusion Under Construction 9 Reference
Under Construction http//newds.zefga.net/snips/Do
cs/BalancedBSTs.html - App endix A Other Balanced Search Trees App
endix B Co des
77Linux Kernel Debugging APIThe purpose of this
project is to create an implementation of much of
the kernel API that functions in user space, the
normal environment that processes run in. The
issue with testing kernel code is that the live
kernel runs in kernel space, a separate area that
deals with hardware interaction and management of
all the other processes. Kernel spacedebuggers
are unreliable and very limited in scope a
kernel failure can hardly dump useful error
information because there's no operating system
left to write that information to disk.
77
78Kernel Debugging User-Space API Library (KDUAL)
John Livingston
- Abstract
- The purpose of this project is to create an
implementation of much of the kernel API that
functions in user space, the normal environment
that processes run in. The issue with testing
kernel code is that the live kernel runs in
kernel space, a separate area that deals with
hardware interaction and management of all the
other processes. Kernel space debuggers are
unreliable and very limited in scope a kernel
failure can hardly dump useful error information
because there's no operating system left to write
that information to disk. Kernel development is
quite likely the most important active project in
the Linux community.
79Kernel Debugging User-Space API Library (KDUAL)
John Livingston
- Any aids to the development process would be
appreciated by the entire kernel development
team, allowing them to do their work faster and
pass changes along to the end user quicker. This
program will make a direct contribution to kernel
developers, but an indirect contribution to every
future user of Linux. - Introduction and Background
- The Linux kernel is arguably the most complex
piece of software ever crafted. It must be held
to the most stringent standards of performance,
as any malfunction, or worse, security flaw,
could be potentially fatal for a critical
application. However, because of the nature of
the kernel and its close interaction with
hardware, it's extremely difficult to debug
kernel code.
80Kernel Debugging User-Space API Library (KDUAL)
John Livingston