natural language c cross compiler

8/10/2019 Natural Language C Cross Compiler

1/64

i

Abstract

Programming is taught at many institutions in the world. It is introduced as a tool to bothComputer Science majors and non-Computer Science students. After the duration of studiesmost students lack appreciation of the subject mainly because the introductory courses failedto cultivate enthusiasm in the students by starting with formal languages like C which the less

passionate students fail to comprehend. In the development world there are different types of programmers mainly good and bad, the distinction comes because of different tastes andstyles. It is fairly difficult to comprehend someones code that is not well commented,debugging can be a painful task, also because of code that is not self-documenting. Theaforementioned reasons brought the task of coming up with a natural language interface for

programming. The long term goal is to simplify software development processes and comeup with models that speed up development so as to minimise losses due to projects takinglonger than necessary.


2/64

ii

Acknowledgements

Firstly I would like to thank my Supervisor, Mr K. Muzheri for his guidance, technical

support and constructive criticism throughout the project and my Co-Supervisor, Mr S.

Ngwenya for his tireless supervision and expert contribution throughout the project. Specialthanks, to my family who have provided for me throughout my entire program both

financially and with moral support. I would like to acknowledge everyone who contributed to

this project but are not mentioned above, do not be disheartened; your contributions were

greatly appreciated. I thank you all for giving me ideas where I lacked wisdom and for your

unconditional support throughout the project and more.


3/64

iii

Table of ContentsAbstract .................................................................................................................................................... i

Acknowledgements ................................................................................................................................. ii

List of Figures .......................................................................................................................................... v

List of Tables .......................................................................................................................................... vi

Chapter 1:Introduction ........................................................................................................................... 1

1.1 Introduction .................................................................................................................................. 1

1.2 Background ................................................................................................................................... 1

1.3 Aim ................................................................................................................................................ 3

1.4 Objectives...................................................................................................................................... 3

1.5 Justification ................................................................................................................................... 3

1.6 Scope ............................................................................................................................................. 4

1.7 Expected results ............................................................................................................................ 4

1.8 Project overview ........................................................................................................................... 4

1.9 Project Plan ................................................................................................................................... 5

1.10 Conclusion ................................................................................................................................... 6

Chapter 2:Literature Review ................................................................................................................... 72.1 Introduction .................................................................................................................................. 7

2.2 Natural Language Processing ........................................................................................................ 7

2.3 Natural Language Programming ................................................................................................. 20

2.4 Compiler Design .......................................................................................................................... 21

2.5 Conclusion ................................................................................................................................... 23

Chapter 3:Methodology ........................................................................................................................ 24

3.1 Introduction ................................................................................................................................ 24

3.2 Research Methodology ............................................................................................................... 24

3.3 Rapid Application Development ................................................................................................. 25

3.4 Agile ............................................................................................................................................ 28

3.5 Tools ............................................................................................................................................ 31

3.6 Preferred Methodology and Justification ................................................................................... 32

3.7 Conclusion ................................................................................................................................... 33

Chapter 4:System Analysis and Design ................................................................................................. 34

4.1 Introduction ................................................................................................................................ 34


4/64

iv

4.2 Requirements Elicitation ............................................................................................................. 34

4.3 Feasibility study........................................................................................................................... 36

4.4 Requirements specification ........................................................................................................ 37

4.6 System Requirements ................................................................................................................. 37

4.4 System Design ............................................................................................................................. 39

4.6 Conclusion ................................................................................................................................... 44

Chapter 5:Implementation and Testing ................................................................................................ 45

5.1 Introduction ................................................................................................................................ 45

5.2 Tools ............................................................................................................................................ 45

5.3 Deployment Architecture ........................................................................................................... 46

5.4 System Functionality ................................................................................................................... 46

5.5 Software Testing ......................................................................................................................... 48

5.6 Conclusion ................................................................................................................................... 49

Chapter 6:Recommendations and Conclusions .................................................................................... 50

6.1 Introduction ................................................................................................................................. 50

6.2 Classification of the System ........................................................................................................ 50

6.3 Review of the Projects Aim and Objectives .............................................................................. 50

6.4 Challenges Encountered .............................................................................................................. 50

6.5 Recommendations for Future Work ............................................................................................ 51

6.6 Conclusion .................................................................................................................................. 51

References ........................................................................................................................................ 52

APPENDICES .......................................................................................................................................... 54


5/64

v

List of FiguresFigure 2.1: Parse Tree . 16

Figure 3.1: A generic agile development process features an initial planning stage, rapid repeatsof the iteration stage, and some form of consolidation before release .....28

Figure 4.1: Suggested Interface .36

Figure 4.2: Architecture Diagram for the Natural Language C Cross Compiler .. 40

Figure 4.3: Class Diagram for the Natural Language C Cross Compiler front-end. ...42

Figure 4.4 : Sequence Diagram for the Natural Language C Cross Compiler. .43

Figure 5.1: The interface... 47


6/64

vi

List of Tables

Table 1.1: Project Plan 5


7/64

1

Chapter 1

Introduction

1.1 Introduction

Natural Language Programming (NLP) is an ontology-assisted way of programming in terms

of natural language sentences for example English. The goal of NLP is to make computers

easier to use and enable people who are not professional computer scientists to be able to

teach new behaviour to their computers.

Natural Language Programming builds up a single program or a library of routines that are

programmed through natural language sentences using an ontology that defines the available

data structures in a high level programming language. The smallest unit of statement in

Natural Language Programming is a sentence. Each sentence is stated in terms of concepts

from the underlying ontology. In a Natural Language Program text each sentence

unambiguously compiles into a procedure call in the underlying high level programming

language such as C, C++, Java, etc.

The goal of easy-to-use interfaces for programming would be a natural language interface -

just tell the computer what you want.

Attempts have been made in the natural language programming field to create natural language

interfaces for programming. The NLC prototype in 1979 (Liu and Lieberman, 2005) was build, with

the capabilities of handling low level operations as the transformation of type declarations into

programmatic expressions. More recently a system called METAFOR capable of translating natural

language statements to class descriptions with associated objects and methods (Liu and Lieberman,

2005). Efforts have been focused on experiments on the feasibility of using natural language in programming and less have been done to come up with fully functional NLP interfaces for

programming.

1.2 Background

A natural language interface for programming should result in greater readability, as well as

making possible a more intuitive way of writing code. Code written in English is much easier

to read and understand than in a traditional programming language. Quite often, it is a


8/64

2

difficult task to read another programmers code. Even understanding ones own code can be

hard after a period of time. This is because without sufficient commenting one cannot tell

what individual steps are meant to do together.

Debugging is a generic term for finding and fixing errors in a program. These errors can be

syntactic, which are normally detected by the compiler or interpreter, or logical, which cause

unwanted behaviours and can be very difficult to detect (Halpern, 1966). The latter, however,

can be extraordinarily difficult to find. It involves knowing exactly what each line in the

program does. If what the programmer believes a statement to do and what it actually does

are disjointed, there is the potential for catastrophe.

Early work in natural language programming was deemed ambitious, targeting the generation

of co mplete computer programs that would compile and run. For instance, the NLC

prototype (Ballard B and Bierman, 1979) aimed at creating a natural language interface for

processing data stored in arrays and matrices, with the ability of handling low level

operations such as the transformation of numbers into type declarations as e.g. float-constant

(2.0), or turning natural language statements like add y1 to y2 into the programmatic

expression y1 + y2.

More recently, however, researchers have started to look again at the problem of natural

language programming, but this time with more realistic expectations, and with a different,

much larger pool of resources for example broad spectrum common sense knowledge (Singh

P, 2002) and a suite of significantly advanced publicly available natural language processing

tools. For instance, (Pane J t. al, 2001) conducted a series of studies with non-programming

fifth grade users, and identified some of the programming models implied by the users

natural language descriptions. In a similar vein, (Lieberman, 2005) have conducted a

feasibility study and showed how a partial understanding of a text, coupled with a dialoguewith the user, can help non-expert users make their intentions more precise when designing a

computer program. Their study resulted in a system called METAFOR (Liu & Lieberman,

2005), able to translate natural language statements into class descriptions with the associated

objects and methods.

The challenge most programmers face today is trying to make code readable for the next

programmer. Current methods of system documentation do not do much in terms ofdocumenting the code itself. This leaves commenting code as the only documentation any


9/64

3

source code has. A natural language interface uses the comments as program statements

which results in code that is self-documenting and readable for anyone who is to go through

the code. This makes debugging a less daunting task as program logic will be in plain

English.

1.3 Aim

The aim of the project is to develop a natural language compiler based on the C programming

language.

1.4 Objectives

To compile natural language text as program input

To perform syntax error checks on input

To perform grammatical error checks on input

To generate equivalent object code executable on the main platform

1.5 Justification

It is notoriously difficult to construct conventional software systems systematically and

timely (Somerville, 2008), with up to 20% of industrial development projects failing. With

further study and improvements the aim is to bridge the gap between how problems are

defined and how they are solved. Problems are defined in natural language and implemented

using formal programming languages. The gap has caused delays in the delivery of software

which ultimately translates to losses in the millions in some cases.

A natural language interface gives code that is readable and easier to maintain in essence self-documenting code. Consider a scenario where as a programmer you are tasked to add

functionality to an application you did not write but whose code you have at hand which is

not well documented. This task takes a long time if the code is not well documented. The

current methods used for documentation of software projects do less in terms of documenting

the code itself. Going through another programmers code is a difficult task and sometimes

even your own code after a long time. Good programmers write self-documenting code and

yet when faced with the less preferred scenario it will take a long time to make a simplechange to a system.


10/64

4

1.6 Scope

The project is aimed at developing a natural language C based compiler. The natural

language used is English. The compiler extracts information about variables, operators and

information on loops from analysing the natural language program input.

It focuses on the representation of parts of the natural language, English, that can be mapped

to existing data structures as variables, structs, lists, arrays and loops. The compiler extracts

nouns, verbs and overly expressed actions; these can be mapped through the use of ontologies

to variables, iterations (loops) and statements.

The compiler is based on the data structures that are built in the C programming language. It

does not cover the graphical implementation of the C programming language.

1.7 Expected results

The compiler takes English natural language text as input. It performs grammatical and

syntax error checks on the natural language program, giving back the errors in the program. It

compiles the natural language program if no errors are found. The compiler generates the

equivalent object code that is executable on the main platform that is, windows or Linux

platforms.

1.8 Project overview

This project is divided into six main chapters. The first chapter being the introductory

chapter, the second is the literature review followed by methodology, systems analysis and

design, implementation and conclusion and recommendations for future research.

The first chapter gives an introduction to the project. It states the background and aim of the

project and the justification to why the research project topic was selected.

The Literature review gives an overview of the current systems and the fundamentals

employed in the area of this research. This chapter also develops an argument on the

relevance of this research.

The Methodology chapter gives an overview of methodologies available and justifications of

the selection of a methodology to be followed are elaborated in this chapter. The chapter also

gives the techniques and the methodology used in the development of the Natural LanguageC Cross Compiler.


11/64

5

The Systems Analysis and Design chapter focuses on the analysis and design of the system. It

is the detailed analysis of the functional requirements and it has a summary of the system in

the form of system development designs. Using Unified Modelling language the conceptual

model of the system is shown and it is on these designs that the system is implemented.

The Implementation chapter is shows how the transformation from design to application is

done. Screen shots of the completed system are shown. This chapter also focuses on testing

and overview of problems as well as resultant solutions encountered during implementation

stage.

The last chapter gives the conclusion of the research topic, set objectives are measured

against results and suggestions for further research are stated. Divergences if any are justified

in this chapter.

1.9 Project Plan

The project is to follow the following plan with the activities and the timelines for each

milestone shown in the following table.

Activity Number Milestone Timeline

1 Analysis and design 4 Weeks

2 Implementation 5 Weeks

3 Testing and Evaluation 4 Weeks

4 Final Demonstration and

reporting

2 Weeks

Table 1.1: Project Plan


12/64

6

1.10 Conclusion

This chapter discussed the research project, its aim and objectives. To introduce the project

the Natural Language Processing and Programming was discussed and the need for the

system was argued as well. Also included in this chapter was the project plan which is

followed in the outline of the rest of this project documentation.


13/64

7

Chapter 2

Literature Review

2.1 Introduction

Natural Language Processing is an interdisciplinary research area at the border between

linguistics and artificial intelligence aiming at developing computer programs capable of

human-like activities related to understanding or producing texts or speech in a natural

language, such as English. Morden approaches to NLP are based on machine learning, a type

of artificial intelligence that examines and uses patterns in data to improve a program's own

understanding. The most important applications of natural language processing include

information retrieval and information organisation, machine translation, and natural language

interfaces, among others.

The task of improvement of Natural Language Processing has been divided into tasks useful

for application development and analysis. These range from syntactic analysis such as part-

of-speech tagging, chunking and parsing to semantic analysis such as semantic role labelling,

named entity extraction and anaphora resolution.

Natural Language Programming is a branch separate to Natural Language Processing but

within Artificial intelligence. Natural Language Programming is the interpretation and

compilation of instructions communicated in natural language into object code. It is depended

on the advances in Natural Language Processing.

2.2 Natural Language Processing

Natural Language Processing (NLP) targets the conversion of human language into formal

representations that can be manipulated using computers. Natural Language Processing is notoften considered as a goal in or of itself but rather as a means for accomplishing a certain task

for instance we have information retrieval systems that use NLP.

Natural Language Processing seeks to accomplish human like processing. That is to be able

to paraphrase input text, convert the text to another language and answer questions about the

text.


14/64

8

2.2.1 Natural language Processing Applications

There are huge amounts of data on the Internet. Applications for processing large amounts of

texts require Natural Language Processing expertise. Some requirements are to classify text

into categories, index and search large texts, automatic translation, speech understanding:understand phone conversations, information extraction: Extract useful information from

resumes, automatic summarisation: that is condense one book into one page, question

answering, knowledge acquisition, text generations or dialogues.

Natural language processing provides both theory and implementations for a range of

applications, some of the applications include; information retrieval, information extraction

information extraction focuses on the recognition, tagging, and extraction into a structured

representation, certain key elements of information, for example persons, companies,

locations, organisations, from large collections of text. These extractions can then be utilised

for a range of applications including question-answering, visualisation, and data mining.

Question Answering in contrast to Information Retrieval provides the user with either just the

text of the answer itself or answer-providing passages which provides a list of potentially

relevant documents in response to a users query.

At the higher levels of NLP at the discourse level there is summarisation. Its implementation

reduces larger text into a shorter richly constituted abbreviated narrative representation of the

original document.

Machine Translation (MT) can be considered to be the oldest of all NLP applications, various

levels of NLP have been uti lised in MT systems, ranging from the word - based approach to

applications that include higher levels of analysis.

2.2.2 Computational Linguistics

A simple sentence consists of a subject followed with predicate. A word in a sentence acts a

part of speech (POS). For an English sentence, the parts of speech are: nouns, pronouns,

adjectives, verb, adverb, prepositions, conjunctions, and interjections. A noun tells us about

names, whereas the verb talks of action. Adjectives and adverbs are modifying the nouns and

verbs, respectively. Prepositions are relationships between nouns and other parts of speech.

Conjunctions join words and groups together, and interjections express strong feelings. In the


15/64

9

spoken language, the problem of understanding speech can be divided into three areas

acoustic-phonetic, morphological-syntactic, and semantic-pragmatic processes.

In computational linguistics the lexicon supplies paradigmatic information about words,

including part of speech labels, irregular plurals, and sub-categorization information for

verbs. In the past, lexicons were quite small and were constructed largely by hand. Effective

natural language processing requires increased amounts of lexical information. A recent trend

has been the use of automatic techniques applied to large corpora for the purpose of acquiring

lexical information from text (Zernik 1991). Statistical techniques are an important aspect of

automatically mining lexical information. (Manning 1993) uses such techniques to gather

sub-categorisation information for verbs. (Brent 1993) also discovers sub-categorisation

information; in addition he attempts to automatically discover verbs in the text, (Liu and Soo1993) describe a method for mining information about thematic roles. The additional

information being added to the lexicon increases the complexity of the lexicon. This added

complexity requires that attention be paid to the organisation of the lexicon (Zernik 1991).

(McCray et al 1993) discuss the structure of a large lexicon designed and implemented to

support syntactic processing.

Automatically disambiguating part-of-speech labels in text is an important research area sinceambiguity is particularly prevalent in the spoken language. Programs that resolve part-of-

speech labels (often called automatic taggers) typically are around 95% accurate (Bod 1998).

Taggers can serve as pre-processors for syntactic parsers and contribute significantly to

efficiency. There have been two main approaches to automatic tagging: probabilistic and

rule-based. Typically, probabilistic taggers are trained on disambiguated text and vary as to

how much training text is needed and how much human effort is required in the training

process. (Schtze 1993) described a tagger that requires very little human intervention.)

Further variation concerns knowing what to do about unknown words and the ability to deal

with large numbers of tags.

One drawback to stochastic taggers is that they are very large programs requiring

considerable computational resources. (Brill 1992) describes a rule-based tagger which is as

accurate as stochastic taggers, but with a much smaller program. The program is slower than

stochastic taggers, however. Building on Brills approach, (Roche and Schabes 1995)

propose a rule-based, finite-state tagger which is much smaller and faster than stochasticimplementations. Accuracy and other characteristics remain comparable.


16/64

10

A traditional approach to natural language processing takes as its basic assumption that a

system must assign a complete constituent analysis to every sentence it encounters. The

methods used to attempt this are drawn from mathematics, with context-free grammars

playing a large role in assigning syntactic constituent structure. (Partee at al 1993) provide an

accessible introduction to the theoretical constructs underlying this approach, including set

theory, logic, formal language theory, and automata theory, along with the application of

these mechanisms to the syntax and semantics of natural language. For syntax, it uses a

unification-based implementation of a generalised phrase structure grammar (Gazdar et al.

1985) and handles an impressive number of syntactic structures. In continuing research in this

tradition, context- free grammars have been extended in various ways. The mildly conte xt

sensitive grammars, such as tree adjoining grammars, have had considerable influence on

recent work concerned with the formal aspects of parsing natural language. Several recent

papers pursue non-traditional approaches to syntactic analysis. One such technique is partial,

or underspecified, analysis. For many applications such an analysis is entirely sufficient and

can often be more reliably produced than a fully specified structure. (Chen 1994), for

example, employ statistical methods combined with a finite state mechanism to impose an

analysis which consists only of noun phrase boundaries, without specifying their complete

internal structure or their exact place in a complete tree structure. (Agarwal and Boggess

1992) successfully rely on semantic features in a partially specified syntactic representation

for the identification of coordinate structures. In an innovative application of dependency

grammar and dynamic programming techniques, (Kurohashi and Nagao 1994) address the

problem of analysing very complicated coordinate structures in Japanese.

A recent innovation in syntactic processing has been investigation into the use of statistical

techniques. In probabilistic parsing, probabilities are extracted from a parsed corpus for the

purpose of choosing the most likely rule when more than one rule can apply during the course

of a parse (Magerman and Weir 1992). In another application of probabilistic parsing the goal

is to choose the (semantically) best analysis from a number of syntactically correct analyses

for a given input (Briscoe at al 1993).

Another application of statistical methodologies to the parsing process is grammar induction

where the rules themselves are automatically inferred from a bracketed text; however, results

in the general case are still preliminary. (Pereira and Schabes 1992) discuss inferring a

grammar from bracketed text relying heavily on statistical techniques, while (Brill 1993) usesonly modest statistics in his rule-based method.


17/64

11

Automatic word-sense disambiguation depends on the linguistic context encountered during

processing. (McRoy 1992) appeals to a variety of cues while parsing, including morphology,

collocations, semantic context, and discourse. Her approach is not based on statistical

methods, but rather is symbolic and knowledge intensive. Statistical methods exploit the

distributional characteristics of words in large texts and have need of training, which can

come from several sources, as well as human intervention. (Gale et al 1992) give an overview

of several statistical techniques they have used for word-sense disambiguation and discuss

research on evaluating results for their systems and others. They have used two training

techniques, one based on a bilingual corpus, and another on Rogets Thesaurus. (Justeson a nd

Katz 1995) use both rule based and statistical methods. The attractiveness of their method is

that the rules they use provide linguistic motivation.

Formal semantics is rooted in the philosophy of language and has as its goal a complete and

rigorous description of the meaning of sentences in natural language. It concentrates on the

structural aspects of meaning. The papers in (Rosner and Johnson 1992) discuss various

aspects of the use of formal semantics in computational linguistics and focus on Montague

grammar (Montague 1974). (King 1992) provides an overview of the relation between formal

semantics and computational linguistics. Several papers in Rosner and Johnson discuss

research in the situation semantics paradigm (Barwise and Perry 1983), which has recentlyhad wide influence in computational linguistics, especially in discourse processing. Lexical

semantics (Cruse 1986) has recently become increasingly important in natural language

processing. This approach to semantics is concerned with psychological facts associated with

the meaning of words. (Levin 1993) analyses verb classes within this framework, while the

papers in Levin and Pinker 1991 explore additional phenomena, including the semantics of

events and verb argument structure. Another application of lexical semantics is WordNet

which is a lexical database that attempts to model cognitive processes. The articles in (Saint-

Dizier and Viegas 1995) discuss psychological and foundational issues in lexical semantics as

well as a number of aspects of using lexical semantics in computational linguistics.

Another approach to language analysis based on psychological considerations is cognitive

grammar (Langacker 1988). (Olivier and Tsujii 1994) deal with spatial prepositions in this

framework, while (Davenport and Heinze 1995) discuss more general aspects of semantic

processing based on cognitive grammar.


18/64

12

Discourse analysis is concerned with coherent processing of text segments larger than the

sentence and assumes that this requires something more than just the interpretation of the

individual sentences. (Grosz, Joshi and Weinstein 1995) provide a broad-based discussion of

the nature of discourse, clarifying what is involved beyond the sentence level, and how the

syntax and semantics of the sentences support the structure of the discourse. In their analysis,

discourse contains linguistic structure (syntax, semantics), focus of attention, and intentional

structure (plan of participants) and is structured into coherent segments. During discourse

processing one important task for the hearer is to identify the referents of noun phrases.

Inferencing is required for this identification. A coherent discourse lessens the amount of

inferencing required of the hearer for comprehension. Throughout a discourse the particular

way that the speaker maintains focus of attention or centring through choice of linguistic

structures for referring expressions is particularly relevant to discourse coherence.

Other work in computational approaches to discourse analysis has focused on particular

aspects of processing coherent text. (Hajicova et al 1995) distinguish topic that is old

information from focus that is new information within a sentence. Information of this sort is

relevant to tracking the focus of attention. (Lappin and Leass 1994) are primarily concerned

with intra-sentential anaphora resolution, which relies on syntactic cues, rather than discourse

cues. Nonetheless, they also address inter-sentential anaphora, and this relies on severaldiscourse cues, such as saliency of a noun phrase, which is determined by such things as

grammatical role, frequency of mention, proximity, and how recent a sentence is. (Hul et al

1995) use a similar notion of saliency for anaphora resolution and resolve deictic expressions

with the same principles. (Passonneau and Litman 1993) study the nature of discourse

segments and the linguistic structures which cue them. (Sonderland and Lehnert 1994)

investigate machine learning techniques for discovering discourse-level semantic structure.

Several recent papers investigate those aspects of discourse processing having to do with the

psychological state of the participants in a discourse, including, goals, intentions, and beliefs:

(Asher and Lascarides 1994) investigate a formal model for representing the intentions of the

participants in a discourse and the interaction of such intentions with discourse structure and

semantic content. (Traum and Allen 1994) describes the idea of social obligation to shed light

on the behaviour of discourse. (Wiebe 1994) investigates psychological point of view in third

person narrative and provides an insightful algorithm for tracking this phenomenon in text.

The point of view of each sentence is either that of the narrator or any one of the characters in

the narrative.


19/64

13

2.2.3 Levels of knowledge in language understanding

A language understanding program must have considerable knowledge about the structure of

the language including what the words are and how they combine into phrases and sentences.

It must also know meaning of the words, how to contribute meaning of the sentence and tothe context in which they are being used. In addition, the program must have general world

knowledge and knowledge about how the humans reason.

The components of the knowledge needed to understand the language are phonological which

relates sounds to the words we recognise. Phoneme which is smallest unit of sound, and the

phones are aggregated into word. Morphological is the lexical knowledge, which relates to

word construction from basic units called morphemes. A morpheme is the smallest unit of

meaning, for example, the construction of friendly from friend and ly. Syntactic is knowledge

about how the words are organised to construct meaningful and correct sentences. Pragmatics

is the high level knowledge about how to use sentences in different contexts and how the

context affects the meanings of the sentences.

2.2.4 Grammars and Languages

A language can be generated given its grammar G = (V,_, S, P), where V is set of variables, _

is set of terminal symbols, which appear at the end of generation, S is start symbol, and P is

set of production rules. The corresponding language of G is L(G).

Consider that various tuples are as given in Listing 2.1.


20/64


21/64

15

S aS

S aAB

AB BA

aA ab

aA aa

Listing 2.3:Third Language Generation

Where uppercase letters are non-terminals and lowercase are terminals.

The type-2 grammars are:

S aS

S aSb

S aB

S aAB

A a

B b

Listing 2.4: Fourth Language Generation

The type 3 grammar is simplest having rewrite rules as:

S aS

S

Listing 2.5: Fifth Language Generation

The types 1, 2, 3 are called context-sensitive, context-free, and regular grammars, and hence

the corresponding names for languages also. The formal languages are mostly based on the

type-2 languages, as the type 0 and 1 are not much and understood and difficult to implement.The types 1, 2, 3 are called context-sensitive, context-free, and regular grammars, and hence


22/64

16

the corresponding names for languages also. The formal languages are mostly based on the

type-2 languages, as the type 0 and 1 are not much and understood and difficult to implement.

2.2.5 Structural Representation

It is convenient to represent the sentences as tree or a graph to help expose the structure of the

constituent parts. For example, the sentence, the boy ate a ice cream can be represented as a

tree shown in Figure 2.1.

Figure 2.1: Parse Tree

For the purpose of computation a tree must also be represented as a record, a list or some

similar data structure. For example, the tree above is represented as a list:

(S (NP ((Art the)

(N boy))

(VP (V ate) (NP (Art a) (N Icecream)))))

Listing 2.6: Tree Representation as a list


23/64

17

A more extensive English grammar can be obtained with the addition of other constituencies

such as prepositional phrases PP, adjectives ADJ, determiners DET, adverbs ADV , auxiliary

verbs AUX, and many other features. Correspondingly, the other rewrite rules are followings.

PP Prep NP,

V P V ADV

V P V PP,

V P V NP PP

V P AUX V NP

Det Art ADJ,

Det Art

Listing 2.7: Rewrite of rules

2.2.6 Pattern matching

The idea here is an approach to natural language processing is to interpret input utterances as

a whole father than building up their interpretation by combining the structure and meaning

of words or other lower level constituents. That means the interpretations are obtained by

matching patterns of words against the input utterance. For a deep level of analysis in pattern

matching a large number of patterns are required even for a restricted domain. This problem

can be ameliorated by hierarchical pattern matching in which the input is gradually

normalised through pattern matching against sub-phrases. Another way to reduce the number

of patterns is by matching with semantic primitives instead of words.

2.2.7 Syntactically driven Parsing

Syntax means ways that words can fit together to form higher level units such as phrases, clauses and

sentences. Therefore syntactically driven parsing means interpretation of larger groups of words are

built up out of the interpretation of their syntactic constituent words or phrases. In a way this is the

opposite of pattern matching as here the interpretation of the input is done as a whole. Syntactic

analyses are obtained by application of a grammar that determines what sentences are legal in thelanguage that is being parsed.


24/64

18

2.2.8 Semantic Grammars

Natural language analysis based on semantic grammar is bit similar to syntactically driven

parsing except that in semantic grammar the categories used are defined semantically and

syntactically. There here semantic grammar is also involved.

Case frame instantiation is one of the major parsing techniques under active research today.

The has some very useful computational properties such as its recursive nature and its ability

to combine bottom-up recognition of key constituents with top-down instantiation of less

structured constituents.

2.2.9 Applications of Natural Language Processing

As natural language processing technology matures it is increasingly being used to support

other computer applications. Such use naturally falls into two areas, one in which linguistic

analysis merely serves as an interface to the primary program, and another in which natural

language considerations are central to the application.

Natural language interfaces to data base management systems (for example Bates 1989)

translate users input into a request in a formal data base query language , and the programthen proceeds as it would without the use of natural language processing techniques. It is

normally the case that the domain is constrained and the language of the input consists of

comparatively short sentences with a constrained set of syntactic structures. The design of

question answering systems is similar to that for interfaces to data base management systems.

One difference is that the knowledge base supporting the question answering system does not

have the structure of a data base. Processing in this system not only requires a linguistic

description for users requests, but it is also necessary to provide a representation for theencyclopaedia itself. As with the interface to a Database Management System, the requests

are likely to be short and have a constrained syntactic structure. (Lauer at al 1992) provide

some general considerations concerning question answering systems and describe several

applications.

In message understanding systems, a fairly complete linguistic analysis may be required, but

the messages are relatively short and the domain is often limited. (Davenport and Heinze

1995) describe such a system in a military domain.


25/64

19

In information filtering, text categorisation, and automatic abstracting no constraints on the

linguistic structure of the documents being processed can be assumed. One mitigating factor

is that effective processing may not require a complete analysis. For all of these applications

there are also statistically based systems based on frequency distributions of words. These

systems work fairly well, but most people feel that for further improvements, and for

extensions, some sort of understanding of the texts, such as that provided by linguistic

analysis is required.

Information filtering and text categorization are concerned with comparing one document to

another. In both applications, natural language processing imposes a linguistic representation

on each document being considered. In text categorization a collection of documents is

inspected and all documents are grouped into several categories based on the characteristicsof the linguistic representations of the documents. (Blosseville et al. 1992) describes an

interesting system which combines natural language processing, statistics, and an expert

system. In information filtering, documents satisfying some criterion are singled out from a

collection. (Jacobs and Rau 1990) discuss a program which imposes a quite sophisticated

semantic representation for this purpose.

In automatic abstracting, a summary of each document is sought, rather than a classification

of a collection. The underlying technology is similar to that used for information filtering and

text categorisation: the use of some sort of linguistic representation of the documents. Of the

two major approaches, one (McKeown and Radev 1995) puts more emphasis on semantic

analysis for this representation and the other (Paice and Jones 1993), less. Information

retrieval systems typically allow a user to retrieve documents from a large bibliographic

database. During the information retrieval process a user expresses an information need

through a query. The system then attempts to match this query to those documents in the

database which satisfy the users information need. In systems which use natural l anguage processing, both query and documents are transformed into some sort of a linguistic structure,

and this forms the basis of the matching. Several recent information retrieval systems employ

varying levels of linguistic representation for this purpose. (Sembok and van Rijsbergen

1990) base their experimental system on formal semantic structures, while (Myaen et al

2004) construct lexical semantic structures for document representations. (Strzalkowski

1994) combines syntactic processing and statistical techniques to enhance the accuracy of

representation of the documents. In an innovative approach to document representation for


26/64

20

information retrieval, (Liddy et al 1995) use several levels of linguistic structure, including

lexical, syntactic, semantic, and discourse.

2.2.10 Natural Language Processing based systems

A number of systems currently use natural language processing for the accomplishment of a

number of targeted tasks. Some of these include systems that do text summarising, page

ranking and natural language interfaces to databases, text mining and language translation.

Currently researchers have been working on coming up with natural language programming

interfaces. Some of the systems have been targeted to learning institutes for the acquisition of

the first programming language. This is because it has been noted across tertiary institutes

that the rate of dropouts for courses with programming has been significantly high as high as

30% ( G u z d i a l & S o l o w a y , 2 0 0 2 ) also causing a less appreciation of

programming in general for students majoring in Computer Science.

The vast amount of information on the internet and the information needed for doing day to

day tasks in a number of fields has called for systems that can search comprehensively for

information to improve productivity. This has led to data and text mining systems in fields

like medicine.

2.3 Natural Language Programming

Natural Language Programming is the interpretation and compilation of instructions

communicated in natural language into object code. It uses natural language processing

techniques for the extraction of information from natural language text input.

2.3.1 Natural Language Programming based systems

The NLC prototype (Ballard B and Bierrman, 1979) was one of the attempts made to come

up with a natural language programming interface. It had the capabilities of handling low

level operations as the transformation of type declarations into programmatic expressions.

The system is capable of turning statements like add y1 to y2 into the expression y1 + y2.


27/64

21

More recently in 2005 a system called METAFOR was implemented. METAFOR is capable

of translating natural language statements into class descriptions with associated objects and

methods. METAFOR interactively converts English sentences to partially specified program

code, to be used as a starting point for a more detailed program. A user study by Henry

Lieberman showed that METAFOR is capable of capturing enough Programmatic Semantics

to facilitate non programming users and begin ners conceptualisation of programming

problems.

2.4 Compiler Design

Compilers bridge source programs in high-level languages with the underlying hardware. A compiler

has four major tasks which are to determine the correctness of the syntax of programs, to generate

correct and efficient object code, performing run-time organisation, and it formats output according to

assembler and/or linker conventions. A compiler consists of three main parts: the frontend, the

middle-end, and the backend.

The front end checks whether the program is correctly written in terms of the programming language

syntax and semantics. Here legal and illegal programs are recognised. Errors are reported, if any, in a

useful way. Type checking is also performed by collecting type information. The frontend then

generates an intermediate representation or IR of the source code for processing by the middle-end.

The middle end is where optimisation takes place. Typical transformations for optimisation are

removal of useless or unreachable code, discovery and propagation of constant values, relocation of

computation to a less frequently executed place (e.g., out of a loop), or specialisation of computation

based on the context.

The middle-end generates another intermediate representation for the following backend. Most

optimisation efforts are focused on this part. The back end is responsible for translating the IR fromthe middle-end into assembly code. The target instructions are chosen for each IR instruction.

Register allocation assigns processor registers for the program variables where possible. The backend

utilises the hardware by figuring out how to keep parallel execution units busy, filling delay slots, and

so on. Although most algorithms for optimisation are in NP, heuristic techniques are well-developed.


28/64

22

2.4.1 What is a compiler?

In order to reduce the complexity of designing and building computers, nearly all of these are

made to execute relatively simple commands (but do so very quickly). A program for a

computer must be built by combining these very simple commands into a program in what iscalled machine language. Since this is a tedious and error prone process most programming

is, instead, done using a high-level programming language. This language can be very

different from the machine language that the computer can execute, so some means of

bridging the gap is required. This is where the compiler comes in. A compiler translates (or

compiles) a program written in a high-level programming language that is suitable for human

programmers into the low-level machine language that is required by computers. During this

process, the compiler attempts to spot and report obvious programmer mistakes.

2.4.2 The phases of a compiler

A typical way to structure the writing of a compiler is to split the compilation into several

phases with well-defined interfaces (Alfred 2007). Theoretically, these phases operate in

sequence (though in practice, they are often interleaved), each phase (except the first) taking

the output from the previous phase as its input. It is common to let each phase be handled bya separate module. Some of these modules are written by hand, while others may be

generated from specifications. Often, some of the modules can be shared between several

compilers.

In some compilers, the ordering of phases may differ slightly, some phases may be combined

or split into several phases or some extra phases may be inserted between those mentioned in

the following paragraphs.

Lexical analysis is the initial part of reading and analysing the program text: The text is read

and divided into tokens, each of which corresponds to a symbol in the programming

language, for example, a variable name, keyword or number.

Syntax analysis phase takes the list of tokens produced by the lexical analysis and arranges

these in a tree-structure (called the syntax tree) that reflects the structure of the program. This

phase is often called parsing.


29/64

23

Type checking phase analyses the syntax tree to determine if the program violates certain

consistency requirements. That is, if a variable is used but not declared or if it is used in a

context that does not make sense given the type of the variable, such as trying to use a

Boolean value as a function pointer.

In intermediate code generation the program is translated to a simple machine independent

intermediate language. On the register allocation phase the symbolic variable names used in

the intermediate code are translated to numbers, each of which corresponds to a register in the

target machine code.

2.5 ConclusionWe gave an overview of the techniques used in Natural Language Processing and where in

real life they are applied. From the literature review, an approach to designing the Natural

Language Compiler would be to start from utilising tools and algorithms for Natural

Language Processing. This assists in getting the relevant information from the input to be

used subsequently in the preceding stages of the overall system. A lot of work would have to

be done after the initial stages of processing that is for a compiler to be fully functional an

exhaustive number of functions have to be written in the underlying language that complyunambiguously to the actions to be performed on the parameters passed.


30/64

24

Chapter 3

Methodology

3.1 Introduction

A software development methodology is a structure imposed on the development of a

software product or alternately a framework that is used to, plan, and control the process of

developing an information system. It includes procedures, techniques, tools and

documentation aids which help system developers in their task of implementing a new

system. The aim of a methodology is to formalise what is being done, making it more

repeatable.

A study conducted by the Forrester research group (Hoffman T, July 2003) states that nearlyone-third of all IT projects commenced would, on average, be three months late. In many

cases the failure is the result of either not using a methodology or using the wrong

methodology. This shows the importance of a software development methodology in a

software project for it somewhat determines the success or failure of a software project. This

chapter discusses some development methodologies that are used and also highlights the

methodology that is adopted for this project and why it was chosen.

3.2 Research Methodology

A research methodology is a way to systematically solve a research problem. It may be

understood as a science of studying how research is done scientifically. We studied are the

various steps generally adopted by a researcher in studying a research problem along with the

logic behind. Research methodologies use procedures, methods and techniques that have

been tested for their. Some research methodologies are discussed below.

The build research methodology consists of building an artefact, either a physical or a

software system, to demonstrate that it is possible. For it to be considered research, the

construction of the artefact must be new or must include new features that have not been

demonstrated before in other artefacts.

Another research methodology is process methodology which is used to understand the

processes used to accomplish tasks in a task. This methodology is mostly used in the areas of


31/64

25

Software Engineering and Man-Machine Interface which deal with the way humans build and

use computer systems. The study of processes may also be used to understand cognition in

the field of Artificial Intelligence.

The last research methodology discussed for the project is the model methodology. It is

centred on defining an abstract model for a real system. The model is much less complex than

the system that it models, and therefore allows the researcher to better understand the system

and to use the model to perform experiments that could not be performed in the system itself

because of cost or accessibility. The model methodology is often used in combination with

other methodologies. Experiments based on a model are called simulations. When a formal

description of the model is created to verify the functionality or correctness of a system, the

task is called model checking.

3.3 Rapid Application Development

Rapid Application Development (RAD) is a software development methodology that focuses

on building applications in a very short amount of time; traditionally with compromises in

usability, features and execution speed. RAD employs joint application design (to obtain user

input), prototyping, CASE technology, application generators, and similar tools to expeditethe design process.

Rapid Application Development has four essential aspects: methodology, people,

management, and tools. If any one of these ingredients is inadequate, development will not be

high speed. Development lifecycles, which weave these ingredients together as effectively as

possible, are of the utmost importance.

3.3.1 Strengths, weaknesses, and limitations

Rapid application development promotes fast, efficient, accurate program and/or system

development and delivery. Compared to other methodologies, RAD generally improves

user/designer communication, user cooperation, and user commitment, and promotes better

documentation.

Because rapid application development adopts prototyping and joint application design, RADinherits their strengths and their weaknesses. More specifically, RAD is not suitable for


32/64

26

mathematical or computationally oriented applications. Because rapid application

development stresses speed, quality indicators such as consistency, standardization,

reusability, and reliability are easily overlooked.

Speed and quality are the primary advantages of Rapid Application Development, while

potentially reduced scalability and feature sets are the disadvantages. The primary advantage

lies in an applications increased development speed and decreased time to del ivery. Projects

developed using RAD lack scalability of a project that was designed as a full application

from the start. Rapid Application Development is not appropriate for all projects. The

methodology works best for projects where the scope is small or work can be broken down

into manageable chunks. Business objectives need to be well defined before the project can

begin, so projects that use RAD should not have a broad or poorly defined scope.

3.3.2 RAD Concepts and Phases

Rapid application development (RAD) is a system development methodology that employs

joint application design (to obtain user input), prototyping, CASE technology, application

generators, and similar tools to expedite the design process. Initially suggested by James

Martin, this methodology gained support during the 1980s because of the wide availability of

such powerful computer software as fourth-generation languages, application generators, and

CASE tools, and the need to develop information systems more quickly. The primary

objectives include high quality, fast development, and low cost.

Rapid application development focuses on four major components: tools, people,

methodology, and management. Current, powerful computing technology is essential to

support such tools as application generators, screen/form generators, report generators,

fourth-generation languages, relational or object-oriented database tools, and CASE tools.

People include users and the development team. The methodology stresses prototyping and

joint application design.

A strong management commitment is essential. Before implementing rapid application

development, the organisation should establish appropriate project management and formal

user sign-off procedures. Additionally, standards should be established for the organis ations

data resources, applications, systems, and hardware platforms.


33/64

27

Martin suggests four phases to implement rapid application development: requirements

planning, user design, construction, and cutover (Martin, 2005). Requirements planning is

much like traditional problem definition and systems analysis. RAD relies heavily on joint

application design (JAD) sessions to determine the new system requirements.

During the user design phase, the JAD team examines the requirements and transforms them

into logical descriptions. CASE tools are used extensively during this phase. The system

design can be planned as a series of iterative steps or allowed to evolve.

During the construction phase, a prototype is built using the software tools described earlier.

The JAD team then exercises the prototype and provides feedback that is used to refine the

prototype. The feedback and modification cycle continues until a final, acceptable version of

the system emerges. In some cases, the initial prototype consists of screens, forms, reports,

and other elements of the user interface, and the underlying logic is added to the prototype

only after the user interface is stabilised.

The cutover phase is similar to the traditional implementation phase. Key activities include

training the users, converting or installing the system, and completing the necessary

documentation. Once the prototype has been developed, within its time box, the construction

team tests the initial prototype using test scripts developed during the user design stage, thedesign team reviews the application, the customer also reviews the application. Lastly the

implementation stage, also known as the deployment stage, consists of integrating the new

system into the business. The design team trains the system users while the users perform

acceptance testing. If there was an old system in place, the design team would help the users

transfer from their old procedures to new ones that involve the new system. The design team

also troubleshoots after the deployment, for testing purposes on a test environment, and

identifies and tracks potential enhancements. The amount of time required to complete the

Implementation Stage varies with the project.

As with any project there are post project activities, which are typically the same for most

methodologies. For RAD; final deliverables should be handed over to the client and such

activities should be performed that benefit future projects. Specifically it is a best practice for

a Project Manager to review and document project metrics, organise and store project assets

such as reusable code components, Project Plan, Project Management Plan (PMP), and Test

Plan. It is also a good practice to prepare a short lessons learned document


34/64

28

3.4 Agile

The focal aspects of light and agile methods are simplicity and speed. In development work,

accordingly, the development group concentrates only on the functions needed at first hand,

delivering them fast, collecting feedback and reacting to the received information. An agile

development process is one were software development is incremental, cooperative,

straightforward and adaptive. Agile methodology is based on iterative and incremental

development, where requirements and solutions evolve.

The core of agile software development methods is the use of light-but-sufficient rules of

project behaviour and the use of human and communication-oriented rules. The agile

process is both light and sufficient. Lightness is a means of remaining manoeuvrable.

Sufficiency is a means of staying in the game (Cockburn 2002).

Agile methodologies embrace iterations. Small teams work together with stakeholders to

define quick prototypes, proof of concepts, or other visual means to describe the problem to

be solved. The team defines the requirements for the iteration, develops the code, and defines

and runs integrated test scripts, and the users verify the results.

Figure 3.1: A generic agile development process features an initial planning stage, rapid repeats

of the iteration stage, and some form of consolidation before release.


35/64

29

3.4.1 Two agile software development methodologies

The most widely used methodologies based on the agile philosophy are Extreme

programming and Scrum. These differ in particulars but share the iterative approach

described above.

3.3.1 Extreme Programming

This methodology concentrates on the development rather than managerial aspects of a

software projects. Extreme programming was designed so that organisations would be free to

adopt all or part of the methodology. It relies on constant code improvement, user

involvement in the development team and pairwise programming.

XP projects start with a release planning phase, followed by several iterations, each of which

concludes with user acceptance testing. When the product has enough features to satisfy

users, the team terminates iteration and releases the software. The life cycle of XP consists of

five phases namely; exploration, planning, release, production, maintenance and final release.

In the exploration phase, the customers write out story cards that they wish to be included in

the first release. Each story card describes a feature to be added into the program. At the sametime the project team familiarise themselves with the tools, technology and practices they will

be using in the project. The planning phase sets the priority for the stories and an agreement

of the contents of the first small release is made. The iterations to release phase includes

several iterations of the systems before the first release. The schedule set in the planning

stage is broken down to a number of iterations that will each take one to four weeks to

implement. The first iteration creates a system with the architecture of the whole system. The

production phase requires extra testing and checking of the performance of the system beforethe system can be released to the customer.

To create a release plan, the team breaks up the development tasks into iterations. The release

plan defines each iteration plan, which drives the development for that iteration. At the end of

iteration, users perform acceptance tests against the user stories. If they find bugs, fixing the

bugs becomes a step in the next iteration.

XP has rules and concepts that govern it, some of which are described below. The first of

which is integrate often; it means development teams must integrate changes into the


36/64

30

development baseline at least once a day. This is also known as continuous integration.

Project velocity is another governing principle which is the measure of how much work is

getting done on the project. This metric drives release planning and release planning and

schedule updates. Another principle is user story which describes problems to be solved by

the system being built. These stories must be written by the user and should be about three

sentences long. This is one of the main objections to the XP methodology, but also one of its

greatest strengths.

3.3.2 Scrum

This methodology follows the rugby concept of scrum, which is related to scrimmage, in the

sense of a huddled mass of players engaged with each other to get a job done. Scrum for

software development came out of the rapid prototyping community because prototyping

groups wanted a methodology that would support an environment in which the requirements

were not only incomplete at the start, but also could change rapidly during development.

Unlike XP, Scrum methodology includes both managerial and development processes.

After the team completes the project scope and high-level designs, it divides the development

process into a series of short iterations called sprints. Each sprint aims to implement a fixed

number of backlog items. Before each sprint, the team members identify the backlog items

for the sprint. At the end of a sprint, the team reviews the sprint to articulate lessons learned

and check progress.

The Scrum development process concentrates on managing sprints. Before each sprint

begins, the team plans the sprint, identifying the backlog items and assigning teams to these

items. Teams develop, wrap, review, and adjust each of the backlog items. During

development, the team determines the changes necessary to implement a backlog item. Theteam then writes the code, tests it, and documents the changes. During wrap, the team creates

the executable necessary to demonstrate the changes. In review, the team demonstrates the

new features, adds new backlog items, and assesses risk. Finally, the team consolidates data

from the review to update the changes as necessary.

Scrum also has some rules and concepts that govern it, some are described below. Sprint backlog is the list of backlog items assigned to a sprint, but not yet completed. In common


37/64

31

practice, no sprint backlog item should take more than two days to complete. The sprint

backlog helps the team predict the level of effort required to complete a sprint. Another

concept is Product backlog. It is the complete list of requirements including bugs,

enhancement requests, and usability and performance improvements that are not currently in

the product release.

3.5 Tools

In the development of the system there are tools that we are going to employ so as to come up

with a system of high quality, below are some of the tools needed in the development of our

system.

3.5.1 Unified Modelling Language

Requirements for a business are best met by modelling business rules at a very high level,

where they can be easily validated with clients, and then automatically transformed to the

implementation level. The Unified Modelling Language (UML) is now widely used for both

database and software modelling. It is used as a standard language for object-orientedanalysis and design and is also used to model the Natural Language C Cross Compiler front

end. UML's object-oriented approach facilitates the transition to object-oriented code hence

its use in this project.

The design models can either be static which describe the static structure of the system in

terms of object classes and relationships or dynamic which describe the dynamic interactions

of the objects.

UML diagrams are used to depict system requirements and functionality and some of these

UML diagrams are used to view what this system does and the system goals in the design

phase. The UML diagrams to be used in the analysis phase are activity diagrams, sequence

diagrams and use case diagrams. In the design phase class diagrams and an Entity

Relationship diagrams are used.

Two more UML diagrams come into play during implementation stage, these are;

deployment diagrams and component diagrams. Deployment diagrams are implementation


38/64

32

level diagrams that show how the hardware and software elements that make up this

application are configured and set into operation. Component diagrams are also

implementation level tools that show the structure of the code be it source code files,

executable files or binary code files connected by dependencies.

3.5.2 Why UML

Although UML has no specification for the modelling of user interfaces, has no way to

formally specify serialisation and object persistence and no way to specify that an object

resides on a server process and shared among instances of a running process; it is the chosen

modeling language for this project. This is so because UML offers all the benefits of Object

Orientated development such as inheritance and polymorphism. It also helps to communicate,

explore potential designs, and validate the architectural design of the software. Importantly, it

uses simple intuitive notation that non-programmers can also understand its models.

3.5.3 Other Tools

A text editor with source code formatting is used for this project. Specifically Notepad++,

this is because of its simplicity and adaptability in being able to be used for multiple

languages offering a quick switch between languages. This project is being developed using

three languages C, Java and some assembly language hence the need to shift between

languages as often as possible during the development stage.

Other than the standard development environments this project uses open source scanner and

parser generators mainly LEX and YACC on Linux distributions and FLEX and Bison a

variation of YACC on windows.

3.6 Preferred Methodology and Justification

The chosen methodology is extreme programming, a type of agile development methodology.

This methodology has been chosen because it concentrates on the core development rather

than managerial aspects of a software projects. This methodology best suits this project in

that it puts more emphasis on the core programming processes of a project. This project has a

lot of programming processes; these are given more focus by this methodology. Extreme

programming promotes the fast development of a software product by dividing the whole


39/64

33

project into small components which are developed in iterations. By promoting fast

development with the use of iteration, it simultaneously promotes the production of a high

quality product since the modules are independently produced in the iterative process.

3.7 Conclusion

Here we discussed the different methodologies that can be used in the research and

development of the Natural Language C Cross Compiler, giving both the advantages and

disadvantages of each. The development of the Natural Language C Cross Compiler can be

modularised and it has a lot of programming processes hence the chosen methodology.

Justification for not choosing the other methodologies is aligned in this chapter. Included also

are the tools that are used in the development of the system.


40/64

34

Chapter 4

System Analysis and Design

4.1 Introduction

Systems analysis is the dissection of a system into its component pieces to study how those

component pieces interact and work with a view of changing it or improving the already

working. We do a systems analysis to subsequently perform a systems synthesis which is the

re-assembly of a system's component pieces back into a whole system-it is hoped an

improved system. Traditionally, systems analysis is associated with application development

projects, that is, projects that produce information systems and their associated computer

applications. Systems analysis methods can be applied to projects with different goals and

scope. In addition to single information systems and computer applications, systems analysis

techniques can be applied to strategic information systems planning and to the redesign of

business processes. There are also many strategies or techniques for performing systems

analysis. They include modern structured analysis, information engineering, prototyping, and

object-oriented analysis.

4.2 Requirements Elicitation

During this phase we gathered system requirements using a number of techniques to ensure

unambiguity, completeness, consistency, correctness and verifiability of the requirements

both non-functional and functional. Methods used include interviews, questionnaires and

examination of similar existing systems. This stage is of utmost importance as the system is

built directly from these.

We conducted interviews with a number of programmers from the University in an attempt toestablish the desired system functionality. The output formed a basis of the structure of the

input to the system that is natural language.

We had a brainstorming session with fellow students passionate in programming including an

expert programmer currently at e-solutions private limited. These were intended to

complement the interviews in trying to come up with solid system requirements that do not

tract from set standards of the development of software.


41/64

35

4.2.1 Needs Analysis

Compilers in existence currently and before give feedback via a character user interface or a

graphic user interface for compilers integrated within Integrated Development Environments

(IDEs). These compilers give information about errors encountered either semantic or syntax

by detecting the actual location of the error by line number and by underlining in colour for

most IDEs for example NetBeans.

From the interviews and brainstorming sessions we had, a point was raised that errors are

easily seen in using colours as with IDEs and text editors with source code formatting. We

then decided that there is need for the Natural Language C Cross Compiler to have a

graphical user interface with a text area for the input and a portion for displaying results or

errors.

The user interaction via the graphical user interface prompted that the text editor should have

basic operations for a text editor. These were designated to be the basic cut, copy, paste and

open file for external files. Apart from the basic text editor operations a point raised was that

the Compiler should have a menu with a compile and build function.

Due to the nature of the input being a natural language a basic grammar checking facility was

introduced. This checks British English and is able to give suggestions on spelling mistakes

and less complex verb to noun agreement. This helps in alerting the programmer what the

compiler takes as symbol when the words used are not recognised by the grammar checker.

The suggested Graphical User Interface is given in Figure 4.1.


42/64

36

\

Figure 4.1 Suggested Interface

4.3 Feasibility study

Feasibility is the state or degree of a project being easily or conveniently done. A feasibility

study is an evaluation and analysis of the possibility of the proposed project which is based

on far-reaching investigation and research (Georgakellos, et al, 2009). A feasibility study is

done so as to support the process of decision making. This study is essential in systems

development because it is done before the system is developed and hence the sponsors of the

system, the users and developers can conclude from this study if the development should

proceed or not. Outlined below are the feasibility study reasons for the development of the Natural Language C Cross Compiler.

4.3.1 Economical Feasibility

In the development of the Natural Language C Cross Compiler the inputs that are put into the

system are mainly time. Monetary input is very low and there is no financial reason why the


43/64

37

development of this system should not proceed. Economically therefore the system is found

to be feasible and hence the development.

4.3.2 Technical feasibility

The Natural Language C Cross Compiler is developed at the National University Science and

Technology (NUST) as part of the requirements for the fulfilment of the B.Sc. (Honours)

Degree in Computer Science. The necessary development tools needed for this systems

development for instance the Java compiler, are readily available for free and hence making

this development process technically feasible. This system is also developed as a research and

training method for the university hence making the development technically feasible.

4.4 Requirements specification

The requirements for the Natural Language C Cross Compiler can be divided into two

categories which are the functional requirements and the non-functional requirements.

Functional Requirements are services that the system should provide, how the system should

react to particular inputs and how the system should react to particular situations

(Sommerville, 2010). They depend on the type of software being developed and can be sub-

divided into input, processing and output requirements. Functional Requirements can be

further subdivided into functional user requirements and functional system requirements.

Functional user requirements are a high level description of what the system should do while

functional system requirements describe the system service in detail. In order to produce

quality software in a software project development, it is essential to implore all system

requirements and clearly understand them.

Non-functional requirements are constraints on the services or functions offered by the

system. They include taming constraints and constraints on the development process and

standards. Non-functional requirements often apply to the recruitment system as a whole andalso relate to performance that will be required of the system and the technologies that will be

used for development of the system under study. They do not usually just apply to individual

system features or services. They are also known as quality requirements

4.6 System Requirements

These describe the functionality or system services the system should provide, how the

system responds to certain input and how the system should behave in particular situations.


44/64

38

There are functional user requirements and non-functional system requirements. Functional

user requirements are a high level description of what the system should do whereas

functional system requirements describe the system service in detail. Non-functional

requirements describe other characteristics of the product. There are several categories of

these requirements that are constraints,

natural language c cross compiler

Documents