natural language c cross compiler

Upload: wellington-tsamasuo

Post on 02-Jun-2018

232 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Natural Language C Cross Compiler

    1/64

    i

    Abstract

    Programming is taught at many institutions in the world. It is introduced as a tool to bothComputer Science majors and non-Computer Science students. After the duration of studiesmost students lack appreciation of the subject mainly because the introductory courses failedto cultivate enthusiasm in the students by starting with formal languages like C which the less

    passionate students fail to comprehend. In the development world there are different types of programmers mainly good and bad, the distinction comes because of different tastes andstyles. It is fairly difficult to comprehend someones code that is not well commented,debugging can be a painful task, also because of code that is not self-documenting. Theaforementioned reasons brought the task of coming up with a natural language interface for

    programming. The long term goal is to simplify software development processes and comeup with models that speed up development so as to minimise losses due to projects takinglonger than necessary.

  • 8/10/2019 Natural Language C Cross Compiler

    2/64

    ii

    Acknowledgements

    Firstly I would like to thank my Supervisor, Mr K. Muzheri for his guidance, technical

    support and constructive criticism throughout the project and my Co-Supervisor, Mr S.

    Ngwenya for his tireless supervision and expert contribution throughout the project. Specialthanks, to my family who have provided for me throughout my entire program both

    financially and with moral support. I would like to acknowledge everyone who contributed to

    this project but are not mentioned above, do not be disheartened; your contributions were

    greatly appreciated. I thank you all for giving me ideas where I lacked wisdom and for your

    unconditional support throughout the project and more.

  • 8/10/2019 Natural Language C Cross Compiler

    3/64

    iii

    Table of ContentsAbstract .................................................................................................................................................... i

    Acknowledgements ................................................................................................................................. ii

    List of Figures .......................................................................................................................................... v

    List of Tables .......................................................................................................................................... vi

    Chapter 1:Introduction ........................................................................................................................... 1

    1.1 Introduction .................................................................................................................................. 1

    1.2 Background ................................................................................................................................... 1

    1.3 Aim ................................................................................................................................................ 3

    1.4 Objectives...................................................................................................................................... 3

    1.5 Justification ................................................................................................................................... 3

    1.6 Scope ............................................................................................................................................. 4

    1.7 Expected results ............................................................................................................................ 4

    1.8 Project overview ........................................................................................................................... 4

    1.9 Project Plan ................................................................................................................................... 5

    1.10 Conclusion ................................................................................................................................... 6

    Chapter 2:Literature Review ................................................................................................................... 72.1 Introduction .................................................................................................................................. 7

    2.2 Natural Language Processing ........................................................................................................ 7

    2.3 Natural Language Programming ................................................................................................. 20

    2.4 Compiler Design .......................................................................................................................... 21

    2.5 Conclusion ................................................................................................................................... 23

    Chapter 3:Methodology ........................................................................................................................ 24

    3.1 Introduction ................................................................................................................................ 24

    3.2 Research Methodology ............................................................................................................... 24

    3.3 Rapid Application Development ................................................................................................. 25

    3.4 Agile ............................................................................................................................................ 28

    3.5 Tools ............................................................................................................................................ 31

    3.6 Preferred Methodology and Justification ................................................................................... 32

    3.7 Conclusion ................................................................................................................................... 33

    Chapter 4:System Analysis and Design ................................................................................................. 34

    4.1 Introduction ................................................................................................................................ 34

  • 8/10/2019 Natural Language C Cross Compiler

    4/64

    iv

    4.2 Requirements Elicitation ............................................................................................................. 34

    4.3 Feasibility study........................................................................................................................... 36

    4.4 Requirements specification ........................................................................................................ 37

    4.6 System Requirements ................................................................................................................. 37

    4.4 System Design ............................................................................................................................. 39

    4.6 Conclusion ................................................................................................................................... 44

    Chapter 5:Implementation and Testing ................................................................................................ 45

    5.1 Introduction ................................................................................................................................ 45

    5.2 Tools ............................................................................................................................................ 45

    5.3 Deployment Architecture ........................................................................................................... 46

    5.4 System Functionality ................................................................................................................... 46

    5.5 Software Testing ......................................................................................................................... 48

    5.6 Conclusion ................................................................................................................................... 49

    Chapter 6:Recommendations and Conclusions .................................................................................... 50

    6.1 Introduction ................................................................................................................................. 50

    6.2 Classification of the System ........................................................................................................ 50

    6.3 Review of the Projects Aim and Objectives .............................................................................. 50

    6.4 Challenges Encountered .............................................................................................................. 50

    6.5 Recommendations for Future Work ............................................................................................ 51

    6.6 Conclusion .................................................................................................................................. 51

    References ........................................................................................................................................ 52

    APPENDICES .......................................................................................................................................... 54

  • 8/10/2019 Natural Language C Cross Compiler

    5/64

    v

    List of FiguresFigure 2.1: Parse Tree . 16

    Figure 3.1: A generic agile development process features an initial planning stage, rapid repeatsof the iteration stage, and some form of consolidation before release .....28

    Figure 4.1: Suggested Interface .36

    Figure 4.2: Architecture Diagram for the Natural Language C Cross Compiler .. 40

    Figure 4.3: Class Diagram for the Natural Language C Cross Compiler front-end. ...42

    Figure 4.4 : Sequence Diagram for the Natural Language C Cross Compiler. .43

    Figure 5.1: The interface... 47

  • 8/10/2019 Natural Language C Cross Compiler

    6/64

    vi

    List of Tables

    Table 1.1: Project Plan 5

  • 8/10/2019 Natural Language C Cross Compiler

    7/64

    1

    Chapter 1

    Introduction

    1.1 Introduction

    Natural Language Programming (NLP) is an ontology-assisted way of programming in terms

    of natural language sentences for example English. The goal of NLP is to make computers

    easier to use and enable people who are not professional computer scientists to be able to

    teach new behaviour to their computers.

    Natural Language Programming builds up a single program or a library of routines that are

    programmed through natural language sentences using an ontology that defines the available

    data structures in a high level programming language. The smallest unit of statement in

    Natural Language Programming is a sentence. Each sentence is stated in terms of concepts

    from the underlying ontology. In a Natural Language Program text each sentence

    unambiguously compiles into a procedure call in the underlying high level programming

    language such as C, C++, Java, etc.

    The goal of easy-to-use interfaces for programming would be a natural language interface -

    just tell the computer what you want.

    Attempts have been made in the natural language programming field to create natural language

    interfaces for programming. The NLC prototype in 1979 (Liu and Lieberman, 2005) was build, with

    the capabilities of handling low level operations as the transformation of type declarations into

    programmatic expressions. More recently a system called METAFOR capable of translating natural

    language statements to class descriptions with associated objects and methods (Liu and Lieberman,

    2005). Efforts have been focused on experiments on the feasibility of using natural language in programming and less have been done to come up with fully functional NLP interfaces for

    programming.

    1.2 Background

    A natural language interface for programming should result in greater readability, as well as

    making possible a more intuitive way of writing code. Code written in English is much easier

    to read and understand than in a traditional programming language. Quite often, it is a

  • 8/10/2019 Natural Language C Cross Compiler

    8/64

    2

    difficult task to read another programmers code. Even understanding ones own code can be

    hard after a period of time. This is because without sufficient commenting one cannot tell

    what individual steps are meant to do together.

    Debugging is a generic term for finding and fixing errors in a program. These errors can be

    syntactic, which are normally detected by the compiler or interpreter, or logical, which cause

    unwanted behaviours and can be very difficult to detect (Halpern, 1966). The latter, however,

    can be extraordinarily difficult to find. It involves knowing exactly what each line in the

    program does. If what the programmer believes a statement to do and what it actually does

    are disjointed, there is the potential for catastrophe.

    Early work in natural language programming was deemed ambitious, targeting the generation

    of co mplete computer programs that would compile and run. For instance, the NLC

    prototype (Ballard B and Bierman, 1979) aimed at creating a natural language interface for

    processing data stored in arrays and matrices, with the ability of handling low level

    operations such as the transformation of numbers into type declarations as e.g. float-constant

    (2.0), or turning natural language statements like add y1 to y2 into the programmatic

    expression y1 + y2.

    More recently, however, researchers have started to look again at the problem of natural

    language programming, but this time with more realistic expectations, and with a different,

    much larger pool of resources for example broad spectrum common sense knowledge (Singh

    P, 2002) and a suite of significantly advanced publicly available natural language processing

    tools. For instance, (Pane J t. al, 2001) conducted a series of studies with non-programming

    fifth grade users, and identified some of the programming models implied by the users

    natural language descriptions. In a similar vein, (Lieberman, 2005) have conducted a

    feasibility study and showed how a partial understanding of a text, coupled with a dialoguewith the user, can help non-expert users make their intentions more precise when designing a

    computer program. Their study resulted in a system called METAFOR (Liu & Lieberman,

    2005), able to translate natural language statements into class descriptions with the associated

    objects and methods.

    The challenge most programmers face today is trying to make code readable for the next

    programmer. Current methods of system documentation do not do much in terms ofdocumenting the code itself. This leaves commenting code as the only documentation any

  • 8/10/2019 Natural Language C Cross Compiler

    9/64

    3

    source code has. A natural language interface uses the comments as program statements

    which results in code that is self-documenting and readable for anyone who is to go through

    the code. This makes debugging a less daunting task as program logic will be in plain

    English.

    1.3 Aim

    The aim of the project is to develop a natural language compiler based on the C programming

    language.

    1.4 Objectives

    To compile natural language text as program input

    To perform syntax error checks on input

    To perform grammatical error checks on input

    To generate equivalent object code executable on the main platform

    1.5 Justification

    It is notoriously difficult to construct conventional software systems systematically and

    timely (Somerville, 2008), with up to 20% of industrial development projects failing. With

    further study and improvements the aim is to bridge the gap between how problems are

    defined and how they are solved. Problems are defined in natural language and implemented

    using formal programming languages. The gap has caused delays in the delivery of software

    which ultimately translates to losses in the millions in some cases.

    A natural language interface gives code that is readable and easier to maintain in essence self-documenting code. Consider a scenario where as a programmer you are tasked to add

    functionality to an application you did not write but whose code you have at hand which is

    not well documented. This task takes a long time if the code is not well documented. The

    current methods used for documentation of software projects do less in terms of documenting

    the code itself. Going through another programmers code is a difficult task and sometimes

    even your own code after a long time. Good programmers write self-documenting code and

    yet when faced with the less preferred scenario it will take a long time to make a simplechange to a system.

  • 8/10/2019 Natural Language C Cross Compiler

    10/64

    4

    1.6 Scope

    The project is aimed at developing a natural language C based compiler. The natural

    language used is English. The compiler extracts information about variables, operators and

    information on loops from analysing the natural language program input.

    It focuses on the representation of parts of the natural language, English, that can be mapped

    to existing data structures as variables, structs, lists, arrays and loops. The compiler extracts

    nouns, verbs and overly expressed actions; these can be mapped through the use of ontologies

    to variables, iterations (loops) and statements.

    The compiler is based on the data structures that are built in the C programming language. It

    does not cover the graphical implementation of the C programming language.

    1.7 Expected results

    The compiler takes English natural language text as input. It performs grammatical and

    syntax error checks on the natural language program, giving back the errors in the program. It

    compiles the natural language program if no errors are found. The compiler generates the

    equivalent object code that is executable on the main platform that is, windows or Linux

    platforms.

    1.8 Project overview

    This project is divided into six main chapters. The first chapter being the introductory

    chapter, the second is the literature review followed by methodology, systems analysis and

    design, implementation and conclusion and recommendations for future research.

    The first chapter gives an introduction to the project. It states the background and aim of the

    project and the justification to why the research project topic was selected.

    The Literature review gives an overview of the current systems and the fundamentals

    employed in the area of this research. This chapter also develops an argument on the

    relevance of this research.

    The Methodology chapter gives an overview of methodologies available and justifications of

    the selection of a methodology to be followed are elaborated in this chapter. The chapter also

    gives the techniques and the methodology used in the development of the Natural LanguageC Cross Compiler.

  • 8/10/2019 Natural Language C Cross Compiler

    11/64

    5

    The Systems Analysis and Design chapter focuses on the analysis and design of the system. It

    is the detailed analysis of the functional requirements and it has a summary of the system in

    the form of system development designs. Using Unified Modelling language the conceptual

    model of the system is shown and it is on these designs that the system is implemented.

    The Implementation chapter is shows how the transformation from design to application is

    done. Screen shots of the completed system are shown. This chapter also focuses on testing

    and overview of problems as well as resultant solutions encountered during implementation

    stage.

    The last chapter gives the conclusion of the research topic, set objectives are measured

    against results and suggestions for further research are stated. Divergences if any are justified

    in this chapter.

    1.9 Project Plan

    The project is to follow the following plan with the activities and the timelines for each

    milestone shown in the following table.

    Activity Number Milestone Timeline

    1 Analysis and design 4 Weeks

    2 Implementation 5 Weeks

    3 Testing and Evaluation 4 Weeks

    4 Final Demonstration and

    reporting

    2 Weeks

    Table 1.1: Project Plan

  • 8/10/2019 Natural Language C Cross Compiler

    12/64

    6

    1.10 Conclusion

    This chapter discussed the research project, its aim and objectives. To introduce the project

    the Natural Language Processing and Programming was discussed and the need for the

    system was argued as well. Also included in this chapter was the project plan which is

    followed in the outline of the rest of this project documentation.

  • 8/10/2019 Natural Language C Cross Compiler

    13/64

    7

    Chapter 2

    Literature Review

    2.1 Introduction

    Natural Language Processing is an interdisciplinary research area at the border between

    linguistics and artificial intelligence aiming at developing computer programs capable of

    human-like activities related to understanding or producing texts or speech in a natural

    language, such as English. Morden approaches to NLP are based on machine learning, a type

    of artificial intelligence that examines and uses patterns in data to improve a program's own

    understanding. The most important applications of natural language processing include

    information retrieval and information organisation, machine translation, and natural language

    interfaces, among others.

    The task of improvement of Natural Language Processing has been divided into tasks useful

    for application development and analysis. These range from syntactic analysis such as part-

    of-speech tagging, chunking and parsing to semantic analysis such as semantic role labelling,

    named entity extraction and anaphora resolution.

    Natural Language Programming is a branch separate to Natural Language Processing but

    within Artificial intelligence. Natural Language Programming is the interpretation and

    compilation of instructions communicated in natural language into object code. It is depended

    on the advances in Natural Language Processing.

    2.2 Natural Language Processing

    Natural Language Processing (NLP) targets the conversion of human language into formal

    representations that can be manipulated using computers. Natural Language Processing is notoften considered as a goal in or of itself but rather as a means for accomplishing a certain task

    for instance we have information retrieval systems that use NLP.

    Natural Language Processing seeks to accomplish human like processing. That is to be able

    to paraphrase input text, convert the text to another language and answer questions about the

    text.

  • 8/10/2019 Natural Language C Cross Compiler

    14/64

    8

    2.2.1 Natural language Processing Applications

    There are huge amounts of data on the Internet. Applications for processing large amounts of

    texts require Natural Language Processing expertise. Some requirements are to classify text

    into categories, index and search large texts, automatic translation, speech understanding:understand phone conversations, information extraction: Extract useful information from

    resumes, automatic summarisation: that is condense one book into one page, question

    answering, knowledge acquisition, text generations or dialogues.

    Natural language processing provides both theory and implementations for a range of

    applications, some of the applications include; information retrieval, information extraction

    information extraction focuses on the recognition, tagging, and extraction into a structured

    representation, certain key elements of information, for example persons, companies,

    locations, organisations, from large collections of text. These extractions can then be utilised

    for a range of applications including question-answering, visualisation, and data mining.

    Question Answering in contrast to Information Retrieval provides the user with either just the

    text of the answer itself or answer-providing passages which provides a list of potentially

    relevant documents in response to a users query.

    At the higher levels of NLP at the discourse level there is summarisation. Its implementation

    reduces larger text into a shorter richly constituted abbreviated narrative representation of the

    original document.

    Machine Translation (MT) can be considered to be the oldest of all NLP applications, various

    levels of NLP have been uti lised in MT systems, ranging from the word - based approach to

    applications that include higher levels of analysis.

    2.2.2 Computational Linguistics

    A simple sentence consists of a subject followed with predicate. A word in a sentence acts a

    part of speech (POS). For an English sentence, the parts of speech are: nouns, pronouns,

    adjectives, verb, adverb, prepositions, conjunctions, and interjections. A noun tells us about

    names, whereas the verb talks of action. Adjectives and adverbs are modifying the nouns and

    verbs, respectively. Prepositions are relationships between nouns and other parts of speech.

    Conjunctions join words and groups together, and interjections express strong feelings. In the

  • 8/10/2019 Natural Language C Cross Compiler

    15/64

    9

    spoken language, the problem of understanding speech can be divided into three areas

    acoustic-phonetic, morphological-syntactic, and semantic-pragmatic processes.

    In computational linguistics the lexicon supplies paradigmatic information about words,

    including part of speech labels, irregular plurals, and sub-categorization information for

    verbs. In the past, lexicons were quite small and were constructed largely by hand. Effective

    natural language processing requires increased amounts of lexical information. A recent trend

    has been the use of automatic techniques applied to large corpora for the purpose of acquiring

    lexical information from text (Zernik 1991). Statistical techniques are an important aspect of

    automatically mining lexical information. (Manning 1993) uses such techniques to gather

    sub-categorisation information for verbs. (Brent 1993) also discovers sub-categorisation

    information; in addition he attempts to automatically discover verbs in the text, (Liu and Soo1993) describe a method for mining information about thematic roles. The additional

    information being added to the lexicon increases the complexity of the lexicon. This added

    complexity requires that attention be paid to the organisation of the lexicon (Zernik 1991).

    (McCray et al 1993) discuss the structure of a large lexicon designed and implemented to

    support syntactic processing.

    Automatically disambiguating part-of-speech labels in text is an important research area sinceambiguity is particularly prevalent in the spoken language. Programs that resolve part-of-

    speech labels (often called automatic taggers) typically are around 95% accurate (Bod 1998).

    Taggers can serve as pre-processors for syntactic parsers and contribute significantly to

    efficiency. There have been two main approaches to automatic tagging: probabilistic and

    rule-based. Typically, probabilistic taggers are trained on disambiguated text and vary as to

    how much training text is needed and how much human effort is required in the training

    process. (Schtze 1993) described a tagger that requires very little human intervention.)

    Further variation concerns knowing what to do about unknown words and the ability to deal

    with large numbers of tags.

    One drawback to stochastic taggers is that they are very large programs requiring

    considerable computational resources. (Brill 1992) describes a rule-based tagger which is as

    accurate as stochastic taggers, but with a much smaller program. The program is slower than

    stochastic taggers, however. Building on Brills approach, (Roche and Schabes 1995)

    propose a rule-based, finite-state tagger which is much smaller and faster than stochasticimplementations. Accuracy and other characteristics remain comparable.

  • 8/10/2019 Natural Language C Cross Compiler

    16/64

    10

    A traditional approach to natural language processing takes as its basic assumption that a

    system must assign a complete constituent analysis to every sentence it encounters. The

    methods used to attempt this are drawn from mathematics, with context-free grammars

    playing a large role in assigning syntactic constituent structure. (Partee at al 1993) provide an

    accessible introduction to the theoretical constructs underlying this approach, including set

    theory, logic, formal language theory, and automata theory, along with the application of

    these mechanisms to the syntax and semantics of natural language. For syntax, it uses a

    unification-based implementation of a generalised phrase structure grammar (Gazdar et al.

    1985) and handles an impressive number of syntactic structures. In continuing research in this

    tradition, context- free grammars have been extended in various ways. The mildly conte xt

    sensitive grammars, such as tree adjoining grammars, have had considerable influence on

    recent work concerned with the formal aspects of parsing natural language. Several recent

    papers pursue non-traditional approaches to syntactic analysis. One such technique is partial,

    or underspecified, analysis. For many applications such an analysis is entirely sufficient and

    can often be more reliably produced than a fully specified structure. (Chen 1994), for

    example, employ statistical methods combined with a finite state mechanism to impose an

    analysis which consists only of noun phrase boundaries, without specifying their complete

    internal structure or their exact place in a complete tree structure. (Agarwal and Boggess

    1992) successfully rely on semantic features in a partially specified syntactic representation

    for the identification of coordinate structures. In an innovative application of dependency

    grammar and dynamic programming techniques, (Kurohashi and Nagao 1994) address the

    problem of analysing very complicated coordinate structures in Japanese.

    A recent innovation in syntactic processing has been investigation into the use of statistical

    techniques. In probabilistic parsing, probabilities are extracted from a parsed corpus for the

    purpose of choosing the most likely rule when more than one rule can apply during the course

    of a parse (Magerman and Weir 1992). In another application of probabilistic parsing the goal

    is to choose the (semantically) best analysis from a number of syntactically correct analyses

    for a given input (Briscoe at al 1993).

    Another application of statistical methodologies to the parsing process is grammar induction

    where the rules themselves are automatically inferred from a bracketed text; however, results

    in the general case are still preliminary. (Pereira and Schabes 1992) discuss inferring a

    grammar from bracketed text relying heavily on statistical techniques, while (Brill 1993) usesonly modest statistics in his rule-based method.

  • 8/10/2019 Natural Language C Cross Compiler

    17/64

    11

    Automatic word-sense disambiguation depends on the linguistic context encountered during

    processing. (McRoy 1992) appeals to a variety of cues while parsing, including morphology,

    collocations, semantic context, and discourse. Her approach is not based on statistical

    methods, but rather is symbolic and knowledge intensive. Statistical methods exploit the

    distributional characteristics of words in large texts and have need of training, which can

    come from several sources, as well as human intervention. (Gale et al 1992) give an overview

    of several statistical techniques they have used for word-sense disambiguation and discuss

    research on evaluating results for their systems and others. They have used two training

    techniques, one based on a bilingual corpus, and another on Rogets Thesaurus. (Justeson a nd

    Katz 1995) use both rule based and statistical methods. The attractiveness of their method is

    that the rules they use provide linguistic motivation.

    Formal semantics is rooted in the philosophy of language and has as its goal a complete and

    rigorous description of the meaning of sentences in natural language. It concentrates on the

    structural aspects of meaning. The papers in (Rosner and Johnson 1992) discuss various

    aspects of the use of formal semantics in computational linguistics and focus on Montague

    grammar (Montague 1974). (King 1992) provides an overview of the relation between formal

    semantics and computational linguistics. Several papers in Rosner and Johnson discuss

    research in the situation semantics paradigm (Barwise and Perry 1983), which has recentlyhad wide influence in computational linguistics, especially in discourse processing. Lexical

    semantics (Cruse 1986) has recently become increasingly important in natural language

    processing. This approach to semantics is concerned with psychological facts associated with

    the meaning of words. (Levin 1993) analyses verb classes within this framework, while the

    papers in Levin and Pinker 1991 explore additional phenomena, including the semantics of

    events and verb argument structure. Another application of lexical semantics is WordNet

    which is a lexical database that attempts to model cognitive processes. The articles in (Saint-

    Dizier and Viegas 1995) discuss psychological and foundational issues in lexical semantics as

    well as a number of aspects of using lexical semantics in computational linguistics.

    Another approach to language analysis based on psychological considerations is cognitive

    grammar (Langacker 1988). (Olivier and Tsujii 1994) deal with spatial prepositions in this

    framework, while (Davenport and Heinze 1995) discuss more general aspects of semantic

    processing based on cognitive grammar.

  • 8/10/2019 Natural Language C Cross Compiler

    18/64

    12

    Discourse analysis is concerned with coherent processing of text segments larger than the

    sentence and assumes that this requires something more than just the interpretation of the

    individual sentences. (Grosz, Joshi and Weinstein 1995) provide a broad-based discussion of

    the nature of discourse, clarifying what is involved beyond the sentence level, and how the

    syntax and semantics of the sentences support the structure of the discourse. In their analysis,

    discourse contains linguistic structure (syntax, semantics), focus of attention, and intentional

    structure (plan of participants) and is structured into coherent segments. During discourse

    processing one important task for the hearer is to identify the referents of noun phrases.

    Inferencing is required for this identification. A coherent discourse lessens the amount of

    inferencing required of the hearer for comprehension. Throughout a discourse the particular

    way that the speaker maintains focus of attention or centring through choice of linguistic

    structures for referring expressions is particularly relevant to discourse coherence.

    Other work in computational approaches to discourse analysis has focused on particular

    aspects of processing coherent text. (Hajicova et al 1995) distinguish topic that is old

    information from focus that is new information within a sentence. Information of this sort is

    relevant to tracking the focus of attention. (Lappin and Leass 1994) are primarily concerned

    with intra-sentential anaphora resolution, which relies on syntactic cues, rather than discourse

    cues. Nonetheless, they also address inter-sentential anaphora, and this relies on severaldiscourse cues, such as saliency of a noun phrase, which is determined by such things as

    grammatical role, frequency of mention, proximity, and how recent a sentence is. (Hul et al

    1995) use a similar notion of saliency for anaphora resolution and resolve deictic expressions

    with the same principles. (Passonneau and Litman 1993) study the nature of discourse

    segments and the linguistic structures which cue them. (Sonderland and Lehnert 1994)

    investigate machine learning techniques for discovering discourse-level semantic structure.

    Several recent papers investigate those aspects of discourse processing having to do with the

    psychological state of the participants in a discourse, including, goals, intentions, and beliefs:

    (Asher and Lascarides 1994) investigate a formal model for representing the intentions of the

    participants in a discourse and the interaction of such intentions with discourse structure and

    semantic content. (Traum and Allen 1994) describes the idea of social obligation to shed light

    on the behaviour of discourse. (Wiebe 1994) investigates psychological point of view in third

    person narrative and provides an insightful algorithm for tracking this phenomenon in text.

    The point of view of each sentence is either that of the narrator or any one of the characters in

    the narrative.

  • 8/10/2019 Natural Language C Cross Compiler

    19/64

    13

    2.2.3 Levels of knowledge in language understanding

    A language understanding program must have considerable knowledge about the structure of

    the language including what the words are and how they combine into phrases and sentences.

    It must also know meaning of the words, how to contribute meaning of the sentence and tothe context in which they are being used. In addition, the program must have general world

    knowledge and knowledge about how the humans reason.

    The components of the knowledge needed to understand the language are phonological which

    relates sounds to the words we recognise. Phoneme which is smallest unit of sound, and the

    phones are aggregated into word. Morphological is the lexical knowledge, which relates to

    word construction from basic units called morphemes. A morpheme is the smallest unit of

    meaning, for example, the construction of friendly from friend and ly. Syntactic is knowledge

    about how the words are organised to construct meaningful and correct sentences. Pragmatics

    is the high level knowledge about how to use sentences in different contexts and how the

    context affects the meanings of the sentences.

    2.2.4 Grammars and Languages

    A language can be generated given its grammar G = (V,_, S, P), where V is set of variables, _

    is set of terminal symbols, which appear at the end of generation, S is start symbol, and P is

    set of production rules. The corresponding language of G is L(G).

    Consider that various tuples are as given in Listing 2.1.

  • 8/10/2019 Natural Language C Cross Compiler

    20/64

  • 8/10/2019 Natural Language C Cross Compiler

    21/64

    15

    S aS

    S aAB

    AB BA

    aA ab

    aA aa

    Listing 2.3:Third Language Generation

    Where uppercase letters are non-terminals and lowercase are terminals.

    The type-2 grammars are:

    S aS

    S aSb

    S aB

    S aAB

    A a

    B b

    Listing 2.4: Fourth Language Generation

    The type 3 grammar is simplest having rewrite rules as:

    S aS

    S

    Listing 2.5: Fifth Language Generation

    The types 1, 2, 3 are called context-sensitive, context-free, and regular grammars, and hence

    the corresponding names for languages also. The formal languages are mostly based on the

    type-2 languages, as the type 0 and 1 are not much and understood and difficult to implement.The types 1, 2, 3 are called context-sensitive, context-free, and regular grammars, and hence

  • 8/10/2019 Natural Language C Cross Compiler

    22/64

    16

    the corresponding names for languages also. The formal languages are mostly based on the

    type-2 languages, as the type 0 and 1 are not much and understood and difficult to implement.

    2.2.5 Structural Representation

    It is convenient to represent the sentences as tree or a graph to help expose the structure of the

    constituent parts. For example, the sentence, the boy ate a ice cream can be represented as a

    tree shown in Figure 2.1.

    Figure 2.1: Parse Tree

    For the purpose of computation a tree must also be represented as a record, a list or some

    similar data structure. For example, the tree above is represented as a list:

    (S (NP ((Art the)

    (N boy))

    (VP (V ate) (NP (Art a) (N Icecream)))))

    Listing 2.6: Tree Representation as a list

  • 8/10/2019 Natural Language C Cross Compiler

    23/64

    17

    A more extensive English grammar can be obtained with the addition of other constituencies

    such as prepositional phrases PP, adjectives ADJ, determiners DET, adverbs ADV , auxiliary

    verbs AUX, and many other features. Correspondingly, the other rewrite rules are followings.

    PP Prep NP,

    V P V ADV

    V P V PP,

    V P V NP PP

    V P AUX V NP

    Det Art ADJ,

    Det Art

    Listing 2.7: Rewrite of rules

    2.2.6 Pattern matching

    The idea here is an approach to natural language processing is to interpret input utterances as

    a whole father than building up their interpretation by combining the structure and meaning

    of words or other lower level constituents. That means the interpretations are obtained by

    matching patterns of words against the input utterance. For a deep level of analysis in pattern

    matching a large number of patterns are required even for a restricted domain. This problem

    can be ameliorated by hierarchical pattern matching in which the input is gradually

    normalised through pattern matching against sub-phrases. Another way to reduce the number

    of patterns is by matching with semantic primitives instead of words.

    2.2.7 Syntactically driven Parsing

    Syntax means ways that words can fit together to form higher level units such as phrases, clauses and

    sentences. Therefore syntactically driven parsing means interpretation of larger groups of words are

    built up out of the interpretation of their syntactic constituent words or phrases. In a way this is the

    opposite of pattern matching as here the interpretation of the input is done as a whole. Syntactic

    analyses are obtained by application of a grammar that determines what sentences are legal in thelanguage that is being parsed.

  • 8/10/2019 Natural Language C Cross Compiler

    24/64

    18

    2.2.8 Semantic Grammars

    Natural language analysis based on semantic grammar is bit similar to syntactically driven

    parsing except that in semantic grammar the categories used are defined semantically and

    syntactically. There here semantic grammar is also involved.

    Case frame instantiation is one of the major parsing techniques under active research today.

    The has some very useful computational properties such as its recursive nature and its ability

    to combine bottom-up recognition of key constituents with top-down instantiation of less

    structured constituents.

    2.2.9 Applications of Natural Language Processing

    As natural language processing technology matures it is increasingly being used to support

    other computer applications. Such use naturally falls into two areas, one in which linguistic

    analysis merely serves as an interface to the primary program, and another in which natural

    language considerations are central to the application.

    Natural language interfaces to data base management systems (for example Bates 1989)

    translate users input into a request in a formal data base query language , and the programthen proceeds as it would without the use of natural language processing techniques. It is

    normally the case that the domain is constrained and the language of the input consists of

    comparatively short sentences with a constrained set of syntactic structures. The design of

    question answering systems is similar to that for interfaces to data base management systems.

    One difference is that the knowledge base supporting the question answering system does not

    have the structure of a data base. Processing in this system not only requires a linguistic

    description for users requests, but it is also necessary to provide a representation for theencyclopaedia itself. As with the interface to a Database Management System, the requests

    are likely to be short and have a constrained syntactic structure. (Lauer at al 1992) provide

    some general considerations concerning question answering systems and describe several

    applications.

    In message understanding systems, a fairly complete linguistic analysis may be required, but

    the messages are relatively short and the domain is often limited. (Davenport and Heinze

    1995) describe such a system in a military domain.

  • 8/10/2019 Natural Language C Cross Compiler

    25/64

    19

    In information filtering, text categorisation, and automatic abstracting no constraints on the

    linguistic structure of the documents being processed can be assumed. One mitigating factor

    is that effective processing may not require a complete analysis. For all of these applications

    there are also statistically based systems based on frequency distributions of words. These

    systems work fairly well, but most people feel that for further improvements, and for

    extensions, some sort of understanding of the texts, such as that provided by linguistic

    analysis is required.

    Information filtering and text categorization are concerned with comparing one document to

    another. In both applications, natural language processing imposes a linguistic representation

    on each document being considered. In text categorization a collection of documents is

    inspected and all documents are grouped into several categories based on the characteristicsof the linguistic representations of the documents. (Blosseville et al. 1992) describes an

    interesting system which combines natural language processing, statistics, and an expert

    system. In information filtering, documents satisfying some criterion are singled out from a

    collection. (Jacobs and Rau 1990) discuss a program which imposes a quite sophisticated

    semantic representation for this purpose.

    In automatic abstracting, a summary of each document is sought, rather than a classification

    of a collection. The underlying technology is similar to that used for information filtering and

    text categorisation: the use of some sort of linguistic representation of the documents. Of the

    two major approaches, one (McKeown and Radev 1995) puts more emphasis on semantic

    analysis for this representation and the other (Paice and Jones 1993), less. Information

    retrieval systems typically allow a user to retrieve documents from a large bibliographic

    database. During the information retrieval process a user expresses an information need

    through a query. The system then attempts to match this query to those documents in the

    database which satisfy the users information need. In systems which use natural l anguage processing, both query and documents are transformed into some sort of a linguistic structure,

    and this forms the basis of the matching. Several recent information retrieval systems employ

    varying levels of linguistic representation for this purpose. (Sembok and van Rijsbergen

    1990) base their experimental system on formal semantic structures, while (Myaen et al

    2004) construct lexical semantic structures for document representations. (Strzalkowski

    1994) combines syntactic processing and statistical techniques to enhance the accuracy of

    representation of the documents. In an innovative approach to document representation for

  • 8/10/2019 Natural Language C Cross Compiler

    26/64

    20

    information retrieval, (Liddy et al 1995) use several levels of linguistic structure, including

    lexical, syntactic, semantic, and discourse.

    2.2.10 Natural Language Processing based systems

    A number of systems currently use natural language processing for the accomplishment of a

    number of targeted tasks. Some of these include systems that do text summarising, page

    ranking and natural language interfaces to databases, text mining and language translation.

    Currently researchers have been working on coming up with natural language programming

    interfaces. Some of the systems have been targeted to learning institutes for the acquisition of

    the first programming language. This is because it has been noted across tertiary institutes

    that the rate of dropouts for courses with programming has been significantly high as high as

    30% ( G u z d i a l & S o l o w a y , 2 0 0 2 ) also causing a less appreciation of

    programming in general for students majoring in Computer Science.

    The vast amount of information on the internet and the information needed for doing day to

    day tasks in a number of fields has called for systems that can search comprehensively for

    information to improve productivity. This has led to data and text mining systems in fields

    like medicine.

    2.3 Natural Language Programming

    Natural Language Programming is the interpretation and compilation of instructions

    communicated in natural language into object code. It uses natural language processing

    techniques for the extraction of information from natural language text input.

    2.3.1 Natural Language Programming based systems

    The NLC prototype (Ballard B and Bierrman, 1979) was one of the attempts made to come

    up with a natural language programming interface. It had the capabilities of handling low

    level operations as the transformation of type declarations into programmatic expressions.

    The system is capable of turning statements like add y1 to y2 into the expression y1 + y2.

  • 8/10/2019 Natural Language C Cross Compiler

    27/64

    21

    More recently in 2005 a system called METAFOR was implemented. METAFOR is capable

    of translating natural language statements into class descriptions with associated objects and

    methods. METAFOR interactively converts English sentences to partially specified program

    code, to be used as a starting point for a more detailed program. A user study by Henry

    Lieberman showed that METAFOR is capable of capturing enough Programmatic Semantics

    to facilitate non programming users and begin ners conceptualisation of programming

    problems.

    2.4 Compiler Design

    Compilers bridge source programs in high-level languages with the underlying hardware. A compiler

    has four major tasks which are to determine the correctness of the syntax of programs, to generate

    correct and efficient object code, performing run-time organisation, and it formats output according to

    assembler and/or linker conventions. A compiler consists of three main parts: the frontend, the

    middle-end, and the backend.

    The front end checks whether the program is correctly written in terms of the programming language

    syntax and semantics. Here legal and illegal programs are recognised. Errors are reported, if any, in a

    useful way. Type checking is also performed by collecting type information. The frontend then

    generates an intermediate representation or IR of the source code for processing by the middle-end.

    The middle end is where optimisation takes place. Typical transformations for optimisation are

    removal of useless or unreachable code, discovery and propagation of constant values, relocation of

    computation to a less frequently executed place (e.g., out of a loop), or specialisation of computation

    based on the context.

    The middle-end generates another intermediate representation for the following backend. Most

    optimisation efforts are focused on this part. The back end is responsible for translating the IR fromthe middle-end into assembly code. The target instructions are chosen for each IR instruction.

    Register allocation assigns processor registers for the program variables where possible. The backend

    utilises the hardware by figuring out how to keep parallel execution units busy, filling delay slots, and

    so on. Although most algorithms for optimisation are in NP, heuristic techniques are well-developed.

  • 8/10/2019 Natural Language C Cross Compiler

    28/64

    22

    2.4.1 What is a compiler?

    In order to reduce the complexity of designing and building computers, nearly all of these are

    made to execute relatively simple commands (but do so very quickly). A program for a

    computer must be built by combining these very simple commands into a program in what iscalled machine language. Since this is a tedious and error prone process most programming

    is, instead, done using a high-level programming language. This language can be very

    different from the machine language that the computer can execute, so some means of

    bridging the gap is required. This is where the compiler comes in. A compiler translates (or

    compiles) a program written in a high-level programming language that is suitable for human

    programmers into the low-level machine language that is required by computers. During this

    process, the compiler attempts to spot and report obvious programmer mistakes.

    2.4.2 The phases of a compiler

    A typical way to structure the writing of a compiler is to split the compilation into several

    phases with well-defined interfaces (Alfred 2007). Theoretically, these phases operate in

    sequence (though in practice, they are often interleaved), each phase (except the first) taking

    the output from the previous phase as its input. It is common to let each phase be handled bya separate module. Some of these modules are written by hand, while others may be

    generated from specifications. Often, some of the modules can be shared between several

    compilers.

    In some compilers, the ordering of phases may differ slightly, some phases may be combined

    or split into several phases or some extra phases may be inserted between those mentioned in

    the following paragraphs.

    Lexical analysis is the initial part of reading and analysing the program text: The text is read

    and divided into tokens, each of which corresponds to a symbol in the programming

    language, for example, a variable name, keyword or number.

    Syntax analysis phase takes the list of tokens produced by the lexical analysis and arranges

    these in a tree-structure (called the syntax tree) that reflects the structure of the program. This

    phase is often called parsing.

  • 8/10/2019 Natural Language C Cross Compiler

    29/64

    23

    Type checking phase analyses the syntax tree to determine if the program violates certain

    consistency requirements. That is, if a variable is used but not declared or if it is used in a

    context that does not make sense given the type of the variable, such as trying to use a

    Boolean value as a function pointer.

    In intermediate code generation the program is translated to a simple machine independent

    intermediate language. On the register allocation phase the symbolic variable names used in

    the intermediate code are translated to numbers, each of which corresponds to a register in the

    target machine code.

    2.5 ConclusionWe gave an overview of the techniques used in Natural Language Processing and where in

    real life they are applied. From the literature review, an approach to designing the Natural

    Language Compiler would be to start from utilising tools and algorithms for Natural

    Language Processing. This assists in getting the relevant information from the input to be

    used subsequently in the preceding stages of the overall system. A lot of work would have to

    be done after the initial stages of processing that is for a compiler to be fully functional an

    exhaustive number of functions have to be written in the underlying language that complyunambiguously to the actions to be performed on the parameters passed.

  • 8/10/2019 Natural Language C Cross Compiler

    30/64

    24

    Chapter 3

    Methodology

    3.1 Introduction

    A software development methodology is a structure imposed on the development of a

    software product or alternately a framework that is used to, plan, and control the process of

    developing an information system. It includes procedures, techniques, tools and

    documentation aids which help system developers in their task of implementing a new

    system. The aim of a methodology is to formalise what is being done, making it more

    repeatable.

    A study conducted by the Forrester research group (Hoffman T, July 2003) states that nearlyone-third of all IT projects commenced would, on average, be three months late. In many

    cases the failure is the result of either not using a methodology or using the wrong

    methodology. This shows the importance of a software development methodology in a

    software project for it somewhat determines the success or failure of a software project. This

    chapter discusses some development methodologies that are used and also highlights the

    methodology that is adopted for this project and why it was chosen.

    3.2 Research Methodology

    A research methodology is a way to systematically solve a research problem. It may be

    understood as a science of studying how research is done scientifically. We studied are the

    various steps generally adopted by a researcher in studying a research problem along with the

    logic behind. Research methodologies use procedures, methods and techniques that have

    been tested for their. Some research methodologies are discussed below.

    The build research methodology consists of building an artefact, either a physical or a

    software system, to demonstrate that it is possible. For it to be considered research, the

    construction of the artefact must be new or must include new features that have not been

    demonstrated before in other artefacts.

    Another research methodology is process methodology which is used to understand the

    processes used to accomplish tasks in a task. This methodology is mostly used in the areas of

  • 8/10/2019 Natural Language C Cross Compiler

    31/64

    25

    Software Engineering and Man-Machine Interface which deal with the way humans build and

    use computer systems. The study of processes may also be used to understand cognition in

    the field of Artificial Intelligence.

    The last research methodology discussed for the project is the model methodology. It is

    centred on defining an abstract model for a real system. The model is much less complex than

    the system that it models, and therefore allows the researcher to better understand the system

    and to use the model to perform experiments that could not be performed in the system itself

    because of cost or accessibility. The model methodology is often used in combination with

    other methodologies. Experiments based on a model are called simulations. When a formal

    description of the model is created to verify the functionality or correctness of a system, the

    task is called model checking.

    3.3 Rapid Application Development

    Rapid Application Development (RAD) is a software development methodology that focuses

    on building applications in a very short amount of time; traditionally with compromises in

    usability, features and execution speed. RAD employs joint application design (to obtain user

    input), prototyping, CASE technology, application generators, and similar tools to expeditethe design process.

    Rapid Application Development has four essential aspects: methodology, people,

    management, and tools. If any one of these ingredients is inadequate, development will not be

    high speed. Development lifecycles, which weave these ingredients together as effectively as

    possible, are of the utmost importance.

    3.3.1 Strengths, weaknesses, and limitations

    Rapid application development promotes fast, efficient, accurate program and/or system

    development and delivery. Compared to other methodologies, RAD generally improves

    user/designer communication, user cooperation, and user commitment, and promotes better

    documentation.

    Because rapid application development adopts prototyping and joint application design, RADinherits their strengths and their weaknesses. More specifically, RAD is not suitable for

  • 8/10/2019 Natural Language C Cross Compiler

    32/64

    26

    mathematical or computationally oriented applications. Because rapid application

    development stresses speed, quality indicators such as consistency, standardization,

    reusability, and reliability are easily overlooked.

    Speed and quality are the primary advantages of Rapid Application Development, while

    potentially reduced scalability and feature sets are the disadvantages. The primary advantage

    lies in an applications increased development speed and decreased time to del ivery. Projects

    developed using RAD lack scalability of a project that was designed as a full application

    from the start. Rapid Application Development is not appropriate for all projects. The

    methodology works best for projects where the scope is small or work can be broken down

    into manageable chunks. Business objectives need to be well defined before the project can

    begin, so projects that use RAD should not have a broad or poorly defined scope.

    3.3.2 RAD Concepts and Phases

    Rapid application development (RAD) is a system development methodology that employs

    joint application design (to obtain user input), prototyping, CASE technology, application

    generators, and similar tools to expedite the design process. Initially suggested by James

    Martin, this methodology gained support during the 1980s because of the wide availability of

    such powerful computer software as fourth-generation languages, application generators, and

    CASE tools, and the need to develop information systems more quickly. The primary

    objectives include high quality, fast development, and low cost.

    Rapid application development focuses on four major components: tools, people,

    methodology, and management. Current, powerful computing technology is essential to

    support such tools as application generators, screen/form generators, report generators,

    fourth-generation languages, relational or object-oriented database tools, and CASE tools.

    People include users and the development team. The methodology stresses prototyping and

    joint application design.

    A strong management commitment is essential. Before implementing rapid application

    development, the organisation should establish appropriate project management and formal

    user sign-off procedures. Additionally, standards should be established for the organis ations

    data resources, applications, systems, and hardware platforms.

  • 8/10/2019 Natural Language C Cross Compiler

    33/64

    27

    Martin suggests four phases to implement rapid application development: requirements

    planning, user design, construction, and cutover (Martin, 2005). Requirements planning is

    much like traditional problem definition and systems analysis. RAD relies heavily on joint

    application design (JAD) sessions to determine the new system requirements.

    During the user design phase, the JAD team examines the requirements and transforms them

    into logical descriptions. CASE tools are used extensively during this phase. The system

    design can be planned as a series of iterative steps or allowed to evolve.

    During the construction phase, a prototype is built using the software tools described earlier.

    The JAD team then exercises the prototype and provides feedback that is used to refine the

    prototype. The feedback and modification cycle continues until a final, acceptable version of

    the system emerges. In some cases, the initial prototype consists of screens, forms, reports,

    and other elements of the user interface, and the underlying logic is added to the prototype

    only after the user interface is stabilised.

    The cutover phase is similar to the traditional implementation phase. Key activities include

    training the users, converting or installing the system, and completing the necessary

    documentation. Once the prototype has been developed, within its time box, the construction

    team tests the initial prototype using test scripts developed during the user design stage, thedesign team reviews the application, the customer also reviews the application. Lastly the

    implementation stage, also known as the deployment stage, consists of integrating the new

    system into the business. The design team trains the system users while the users perform

    acceptance testing. If there was an old system in place, the design team would help the users

    transfer from their old procedures to new ones that involve the new system. The design team

    also troubleshoots after the deployment, for testing purposes on a test environment, and

    identifies and tracks potential enhancements. The amount of time required to complete the

    Implementation Stage varies with the project.

    As with any project there are post project activities, which are typically the same for most

    methodologies. For RAD; final deliverables should be handed over to the client and such

    activities should be performed that benefit future projects. Specifically it is a best practice for

    a Project Manager to review and document project metrics, organise and store project assets

    such as reusable code components, Project Plan, Project Management Plan (PMP), and Test

    Plan. It is also a good practice to prepare a short lessons learned document

  • 8/10/2019 Natural Language C Cross Compiler

    34/64

    28

    3.4 Agile

    The focal aspects of light and agile methods are simplicity and speed. In development work,

    accordingly, the development group concentrates only on the functions needed at first hand,

    delivering them fast, collecting feedback and reacting to the received information. An agile

    development process is one were software development is incremental, cooperative,

    straightforward and adaptive. Agile methodology is based on iterative and incremental

    development, where requirements and solutions evolve.

    The core of agile software development methods is the use of light-but-sufficient rules of

    project behaviour and the use of human and communication-oriented rules. The agile

    process is both light and sufficient. Lightness is a means of remaining manoeuvrable.

    Sufficiency is a means of staying in the game (Cockburn 2002).

    Agile methodologies embrace iterations. Small teams work together with stakeholders to

    define quick prototypes, proof of concepts, or other visual means to describe the problem to

    be solved. The team defines the requirements for the iteration, develops the code, and defines

    and runs integrated test scripts, and the users verify the results.

    Figure 3.1: A generic agile development process features an initial planning stage, rapid repeats

    of the iteration stage, and some form of consolidation before release.

  • 8/10/2019 Natural Language C Cross Compiler

    35/64

    29

    3.4.1 Two agile software development methodologies

    The most widely used methodologies based on the agile philosophy are Extreme

    programming and Scrum. These differ in particulars but share the iterative approach

    described above.

    3.3.1 Extreme Programming

    This methodology concentrates on the development rather than managerial aspects of a

    software projects. Extreme programming was designed so that organisations would be free to

    adopt all or part of the methodology. It relies on constant code improvement, user

    involvement in the development team and pairwise programming.

    XP projects start with a release planning phase, followed by several iterations, each of which

    concludes with user acceptance testing. When the product has enough features to satisfy

    users, the team terminates iteration and releases the software. The life cycle of XP consists of

    five phases namely; exploration, planning, release, production, maintenance and final release.

    In the exploration phase, the customers write out story cards that they wish to be included in

    the first release. Each story card describes a feature to be added into the program. At the sametime the project team familiarise themselves with the tools, technology and practices they will

    be using in the project. The planning phase sets the priority for the stories and an agreement

    of the contents of the first small release is made. The iterations to release phase includes

    several iterations of the systems before the first release. The schedule set in the planning

    stage is broken down to a number of iterations that will each take one to four weeks to

    implement. The first iteration creates a system with the architecture of the whole system. The

    production phase requires extra testing and checking of the performance of the system beforethe system can be released to the customer.

    To create a release plan, the team breaks up the development tasks into iterations. The release

    plan defines each iteration plan, which drives the development for that iteration. At the end of

    iteration, users perform acceptance tests against the user stories. If they find bugs, fixing the

    bugs becomes a step in the next iteration.

    XP has rules and concepts that govern it, some of which are described below. The first of

    which is integrate often; it means development teams must integrate changes into the

  • 8/10/2019 Natural Language C Cross Compiler

    36/64

    30

    development baseline at least once a day. This is also known as continuous integration.

    Project velocity is another governing principle which is the measure of how much work is

    getting done on the project. This metric drives release planning and release planning and

    schedule updates. Another principle is user story which describes problems to be solved by

    the system being built. These stories must be written by the user and should be about three

    sentences long. This is one of the main objections to the XP methodology, but also one of its

    greatest strengths.

    3.3.2 Scrum

    This methodology follows the rugby concept of scrum, which is related to scrimmage, in the

    sense of a huddled mass of players engaged with each other to get a job done. Scrum for

    software development came out of the rapid prototyping community because prototyping

    groups wanted a methodology that would support an environment in which the requirements

    were not only incomplete at the start, but also could change rapidly during development.

    Unlike XP, Scrum methodology includes both managerial and development processes.

    After the team completes the project scope and high-level designs, it divides the development

    process into a series of short iterations called sprints. Each sprint aims to implement a fixed

    number of backlog items. Before each sprint, the team members identify the backlog items

    for the sprint. At the end of a sprint, the team reviews the sprint to articulate lessons learned

    and check progress.

    The Scrum development process concentrates on managing sprints. Before each sprint

    begins, the team plans the sprint, identifying the backlog items and assigning teams to these

    items. Teams develop, wrap, review, and adjust each of the backlog items. During

    development, the team determines the changes necessary to implement a backlog item. Theteam then writes the code, tests it, and documents the changes. During wrap, the team creates

    the executable necessary to demonstrate the changes. In review, the team demonstrates the

    new features, adds new backlog items, and assesses risk. Finally, the team consolidates data

    from the review to update the changes as necessary.

    Scrum also has some rules and concepts that govern it, some are described below. Sprint backlog is the list of backlog items assigned to a sprint, but not yet completed. In common

  • 8/10/2019 Natural Language C Cross Compiler

    37/64

    31

    practice, no sprint backlog item should take more than two days to complete. The sprint

    backlog helps the team predict the level of effort required to complete a sprint. Another

    concept is Product backlog. It is the complete list of requirements including bugs,

    enhancement requests, and usability and performance improvements that are not currently in

    the product release.

    3.5 Tools

    In the development of the system there are tools that we are going to employ so as to come up

    with a system of high quality, below are some of the tools needed in the development of our

    system.

    3.5.1 Unified Modelling Language

    Requirements for a business are best met by modelling business rules at a very high level,

    where they can be easily validated with clients, and then automatically transformed to the

    implementation level. The Unified Modelling Language (UML) is now widely used for both

    database and software modelling. It is used as a standard language for object-orientedanalysis and design and is also used to model the Natural Language C Cross Compiler front

    end. UML's object-oriented approach facilitates the transition to object-oriented code hence

    its use in this project.

    The design models can either be static which describe the static structure of the system in

    terms of object classes and relationships or dynamic which describe the dynamic interactions

    of the objects.

    UML diagrams are used to depict system requirements and functionality and some of these

    UML diagrams are used to view what this system does and the system goals in the design

    phase. The UML diagrams to be used in the analysis phase are activity diagrams, sequence

    diagrams and use case diagrams. In the design phase class diagrams and an Entity

    Relationship diagrams are used.

    Two more UML diagrams come into play during implementation stage, these are;

    deployment diagrams and component diagrams. Deployment diagrams are implementation

  • 8/10/2019 Natural Language C Cross Compiler

    38/64

    32

    level diagrams that show how the hardware and software elements that make up this

    application are configured and set into operation. Component diagrams are also

    implementation level tools that show the structure of the code be it source code files,

    executable files or binary code files connected by dependencies.

    3.5.2 Why UML

    Although UML has no specification for the modelling of user interfaces, has no way to

    formally specify serialisation and object persistence and no way to specify that an object

    resides on a server process and shared among instances of a running process; it is the chosen

    modeling language for this project. This is so because UML offers all the benefits of Object

    Orientated development such as inheritance and polymorphism. It also helps to communicate,

    explore potential designs, and validate the architectural design of the software. Importantly, it

    uses simple intuitive notation that non-programmers can also understand its models.

    3.5.3 Other Tools

    A text editor with source code formatting is used for this project. Specifically Notepad++,

    this is because of its simplicity and adaptability in being able to be used for multiple

    languages offering a quick switch between languages. This project is being developed using

    three languages C, Java and some assembly language hence the need to shift between

    languages as often as possible during the development stage.

    Other than the standard development environments this project uses open source scanner and

    parser generators mainly LEX and YACC on Linux distributions and FLEX and Bison a

    variation of YACC on windows.

    3.6 Preferred Methodology and Justification

    The chosen methodology is extreme programming, a type of agile development methodology.

    This methodology has been chosen because it concentrates on the core development rather

    than managerial aspects of a software projects. This methodology best suits this project in

    that it puts more emphasis on the core programming processes of a project. This project has a

    lot of programming processes; these are given more focus by this methodology. Extreme

    programming promotes the fast development of a software product by dividing the whole

  • 8/10/2019 Natural Language C Cross Compiler

    39/64

    33

    project into small components which are developed in iterations. By promoting fast

    development with the use of iteration, it simultaneously promotes the production of a high

    quality product since the modules are independently produced in the iterative process.

    3.7 Conclusion

    Here we discussed the different methodologies that can be used in the research and

    development of the Natural Language C Cross Compiler, giving both the advantages and

    disadvantages of each. The development of the Natural Language C Cross Compiler can be

    modularised and it has a lot of programming processes hence the chosen methodology.

    Justification for not choosing the other methodologies is aligned in this chapter. Included also

    are the tools that are used in the development of the system.

  • 8/10/2019 Natural Language C Cross Compiler

    40/64

    34

    Chapter 4

    System Analysis and Design

    4.1 Introduction

    Systems analysis is the dissection of a system into its component pieces to study how those

    component pieces interact and work with a view of changing it or improving the already

    working. We do a systems analysis to subsequently perform a systems synthesis which is the

    re-assembly of a system's component pieces back into a whole system-it is hoped an

    improved system. Traditionally, systems analysis is associated with application development

    projects, that is, projects that produce information systems and their associated computer

    applications. Systems analysis methods can be applied to projects with different goals and

    scope. In addition to single information systems and computer applications, systems analysis

    techniques can be applied to strategic information systems planning and to the redesign of

    business processes. There are also many strategies or techniques for performing systems

    analysis. They include modern structured analysis, information engineering, prototyping, and

    object-oriented analysis.

    4.2 Requirements Elicitation

    During this phase we gathered system requirements using a number of techniques to ensure

    unambiguity, completeness, consistency, correctness and verifiability of the requirements

    both non-functional and functional. Methods used include interviews, questionnaires and

    examination of similar existing systems. This stage is of utmost importance as the system is

    built directly from these.

    We conducted interviews with a number of programmers from the University in an attempt toestablish the desired system functionality. The output formed a basis of the structure of the

    input to the system that is natural language.

    We had a brainstorming session with fellow students passionate in programming including an

    expert programmer currently at e-solutions private limited. These were intended to

    complement the interviews in trying to come up with solid system requirements that do not

    tract from set standards of the development of software.

  • 8/10/2019 Natural Language C Cross Compiler

    41/64

    35

    4.2.1 Needs Analysis

    Compilers in existence currently and before give feedback via a character user interface or a

    graphic user interface for compilers integrated within Integrated Development Environments

    (IDEs). These compilers give information about errors encountered either semantic or syntax

    by detecting the actual location of the error by line number and by underlining in colour for

    most IDEs for example NetBeans.

    From the interviews and brainstorming sessions we had, a point was raised that errors are

    easily seen in using colours as with IDEs and text editors with source code formatting. We

    then decided that there is need for the Natural Language C Cross Compiler to have a

    graphical user interface with a text area for the input and a portion for displaying results or

    errors.

    The user interaction via the graphical user interface prompted that the text editor should have

    basic operations for a text editor. These were designated to be the basic cut, copy, paste and

    open file for external files. Apart from the basic text editor operations a point raised was that

    the Compiler should have a menu with a compile and build function.

    Due to the nature of the input being a natural language a basic grammar checking facility was

    introduced. This checks British English and is able to give suggestions on spelling mistakes

    and less complex verb to noun agreement. This helps in alerting the programmer what the

    compiler takes as symbol when the words used are not recognised by the grammar checker.

    The suggested Graphical User Interface is given in Figure 4.1.

  • 8/10/2019 Natural Language C Cross Compiler

    42/64

    36

    \

    Figure 4.1 Suggested Interface

    4.3 Feasibility study

    Feasibility is the state or degree of a project being easily or conveniently done. A feasibility

    study is an evaluation and analysis of the possibility of the proposed project which is based

    on far-reaching investigation and research (Georgakellos, et al, 2009). A feasibility study is

    done so as to support the process of decision making. This study is essential in systems

    development because it is done before the system is developed and hence the sponsors of the

    system, the users and developers can conclude from this study if the development should

    proceed or not. Outlined below are the feasibility study reasons for the development of the Natural Language C Cross Compiler.

    4.3.1 Economical Feasibility

    In the development of the Natural Language C Cross Compiler the inputs that are put into the

    system are mainly time. Monetary input is very low and there is no financial reason why the

  • 8/10/2019 Natural Language C Cross Compiler

    43/64

    37

    development of this system should not proceed. Economically therefore the system is found

    to be feasible and hence the development.

    4.3.2 Technical feasibility

    The Natural Language C Cross Compiler is developed at the National University Science and

    Technology (NUST) as part of the requirements for the fulfilment of the B.Sc. (Honours)

    Degree in Computer Science. The necessary development tools needed for this systems

    development for instance the Java compiler, are readily available for free and hence making

    this development process technically feasible. This system is also developed as a research and

    training method for the university hence making the development technically feasible.

    4.4 Requirements specification

    The requirements for the Natural Language C Cross Compiler can be divided into two

    categories which are the functional requirements and the non-functional requirements.

    Functional Requirements are services that the system should provide, how the system should

    react to particular inputs and how the system should react to particular situations

    (Sommerville, 2010). They depend on the type of software being developed and can be sub-

    divided into input, processing and output requirements. Functional Requirements can be

    further subdivided into functional user requirements and functional system requirements.

    Functional user requirements are a high level description of what the system should do while

    functional system requirements describe the system service in detail. In order to produce

    quality software in a software project development, it is essential to implore all system

    requirements and clearly understand them.

    Non-functional requirements are constraints on the services or functions offered by the

    system. They include taming constraints and constraints on the development process and

    standards. Non-functional requirements often apply to the recruitment system as a whole andalso relate to performance that will be required of the system and the technologies that will be

    used for development of the system under study. They do not usually just apply to individual

    system features or services. They are also known as quality requirements

    4.6 System Requirements

    These describe the functionality or system services the system should provide, how the

    system responds to certain input and how the system should behave in particular situations.

  • 8/10/2019 Natural Language C Cross Compiler

    44/64

    38

    There are functional user requirements and non-functional system requirements. Functional

    user requirements are a high level description of what the system should do whereas

    functional system requirements describe the system service in detail. Non-functional

    requirements describe other characteristics of the product. There are several categories of

    these requirements that are constraints,