€¦ · web viewthe results will be output into three tabs: top results, word relationships, and...

26
Running Head: Lab 1 – LASI Product Description LASI Product Description LASI – Red Team Old Dominion University Author: Scott Minter Last Modified: March 18pe1, 2013 Version 1.1

Upload: nguyenxuyen

Post on 28-Aug-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Running Head: Lab 1 – LASI Product Description

LASI Product Description

LASI – Red Team

Old Dominion University

Author: Scott Minter

Last Modified: March 18pe1, 2013

Version 1.1

Page 2: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

Table of Contents

1. Introduction …………………………………….…………………………………… 3

2. Product Description ……………………………..………………………………….. 4

2.1 Key Product Features and Capabilities ……………………...…………….. 4

2.2 Major Functional Components (Hardware and Software) …….………….. 7

3. Identification of Case Study .……………………………………..……….……….. 10

4. Prototype Description …………………………………………………………..…. 12

4.1 Major Functional Components (Hardware/Software) ………….......……. 12

4.2 Features and Capabilities …………………………………………….…... 14

4.3 Prototype Challenges …………………………………………………….. 14

Glossary ……………………………………………………………………………….. 16

References .……………………………………………………………………………. 18

List of Figures

Example 1. Top Results Tab ………………………………………………………….. 5

Example 2. Word Relationship Tab ……………………………………………….…... 6

Example 3. Word Count and Weighting Tab …...…………………………………...... 7

Figure 4. Current AID Process …………………………………………..…………… 10

Figure 5. AID Process with LASI ……………………………………….………….... 11

Figure 6. Real-World vs. Prototype …………...…………………………..………….. 12

Figure 7. Prototype Major Functional Components …………………………...…….... 13

2

Page 3: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

1. Introduction

Linguistic Analysis for Subject Identification (LASI) is the name of The Red

Group’s project. This project is being developed as a requirement in the Professional

Workforce Development I & II courses at Old Dominion University. Linguistic

Analysis, in the scope of LASI, is the contextual study of written works and how the

words combine to form an overall meaning. LASI will be a decision support tool to

assist users in determining common themes across multiple documents. The themes that

LASI will produce are going to be subject-object-verb relationships. Themes are

important because they help the reader to comprehend what has just been read. Then, if

the reader has comprehended what was read, then the reader can summarize the material.

Comprehension and summarization are important because they assist the reader in

communicating the content of the material with other people.

The process of finding common themes across multiple documents may be

lengthy and repetitive. This is due to the depth of understanding needed to identify

themes across all the documents, which may not be the theme of any individual

document. Therefore, it is difficult for people to identify a common theme over a large

set of documents in a timely, consistent and objective manner.

LASI will assist in this area by providing a weighted list of potential themes from

which the user can choose the best fit for their understanding of the material. For LASI

to effectively resolve this societal problem it will need to accurately find themes, be

system efficient, and provide consistent results.

(This space intentionally left blank)

3

Page 4: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

2. Product Description

LASI will be a self-contained, stand-alone piece of software. It will not require a

connection to the Internet to produce accurate results. LASI will be designed to run on a

consumer level laptop or desktop. Also, LASI will be designed to be an open source

back-end engine for other projects. The data collected from the analysis that LASI

performs can be used drive other projects and their respective GUIs, completely

bypassing the default GUI.

LASI’s ability to analyze multiple documents for common themes makes it a

decision support tool that is useful to anyone who has to read over large sets of

documents looking for commonality. Students could use it to verify the usefulness of

scientific publications to the topic they are researching. Teachers could use it as an initial

analysis of student research papers, verifying that the paper correctly addresses the topic.

Similar to students, research analysts could use LASI to verify whether or not a different

papers and articles address the specific area they are researching.

2.1 Key Product Features and Capabilities

LASI’s ability to find themes is based on three different sub-routines. The first is

a Part-Of-Speech (POS) tagging system that will return the input document(s) with all

Words and Word Phrases tagged for their corresponding POS. Second is a word

association algorithm that will associate Words based on their POS and their proximity to

one another. Finally, a weight is applied to each Word or Word Phrase based on it’s POS

and it’s association to other words and their POS’s.

LASI will accept DOC, DOCX, and TXT files as input. LASI will allow a user to

input any known “problem” words: any organization specific jargon or slang. LASI will

4

Page 5: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

also allow a user to input any desired assumptions such as synonyms and acronyms. The

user will be able to specify word equivalency, allowing LASI to better analyze the

document in the context the user desires.

One of the important aspects of LASI is the user experience. The results will be

output into three tabs: Top Results, Word Relationships, and Word Count and Weighting.

The user will also be able to export the results into PDF format. Example 1 shows a

prototype of the Top Results tab displaying the likeliest possible themes based on

analysis.

Example 1. Top Results Tab

(This space intentionally left blank)

5

Page 6: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

Figure 2 is a prototype of the Word Relationship tab. It also shows that the user will be

able to see these results for all the documents and for the individual documents. The

colors will correspond to the word’s corresponding POS. The search box will allow the

user to search for specific words and have them searched words be highlighted.

Example 2. Word Relationships Tab

(This space intentionally left blank)

6

Page 7: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

Figure 3 is a prototype of the Word Count and Weighting Tab. It will display the

count of each word in the set of documents and display their weights based on the

weighting algorithm.

Example 3. Word Count and Weighting Tab

2.2 Major Functional Components (Hardware and Software)

LASI will be able to run one a laptop or a desktop provided the machine has a

multi-core processor and four to eight gigabytes of RAM. The third party software

components for LASI are SharpNPL Part Of Speech Tagger, WordNet Thesaurus Data,

and Document Converters. SharpNPL Part Of Speech Tagger is handling the tagging of

words and word phrases for their corresponding POS. WordNet Thesaurus Data is

allowing LASI to recognize synonyms. Document Converters are converting DOC and

DOCX files to TXT files.

7

Page 8: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

LASI’s analytical capabilities are enabled by a combination of data structures and

algorithms. The key data types go into two categories: specialized word and phrase

constructs and the ability to traverse documents as a collection of specialized word and

phrase constructs.

Specialized word and phrase constructs are assigned based on their tagged POS.

Once initialized an instance of a Word Construct will be able to be displayed, sorted and

displayed based on it’s POS (e.g. Noun, Verb, etc.), and have other constructs assigned to

it based on known syntactic and/or semantic relationships. Phrase Constructs are going

handle phrase tags that are generated by the SharpNLP POS tagging system. A phrase is

a recognized group of words that will have a tagged POS (eg. NounPhrase, VerbPhrase,

etc.) Similar to the Word Constructs, Phrase Constructs will be able to be displayed,

sorted and displayed based on its POS, and have other constructs assigned to it based on

known syntactic and/or semantic relationships.

A document being viewed and used as a traversable collection allows a document

to be moved through using different methods: Word, Reference, and Web-wise. When

moving through a document using a Word-wise method, the document is broken up by

individual words, with each word being an instance of the Word class. Moving through a

document using a Reference-wise method, the document is broken up using the Word

class and Phrase class respective reference methods. This allows the document to be

viewed in terms of which words and or phrases reference each other. Moving through a

document using a Web-wise method allows the document to be traversed through as if it

were a web with the nodes being the words and or phrases and the references being the

connection between nodes.

8

Page 9: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

The algorithms used by LASI breakdown into two categories: Element-binding,

Weighting and Conflict Resolution. Element-binding binds words and phrases together

based on each instance’s POS to create references mentioned in the above Word and

Phrase Construct section. Element-binding will consist of Direct Binding and Indirect

Binding. Direct Binding will create subject-verb, verb-subject, adverb-verb, adjective-

noun, and determiner-noun references. Whereas, Indirect Binding will create pronoun-

noun references.

The Weighting algorithm gives a numeric value to both Word and Phrase

instances that will give the instance weight when being considered for its importance.

The algorithm looks at both raw and relative data. For raw data it looks at word instance,

word instance POS, and synonym count. Word instance count will tally the number of

times a word occurs. Word instance POS count will contain the frequency as long as it

has the same POS tag. Finally the synonym count raises the count of both synoptic

words for any recognized synonyms. For relative data it looks at Subject-Object-Verb

reference count and Lexical distance. Once Word and Phrase references have been made

on a level deep enough to establish Subject-Object-Verb (SOV) references, a count is

made of the number of times the SOV instance occurs. Weight is also based on Lexical-

Distance, meaning the physical proximity a Word or Phrase is to the reference instance

will determine the weight assigned.

Conflict Resolution will be important to ensuring that LASI can complete analysis

successfully. In a document, there may be any number of unaccounted for items such as

incorrect grammar and unrecognized characters. Conflict Resolution will be able to

recognize these items try to address them and if not throw the proper exception.

9

Page 10: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

3. Identification of Case Study

Dr. Hester and Dr. Meyers work for an organization housed on the Old Dominion

University campus called National Center for Systems of Systems Engineering

(NCSOSE). NCSOSE analyzes organizations and their respective documents in order to

help them recognize and address internal problems. The current process utilized at

NCSOCE is called the Assessment Improvement Design (AID) process. The AID

process involves a company coming to NCSOSE for evaluation. At which point,

NCSOSE will gather organizationally specific documents for analysis. In this analysis

phase, Drs. Hester and Meyers will read over the documents multiple times in order to

find common themes specific to the structure and function of the organization. Finally,

Dr. Hester and Dr. Meyers will return to the organization with their findings based on the

analysis (Fig. 4).

Figure 4. Current AID Process

10

Page 11: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

It is during the document analysis phase that LASI will be utilized. By inserting

LASI into the AID process, it will cut down on both time and inconsistency. LASI will

allow for less time spent rereading the documents and give NCSOSE logical grounding

for the findings they return to the organizations (Fig. 5).

Figure 5. AID Process with LASI

(This space intentionally left blank)

11

Page 12: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

4. Prototype Description

A full Real World Solution for LASI would be highly difficult to develop in the

time allotted so, a prototype needs to be created in order to narrow the scope but have

something that can still demonstrate its capabilities. Figure 6 shows what the prototype

will do in comparison to the Real-World Solution.

Figure 6. Real-World vs. Prototype

4.1 Major Functional Components (Hardware/Software)

The major functional components for the prototype are very similar to those of the

Real-World Solution. The hardware required to run the LASI prototype will be a laptop

or desktop with four to eight gigabytes of RAM and a multi-core processor. The software

needed will be the third-party software to tag parts-of-speech and convert DOC and

DOCX files to TXT files, the LASI data structures and algorithms and the LASI GUI

(Fig. 7). For in-class development a Virtual Machine is also being utilized as a testing,

demonstration, and code writing environment.

12

Page 13: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

Figure 7. Prototype Major Functional Components

The third-party software is the SharpNLP Part-Of-Speech Tagger and the

B2XTranslator. The SharpNLP POS Tagger tags words and word phrases for their

respective POS in order for the LASI algorithms to use them. The B2XTranslator

converts DOC to DOCX files. This is done because DOCX files contain an XML file

that can easily be converted to a TXT file.

The LASI data structures and algorithms needed for the prototype are reference

binding and weight assigning. In the prototype, the reference binding works the same as

in the Real-World solution. References are made between Words and Word Phrases

based on their tagged POS and how they relate to one another within the sentence,

paragraph and document structure. The weight assigning algorithm will assign weight

based on tagged POS, word instance, and reference count. The reference count will

count how many times other Words and Word Phrases refer to a Word or Word Phrase.

13

Page 14: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

4.2 Features and Capabilities

The prototype will be limited in its capabilities from the Real-World Solution due

to the time constraints of the class. One of the areas it will be limited in is that it will

only allow five documents to be loaded into a LASI project. Also, it will only accept

DOC, DOCX and TXT files as input. In the Real-World solution there would need to be

some kind of scanned text recognition in order to correctly convert PDF files to TXT

files. However, our prototype will not accept PDF files and therefore will not have

scanned text recognition capabilities.

Some of the identified risks with LASI are trust for the output, post semester

maintenance, individual PC system limitations, and illegal character handling. These

risks are being mitigated through the various means. Trust for the output will be handled

by the various views and tabs on our results GUI. By showing the user much of LASI’s

accumulated data the results will be provable. Maintenance of LASI will be performed

by the open source community and possibly by future CS410 and CS411 groups.

Avoiding crashes due to system limitations will be handled by multithreading the LASI

algorithms and making sure the program runs as efficiently as possible. At some point,

LASI will encounter some unrecognized characters in a document. When this happens,

LASI attempt to recognize these characters based on their syntax in the document.

However, if they remain unrecognizable then an exception will be thrown and the

character will be ignored.

4.3 Prototype Development Challenges

Some of the challenges facing the development of the LASI prototype are the

ability to correctly use all the data generated and to correctly identify themes. Identifying

14

Page 15: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

POS, creating Word and Word Phrase references and assigning weight are all constructs

that assist and enable LASI with correctly identifying themes. This will be mitigated by

intelligently creating an algorithm that will use this information to identify themes in a

timely, consistent and objective manner.

(This space intentionally left blank)

15

Page 16: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

Glossary of Terms

Theme - subject-object-verb relationships that LASI is attempting to generate from the input setLASI - Linguistic Analysis for Subject IdentificationParser - Takes in DOC and DOCS files and converts them to TXT filesWordNet - compilers and providers of our thesaurusPhrase - A group of words standing together as a conceptual unit, typically forming a new component.Analysis - Detailed examination of the elements or structure of something, typically as a basis for interpretation.Linguistic Analysis - The scientific analysis of a language.Tag - A label, or the act of attaching a label, that specifies the role (such part of speech or location) of a selected element in a document.Document - A document herein refers to a formally written, expository paper which expounds, via a declarative approach, on a relatively quantifiable issue, goal, or area of research.Word Weight - A numeric value, associated with each syntactically and lexically unique word in a written work, which indicates the relative significance of that word.Tornado chart - A horizontal bar graph like visualization, representing the relative frequency or significance of elements, sorted in descending order by magnitude.Head word - A Head Word is the locally distinct word within a phrase which, by its syntactic associations, determines the syntactic category of the phrase itself. Word Binding - Conversion of scanned images to text.Sharp NLP - C# natural language processing tool used to parse and tag part-of-speech.Tagged Word Object - The process of binding part-of-speech to a word.Optical Character Recognition - A word that has an associated part-of-speech.Tagged Set - A group of words whose part of speech and location in a sentence have been identified by our parser.Lexer - A piece of our parsing tool that isolates each word and its part of speech, and location in a sentence into machine readable tokens. These are stored as elements in an XML file.Syntactic Analysis - a form of Linguistic analysis that focuses on grammar in sentences and identifies themes based on sentence structure and formatting. Unlike Semantic Analysis, it identifies key words based on their location in the sentence, rather than their overall meaning throughout the document.Subject Identification- dentifies the main actor in a sentence. However, in a broader sense, the word subject is synonymous with the theme of a document. Subject identification is the process of determining subjects, or themes of a document or documents.Part of Speech Tagger - Software utility that associates words with the parts of speech (i.e. Noun, Verb, etc.) in a sentence.Semantic Analysis - Relating the syntactical structure of words to their language independent meanings.A.I.D. Process - Assessment Improvement Design: A process that provides quantitative and qualitative basis to identify problems and determine the feasibility of solutions.

16

Page 17: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

Strategic Document - Document produced by a client that defines what their Goals, Visions, and Missions.Word (denoted by capital W) – an instance of LASI’s Word classWord Phrase (denoted by capital W and capital P) – an instance of LASI’s Word Phrase class

17

Page 18: €¦ · Web viewThe results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF

Lab 1 – LASI Product Description

References

Hester, P.T., Meyers, T. (2012). Enterprise AID: A performance measurement system for enterprise assessment, improvement, and design (NCSOSE-TR-12-001). Norfolk, VA: National Centers for System of Systems Engineering.

18