understanding system_t by mao xianling 2009.02.28

28
Understanding System_T By Mao Xianling 2009.02.28

Upload: ashley-shaw

Post on 30-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding System_T By Mao Xianling 2009.02.28

Understanding System_T

By Mao Xianling

2009.02.28

Page 2: Understanding System_T By Mao Xianling 2009.02.28

Outline

Introduction to System_TPrimary testsProblem

Page 3: Understanding System_T By Mao Xianling 2009.02.28

Outline

Introduction to System_TIntroduction to System_TPrimary testsProblem

Page 4: Understanding System_T By Mao Xianling 2009.02.28

Installing the Development Environment

downloaded from IBM's AlphaWorks site; just search for "System Text" at http://www.alphaworks.ibm.com/

uncompress the .zip/.tar file onto your computer's hard drive

run the startup script Sh SystemText-[version]/bin/startserver.shstart the Development Environment by pointing y

our web browser at the address http://localhost:8083/aql

Page 5: Understanding System_T By Mao Xianling 2009.02.28

Development Environment

Page 6: Understanding System_T By Mao Xianling 2009.02.28
Page 7: Understanding System_T By Mao Xianling 2009.02.28

create view PhoneNum as

extract

regex /[0-9]{3}-[0-9]{4}/

on D.text as number

from Document D;

output view PhoneNum;

Page 8: Understanding System_T By Mao Xianling 2009.02.28

One Example for AQL Code

create view PhoneNum as

extract

regex /[0-9]{3}-[0-9]{4}/

on D.text as number

from Document D;

output view PhoneNum;

Page 9: Understanding System_T By Mao Xianling 2009.02.28
Page 10: Understanding System_T By Mao Xianling 2009.02.28
Page 11: Understanding System_T By Mao Xianling 2009.02.28
Page 12: Understanding System_T By Mao Xianling 2009.02.28

Introduction to AQL

• AQL:a language for building annotators that extract structured information from unstructured or semistructured text.

• AQL is the primary method of creating new annotators in System Text for Information Extraction.

Page 13: Understanding System_T By Mao Xianling 2009.02.28

Introduction to AQL

The syntax of AQL is similar to that of SQL, but with several important differences:

• AQL is case sensitive. • AQL allows regular expressions to be expressed

in Perl syntax, e.g. /regex/ instead of 'regex'. • AQL currently does not support advanced SQL

features like correlated subqueries and recursive queries.

• AQL has a new statement type, extract, which is not present in SQL.

Page 14: Understanding System_T By Mao Xianling 2009.02.28

Data Model

• AQL's data model is similar to the standard relational model used by SQL databases like DB2. All data in AQL is stored in tuples, data records of one or more columns, or fields. A collection of tuples forms a relation. All tuples in a relation must have the same schema — the names and types of their fields.

Page 15: Understanding System_T By Mao Xianling 2009.02.28

Data Model

The fields of an AQL tuple must belong to one of the language's built-in scalar types

• Integer: A 32-bit signed integer.

• Text: A Unicode string, with additional metadata to indicate which tuple the string belongs to.

• Span: A contiguous region of characters in a Text object.

Page 16: Understanding System_T By Mao Xianling 2009.02.28

Execution Model

Page 17: Understanding System_T By Mao Xianling 2009.02.28

AQL Statement

The create view Statement The extract Statement

– Extraction Specifications• Regular Expressions• Dictionaries• Splits

The select Statement The create table Statement Built-In Functions

– Predicate Functions– Scalar Functions– Table Functions

Page 18: Understanding System_T By Mao Xianling 2009.02.28

• create view PersonFirstOrLastName as• extract• dictionary 'names.dict' on D.text as name• from Document D• having MatchesRegex(/[A-Z].+/, name);• • create view PhoneNumber as• extract • regexes /(\d{3})-(\d{3}-\d{4})/ and /\(\d{3}\)\s*(\d{3}-\d{4})/ • on D.text as num • from Document D;• • create view ExtensionNumbers as• extract • regex /[Ee]xt\s*[\.\-\:]?\s*(\d{3,5})/• on D.text • return group 1 as num and group 0 as completenum• from Document D;

• create view PhoneNumberWithExtension as• select CombineSpans(P.num,E.completenum) as num• from PhoneNumber P, ExtensionNumbers E• where FollowsTok(P.num, E.completenum,0,1);• • create view PhoneNumberAll as• (select P.num as num from PhoneNumber P)• union all• (select E.completenum as num from ExtensionNumbers E)• union all• (select P.num as num from PhoneNumberWithExtension P);• • create view PhoneNumberAllConsolidated as• select P.num as num• from PhoneNumberAll P• consolidate on P.num• using 'ContainedWithin';

• • create view PersonsPhone as• select person.name as person, phone.num as phone,• CombineSpans(person.name, phone.num) as personphone• from PersonFirstOrLastName person, PhoneNumberAllConsolidated phone• where Follows(person.name, phone.num, 0, 30);• • output view PersonsPhone;

Page 19: Understanding System_T By Mao Xianling 2009.02.28
Page 20: Understanding System_T By Mao Xianling 2009.02.28
Page 21: Understanding System_T By Mao Xianling 2009.02.28

Outline

Introduction to System_TPrimary testsPrimary testsProblem

Page 22: Understanding System_T By Mao Xianling 2009.02.28

Primary Tests

• DataSet

From TianWang Clawer; Chinese; Firstname.dict/Lastname.dict (for Chinese)

• Method

Using AQL to build Annotators

Page 23: Understanding System_T By Mao Xianling 2009.02.28

Annotator for extract phone num

Page 24: Understanding System_T By Mao Xianling 2009.02.28

Annotator for extract name

Page 25: Understanding System_T By Mao Xianling 2009.02.28

Time && Space

Page 26: Understanding System_T By Mao Xianling 2009.02.28

Outline

Introduction to System_TPrimary testsProblemProblem

Page 27: Understanding System_T By Mao Xianling 2009.02.28

Problem

• English VS Chinese [extract regex /[0-9]{3}/ on 1 token in D.text] • Time && Space && Network?• MultiSet?• The express ability of Regex ?• No source code && MapReduce?• Zip?

Page 28: Understanding System_T By Mao Xianling 2009.02.28