domain-independent data extraction: person names carl christensen and deryle lonsdale brigham young...

Domain-Independent Data Extraction: Person Names Carl Christensen and Deryle Lonsdale Brigham Young University cvchristensen@gmail [email protected]

Post on 19-Dec-2015

217 views

Category:

Documents

0 download

Report

Download

Tags:

Embed Size (px):

TRANSCRIPT

Domain-Independent Data Extraction: Person Names

Carl Christensen and Deryle Lonsdale

Brigham Young Universitycvchristensen@gmail

[email protected]

WePS task

Web People Search18 attribute values on person

namesTraining corpus – 17 names, approx.

100 web pages perScript given to evaluate

performanceTest corpus of comparable sizeNew ground in information

extraction

Ontos

Software developed by BYU data extraction group

Ontology based method leveraged to organize data

Off the shelf performanceSimilar uses for obituaries and

car ads information extraction

Page 4: Domain-Independent Data Extraction: Person Names Carl Christensen and Deryle Lonsdale Brigham Young University cvchristensen@gmail lonz@byu.edu

WePS extraction process

Person webpage

Ontos

Evaluation script

Annotated results

Knowledge sources

Text file output

WePS ontology

Results report

Page 5: Domain-Independent Data Extraction: Person Names Carl Christensen and Deryle Lonsdale Brigham Young University cvchristensen@gmail lonz@byu.edu

Dataframes

• XML description of extraction ontology components

Page 6: Domain-Independent Data Extraction: Person Names Carl Christensen and Deryle Lonsdale Brigham Young University cvchristensen@gmail lonz@byu.edu

Knowledge files

• Approximately 80,000 school names and 30,000 occupations - 66% of total schools in U.S. in 2003

• All possible options for some files, small subset for others - e.g. Occupations, hypocoristics

• Names, cities, countries, hypocoristics, occupations, etc.

• Knowledge gathered from extracting and formatting online databases - Live Journal - Wikipedia - Bureau of Labor Statistics - etc.

Page 7: Domain-Independent Data Extraction: Person Names Carl Christensen and Deryle Lonsdale Brigham Young University cvchristensen@gmail lonz@byu.edu

Constraints

Required context expressions

Cardinality

Regular expressions

<RequiredContextExpression> <ExpressionText>\b{BirthTime}\b</ExpressionText></RequiredContextExpression>

Email : [\w]+[\d]*@[\w]+[.]{1}[\w]+

Search Person [0:*] has Occupation [1:*]; Search Person [0:*] has Affiliation [1:*];Search Person [0:*] has Email [1:*];

Page 8: Domain-Independent Data Extraction: Person Names Carl Christensen and Deryle Lonsdale Brigham Young University cvchristensen@gmail lonz@byu.edu

Sample annotated webpage

Page 9: Domain-Independent Data Extraction: Person Names Carl Christensen and Deryle Lonsdale Brigham Young University cvchristensen@gmail lonz@byu.edu

Precision/Recall

alpha = 0.5 for WePS

Page 10: Domain-Independent Data Extraction: Person Names Carl Christensen and Deryle Lonsdale Brigham Young University cvchristensen@gmail lonz@byu.edu

Page 11: Domain-Independent Data Extraction: Person Names Carl Christensen and Deryle Lonsdale Brigham Young University cvchristensen@gmail lonz@byu.edu

Challenges

Smaller/larger match preference - Preference for “Arizona” as place

over “University of Arizona” as school

DOM parser - Unofficial HTML tags cause system

to failText formatting - Record detection for individuals

intractableSystem functionality - Cardinality bounds, system output

file

Page 12: Domain-Independent Data Extraction: Person Names Carl Christensen and Deryle Lonsdale Brigham Young University cvchristensen@gmail lonz@byu.edu

Performance

Very low initial precision/recall: < 1%

Increased drastically with knowledge engineering and system constraints

- 27 % recall with some person results approaching

40% recall - Approaching 10% precision

Nothing to measure against - Official WePS results will be

released in April

Page 13: Domain-Independent Data Extraction: Person Names Carl Christensen and Deryle Lonsdale Brigham Young University cvchristensen@gmail lonz@byu.edu

Future work

Ontos RobustnessMachine learning for

constraints/knowledge filesPerson name disambiguationKeyword probability values

IT 344: Operating Systems Winter 2010 Module 5 Process and Threads Chia-Chi Teng [email protected] CTB 265G

An Interactive Approach to Teaching L2 Reading: From the Bottom-Up Heidi Hyte Brigham Young University [email protected] [email protected]

Teaching Students to Read in their Disciplines 2015 FHSS Spring Teaching Conference May 21, 2015 Jeffery D Nokes [email protected]

Integrated Natural History of Utah 2012 [email protected]

BYU International Security 280C HRCB 801-422-5357 [email protected]

AAAI 2002 WS1 Peppering knowledge sources with SALT Deryle Lonsdale, Yihong Ding, David W. Embley, Alan Melby Brigham Young University [email protected], {ding,embley}@cs.byu.edu,

Estimating With Electronic Documents Kevin R. Miller, Ph.D. Brigham Young University [email protected]

IT 344: Operating Systems Module 9 Scheduling Chia-Chi Teng [email protected] 265 CTB

An Intonational Description of Mayan Q'eqchi'authorized administrator of BYU ScholarsArchive. For more information, please [email protected]. BYU ScholarsArchive Citation

VITA - marriottschool.netmarriottschool.net/emp/WPW_bak/vita_wpw06.doc · Web viewNAME: Warner P. Woodworth. EMAIL: [email protected]. ADDRESS: Work Department of Organizational

Deryle Lonsdale and Tory Anderson - University of Michigan

GEOMETRIC ALGEBRAS Manuel Berrondo Brigham Young University Provo, UT, 84097 [email protected]

ME 312 Homework # 4 Due Tuesday September 27th …me312.byu.edu/sites/me312.byu.edu/files/HW_4_solutions.pdf · ME 312 Homework # 4 Due Tuesday September 27th by 5:00 p.m. Problems

Deryle W. Lonsdale, David W. Embley, Stephen W. Liddle, and Joseph Park BYU Data Extraction Research Group

Research Developmentresearchdevelopment.byu.edu/wp-content/uploads/2015/08/[email protected] Research Development Jo Ann Petrie Research Development Specialist (MRI) MRI Research

Gene Expression And Regulation Bioinformatics January 11, 2006 D. A. McClellan ([email protected])

TANGO Table ANalysis for Generating Ontologies Yuri A. Tijerino*, David W. Embley*, Deryle W. Lonsdale* and George Nagy** * Brigham Young University **

Project Governance: Avoiding “Administrivia” Lisa Kosanovich Project Manager Center for Instructional Design Brigham Young University [email protected]

David W. Embley , Stephen W. Liddle , & Deryle W. Lonsdale Brigham Young University, USA

-4 -3 -2 -1 0 1 2 3 4 - ME 363 Elementary Instrumentationme363.byu.edu/sites/me363.byu.edu/files/hw1_solutions.pdf · 2012-05-17 · -5 -4 -3 -2 -1 0 1 2 3 4 5-5-4-3-2-1 0 1 2 3 4

DLLS 20031 Ontologically-based Searching for Jobs in Linguistics Deryle Lonsdale [email protected] Funded by:

Generating and Tracking Communities Based on Implicit Affinities Matthew Smith – [email protected] BYU Data Mining Lab April 2007

TABLE OF CONTENTS - Brigham Young University · 2016. 4. 26. · Nieves Pérez Knapp 3171 JFSB 801-422-3196 [email protected] . Rob A. Martinsen 3143 JFSB 801-422-8466 [email protected]

WHY PAY MORE THAN ONCE? jlh39/erl/ Redundant Journal Access Jared Howland Brigham Young [email protected]

Marsha Dowell, PhD, Senior Vice Chancellor Warren Carson, PhD, Associate Vice Chancellor Deryle Hope, PhD, Associate Director, International Studies Cherie

ME 312 Homework # 3 Due Wednesday September 24nd ...me312.byu.edu/sites/me312.byu.edu/files/HW3_solutions_0.pdfME 312 Homework # 3 Due Wednesday September 24nd by 5:00 p.m. Problems

IT 344: Operating Systems Winter 2010 Module 13 Secondary Storage Chia-Chi Teng [email protected] CTB 265

ME 312 Homework # 7 Due Monday October 25th …me312.byu.edu/sites/me312.byu.edu/files/HW_7_solutions.pdfME 312 Homework # 7 Due Monday October 25th by 5:00 p.m. Problems from Fundamentals

IT 344: Operating Systems Winter 2008 Module 2 Operating Systems Overview Chia-Chi Teng [email protected] 265G CTB

Make a phone call and use CALL Ben McMurry [email protected]

David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Aaron Stewart, and Cui Tao* Brigham Young University, Provo, Utah, USA *Mayo Clinic, Rochester,

ME 312 Homework #2 Due Friday September 17thth …me312.byu.edu/sites/me312.byu.edu/files/HW2_solutions.pdf · ME 312 Homework #2 Due Friday September 17thth by 5:00 p.m. Problems

Principled Pragmatism: A Guide to the Adaptation of Philosophical Disciplines to Conceptual Modeling David W. Embley, Stephen W. Liddle, & Deryle W. Lonsdale

domain-independent data extraction: person names carl christensen and deryle lonsdale brigham young...

Documents

weps slide

person results

system output file slide

w search person

sample annotated webpage

person names training

person names carl christensen

school names