learning information extraction from images of structured ...form automatic information extraction...

Problem Why automatic information extraction for forms? • Automatic information extraction (IE) from document or form images eliminates error-prone manual data entry so users don’t have to do it. • Existing machine learning-based techniques for IE rely on expensive-to- acquire manually annotated labeled data. • Intuit solves this problem by building a data-driven synthetic data generation pipeline and by using a modified conditional random field (CRF) model for field extraction. Form automatic information extraction as a named entity recognition (NER) problem • Two main types of entities: form field titles and form field values. • W2 form contains 32-35 fields, corresponding to ~70 entity classes. Learning Information Extraction from Images of Structured Documents Using Synthetic Data and Conditional Random Fields (CRFs) Solution: An End-to-End Learning Framework for Form Information Extraction Tharathorn (Joy) Rimchala Intuit Data Type Classifier Synthetic W2 Field Values from Real Data Image Renderer Synthetic Training Data Set Input tokens (output of OCR) Class labels Features Machine Learning Form Field Value Data Source Synthetic Text Data Synthetic Image Data Stratified Samplers Image Renderer Form Layouts PII PII Anonymizer Categorical Discreet Valued Density Estimator Numerical Continuous Valued Density Estimator OCR Engine Feature Generator CRF Model Token-based Word regex patterns, part of speech, known entity tags, etc. Contextual FastText clusters LDA document topics Spatial 2D text organization relative 2D positions Conditional random field (CRF) layers Synthetic Data Generation Results Form CRF Model Training The Pipeline Stages • The generators learn to generate three main types of data distributions from millions of anonymized real electronic form field data. • The synthetic data is rendered on variations of form with font variations. • The entire pipeline is packaged into a single ready-to-deploy docker image. Model Performance: • Model performance varies with the usage rate of field class in W2 Form. • The best model yield 97.44% F1 score on classes that are highly used field value class. Next Steps: • Deployment of the best NER-CRF. • Exploring over-sampling to improve performance on classes that are not highly used. • Exploring more powerful model: bidirectional Long Short Term Memory Conditional Random Field (biLSTM-CRF) using biLSTM as a feature extractor. Model Training • 96K images, 48 variations, ~20 font variations and text localization • 80/10/10 training/validation/test split • L-BFGS with L2 regularization Best NER-CRF Model Performance (aggregated across all classes) Metric Field Title classes Field value classes High Usage* Field Value classes Precision 99.10% 81.60% 97.74% Recall 98.50% 82.40% 96.65% F1 98.50% 82.40% 97.65% * High usage classes: classes that appear in more than 70% of all real W2 form data. NER-CRF Model Confidence Among Highly Used Classes Field Class (entity we want to extract) Model Confidence Usage Rate Employee Name 99.19% 99.86% Employee Address 92.31% 99.82% Employer Id Number (EIN) 97.70% 99.85% Medicare Tax Withheld Amount 98.18% 98.39% Medicare Wages And Tips Amount 99.24% 98.62% Social Security Number (SSN) 98.64% 99.86% Social Security Tax Amount 98.23% 97.23% Social Security Wages Amount 97.86% 97.44% Employer State Id Number (State EIN) 95.93% 87.34% State Income Amount 93.48% 87.33% Wages Amount 98.25% 99.77% Withholding Amount 97.12% 98.42%

Upload: others

Post on 11-Jul-2020

2 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Learning Information Extraction from Images of Structured ...Form automatic information extraction as a named entity recognition (NER) problem • Two main types of entities: form

Problem

Why automatic information extraction for forms?

• Automatic information extraction (IE) from document or form images eliminates error-prone manual data entry so users don’t have to do it.

• Existing machine learning-based techniques for IE rely on expensive-to-acquire manually annotated labeled data.

• Intuit solves this problem by building a data-driven synthetic data generation pipeline and by using a modified conditional random field (CRF) model for field extraction.

Form automatic information extraction as a named entity recognition (NER) problem

• Two main types of entities: form field titles and form field values.

• W2 form contains 32-35 fields, corresponding to ~70 entity classes.

Learning Information Extraction from Images of Structured Documents Using Synthetic Data and Conditional Random Fields (CRFs)

Solution: An End-to-End Learning Framework for Form Information Extraction

Tharathorn (Joy) Rimchala Intuit

Data TypeClassifier

Synthetic W2 FieldValues from Real Data

ImageRenderer

SyntheticTraining Data Set

Input tokens(output of OCR)

Class labels

Features Machine Learning

Form Field ValueData Source

SyntheticText Data

SyntheticImage Data

StratifiedSamplers

ImageRenderer

Form LayoutsPIIPII Anonymizer

CategoricalDiscreet Valued

Density Estimator

NumericalContinuous ValuedDensity Estimator

OCREngine

FeatureGenerator

CRFModel

Token-basedWord regex patterns,

part of speech,known entity tags, etc.

ContextualFastText clusters

LDA document topics

Spatial2D text organizationrelative 2D positions

Conditional random field (CRF) layers

Synthetic Data Generation

Results

Form CRF Model Training

The Pipeline Stages

• The generators learn to generate three main types of data distributions from millions of anonymized real electronic form field data.

• The synthetic data is rendered on variations of form with font variations.

• The entire pipeline is packaged into a single ready-to-deploy docker image.

Model Performance: • Model performance varies with the usage rate of field class in W2 Form.

• The best model yield 97.44% F1 score on classes that are highly used field value class.

Next Steps: • Deployment of the best NER-CRF.

• Exploring over-sampling to improve performance on classes that are not highly used.

• Exploring more powerful model: bidirectional Long Short Term Memory Conditional Random Field (biLSTM-CRF) using biLSTM as a feature extractor.

Model Training

• 96K images, 48 variations, ~20 font variations and text localization

• 80/10/10 training/validation/test split

• L-BFGS with L2 regularization

Best NER-CRF Model Performance(aggregated across all classes)

Metric Field Title classes

Field value classes

High Usage*Field Valueclasses

Precision 99.10% 81.60% 97.74%

Recall 98.50% 82.40% 96.65%

F1 98.50% 82.40% 97.65%

* High usage classes: classes that appear in more than 70% of all real W2 form data.

NER-CRF Model Confidence Among Highly Used Classes

Field Class (entity we want to extract)

ModelConfidence

Usage Rate

Employee Name 99.19% 99.86%

Employee Address 92.31% 99.82%

Employer Id Number (EIN) 97.70% 99.85%

Medicare Tax Withheld Amount 98.18% 98.39%

Medicare Wages And Tips Amount 99.24% 98.62%

Social Security Number (SSN) 98.64% 99.86%

Social Security Tax Amount 98.23% 97.23%

Social Security Wages Amount 97.86% 97.44%

Employer State Id Number (State EIN) 95.93% 87.34%

State Income Amount 93.48% 87.33%

Wages Amount 98.25% 99.77%

Withholding Amount 97.12% 98.42%

Information Extraction and Named Entity Recognitionsharif.edu/~sani/courses/nlp/Lec16.pdfChristopher Manning Named Entity Recognition (NER) • A very important sub-task: find and

Ben-Ner, Culture, identity

Lymba (QA System)ad-teaching.informatik.uni-freiburg.de/information-extraction-ws1314… · Lymba – Fabian Schillinger 11/26 ROSE – a new NER System Uses 3-step process: Pass

orporation (“NER”)

Par Ner Ship

Information Extraction - Evaluation, Rule-based NERfraser/... · Named Entity Recognition Named Entity Recognition (NER) is the process of finding entities (people, cities, organizations,

LeNER-Br: A Dataset for Named Entity Recognition in ...teodecampos/LeNER-Br/luz_etal... · Named Entity Recognition Named Entity Recognition (NER) is a sub-task of Information Extraction

The BIOCREATIVE Task in SEER. Outline Background for biomedical information extraction and BIOCREATIVE BIOCREATIVE NER Task Stanford-Edinburgh System

Sportsimages.halinet.on.ca/OakvilleImages/Images/OI002344555pf...Mile Sports Complex. Pictured clockwise from left, the Edge’s elementary, begin-ner 2 and begin-ner 1 teams per-form

NER pathway

€¦ · All SS/SMs/NER All DRMs/Sr.DCMs/DCMs/NER NERIGKP All PHODs/NER/GKP CCM, All Indian Railways CCO DY.CCMJUTS,TC. PO/NER/GKP Executive Director (Rates) Railway Board, New Delhi

Mattingly Hitch Ner

Extraction and characterization of microalgae proteins ... · Extraction and characterization of microalgae proteins from the extremophile Dunaliella ... in the form of single-cell

Intro to Information Extraction -Sentiment Lexicon ...faculty.cse.tamu.edu/huangrh/Fall18-638/l16_ie-overview_sentiment_lexicon.pdfNamed Entity Recognition (NER) Mars One announced

RUN NER S

NER Highlights

Evaluation of Named Entity Extraction Systems · 2009-07-17 · choice or the design of more suitable NER tools for a specific project. Keywords: Named entity extraction, named entity

NER CALCULATION

Ner Ner Leadsheet

Outline Super-quick review of previous talk More on NER by token-tagging –Limitations of HMMs –MEMMs for sequential classification Review of relation extraction

A Neural Multi-digraph Model for Chinese NER with Gazetteers · that NER is a knowledge intensive task. Back-ground knowledge is often incorporated into an NER system in the form

€¦ · File: NerRishonSPf 1 Ner Rishon Hebrew folklore Allegretto Arr. by Serban Nichifor = 90 Ner Ri- shon Ner She ni Ner She li- shi Ner Re- vii-i Ner Cha- mi- shi Ner Shi- shi

Empirical advances with text mining of electronic health ... · through information extraction (IE), named entity recognition (NER) and data mining (DM) can provide a real advantage

Brochure NER

Form 990 for AMERICAN FRIENDSOF NER MOSHE INSTITUTE (11 ... · Title: Form 990 for AMERICAN FRIENDSOF NER MOSHE INSTITUTE (11-2688904) for 06/2014 Author: AMERICAN FRIENDSOF NER MOSHE

Hil' Ner Chanuka

Economic Valuation for Information Security Investment: A ...€¦ · to SLR topic Milestone:Protocol Finalization Define final data extraction form contents Draft data extraction

Masterarbeit Process-based Data Extraction from Web Sources · Web Data Extraction is concerned with the extraction of such data into a structured form that is proper for a further

ladda ner manual

NER Aurora Thermal_Monitoring_Whitepaper

Huber Suh Ner

Ner Advt12013

NER 2013 - Graduates

IEEE International Conference on Neural Engineering NER’19IEEE International Conference on Neural Engineering NER’19 IEEE International Conference on Neural Engineering (NER’19)

Ner and Perfortnance - thehistorycenteronline.com