layoutlm: pre-training of text and layout for document ...toolkit: docx parser, ... [cls]) layout....

50
LayoutLM: Pre-training of Text and Layout for Document Image Understanding Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou KDD 2020

Upload: others

Post on 19-Jan-2021

22 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou

KDD 2020

Page 2: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Outline 1. Background

2. Motivation

3. Method

4. Experiments

5. Conclusion

Page 3: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

1. Background

Page 4: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Document Understanding in Real World

Form Receipt Report Invoice

Born-digital DocumentsScanned Documents

Visually-rich Documents

Page 5: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Preprocessing

� Scanned documents� File format: .jpg, .png, …� Toolkit: Optical character recognition, a.k.a. OCR � Open source tools: Tesseract

� Born-digital documents� File format: .docx, .pdf, .pptx, …� Toolkit: DOCX parser, PDF parser, …� Open source tools: python-docx, pdfminer, PyMuPDF

Documents

OCR Toolsor

Specific Parser

Semi-structured Data

Page 6: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Typical Document Understanding Task

Key Value

TO Lorillard Corporation

ADDRESS 666 Fifth Avenue

CITY New York

… …

Key Value

Total 4.95

Company StarBucksStore

Address 11302 Euclid Avenue

Cleveland, OH

Date 12/07/2014

Category: Form

Form Understanding Receipt Understanding Document Image Classification

Page 7: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Sequence Labeling

CRF LSTM

LSTM+CRF BiLSTM+CRFHuang, Zhiheng et al. “Bidirectional LSTM-CRF Models for Sequence Tagging.” ArXiv abs/1508.01991 (2015).

Page 8: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Graph Convolution for Multimodal Information Extraction from Visually Rich Documents

� Propose a graph convolution based model to combine textual and visual information.

� Combine graph embedding with text embedding using a standard BiLSTM-CRF model.

Liu, Xiaojing et al. “Graph Convolution for Multimodal Information Extraction from Visually Rich Documents.” NAACL-HLT (2019).

Examples of VRDs and example entities to extract.

Page 9: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Document Modeling

� Model each document as a full-connected graph of text segments

� Document D is a tuple (T, E), where 𝑇 =𝑡!, 𝑡", … , 𝑡# , 𝑡$ ∈ 𝑇 is a set of n text nodes

� 𝑅 = 𝑟$!, 𝑟$", … , 𝑟$% , 𝑟$% ∈ 𝑅 is a set of edges

� E = 𝑇×𝑅×𝑇is a set of directed edges of the form 𝑡$ , 𝑟$% , 𝑡%

Liu, Xiaojing et al. “Graph Convolution for Multimodal Information Extraction from Visually Rich Documents.” NAACL-HLT (2019).

Document graph

Page 10: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Feature Extraction

� Edge Embedding 𝑟$% = [𝑥$% , 𝑦$% ,&!'!,'"'!,&"'!], where

� 𝑥$% and 𝑦$% are horizontal and vertical distance between the two text boxes

� 𝑤$ and ℎ$ are the width and height of the corresponding text box.

Liu, Xiaojing et al. “Graph Convolution for Multimodal Information Extraction from Visually Rich Documents.” NAACL-HLT (2019).

Page 11: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Popular BERT and his Family

� Contextual embedding� Pre-training technique

Page 12: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding� Incorporate contextualized embedding into the grid document

representation

Denk, Timo I. and Christian Reisswig. “BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding.” Document Intelligence Workshop at NeuriPS (2019).

Page 13: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

2. Motivation

Page 14: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Motivations

1. Previous work: contextual text embedding + non-contextual spatial information

2. Local invariance in document layout3. Extra information in visually rich documents

Page 15: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Be Contextual

Problem: contextual text embedding + non-contextual spatial information

Contextualizing spatial information to represent local invariance

Page 16: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Local Invariance in Document Layout

� Relative positions of words in a document contribute a lot to the semantic representation.

� Local Invariance� Key-value layout: left/right or up/down� Table layout: grid

� Pre-training technique will utilize the local invariance and better align the layout information with the semantic representation.

Page 17: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Visual Feature in Document Style

� Document-level� the whole image can indicate the document layout

� Word-level� visual features, styles such as bold, underline, and

italic

Page 18: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Insufficient and Expensive Labeled Data

Massive unlabeled documents Few labeled documents

Page 19: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Pre-training Techniques

Self-supervised training on large amounts of text.

Supervised training on a specific task with labeled data.

Page 20: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Language Understanding

Text-only feature

Document Image Understanding

Text feature

Layout feature

Style feature

Page 21: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Goals

1. 2D Language Model: contextual text embedding + contextual spatial information

2. Modeling and pre-training local invariance in document layout3. Utilizing visual information in visually rich documents

Page 22: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

3. Method

Page 23: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

LayoutLM Architecture

Page 24: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

2-D Position Embedding

BERT

LayoutLM

Page 25: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Image Embedding

Page 26: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Pre-training for LayoutLM

• Masked Visual-Language Model

Input Date MASK January 11, 1994 Contract MASK 4011

TextEmbeddings

PositionEmbeddings (x0)

PositionEmbeddings (y0)

PositionEmbeddings (x1)

PositionEmbeddings (y1)

E(86) E(117) E(227) E(281) E(303) E(415) E(468) E(556)

E(138) E(138) E(138) E(138) E(139) E(138) E(139) E(139)

E(112) E(162) E(277) E(293) E(331) E(464) E(487) E(583)

E(148) E(148) E(153) E(148) E(149) E(149) E(149) E(150)

+ + + + + + + +

+ + + + + + + +

+ + + + + + + +

+ + + + + + + +

E(Date) E(Routed:) E(January) E(11,) E(1994) E(Contract) E(No.) E(4011)

0000

E(589)

E(139)

E(621)

E(150)

+

+

+

+

E(0000)

[CLS]

E(0)

E(0)

E(maxW)

E(maxH)

+

+

+

+

E([CLS])Text

Layout

Page 27: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Pre-training for LayoutLM

• Document Image Classification

Input Date Routed: January 11, 1994 Contract No. 4011

TextEmbeddings

PositionEmbeddings (x0)

PositionEmbeddings (y0)

PositionEmbeddings (x1)

PositionEmbeddings (y1)

E(86) E(117) E(227) E(281) E(303) E(415) E(468) E(556)

E(138) E(138) E(138) E(138) E(139) E(138) E(139) E(139)

E(112) E(162) E(277) E(293) E(331) E(464) E(487) E(583)

E(148) E(148) E(153) E(148) E(149) E(149) E(149) E(150)

+ + + + + + + +

+ + + + + + + +

+ + + + + + + +

+ + + + + + + +

E(Date) E(Routed:) E(January) E(11,) E(1994) E(Contract) E(No.) E(4011)

0000

E(589)

E(139)

E(621)

E(150)

+

+

+

+

E(0000)

[CLS]

E(0)

E(0)

E(maxW)

E(maxH)

+

+

+

+

E([CLS])Text

Layout

Page 28: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Pre-training Data

11 million scanned document images from IIT-CDIP Test Collection 1.0 https://ir.nist.gov/cdip/

Page 29: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

4. Experiments

Page 30: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Downstream Tasks

Form Understanding

Receipt Understanding

Document Image Classification

Page 31: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Form Understanding with LayoutLM[Task] Sequence labeling (B-I-O class labels) for key-value from forms[Data] 149 training, 50 testing[Metric] Precision, Recall, F1[Baseline] Pre-trained BERT and RoBERTa

FUNSD: Form Understanding in Noisy Scanned Documentshttps://guillaumejaume.github.io/FUNSD/

Page 32: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Form Understanding with LayoutLM

Page 33: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Receipt Understanding with LayoutLM

"company": "STARBUCKS STORE #10208","date": "12/07/2014","address": "11302 EUCLID AVENUE, CLEVELAND, OH (216)

229-0749","total": "4.95",

ICDAR 2019 Robust Reading Challenge on Key Information Extraction from Scanned Receiptshttps://rrc.cvc.uab.es/?ch=13&com=tasks

[Task] Sequence labeling (B-I-O class labels) for values from receipts[Data] 626 training, 347 testing[Metric] Precision, Recall, F1[Baseline] Pre-trained BERT, RoBERTa

Page 34: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Receipt Understanding with LayoutLM

Page 35: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Document Image Classification with LayoutLM

[Task] Image Classification (16 classes) [Data] RVL-CDIP dataset (320K training, 40K validation, 40K testing)[Metric] Accuracy[Baseline] InceptionResNetV2, LadderNet, Multimodal

https://www.cs.cmu.edu/~aharley/rvl-cdip/

Page 36: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Document Image Classification with LayoutLM

Page 37: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Different Data and Epochs

Page 38: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Different Initialization Methods

Page 39: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Visualization: Table Detection Task on DocBank

BERT LayoutLM BERT LayoutLM

Error Correct Ground Truth

Li, Minghao et al. “DocBank: A Benchmark Dataset for Document Layout Analysis.” ArXiv abs/2006.01038 (2020).

Page 40: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

5. Conclusion

Page 41: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

• LayoutLM• 1st document-level pre-trained model using text and layout• Support different downstream tasks

• Form/Invoice understanding• Receipt understanding• Document image classification

• Paper: https://arxiv.org/abs/1912.13318• Code: https://aka.ms/layoutlm

LayoutLM

Page 42: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

How to conduct research as an undergraduate?

Page 43: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

My suggestions

1. Being self-motivated and hard-working2. Doing well in math and programming courses3. Finding a group/professor/graduate student4. Getting involved in a research project

Page 44: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Working with a Professor/Graduate Student

• Clear goal• A topic or an idea• Conference deadline

• Weekly one-to-one meeting• Progress report: reading, codes, results

Page 45: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

More Advice

• How to Do Research With a Professor?• Jason Eisner, CS professor at Johns Hopkins University, ACL Fellow• http://www.cs.jhu.edu/~jason/advice/how-to-work-with-a-professor.html

• How undergraduates can make successful research (in Chinese)• Minlie Huang, CS professor at Tsinghua University• http://coai.cs.tsinghua.edu.cn/hml/media/files/undergraduate-res.pdf

Page 46: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Life at MSRA

Novel Topic/Idea

• Mentorship• Diverse research area

Computing Resource

• Azure Machine Learning

Programming Skill • Research & Develop

Conditions of Good Research

Page 47: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Acknowledgement

Page 48: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

Acknowledgement: MSRA NLC Group

Ming Zhou Lei CuiFuru Wei

Page 49: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

UniLM Family: https://github.com/microsoft/unilm

� UniLM(v1@NeurIPS'19 | v2@ICML'20): unified pre-training for language understanding and generation

� MiniLM(arXiv'20): small pre-trained models for language understanding and generation

� LayoutLM (v1@KDD’20): multimodal (text + layout/format + image) pre-training for document understanding

� s2s-ft: sequence-to-sequence fine-tuning toolkit

Page 50: LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout. Pre-training for LayoutLM •Document Image Classification ... Doing well in math

© Copyright Microsoft Corporation. All rights reserved.

Thank you for listening