layoutlm: pre-training of text and layout for document ...toolkit: docx parser, ... [cls]) layout....

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou

KDD 2020

Outline 1. Background

2. Motivation

3. Method

4. Experiments

5. Conclusion

1. Background

Document Understanding in Real World

Form Receipt Report Invoice

Born-digital DocumentsScanned Documents

Visually-rich Documents

Preprocessing

� Scanned documents� File format: .jpg, .png, …� Toolkit: Optical character recognition, a.k.a. OCR � Open source tools: Tesseract

� Born-digital documents� File format: .docx, .pdf, .pptx, …� Toolkit: DOCX parser, PDF parser, …� Open source tools: python-docx, pdfminer, PyMuPDF

Documents

OCR Toolsor

Specific Parser

Semi-structured Data

Typical Document Understanding Task

Key Value

TO Lorillard Corporation

ADDRESS 666 Fifth Avenue

CITY New York

… …

Key Value

Total 4.95

Company StarBucksStore

Address 11302 Euclid Avenue

Cleveland, OH

Date 12/07/2014

Category: Form

Form Understanding Receipt Understanding Document Image Classification

Sequence Labeling

CRF LSTM

LSTM+CRF BiLSTM+CRFHuang, Zhiheng et al. “Bidirectional LSTM-CRF Models for Sequence Tagging.” ArXiv abs/1508.01991 (2015).

Graph Convolution for Multimodal Information Extraction from Visually Rich Documents

� Propose a graph convolution based model to combine textual and visual information.

� Combine graph embedding with text embedding using a standard BiLSTM-CRF model.

Liu, Xiaojing et al. “Graph Convolution for Multimodal Information Extraction from Visually Rich Documents.” NAACL-HLT (2019).

Examples of VRDs and example entities to extract.

Document Modeling

� Model each document as a full-connected graph of text segments

� Document D is a tuple (T, E), where 𝑇 =𝑡!, 𝑡", … , 𝑡# , 𝑡$ ∈ 𝑇 is a set of n text nodes

� 𝑅 = 𝑟$!, 𝑟$", … , 𝑟$% , 𝑟$% ∈ 𝑅 is a set of edges

� E = 𝑇×𝑅×𝑇is a set of directed edges of the form 𝑡$ , 𝑟$% , 𝑡%


Document graph

Feature Extraction

� Edge Embedding 𝑟$% = [𝑥$% , 𝑦$% ,&!'!,'"'!,&"'!], where

� 𝑥$% and 𝑦$% are horizontal and vertical distance between the two text boxes

� 𝑤$ and ℎ$ are the width and height of the corresponding text box.


Popular BERT and his Family

� Contextual embedding� Pre-training technique

BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding� Incorporate contextualized embedding into the grid document

representation

Denk, Timo I. and Christian Reisswig. “BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding.” Document Intelligence Workshop at NeuriPS (2019).

2. Motivation

Motivations

1. Previous work: contextual text embedding + non-contextual spatial information

2. Local invariance in document layout3. Extra information in visually rich documents

Be Contextual

Problem: contextual text embedding + non-contextual spatial information

Contextualizing spatial information to represent local invariance

Local Invariance in Document Layout

� Relative positions of words in a document contribute a lot to the semantic representation.

� Local Invariance� Key-value layout: left/right or up/down� Table layout: grid

� Pre-training technique will utilize the local invariance and better align the layout information with the semantic representation.

Visual Feature in Document Style

� Document-level� the whole image can indicate the document layout

� Word-level� visual features, styles such as bold, underline, and

italic

Insufficient and Expensive Labeled Data

Massive unlabeled documents Few labeled documents

Pre-training Techniques

Self-supervised training on large amounts of text.

Supervised training on a specific task with labeled data.

Language Understanding

Text-only feature

Document Image Understanding

Text feature

Layout feature

Style feature

…

Goals

1. 2D Language Model: contextual text embedding + contextual spatial information

2. Modeling and pre-training local invariance in document layout3. Utilizing visual information in visually rich documents

3. Method

LayoutLM Architecture

2-D Position Embedding

BERT

LayoutLM

Image Embedding

Pre-training for LayoutLM

• Masked Visual-Language Model

Input Date MASK January 11, 1994 Contract MASK 4011

TextEmbeddings

PositionEmbeddings (x0)

PositionEmbeddings (y0)



E(86) E(117) E(227) E(281) E(303) E(415) E(468) E(556)

E(138) E(138) E(138) E(138) E(139) E(138) E(139) E(139)

E(112) E(162) E(277) E(293) E(331) E(464) E(487) E(583)

E(148) E(148) E(153) E(148) E(149) E(149) E(149) E(150)

+ + + + + + + +

+ + + + + + + +

+ + + + + + + +

+ + + + + + + +

E(Date) E(Routed:) E(January) E(11,) E(1994) E(Contract) E(No.) E(4011)

0000

E(589)

E(139)

E(621)

E(150)

+

+

+

+

E(0000)

[CLS]

E(0)

E(0)

E(maxW)

E(maxH)

+

+

+

+

E([CLS])Text

Layout

Pre-training for LayoutLM

• Document Image Classification

Input Date Routed: January 11, 1994 Contract No. 4011

TextEmbeddings





E(86) E(117) E(227) E(281) E(303) E(415) E(468) E(556)

E(138) E(138) E(138) E(138) E(139) E(138) E(139) E(139)

E(112) E(162) E(277) E(293) E(331) E(464) E(487) E(583)

E(148) E(148) E(153) E(148) E(149) E(149) E(149) E(150)

+ + + + + + + +

+ + + + + + + +

+ + + + + + + +

+ + + + + + + +

E(Date) E(Routed:) E(January) E(11,) E(1994) E(Contract) E(No.) E(4011)

0000

E(589)

E(139)

E(621)

E(150)

+

+

+

+

E(0000)

[CLS]

E(0)

E(0)

E(maxW)

E(maxH)

+

+

+

+

E([CLS])Text

Layout

Pre-training Data

11 million scanned document images from IIT-CDIP Test Collection 1.0 https://ir.nist.gov/cdip/

https://ir.nist.gov/cdip/

4. Experiments

Downstream Tasks

Form Understanding

Receipt Understanding

Document Image Classification

Form Understanding with LayoutLM[Task] Sequence labeling (B-I-O class labels) for key-value from forms[Data] 149 training, 50 testing[Metric] Precision, Recall, F1[Baseline] Pre-trained BERT and RoBERTa

FUNSD: Form Understanding in Noisy Scanned Documentshttps://guillaumejaume.github.io/FUNSD/

https://guillaumejaume.github.io/FUNSD/

Form Understanding with LayoutLM

Receipt Understanding with LayoutLM

"company": "STARBUCKS STORE #10208","date": "12/07/2014","address": "11302 EUCLID AVENUE, CLEVELAND, OH (216)

229-0749","total": "4.95",

ICDAR 2019 Robust Reading Challenge on Key Information Extraction from Scanned Receiptshttps://rrc.cvc.uab.es/?ch=13&com=tasks

[Task] Sequence labeling (B-I-O class labels) for values from receipts[Data] 626 training, 347 testing[Metric] Precision, Recall, F1[Baseline] Pre-trained BERT, RoBERTa

Receipt Understanding with LayoutLM

Document Image Classification with LayoutLM

[Task] Image Classification (16 classes) [Data] RVL-CDIP dataset (320K training, 40K validation, 40K testing）[Metric] Accuracy[Baseline] InceptionResNetV2, LadderNet, Multimodal

https://www.cs.cmu.edu/~aharley/rvl-cdip/

https://www.cs.cmu.edu/~aharley/rvl-cdip/

Document Image Classification with LayoutLM

Different Data and Epochs

Different Initialization Methods

Visualization: Table Detection Task on DocBank

BERT LayoutLM BERT LayoutLM

Error Correct Ground Truth

Li, Minghao et al. “DocBank: A Benchmark Dataset for Document Layout Analysis.” ArXiv abs/2006.01038 (2020).

5. Conclusion

• LayoutLM• 1st document-level pre-trained model using text and layout• Support different downstream tasks

• Form/Invoice understanding• Receipt understanding• Document image classification

• Paper: https://arxiv.org/abs/1912.13318• Code: https://aka.ms/layoutlm

LayoutLM

https://arxiv.org/abs/1912.13318

https://aka.ms/layoutlm

How to conduct research as an undergraduate?

My suggestions

1. Being self-motivated and hard-working2. Doing well in math and programming courses3. Finding a group/professor/graduate student4. Getting involved in a research project

Working with a Professor/Graduate Student

• Clear goal• A topic or an idea• Conference deadline

• Weekly one-to-one meeting• Progress report: reading, codes, results

More Advice

• How to Do Research With a Professor?• Jason Eisner, CS professor at Johns Hopkins University, ACL Fellow• http://www.cs.jhu.edu/~jason/advice/how-to-work-with-a-professor.html

• How undergraduates can make successful research (in Chinese)• Minlie Huang, CS professor at Tsinghua University• http://coai.cs.tsinghua.edu.cn/hml/media/files/undergraduate-res.pdf

http://www.cs.jhu.edu/~jason/advice/how-to-work-with-a-professor.html

http://coai.cs.tsinghua.edu.cn/hml/media/files/undergraduate-res.pdf

Life at MSRA

Novel Topic/Idea

• Mentorship• Diverse research area

Computing Resource

• Azure Machine Learning

Programming Skill • Research & Develop

Conditions of Good Research

Acknowledgement

Acknowledgement: MSRA NLC Group

Ming Zhou Lei CuiFuru Wei

UniLM Family: https://github.com/microsoft/unilm

� UniLM(v1@NeurIPS'19 | v2@ICML'20): unified pre-training for language understanding and generation

� MiniLM(arXiv'20): small pre-trained models for language understanding and generation

� LayoutLM (v1@KDD’20): multimodal (text + layout/format + image) pre-training for document understanding

� s2s-ft: sequence-to-sequence fine-tuning toolkit

https://github.com/microsoft/unilm