chargrid: towards understanding 2d documents2007/02/01 · formatted notebooks. in such documents,...

Published as a conference paper at EMNLP 2018

Chargrid: Towards Understanding 2D Documents

Anoop R Katti∗ Christian Reisswig∗ Cordula Guder Sebastian Brarda

Steffen Bickel Johannes Hohne

SAP SEMachine Learning R&D, Berlin Germany

{anoop.raveendra.katti | christian.reisswig | cordula.guder | sebastian.brarda | steffen.bickel | johannes.hoehne | jean.baptiste.faddoul}@sap.com

Jean Baptiste Faddoul

Abstract

We introduce a novel type of text represen-tation that preserves the 2D layout of a doc-ument. This is achieved by encoding eachdocument page as a two-dimensional gridof characters. Based on this representation,we present a generic document understand-ing pipeline for structured documents. Thispipeline makes use of a fully convolutionalencoder-decoder network that predicts a seg-mentation mask and bounding boxes. Wedemonstrate its capabilities on an informationextraction task from invoices and show that itsignificantly outperforms approaches based onsequential text or document images.

1 Introduction

Textual information is often represented throughstructured documents which have an inherent 2Dstructure. This is even more so the case with theadvent of new types of media and communica-tions such as presentations, websites, blogs andformatted notebooks. In such documents, the lay-out, positioning, and sizing might be crucial to un-derstand its semantic content and provide a strongguidance to the human perception.

NLP addresses the task of processing and un-derstanding natural language texts through sub-tasks like language modeling, classification, in-formation extraction, summarization, translation,and question answering among others. NLP meth-ods typically operate on serialized text, which isa 1D sequence of characters. Such methods havebeen proven very successful for various tasks onunformatted text (e.g. books, reviews, news arti-cles, short text snippets). However, when process-ing structured and formatted documents in whichthe relation between words is impacted not onlyby the sequential order, but also by the documentlayout, NLP can result in significant shortcomings.

∗Equal contribution

Computer vision algorithms, on the other hand,are designed to exploit 2D information in the vi-sual domain. Images are commonly processedwith convolutional neural networks (LeCun et al.,1998; Krizhevsky et al., 2012; Ren et al., 2015b;Pinheiro et al., 2016) (or likes) that preserve andexploit the 2D correlation between neighboringpixels. While it is feasible to convert structureddocuments into images and then apply computervision algorithms, this approach is not optimalfor understanding their semantics as it is drivenmostly by the visual content and not by the textualcontent. As a result, a machine learning modelwould first need to extract the text from the imagefollowed by learning the semantics. This purelyvisual approach requires a more complex modeland significantly larger training data compared totext-based approaches.

We propose a novel paradigm for processingand understanding structured documents. Insteadof serializing a document into a 1D text, the pro-posed method, named chargrid, preserves the spa-tial structure of the document by representing it asa sparse 2D grid of characters. We then formu-late the document understanding task as instance-level semantic segmentation on chargrid. Moreprecisely, the model predicts a segmentation maskwith pixel-level labels and object bounding boxesto group multiple instances of the same class. Weapply the chargrid paradigm on an information ex-traction task from invoices and demonstrate thatthis method is superior to both, state-of-the-artNLP algorithms as well as computer vision algo-rithms.

The rest of the document is organized in threeparts: we first introduce the chargrid paradigm;we then describe a specific application of infor-mation extraction from documents; finally, wepresent experimental results and conclusion.

arX

iv:1

809.

0879

9v1

[cs

.CL

] 2

4 Se

p 20

18


2 Related Work

NLP focuses on understanding natural languagetext through tasks like classification (Kim, 2014),translation (Bahdanau et al., 2014), summarization(Rush et al., 2015), and named entity recognition(Lample et al., 2016). Such methods expect unfor-matted text as input and, therefore, do not assumeany intrinsic 2D structure.

Document Analysis, on the other hand, dealslargely with problems such as recognizingprinted/hand-written characters from a variety ofdocuments (Graves and Schmidhuber, 2009), pro-cessing document images for document localiza-tion (Javed and Shafait, 2018), binarization (Tens-meyer and Martinez, 2017), and layout segmenta-tion (Chen et al., 2015). As a result, it does not fo-cus on understanding the character- and/or word-level semantics of the document the same way asNLP.

Within computer vision, problems such as scenetext detection and recognition (Goodfellow et al.,2013), semantic segmentation (Badrinarayananet al., 2017) as well as object detection (Guptaet al., 2014; Lin et al., 2017) can be considered asrelated problems to ours, but applied on a differentdomain, i.e., processing natural images instead ofdocuments as input.

The closest to our work is Yang et al. (2017)that performs pixel-wise layout segmentation on astructured document, using sentence embeddingsas additional input to an encoder-decoder networkarchitecture. For each pixel inside the area of asentence, the sentence embedding is appended tothe visual feature embedding at the last layer ofthe decoder. The authors show that the layout seg-mentation accuracy can be significantly improvedwhen using the textual features. Another relatedwork is Palm et al. (2017) which extracts key-value information from structured documents (in-voices) using a recurrent neural network (RNN).Their work addresses the problem of documentunderstanding, however, the RNN operates on se-rialized 1D text.

Combining approaches from computer vision,NLP, and document analysis, our work is the firstto systematically address the task of understanding2D documents the same way as NLP while stillretaining the 2D structure in structured documents.

3 Document Understanding withChargrid

A human observer comprehends a document byunderstanding the semantic content of characters,words, paragraphs, and layout components. Weencapsulate all such tasks under the common um-brella of document understanding. Therefore, wecan formulate this problem as an instance segmen-tation task of characters on the page. In the fol-lowing sections, we describe a new approach forsolving that task.

3.1 Chargrid

Chargrid is a novel representation of a documentthat preserves its 2D layout. A chargrid can beconstructed from character boxes, i.e., boundingboxes that each surround a single character some-where on a given document page. This positionalinformation can come from an optical characterrecognition (OCR) engine, or can be directly ex-tracted from the layout information in the docu-ment as provided by, e.g., PDF or HTML. Thecoordinate space of a character box is defined bypage height H and width W , and is usually mea-sured in units of pixels.

The complete text of a document page canthus be represented as a set of tuples D ={(ck,bk) | k = 0, ..., n}, where ck denotes thek-th character in the page and bk the associatedcharacter box of the k-th character, which is for-malized by the top-left pixel position, width andheight, thus bk = (xk, yk, wk, hk).

We can now construct the chargrid g ∈ NH×W

of the original document page, and its character-pixel gij from the set D with

gij =

{E(ck) if (i, j) ≺ bk

0(1)

where ≺ means ’overlaps with’, and where eachpoint (i, j) corresponds to some pixel in the origi-nal document page pixel coordinate space definedby (H,W ). E(ck) is some encoding of the char-acter in the k-th character box, i.e. the value ofcharacter ck may be mapped to a specific integerindex. For instance, we may map the alphabet(or any character of interest) to non-zero indices{a, b, c, ...} → {1, 2, 3, ...}. Note that we assumethat character boxes cannot overlap, i.e. each char-acter on a document page occupies a unique regionon the page. In practice, it may happen that the


Raw document Chargrid

Figure 1: Example for a document page (left) andcorresponding chargrid representation g (right).

corners and edges of a character box may overlapwith other closeby characters. We solve such cor-ner cases by assigning the character-pixel to thebox that has the closest box center.

In other words, the chargrid representation isconstructed as follows: for each character ck atlocation bk, the area covered by that characteris filled with some constant index value E(ck).All remaining character-pixels corresponding toempty regions on the original document page areinitialized with 0. Figure 1 visualizes the chargridrepresentation of an example input document.

The advantage of the new chargrid representa-tion is twofold: (i) we directly encode a characterby a single scalar value rather than by a granularcollection of grayscale pixels as is the case for im-ages, thus making it easy for the subsequent doc-ument analysis algorithms to understand the doc-ument, and (ii), because the group of pixels thatbelonged to a given character are now all mappedto the same constant value, we can significantlydownsample the chargrid representation withoutloss of any information. For instance, if the small-est character occupied a 10×10 pixel region in theoriginal document, we can downsample the char-grid representation by a factor of 10×10. This sig-nificantly reduces the computational time of sub-sequent processing steps, such as training a ma-chine learning model on this data representation.

We point out that a character can occupy a re-gion spanning several character pixels. This is incontrast to traditional NLP where each character isrepresented by exactly one token. In our represen-tation, therefore, the larger a given character, themore character-pixels represent that single charac-

ter. We do not find this to be a problem, instead,it even helps since it implicitly encodes additionalinformation, for example, the font size, that wouldotherwise not be available.

Before the chargrid representation is used as in-put to, e.g., a neural network (see Sec. 3.2), weapply 1-hot encoding to the chargrid g. Thus, theoriginal chargrid representation g ∈ NH×W be-comes a vector representation g ∈ RH×W×NC ,where NC denotes the number of characters inthe vocabulary including a padding/backgroundcharacter (in our case this is mapped to 0) andan unknown character (all characters that are notmapped by our encoding E will be mapped to thischaracter).

We note that similar to using characters for con-structing the chargrid, one can also use words toconstruct a wordgrid in the same way. In that case,rather than using 1-hot encoding, one may use aword embedding like word2vec or GloVe. Whilethe construction of a wordgrid seems straight-forward, we have not experimented with it in thepresent work as our dataset contains too manyunusual words and spans multiple languages (seeSec. 4).

3.2 Network Architecture

We use the 1-hot encoded chargrid representationg as input to a fully convolutional neural networkto perform semantic segmentation on the chargridand predict a class label for each character-pixelon the document. As there can be multiple and anunknown number of instances of the same class,we further perform instance segmentation. Thismeans, in addition to predicting a segmentationmask, we may also predict bounding boxes usingthe techniques from object detection. This allowsthe model to assign characters from the same seg-mentation class to distinct instances.

Our model is described in Figure 2. It is com-prised of two main parts: The encoder networkand the decoder network. The decoder networkis further made up of two branches: The seg-mentation branch and the bounding box regressionbranch. The encoder boils down to a VGG-typenetwork (Simonyan and Zisserman, 2014) with di-lated convolutions (Yu and Koltun, 2016), batchnormalization (Ioffe and Szegedy, 2015), and spa-tial dropout (Tompson et al., 2015). Essentially,the encoder consists of five blocks where eachblock consists of three 3 × 3 convolutions (which


C

HxW

d=2 d=4 d=8

2C

4C 8C

4C

2C

8C

C

Decoder: Semantic SegmentationEncoderC

4C

2C

CDecoder: Bounding Box Regression

C C

a

b

cc d

e f

Concat

1x1 Conv

3x3 ConvTransposed

3x3 Conv

Dropout

3x3 Conv

b

Concat

1x1 Conv

3x3 ConvTransposed

cc

3x3 Conv

3x3 ConvSoftmax

3x3 Conv

d

3x3 Conv

3x3 ConvSoftmax

3x3 Conv

e3x3 Conv

3x3 ConvLinear

3x3 Conv

f

a

a" a"

b

b

b

cc

a 3x3 Conv Stride-2

3x3 Conv

3x3 Conv

Dropout

Raw data Chargrid

a'

3x3 Convwith dilation

Dropout



a"

a' 3x3 Conv Stride-2

Dropout



Figure 2: Network architecture for document understanding, the chargrid-net. Each convolutional blockin the network is represented as a box. The height of a box is a proxy for feature map resolution whilethe width is a proxy for the number of output channels. C corresponds to the number of ’base’ channels,which in turns corresponds to the number of output channels in the first encoder block. d denotes dilationrate.

themselves are made of convolution, batch nor-malization, ReLU activation) followed by spatialdropout at the end of a block. The first convolutionin a block is a stride-2 convolution to downsamplethe input to that block. Whenever we downsam-ple, we increase the number of output channels Cof each convolution by a factor of two. We havefound stride-2 convolutions to yield slightly bet-ter results compared to max pooling. In block fourand five of the encoder, we do not apply any down-sampling, and we leave the number of channels at512 (the first block has C = 64 channels). We usedilated convolutions in block three, four, five withrates d = 2, 4, 8, respectively.

The decoder for semantic segmentation and forbounding box regression are both made of con-volutional blocks which essentially reverse thedownsampling of the encoder via stride-2 trans-posed convolutions. Each block first concatenatesfeatures from the encoder via lateral connectionsfollowed by 1×1 convolutions (Ronneberger et al.,2015; Pinheiro et al., 2016). Subsequently, we up-sample via a 3×3 stride-2 transposed convolution.This is followed by two 3 × 3 convolutions. Notethat whenever we upsample, we decrease the num-ber of channels by a factor of two.

The two decoder branches are identical in archi-tecture up to the last convolutional block. The de-coder for semantic segmentation has an additionalconvolutional layer without batch normalization,but with bias and with softmax activation. The

number of output channels of the last convolutioncorresponds to the number of classes. Togetherwith the encoder, the decoder for the bounding boxregression task forms a one-stage detector whichmakes use of focal loss (Lin et al., 2017). Wealso make use of the anchor box representation andcorresponding bounding box regression targets asdiscussed in Ren et al. (2015a). The anchor-boxrepresentation allows us to handle bounding boxesthat vary widely with respect to size and aspect ra-tios. Moreover, the anchor-box representation al-lows us to detect boxes of different classes. Thenumber of output channels are 2Na for the boxmask (foreground versus background) and 4Na forthe four box coordinates, where Na is the num-ber of anchors per pixel. The weights of all layersare initialized following He et al. (2015), exceptfor the last ones, which are initialized with a smallconstant value 1e− 3 for stabilization purposes.

In total, we have three equally contributing lossterms:

Ltotal = Lseg + Lboxmask + Lboxcoord, (2)

where Lseg is the cross entropy loss for segmen-tation, e.g. (Ronneberger et al., 2015), Lboxmask

is the binary cross entropy loss for box masks,and Lboxcoord is the Huber loss for box coordi-nate regression (Ren et al., 2015a). Both cross en-tropy loss terms are augmented following the fo-cal loss idea (Lin et al., 2017). We also make useof aggressive static class weighting in both cross


entropy loss terms mainly to counter the strongclass imbalance between irrelevant and ”easy”pixels (the ”background” class) versus relevantand ”hard” pixels (all other classes; see Sec. 4.2for further details).

We refer to the network depicted in Figure 2 asthe chargrid-net.

4 Information Extraction from Invoices

As a concrete example for understanding struc-tured 2D documents, we extract key-value infor-mation from invoices. We make no assumptionon the format of the invoice, the country of origin(and consequently taxes, date formats, amount for-mats, currency etc.) or the language. In addition tothat, real-world invoices often contain incompletesentences, nouns, and abbreviations.

We want our model to parse an invoice and toextract 5 header fields (i.e., Invoice Number, In-voice Date, Invoice Amount, Vendor Name andVendor Address) as well as the list of productitems purchased, referred to as the line-items.Line-items include details for each item such asLine-item Description, Line-item Quantity andLine-item Amount. Together with the backgroundclass, this yields 9 classes and each character onthe invoice is associated to exactly one class. Wenote that while header fields may only appear onceon an invoice (are unique), line-items may occur inmultiple instances.

To extract the values for each field, we collectall characters that are classified as belonging tothe corresponding class. For line-items, we furthergroup the characters by the predicted item bound-ing boxes.

4.1 Data

Our invoice dataset consists of 12k scanned sam-ple invoices from a large variety of different ven-dors and languages. We assign 10k samples fortraining, 1k for validation and 1k for test on whichwe report our results (see Sec. 5). We ensure thatvendors that appear in one set do not appear inany other set. This is more restrictive than nec-essary but gives a good estimate on how well themodel generalizes to unseen invoice layouts. Fig-ure 3 shows the distribution over vendors and lan-guages. From the vendor distribution, it can beseen that most vendors contribute only one invoiceto the dataset with at most 6 invoices coming froma single vendor. From the language distribution,

Figure 3: The left image shows a histogram ofnumber of vendors over the number of contribut-ing invoices in the dataset. Most vendors appearonly once in the dataset. The right image shows adistribution over languages, illustrating the diver-sity of the invoice data.

it can be seen that while English is the predomi-nant language, there are large representations fromFrench, Spanish, Norwegian, and German.

For all invoices, we collected manual annota-tions with bounding boxes around the fields of in-terest. Considerable efforts were spent to ensurethat the labels are correct. In particular, each in-voice was analyzed by three annotators plus a re-viewer. Over 35k invoices were investigated andfinally only clean set of 12k invoices were se-lected. Furthermore, detailed instructions weregiven to the annotators for each field. Figure 4visualizes the locations of annotated boxes on in-voices for Invoice Amount and Line-item Quan-tity. It can be seen that while some regions aredenser, the occurrences are spread widely over thewhole invoice thereby illustrating the diversity inthe invoice format.

4.2 Implementation Details

The invoices were processed with an open-sourceengine for OCR, Tesseract (v4), to extract the textfrom the documents.

We limit the number of distinct characters in ourencoding E (in Eq. 1) to the most frequent ones,which in our present case means NC = 54 differ-ent characters including the background/paddingand unknown character. Thus, the 1-hot encodedchargrid g has 54 channels. Each page of an in-voice is processed independently.

Once we created the chargrid representation of adocument page, we downsample to a fixed resolu-tion using nearest neighbor interpolation to ensurethat all chargrid representations have the same res-olution for training (and inference). Note that this


downsampling operation is still performed in thetoken space (i.e., before 1-hot encoding). To fur-ther ensure that no tokens in the chargrid repre-sentation are lost due to nearest neighbor interpo-lation, we downsample to twice the target resolu-tion. This resolution is determined by the smallestcharacters in the training set. A second downsam-pling step, now in the 1-hot encoding to the finaltarget resolution of 336x256 in our case, is per-formed using bilinear interpolation and fed intothe network. Note that we could have also firstperformed 1-hot encoding on the document andapplied bilinear interpolation directly to the targetresolution. Computationally, however, this two-stage downsampling is more efficient.

We handle landscape documents by simplysqueezing all input pages into our target resolutionusing interpolation (similar to image re-sizing).We have found that this approach is not harmfuland for simplicity stick to it.

Line-items can occur in an unknown number ofdistinct instances. Therefore, we require instancesegmentation of characters on the document. Toaccomplish this, the model is trained to predictbounding boxes that span across the entire rowof one instance of a line-item, while the segmen-tation mask classifies those characters belongingto given column classes (such as, e.g., Line-itemQuantity, or Line-item Description) of that line-item instance.

We implemented our model in TensorFlow 1.4.We use SGD with momentum β = 0.9 and learn-ing rate α = 0.05. We used weight decay ofλ = 10−4, and spatial dropout with probabilityP = 0.1. We perform random cropping of thechargrid for data augmentation (that is we pad by16 character pixels in each direction after down-sampling and then crop with a random offset inrange (16, 16)).

We use aggressive class weighting in the cross-entropy loss for semantic segmentation and forthe bounding box mask. We have found thisto be more effective than the focal loss (whichcan bee seen as a form of dynamic class weight-ing). We implement class weighting following(Paszke et al., 2016) with a constant of c = 1.04.In early stages of our experiments not yet usingclass weights, we noted poor performance on thebounding box regression task as well as the seg-mentation task.

The distribution of line-items on invoices in

Figure 4: Spatial distribution of Invoice Amount(left) and Line-item Quantity (right) over the in-voice. This depicts the variation in the invoice lay-outs contained in our dataset.

our dataset reveals that around 50% of all in-voices only contain less than three line-items. Wefound that repeating those invoices with more thanthree line-items during training more often thanthose with only few line-items significantly boostsbounding box regression accuracy. With a minibatch size of 7, each model took around 360k it-erations and 3 days to fully converge on a singleNvidia Tesla V100.

4.3 Evaluation Measure

For evaluating our model, we would like to mea-sure how much work would be saved by usingthe extraction system, compared to performing thefield extraction manually. To capture this, we usea measure similar to the word error rate (Prab-havalkar et al., 2017) used in speech recognitionor translation tasks.

For a given field, we count the number of in-sertions, deletions, and modifications of the pre-dicted instances (pooled across the entire test set)to match the ground truth instances. Evaluationsare made on the string level. We compute thismeasure as follows:

1− #[insertions] + #[deletions] + #[modifications]

N

whereN is the total number of instances occurringin the ground truth of the entire test set. This mea-sure can be negative, meaning that it would be lesswork to perform the extraction manually. In ourpresent case, the error caused by the OCR enginedoes not affect this measure, because the same er-rors are present in the prediction and in the groundtruth and are not considered a mismatch.


charg

rid

prediction

mask

and

box

es

Figure 5: Sample invoices and their corresponding network predictions: the top row shows the chargridinput, the bottom row shows the predicted segmentation mask with overlaid bounding box predictions.Our model is able to handle a large diversity of layouts. The encoding of the characters on the chargridhas been scrambled to preserve privacy.

5 Experiments and Results

Figure 5 shows some sample predictions. It canbe seen that the model can successfully extractkey-value information on sample invoices despitesignificant diversity and complexity in the invoicelayouts.

We show the quantitative results in Table 1.We compare the proposed model, chargrid-net(Sec. 3.2), against four other models. The first oneis a sequential model based purely on text. It isa stack of bi-directional GRUs (Cho et al., 2014)taking a sequence of characters as input, and pro-ducing a sequence of labels as output. This modelis our implementation of Palm et al. (2017) andserves as a baseline comparison against a more tra-ditional NLP approach which is based on sequen-tial input. We note that we have also experimentedwith a extension of this model that along with thecharacters also takes as input the position of eachcharacter on the document, however, with negli-gible returns. Therefore, we stick to this simplermodel.

The second image-only model is identical tochargrid-net (Figure 2), except that we directlytake the original image of the document page asnetwork input rather than the chargrid g. Thismodel serves as a baseline comparison to a purelyimage-based approach using directly the raw pixelinformation as input. We note that we downsam-ple the image to the same input resolution as thechargrid representation, that is 336x256.

The third and fourth models are both a hy-

brid between the chargrid-net, and the image-onlymodel. In both models, we replicate the encoder:one encoder for the chargrid input g, and one en-coder for the image of the document page. Infor-mation from the two encoders is concatenated inthe decoder: whenever a lateral connection fromthe encoder of the original chargrid-net is concate-nated in a decoder block, we now concatenate lat-eral connections from the two encoders in a de-coder block.

We distinguish two configurations of the hybridmodel: model chargrid-hybrid-C64 where char-grid and image encoders both have C = 64 basechannels, and model chargrid-hybrid-C32 wherechargrid and image encoders have C = 32 basechannels while the decoder still retains C = 64base channels. The latter model is used to assessthe influence on the number of encoder parame-ters since the hybrid model effectively has moreparameters due to the two encoders.

It can be seen that compared to the purely text-based approach, the chargrid-net performs equiva-lently on single-instance single-word fields wherethe 2D structure is not as important. Examplesof single-instance, single-word fields are ’InvoiceNumber’, ’Invoice Amount’, and ’Invoice Date’.The extraction of these fields could essentially alsobe easily tackled with named entity recognitionapproaches based on serialized text (Gillick et al.,2015; Lample et al., 2016).

The chargrid-net, however, significantly outper-


Model/Field InvoiceNumber

InvoiceAmount

InvoiceDate

VendorName

VendorAddress

Line-itemDescription

Line-itemQuantity

Line-itemAmount

sequential 80.98% 79.13% 83.98% 28.97% 16.94% -0.01% -0.18% 0.22%image-only 47.79% 68.91% 45.67% 19.68% 13.99% 49.50% 46.79% 63.49%chargrid-net 80.48% 80.74% 83.78% 36.00% 39.13% 52.80% 65.20% 65.57%chargrid-hybrid-C32 74.85% 77.93% 80.40% 32.00% 31.48% 46.27% 64.04% 63.25%chargrid-hybrid-C64 82.49% 80.14% 84.28% 34.27% 36.83% 48.81% 64.59% 64.53%

Table 1: Accuracy measure (c.f. Sec. 4.3) for an 8-class information extraction problem on invoices. Theproposed chargrid models perform consistently well on all extracted fields compared to sequential andimage models.

ch

arg

rid

pre

dic

tion

mask

and

box

es

gro

un

d t

ruth

mask

and

box

es

example 1 example 2 example 3

Figure 6: In example 1, chargrid-net attributesunrelated rows of text to the line-item’s descrip-tion. Moreover, some character-pixels are mis-classified such that the predicted header fieldabove the line-item is incorrect. Thus, we ob-serve errors in both, the segmentation mask andthe bounding box predictions. In example 2, themodel predicts the segmentation mask mostly cor-rect, but it fails to predict boxes for two of thefour line-items. In example 3, the model fails toseparate adjacent multi-row line-items from eachother correctly. Please note that the ground truthis debatable in some cases; for example, it maybe hard to decide which information belongs to aline-item’s description and which does not.

forms the sequential model on multi-instance ormulti-word fields where 2D relationships betweentext entities are important. Examples of such fieldsare Line-item Description, Line-item Quantity andLine-item Amount, which are all grouped as sub-fields to a specific line-item. Each of those fieldsmay span a varying number of rows per line-item,see Figure 5. The sequential model fails to cor-rectly identify those line-item fields which is man-ifested in a negative accuracy measure (Sec. 4.3).This implies that it is better to perform manualextraction over using automatic extraction. Thisis understandable since the line-item fields have astrong 2D structure that the sequential approach isnot designed to capture.

In comparison with the image-only model,chargrid-net still performs much better. This is es-pecially true for smaller fields which need to beread to be accurately localized. On the other hand,for larger fields like Line-item Description, whichcan be localized by only vision, the gap is muchsmaller.

The values for the hybrid models are a bit moreinteresting. One could expect that combining twocomplementary inputs such as the chargrid repre-sentation and image - one capturing the contentand the other capturing, e.g. table delineations -would boost the accuracy. It turns out, however,that at least in the case of our invoice dataset, theadditional image encoder does not bring additionalbenefits. Model chargrid-hybrid-C64, where thechargrid encoder branch has the same number ofbase channels C = 64 as the original chargrid-net,is essentially as accurate as chargrid-net. Modelchargrid-hybrid-C32, where the image and char-grid encoder combined have the same number ofchannels (that is the chargrid encoder branch onlyhas C = 32 base channels), the accuracy is signif-icantly reduced.

We conclude that at least in our present extrac-tion problem, most of the discriminative informa-


tion comes from the chargrid encoder branch andthus from the chargrid representation.

Error examples of the chargrid-net model aredepicted in Figure 6. One frequent error-type isthat the model may fail to disentangle line-items,if they have a peculiar structure. While this is alsochallenging for human experts, other erroneouspredictions are observed in samples for which theground truth annotations are debatable.

6 Discussion

We introduce a new way of modeling documentsby using a character grid as document representa-tion. The chargrid allows models to capture 2D re-lationships between characters, words, and largerunits of text. The idea of the chargrid paradigm isinspired by the human perception which is heav-ily guided by 2D shapes and structures for under-standing this type of documents. Therefore, thechargrid allows to encode the positioning, size andalignment for textual components in a meaningfulmanner. While the chargrid paradigm could be ap-plied to various kinds of NLP tasks, we demon-strate its potential on an information extractiontask from invoices. We train a deep neural net-work with an encoder-decoder architecture and weshow that the network computes accurate segmen-tation masks and bounding boxes, which pinpointthe relevant information on the invoice.

We evaluate the accuracy of the model and com-pare it to state-of-the-art NLP and computer vi-sion approaches. While those baseline modelsachieve accurate predictions for individual fields,only the chargrid performs well on all informa-tion extraction tasks. Some fields such as InvoiceNumber are relatively easy to detect for a modelthat operates on serialized text, as discriminativekeywords are commonly preceding the words tobe extracted. However, more complex extractiontasks (e.g. Vendor Address or Line-item Quantity)cannot be performed accurately as they require toexploit both, textual components and 2D layoutstructures. The traditional image-only computervision model yields accurate predictions only forlarge visual columns (such as Line-item Amount)and it fails to locate extractions that require to un-derstand the text.

However, compared to traditional sequentialneural NLP models, the benefits in accuracy comeat a larger computational cost compared. Eventhough the proposed model is fully convolutional

and parallelizes very well on a single (or even mul-tiple) GPU(s), the added complexity introduced byusing a 2D data representation can significantly in-crease the total data dimensionality. In our cur-rent use case, a chargrid-net training requires upto three days until full convergence, whereas oursequential model converges after only a few hours.

Comparing chargrid to semantic segmentationon natural images, one should note that character-pixels, unlike pixels of a gray-scale image, are cat-egorical. This requires the character-pixels to beencoded as 1-hot. This yields a highly sparse datarepresentation. While such representation is verycommon for NLP problems, it is new for segmen-tation networks that have previously only been ap-plied to image segmentation.

7 Conclusion

Chargrid is a generic representation for 2D text.Using this as a base, one could solve any task on a2D text such as document classification, named-entity recognition, information extraction, parts-of-speech tagging etc. Furthermore,chargrid ishighly beneficial for scenarios where text and nat-ural images are blended. One could imagine per-forming all the above NLP tasks also on such in-puts. This work demonstrates the advantages ofchargrid for an information extraction task but webelieve that it is only the first step towards in-corporating the 2D document structure into doc-ument understanding tasks. Some follow-up di-rection could be solving other NLP tasks on struc-tured documents using chargrid or experimentingwith other computer vision algorithms on char-grid. Furthermore, it may be interesting to useword embeddings rather than 1-hot encoded char-acters, i.e. a wordgrid, as 2D text input representa-tion.

Acknowledgments

In acknowledgement of the partnership betweenNvidia and SAP this publication has been preparedusing Nvidia DGX-1. We thank Zbigniew Jerzakand Markus Noga for their support and fruitful dis-cussions.

ReferencesVijay Badrinarayanan, Alex Kendall, and Roberto

Cipolla. 2017. Segnet: A deep convolutionalencoder-decoder architecture for image segmenta-


tion. IEEE transactions on pattern analysis and ma-chine intelligence 39(12):2481–2495.

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate. CoRRabs/1409.0473. http://arxiv.org/abs/1409.0473.

K. Chen, M. Seuret, M. Liwicki, J. Hennebert,and R. Ingold. 2015. Page segmentationof historical document images with convo-lutional autoencoders. In 2015 13th Inter-national Conference on Document Analysisand Recognition (ICDAR). pages 1011–1015.https://doi.org/10.1109/ICDAR.2015.7333914.

Kyunghyun Cho, Bart van Merrienboer, CaglarGulcehre, Fethi Bougares, Holger Schwenk, andYoshua Bengio. 2014. Learning phrase repre-sentations using RNN encoder-decoder for statis-tical machine translation. CoRR abs/1406.1078.http://arxiv.org/abs/1406.1078.

Dan Gillick, Cliff Brunk, Oriol Vinyals, and AmarnagSubramanya. 2015. Multilingual language process-ing from bytes. arXiv preprint arXiv:1512.00103 .

Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz,Sacha Arnoud, and Vinay D. Shet. 2013. Multi-digit number recognition from street view imageryusing deep convolutional neural networks. CoRRabs/1312.6082. http://arxiv.org/abs/1312.6082.

Alex Graves and Jurgen Schmidhuber. 2009. Of-fline handwriting recognition with multidimen-sional recurrent neural networks. In D. Koller,D. Schuurmans, Y. Bengio, and L. Bottou, edi-tors, Advances in Neural Information ProcessingSystems 21, Curran Associates, Inc., pages545–552. http://papers.nips.cc/paper/3449-offline-handwriting-recognition-with-multidimensional-recurrent-neural-networks.pdf.

Saurabh Gupta, Ross Girshick, Pablo Arbelaez, andJitendra Malik. 2014. Learning rich features fromrgb-d images for object detection and segmenta-tion. In European Conference on Computer Vision.Springer, pages 345–360.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2015. Delving deep into rectifiers: Surpass-ing human-level performance on imagenet classifi-cation. In Proceedings of the IEEE internationalconference on computer vision. pages 1026–1034.

Sergey Ioffe and Christian Szegedy. 2015. Batch nor-malization: Accelerating deep network training byreducing internal covariate shift. arXiv preprintarXiv:1502.03167 .

K. Javed and F. Shafait. 2018. Real-time docu-ment localization in natural images by recursiveapplication of a cnn. In 2017 14th IAPR In-ternational Conference on Document Analysis andRecognition (ICDAR). volume 01, pages 105–110.https://doi.org/10.1109/ICDAR.2017.26.

Yoon Kim. 2014. Convolutional neural networksfor sentence classification. CoRR abs/1408.5882.http://arxiv.org/abs/1408.5882.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. Imagenet classification with deep con-volutional neural networks. In Advances in neuralinformation processing systems. pages 1097–1105.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.arXiv preprint arXiv:1603.01360 .

Yann LeCun, Leon Bottou, Yoshua Bengio, and PatrickHaffner. 1998. Gradient-based learning applied todocument recognition. Proceedings of the IEEE86(11):2278–2324.

Tsung-Yi Lin, Priya Goyal, Ross Girshick, KaimingHe, and Piotr Dollar. 2017. Focal loss for dense ob-ject detection. arXiv preprint arXiv:1708.02002 .

Rasmus Berg Palm, Ole Winther, and Florian Laws.2017. Cloudscan-a configuration-free invoice anal-ysis system using recurrent neural networks. arXivpreprint arXiv:1708.07403 .

Adam Paszke, Abhishek Chaurasia, Sangpil Kim, andEugenio Culurciello. 2016. Enet: A deep neural net-work architecture for real-time semantic segmenta-tion. arXiv preprint arXiv:1606.02147 .

Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, andPiotr Dollar. 2016. Learning to refine object seg-ments. In European Conference on Computer Vi-sion. Springer, pages 75–91.

Rohit Prabhavalkar, Tara N Sainath, Yonghui Wu,Patrick Nguyen, Zhifeng Chen, Chung-Cheng Chiu,and Anjuli Kannan. 2017. Minimum word error ratetraining for attention-based sequence-to-sequencemodels. arXiv preprint arXiv:1712.01818 .

Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. 2015a. Faster r-cnn: Towards real-time ob-ject detection with region proposal networks. InAdvances in neural information processing systems.pages 91–99.

Shaoqing Ren, Kaiming He, Ross B. Girshick, and JianSun. 2015b. Faster R-CNN: towards real-time ob-ject detection with region proposal networks. CoRRabs/1506.01497. http://arxiv.org/abs/1506.01497.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox.2015. U-net: Convolutional networks for biomed-ical image segmentation. In International Confer-ence on Medical image computing and computer-assisted intervention. Springer, pages 234–241.

Alexander M. Rush, Sumit Chopra, and Jason We-ston. 2015. A neural attention model for abstractivesentence summarization. CoRR abs/1509.00685.http://arxiv.org/abs/1509.00685.

http://arxiv.org/abs/1409.0473



https://doi.org/10.1109/ICDAR.2015.7333914












http://papers.nips.cc/paper/3449-offline-handwriting-recognition-with-multidimensional-recurrent-neural-networks.pdf




















Karen Simonyan and Andrew Zisserman. 2014. Verydeep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556 .

Chris Tensmeyer and Tony Martinez. 2017. Doc-ument image binarization with fully convolu-tional neural networks. CoRR abs/1708.03276.http://arxiv.org/abs/1708.03276.

Tesseract. v4 https://github.com/tesseract-ocr/tesseract.

Jonathan Tompson, Ross Goroshin, Arjun Jain, YannLeCun, and Christoph Bregler. 2015. Efficient ob-ject localization using convolutional networks. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pages 648–656.

Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley,Daniel Kifer, and C Lee Giles. 2017. Learning to ex-tract semantic structure from documents using mul-timodal fully convolutional neural networks. arXivpreprint arXiv:1706.02337 .

Fisher Yu and Vladlen Koltun. 2016. Multi-scale con-text aggregation by dilated convolutions. In ICLR.





https://github.com/tesseract-ocr/tesseract