using regular expressions in document management data capture and indexing

Using Regular Expressions for Data Mining and Automated Data Capture and Indexing

Copyright © 2010 - 2013 DocuFi. All Rights Reserved

In a Document Management Environment


First: What is automated data capture?

Just identifying and extracting information or data (sometimes called metadata) from scanned documents

Data Capture:

First: What is automated data capture or data mining?

Just identifying and extracting information or data (sometimes called metadata) from scanned documents

Data Capture:

Automated Data Capture:

Applying the principles of automation to data capture, silly!

This can also be called text data mining.

Why automate

data capture?

Manual Data Capture is Expensive and Time Consuming

Problems with manual data entry:

1. Security maybe compromised if documents taken off premises

2. A delay is introduced if documents taken off premises

3. Compared to automated extraction, manual indexing is slow

4. Manual indexing doesn’t scale well with large projects

5. Manual indexing has the potential to introduce errors into the data

Why automate

data capture?

and…

Why automate

data capture?

Problems with manual data entry:

1. Security maybe compromised if documents taken off premises

2. A delay is introduced if documents taken off premises

3. Compared to automated extraction, manual indexing is slow

4. Manual indexing doesn’t scale well with large projects

5. Manual indexing has the potential to introduce errors into the data

There’s a Mountain of It!

There’s a Mountain of It! Let’s take a look at just invoices for

example…


According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.


Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper.



Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper.

and it’s expensive


An Aberdeen Group March 2012 publication estimates the costs of processing a single invoice from $4.84 to $20.13.

So if e-invoicing is not an option (as it’s not for many), what?

sending and receiving invoices electronically e-invoicing:

http://www.istockphoto.com/stock-photo-21653377-computer-password-security.php?st=80d5a93

“it is the front-end capture options…that introduce true performance gains. For example, respondents who have implemented front-end document capture (creating a scanned digital copy of a physical invoice to be used in the approval process) report invoice processing 34% faster than those who process invoices manually. Moving to the pure data end of the spectrum, companies that convert scanned documents into usable data (through optical character recognition or similar technologies), report a 26% faster processing time than those that work only with document images.” ---Aberdeen’s 2010 report

( )

And, We All Know, Time is Money


Don’t forget we are using invoices only as an example. But, this could apply to patient records,

legal documents, purchase orders…any document.


Now that you know this is all about money, let’s go back to the focus of

this slideshow.



What are Regular Expressions or regex?

Regular expressions (regex) provide a fast and powerful method to search, extract and replace specific data found within scanned documents. Regular expressions are essentially a special text string for describing a search pattern. You could think of regular expressions as extremely powerful wildcards.


What’s it look like? A simple regular expression might look something like this:

^∖s{1,3}[A-Z0-9]XYZ



^∖s{1,3}[A-Z0-9]XYZ

^ Start at the beginning of a string or line

∖s{1,3} Find a space that occurs between 1 and 3

times

[A-Z0-9]* Find any character in the range A-Z and 0-9,

the “*” is the instruction to find as many

occurrences as possible.

XYZ Find the literal characters “XYZ”



^∖s{1,3}[A-Z0-9]XYZ

^ Start at the beginning of a string or line

∖s{1,3} Find a space that occurs between 1 and 3

times

[A-Z0-9]* Find any character in the range A-Z and 0-9,

the “*” is the instruction to find as many

occurrences as possible.

XYZ Find the literal characters “XYZ”

If we had the value “ AZR8987XYZ” in our document at the start of a line we would get a match whereas if we had “ AZR898XY” we would not.


Huh?

Don’t worry, this is not a tutorial on writing regex. We just want to look at some examples and understand how regex can apply to data capture and indexing in a document management

environment.

There’s a Mountain of It! Regular expressions are extremely flexible and patterns can be constructed to match almost anything. For text commonly found in documents such as dates, SSNs, ZIP codes etc., patterns are freely available on the Internet. Here are some examples:

Zip Codes ^(?!00000)(?<zip>(?<zip5>∖d{5})(?:[ -

](?=∖d))?(?<zip4>∖d{4})?)$

US Phone

Number

^([0-9]( |-)?)?(∖(?[0-9]{3} ∖)?|[0-9]{3})( |-

)?([0-9]{3}( |-)?[0-9]{4}|[a-zA-Z0-9]{7})$

Credit

Card

(^(4|5)∖d{3}-?∖d{4}-?∖d{4}-

?∖d{4}|(4|5)∖d{15})|(^(6011)-?∖d{4}-

?∖d{4}-?∖d{4}|(6011)-

?∖d{12})|(^((3∖d{3}))-∖d{6}-

∖d{5}|^((3∖d{14})))

There’s a Mountain of It! Here is a partial invoice where you might need to capture the "Catalogue Number“.

Real World Example


In order to start constructing a regular expression we have to use what we know from the data in front of us as well as making some assumptions. During testing we can refine the regular expression. In this example we can assume from the document that the catalogue number has the format of a single uppercase letter, followed by 2 digits then a hyphen followed by a single uppercase letter and 6 digits or just 6 digits.

We could use the regex of [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{6} extract the data. Let's again break this down:

[A-Z] Find a character from A-Z, the absence of a

quantifier specification,“{}”, assumes we are

only looking for 1 character

∖d{2} Find exactly 2 digits

- Find the literal character “-“

[A-

Z]{0,1}

Find a character A-Z between 0 and 1 repetitions

∖d{6} Find exactly 6 digits

This is just one way of writing a regular expression for this example although there are various ways it could be written. If we should subsequently find that the last portion of the catalogue number might contain 4 to 6 digits, we could simply amend it as follows [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{4,6}.

We’ll take a look at how regex is used in ImageRamp Batch. It’s a simple-to-use folder processing tool that accelerates getting data and files into various EMR, Document Management or other secure storage environments. It can be used to capture and extract data in both structured and unstructured documents. As an example, we might want to extract data from a scanned file with the following 4 fields:

Now how would this work in a data capture solution?

Company Name

Company Number

Date

SIC Code

Here is the ImageRamp screen showing the scanned file pages and the data extracted using regex for the four fields we listed.

Hang on, we’ll show it. We’ll use it to split individual company’s invoices from an multipage scan based on the Company Name and extract index data.

A company might use this to scan a large stack of invoices and split the file every time a new invoicing company name is located using the regex scripts.

So where is the regex?

First we are going to define the regex to perform document splitting when a new Company Name is located in ImageRamp’s Splitting and Extraction’s Data Mining submenu as shown below.

Let’s break it down—-splitting the scan stack.

(?<=∖bCompany∖s*Name∖s+ ∖b)[a-z0-9∖(∖) ]*

… and check the “Split if Matched” option.

Remember in our example we identified CompanyName, CompanyNo, Date and SICcode as the index or metadata information we want to capture. So here we are extracting the date field using the regex in the Index Fields section of the Data Mining submenu.

(?<Date>(?<= ∖bDate of this return∖s+∖b)∖d{2}/∖d{2}/∖d{4})

--capturing the index data.

Information extracted through the text data mining with regex can also be used to name the file and create folders.

Here %regex1 corresponds to the first regex field definition (CompanyName) and %regex2 corresponds to the second field definition (CompanyNo).

But wait, there’s more.

We hope we have demonstrated the immense power of using regular expressions to extract data from both structured and unstructured data.

Data in the palm of your hand…not locked in your

documents!

and…

For more on: • Data Mining PDF • Data mining Scans • Invoice Mining • Patient Record Mining • OCR mining • TIF mining • Extracting meta data, • Data extraction from unstructured data • Intelligent data capture • Data extraction • Using regex to extract data • Document scanning • Extracting data • Extract meta data, • Scanner software, • Barcode recognition, • OCR software, • Capture tutorial • Pdf scanning, • Scanning software • Indexing • Document indexing • Automated capture • Meta data • Scan to index • Batch Processing • Bulk scanning • Docufi • Imageramp • Data capture • Migration to document management

the power of ImageRamp and its other features including:

Learn more about…

Full text OCR to PDF

PDF rights management and encryption

Document naming, splitting, and routing based on barcodes

and…

Image processing for clean up and adaptive thresholding

OCR (Optical Character Recognition)

Barcode reading (1D and 2D)

http://www.docufi.com/products/imageramp-document-onramp

More?

http://www.slideshare.net/DocuFi/introduction-to-document-scanning

http://www.slideshare.net/DocuFi/introduction-to-document-scanning

http://www.slideshare.net/DocuFi/what-is-batch-document-scanning

http://www.slideshare.net/DocuFi/what-is-batch-document-scanning

http://www.slideshare.net/DocuFi/dental-data-capture

http://www.slideshare.net/DocuFi/dental-data-capture

http://www.slideshare.net/DocuFi/paperless-dreams-with-fujitsu-scansnap

http://www.slideshare.net/DocuFi/paperless-dreams-with-fujitsu-scansnap

http://www.slideshare.net/DocuFi/barcode-slideshare

http://www.slideshare.net/DocuFi/barcode-slideshare

http://www.slideshare.net/DocuFi/document-capture-must-haves

http://www.slideshare.net/DocuFi/document-capture-must-haves

http://www.slideshare.net/DocuFi/what-is-document-indexing

http://www.slideshare.net/DocuFi/what-is-document-indexing

http://www.slideshare.net/DocuFi/what-is-intelligent-document-and-data-capture

http://www.slideshare.net/DocuFi/what-is-intelligent-document-and-data-capture

http://www.slideshare.net/DocuFi/automatic-file-naming-and-routing

http://www.slideshare.net/DocuFi/automatic-file-naming-and-routing

Further reading on Regular Expressions:

More?

http://en.wikipedia.org/wiki/Regular_expression

http://regexlib.com/

http://www.regular-expressions.info/

http://en.wikipedia.org/wiki/Regular_expression

http://regexlib.com/




docufi.com @imageramp @docufinews

http://www.docufi.com/