using regular expressions in document management data capture and indexing

37
Using Regular Expressions for Data Mining and Automated Data Capture and Indexing Copyright © 2010 - 2013 DocuFi. All Rights Reserved

Upload: sandy-schiele

Post on 11-Nov-2014

431 views

Category:

Technology


1 download

DESCRIPTION

Learn how metadata (index information) can be pulled from documents using regular expressions or regex. See how regex is used to extract the index information, name files, create subfolders and more to feed your document management or EMR systems. Automated data capture is shown with ImageRamp from DocuFi, a powerful platform to capture index information from your scanned documents and drawings which integrates with today's document management and EMR systems.

TRANSCRIPT

Page 1: Using Regular Expressions in Document Management Data Capture and Indexing

Using Regular Expressions for Data Mining and Automated Data Capture and Indexing

Copyright © 2010 - 2013 DocuFi. All Rights Reserved

Page 2: Using Regular Expressions in Document Management Data Capture and Indexing

In a Document Management Environment

Using Regular Expressions for Data Mining and Automated Data Capture and Indexing

Page 3: Using Regular Expressions in Document Management Data Capture and Indexing

First: What is automated data capture?

Just identifying and extracting information or data (sometimes called metadata) from scanned documents

Data Capture:

Page 4: Using Regular Expressions in Document Management Data Capture and Indexing

First: What is automated data capture or data mining?

Just identifying and extracting information or data (sometimes called metadata) from scanned documents

Data Capture:

Automated Data Capture:

Applying the principles of automation to data capture, silly!

This can also be called text data mining.

Page 5: Using Regular Expressions in Document Management Data Capture and Indexing

Why automate

data capture?

Manual Data Capture is Expensive and Time Consuming

Page 6: Using Regular Expressions in Document Management Data Capture and Indexing

Problems with manual data entry:

1. Security maybe compromised if documents taken off premises

2. A delay is introduced if documents taken off premises

3. Compared to automated extraction, manual indexing is slow

4. Manual indexing doesn’t scale well with large projects

5. Manual indexing has the potential to introduce errors into the data

Why automate

data capture?

Page 7: Using Regular Expressions in Document Management Data Capture and Indexing

and…

Why automate

data capture?

Problems with manual data entry:

1. Security maybe compromised if documents taken off premises

2. A delay is introduced if documents taken off premises

3. Compared to automated extraction, manual indexing is slow

4. Manual indexing doesn’t scale well with large projects

5. Manual indexing has the potential to introduce errors into the data

Page 8: Using Regular Expressions in Document Management Data Capture and Indexing

There’s a Mountain of It!

Page 9: Using Regular Expressions in Document Management Data Capture and Indexing

There’s a Mountain of It! Let’s take a look at just invoices for

example…

Page 10: Using Regular Expressions in Document Management Data Capture and Indexing

There’s a Mountain of It!

According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.

Page 11: Using Regular Expressions in Document Management Data Capture and Indexing

There’s a Mountain of It!

Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper.

According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.

Page 12: Using Regular Expressions in Document Management Data Capture and Indexing

There’s a Mountain of It!

Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper.

and it’s expensive

According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.

An Aberdeen Group March 2012 publication estimates the costs of processing a single invoice from $4.84 to $20.13.

Page 13: Using Regular Expressions in Document Management Data Capture and Indexing

So if e-invoicing is not an option (as it’s not for many), what?

sending and receiving invoices electronically e-invoicing:

http://www.istockphoto.com/stock-photo-21653377-computer-password-security.php?st=80d5a93

“it is the front-end capture options…that introduce true performance gains. For example, respondents who have implemented front-end document capture (creating a scanned digital copy of a physical invoice to be used in the approval process) report invoice processing 34% faster than those who process invoices manually. Moving to the pure data end of the spectrum, companies that convert scanned documents into usable data (through optical character recognition or similar technologies), report a 26% faster processing time than those that work only with document images.” ---Aberdeen’s 2010 report

( )

Page 14: Using Regular Expressions in Document Management Data Capture and Indexing

And, We All Know, Time is Money

http://www.istockphoto.com/stock-photo-21653377-computer-password-security.php?st=80d5a93

Page 15: Using Regular Expressions in Document Management Data Capture and Indexing

Don’t forget we are using invoices only as an example. But, this could apply to patient records,

legal documents, purchase orders…any document.

http://www.istockphoto.com/stock-photo-21653377-computer-password-security.php?st=80d5a93

Page 16: Using Regular Expressions in Document Management Data Capture and Indexing

Now that you know this is all about money, let’s go back to the focus of

this slideshow.

http://www.istockphoto.com/stock-photo-21653377-computer-password-security.php?st=80d5a93

Page 17: Using Regular Expressions in Document Management Data Capture and Indexing

http://www.istockphoto.com/stock-photo-21653377-computer-password-security.php?st=80d5a93

Using Regular Expressions for Data Mining and Automated Data Capture and Indexing

Page 18: Using Regular Expressions in Document Management Data Capture and Indexing

There’s a Mountain of It!

What are Regular Expressions or regex?

Regular expressions (regex) provide a fast and powerful method to search, extract and replace specific data found within scanned documents. Regular expressions are essentially a special text string for describing a search pattern. You could think of regular expressions as extremely powerful wildcards.

Page 19: Using Regular Expressions in Document Management Data Capture and Indexing

There’s a Mountain of It!

What’s it look like? A simple regular expression might look something like this:

^∖s{1,3}[A-Z0-9]XYZ

Page 20: Using Regular Expressions in Document Management Data Capture and Indexing

There’s a Mountain of It!

What’s it look like? A simple regular expression might look something like this:

^∖s{1,3}[A-Z0-9]XYZ

^ Start at the beginning of a string or line

∖s{1,3} Find a space that occurs between 1 and 3

times

[A-Z0-9]* Find any character in the range A-Z and 0-9,

the “*” is the instruction to find as many

occurrences as possible.

XYZ Find the literal characters “XYZ”

Page 21: Using Regular Expressions in Document Management Data Capture and Indexing

There’s a Mountain of It!

What’s it look like? A simple regular expression might look something like this:

^∖s{1,3}[A-Z0-9]XYZ

^ Start at the beginning of a string or line

∖s{1,3} Find a space that occurs between 1 and 3

times

[A-Z0-9]* Find any character in the range A-Z and 0-9,

the “*” is the instruction to find as many

occurrences as possible.

XYZ Find the literal characters “XYZ”

If we had the value “ AZR8987XYZ” in our document at the start of a line we would get a match whereas if we had “ AZR898XY” we would not.

Page 22: Using Regular Expressions in Document Management Data Capture and Indexing

There’s a Mountain of It!

Huh?

Don’t worry, this is not a tutorial on writing regex. We just want to look at some examples and understand how regex can apply to data capture and indexing in a document management

environment.

Page 23: Using Regular Expressions in Document Management Data Capture and Indexing

There’s a Mountain of It! Regular expressions are extremely flexible and patterns can be constructed to match almost anything. For text commonly found in documents such as dates, SSNs, ZIP codes etc., patterns are freely available on the Internet. Here are some examples:

Zip Codes ^(?!00000)(?<zip>(?<zip5>∖d{5})(?:[ -

](?=∖d))?(?<zip4>∖d{4})?)$

US Phone

Number

^([0-9]( |-)?)?(∖(?[0-9]{3} ∖)?|[0-9]{3})( |-

)?([0-9]{3}( |-)?[0-9]{4}|[a-zA-Z0-9]{7})$

Credit

Card

(^(4|5)∖d{3}-?∖d{4}-?∖d{4}-

?∖d{4}|(4|5)∖d{15})|(^(6011)-?∖d{4}-

?∖d{4}-?∖d{4}|(6011)-

?∖d{12})|(^((3∖d{3}))-∖d{6}-

∖d{5}|^((3∖d{14})))

Page 24: Using Regular Expressions in Document Management Data Capture and Indexing

There’s a Mountain of It! Here is a partial invoice where you might need to capture the "Catalogue Number“.

Real World Example

Page 25: Using Regular Expressions in Document Management Data Capture and Indexing

There’s a Mountain of It!

In order to start constructing a regular expression we have to use what we know from the data in front of us as well as making some assumptions. During testing we can refine the regular expression. In this example we can assume from the document that the catalogue number has the format of a single uppercase letter, followed by 2 digits then a hyphen followed by a single uppercase letter and 6 digits or just 6 digits.

Page 26: Using Regular Expressions in Document Management Data Capture and Indexing

We could use the regex of [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{6} extract the data. Let's again break this down:

[A-Z] Find a character from A-Z, the absence of a

quantifier specification,“{}”, assumes we are

only looking for 1 character

∖d{2} Find exactly 2 digits

- Find the literal character “-“

[A-

Z]{0,1}

Find a character A-Z between 0 and 1 repetitions

∖d{6} Find exactly 6 digits

This is just one way of writing a regular expression for this example although there are various ways it could be written. If we should subsequently find that the last portion of the catalogue number might contain 4 to 6 digits, we could simply amend it as follows [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{4,6}.

Page 27: Using Regular Expressions in Document Management Data Capture and Indexing

We’ll take a look at how regex is used in ImageRamp Batch. It’s a simple-to-use folder processing tool that accelerates getting data and files into various EMR, Document Management or other secure storage environments. It can be used to capture and extract data in both structured and unstructured documents. As an example, we might want to extract data from a scanned file with the following 4 fields:

Now how would this work in a data capture solution?

Company Name

Company Number

Date

SIC Code

Page 28: Using Regular Expressions in Document Management Data Capture and Indexing

Here is the ImageRamp screen showing the scanned file pages and the data extracted using regex for the four fields we listed.

Page 29: Using Regular Expressions in Document Management Data Capture and Indexing

Hang on, we’ll show it. We’ll use it to split individual company’s invoices from an multipage scan based on the Company Name and extract index data.

A company might use this to scan a large stack of invoices and split the file every time a new invoicing company name is located using the regex scripts.

So where is the regex?

Page 30: Using Regular Expressions in Document Management Data Capture and Indexing

First we are going to define the regex to perform document splitting when a new Company Name is located in ImageRamp’s Splitting and Extraction’s Data Mining submenu as shown below.

Let’s break it down—-splitting the scan stack.

(?<=∖bCompany∖s*Name∖s+ ∖b)[a-z0-9∖(∖) ]*

… and check the “Split if Matched” option.

Page 31: Using Regular Expressions in Document Management Data Capture and Indexing

Remember in our example we identified CompanyName, CompanyNo, Date and SICcode as the index or metadata information we want to capture. So here we are extracting the date field using the regex in the Index Fields section of the Data Mining submenu.

(?<Date>(?<= ∖bDate of this return∖s+∖b)∖d{2}/∖d{2}/∖d{4})

--capturing the index data.

Page 32: Using Regular Expressions in Document Management Data Capture and Indexing

Information extracted through the text data mining with regex can also be used to name the file and create folders.

Here %regex1 corresponds to the first regex field definition (CompanyName) and %regex2 corresponds to the second field definition (CompanyNo).

But wait, there’s more.

Page 33: Using Regular Expressions in Document Management Data Capture and Indexing

We hope we have demonstrated the immense power of using regular expressions to extract data from both structured and unstructured data.

Data in the palm of your hand…not locked in your

documents!

and…

Page 34: Using Regular Expressions in Document Management Data Capture and Indexing

For more on: • Data Mining PDF • Data mining Scans • Invoice Mining • Patient Record Mining • OCR mining • TIF mining • Extracting meta data, • Data extraction from unstructured data • Intelligent data capture • Data extraction • Using regex to extract data • Document scanning • Extracting data • Extract meta data, • Scanner software, • Barcode recognition, • OCR software, • Capture tutorial • Pdf scanning, • Scanning software • Indexing • Document indexing • Automated capture • Meta data • Scan to index • Batch Processing • Bulk scanning • Docufi • Imageramp • Data capture • Migration to document management

the power of ImageRamp and its other features including:

Learn more about…

Full text OCR to PDF

PDF rights management and encryption

Document naming, splitting, and routing based on barcodes

and…

Image processing for clean up and adaptive thresholding

OCR (Optical Character Recognition)

Barcode reading (1D and 2D)

Page 36: Using Regular Expressions in Document Management Data Capture and Indexing

Further reading on Regular Expressions:

More?

http://en.wikipedia.org/wiki/Regular_expression

http://regexlib.com/

http://www.regular-expressions.info/

Page 37: Using Regular Expressions in Document Management Data Capture and Indexing

docufi.com @imageramp @docufinews