using regular expressions in document management data capture and indexing
DESCRIPTION
Learn how metadata (index information) can be pulled from documents using regular expressions or regex. See how regex is used to extract the index information, name files, create subfolders and more to feed your document management or EMR systems. Automated data capture is shown with ImageRamp from DocuFi, a powerful platform to capture index information from your scanned documents and drawings which integrates with today's document management and EMR systems.TRANSCRIPT
Using Regular Expressions for Data Mining and Automated Data Capture and Indexing
Copyright © 2010 - 2013 DocuFi. All Rights Reserved
In a Document Management Environment
Using Regular Expressions for Data Mining and Automated Data Capture and Indexing
First: What is automated data capture?
Just identifying and extracting information or data (sometimes called metadata) from scanned documents
Data Capture:
First: What is automated data capture or data mining?
Just identifying and extracting information or data (sometimes called metadata) from scanned documents
Data Capture:
Automated Data Capture:
Applying the principles of automation to data capture, silly!
This can also be called text data mining.
Why automate
data capture?
Manual Data Capture is Expensive and Time Consuming
Problems with manual data entry:
1. Security maybe compromised if documents taken off premises
2. A delay is introduced if documents taken off premises
3. Compared to automated extraction, manual indexing is slow
4. Manual indexing doesn’t scale well with large projects
5. Manual indexing has the potential to introduce errors into the data
Why automate
data capture?
and…
Why automate
data capture?
Problems with manual data entry:
1. Security maybe compromised if documents taken off premises
2. A delay is introduced if documents taken off premises
3. Compared to automated extraction, manual indexing is slow
4. Manual indexing doesn’t scale well with large projects
5. Manual indexing has the potential to introduce errors into the data
There’s a Mountain of It!
There’s a Mountain of It! Let’s take a look at just invoices for
example…
There’s a Mountain of It!
According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.
There’s a Mountain of It!
Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper.
According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.
There’s a Mountain of It!
Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper.
and it’s expensive
According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.
An Aberdeen Group March 2012 publication estimates the costs of processing a single invoice from $4.84 to $20.13.
So if e-invoicing is not an option (as it’s not for many), what?
sending and receiving invoices electronically e-invoicing:
http://www.istockphoto.com/stock-photo-21653377-computer-password-security.php?st=80d5a93
“it is the front-end capture options…that introduce true performance gains. For example, respondents who have implemented front-end document capture (creating a scanned digital copy of a physical invoice to be used in the approval process) report invoice processing 34% faster than those who process invoices manually. Moving to the pure data end of the spectrum, companies that convert scanned documents into usable data (through optical character recognition or similar technologies), report a 26% faster processing time than those that work only with document images.” ---Aberdeen’s 2010 report
( )
And, We All Know, Time is Money
http://www.istockphoto.com/stock-photo-21653377-computer-password-security.php?st=80d5a93
Don’t forget we are using invoices only as an example. But, this could apply to patient records,
legal documents, purchase orders…any document.
http://www.istockphoto.com/stock-photo-21653377-computer-password-security.php?st=80d5a93
Now that you know this is all about money, let’s go back to the focus of
this slideshow.
http://www.istockphoto.com/stock-photo-21653377-computer-password-security.php?st=80d5a93
http://www.istockphoto.com/stock-photo-21653377-computer-password-security.php?st=80d5a93
Using Regular Expressions for Data Mining and Automated Data Capture and Indexing
There’s a Mountain of It!
What are Regular Expressions or regex?
Regular expressions (regex) provide a fast and powerful method to search, extract and replace specific data found within scanned documents. Regular expressions are essentially a special text string for describing a search pattern. You could think of regular expressions as extremely powerful wildcards.
There’s a Mountain of It!
What’s it look like? A simple regular expression might look something like this:
^∖s{1,3}[A-Z0-9]XYZ
There’s a Mountain of It!
What’s it look like? A simple regular expression might look something like this:
^∖s{1,3}[A-Z0-9]XYZ
^ Start at the beginning of a string or line
∖s{1,3} Find a space that occurs between 1 and 3
times
[A-Z0-9]* Find any character in the range A-Z and 0-9,
the “*” is the instruction to find as many
occurrences as possible.
XYZ Find the literal characters “XYZ”
There’s a Mountain of It!
What’s it look like? A simple regular expression might look something like this:
^∖s{1,3}[A-Z0-9]XYZ
^ Start at the beginning of a string or line
∖s{1,3} Find a space that occurs between 1 and 3
times
[A-Z0-9]* Find any character in the range A-Z and 0-9,
the “*” is the instruction to find as many
occurrences as possible.
XYZ Find the literal characters “XYZ”
If we had the value “ AZR8987XYZ” in our document at the start of a line we would get a match whereas if we had “ AZR898XY” we would not.
There’s a Mountain of It!
Huh?
Don’t worry, this is not a tutorial on writing regex. We just want to look at some examples and understand how regex can apply to data capture and indexing in a document management
environment.
There’s a Mountain of It! Regular expressions are extremely flexible and patterns can be constructed to match almost anything. For text commonly found in documents such as dates, SSNs, ZIP codes etc., patterns are freely available on the Internet. Here are some examples:
Zip Codes ^(?!00000)(?<zip>(?<zip5>∖d{5})(?:[ -
](?=∖d))?(?<zip4>∖d{4})?)$
US Phone
Number
^([0-9]( |-)?)?(∖(?[0-9]{3} ∖)?|[0-9]{3})( |-
)?([0-9]{3}( |-)?[0-9]{4}|[a-zA-Z0-9]{7})$
Credit
Card
(^(4|5)∖d{3}-?∖d{4}-?∖d{4}-
?∖d{4}|(4|5)∖d{15})|(^(6011)-?∖d{4}-
?∖d{4}-?∖d{4}|(6011)-
?∖d{12})|(^((3∖d{3}))-∖d{6}-
∖d{5}|^((3∖d{14})))
There’s a Mountain of It! Here is a partial invoice where you might need to capture the "Catalogue Number“.
Real World Example
There’s a Mountain of It!
In order to start constructing a regular expression we have to use what we know from the data in front of us as well as making some assumptions. During testing we can refine the regular expression. In this example we can assume from the document that the catalogue number has the format of a single uppercase letter, followed by 2 digits then a hyphen followed by a single uppercase letter and 6 digits or just 6 digits.
We could use the regex of [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{6} extract the data. Let's again break this down:
[A-Z] Find a character from A-Z, the absence of a
quantifier specification,“{}”, assumes we are
only looking for 1 character
∖d{2} Find exactly 2 digits
- Find the literal character “-“
[A-
Z]{0,1}
Find a character A-Z between 0 and 1 repetitions
∖d{6} Find exactly 6 digits
This is just one way of writing a regular expression for this example although there are various ways it could be written. If we should subsequently find that the last portion of the catalogue number might contain 4 to 6 digits, we could simply amend it as follows [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{4,6}.
We’ll take a look at how regex is used in ImageRamp Batch. It’s a simple-to-use folder processing tool that accelerates getting data and files into various EMR, Document Management or other secure storage environments. It can be used to capture and extract data in both structured and unstructured documents. As an example, we might want to extract data from a scanned file with the following 4 fields:
Now how would this work in a data capture solution?
Company Name
Company Number
Date
SIC Code
Here is the ImageRamp screen showing the scanned file pages and the data extracted using regex for the four fields we listed.
Hang on, we’ll show it. We’ll use it to split individual company’s invoices from an multipage scan based on the Company Name and extract index data.
A company might use this to scan a large stack of invoices and split the file every time a new invoicing company name is located using the regex scripts.
So where is the regex?
First we are going to define the regex to perform document splitting when a new Company Name is located in ImageRamp’s Splitting and Extraction’s Data Mining submenu as shown below.
Let’s break it down—-splitting the scan stack.
(?<=∖bCompany∖s*Name∖s+ ∖b)[a-z0-9∖(∖) ]*
… and check the “Split if Matched” option.
Remember in our example we identified CompanyName, CompanyNo, Date and SICcode as the index or metadata information we want to capture. So here we are extracting the date field using the regex in the Index Fields section of the Data Mining submenu.
(?<Date>(?<= ∖bDate of this return∖s+∖b)∖d{2}/∖d{2}/∖d{4})
--capturing the index data.
Information extracted through the text data mining with regex can also be used to name the file and create folders.
Here %regex1 corresponds to the first regex field definition (CompanyName) and %regex2 corresponds to the second field definition (CompanyNo).
But wait, there’s more.
We hope we have demonstrated the immense power of using regular expressions to extract data from both structured and unstructured data.
Data in the palm of your hand…not locked in your
documents!
and…
For more on: • Data Mining PDF • Data mining Scans • Invoice Mining • Patient Record Mining • OCR mining • TIF mining • Extracting meta data, • Data extraction from unstructured data • Intelligent data capture • Data extraction • Using regex to extract data • Document scanning • Extracting data • Extract meta data, • Scanner software, • Barcode recognition, • OCR software, • Capture tutorial • Pdf scanning, • Scanning software • Indexing • Document indexing • Automated capture • Meta data • Scan to index • Batch Processing • Bulk scanning • Docufi • Imageramp • Data capture • Migration to document management
the power of ImageRamp and its other features including:
Learn more about…
Full text OCR to PDF
PDF rights management and encryption
Document naming, splitting, and routing based on barcodes
and…
Image processing for clean up and adaptive thresholding
OCR (Optical Character Recognition)
Barcode reading (1D and 2D)
More?
Further reading on Regular Expressions:
More?
http://en.wikipedia.org/wiki/Regular_expression
http://regexlib.com/
http://www.regular-expressions.info/
docufi.com @imageramp @docufinews