document digitization - qcon.ai · @nischalhp | document digitization | qconai sfo 2019 human and...

34
DOCUMENT DIGITIZATION Rethinking it with Machine Learning Nischal Harohalli Padmanabha QConAI SFO 2019

Upload: others

Post on 22-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

DOCUMENT DIGITIZATIONRethinking it with Machine Learning

Nischal Harohalli Padmanabha QConAI SFO 2019

Page 2: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

“The brain sure as hell doesn’t work by somebody programming in rule.”

- Geoffrey Hinton

Page 3: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

PROBLEM

@nischalhp | Document Digitization | QconAI SFO 2019

Understanding unstructured documents and extracting semantic information to automate claims handling.

Page 4: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

DOCUMENT CLASS

Policy

POLICY NUMBER

H 54/16 307 728

CUSTOMER

Renolate GmbH10115 Berlin

AGENT

pma Insurance Broker48149 Nurnberg

RISK DESCRIPTION / INSURED LOCATION

Private liability insurance comfort plus Dog liabilityEnvironmental damage insuranceEmployees on premises

POLICY

Liability Protection

EFFECTIVE DATE OF CHANGE

22.12.2016 12:00TERMINATION

22.12.2019 12:00ANNUAL CHARGE

EUR 424,63

COVERAGES

Persons & property damage flatFinancial lossesEnvironmental damage basic flat

EUR 3.000.000EUR 100.000EUR 3.000.000

Page 5: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

REWIND

Page 6: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

TABULAR INFORMATION EXTRACTION

Page 7: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

Writing a lot of rules

COURSE OF ACTION - ROUND 1

Initial results, gave us a lot of happiness. Evaluation on known Data

Page 8: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

RESULT

@nischalhp | Document Digitization | QconAI SFO 2019

In production 58% accuracy

Page 9: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

RESULT

@nischalhp | Document Digitization | QconAI SFO 2019

We failed, miserably.Rules became cumbersome & brittle.

In production 58% accuracy

Page 10: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

Life or death situation for the project (and us engineers)

Page 11: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

ADAPTIVE LEARNING THOUGHT PROCESS

How does a human solve the same problem?

Identifies Grouping of Text, to build Context

Eg: Tables, paragraphs, passages Given the context, domain knowledge and semantic understanding of text

Page 12: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

Sounds straightforward, right?

Page 13: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

TECH STACK CHECK

Page 14: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

NEXT STEPS

Page 15: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

Which algorithms to use?

What should we feed as input to the algorithm?What to annotate?

What are our deadlines?

Human and computation resources required?

How to agile this?

Page 16: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

Which algorithms to use?

COURSE OF ACTION - ROUND 2

Supervised Learning

Unsupervised Learning

Computer Vision

NLP

Computer Vision

NLP

Using this technique to generate data for supervised training. Wrote implementations of Deep clustering, word / sentence / page / document embeddings

● Object detection● Messaging parsing networks● Custom CNN networks

● Implementation of Deep Topic modeling● Custom RNN + CNN networks with

domain adaptation

Page 17: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

EMPHASIS ON SUPERVISED LEARNING

@nischalhp | Document Digitization | QconAI SFO 2019

Page 18: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

Computer Vision

NLP

● Drawing polygon bounding boxes● Labeling pages● Labeling documents

Complex annotation of passages, phrases, tables, line items, hierarchy nature of textual information

What should we feed as input to the algorithm?What to annotate?

]Built an in houseAnnotation System

COURSE OF ACTION - ROUND 2

Workflows supporthuge annotation jobs

Page 19: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

Human and computation resources required?

Data Scientists

Engineers

● Data Scientists from Academia● Deep learning engineers● Research programme with Universities● Master Thesis sponsorship at omni:us

● Full stack engineers● Data Engineers● Devops

Leadership & Mentors

Cloud startup programmes

● Team leads with experience in AI● Identifying and convincing industry experts to mentor● Devops

● Credits to support memory and GPU training algorithms● Mentoring to scale operations

COURSE OF ACTION - ROUND 2

Page 20: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

What are our Deadlines?

How to agile this?

Sprint Planning for Research

Quick turn around of POC

Engineer AI systems to run experiments in a systematic and automated way

COURSE OF ACTION - ROUND 2

Page 21: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

RESULT

@nischalhp | Document Digitization | QconAI SFO 2019

In production 94% accuracy

Successful AIdelivery

Page 22: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

TECH STACK CHECK

Page 23: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

GO LIVE OR GO HOME

Page 24: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

Trained Models Predict

AI IN PRODUCTION

Human in the loop, fixes the errors and validates corrections

Train on the corrections, Continuous improvements

Page 25: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

DO NOT IGNORE

Domain Knowledge is essential

Educate your customers on AI

Engineer end to end AI systems to solve business use case, not a dataset

Page 26: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

PLATFORM

Training Platform Prediction Platform with human in the loop

Management Console of Infrastructure, Applications & Users

Page 27: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

Training Platform

COURSE OF ACTION - ROUND 3

Annotation System

Ability to train and evaluate models

Mechanism and system to trigger training, retraining of evaluation and versioning of different types models, in a managed way across various infrastructures supporting CPU and GPU

System to define data models, annotate data, manage annotation jobs, audit the annotated data and version control the datasets]Console connecting

the two together

Page 28: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

COURSE OF ACTION - ROUND 3

Async API for Ingestion

Data PipelinesRobust data pipelines connecting the services with providing capabilities of high throughput, reliability and retry mechanisms.

Rest API that supports asynchronous data upload capabilities ]Prediction console

connects all.

Prediction Platform with human in the loop

Validation UI

AI microservices

User interface to fix prediction errors

Scaling deep learning models as microservices

Page 29: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

Management Console of Infrastructure, Applications & Users

COURSE OF ACTION - ROUND 3

Configuration management

Application logsMonitoring logs of applications and setting up dashboards for internal and external stakeholders

Central management of configuration of various systems, consoles and services ]Management and

monitoring console

User management

Infrastructure logs

Managing users and providing authentication and authorisation capabilities for services.

Monitoring infrastructure usage and patterns to setup alerts and notifications

Page 30: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

TECH STACK CHECK

Page 31: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

@nischalhp | Document Digitization | QconAI SFO 2019

omni:us platform console |

Page 32: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

Learnings

Page 33: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia

Learnings

@nischalhp | Document Digitization | QconAI SFO 2019

● Very important for an entire organization to believe that AI can solve problems● Engineer AI products, do not believe that having just AI models are good enough● Agile for AI works, choose an interpretation that works for your team● Pay attention to details, domain knowledge and use case to be solved. ● Combination of multiple technologies have to be used to solve use case, not just one

hammer for all.● Do not try to “AI” everything, certain matured technologies are capable of solving

certain problems well. Use them wisely.● Believe in human in the loop, builds trust with business● Educate internal and external stakeholders around the possibilities and limitations

of AI.● Visualisation is power tool to understand and explain AI to everybody. Use them.● AI is no more a black box, it can fine tuned, managed and configured appropriately.● Automate your current processes as much as possible, this gives more room for

research.

Page 34: DOCUMENT DIGITIZATION - QCon.ai · @nischalhp | Document Digitization | QconAI SFO 2019 Human and computation resources required? Data Scientists Engineers Data Scientists from Academia