dev ops-presentation

Post on 14-Jul-2015

331 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Text miningFuzzy document classification

Using ElasticsearchLev Ozeryansky

Identity Card

• Merging of Ankor and We!

• Owned by Hilan (publicly traded in Tel Aviv Stock exchange)

• Fast growing IT integration company

• Over 2000 systems installed and maintained• Over 1000 leading customers - Hi-tech, Industry, Academy, Banks,

Insurance,

• Strong technological team – over 45 engineers, professional services and project managers

• Over 120 employees

• Four main divisions – Infrastructure, Big Data, Cloud, Cyber

Technology Edge

What is classification

• Document classification as document categorization.

• Using classification.

• Our classification data source.

• What we do with?• Java programmer.

• .NET programmer.

Data source

The mathematics

• Let be class set

• Let be documents set

• Classification function

Classification method

• Cosine similarity

• Function

Build document class vector

• Java programmer• Java

• 5

• Hibernate

• .NET programmer• C#

• 5

• Nhibernate

Let index classificators

• Add weight manually.

• For Java programmer:• Java = 0.7

• 5 = 0.5

• Hibernate = 0.3

• For .NET programmer• C# = 0.7

• 5 = 0.5

• Nhibernate = 0.3

DEMO

w-shingling

• In natural language processing a w-shingling is a set of unique "shingles"—contiguous subsequences of tokens in a document. (Wikipedia)

• Tokenization

• Elasticsearch analyze mechanism

DEMO

Classification process

• Tokens array.

• Classification query.• Use terms query when terms array == tokens array

• Two vectors• Vector of filtered tokens

• Classification vector

DEMO

Classification process

• SciPy to calculate distance.

Q&A

top related