record linkage, a real use case with spark ml - paris spark meetup dec 2015

RECORD LINKAGE,A REAL USE CASE WITH SPARK ML

Alexis Seigneurin - Pascale Mkhael

Who I am

• Software engineer for 15 years

• Consultant at Ippon Tech in Paris, France

• Spark trainer

• Favorite subjects: Spark, Machine Learning, Cassandra

• @aseigneurin

http://http

The DIL’s mission: Help AXA become a data-driven company…

BUILDING technological

platforms using Big Data technologies

SUPPORTING AXA entities’

Big Data projects (tools and/or

expertise)

EXPLORING innovative

opportunities to transform the

insurance business

Platforms

> By focusing on…

> The DIL in figures:

1 TEAM

3 SITES

77 OPPORTUNITIES

100+ TB

3 SETS OF

PLAFORMS

The project

• Record Linkage with Machine learning

• Use cases:• Find new clients who come from insurance comparison services→ Commission

• Find duplicates in existing files (acquisitions)

• Record Linkage• Entity resolution• Deduplication• Entity disambiguation• …

Overview

• Find duplicates!

Purpose

+---+-------+------------+----------+------------+---------+------------+|ID|veh|codptgar_veh|dt_nais_cp|dt_permis_cp|redmaj_cp|formule|+---+-------+------------+----------+------------+---------+------------+|...|PE28221|50000|1995-10-12|2013-10-08|100.0|TIERS||...|FO26016|59270|1967-01-01|1987-02-01|100.0|VOL_INCENDIE||...|FI19107|77100|1988-09-27|2009-09-13|105.0|TOUS_RISQUES||...|RE07307|69100|1984-08-15|2007-04-20|50.0|TIERS||...|FO26016|59270|1967-01-07|1987-02-01|105.0|TOUS_RISQUES|+---+-------+------------+----------+------------+---------+------------+

Steps

1. Preprocessing1. Find potential duplicates2. Feature engineering

2. Manual labeling of a sample

3. Machine Learning to make predictions on the rest of the records

Prototype

• Crafted by a Data Scientist• Not architectured, not versioned, not unit tested…→ Not ready for production

• Spark, but a lot of Spark SQL (data processing)

• Machine Learning in Python (Scikit Learn)

→ Objective: industrialization of the code

Preprocessing

• Data (CSV) + Schema (JSON)

Inputs

000010;Jose;Lester;10/10/1970000011;José;Lester;10/10/1970000012;Tyler;Hunt;12/12/1972000013;Tiler;Hunt;25/12/1972000014;Patrick;Andrews;1973-12-13

{"tableSchemaBeforeSelection":[{"name":"ID","typeField":"StringType","hardJoin":false},{"name":"name","typeField":"StringType","hardJoin":true,"cleaning":"digitLetter","listFeature":["scarcity"],"listDistance":["equality","soundLike"]},...

• Spark CSV module → DataFrameDon’t use type inference

Data loading

+------+-------+-------+----------+|ID|name|surname|birthDt|+------+-------+-------+----------+|000010|Jose|Lester|10/10/1970||000011|José|Lester|10/10/1970||000012|Tyler|Hunt|12/12/1972||000013|Tiler|Hunt|25/12/1972||000014|Patrick|Andrews|1970-10-10|+------+-------+-------+----------+

• Parsing of dates, numbers…

• Cleaning of strings

Data cleasning

+------+-------+-------+----------+|ID|name|surname|birthDt|+------+-------+-------+----------+|000010|jose|lester|1970-10-10||000011|jose|lester|1970-10-10||000012|tyler|hunt|1972-12-12||000013|tiler|hunt|1972-12-25||000014|patrick|andrews|null|+------+-------+-------+----------+

• Convert strings to phonetics (Beider-Morse)

• …

Feature calculation

+------+-------+-------+----------+--------------------+|ID|name|surname|birthDt|BMencoded_name|+------+-------+-------+----------+--------------------+|000010|jose|lester|1970-10-10|ios|iosi|ioz|iozi...||000011|jose|lester|1970-10-10|ios|iosi|ioz|iozi...||000012|tyler|hunt|1972-12-12|tilir||000013|tiler|hunt|1972-12-25|tQlir|tili|tilir||000014|patrick|andrews|null|pYtrQk|pYtrik|pat...|+------+-------+-------+----------+--------------------+

• Auto-join (more on that later…)

Find potential duplicates

+------+------+---------+...+------+------+---------+...|ID_1|name_1|surname_1|...|ID_2|name_2|surname_2|...+------+------+---------+...+------+------+---------+...|000010|jose|lester|...|000011|jose|lester|...|000012|tyler|hunt|...|000013|tiler|hunt|...+------+------+---------+...+------+------+---------+...

• Several distance algorithms:• Levenshtein distance, date difference…

Distance calculation

+------+...+------+...+-------------+--------------+...+----------------+|ID_1|...|ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt|+------+...+------+...+-------------+--------------+...+----------------+|000010|...|000011|...|0.0|0.0|...|0.0||000012|...|000013|...|1.0|0.0|...|13.0|+------+...+------+...+-------------+--------------+...+----------------+

• Standardization of distances only

• Vectorization (2 vectors)

Standardization / vectorization

+------+------+---------+----------+------+------+---------+----------+------------+--------------+|ID_1|name_1|surname_1|birthDt_1|ID_2|name_2|surname_2|birthDt_2|distances|other_features|+------+------+---------+----------+------+------+---------+----------+------------+--------------+|000010|jose|lester|1970-10-10|000011|jose|lester|1970-10-10|[0.0,0.0,...|[2.0,2.0,...||000012|tyler|hunt|1972-12-12|000013|tiler|hunt|1972-12-25|[1.0,1.0,...|[1.0,2.0,...|+------+------+---------+----------+------+------+---------+----------+------------+--------------+

SparkSQL → DataFrames

From SQL…• Generated SQL requests

• Hard to maintain (especially as regards to UDFs)

valcleaningRequest=tableSchema.map(x=>{x.CleaningFuctionmatch{case(Some(a),_)=>a+"("+x.name+")as"+x.namecase_=>x.name}}).mkString(",")

valcleanedTable=sqlContext.sql("select"+cleaningRequest+"from"+tableName)cleanedTable.registerTempTable(schema.tableName+"_cleaned")

… to DataFrames• DataFrame primitives

• More work done by the Scala compiler

valcleanedDF=tableSchema.filter(_.cleaning.isDefined).foldLeft(df){case(df,field)=>valudf:UserDefinedFunction=...//getthecleaningUDFdf.withColumn(field.name+"_cleaned",udf.apply(df(field.name))).drop(field.name).withColumnRenamed(field.name+"_cleaned",field.name)}

Unit testing

Unit testing

• Scalatest + Scoverage

• Coverage of all the data processing operations

• Comparison of Row objects

val resDF = schema.cleanTable(rows)

"The cleaning process" should "clean text fields" in { val res = resDF.select("ID", "name", "surname").collect() val expected = Array( Row("000010", "jose", "lester"), Row("000011", "jose", "lester ea"), Row("000012", "jose", "lester") ) res should contain theSameElementsAs expected} "The cleaning process" should "parse dates" in { ...

Unit testing000010;Jose;Lester;10/10/1970000011;Jose =-+;Lester éà;10/10/1970000012;Jose;Lester;invalid date

Matching potential duplicates

Join strategy

• For record linkage, first merge the two sources

• Then auto-join

Prospects New clients

Duplicate

Join - Volume of data

• Input: 1M records

• Cartesian product: 1000 B records

→ Find an appropriate join condition0

25

50

75

100

Join condition• Multiples join on 2 fields

• Equality of values or custom condition (UDF)• Union between all the intermediate results

• E.g. with fields name, surname, birth_date:df1.join(df2, (df1("ID_1") < df2("ID_2")) && (df1("name_1") === df2("name_2")) && (soundLike(df1("surname_1"), df2("surname_2")))

df1.join(df2, (df1("ID_1") < df2("ID_2")) && (df1("name_1") === df2("name_2")) && (df1("birth_date_1") === df2("birth_date_2")))

df1.join(df2, (df1("ID_1") < df2("ID_2")) && (soundLike(df1("surname_1"), df2("surname_2"))) && (df1("birth_date_1") === df2("birth_date_2")))

UNION

DataFrames extension

• 3 types of columns


+------+...+------+...+-------------+--------------+...+----------------+|ID_1|...|ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt|+------+...+------+...+-------------+--------------+...+----------------+|000010|...|000011|...|0.0|0.0|...|0.0||000012|...|000013|...|1.0|0.0|...|13.0|+------+...+------+...+-------------+--------------+...+----------------+

Data DistancesNon-distance features

• DataFrame columns have a name and a data type• DataFrameExt = DataFrame + metadata over columns


case class OutputColumn(name: String, columnType: ColumnType) class DataFrameExt(val df: DataFrame, val outputColumns: Seq[OutputColumn]) {

def show() = df.show()

def drop(colName: String): DataFrameExt = ...

def withColumn(colName: String, col: Column, columnType: ColumnType): DataFrameExt = ...

...

Labeling

Labeling

• Manual operation• Is this a duplicate? → Yes / No

• Performed on a sample of the potential duplicates• Between 1000 and 10 000 records

Labeling

Predictions

Predictions

• Machine Learning• Random Forests• (Gradient Boosting Trees also give good results)

• Training on the potential duplicates labeled by hand

• Predictions on the potential duplicates not labeled by hand

Predictions

• Sample: 1000 records• Training set: 800 records• Test set: 200 records

• Results• True positives: 53 • False positives: 2 • True negatives: 126• False negatives: 5

→ Found 53 duplicates on the 58 expected (53+5) and only 2 errors

•Precision ≈ 93%

•Recall ≈ 91%

Summary&

Conclusion

Summary

✓ Single engine for Record Linkage and Deduplication

✓ Machine Learning → Specific rules for each dataset

✓ Higher identification of matches• Previously ~50% → Now 80-90%

Thank you!@aseigneurin

record linkage, a real use case with spark ml - paris spark meetup dec 2015

Technology