using sociolinguistics to enhance customer segmentation, geomarketing & diversity analytics
TRANSCRIPT
Elian CARSENAT, NamSor 2016-01-28
1 “Using Sociolinguistics to
Enhance Customer
Segmentation, Geomarketing
& Diversity Analytics”
Founder Bio 2
Elian CARSENAT, a computer scientist trained at ENSIIE/INRIA, started his career at JP Morgan in Paris in 1997. He later worked as consultant and managed business & IT projects in London, Paris, Moscow and Shanghai.
In 2012, Elian created NamSor, a piece of sociolinguistics software to mine the 'Big Data' and better understand international flows of money, ideas and people. NamSor helps answer the perennial question all countries ask about their diasporas – who are they, where are they and what are they doing.
NamSor has been used to attract Foreign Direct Investments (FDI), to build-up international collaboration within scientific communities, to attract and facilitate Diaspora investment in Start-ups... as well as other use cases.
http://fr.linkedin.com/in/eliancarsenat/en
NamSor sorts Names 3
Names are meaningful : we use sociolinguistics to extract their
semantics and deliver actionable intelligence.
Names reflect cultural Identity
NamSor data mining software
recognizes the linguistic or cultural
origin of names in any alphabet /
language, with fine grain and high
accuracy.
4
Gender Gap
in
Fina
ncin
g
5
Gender Gap
in
Sci
enc
e
Diasporas in Science (in collaboration with French INSERM)
6
Thomson Reuters WebOfScience (6 countries, 250k scientists, 50k papers)
“Analysts uncovered amazing patterns in the way scientists’ names correlate with whom they publish, and who
they cite in their papers - not just in case of a particular country, but globally. Tania Vichnevskaia of the French
National Institute for Health (INSERM) presented the paper ‘Applying onomastics to scientometrics‘ at IREG
International symposium 2015 organised by University of Maribor and Shanghai Jiao Tong University. The
paper was prepared jointly with NamSor, a private start-up company specialized in mapping international
Diasporas.”
Source: WoS; Data Mining: INSERM with NamSor
Scholar names in some Canadian Universities Chinese, Indian, Iranian, Moroccan, Italian names
7
Canadian Science Policy Conference - CSPC2015
8
USE CASE – BOSTON CITY GEODEMOGRAPHICS
US Census vs NamSor geo-demographics
9
In July 2015, the US Government announced new
rules that will require all cities and towns receiving
federal housing funds to assess patterns of
segregation.
The NY Times has published interactive maps of
Boston geo-demographics, which we can compare
with the information inferred by NamSor
US Census Race Map of Boston 10
http://www.nytimes.com/interactive/2015/07/08/us/census-race-map.html
Using Voters List
US Census: 1pixel = 40 inhabitants
Voters List: 1 pixel = 1 voter
11
Source: Boston Voters List
Visualization : ESRI
Data Mining: NamSor+RapidMiner
Breaking down ‘White’ and ‘Asian’ into
Portuguese, Spanish, Italian, India, Pakistan, China, ...
12
Source: Boston Voters List
Visualization : ESRI
Data Mining: NamSor+RapidMiner
Who LIVES in New York ? 13
Who OWNS in Brooklyn, NY? Inferring origin in NYC ACRIS (Real Estate OpenData)
14
> Brooklyn zip codes
> N
am
Sor
ori
gin
s
Who OWNS in Brooklyn, NY? Inferring origin in NYC ACRIS (Real Estate OpenData)
15
Interesting ‘Little’ spots
ZIP 11209 : Irish
ZIP 11219 : Jewish
ZIP 11233 : African American
ZIP 11228 : Italian
ZIP 11208 : Hispanic
ZIP 11214 : Chinese
ZIP 11235 : Ukrainian/Russian
ZIP 11416 : Indian
ZIP 11222 : Polish
16
USE CASE – ELECTIONS
A Decision Tree from FLORIDA Voters List
(open data) 17
//TODO : based on FLORIDA
Segmenting ‘Asian’ voters would improve the model Using NamSor Origin to infer : Indian, Vietnamese, Korean, Chinese, ...
18
Tree
ethno = (Chin: DEM {DEM=3311, REP=2636, IDP=48, INT=199, LPF=9, GRE=5, CPF=2, REF=2, AIP=0, PSL=0}
ethno = (Indi: DEM {DEM=12509, REP=4565, IDP=95, INT=432, LPF=32, GRE=10, CPF=0, REF=1, AIP=3, PSL=1}
ethno = (Indo: DEM {DEM=984, REP=718, IDP=9, INT=43, LPF=4, GRE=1, CPF=1, REF=0, AIP=0, PSL=0}
ethno = (Japa: DEM {DEM=488, REP=403, IDP=9, INT=34, LPF=2, GRE=1, CPF=1, REF=0, AIP=0, PSL=0}
ethno = (Kore: REP {DEM=1148, REP=1174, IDP=11, INT=75, LPF=3, GRE=0, CPF=0, REF=0, AIP=0, PSL=0}
ethno = (Mong: DEM {DEM=24, REP=22, IDP=0, INT=0, LPF=0, GRE=1, CPF=0, REF=0, AIP=0, PSL=0}
ethno = (Paki: DEM {DEM=4411, REP=843, IDP=25, INT=110, LPF=9, GRE=6, CPF=0, REF=0, AIP=0, PSL=0}
ethno = (Viet: REP {DEM=3798, REP=5780, IDP=65, INT=272, LPF=10, GRE=5, CPF=3, REF=3, AIP=2, PSL=0}
Pakistanis, Vietnamese didn’t vote the same.
19
USE CASE – TRAVEL INTELLIGENCE
“Incredible India” – 1.2 BN People Indian onomastics by State/Union Territory
20
Names in LATIN, BENGALI, DEVANAGARI, GUJARATI, GURMUKHI, KANNADA, MALAYALAM,
ORIYA, TAMIL, TELUGU, ARABIC
ASSAM: Karbi Anglong, within district Inter-caste marriages ?
21
output Input Input
clusterId clusterParentId Firstname LastName parent is FirstParentLastParent
L25354:253L64958:2797 A¡à[\¹ ¹}[ššã husband ¤àl¡ü[W¡³ [W¡}>à¹
L47490:1593L64958:2797 ¤àK[¹ [W¡}>๠father ¤àl¡ü[W¡³ [W¡}>à¹
L28582:1209L47490:1593 [³>à Òü}[t¡šã husband ¤àK[¹ [W¡}>à¹
L23643:669L35593:510 ™åKƒ}à [W¡}>๚ã father ¤ài¡[W¡³ [W¡}>à¹
L23643:669L35593:510 ³à>àÒü [W¡}>๚ã father ¤ài¡[W¡³ [W¡}>à¹
L47490:1593L35593:510 W¡àì=¢ [W¡}>๠father Wå¡ì¤ [W¡}>à¹
L23643:669L35593:510 A¡àì¹ t¡àì¹ïšã husband Wå¡ì¤ [W¡}>à¹
L35593:510L47490:1593 [ƒ[ºš [W¡}>๠father W¡àì¤ [W¡}>à¹
L23643:669L47490:1593 [¹>à [W¡}>๚ã father W¡àì¤ [W¡}>à¹
parent is husband
Count of se ria l Column Labe ls
Row Labe ls L47490:1593 L116370:3612 L54332:2031 L184096:2297 L35593:510 L168871:1819 L135664:4438 L51271:837
L23643:669 6931 84 5099 15 2069 28 791 1924
L151415:3559 18 212 11 6446 19 1217 55 6
L28582:1209 5132 68 3565 10 1494 17 592 1323
L116370:3612 66 10283 38 72 40 321 137 29
L9839:442 2491 60 1851 9 774 11 321 660
L168871:1819 7 263 6 361 8 2730 24 4
L23642:141 1198 8 822 2 375 4 156 332
L25354:253 1181 12 932 375 7 100 323
L135664:4438 20 154 5 22 19 44 2212 3
L87032:1210 11 315 13 51 14 141 37 9
L90333:3644 3 204 2 31 190 5
L184096:2297 13 1735 3 84 11 1
L87031:697 4 136 4 12 3 137 4 5
L14495:131 614 10 432 167 4 68 163
L63724:1422 17 83 10 34 34 28 96 6
L98994:891 31 161 46 21 19 59 21 5
ASSAM: Karbi Anlong district
names clusteredL116370:3612L23643:669L151415:3559L47490:1593L28582:1209L54332:2031L184096:2297L168871:1819L9839:442L135664:4438L87032:1210L90333:3644L35593:510L51271:837L63724:1422L154797:1168L64959:1796L23642:141L87031:697L6536:295L98994:891L25354:253L64958:2797L30570:2614L90334:1189L95839:287L100510:366L121390:783Other
Source: Voters List; Data Mining: NamSor
Applications to an Airline’s customer intelligence
22
A global airline : ‘For 93% of our customers, when
NamSor recognizes an Indian
name, the client has travelled to
India in the past.’
Finer grain segmentation using
names brings insights about
diasporas travel pattern
visiting family and friends in
their home country, as well as
their specific needs.
Using NamSor API 23
(1) Get an API Key (2) Get NamSor
RapidMiner Extension
Thank you!
Elian CARSENAT,
Phone : +33 6 52 77 99 07
http://www.namsor.com/
24
Juillet 2013, Ambassade de Lituanie à Paris