data scientist vccorp_aug_2016_vnu
TRANSCRIPT
ChallengesandBigOpportuni3esforDataScien3statVCCORP
HoangAnhTuan
CTOAdmicro-VCCORP
1
FOUNDERSHissuccessindevelopingVCCtobetheleadingnew-mediacompanyinVietnambeganwhenhefoundedhisfirstonlinecommunity,TTVNOL,fromagaragestart-upin2000.AMerfoundingthisnetwork,hefoundedthefirstonlinemedia&newsportalTintucvietnamin2002.In2005,hefoundedthefirstprivatecompanyinmobilevalueaddedservicesandinternetcontent,whichformedtheiniRalfoundaRonofthecurrent-dayVCC.
Asanaturalentrepreneur,ThanghasbeenthefirstmoverinalltheemerginginternetsectorsinVietnamincludingmediacontent,socialnetwork,mobilecontent&services,andecommerce.Naturally,VCCbecameaninnovaRveleaderofVietnam’sinternet&mediaindustry.Asthetechnologystrategist,hehasbeenthechiefarchitectofdisrupRvetechnologiesinVietnaminthelast10yearsincludingCMS,keyportaltechnologies,andcloudcompuRngtonameafew.
Mr.VuongVuThangistheFounderandChairmanofVCCorp
Mr.VuongVuThangFounder,Chairman
Withunparalleledknow-howininternetmoneRzaRon,NguyenTheTan’sleadershipisdrivingthegrowthofVCC.Hisresultsareapparentasheoversaw100%annualgrowthinadverRsing,mobileservicesandecommercerevenueforeachofhislast6yearsatVCC.Asastrategicvisionarywithintheindustry,NguyenTheTan’sworkhasmadeatremendousimpactwithintheVietnamesemarketplace.
BeforejoiningVCC,TanwasVice-DirectorofoneofthelargesttelcocompaniesinVietnam,Vie[elFixedTelecom.BeforeVie[elTelecom,hewasthedivisiondirectorataleadingsoMwareandsystemintegraRoncompany,CMC.Therehebuiltae-librarysoluRonspla]ormandexpandedthedistribuRonofCMC’ssoMwareandsoluRonstowardsbroadersegmentswithinthemarket.
Mr.NguyenTheTanistheCo-FounderandCEOofVCCorp
Mr.NguyenTheTanCo-Founder,CEO
5
6
VCCORPOverview
Overview
ü FirstmoverDNAü 50%YoYGrowth
ü 43Mwebaudience
ü 38Mmobileaudience
ü 1,700employees
Investors
VCCORPMARKETCOVERAGE� 43Minternetuserreach(97.6%ofVNinternetpopulation)� 38Mmobileuserreach(95%ofVNSmartphonepopulation)� 10,000+online&mobileadvertisers� 100,000+smallbusinessmerchants� 12Me-marketplacevisitors&buyers� LargestAdnetworkinVietnamwith 1000+publishers,including200+top-
publishers, 30 ofthemareexclusive� 22leadingproductswithpresencein20+verticals;14sitesareintop100
websitesinVietnam(news,finance,family,teenage,auto,high-tech,onlineadvertising,B2CandC2C,contentconsumptionmobile)
7
BigDatainVCCORP
� Inthe2007,BigDatawasappliedearlyinBaambooSearchEngine� Since2009,BigDataplatformhavebeeninstalledforservingad
systeminVCCORP� Currently,BigDataplatformisbeingdevelopedandimprovedin
majorareas.� Advertisement� DigitalContent� Ecommerce� Game
� Currentstaffs:100DataEngineers
9
ThechallengesinVCCORP
� BigDataskillsetsin-house� Thelarge-scaledata� Thehugeamountofspecificproblems,spreadingovermanyareaswhichisrequiredcreativeproblem-solving,self-motivedperson
� Humanresourceisnotenough
10
� Userbehaviors
� AdOptimization
� CoreNLPanditsapplication � NewsDistribution
� RecommendationEngine� VccorpAnalytic
13
UserbehavioranalyRcs
� Threemainprojects: � Demographic:gender,age � Userprofile:behavior,interest � Crossdevices:trackinguseronmultipledevices
15
Demographic-Userprofiling
� Detectuserprofileincludinggender,age,userinteresting(12-basedinterests),longterm,shortterm.Basedondata� Browsinghistory� Keywordsearchhistory� Timeusage,timeonsite
� Data:� 43Musersinpc� 38Musersinmobile� 1Terabyteloggingdata
� Result–accuracy:� Gender:82.5%� Age:67.5%
16
Systemoverview
17
Actioninwebsite
Sparkstreaming–FilternewURL
Updateurlcontent
Updateuseractionforbatchprocessing
OurLDAmodelSVM-predictuser
profile
UserProfileLongtermandshort
term
Benchmark
19
Benchmarkingdata:43Musers,200Mdocuments,30000*10^6actions,23otopicsVCCORPcluster:20nodes,640cores,640GBram
LDAwithSparkMLLIB
Time:36h,Accuracy:
� Gender:75.1%� Age:60.1%
Recall:91%
OurModel
Time:18h,Accuracy:
� Gender:82.5%� Age:67.5%
Recall:92%
Oldclassificationmodel
Time:16h,Accuracy:
� Gender:79.5%� Age:63.4%
Recall:92%
Crossdevice � Weusedinformation:
� BothUser-IPandtimestampintheirdevices� Websiteandcategorieshistory� Userdemographicanduserinterest
� Result:� Accuracy:60%� Numberdetectedusers:11M
21
AdmicroOverview
� #1adnetworkinVietnam:cover38%marketshare� 200+toppublishersinVietnam� 10,000+advertisers� 4Bpageviewspermonth� 1,5Bimpressionsperday� 22leadingadproducts� 43Minternetuserreach(97.6%ofVNinternetpopulation)� 38Mmobileuserreach(95%ofVNSmartphonepopulation)
23
AdOpRmizaRon
� Theadvancedtechniqueswereimplemented: � Personalization � AudienceTargetingPlatform � RealTimeBidding � Retargeting � ContextualTargeting� SSP/DSP/DMP
24
PersonalizaRon� Intraditionaladvertising,adsaredisplayedtoeveryoneinthefixedlocation
� Bycontrast,personalizationtechniquewillchoosethebestfitadsforeachuser:� 43Minternetuser� 10.000ads
� Usingmultipletechnologies:� Highloadcapacitywebserver� Optimizationalgorithms� Estimateandpredictionalgorithms
25
430Bestimatedoperationsforeachtime
AudienceTargeRngPla]orm� Advertiserscantargettheirparticularaudience� Subsetoftheaudiencecanbeprebuiltby“setoperators”ofuserproperties� Location� Demographic(gender,age,relationship…)� Interest/Behavior
� Especially,anaudiencecanbemadeupfrom:� Listofemail/phonenumber� Automaticallyfindsimilaraudiences(look-alike)
26
RealTimeBidding� Atransaction(sell/purchase)ofadimpressionsisimmediatelyproceededwhenanaudiencetriggertheadzones
27
RealTimeBidding� Atransaction(sell/purchase)ofadimpressionsisimmediatelyproceededwhenanaudiencetriggertheadzones
� Challenges:� 80msisthemaximumtimeofatransaction� 1000sitesinVN� 4.5billionrequest/day� NumberofTransaction:$200,000/day
28
RetargeRng� Retargetingisapowerfulbrandingandconversionoptimizationtool
� Adswillfollowcustomersaftertheydotheshopping� Adswillbedisplayedin
� AnyWebpages� Multipledevices
29
ContextualTargeRng� Contextualtargetinglooksatthecategoryorkeywordsofthecurrentpageaconsumerisviewingandthenservesthemadsthatarehighlyrelevanttothatcontent.
30
Categorytarget Keywordtarget
� Weimplemented:� Contentclassificationsystems(LDAP)� Keywordindexandsearchenginesystem
NewsdistribuRon� VCCORPpossessesmanylargeonlinenewspapersinVN� WeareproudofbecomingthefirstcompanyinVietnamabletoimplementanautomaticmethodforpublishingnews
33
AnalyticSystem Autopublish
NewsdistribuRon� Akeychallengeofnewswebsitesistohelpusersfindthearticlesthatareinterestingtoread
� Manytechniquesareappliedas:� Real-timeengagementstatistic� Personalization� NLP� Eventdetection� Trendingdetection� Breakingnewsdetection
34
CORENLP
38
System Accuracy (%) (VCCorp) Others Speed(/s)
Word Segmentation 98.8 97.0 (VLSp) 47,855 tokens
POS tagging 94.5 93 (VLSp) ~50k tokens NER 87.0 85.0 (Baomoi.com) ~22k tokens
Chunking 84.0 81.0 (VLSp) 800 tokens UniversalDependencyParser
72.0(UAS), 66.0 ( LAS)
68.28%(UAS),66.30%(LAS)
1200 sentences
Co-reference resolution 57.0 N/A 106 docs
VLSp:http://vlsp.hpda.vn:8080/demo/?page=resources
EntitylinkingApplication
Knowledgebase
AilàgiámđốcNgânhàngACB?
[entity1,relation,entity2]
[entityN,relation,entityN]
…ÔngNguyễnVănHòalàgiámđốcNgânhàng
ACB
question
answer
QueryandretrieveinformationfromKB
"Ông Dũng là Thạc_sĩ ngành cơ_điện Trường Đại_hoc New_York (Mỹ) và Thạc_sĩ Quan_hệ quốc_tế Trường Đại_học Georgetown (Mỹ) . “
[ÔngDũng,là,Thạc_sĩ][ÔngDũng,là,Thạc_sĩngànhcơ_điện][ÔngDũng,là,Thạc_sĩQuan_hệquốc_tế]…
<CoNLLformat> <Entity1,relation,Entity2><Rawtext>
1Input
sentenceDependency
tree Outputlinks
"Trong năm 2015, lãi trước thuế của BIDV đạt 7.944 tỷ đồng (tăng 26,16 %), lãi sau thuế đạt 6.382 tỷ đồng (tăng hơn 28%), vốn_điều_lệ của BIDV tăng lên 34.187 tỷ đồng, tổng tài_sản đạt 850.748 tỷ đồng (tăng 30,8%).
[lãitrướcthuếBIDV,đạt7.944tỷtrong,năm2015][lãisauthuếBIDV,đạt6.381tỷđồngtrong,năm2015][vốnđiềulệBIDV,tăng34.187tỷđồngtrong,năm2015][tổngtàisảnBIDV,đạt850.748tỷđồngtrong,năm2015]…
<CoNLLformat>
<Entity1,relation,Entity2><Rawtext>
2Input
sentenceDependency
tree Outputlinks
SenRmentAnalysis
45
� Level:Doc,sentence,entity,aspectlevel� Data:~1billionrecords,1TBprocessingdata
� Facebook:5Mpages,500kgroups� News:500� Forums:200
� Approach:UsingNLP+TopicModeling+DeepLearning� Accuracy:~70%
RecommendaRonEngine
� Buildingpurchaserecommendationsystemfore-commercesites� Oursuggestionbasedoninformation
� PurchaseHistoryandweb-browserhistory� Productandbuyersknowledge
49
RecommendaRonEngine
� Thealgorithmapplied:� NER+DeepNeuralNetwork� NetworkandProductknowledge� Collaborativefiltering� F-CTR:combinescollaborativealgorithmandproductknowledge
50
54
( )( ) ( )
, ,
k kk
Rank knowledge NER history
r i r iλ=∑
Knowledge
Recommender
NER
History
RecommendaRonEngine–Ranking
VccorpAnalyRc
58
� DevelopingAnalyticToolforwebsites,showinggoodperformanceincomparedwithGoogleAnalytic(GA)inVietnam
� Technologies:� No-SQLselectedasdata-warehouseforlarge-scaledataanalysis� Real-timeanalytic:Streaminglogging
VccorpAnalyRc
61
� Thealgorithmapplied:� Samplingdata� Abnormaldetection(removefaultclicks/sessions)
� Results:agoodcandidatetoreplaceGA,ensuringbothaccuracyandperformance
SamplingData
populationsize=N
strata 1 size = N1
s1
s2
s3 s4
strata 2 size = N2
strata 4 size = N4
strata 3 size = N3
RSWR
- N:Sizeofthesamplingdata- n:Totalstrata- 𝑁↓𝑖 :Sizeofthe𝑖↑𝑡ℎ strata