adversarial analytics - 2013 strata & hadoop world talk

46
Adversarial Analy-cs 101 Robert L. Grossman Open Data Group Strata Conference & Hadoop World October 28, 2013

Upload: robert-grossman

Post on 20-Aug-2015

2.323 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Adversarial  Analy-cs  101  

Robert  L.  Grossman  Open  Data  Group  

Strata  Conference  &  Hadoop  World  October  28,  2013  

Page 2: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Examples  of  Adversarial  Analy-cs    

•  Compete  for  resources  – Low  latency  trading  – Auc-ons  

•  Steal  someone  else’s  resources  – Credit  card  fraud  rings    –  Insurance  fraud  rings    

•  Hide  your  ac-vity  – Click  spam    – Exfiltra-on  of  informa-on    

Page 3: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Conven-onal  vs  Adversarial  Analy-cs  

Conven&onal  Analy&cs   Adversarial  Analy&cs  Type  of  game   Gain  /  Gain   Gain  /  Loss  Subject   Individual   Group  /  ring  /  organiza-on  Behavior   DriUs  over  -me   Sudden  changes  Transparency   Behavior  usually  open   Behavior  usually  hidden  Automa-on   Manual  (an  individual)   Some-mes  another  model  

Page 4: Adversarial Analytics - 2013 Strata & Hadoop World Talk

1.  Points  of  Compromise  

Source:  Wall  Street  Journal,  May  4,  2007  

Page 5: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Case  Study:  Points  of  Compromise  •  TJX  compromise  •  Wireless  Point  of  Sale  devices  compromised    

•  Personal  informa-on  on  451,000  individuals  taken  

•  Informa-on  used  months  later  for  fraudulent  purchases.  

•  WSJ  reports  that  45.7  million  credit  card  accounts  at  risk  

•  “TJX's  breach-­‐related  bill  could  surpass  $1  billion  over  five  years”  

Source:  Wall  Street  Journal,  May  4,  2007  

Page 6: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Learning  TradecraU  

Page 7: Adversarial Analytics - 2013 Strata & Hadoop World Talk

The  adversary  is  oUen  a  group,  criminal  ring,  etc.  

Page 8: Adversarial Analytics - 2013 Strata & Hadoop World Talk

It’s  Not  an  Individual…  

Source:  Jus-ce  Department,  New  York  Times.  

Page 9: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Gonzalez  TradecraU  

1.  Exploit  vulnerabili-es  in  wireless  networks  outside  of  retail  stores.  

2.  Exploit  vulnerabili-es  in  databases  of  retail  organiza-ons.  

3.  Gain  unauthorized  access  to  networks  that  process  and  store  credit  card  transac-ons.    

4.  Exfiltrate  40  Million  track  2  records  (data  on  credit  card’s  magne-c  strip)  

Page 10: Adversarial Analytics - 2013 Strata & Hadoop World Talk

5.  Sell  track  2  data  in  Eastern  Europe,  USA,  and  other  places.  

6.  Create  counterfeit  ATM  and  debit  cards  using  track  2  data.  

7.  Conceal  and  launder  the  illegal  proceeds  obtained  through  anonymous  web  currencies  in  the  US  and  Russia.  

Source:  US  District  Court,  District  of  Massachusehs,  Indictment  of  Albert  Gonzalez,  August  5,  2008.  

Gonzalez  TradecraU  (cont’d)  

Page 11: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Data  Is  Large  

•  TJX  compromise  data  is  too  large  to  fit  into  memory  

•  Data  is  difficult  to  fit  into  database  

•  Millions  of  possible  points  of  compromise  

•  Data  must  be  kept  for  months  to  years  

Source:  Wall  Street  Journal,  May  4,  2007  

Page 12: Adversarial Analytics - 2013 Strata & Hadoop World Talk

2.  Introduc-on  to    Adversarial  Analy-cs  

Page 13: Adversarial Analytics - 2013 Strata & Hadoop World Talk

•  Example:  predic-ve  model  to  select  online  ads.  

•  Example:  predic-ve  model  to  classify  sen-ment  of  a  customer  service  interac-on.  

•  Example:  commercial  compe-tor  trying  to  maximize  trading  gains.  

•  Example:  criminal  gang  ahacking  a  system  and  hiding  its  behavior.  

Conven-onal  Analy-cs   Adversarial  Analy-cs  

Page 14: Adversarial Analytics - 2013 Strata & Hadoop World Talk

What’s  Different  About  the  Models?  

Conven&onal  Analy&cs   Adversarial  Analy&cs  Data   Labeled   Unlabeled  Data  size   Small  to  large   Small  to  very  large  Model   Classifica-on  /    

Regression  Clustering,  change  detec-on  

Frequency  of  update  

Monthly,  quarterly,  yearly  

Hourly,  daily,  weekly,  …  

Page 15: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Types  of  Adversaries  

Use  exis-ng  analy-c  techniques.  

Create  new  analy-c  techniques.  

Change  the  game.  

$1,000  

$100,000  

$10,000,000  Use  specialized  teams  and  approaches.  

Use  products.  

Build  custom  analy-c  models.  

Home  Team   Adversary  

Source:  This  approach  is  based  in  part  on  the  threat  hierarchy  in  the  Defense  Science  (DSB)  Report  on  Resilient  Military  Systems  and  the  Cyber  Threat,  January,  2013  

Page 16: Adversarial Analytics - 2013 Strata & Hadoop World Talk

What  is  Different  About  the  Adversary?  

Conven&onal  Analy&cs   Adversarial  Analy&cs  En-ty   Person  browsing   Gang  ahacking  a  system  What  is  modeled?  

Events,  en--es   Events,  en--es,  rings  

Behavior   Component  of  natural  behavior  

Wai-ng  for  opportuni-es  

Evolu-on   DriU   Ac-ve  change  in  behavior    

Obfusca-on   Ignore   Hide  

Page 17: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Upda-ng  Models  Conven&onal  Analy&cs  

Adversarial  Analy&cs  

When  do  you  update  model?  

When  there  is  a  major  change  in  behavior  

Frequently,  to  gain  an  advantage.  

Process  for  upda-ng  models  

Automa-on  and  analy-c  infrastructure  helpful.  

Automa-on  and  analy-c  infrastructure  essen-al.  

Page 18: Adversarial Analytics - 2013 Strata & Hadoop World Talk

3.  More  Examples  

Page 19: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Click  Fraud  

•  Is  a  click  on  an  ad  legi-mate?  •  Is  it  done  by  a  computer  or  a  human?  •  If  done  by  a  human,  is  it  an  adversary?  

?  

Page 20: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Types  of  Adversaries  

Use  exis-ng  analy-c  techniques.  

Create  new  analy-c  techniques.  

Change  the  game.  

$1,000  

$100,000  

$10,000,000  Use  specialized  teams  and  approaches.  

Use  products.  

Build  custom  analy-c  models.  

Home  Team   Adversary  

Source:  This  approach  is  based  in  part  on  the  threat  hierarchy  in  the  Defense  Science  (DSB)  Report  on  Resilient  Military  Systems  and  the  Cyber  Threat,  January,  2013  

Page 21: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Low  Latency  Trading  

•  2+  billion  bids,  asks,  executes  /  day  •  >  90%  are  machine  generated  •  Most  are  cancelled  •  Decisions  are  made  in  ms  

ECN   Trades  

Back  tes-ng  

Low  latency  trader  

Market  data  

Algorithm  

Algorithm  Adversarial  traders  

Broker/Dealer  

Page 22: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Connec-ng  Chicago  &  NYC  

Technology   Vendor   Round  Trip  Time  (ms)  

Microwave   Windy  Apple    

9  .0  

Dark  fiber,  shorter  path  

Spread  Networks  

13.1  

Standard  fiber,  ISP  

Various   14.5+  

Source:  lowlatency.com,  June  27,  2012;  Wired  Magazine,  August  3,  2012  

Page 23: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Types  of  Adversaries  

Use  exis-ng  analy-c  techniques.  

Create  new  analy-c  techniques.  

Change  the  game.  

$1,000  

$100,000  

$10,000,000  Use  specialized  teams  and  approaches.  

Use  products.  

Build  custom  analy-c  models.  

Home  Team   Adversary  

Source:  This  approach  is  based  in  part  on  the  threat  hierarchy  in  the  Defense  Science  (DSB)  Report  on  Resilient  Military  Systems  and  the  Cyber  Threat,  January,  2013  

Page 24: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Example:  Conficker  

Source:  Conficker  Working  Group:  Lessons  Learned,  June  2010  (Published  January  2011),www.confickerworkinggroup.org  

Page 25: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Command  and  control  center  

Infected  PCs  

Page 26: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Source:  Conficker  Working  Group:  Lessons  Learned,  June  2010  (Published  January  2011),www.confickerworkinggroup.org  

Page 27: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Source:  Conficker  Working  Group:  Lessons  Learned,  June  2010  (Published  January  2011),www.confickerworkinggroup.org  

Page 28: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Types  of  Adversaries  

Use  exis-ng  analy-c  techniques.  

Create  new  analy-c  techniques.  

Change  the  game.  

$1,000  

$100,000  

$10,000,000  Use  specialized  teams  and  approaches.  

Use  products.  

Build  custom  analy-c  models.  

Home  Team   Adversary  

Source:  This  approach  is  based  in  part  on  the  threat  hierarchy  in  the  Defense  Science  (DSB)  Report  on  Resilient  Military  Systems  and  the  Cyber  Threat,  January,  2013  

Page 29: Adversarial Analytics - 2013 Strata & Hadoop World Talk

4.  Building  Models  for    Adversarial  Analy-cs  

Page 30: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Building  Models  To    Iden-fy  Fraudulent  Transac-ons  

Scoring  transac-ons:  1.  Get  next  event  

(transac-on).  2.  Update  one  or  more  

associated  feature  vectors  (account-­‐based)  

3.  Use  model  to  process  updated  feature  vectors  to  compute  scores.  

4.  Post-­‐process  scores  using  rules  to  compute  alerts.  

Building  the  model:  1.  Get  a  dataset  of  labeled  

data  (i.e.  fraud  is  labeled).  2.  Build  a  candidate  predic-ve  

model  that  scores  each  transac-on  with  likelihood  of  fraud.  

3.  Compare  liU  of  candidate  model  to  current  champion  model  on  hold-­‐out  dataset.  

4.  Deploy  new  model  if  performance  is  beher.  

Page 31: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Building  Models  to    Iden-fy  Fraud  Rings  

•  Look  for  suspicious  paherns  and  rela-onships  among  claims.    

•  Look  for  hidden  rela-onships  through  common  phone  numbers,  nearby  addresses,  etc.  

Repair  shop  

Driver  

Driver  

Clinic  

Driver  

Driver  

Driver  

Repair  shop  

Clinic  

Driver  

Page 32: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Method  1:  Look  for  Tes-ng  Behavior  

•  Adversaries  oUen  test  an  approach  first.  •  Build  models  to  detect  the  tes-ng.  

Page 33: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Method  2:  Iden-fy  Common  Behavior  

•  Common  cluster  algorithms  (e.g.  k-­‐means)  require  specifying  the  number  of  clusters  

•  For  many  applica-ons,  we  keep  the  number  of  clusters  k  rela-vely  small  

•  With  microclustering,  we  produce  many  clusters,  even  if  some  of  the  them  are  quite  similar.  

•  Microclusters  can  some-mes  be  labeled    

Page 34: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Microcluster  Based  Scoring  

•  The  closer  a  feature  vector  is  to  a  good  cluster,  the  more  likely  it  is  to  be  good  

•  The  closer  it  is  to  a  bad  cluster,  the  more  likely  it  is  to  be  bad.  

Bad  Cluster  

Good  Cluster  

Page 35: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Method  3:  Scaling  Baseline  Algorithms  

Drive  by  exploits  

Source:  Wall  Street  Journal  

Points  of  compromise  

Page 36: Adversarial Analytics - 2013 Strata & Hadoop World Talk

What  are  the  Common  Elements?  •  Time  stamps  •  Sites  – e.g.  Web  sites,  vendors,  computers,  network  devices  

•  En--es  – e.g.  visitors,  users,  flows    

•  Log  files  fill  disks;  many,  many  disks  •  Behavior  occurs  at  all  scales  •  Want  to  iden-fy  phenomena  at  all  scales  •  Need  to  group  “similar  behavior”  

Page 37: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Examples  of  a  Site-­‐En-ty  Data  

37  

Example   Sites   En&&es  Drive-­‐by  exploits   Web  sites   Computers  Compromised  user  accounts  

Compromised  computers  

User  accounts  

Compromised    payment  systems  

Merchants   Cardholders  

Source:  Collin  Benneh,  Robert  L.  Grossman,  David  Locke,  Jonathan  Seidman  and  Steve  Vejcik,  MalStone:  Towards  a  Benchmark  for  Analy-cs  on  Large  Data  Clouds,  The  16th  ACM  SIGKDD  Interna-onal  Conference  on  Knowledge  Discovery  and  Data  Mining  (KDD  2010),  ACM,  2010.  

Page 38: Adversarial Analytics - 2013 Strata & Hadoop World Talk

-me   38  dk-­‐1   dk  

Exposure  Window   Monitor  Window  

sites   en--es  

Page 39: Adversarial Analytics - 2013 Strata & Hadoop World Talk

The  Mark  Model  •  Some  sites  are  marked  (percent  of  mark  is  a  parameter  and  type  of  sites  marked  is  a  draw  from  a  distribu-on)  

•  Some  en--es  become  marked  aUer  visi-ng  a  marked  site  (this  is  a  draw  from  a  distribu-on)  

•  There  is  a  delay  between  the  visit  and  the  when  the  en-ty  becomes  marked  (this  is  a  draw  from  a  distribu-on)  

•  There  is  a  background  process  that  marks  some  en--es  independent  of  visit  (this  adds  noise  to  problem)  

Page 40: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Subsequent  Propor-on  of  Marks  

•  Fix  a  site  s[j]  •  Let  A[j]  be  en--es  that  transact  during  ExpWin  and  if  en-ty  is  marked,  then  visit  occurs  before  mark  

•  Let  B[j]  be  all  en--es  in  A[j]  that  become  marked  some-me  during  the  MonWin  

•  Subsequent  propor-on  of  marks  is                                          r[j]  =  |  B[j]  |    /    |  A[j]    |  •  Easily  computed  with  MapReduce.    

Page 41: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Compu-ng  Site  En-ty    Sta-s-cs  with  MapReduce  

41  

MalStone  B  Hadoop    MapReduce  v0.18.3   799  min  Hadoop  Python  Streams    v0.18.3   142  min  Sector  UDFs  v1.19   44  min  #  Nodes   20  nodes  #  Records   10  Billion  Size  of  Dataset     1  TB  

Source:  Collin  Benneh,  Robert  L.  Grossman,  David  Locke,  Jonathan  Seidman  and  Steve  Vejcik,  MalStone:  Towards  a  Benchmark  for  Analy-cs  on  Large  Data  Clouds,  The  16th  ACM  SIGKDD  Interna-onal  Conference  on  Knowledge  Discovery  and  Data  Mining  (KDD  2010),  ACM,  2010.  

Page 42: Adversarial Analytics - 2013 Strata & Hadoop World Talk

5.  Summary  

Page 43: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Some  Rules  1.  Understand  your  adversary's  tradecraU;  he  or  

she  will  go  aUer  the  easiest  or  weakest  link.  2.  Your  adversary  will  be  crea-ng  new  

opportuni-es;  you  must  create  new  models  and  analy-c  approaches.  

3.  Risk  is  usually  viewed  as  the  cost  of  doing  business,  but  every  once  in  a  while  the  cost  may  be  100x  –  1,000x  greater  

4.  The  quicker  you  can  update  your  strategy  and  models,  the  greater  your  advantage.    Use  automa-on  (e.g.  PMML-­‐based  scoring  engines)  to  deploy  models  more  quickly.  

Page 44: Adversarial Analytics - 2013 Strata & Hadoop World Talk

Ques-ons?  

Page 45: Adversarial Analytics - 2013 Strata & Hadoop World Talk

About  Robert  L.  Grossman  Robert  Grossman  (@bobgrossman)  is  the  Founder  and  a  Partner  of  Open  Data  Group,  which  specializes  in  building  predic-ve  models  over  big  data.  He  is  also  a  Senior  Fellow  at  the  Computa-on  Ins-tute    and  Ins-tute  for  Genomics  and  Systems  Biology  at  the  University  of  Chicago  and  a  Professor  in  the  Biological  Sciences  Division.  He  has  led  the  development  of  new  open  source  soUware  tools  for  analyzing  big  data,  cloud  compu-ng,  and  high  performance  networking.  Prior  to  star-ng  Open  Data  Group,  he  founded  Magnify,  Inc.,  which  provides  data  mining  solu-ons  to  the  insurance  industry.  Grossman  was  Magnify’s  CEO  un-l  2001  and  its  Chairman  un-l  it  was  sold  to  ChoicePoint  in  2005.  He  blogs  occasionaly  about  big  data,  data  science,  and  data  engineering  at  rgrossman.com.      

Page 46: Adversarial Analytics - 2013 Strata & Hadoop World Talk

For  More  Informa-on  

www.opendatagroup.com  [email protected]