text analytics in the eu fusepool project (at ii-sdv 2013 conference)

27
Treparel Delftechpark 26 2628 XH Delft The Netherlands www.treparel.com Text Analytics in the EU Fusepool project II-SDV ’13 - Nice Anton Heijs CTO [email protected] April 15, 2013

Upload: treparel

Post on 11-May-2015

262 views

Category:

Technology


4 download

DESCRIPTION

Fusepool refines and enriches raw data using common standards and provides tools for analyzing and visualizing data so that end users and other software receive timely, context-aware and relevant information. To ensure that data and results are of high quality, Fusepool combines the well-defined but error-prone (semantic) Web 3.0 with controlled supervision and the collaborative but often messy (social) Web 2.0.

TRANSCRIPT

Page 1: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Treparel Delftechpark 26 2628 XH Delft

The Netherlands www.treparel.com

Text Analytics in the

EU Fusepool project

II-SDV ’13 - Nice

Anton Heijs CTO

[email protected]

April 15, 2013

Page 2: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Agenda  

•  Introduc.on  of  the  EU  Fusepool  project,  Treparel  and  the  consor.um  partners  

•  Background  on  objec.ves  of  the  Fusepool  project  •  User  adap.ve  system  

•  Data  pooling  and  linking  •  Large  scale  machine  learning  and  text  analy.cs  

•  Summary  

Treparel KMX – All rights reserved 2013 2 www.treparel.com

Page 3: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

The  EU  Fusepool  Consor.um  

Treparel (Delft, The Netherlands) is a global software provider in Big Data Text Analytics and Visualization.

Global companies, government agencies, software vendors or data publishers are using Treparel KMX text analysis software to gain faster, reliable, precise insights in large complex unstructured data sets (like application notes, blogs, email and patents) allowing them to make better informed decisions.

As part of the Fusepool consortium Treparel integrates the KMX Text Analytics technology for advanced classification, clustering and visualization of large complex document collections.

Treparel KMX – All rights reserved 2013 3 www.treparel.com

Page 4: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Each  partner  provides  cri.cal  building  blocks:  

  BUAS:  Dynamic  user  interfaces    ENOLL:  End  users  with  specific  needs    GEOX:  NER,  UI  design    SEARCHBOX:  text  search  and  similarity  seman.c  

matching  

  TREPAREL:  text/patent  classifica.on  engines    XEROX:  user-­‐adap.ve  learning  know-­‐how  

           BUT  BREAK-­‐THROUGH  BY  INTEGRATION  

The  Big  Picture  

Treparel KMX – All rights reserved 2013 4 www.treparel.com

Page 5: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

WP4 Find & match

WP5 GUI & Visuals

WP2 Platform,sourcing

WP3 Extract & enrich

WP1 Reqs & Testing

WP6 User Involvement

Learning Learning Learning Learning Feedback

Work  packages  

The EC Fusepool project is a 2 year EU project and started in July 2012

Treparel KMX – All rights reserved 2013 5 www.treparel.com

Page 6: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Fusing  and  pooling  informa.on    for  product  development  

Vision: User-adaptive system Living Lab: Rapid app development Data processing: Sourcing & interlinking Machine learning: Matching & optimizing

Integrated Use Cases

Treparel KMX – All rights reserved 2013 6 www.treparel.com

Page 7: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Background  

•  SMEs  have  a  need  for  technology  intelligence  for  detec.ng  and  responding  to  opportuni.es  and  threats  

•  This  a  partly  driven  by  growth  and  complexity  of  patents  and  lawsuits  

•  Consumer  intelligence  to  detect  opinions  and  needs  of  consumers  for  product  development  

•  Open  innova.on  requiring  coopera.on  (links  between  data,  e.g.  finding  business  partners)  

•  Focus:  Machine  Learning  algorithms  to  improve  matching    

Treparel KMX – All rights reserved 2013 7 www.treparel.com

Page 8: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

User-­‐adap.ve  system  

•  Focus:  monitor  and  learn  specific  needs  and  preferences  of  a  user  to  align  features,  func.onali.es,  and  graphical  interfaces  

•  AdapDve:  machine  learning  from  crowd-­‐sourcing  (rather  than  rule-­‐based)  

•  User-­‐aligned  prioriDzaDon:  more  usable  and  customized  interfaces,  sugges.ons  based  on  ac.vity  &  user  feedback  

Treparel KMX – All rights reserved 2013 8 www.treparel.com

Page 9: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

User-­‐adap.ve  matching  

•  Main  goal:  automated  user-­‐adap.ve  matching  of  users  to:  

–  Patent  analysis  –  Finding  funding  opportuni.es  –  Partner  matching  

•  Key  asset:  informa.on  provided  by  the  user    

•  User  Data  Credo:    accuracy  improves  with  quan.ty  and  quality  of  user  data  while  variety  (breadth)  increases  with  number  of  users  

•  Living  lab:  Co-­‐crea.on  between  creators  and  consumers  of  the  Fusepool  plaWorm  

Treparel KMX – All rights reserved 2013 9 www.treparel.com

Page 10: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Data  sourcing  

•  Sources:  internal  &  external  content  from  web  harves.ng  and  structured  data  sources  

–  using  content  databases  and  linked  open  data  •  Scope:  ini.al  data  corpus  includes  all  explicitly  in-­‐  and  

excluded  sources  

•  Gained  value  from  InformaDon:  recommenda.ons  based  on  machine  learning  from  feedback  

Treparel KMX – All rights reserved 2013 10 www.treparel.com

Page 11: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Data  handling  

1.   Text  analysis  and  feature  extracDon:  ML  &  NLP  methods  for  categorizing,  named  en.ty  extrac.on,  etc.  

2.   Shared  metadata  models:  mapping  text  features  to  exis.ng/custom  ontologies  and  genera.on  of  seman.c  triplets  

→  High-­‐level  abstrac.on  &  persistence  for  reuse  →  Lightweight  storage:  mostly  metadata  only,  text  

indexing  and  abstrac.on  uses  schema-­‐free  key-­‐value  (enabling  ac.onable  facets)  

Treparel KMX – All rights reserved 2013 11 www.treparel.com

Page 12: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Data  interlinking  

•  Contextualize:  terms  are  interlinked  with  same  and  similar  terms  across  sources:  –  Enrich  the  extracted  content  with  exis.ng  informa.on  available  

in  the  Internet  

–  Interlink  as  much  informa.on  as  possible  to  increase  the  value  of  knowledge  extrac.on  

–  Use  available  public  sector  resources  in  Seman.c  Web  and  LOD  format  

•  Metadata:  when  a  user  uploads  texts  to  be  matched  with  other  content,  only  the  metadata  descriptors  are  transmiYed  

•  Data  privacy:  data  fusion  from  diverse  sources  without  endangering  user  privacy  

Treparel KMX – All rights reserved 2013 12 www.treparel.com

Page 13: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Searching  &  finding  

•  Key  search-­‐oriented  features:  –  Search  through  all  content  in  the  data  pool  –  Faceted  search  (categories,  metadata,  en..es)  –  Integra.on  of  Linked  Open  Data  (LOD)  results  –  Cross-­‐lingual  indexing  and  cross-­‐referencing  –  “Did  you  mean?”-­‐func.onality  in  case  of  typos  and  auto-­‐

comple.on  of  search  queries  

•  User-­‐adapDve:  indexing  and  integra.on  based  on  users  needs  (e.g.  user  profiling)  

Treparel KMX – All rights reserved 2013 13 www.treparel.com

Page 14: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Adapta.on  &  refinement  

•  AdapDve  search:  results  are  aligned  to  user  preferences  based  on  analysis  of  user  implicit  and  explicit  feedback    

•  MulD-­‐task  ranking:  good  trade-­‐off  between  user-­‐independent  search  (high  coverage  but  lower  precision)  and  a  very  customized  approach  

•  Query  intent  discovery:  analysis  of  the  query  structure  and  interlinking  of  queries  

Treparel KMX – All rights reserved 2013 14 www.treparel.com

Page 15: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Correla.ng  &  matching  

•  Search  guided  navigaDon:  seman.c  matching  extracts  contextual  rela.onships  to  list  related  content  –  sugges.ons  organized  by  categories  –  exposing  facets  within  related  content  

•  Distributed  rule  and  event  model:  defines  states,  ac.ons,  and  consequences  (e.g.  no.fica.ons,  visualiza.ons)  for  reasoning  based  on  light-­‐weight  ontologies  

Treparel KMX – All rights reserved 2013 15 www.treparel.com

Page 16: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Crowd  sourcing  &    supervised  automa.on  

•  RelaDonal  learning:  related  instances  are  used  to  reason  about  the  focal  instance  –  Ra.onality  of  content  (links  to  other  content,  people,  etc.)  

provide  rich  informa.on  

–  Similari.es/dissimilari.es  to  other  content  is  established  purely  on  rela.onal  proper.es  

Treparel KMX – All rights reserved 2013 16 www.treparel.com

Page 17: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Visualiza.on  

Clustering   Classifica.on  

Text  Preprocessing  and  Indexing  

Acquire  documents  

Present  Results  

Taxonomies,  Ontologies  

Seman.c  Analysis  

Document  level  analysis  using  the  KMX  technology    

KMX  unique  func.ons:  •  Extract  concepts  in  context  using  clustering  and  classifica.on  of  documents  

•  Use  classifica.on  to  create  ranked  lists  and  to  tag  subsets  

•  Support  of  binary  and  mul.-­‐class  Classifica.on  

•  Integra.on  with  other  applica.ons  through  KMX  API  

Treparel KMX – All rights reserved 2012 www.treparel.com 17

Query & Search Tools

Page 18: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Treparel KMX – All rights reserved 2013 18 www.treparel.com

NER  in  the  landscaping    

Page 19: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Sentence  level  analysis  using    Named  En.ty  Recogni.on  

•  The  aim  of  NER  is  to  iden.fy  en..es  in  unstructured  text  documents.  o  To  locate  :  mark-­‐up  the  en..es  

o  To  classify  :  into  predefined  categories/domains  

•  Aim  of  usage  o  To  recognise  trends  (trained  and  new  high  frequency)  o  To  find  all  „trained”  en..es  o  To  „discover”  new  en.tes  

•  NER  approaches  o  Sta.s.cs  based  (supervised  machine  learning)  

o  Rule  based  (regular  expressions)  

Treparel KMX – All rights reserved 2013 19 www.treparel.com

Page 20: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Building  a  NER  model  

Treparel KMX – All rights reserved 2013 20 www.treparel.com

Page 21: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

3  NER  model  examples  

  Training  text:  500  patents  from  EPO  

  Training:  model  building  by  Stanford  NER  

•  LTE  (long  term  evoluDon)  

•  F1:  88%  

•  New  en..es:  „GSM  BSS”,  „LTE  TDD”  

•  False  pos:  „Loca.on  Area  LA”  

•  Elements  

•  F1:  98%  

•  False  pos:  „argon/hydrogen”  

•  Cancer  •  F1:  87%  

•  New  en..es:  „myeloma  cell”,  „tumor  .ssue”  

•  False  pos:  „cell  mortality”,  „the  test  compound”  Treparel KMX – All rights reserved 2013 21 www.treparel.com

Page 22: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Using  NER  in  a  GUI  

Treparel KMX – All rights reserved 2013 22 www.treparel.com

Page 23: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Raw tex

Extract  terms  indica.ng  domains  using  patent  classifica.on  codes  

Content  /  En.ty  Hub  

Generate  Vector  Space  Model  

Document Vectors

Generate  Classifiers  to  es.mate  domains  

Annotated tex with domain labels

Domain labels

Training Vectors

DocId : 1122 Title : Pesticide device Classification Code : A61B Text: Hihaho Etc etc

Table A61 : pesticide A61B : pesticide solvent Etc etc

Many Vectors Doc-Id : 1122 750 terms + weights

Classifier  on  pes.cide  solvents  

25 Positive vectors Doc-Id’s Label 750 terms + weights

25 negative vectors Doc-Id’s Label 750 terms + weights

Doc-Id : 1122 Labels + scores Title : Pesticide device Classification Code : A61B Text: Hihaho …Etc etc

1

2

3

4

5

6

7

8

9

10  

11  

Combining  text  analysis  approaches  

Treparel KMX – All rights reserved 2013 23 www.treparel.com

Page 24: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

•  Data  sourcing:  retrieves  data  from  data  sources  via  data  integrator  

•  Storage:  raw  (e.g.  text)  or  processed  data  stored  in  database  or  triple  store  

•  Using  ML  to  enable  learning  from  the  crowd  

•  Push  &  Pull:      interface  to  consumers  (web  and  mobile  apps)  

  suppor.ng  quality  and  access  control  from  portal  

•  Portal:  business  logic,  storage  of  registered  data  sources,  access  control  using  web  frontend  

Somware  architecture  

Treparel KMX – All rights reserved 2013 24 www.treparel.com

Page 25: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

Open  Call  for  Users  

•  ApplicaDons  received  from  Finland,  Germany,  Spain,  Greece,  Bulgaria,  Hungary,  Italy,  UK,  Belgium,  China,  Switzerland,  France,  Denmark,  Portugal,  Ireland  and  The  Netherlands    

•  Business  areas  covered:    –  Bio-­‐medical,  

–  Pharma  and  biotech,  

–  ICT/Telecommunica.ons,    

–  Digital  media,    

–  Renewable  energies,    

–  Educa.on,    

–  Innova.on/consultancy  services  for  SMEs.    

•  Mix  of  profiles  from  SMEs,  Research,  SME  intermediaries  (incubators,  Science  parks,  Living  Labs,  etc),  developers  and  more  

•  MulDple  areas  of  research,  development  and  innova.on  areas  are  covered.  

25 Treparel KMX – All rights reserved 2013 25 www.treparel.com

Page 26: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

•  Data  as  a  service:  scale  economies  of  scale  in  management  of  data    

•  Data  pooling:  processes  need  .mely  aggrega.on  and  redistribu.on  of  diverse  data  but  building  own  is  redundant  and  prohibi.ve  for  SMEs    provide  services  on  top  of  pool  with  high  quality  data  

  provide  access  to  services  on  demand  •  Success  criteria:  

  Early  provision  of  scalable  basic  Fusepool  services    SME  involvement  and  uptake  of  Fusepool  services  

  Machine  learning  for  data  cura.on  &  user  adapta.on  

•  Required  steps:  

  Stepwise  integra.on  of  exis.ng  &  new  technologies    Early  and  ongoing  feedback  from  end  users  

EC  perspec.ve  

Treparel KMX – All rights reserved 2013 26 www.treparel.com

Page 27: Text Analytics in the EU Fusepool project (at II-SDV 2013 conference)

The  EC  FP  7  program  Fusepool  is  all  about:    •  Building  a  plaWorm  with  web  enabled  services  for  

•  Data  pooling  

•  Large  scale  text  analysis  

•  Large  scale  machine  learning  of  user  input  

•  Enable  SME’s  with  analy.cs  for  improving  their  innova.on  and  compe..ve  strengths  using  

•  SME’s  involvement  and  feedback  to  the  Fusepool  services  

•  Machine  learning  for  data  cura.on  &  user  adapta.on  

Summary  

Treparel KMX – All rights reserved 2013 27 www.treparel.com

Welcome to visit Fusepool at: www.fusepool.net