diving in panama papers and open data to discover emerging news

28
Diving in Panama Papers and Open Data Ontotext Webinar, 26 May 2016

Upload: ontotext

Post on 16-Apr-2017

401 views

Category:

Internet


2 download

TRANSCRIPT

Diving in Panama Papers and Open Data

Ontotext Webinar, 26 May 2016

Diving in Panama Papers and Open Data

Relation Discovery Case

May 2016

• Find suspicious relationships like:− Company in USA controls

− Another company in USA

− Through a company in an off-shore zone

• Show news relevant to them

Diving in Panama Papers and Open Data

Presentation Outline

• Publishing Panama Papers DB as #LinkedLeaks• Sample Queries • FactForge-News open-data playground• Next steps

May 2016

Diving in Panama Papers and Open Data

Offshore Leaks Database from ICIJ

• Published by the International Consortium of Investigative Journalists (ICIJ) on 9th of May• A “searchable database” about 320 000 offshore companies

− 214 000 extracted from Panama Papers (valid until 2015)

− More than 100 000 from 2013 Offshore leaks investigation (valid until 2010)

• CSV extract from a graph database available for download• https://offshoreleaks.icij.org/

May 2016

Diving in Panama Papers and Open Data

Offshore Leaks Database

May 2016

Diving in Panama Papers and Open Data

Offshore Leaks DB as Linked Open Data

• Ontotext published the Offshore Leaks DB as Linked Open Data• Available for exploration, querying and download at

http://data.ontotext.com• ONTOTEXT DISCLAIMERSWe use the data as is provided by ICIJ. We make no representations and warranties of any kind, including warranties of title, accuracy, absence of errors or fitness for particular purpose. All transformations, query results and derivative works are used only to showcase the service and technological capabilities and not to serve as basis for any statements or conclusions.

May 2016

Diving in Panama Papers and Open Data

Enrichment and structuring of the data

• Relationship type hierarchy− About 80 types of relationship types in the original dataset got organized in a property hierarchy

• Classification of officers into Person and Company− In the original database there is no way to distinguish whether an officer is a physical person

• Mapping to DBPedia: − 209 countries referred in Offshore Leaks DB are mapped to DBPedia

− About 3000 companies and 300 persons mapped to DBPedia

• Overall size of the repository: 22M statements (20M explicit)

May 2016

Diving in Panama Papers and Open Data

The RDF-ization Process

• Linked data variant produced without programming− The raw CSV files are RDF-ized using TARQL, http://tarql.github.io/

− Data was further interlinked and enriched in GraphDB using SPARQL

• The process is documented in this README file• All relevant artifacts are open-source, available at

https://github.com/Ontotext-AD/leaks/• The entire publishing and mapping took about 15 person-days !!!

− Including data.ontotext.com portal setup, promotion, documentation, etc.

May 2016

Diving in Panama Papers and Open Data

Presentation Outline

• Publishing Panama Papers DB as #LinkedLeaks• Sample Queries • Integration with DBPedia & other data• Next steps

May 2016

Diving in Panama Papers and Open Data

Presentation Outline

• Publishing Panama Papers DB as #LinkedLeaks• Sample Queries • FactForge-News open data playground• Next steps

May 2016

Open Data & News Analytics

Our approach to Big Data

1. Integrate relevant data from many sources− Build a Big Knowledge Graph from proprietary databases and

taxonomies integrated with millions of facts of Linked Data

2. Infer new facts and unveil relationships− Performing reasoning across data from different sources

3. Interlink text and with big data− Using text-mining to automatically discover references to

concepts and entities

4. Use NoSQL graph database for metadata management, querying and search

Mar 2016 #12

Open Data & News Analytics #13

Quick news-analytics case

Mar 2016

• Our Dynamic Semantic Publishing platform already offers linking of text with big open data graphs

• One can get navigate from text to concepts, get trends, related entities and news

• Try it at http://now.ontotext.com

Diving in Panama Papers and Open Data

FF-NEWS: Data Integration and Loading

• DBpedia (the English version only) 496M statements

• Geonames (all geographic features on Earth) 150M statements− owl:sameAs links between DBpedia and Geonames 471K statements

• Company registry data (GLEI) 3M statements

• News metadata (from NOW) 128M statements

• Total size: 986М statements− Mapped to FIBO; 667M explicit statements + 318M inferred statements

− RDFRank and geo-spatial indices enabled to allow for ranking and efficient geo-spatial constraintsMay 2016

Diving in Panama Papers and Open Data

Global Legal Entity Identifier (GLEI) data

May 2016

• Global Markets Entity Identifier (GMEI) Utility data− The Global Markets Entity Identifier (GMEI) utility is DTCC's legal entity identifier solution offered in

collaboration with SWIFT

− We downloaded data dump from https://www.gmeiutility.org/

• RDF-ized company records − Fields: LEI#, legal name, ultimate parent, registered country

− 3M explicit statements for 211 thousand organizations▪ For comparison, there are 490 000 organizations in DBPeda and D&B covers above 200 million

− 10,821 ultimate parent relationships and 1632 ultimate parents

− About 2 800 organizations from the GLEI dump mapped to DBPedia

Diving in Panama Papers and Open Data

Loading FIBO

• FIBO = Financial Industry Business Ontology

• We loaded FIBO Foundations and BE in GraphDB− About 55 RDF files the “foundations-14-11-30” and “business-eneitites-15-02-23” packages

• Reasoning switched to OWL 2 RL− Loading takes 3-4 seconds

• Number of explicit statements: 5 433

• Number of total statements: 20 646− Of which inferred and materialized: 15 213

May 2016

Diving in Panama Papers and Open Data

Mapping FIBO to DBPedia

• We mapped FIBO to DBPedia Ontology− Minimalistic approach – we mapped as much as we needed

dbo:Organization rdfs:subClassOf fibo-fnd-org-fm:FormalOrganization.

dbo:Company rdfs:subClassOf fibo-be-le-cb:Corporation.

dbo:Person rdfs:subClassOf fibo-fnd-aap-ppl:Person.

dbo:subsidiary rdfs:subPropertyOf fibo-fnd-rel-rel:controls.

• Methodological notes− Note, fibo-fnd-rel-rel:controls is not transitive

− We mapped more specific DBPedia primitives to more general FIBO, so, that data becomes “visible” through FIBO

May 2016

Diving in Panama Papers and Open Data

See open data through the FIBO lens

May 2016

Diving in Panama Papers and Open Data

Semantic Press-Clipping

• We can trace references to a specific company in the news− This is pretty much standard, however we can deal with syntactic variations in the names, because state

of the art Named Entity Recognition technology is used

− What’s more important, we distinguish correctly in which mention “Paris” refers to which of the following: Paris (the capital of France), Paris in Texas, Paris Hilton or to Paris (the Greek hero)

• We can trace and consolidate references to daughter companies

• We have comprehensive industry classification− The one from DBPedia, but refined to accommodate identifier variations and specialization (e.g.

company classified as dbr:Bank will also be considered classified as dbr:FinancialServices)

May 2016

Diving in Panama Papers and Open Data

Sample queries at http://ff-news.ontotext.comF1: Big cities in Eastern Europe

F2: Airports near London

F3: People and organizations related to Google

F4: Top-level industries by number of companies

F5: Mentions in the news of an organization and its related entities

F7: Most popular companies per industry, including children

F8: Regional exposition of company – normalized

FF-NEWS is still in Beta testing ! Not officially launched, but available to play with

May 2016

http://ff-news.ontotext.com/sparql?name=Orgs+by+number+of+children&infer=true&sameAs=false&query=%23+F5%3A+Mentions+in+the+news+of+an+organization+and+its+related+entities%0A%23+-+retrieves+people+related+to+a+given+organization+with+any+relation+%3B%0A%23+++this+would+be+slow+if+predicate+indices+are+not+switched+on%0A%23+-+retrieves+related+organizations+using+ff-map%3AagentRelation+%3B+%0A%23%09it+generalizes+the+important+relations+between+agents+%0A%23%09(people+and+organizations)+from+DBPedia+++%0A%23+-+the+entity+itself+is+also+added+to+the+set+of+%22related+entities%22%0A%23+++so+that+its+mentions+in+the+news+are+easily+extracted%0A%23+-+uses+news+metadata+imported+continuously+from+http%3A%2F%2Fnow.ontotext.com%0A%23+Change+Gazprom+to+any+organization%2C+e.g.+type+dbr%3ABerks+and+press+%0A%23+Ctrl-Space+to+auto-complete+and+get+dbr%3ABerkshire_Hathaway%0A%0APREFIX+dbr%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2F%3E%0APREFIX+pub-old%3A+%3Chttp%3A%2F%2Fontology.ontotext.com%2Fpublishing%23%3E%0APREFIX+pub%3A+%3Chttp%3A%2F%2Fontology.ontotext.com%2Ftaxonomy%2F%3E%0APREFIX+dbo%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2F%3E%0APREFIX+ff-map%3A+%3Chttp%3A%2F%2Ffactforge.net%2Fff2016-mapping%2F%3E%0A%0ASELECT+DISTINCT+%3Fnews+%3Ftitle+%3Fdate+%3Frelated_entity++%0A%7B%0A++++%7B+SELECT+DISTINCT+%3Frelated_entity+%7B%0A++++++++BIND+(+dbr%3AGazprom+as+%3Fentity+)%0A%0A%09%09%7B%09%3Frelated_entity+a+dbo%3APerson+%3B+%3Fp+%3Fentity+.%0A+++++++++++++FILTER+NOT+EXISTS+%7B+%3Frelated_entity+dbo%3Aclub+%3Fentity+.+%7D+%0A++++++++%7D+%09++++++++++++%0A++++++++UNION++++%0A++++++++%7B%09%3Frelated_entity+a+dbo%3AOrganisation+%3B+dbo%3Aparent+%3Fentity+.+%7D+%0A++++++++UNION%0A++++++++%7B+++BIND(%3Fentity+as+%3Frelated_entity)+%7D+%0A%09%7D+%7D%0A++++%0A++++%3Fnews+pub-old%3AcontainsMention+%2F+pub-old%3AhasInstance+%2F+pub%3AexactMatch+%3Frelated_entity+.%0A++++%3Fnews+pub-old%3AcreationDate+%3Fdate%3B+pub-old%3Atitle+%3Ftitle+.%0A%7D+%0AORDER+BY+DESC(%3Fdate)+LIMIT+1000&execute=
http://ff-news.ontotext.com/sparql?name=Orgs+by+number+of+children&infer=true&sameAs=false&query=%23+F5%3A+Mentions+in+the+news+of+an+organization+and+its+related+entities%0A%23+-+retrieves+people+related+to+a+given+organization+with+any+relation+%3B%0A%23+++this+would+be+slow+if+predicate+indices+are+not+switched+on%0A%23+-+retrieves+related+organizations+using+ff-map%3AagentRelation+%3B+%0A%23%09it+generalizes+the+important+relations+between+agents+%0A%23%09(people+and+organizations)+from+DBPedia+++%0A%23+-+the+entity+itself+is+also+added+to+the+set+of+%22related+entities%22%0A%23+++so+that+its+mentions+in+the+news+are+easily+extracted%0A%23+-+uses+news+metadata+imported+continuously+from+http%3A%2F%2Fnow.ontotext.com%0A%23+Change+Gazprom+to+any+organization%2C+e.g.+type+dbr%3ABerks+and+press+%0A%23+Ctrl-Space+to+auto-complete+and+get+dbr%3ABerkshire_Hathaway%0A%0APREFIX+dbr%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2F%3E%0APREFIX+pub-old%3A+%3Chttp%3A%2F%2Fontology.ontotext.com%2Fpublishing%23%3E%0APREFIX+pub%3A+%3Chttp%3A%2F%2Fontology.ontotext.com%2Ftaxonomy%2F%3E%0APREFIX+dbo%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2F%3E%0APREFIX+ff-map%3A+%3Chttp%3A%2F%2Ffactforge.net%2Fff2016-mapping%2F%3E%0A%0ASELECT+DISTINCT+%3Fnews+%3Ftitle+%3Fdate+%3Frelated_entity++%0A%7B%0A++++%7B+SELECT+DISTINCT+%3Frelated_entity+%7B%0A++++++++BIND+(+dbr%3AGazprom+as+%3Fentity+)%0A%0A%09%09%7B%09%3Frelated_entity+a+dbo%3APerson+%3B+%3Fp+%3Fentity+.%0A+++++++++++++FILTER+NOT+EXISTS+%7B+%3Frelated_entity+dbo%3Aclub+%3Fentity+.+%7D+%0A++++++++%7D+%09++++++++++++%0A++++++++UNION++++%0A++++++++%7B%09%3Frelated_entity+a+dbo%3AOrganisation+%3B+dbo%3Aparent+%3Fentity+.+%7D+%0A++++++++UNION%0A++++++++%7B+++BIND(%3Fentity+as+%3Frelated_entity)+%7D+%0A%09%7D+%7D%0A++++%0A++++%3Fnews+pub-old%3AcontainsMention+%2F+pub-old%3AhasInstance+%2F+pub%3AexactMatch+%3Frelated_entity+.%0A++++%3Fnews+pub-old%3AcreationDate+%3Fdate%3B+pub-old%3Atitle+%3Ftitle+.%0A%7D+%0AORDER+BY+DESC(%3Fdate)+LIMIT+1000&execute=
http://ff-news.ontotext.com/sparql?name=Orgs+by+number+of+children&infer=true&sameAs=false&query=%23+F5%3A+Mentions+in+the+news+of+an+organization+and+its+related+entities%0A%23+-+retrieves+people+related+to+a+given+organization+with+any+relation+%3B%0A%23+++this+would+be+slow+if+predicate+indices+are+not+switched+on%0A%23+-+retrieves+related+organizations+using+ff-map%3AagentRelation+%3B+%0A%23%09it+generalizes+the+important+relations+between+agents+%0A%23%09(people+and+organizations)+from+DBPedia+++%0A%23+-+the+entity+itself+is+also+added+to+the+set+of+%22related+entities%22%0A%23+++so+that+its+mentions+in+the+news+are+easily+extracted%0A%23+-+uses+news+metadata+imported+continuously+from+http%3A%2F%2Fnow.ontotext.com%0A%23+Change+Gazprom+to+any+organization%2C+e.g.+type+dbr%3ABerks+and+press+%0A%23+Ctrl-Space+to+auto-complete+and+get+dbr%3ABerkshire_Hathaway%0A%0APREFIX+dbr%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2F%3E%0APREFIX+pub-old%3A+%3Chttp%3A%2F%2Fontology.ontotext.com%2Fpublishing%23%3E%0APREFIX+pub%3A+%3Chttp%3A%2F%2Fontology.ontotext.com%2Ftaxonomy%2F%3E%0APREFIX+dbo%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2F%3E%0APREFIX+ff-map%3A+%3Chttp%3A%2F%2Ffactforge.net%2Fff2016-mapping%2F%3E%0A%0ASELECT+DISTINCT+%3Fnews+%3Ftitle+%3Fdate+%3Frelated_entity++%0A%7B%0A++++%7B+SELECT+DISTINCT+%3Frelated_entity+%7B%0A++++++++BIND+(+dbr%3AGazprom+as+%3Fentity+)%0A%0A%09%09%7B%09%3Frelated_entity+a+dbo%3APerson+%3B+%3Fp+%3Fentity+.%0A+++++++++++++FILTER+NOT+EXISTS+%7B+%3Frelated_entity+dbo%3Aclub+%3Fentity+.+%7D+%0A++++++++%7D+%09++++++++++++%0A++++++++UNION++++%0A++++++++%7B%09%3Frelated_entity+a+dbo%3AOrganisation+%3B+dbo%3Aparent+%3Fentity+.+%7D+%0A++++++++UNION%0A++++++++%7B+++BIND(%3Fentity+as+%3Frelated_entity)+%7D+%0A%09%7D+%7D%0A++++%0A++++%3Fnews+pub-old%3AcontainsMention+%2F+pub-old%3AhasInstance+%2F+pub%3AexactMatch+%3Frelated_entity+.%0A++++%3Fnews+pub-old%3AcreationDate+%3Fdate%3B+pub-old%3Atitle+%3Ftitle+.%0A%7D+%0AORDER+BY+DESC(%3Fdate)+LIMIT+1000&execute=
http://ff-news.ontotext.com/sparql?name=Orgs+by+number+of+children&infer=true&sameAs=false&query=%23+F5%3A+Mentions+in+the+news+of+an+organization+and+its+related+entities%0A%23+-+retrieves+people+related+to+a+given+organization+with+any+relation+%3B%0A%23+++this+would+be+slow+if+predicate+indices+are+not+switched+on%0A%23+-+retrieves+related+organizations+using+ff-map%3AagentRelation+%3B+%0A%23%09it+generalizes+the+important+relations+between+agents+%0A%23%09(people+and+organizations)+from+DBPedia+++%0A%23+-+the+entity+itself+is+also+added+to+the+set+of+%22related+entities%22%0A%23+++so+that+its+mentions+in+the+news+are+easily+extracted%0A%23+-+uses+news+metadata+imported+continuously+from+http%3A%2F%2Fnow.ontotext.com%0A%23+Change+Gazprom+to+any+organization%2C+e.g.+type+dbr%3ABerks+and+press+%0A%23+Ctrl-Space+to+auto-complete+and+get+dbr%3ABerkshire_Hathaway%0A%0APREFIX+dbr%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2F%3E%0APREFIX+pub-old%3A+%3Chttp%3A%2F%2Fontology.ontotext.com%2Fpublishing%23%3E%0APREFIX+pub%3A+%3Chttp%3A%2F%2Fontology.ontotext.com%2Ftaxonomy%2F%3E%0APREFIX+dbo%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2F%3E%0APREFIX+ff-map%3A+%3Chttp%3A%2F%2Ffactforge.net%2Fff2016-mapping%2F%3E%0A%0ASELECT+DISTINCT+%3Fnews+%3Ftitle+%3Fdate+%3Frelated_entity++%0A%7B%0A++++%7B+SELECT+DISTINCT+%3Frelated_entity+%7B%0A++++++++BIND+(+dbr%3AGazprom+as+%3Fentity+)%0A%0A%09%09%7B%09%3Frelated_entity+a+dbo%3APerson+%3B+%3Fp+%3Fentity+.%0A+++++++++++++FILTER+NOT+EXISTS+%7B+%3Frelated_entity+dbo%3Aclub+%3Fentity+.+%7D+%0A++++++++%7D+%09++++++++++++%0A++++++++UNION++++%0A++++++++%7B%09%3Frelated_entity+a+dbo%3AOrganisation+%3B+dbo%3Aparent+%3Fentity+.+%7D+%0A++++++++UNION%0A++++++++%7B+++BIND(%3Fentity+as+%3Frelated_entity)+%7D+%0A%09%7D+%7D%0A++++%0A++++%3Fnews+pub-old%3AcontainsMention+%2F+pub-old%3AhasInstance+%2F+pub%3AexactMatch+%3Frelated_entity+.%0A++++%3Fnews+pub-old%3AcreationDate+%3Fdate%3B+pub-old%3Atitle+%3Ftitle+.%0A%7D+%0AORDER+BY+DESC(%3Fdate)+LIMIT+1000&execute=

Diving in Panama Papers and Open Data

News Popularity Ranking: Automotive

May 2016

Rank Company News # Rank Company incl. mentions of controlled News #1 General Motors 2722 1 General Motors 46202 Tesla Motors 2346 2 Volkswagen Group 39993 Volkswagen 2299 3 Fiat Chrysler Automobiles 26584 Ford Motor Company 1934 4 Tesla Motors 23705 Toyota 1325 5 Ford Motor Company 21256 Chevrolet 1264 6 Toyota 16567 Chrysler 1054 7 Renault-Nissan Alliance 13328 Fiat Chrysler Automobiles 1011 8 Honda 8649 Audi AG 972 9 BMW 715

10 Honda 717 10 Takata Corporation 547

Diving in Panama Papers and Open Data

News Popularity: Finance

May 2016

Rank Company News # Rank Company incl. mentions of controlled News #1 Bloomberg L.P. 3203 1 Intra Bank 2616672 Goldman Sachs 1992 2 Hinduja Bank (Switzerland) 497313 JP Morgan Chase 1712 3 China Merchants Bank 382884 Wells Fargo 1688 4 Alphabet Inc. 226015 Citigroup 1557 5 Capital Group Companies 40766 HSBC Holdings 1546 6 Bloomberg L.P. 36117 Deutsche Bank 1414 7 Exor 27048 Bank of America 1335 8 Nasdaq, Inc. 20829 Barclays 1260 9 JP Morgan Chase 1972

10 UBS 694 10 Sentinel Capital Partners 1053

Note: Including investment funds, stock exchanges, agencies, etc.

Diving in Panama Papers and Open Data

News Popularity: Banking

May 2016

Rank Company News # Rank Company incl. mentions of controlled News #1 Goldman Sachs 996 1 China Merchants Bank * 382882 JP Morgan Chase 856 2 JP Morgan Chase 19723 HSBC Holdings 773 3 Goldman Sachs 10304 Deutsche Bank 707 4 HSBC 9665 Barclays 630 5 Bank of America 7716 Citigroup 519 6 Deutsche Bank 7427 Bank of America 445 7 Barclays 6818 Wells Fargo 422 8 Citigroup 6309 UBS 347 9 Wells Fargo 428

10 Chase 126 10 UBS 347

Note: including investment funds, stock exchanges, agencies, etc.

Diving in Panama Papers and Open Data

#LinkedLeaks Mapping Queries

Number of entities mapped by type

Companies mapped by industry

Companies mapped in the Finance sector

Politicians mapped

Athletes mapped

May 2016

Diving in Panama Papers and Open Data

Presentation Outline

• Publishing Panama Papers DB as #LinkedLeaks• Sample Queries • FactForge-News open data playground• Next steps

May 2016

Diving in Panama Papers and Open Data

Future Work

May 2016

• Publish and interlink LEI data and other datasets− More comprehensive mapping of LEI data to DBPedia

− Refine #LinkedLeaks, providing more structure; FIBO mapping

− Launch updated FactForge.net portal

• Relationship discovery work− Ultimate parent and suspicious control pattern discovery

− Organizations, related in the news, but not in other datasets

• Partnership with commercial data providers

• Partnership with journalists and analysts

Diving in Panama Papers and Open Data

Wrap up

May 2016

• We published Offshore Leaks DB as Linked Open Data− It took us few days after the release of the raw CSVs.

− Mapping to DBpedia available

− Play with it! Take it!

• We allow multiple open datasets to be used for discovery− It took just few days to clean up DBPedia’s industry classifications and control relationships

− Several datasets accessible through Financial Industry Business Ontology (FIBO)

• Integrating more data sources is easy, e.g. GLEI and #LinkedLeaks− We can integrate proprietary and 3rd party data within days or weeks

Diving in Panama Papers and Open Data

Thank you!

Experience the technology with NOW: Semantic News Portalhttp://now.ontotext.com

Start using GraphDB and text-mining with S4 in the cloudhttp://s4.ontotext.com

Play with open data at http://data.ontotext.com

May 2016