diving in panama papers and open data to discover emerging news
TRANSCRIPT
Diving in Panama Papers and Open Data
Relation Discovery Case
May 2016
• Find suspicious relationships like:− Company in USA controls
− Another company in USA
− Through a company in an off-shore zone
• Show news relevant to them
Diving in Panama Papers and Open Data
Presentation Outline
• Publishing Panama Papers DB as #LinkedLeaks• Sample Queries • FactForge-News open-data playground• Next steps
May 2016
Diving in Panama Papers and Open Data
Offshore Leaks Database from ICIJ
• Published by the International Consortium of Investigative Journalists (ICIJ) on 9th of May• A “searchable database” about 320 000 offshore companies
− 214 000 extracted from Panama Papers (valid until 2015)
− More than 100 000 from 2013 Offshore leaks investigation (valid until 2010)
• CSV extract from a graph database available for download• https://offshoreleaks.icij.org/
May 2016
Diving in Panama Papers and Open Data
Offshore Leaks DB as Linked Open Data
• Ontotext published the Offshore Leaks DB as Linked Open Data• Available for exploration, querying and download at
http://data.ontotext.com• ONTOTEXT DISCLAIMERSWe use the data as is provided by ICIJ. We make no representations and warranties of any kind, including warranties of title, accuracy, absence of errors or fitness for particular purpose. All transformations, query results and derivative works are used only to showcase the service and technological capabilities and not to serve as basis for any statements or conclusions.
May 2016
Diving in Panama Papers and Open Data
Enrichment and structuring of the data
• Relationship type hierarchy− About 80 types of relationship types in the original dataset got organized in a property hierarchy
• Classification of officers into Person and Company− In the original database there is no way to distinguish whether an officer is a physical person
• Mapping to DBPedia: − 209 countries referred in Offshore Leaks DB are mapped to DBPedia
− About 3000 companies and 300 persons mapped to DBPedia
• Overall size of the repository: 22M statements (20M explicit)
May 2016
Diving in Panama Papers and Open Data
The RDF-ization Process
• Linked data variant produced without programming− The raw CSV files are RDF-ized using TARQL, http://tarql.github.io/
− Data was further interlinked and enriched in GraphDB using SPARQL
• The process is documented in this README file• All relevant artifacts are open-source, available at
https://github.com/Ontotext-AD/leaks/• The entire publishing and mapping took about 15 person-days !!!
− Including data.ontotext.com portal setup, promotion, documentation, etc.
May 2016
Diving in Panama Papers and Open Data
Presentation Outline
• Publishing Panama Papers DB as #LinkedLeaks• Sample Queries • Integration with DBPedia & other data• Next steps
May 2016
Diving in Panama Papers and Open Data
Sample queries at http://data.ontotext.comQ1: Countries by number of entities related to them
Q2: Country pairs by ownership statistics
Q3: Statistics by incorporation year
Q4: Officers and entities by number of capital relations
Q5: Countries in Eastern Europe by number of owners
Q6: Intermediaries in Asia by name
Q7: The best connected officers
Q8: Countries by number of Person and Company officers
May 2016
Diving in Panama Papers and Open Data
Presentation Outline
• Publishing Panama Papers DB as #LinkedLeaks• Sample Queries • FactForge-News open data playground• Next steps
May 2016
Open Data & News Analytics
Our approach to Big Data
1. Integrate relevant data from many sources− Build a Big Knowledge Graph from proprietary databases and
taxonomies integrated with millions of facts of Linked Data
2. Infer new facts and unveil relationships− Performing reasoning across data from different sources
3. Interlink text and with big data− Using text-mining to automatically discover references to
concepts and entities
4. Use NoSQL graph database for metadata management, querying and search
Mar 2016 #12
Open Data & News Analytics #13
Quick news-analytics case
Mar 2016
• Our Dynamic Semantic Publishing platform already offers linking of text with big open data graphs
• One can get navigate from text to concepts, get trends, related entities and news
• Try it at http://now.ontotext.com
Diving in Panama Papers and Open Data
FF-NEWS: Data Integration and Loading
• DBpedia (the English version only) 496M statements
• Geonames (all geographic features on Earth) 150M statements− owl:sameAs links between DBpedia and Geonames 471K statements
• Company registry data (GLEI) 3M statements
• News metadata (from NOW) 128M statements
• Total size: 986М statements− Mapped to FIBO; 667M explicit statements + 318M inferred statements
− RDFRank and geo-spatial indices enabled to allow for ranking and efficient geo-spatial constraintsMay 2016
Diving in Panama Papers and Open Data
Global Legal Entity Identifier (GLEI) data
May 2016
• Global Markets Entity Identifier (GMEI) Utility data− The Global Markets Entity Identifier (GMEI) utility is DTCC's legal entity identifier solution offered in
collaboration with SWIFT
− We downloaded data dump from https://www.gmeiutility.org/
• RDF-ized company records − Fields: LEI#, legal name, ultimate parent, registered country
− 3M explicit statements for 211 thousand organizations▪ For comparison, there are 490 000 organizations in DBPeda and D&B covers above 200 million
− 10,821 ultimate parent relationships and 1632 ultimate parents
− About 2 800 organizations from the GLEI dump mapped to DBPedia
Diving in Panama Papers and Open Data
Loading FIBO
• FIBO = Financial Industry Business Ontology
• We loaded FIBO Foundations and BE in GraphDB− About 55 RDF files the “foundations-14-11-30” and “business-eneitites-15-02-23” packages
• Reasoning switched to OWL 2 RL− Loading takes 3-4 seconds
• Number of explicit statements: 5 433
• Number of total statements: 20 646− Of which inferred and materialized: 15 213
May 2016
Diving in Panama Papers and Open Data
Mapping FIBO to DBPedia
• We mapped FIBO to DBPedia Ontology− Minimalistic approach – we mapped as much as we needed
dbo:Organization rdfs:subClassOf fibo-fnd-org-fm:FormalOrganization.
dbo:Company rdfs:subClassOf fibo-be-le-cb:Corporation.
dbo:Person rdfs:subClassOf fibo-fnd-aap-ppl:Person.
dbo:subsidiary rdfs:subPropertyOf fibo-fnd-rel-rel:controls.
• Methodological notes− Note, fibo-fnd-rel-rel:controls is not transitive
− We mapped more specific DBPedia primitives to more general FIBO, so, that data becomes “visible” through FIBO
May 2016
Diving in Panama Papers and Open Data
Semantic Press-Clipping
• We can trace references to a specific company in the news− This is pretty much standard, however we can deal with syntactic variations in the names, because state
of the art Named Entity Recognition technology is used
− What’s more important, we distinguish correctly in which mention “Paris” refers to which of the following: Paris (the capital of France), Paris in Texas, Paris Hilton or to Paris (the Greek hero)
• We can trace and consolidate references to daughter companies
• We have comprehensive industry classification− The one from DBPedia, but refined to accommodate identifier variations and specialization (e.g.
company classified as dbr:Bank will also be considered classified as dbr:FinancialServices)
May 2016
Diving in Panama Papers and Open Data
Sample queries at http://ff-news.ontotext.comF1: Big cities in Eastern Europe
F2: Airports near London
F3: People and organizations related to Google
F4: Top-level industries by number of companies
F5: Mentions in the news of an organization and its related entities
F7: Most popular companies per industry, including children
F8: Regional exposition of company – normalized
FF-NEWS is still in Beta testing ! Not officially launched, but available to play with
May 2016
Diving in Panama Papers and Open Data
News Popularity Ranking: Automotive
May 2016
Rank Company News # Rank Company incl. mentions of controlled News #1 General Motors 2722 1 General Motors 46202 Tesla Motors 2346 2 Volkswagen Group 39993 Volkswagen 2299 3 Fiat Chrysler Automobiles 26584 Ford Motor Company 1934 4 Tesla Motors 23705 Toyota 1325 5 Ford Motor Company 21256 Chevrolet 1264 6 Toyota 16567 Chrysler 1054 7 Renault-Nissan Alliance 13328 Fiat Chrysler Automobiles 1011 8 Honda 8649 Audi AG 972 9 BMW 715
10 Honda 717 10 Takata Corporation 547
Diving in Panama Papers and Open Data
News Popularity: Finance
May 2016
Rank Company News # Rank Company incl. mentions of controlled News #1 Bloomberg L.P. 3203 1 Intra Bank 2616672 Goldman Sachs 1992 2 Hinduja Bank (Switzerland) 497313 JP Morgan Chase 1712 3 China Merchants Bank 382884 Wells Fargo 1688 4 Alphabet Inc. 226015 Citigroup 1557 5 Capital Group Companies 40766 HSBC Holdings 1546 6 Bloomberg L.P. 36117 Deutsche Bank 1414 7 Exor 27048 Bank of America 1335 8 Nasdaq, Inc. 20829 Barclays 1260 9 JP Morgan Chase 1972
10 UBS 694 10 Sentinel Capital Partners 1053
Note: Including investment funds, stock exchanges, agencies, etc.
Diving in Panama Papers and Open Data
News Popularity: Banking
May 2016
Rank Company News # Rank Company incl. mentions of controlled News #1 Goldman Sachs 996 1 China Merchants Bank * 382882 JP Morgan Chase 856 2 JP Morgan Chase 19723 HSBC Holdings 773 3 Goldman Sachs 10304 Deutsche Bank 707 4 HSBC 9665 Barclays 630 5 Bank of America 7716 Citigroup 519 6 Deutsche Bank 7427 Bank of America 445 7 Barclays 6818 Wells Fargo 422 8 Citigroup 6309 UBS 347 9 Wells Fargo 428
10 Chase 126 10 UBS 347
Note: including investment funds, stock exchanges, agencies, etc.
Diving in Panama Papers and Open Data
#LinkedLeaks Mapping Queries
Number of entities mapped by type
Companies mapped by industry
Companies mapped in the Finance sector
Politicians mapped
Athletes mapped
May 2016
Diving in Panama Papers and Open Data
Presentation Outline
• Publishing Panama Papers DB as #LinkedLeaks• Sample Queries • FactForge-News open data playground• Next steps
May 2016
Diving in Panama Papers and Open Data
Future Work
May 2016
• Publish and interlink LEI data and other datasets− More comprehensive mapping of LEI data to DBPedia
− Refine #LinkedLeaks, providing more structure; FIBO mapping
− Launch updated FactForge.net portal
• Relationship discovery work− Ultimate parent and suspicious control pattern discovery
− Organizations, related in the news, but not in other datasets
• Partnership with commercial data providers
• Partnership with journalists and analysts
Diving in Panama Papers and Open Data
Wrap up
May 2016
• We published Offshore Leaks DB as Linked Open Data− It took us few days after the release of the raw CSVs.
− Mapping to DBpedia available
− Play with it! Take it!
• We allow multiple open datasets to be used for discovery− It took just few days to clean up DBPedia’s industry classifications and control relationships
− Several datasets accessible through Financial Industry Business Ontology (FIBO)
• Integrating more data sources is easy, e.g. GLEI and #LinkedLeaks− We can integrate proprietary and 3rd party data within days or weeks
Diving in Panama Papers and Open Data
Thank you!
Experience the technology with NOW: Semantic News Portalhttp://now.ontotext.com
Start using GraphDB and text-mining with S4 in the cloudhttp://s4.ontotext.com
Play with open data at http://data.ontotext.com
May 2016