ads users group meeting, 20-21 nov. 2019 ingest...

15
Ingest System Steve McDonald and the ADS Team ADS Users Group Meeting, 20-21 Nov. 2019

Upload: others

Post on 20-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic

Ingest System

Steve McDonald and the ADS Team

ADS Users Group Meeting, 20-21 Nov. 2019

Page 2: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic

Ingest System Overview

● 8 Individual Pipelines○ Bibliographic, Non-bibliographic○ Affiliation, Citation Capture, Fulltext, myADS, Orcid○ Master

● Each Sends Processed Data To Master Pipeline○ Master merges and sends to persistent stores

■ Solr, SQL Database, API Endpoint ● Tech Stack

○ Python, Cron, Docker, RabbitMQ, Protobufs● Mostly Reuses Classic Files

○ Needs to change!

2

Page 3: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic

Ingest Operations Overview● 635,989 More Bibcodes● 8,676,103 More Citations● Persistent Vigilance

○ Immediate Action By Curators ○ Daily and Monthly Reports

● Number Of Failed Records In Daily Ingest○ None: 38%○ 1 to 5 Records: 42%○ These Are Conservative Estimates!○ Publisher And Non Curation Issues

3

Page 4: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic

Affiliation Pipeline

● Affiliations Data Debuted @ 2019 Winter AAS● Pipeline In Daily Operation Since ~ Spring 2019

○ New records have aff_ids assigned daily○ Existing records are updated as the dictionary of aff/aff_id is

updated (~monthly)● Machine-Learning Based Curation

○ Curation of a learning model is ongoing (slow, mainly hand-work)○ Not yet in automated production, but can be used to assist

curation● Statistics: See CSG Presentation (Thursday) On Affiliations

4

Page 5: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic

ORCID Pipeline

● Restructure To Improve Performance○ Reduced Pipeline’s API Search Calls

■ Requests from the ORCID pipeline were ~80% of search traffic■ Reduced to 1% of search traffic

● Changes To Support ORCID Microservice/UI Restructure ○ Functionality migrated from UI to ORCID microservice

5

Page 6: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic

Fulltext Pipeline Overview

● Over 5 Million Articles With Fulltext○ Exponential growth

6

Page 7: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic

Fulltext Pipeline Updates

● Upgraded To Python 3 Compatible External Libraries ● More Robust XML Parsing● Added Performance Monitoring

7

Page 8: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic

Fulltext Pipeline Updates

● PDF Extraction Analysis○ Validated Current Technique

8

Page 9: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic

Ingest System Future

● Reimplement Non-bibliographic Pipeline○ Non-metadata (reads, citations, metrics)○ Changes .4M to 2M records daily○ Investigations underway○ Behind schedule

● Create New Ingest System Plan○ Remove dependency on classic pipeline

■ Large effort○ Consider framework like Kafka

● Implement Plan○ Long running project

9

Page 10: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic

Ingest System

Steve McDonald and the ADS Team

ADS Users Group Meeting, 20-21 Nov. 2019

Page 11: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic

Data Timeliness

● Server Room Issues○ Twice cooling system failed○ Once network upgrade went poorly

● Nonbib Failed With Bad Import Once Every 2 months○ Delays data reaching prod servers by ~10 hours

● Changes To Data Links Field Not Automatically Detected● End To End System Performance● Misc

○ Replication fail, disks full

11

Page 12: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic

Data Quality

● Generally Good● We Avoid Repeated Bugs

○ Root Cause Analysis● Any Outstanding Issues

○ ?

12

Page 13: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic

Ingest System 2019 Highlights

● New Affiliation Pipeline● Enhancements To Orcid Pipeline● Improvements To Fulltext Pipeline● Bug Fixes To Improve Data Quality

13

Page 14: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic

Fulltext Pipeline

File Formats

14

Page 15: ADS Users Group Meeting, 20-21 Nov. 2019 Ingest Systemads.harvard.edu/adsug/2019/08-Ingest_System.pdf · 2019. 12. 2. · Python, Cron, Docker, RabbitMQ, Protobufs Mostly Reuses Classic

Fulltext Pipeline

Extraction Analysis

15