Download - TrackademiX Insight Data Engineering Project
Transcript
• Look at records in the ArXiv database – Correlate to information in additional commercial databases (NASA
ADS/INSPIRE/PubMed,…)
• Necessary information available
– Scattered (not in the same place/site) – Need to scrape - not intended to be “harvested”
• Merge disjoint databases, filter and
visually display relevant information
• Approach applicable to wide
class of real-world problems
Application Programming Interface • author list / number of authors • citations + dates / references + dates • institutions / collaborators • citation history
DEMO!
PIPELINE
ArXiv INSPIRE
NASA ADS
KAFKA
CUSTOM CRAWLER
REAL TIME
BATCH
WEB-BASED DATABASES
STORM
MR Job