![Page 1: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/1.jpg)
Data Engineering Tools & Best PracticesSriram BaskaranInsight
![Page 2: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/2.jpg)
Bachelors in CSGrad 2013
Machine Learning Engineer
2013-2016
Insight2018
Masters in CS (Data Science)
Grad 2018
Sriram Baskaran
Program DirectorData Engineer
linkedin.com/[email protected]
apply.insightdatascience.com
![Page 3: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/3.jpg)
Some context
![Page 4: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/4.jpg)
AppBackend
id rest_name loc
1 Everest Momo Sunnyvale
2 Cafe Centro San Francisco
... ... ...
id user_name user_base_loc
101 James San Jose
102 Mark San Francisco
... ... ...
Restaurants Customers
Let’s take an example
![Page 5: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/5.jpg)
Why Relational?
● Rows of my tables are accessed together.○ Single row-All column○ All relational databases follow this pattern: Postgres, MySQL, Oracle○ Huge amount of planning is required to design good schemas!
■ No flexibility for schema changes
id rest_name loc
1 Everest Momo Sunnyvale
2 Cafe Centro San Francisco
... ... ...
id user_name user_base_loc
101 James San Jose
102 Mark San Francisco
... ... ...
Restaurants Customersid cust_id rest_id rating
1001 101 1 3
1002 102 1 5
... ... ...
Reviews
![Page 6: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/6.jpg)
Backend Databases
● Mostly Relational: Postgres, MySQL are popular.● Based on Relational Algebra and Codd’s model! It’s important to know this! ● Things to know: SQL, ER modeling.
○ Crow’s foot notation
● Most of your data for Data pipelines start here○ It is important to understand backend databases.
● Binary format like Images are stored separately○ Caching and Content Delivery Networks
![Page 7: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/7.jpg)
Data Engineering starts here
![Page 8: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/8.jpg)
Data engineering
● Extensions and Analytics on Backend databases.● Building pipelines to move data from A to B. ● Ingest and store data in efficient storage systems. ● Ability to handle large scale data processing.● Automating a large part of ETL work
![Page 9: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/9.jpg)
Agenda
Storing / Ingesting
Data
Processing Data
Visualizing Data
Scheduling and Monitoring!
![Page 10: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/10.jpg)
Agenda - focus
Storing / Ingesting
Data
Processing Data
Visualizing Data
Scheduling and Monitoring!
![Page 11: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/11.jpg)
Storing Data
● Database and storage systems are the most underrated tools.● Processing hinges on good storage of data● It removes the additional transformations in processing stage.
id rest_name loc
1 Everest Momo Sunnyvale
2 Cafe Centro San Francisco
... ... ...
id user_name user_base_loc
101 James San Jose
102 Mark San Francisco
... ... ...
Restaurants Customersid cust_id rest_id rating
1001 101 1 3
1002 102 1 5
... ... ...
Reviews
![Page 12: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/12.jpg)
Storing Data
● Database and storage systems are the most underrated tools.● Processing hinges on good storage of data● It removes the additional transformations in processing stage.
NormalizedRestaurantsCustomersRatings
Joins happen every time.
![Page 13: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/13.jpg)
Storing Data
● Database and storage systems are the most underrated tools.● Processing hinges on good storage of data● It removes the additional transformations in processing stage.
DenormalizedAll Data
Star Schema(But prod is not optimized,Let’s fix that in sometime)
Joins don’t happen here
![Page 14: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/14.jpg)
Storing Data
● Database and storage systems are the most underrated tools.● Processing hinges on good storage of data● It removes the additional transformations in processing stage.
DenormalizedAll Data
Load on the production database.
Joins don’t happen here
![Page 15: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/15.jpg)
Build a warehouse that is independent of your prod database
Some way to sync
Analytical DatabaseTransactional
Database
![Page 16: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/16.jpg)
What are our options?
● You will come across○ Postgres○ MySQL○ Oracle○ Druid○ Redshift○ Elastic Search○ Cassandra○ Memcached○ Redis○ Dynamo○ Couchbase○ Flat-files (S3)
Pick a database after knowing the access patterns
![Page 17: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/17.jpg)
Analytical in Relational
● OLAP is pretty powerful.○ Use of ROLLUP and CUBE operations○ Star Schema and Snowflake schema are pretty nice.○ Examples: Postgres, Oracle, SQL Server, MySQL
● Good but it will not scale well. Mainly due to the way the data is stored.● Schema is rigid so changes are very hard.
![Page 18: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/18.jpg)
Groupings and Aggregations
● Columnar○ Druid○ Redshift
id rest_name loc
1 Everest Momo Sunnyvale
2 Cafe Centro San Francisco
... ... ...
id user_name user_base_loc
101 James San Jose
102 Mark San Francisco
... ... ...
Restaurants Customersid cust_id rest_id rating
1001 101 1 3
1002 102 1 5
... ... ...
Reviews
![Page 19: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/19.jpg)
Search through unstructured text
● Like % in SQL is not efficient. ○ SELECT * FROM reviews WHERE review_text LIKE ‘%great%’○ SELECT * FROM reviews WHERE review_text LIKE ‘Loved%’
● Indexing through unstructured text should be really good○ Elastic Search○ Solr
● Eg, searching the text in the review● Each tool has a new data structure called “Postings-list”, which makes it
faster.
![Page 20: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/20.jpg)
Caching
● Temporary in-memory storage○ Redis○ Memcache
● Optimized for quick and fast storage/retrieval. Key-value store (not a document store)
● Use reasonable keys so hashing algorithm is not a bottleneck
![Page 21: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/21.jpg)
How to pick one?
● Make educated & reasonable assumptions○ Type of Data○ Access Patterns○ Scaling factor (Most databases are designed to scale in their “domain”)
● Read a lot, never stop reading it. ● Use it in a project
○ There are hundreds of open large datasets available. ○ Start with GDELT (https://www.gdeltproject.org/data.html)
![Page 22: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/22.jpg)
Complexities of communication
● More tools, difficult it is to communicate between them● Keeping databases in sync is one of the main challenges in the industry.● Kafka may be a solution
○ Act as a message bus○ Use Kafka Connect to bridge
![Page 23: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/23.jpg)
Remember our Denormalized issue?
DenormalizedAll Data
Star Schema(But prod is not optimized,Let’s fix that in sometime)
Joins don’t happen here
![Page 24: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/24.jpg)
Remember our Denormalized issue?
AppBackend
![Page 25: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/25.jpg)
Agenda - for completion
Storing / Ingesting
Data
Processing Data
Visualizing Data
Scheduling and Monitoring!
![Page 26: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/26.jpg)
We are talking about scale!
● Tackling two problems: Time and Space○ Data size is greater than size of your “main-memory”○ Data cannot fit entirely.○ It takes too long to compute
● Distributed computing is a popular solution○ Hadoop, Spark, Presto, Hive○ Kafka is gaining popularity in processing too
● Example: Scrape menu items for each restaurant○ Go to each restaurant’s website○ Scrape it○ Parse it the website○ Find the menu content and process it.
![Page 27: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/27.jpg)
Yelp - update menu items
Yelp’s Database
1.Get URL
2.Get actual content from internet
3.Process text and store results
Postgres
![Page 28: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/28.jpg)
Yelp - update menu items - 1 million urls!
1.custom way to get urls
2.Each script access separately
3.Each script Process text and store results
Yelp’s Database
![Page 29: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/29.jpg)
Yelp - update menu items - 1 million urls!
Yelp’s Database
![Page 30: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/30.jpg)
Yelp - update menu items - 1 million urls!
Yelp’s Database
![Page 31: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/31.jpg)
Yelp - update menu items - 1 million urls!
Yelp’s Database
or
![Page 32: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/32.jpg)
ML Training at Scale
● Use distributed computing to scale your training. ● Compute weights in a fast and efficient manner.
○ Sparkling water wrapper: https://github.com/h2oai/sparkling-water ○ H20
![Page 33: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/33.jpg)
What about Speed/Velocity?
● Data can be unbounded stream of information● Example: Processing reviews for each restaurant, Do a POS tagging.
….r50, r52, r53, …..
id cust_id rest_id rating
1001 101 1 3
1002 102 1 5
... ... ...
Reviews
Batch Processing
POS Tagging Model
![Page 34: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/34.jpg)
What about Speed/Velocity?
● Data can be unbounded stream of information● Need a robust system● Example: Processing reviews
….r50, r52, r53, …..
Spark Streaming (Micro-batches)
id cust_id rest_id rating
1001 101 1 3
1002 102 1 5
... ... ...
Reviews
POS Tagging Model
![Page 35: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/35.jpg)
Agenda - for completion
Storing / Ingesting
Data
Processing Data
Visualizing Data
Scheduling and Monitoring!
![Page 36: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/36.jpg)
Visualize the output data
● It’s like building a software application○ Consider end-users○ What is most intuitive way to see this information?
● Professor would have give even better examples● Do not reinvent the wheel
○ Tableau (education edition)○ Kibana (Self-setup)○ Mode (Paid)○ Looker (Paid)○ Plotly (open source, free)○ Dash (abstraction around plotly, free)○ Matlab (not so much used in industry)
If you are not able to show it in a good way, there was no need to process it!
![Page 37: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/37.jpg)
Agenda - for completion
Storing / Ingesting
Data
Processing Data
Visualizing Data
Scheduling and Monitoring!
![Page 38: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/38.jpg)
Putting together a pipeline
Transactional
AppBackend
![Page 39: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/39.jpg)
Putting together a pipeline
Transactional
AppBackend
![Page 40: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/40.jpg)
Putting together a pipeline
Transactional
AppBackend
![Page 41: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/41.jpg)
Putting together a pipeline
Transactional
AppBackend
POS Tagging Model
![Page 42: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/42.jpg)
Putting together a pipeline
Transactional
AppBackend
Event Store
POS Tagging Model
![Page 43: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/43.jpg)
Putting together a pipeline
Transactional
AppBackend
Event Store
Spark Streaming (Micro-batches)POS Tagging
Model
![Page 44: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/44.jpg)
Putting together a pipeline
Transactional
AppBackend
Event Store
Spark Streaming (Micro-batches)POS Tagging
Model
![Page 45: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/45.jpg)
Putting together a pipeline
Transactional
AppBackend
Event Store
Spark Streaming (Micro-batches)POS Tagging
Model
![Page 46: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/46.jpg)
How to automate the tasks?
![Page 47: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/47.jpg)
Scheduling & Monitoring
● Scheduling tasks in a sequence● Easy to specify dependency● Code based configuration● Easy to deploy and manage● Every Batch pipeline needs a scheduler to automate tasks.● Handling failure● Also allows backfill.
![Page 48: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/48.jpg)
Backfill
…………...
??
Events in time
![Page 49: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/49.jpg)
Backfill
…………... Events in time
Backfill
![Page 50: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/50.jpg)
![Page 51: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/51.jpg)
Think ahead, Think smart
● Get all data in to one place (know about data warehousing)● Understand the why behind any tool choices● Expect future requests from stakeholders● Learn by collaborating, know all different ways a data can be stored,
processed and visualized.● Constantly learn, know the latest updates in a too
○ Start with basics of why the tool was built
● Learn these five: Kafka, Spark, Cassandra, Postgres (PostGIS), Redshift● Managed: Lambdas, Redshift, Dynamo, S3
![Page 52: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/52.jpg)
Start using cloud resources
● Students get $300 in credits both in AWS and GCP. Start using them.● Spin up compute resources● Try out labs for managed services. ● AWS for Students
○ AWS Lambdas○ AWS Redshift○ AWS Dynamo
![Page 53: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/53.jpg)
More resources
● Data Engineering Tools (Visualized)● Rise of a Data Engineer● Preparing for Transition into a Data Engineer● What’s Parquet?● More blogs on insight
Or!
![Page 54: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/54.jpg)
Insight
![Page 55: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/55.jpg)
![Page 56: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/56.jpg)
![Page 57: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/57.jpg)
![Page 58: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/58.jpg)
Insight Offerings - Which one to pick
Data Science Program
● PhD in quantitative fields.
● Have worked in analysing data.
● Good problem solving skills
Data Engineering Program
● Engineering background.
● Worked on and maintained building engineering systems.
● Java/Python
Health Data Science Program
● Postdoctoral researcher, medical doctors
● Interested in genome sequences,clinical trials.
Artificial Intelligence Program
● Engineering background.
● Have worked on training and deploying ML or NN.
DevOps Engineering Program
● Systems admin and Linux background.
● Problem solver critical thinker.
● Can understand containerized sys.
![Page 59: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/59.jpg)
New Programs - More focused domains
● Designing security measures
● Building secure applications.
● Blockchain technology
● Smart contract management
● Decentralized architectures
![Page 60: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/60.jpg)
![Page 61: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/61.jpg)
![Page 62: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/62.jpg)
![Page 63: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/63.jpg)
![Page 64: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/64.jpg)
![Page 65: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/65.jpg)
Where are we?
65
Seattle
Portland
San Francisco
Los Angeles
Austin
Chicago
New
York
Boston
Toronto
In Person
Remote
![Page 66: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/66.jpg)
Apply to Insight● 3 sessions a year● Apply when you are ready
for full-time ● Prepare a role-driven
resume● Read our blog posts● Contact alumni● Application process:
○ Resume + Application Form○ Interview
Note: Data Engineering program has a Coding challenge before the interview.
![Page 67: Best Practices Data Engineering Tools & Insight Sriram ...bytes.usc.edu/cs585/f19_AGI1ml04Us/lectures/Guest/DEToolsEtc.pdf · Learn by collaborating, know all different ways a data](https://reader036.vdocuments.us/reader036/viewer/2022062505/5ec5cfcd4a29781b3c1abe82/html5/thumbnails/67.jpg)
Applications open for June 2020 Session!
Apply.insightdatascience.comSign up for Notifications list