data infrastructure architecture for medium size organization: tips for collecting, storing and...
TRANSCRIPT
Egor PakhomovData Architect, [email protected]
Data infrastructure architecture for a medium size organization: tips for collecting, storing and analysis.
Medium organization (<500 people)
Big organization ( >500 people)
DATA CUSTOMERS >10 >100
DATA VOLUME “Big data” “Big data”
DATA TEAM PEOPLE RESOURCES
Enough to integrate and support some open source stack
Enough to write our own data tools
FINANCIAL RESOURCES Enough to buy hardware for Hadoop cluster
Enough to buy some cloud solution (Databricks cloud, Google BigQuery...)
Examples:Examples:
Data infrastructure architecture
HOW TO MANAGE BIG DATA
WHEN YOU ARE NOT THAT BIG?
Data architect in AnchorFree
About me
Spark contributor since 0.9Integrated spark in Yandex Islands. Worked in Yandex Data Factory
Participated in “Alpine Data” development - Spark based data platform
Agenda
Data aggregation
Why SQL is important and how to use it in Hadoop?
• SQL vs R/Python• Impala vs Spark• Zeppelin vs SQL
desktop client
How to store data to query it fast and change easily?
• JSON vs Parquet• Schema vs schema-
less
How to aggregate your data to work better with BI tools?
• Aggregate your data!• SQL code is code!
1Data
Querying
2Data
Storage
3Data
Aggregation
1Data
Querying
Why SQL is important and how to use it in Hadoop?
1. SQL vs R/Python2. Impala vs Spark3. Zeppelin vs SQL desktop client
BI
Analysts
Regular data transformations
SQL
QA
What do you need from SQL engine?
Fast Reliable Able to process terabytes of data
Support Hive metastore
Support modern SQL statements
Hive metastore role
HDFS
Hive Metastoretable_1 -> file341, file542, file453
table_2 -> file457, file458, file459table_3 -> file37, file568, file359table_4 -> file3457, file568, file349…..
Driver of SQL engine 1
Driver of SQL engine 1
Executor Executor Executor Executor
Which one would you choose? Both!
SparkSQL ImpalaSUPPORT HIVE METASTORE + +FAST - +RELIABLE (WORKS NOT ONLY IN RAM) + -
JSON SUPPORT + -HIVE COMPATIBLE SYNTAX + -OUT OF THE BOX YARN SUPPORT + -MORE THAN JUST A SQL FRAMEWORK + -
Connect Tableau to HadoopStep 1
Hadoop
ODBC/JDBC server
Give SQL to users
Hadoop
ODBC/JDBC server
Step 2
1. Manage desktop application on N laptops
2. One spark context per many users
3. Lack of visualizing
4. No decent resource scheduling
Would not work...
No decent resource scheduling: One user blocks everyone
No decent resource scheduling: Hadoop good in resource scheduling!
Apache Zeppelin is our solution
1. Web-based
2. Notebook-based
3. Great visualisation
4. Works with both Impala and Spark
5. Has cloud solution with support - Zeppelin Hub from NFLabs
It’s great!
Apache Zeppelin integration
Hadoop
2Data
Storage
How to store data to query it fast and change easily?
1. JSON vs Parquet2. Schema vs schema-less
What would you need from data storing?
Flexible format
Fast querying Access to “raw” data
Have schema
Can we choose just one data format? We need both!
Json Parquet
FLEXIBLE +
ACCESS TO “RAW” DATA +
FAST QUERYING +
HAVE SCHEMA +
IMPALA SUPPORT +
FORMAT QUERY TIME
Parquet SELECT Sum(some_field) FROM logs.parquet_datasource 136 sec
JSON SELECT Sum(Get_json_object(line, ‘$.some_field’)) FROM logs.json_datasource
764 sec
Parquet is 5 times faster!
But! when you need raw data, 5 times slower is not that bad
Let’s compare elegance and speed:
{“First name”: “Mike”,“Last name”: “Smith”,
“Gender”: “Male”,“Country”: “US”
}
{“First name”: “Anna”,“Last name”: “Smith”,
“Age”: “45”,“Country”: “Canada”,
Comments: ”Some additional info”}...
FIRST NAME
LAST NAME GENDER AGE
Mike Smith Male NULL
Anna Smith NULL 45
... ... ... ...
JSON Parquet
How data in these formats compare
3Data
Aggregation
How to aggregate your data to work better with BI tools?
1. Aggregate your data!2. SQL code is code!
● “Big data” does not mean you need to query all data Daily
● BI tools should not do big queries
Aggregate your data!
BI tool
select * from ...
How aggregation works?
Git with queries
Query executor
Aggregated table
Report development process
1
2
4
Creating aggregated table in Zeppelin
Creating BI report based on this table
Adding queries to git to run daily
Publishing report
3
Data for report changing process:
Change query in git1
One more tip)
1. Need to apply our patches to source code
2. Move to new versions before any release
3. Move to new version on part of infrastructure - rest remain on old one
We do not use Spark, which comes with Hadoop installation
Questions?
Contact:Egor Pakhomov
[email protected]@gmail.comhttps://www.linkedin.com/in/egor-pakhomov-35179a3a