lessons learned with cassandra and spark at the us patent and trademark office
TRANSCRIPT
![Page 1: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/1.jpg)
Lessons Learned with Cassandra & Spark at the USPTO
![Page 2: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/2.jpg)
Christopher Bradford
• DataStax Certified Cassandra Architect
•Contributed to CQLEngine - Python C* •ORM
•Developed Trireme - a migration •engine for Cassandra & DSE
•Created the world’s smallest C*•cluster
Twitter: @bradfordcpGitHub: bradfordcp
![Page 3: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/3.jpg)
OpenSource Connections
• Consulting firm based in Charlottesville Virginia• Founded in 2005• Focused on Search in 2010, specifically Solr and
Lucene• Delivering Cassandra Consulting since 2012• Datastax Gold Partner• Great with Search, Analytics and Discovery
![Page 4: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/4.jpg)
OpenSource Connections
Bloghttp://o19s.com/blog/Twitter@o19s GitHubo19s
![Page 5: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/5.jpg)
Exploring Search Technologies
![Page 6: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/6.jpg)
Technologies
![Page 7: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/7.jpg)
Architecture
![Page 8: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/8.jpg)
Architecture – Data Layer
![Page 9: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/9.jpg)
Data
1963
1965
1967
1969
1971
1973
1975
1977
1979
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2001
2003
2005
2007
2009
2011
2013
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
90,98292,971100,15093,48290,54498,737104,357109,359111,095105,300109,622108,011107,456109,580108,377108,648108,209112,379113,966117,987112,040120,276126,788132,665139,455151,491165,748176,264177,830186,507188,739206,090228,238211,013
232,424260,889
288,811315,015
345,732356,493366,043382,139417,508
452,633484,955485,312482,871
520,277535,188576,763
609,052615,243
48,97150,38966,64771,88669,09862,71471,23067,96481,79078,18578,62281,27876,81075,38869,78170,51452,41366,17071,06463,27661,98272,65077,24576,86289,38584,272102,53399,077106,696107,394109,746113,587113,834121,696124,069163,142169,085175,979183,970184,375187,012181,299
157,718196,405182,899185,224191,927
244,341247,713276,788
302,948326,033
Patent Applications & Grants
Applications Grants
![Page 10: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/10.jpg)
![Page 11: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/11.jpg)
WHERE Clauses
![Page 12: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/12.jpg)
WHERE
I DON’T THINK YOU KNOW WHAT THAT MEANS
YOU KEEP USING THAT CLAUSE
![Page 13: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/13.jpg)
CQL vs SQL: WHERE
type | name | rank------+----------+------- last | STOBAUGH | 25067 last | BRUDNER | 65304 last | SKLAR | 12517 last | VRANES | 59290 last | SCHRODT | 34764
SQLSELECT * FROM names WHERE rank = 59290;last | VRANES | 59290
CQLSELECT * FROM names WHERE rank = 59290;InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: "
CREATE TABLE names ( type VARCHAR, name VARCHAR, rank INT, PRIMARY KEY ((type, name)));
![Page 14: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/14.jpg)
CQL vs SQL: WHERE
type | name | rank------+----------+------- last | STOBAUGH | 25067 last | BRUDNER | 65304 last | SKLAR | 12517 last | VRANES | 59290 last | SCHRODT | 34764
SQLSELECT * FROM names WHERE rank = 59290;last | VRANES | 59290
CQLSELECT * FROM names WHERE type = ‘last’ AND name = ‘VRANES’;last | VRANES | 59290
CREATE TABLE names ( type VARCHAR, name VARCHAR, rank INT, PRIMARY KEY ((type, name)));
![Page 15: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/15.jpg)
CQL vs SQL: Tables
rank | type | name -------+------+---------- 25067 | last | STOBAUGH65304 | last | BRUDNER 12517 | last | SKLAR 59290 | last | VRANES 34764 | last | SCHRODT
SQLSELECT * FROM names_by_rank WHERE rank = 59290;last | VRANES | 59290
CQLSELECT * FROM names_by_rank WHERE rank = 59290;last | VRANES | 59290
type | name | rank------+----------+------- last | STOBAUGH | 25067 last | BRUDNER | 65304 last | SKLAR | 12517 last | VRANES | 59290 last | SCHRODT | 34764
names names_by_rank
![Page 16: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/16.jpg)
CQL vs SQL: Indexes
SQLSELECT * FROM names WHERE rank = 59290;last | VRANES | 59290
CQLSELECT * FROM names WHERE rank = 59290;last | VRANES | 59290
type | name | rank------+----------+------- last | STOBAUGH | 25067 last | BRUDNER | 65304 last | SKLAR | 12517 last | VRANES | 59290 last | SCHRODT | 34764
CREATE INDEX ON names (rank);
![Page 17: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/17.jpg)
CQL vs SQL: Recap
• Consider multiple tables with data models that support fast, efficient, querying.
• Remember that writes are extremely fast in C*. Writing to multiple tables is not necessarily a bad thing.
• Build an index table• Your model may support building an inverted index for lookups of record ids.
• Use secondary indexes***
![Page 18: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/18.jpg)
Cluster Balancing
![Page 19: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/19.jpg)
Unbalanced Cluster Symptoms
• Certain nodes shutting down mid-way through ingestion• Data is not cleanly distributed across the cluster
![Page 20: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/20.jpg)
Unbalanced Cluster Causes
• Data Model – check your partitions!
• Configuration – how are your tokens split amongst the nodes?
• Hardware – is the server configured correctly?
![Page 21: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/21.jpg)
Data
1963
1965
1967
1969
1971
1973
1975
1977
1979
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2001
2003
2005
2007
2009
2011
2013
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
90,98292,971100,15093,48290,54498,737104,357109,359111,095105,300109,622108,011107,456109,580108,377108,648108,209112,379113,966117,987112,040120,276126,788132,665139,455151,491165,748176,264177,830186,507188,739206,090228,238211,013
232,424260,889
288,811315,015
345,732356,493366,043382,139417,508
452,633484,955485,312482,871
520,277535,188576,763
609,052615,243
48,97150,38966,64771,88669,09862,71471,23067,96481,79078,18578,62281,27876,81075,38869,78170,51452,41366,17071,06463,27661,98272,65077,24576,86289,38584,272102,53399,077106,696107,394109,746113,587113,834121,696124,069163,142169,085175,979183,970184,375187,012181,299
157,718196,405182,899185,224191,927
244,341247,713276,788
302,948326,033
Patent Applications & Grants
Applications Grants
![Page 22: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/22.jpg)
Balancing: Data Model
CREATE TABLE images ( year INT, id TEXT, page TEXT, image BLOB, PRIMARY KEY (year, id, page));
SELECT * FROM images WHERE year = 2015;
Sample unbalanced model
![Page 23: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/23.jpg)
Balancing: Data Model
CREATE TABLE images ( year INT, month INT, id TEXT, page INT, image BLOB, PRIMARY KEY ((year, month), id, page));
SELECT * FROM images WHERE year = 2015 AND month IN (1,…);
Switch partition key to use multiple fields instead of just year.
![Page 24: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/24.jpg)
Balancing: Configuration
![Page 25: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/25.jpg)
Virtual Nodes?
Source: http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
![Page 26: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/26.jpg)
Hardware
![Page 27: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/27.jpg)
Hardware
• Understand the type of hardware Cassandra runs well on.
LOCALSTORAGE
NETWORKSTORAGE
![Page 28: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/28.jpg)
![Page 29: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/29.jpg)
Data Ingestion
![Page 30: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/30.jpg)
Data Loading Performance
• Did it work?• Why change it?• How could we make it better?
![Page 31: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/31.jpg)
Spark Data Loading
![Page 32: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/32.jpg)
Collecting Metrics
![Page 33: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/33.jpg)
Metrics
• GitHub:• dropwizard/metrics
• Awesome Java library for collecting metrics in your code
• Demo later
![Page 34: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/34.jpg)
Poor Performance
joinedRDD = …joinedRDD.foreach() document = … // build document sc = new SolrConnection() sc.push(document) sc.disconnect()// Job is done
![Page 35: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/35.jpg)
Poor Performance
joinedRDD = …joinedRDD.foreach() document = … // build document sc = new SolrConnection() sc.push(document) sc.disconnect()// Job is done
![Page 36: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/36.jpg)
Optimum Performance
joinedRDD = …sc = new SolrConnection()joinedRDD.foreach() document = … // build document sc.push(document)sc.disconnect()// Job is done
![Page 37: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/37.jpg)
Scope
![Page 38: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/38.jpg)
Scope: Review
joinedRDD = …sc = new SolrConnection()joinedRDD.foreach() document = … // build document sc.push(document)sc.disconnect()// Job is done
![Page 39: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/39.jpg)
Scope: ERROR
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
![Page 40: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/40.jpg)
Scope: Fixed!
joinedRDD = …joinedRDD.foreachPartition() sc = new SolrConnection() partition.foreach() document = … sc.push(document)// Job is done
![Page 41: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/41.jpg)
Know your API
![Page 42: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/42.jpg)
Java RDD != Scala RDD
![Page 43: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/43.jpg)
APIs: mapPartitions()
joinedRDD = …joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document)return partition.rows
![Page 44: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/44.jpg)
APIs: Transformations & Actions
• Transformations: Lazily executed, the code is not executed until an action is applied.
• Ex: map
• Actions:• Operate on RDD elements
and return to the driver
• Ex: foreach
![Page 45: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/45.jpg)
APIs: mapPartitions()
joinedRDD = …joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) return partition.rows.collect()
![Page 46: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/46.jpg)
Understand how data is passed around
![Page 47: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/47.jpg)
Memory: Solution
joinedRDD = …joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) return partition.rows.length.collect()
![Page 48: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/48.jpg)
How did it go?
![Page 49: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/49.jpg)
Demo
![Page 50: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/50.jpg)
Questions?
![Page 51: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office](https://reader035.vdocuments.us/reader035/viewer/2022070510/58a9d6121a28aba05b8b4b55/html5/thumbnails/51.jpg)
Thank You© 2015. All Rights Reserved. 51