understanding dse search by matt stump

DSE 4.7 SearchMatt Stump, Chief Architect/Manager for SWAT, DataStax

Thank you for joining. We will begin shortly.

All attendees

placed on muteInput questions at any time

using the online interface

Webinar Housekeeping

1 Data Locality

2 Bitmap Indexing

3 IO Path

4 Demo

5 Performance

6 Why DSE?

Agenda

Hash(“some bytes”) => A Number

V1 OR V2

Quick to ReadExpensive to Update

Near Real Time is Expensive

Use 32 vnodes in DSE 4.7.1

{ 'asin': '0007148089', 'title': "Blood and Roses: The Tumultuous Wars of the Roses", 'price': 5.98, 'imUrl': 'http://ecx.images-amazon.com/images/I/518p8d64F8L.jpg', 'related': { 'also_bought': ['0061430765', '0061430773’,'B00A4E8E78'], 'buy_after_viewing': ['0061430773', '0345404335', 'B00A4E8E78', '0975126407'] }, 'salesRank': {'Books': 326205}, 'categories': [['Books']]}

CREATE TABLE IF NOT EXISTS amazon.metadata ( asin text, title text, imurl text, price double, categories set<text>, also_bought set<text>, buy_after_viewing set<text>, PRIMARY KEY(asin));

CREATE TABLE IF NOT EXISTS amazon.rank ( asin text, category text, rank int, PRIMARY KEY(asin, category));

dsetool create_core amazon.metadata generateResources=true

dsetool create_core amazon.rank generateResources=true

http://localhost:8983/solr/#/amazon.metadata

http://localhost:8983/solr/#/amazon.rank

Index Size

• Core index size• Fields, term frequency, count, and settings• Number of dynamic fields and frequency using Luke• termVectors="false" • termPositions="false" • termOffsets="false"• omitNorms="true"• Only index fields you intend to search

Dynamic Fields

Indexing throughput

• Set autoSoftCommit as high as possible• Disable all caches except filterCache• Increase RAM buffer to 512-1024MB• Enable realtime indexing• Large heap (20GB) with G1 or 8150 tuning• Increase back_pressure_threshold_per_core to 2000-5000• Set max_solr_concurrency_per_core to number of cores• Recommend more cores (32)

Live Indexing Throughput

Query Latency and Throughput

• Set autoSoftCommit as high as possible• Disable all caches except filterCache• Use docValues for faceted or sorted fields• Large heap (20GB) with G1 or 8150 tuning• Move query parameters to filters• Use single pass queries where possible• Recommend more cores (32)

Query Latency and Throughput

• DSETool Performance objects• Solr slow query log• Tracing• Use Jbean com.datastax.bdp.search DSP-2792

– EXECUTE– RETREIVE– COORDINATE

CASSANDRA-8150 Tuning

MAX_HEAP_SIZE="20G"HEAP_NEWSIZE="6G"JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"JVM_OPTS="$JVM_OPTS -XX:+PerfDisableSharedMem"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2"JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=10"JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=10000"JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"

CASSANDRA-7486 (G1) Tuning

MAX_HEAP_SIZE="20G"JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"JVM_OPTS="$JVM_OPTS -XX:+PerfDisableSharedMem"JVM_OPTS="$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5"

# set these to the number of coresJVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=8"JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=8"JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"

JVM_OPTS="$JVM_OPTS -XX:G1ReservePercent=15"JVM_OPTS="$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=25"JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500"JVM_OPTS="$JVM_OPTS -XX:G1HeapRegionSize=32"

DSE 4.7 Improvements

DSP-4477 - Pivot facetingDSP-4476 - PaginationDSP-3740 - Live indexingDSP-4091 - Remove support for stored copy fieldsDSP-4703 - Query Solr from SparkDSP-4518 - Improved memory usage for facetingDSP-3931 - Filter cache sizing is now global across all segmentsDSP-4475 - Verify/Integrate single pass distributed queries (SOLR-5768)DSP-4091 - Remove support for stored copy fieldsDSP-4072 - Fault-tolerant distributed queriesDSP-3958 - Improve shard routing by taking into account node health factorsDSP-3935 - Implement faceting inside CQL Solr queries

DSE vs ElasticSearch

Feature DSE ElasticSearch

Replication and multiple datacentersBased on Cassandra, multi-DC support for free,

real-time replication, high availabilityMaster slave, long replication delay, doesn't do

multi-DC well

Scalability Hundreds of nodes, hundreds of terabytes 10s of nodes a couple terabytes

Data loss possible No Yes

Primary Data Store Yes No

Operational Complexity Single system Multiple systems

Analytics Yes No

Dynamic Schema Sorta Sorta, slightly easier

Increased performance by 700% while growing data by 500%

Reduced operational costs by 40%

Deleted 15,000 lines of code

understanding dse search by matt stump

Technology

dse brochure

brochure dse

dse cvl yellow spring paper circles on c.. dse cvl yellow...

dse consultores

dse genset dse 6010/20 mkii - deep sea electronics

stump the students!!

· dse-3110 dse-44xx dse-704 instrumentation dse-7220...

dse 220, dse 260 - instructionsmanuals.com · dse 220, dse...

degenerative dse

7ocean.com7ocean.com/uploads/files/a8b786ef-bfbb-71be-1c4d-95cbc7a08e0e1.pdf ·...

stump grinding sydney

mgspt mega global solution · dse 703 dse 3110 dse 7320 dse...

endocrine dse

c* summit 2013: cassandra on flash: performance & efficiency...

commission on youth dse/hkcee/hkal dse/hkcee/hkal …

stump catechism

dse history

dse 701 & dse 702 auto &manual start control modules

gastric stump adenocarcinoma

tm & © dse · tm & © dse tm & © dse. created date:...