security's in your dna: genomics for infosec
DESCRIPTION
Releases the blar.py tool which creates a genomic encoding from text files. This encoding results in a lossy, highly compressible representation of the original file that can be used for rapid anomaly detection and forensic analysis.TRANSCRIPT
Security’s in your DNA:Genomics for InfoSec
Rob Bird@conduit242
What is the most efficient way to analyze a sequence of events?
What’s a genome?
• The genetic material of an organism• A redundant encoding of instructions• A big sequence of letters
HIVtggatgggttaatttactccaagcaaagacaagatatccttgatctgtgggtctaccacacacaaggctacttccctgattggcagaattacacaccagggccaggagtcagatacccactaacatttggatggtgcttcaagctagtaccagttgatccagatgaagtagagaaggatactgagggagagaacaacagcctattacaccctatatgccaacatggaatggatgatgaggagaaagaagtattaaggtggaaatttgacagccgcctggcactaaaacacagagcccaagagatgcatccggagttctacaaagactgctgacacagaagttgctgacagggactttccgctgggactttccaggggaggtgtggtttgggcggagttggggagtggccaaccctcagatgctgcatataagcagctgcttttcgcttgtactgggtctctctaggtagaccagatccgagcctgggagctctctggctatctggggaacccactgcttaagcctcaataaagcttgccttgagtgctctaagtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctcagaccactctagactgagtaaaaatctctagcagtggcgcccgaacagggactcgaaagcgaaagtaagaccagagaagttctctcgacgcaggactcggcttgctgaggtgcacacagcaagaggcgagagcggcgactggtgagtacgccaatttttgactagcggaggctagaaggagagagatgggtgcgagagcgtcagtattaagcgggggaaaattagatgcatgggagagaattcggttaaggccagggggaaagaaaaaatatagaatgaaacatctagtatgggcaagcagggagctggaaagatttgcacttaaccctggcctgttagaaacaacagaaggatgtcaacaaataatagaacagttacaaccagctctcaagacaggaacagaagaacttagatcattatttaatacagtagtaaccctctattgtgtacatcaacggatagaggtaaaagacaccaaggaagctctagataaaatagaggaaatacaaaataagagcaagcaaaagacacaacaggcagcagctgccacaggaaacagcagcaatgtcagccaaaattaccctatagtgcaaaatgcacaagggcaaatggtacaccaggctgtatcacctaggacattgaatgcatgggtgaaggtaatagaagaaaaggctttcagcccagaagtaatacccatgttctcagcattgtcagaaggagccaccccacaagatttaaatatgatgctaaacatagtggggggacaccaggcagctatgcagatgttgaaagataccatcaatgaggaagctgcagaatgggacaggttacatccagtacaggcagggcctattccaccaggccaattgagagaaccaaggggaagtgacatagcaggaactactagtacccctcaagaacaaataggatggatgacaggcaacccacctattccagtgggagacatctataaaagatggataatcctgggattaaataaaatagtaagaatgtatagccctgttagcattttggacgtaaaacaagggccaaaagaacccttcagagactatgtagataggttctttaaaattctcagagctgagcaagctacacaggaggtaaaaggttggatgacagaaaccttgctggtccaaaatgcaaatccagattgtaagtccattttaagagcactaggaacaggagctacattagaagaaatgatgacagcatgccagggagtgggaggacccggccataaagcaagggttttggctgaggcaatgagtcaagtacaacatacaaacataatgatgcagagaggcaattttaggggtcagagaaggatgattaaatgtttcaattgtggcaaagaaggacacctagccagaaattgcagagcccctaggaaaaagggctgttggaaatgtgggaaagagggacaccaaatgaaggactgcactgaaagacaggctaattttttagggaaaatttggccttccagcaaggggaggccaggaaactttccccagagcaggccagagccaacagccccaccagcagagctctttgggatggaggaagaaaaaacctccgctctgaagcaggagcagaaggacaggaaacaggacccacctttagtttccctcaaatcactctttggcaacgaccccttgtcacagtaaaagtagggggacagctaaaagaagctctattagatacaggagcagatgacacagtattagaagatataaatttgccaggaaaatggaaaccaagaatgatagggggaattggaggttttatcaaagtaaaacagtatgatcagatacttatagaaatttgtggaaaaaaggctataggtacagtattagtaggacccacacctgtcaacataattggaaggaatatgttgacccagattggatgtactttaaatttcccaattagtcctattgagactgtgccagtaaaattaaagccaggaatggatggcccaaaggttaaacaatggccattgacagaagaaaaaataaaagcattaacagaaatttgtacagatatggaaaaggaaggaaaaatttcaagaattgggcctgaaaatccatacaatactccaatatttgctataaagaaaaaagacagcactaaatggaggaaactagtagatttcagagagctcaataaaagaacacaagacttttgggaagttcaattgggaataccgcatccagcgggcctaaaaaagaaaaaatcagtaacagtactagatgtgggggacgcatatttttcagttcctttagatgaaagctttagaaagtatactgcgttcaccatacctagtacaaataatgagacaccaggaatcaggtatcaatacaatgtgctgccacagggatggaaaggatcaccggcaatattccagagtagcatgacaaaaatcttagagccctatagatcaaagaatccagaaataattatctatcaatacatggatgacttgtatgtaggatctgatttagaaatagggcagcatagaacaaaaatagaggagttgagagctcatctattgagctggggatttactacaccagacaaaaagcatcaaaaagaacctccatttctttggatggggtatgaactccatcctgacaaatggacagtacagcctatacaactgccagaaaaggatagctggactgtcaatgatatacagaagttggtggggaaactgaattgggcaagtcaaatttatgcagggattaaagtaaagcaactgtgcaaactcctcaggggagccaaagcactaacagaggtagtaactctgactgaggaagcagaattagaattggcagagaacagggaaattctaaaagaccctgtgcatggagtatattatgacccatcaaaagaattaatagcagaaatacagaaacaagggcaagaccaatggacatatcaaatttatcaagagccatttaaaaatctaaaaacaggaaaatatgcaagaaaaaggtctgctcacactaatgatgtaaagcaattagcagaagtggtgcaaaaggtggtcatggagagcatagtaatatggggaaagactcctaaatttaaactacccatacaaaaagagacatgggaaacatggtggatggactattggcaggctacctggattcctgaatgggagtttgtcaatacccctcccctagtaaaattgtggtaccagttagagaaagaccctatagcaggagcagaaactttctatgtagatggggcagccaatagggagactaagctaggaaaagcagggtatgtaactgacagaggaagacaaaaggttgtttccctaactgagacaacaaatcaaaagactgaactacatgcaatccatctagccttacaggattcaggatcagaagtaaacatagtaacggactcacagtatgcattaggaatcattcaggcacaaccagacaggagtgaatcagaattagtcaatctaataatagaggagctaatagaaaaggacaaggtctacctgtcatgggtaccagcacacaaaggaattggaggaaatgaacaagtagataaattagtcagttccggaattaggaaggtgctgtttttagatgggatagataaagctcaagaagaacatgaaagatatcacagcaattggaaagcaatggctagtgattttaatctgccacctatagtagcaaaggaaatagtagccagctgtgataaatgccaactaaaaggagaagccatgcatggacaggtagactgtagtccaggaatatggcaattagattgcacacatctagaaggaaaagtaatcctggtagcagtccatgtagccagtggttatatagaagcagaagttatcccagcagaaacaggacaagagacagcatactttctactaaaattagcaggaagatggccagtaaaagtagtacacacagacaatggaggcaatttcaccagtgctgcagttaaagcagcctgttggtgggcaaatatccaacaggaatttgggattccctacaatccccaaagtcaaggagtagtggaatctatgaataaagaattaaagaaaatcatagggcaggtaagagatcaagctgaacatcttaagacagcagtacaaatggcagtattcattcacaattttaaaagaaaaggggggattggggggtacagtgcaggggaaaggataatagacataatagcaacagacatgcaaactaaagaattacaaaaacaaattacaaaaattcaaaattttcgggtttattacagggacagcagagatccaatttggaaaggaccagcaaaactactctggaaaggtgaaggggcagtagtaatacaggacaatagtgatatcaaggtagtaccaagaagaaaagcaaagatcattagggattatggaaaacagatggcaggtgatgattgtgtggcaggtagacaggatgaggattagaacatggaacagtttagtaaaatatcatatgtatgtctcaaagaaagctcgaaagtggctctatagacatcactatgatagcaggcatccaaaagtaagttcagaagtacacatcccactaggggatgctagattagtagtaagaacatattggggtctgcatacaggagaaaaagactggcaattgggtcacggggtctccatagaatggaggctaagaagatatagcacacaaatagatcctgacctagcagaccaactaattcatctgcattattttgactgtttttcagaatctgccataaggagagccatattaggacaagtagttagccctaggtgtgtatatccaacaggacataaccaggtaggatccctacaatatctagcactgaaggcattagtaacaccaataaagacaagaccacctttgcctagtgttaagatattaacagaggatagatggaacaagccccagaagaccaggggccacagagggaaccatacaatgaatggatgttagaactgttagaagatcttaaacatgaagcagttagacactttcctagaccatgggctaggacaacatatatataacacctatggggatacttgggaaggagtcgaagctatagtaagaattttgcaacaactactgtttgttcatttcagaattgggtgccaacatagcagaataggcattattcaagggagaagagtcagaaatggagccggtagatcctaacttagagccctggaaccatccgggaagtcagcctacaactgcttgtaccaagtgttactgtaaaaagtgttgctatcattgcctagtttgctttctgaacaaaggcttaggcatctcctatggcaggaagaagcggagcaagcgacgacgaactcctcacagcagtaaggatcatcaaaatcctataccaaagcagtaagtatcagtaattagtatatgtaatgagtcctttagaaatctgtgcaatagtaggattgatagtagcgctaatcatagcaatagttgtgtggactatagtaggtatagaatataagagattgttaaagcaaaggaaaatagacaggttaattaagaaaatacgagaaagagcagaagacagtggcaatgagagtgatggggacatggatgaattggcaaaacttgtggagagggggaactatgatcttggggatgttaatgatctgtagtactgcagaaaacttgtgggttactgtctactatggggtacctgtgtggaaagatgcagaaaccaccttattttgtgcatcagatgctaaagcatacgacacagaggcgcataatgtctgggctacacatgcctgtgtacccacagaccccaacccacaagaaatatatttggaaaatgtgacagaagagtttaacatgtggaaaaataacatggtagagcagatgcatacagatataatcagtctatgggatcaaagcctaaagccatgtgtacagttaacccctctctgcgttactttaaattgtaataacatcaccatcaataacatcaccaccaacatcactgaggacatgagaggagaaataaaaaactgctcgtacaatatgaccacagtattaagggataagagaaggaaagtgtattcacttttttatagacttgatatagtaccacttgatgaggggaataataactctgctgggagtagtgactatagattaataaattgtaatacctcaaccataacacaagcctgtccaaaggtctcttttgacccaattcctatacattattgtgctccagctggttttgcgattctaaaatgtaaggatccagatttcaatggaacagggccatgcaagaatgtcagcacagtacaatgcacacatggaatcaagccagtagtatcaactcaactgctgttaaatggcagtctagcagaaggaaaggtaagaattagatctgaaaatattacaaacaatgccaaaaacataatagtacaacttgtcaagcctgtaaaaattaattgtgtcagacctaacaacaatacaagaacaagtgtacgtataggaccaggacaaacattctatgcaacaggtgaaataataggggatataagacaagcattttgtactgtcaatgaatcagaatggaatgaaactttacaacaggtagctacgcaattaagagaacactttgagaacaaaacaataaaatttactaactcctcaggaggggatttagaaattacaacacatagctttaattgtggaggagaatttttctattgtaatacatcaggcctgtttaatagcacctggaataataataataccagggagaagataaatggtacagagtcaaatagcactataactctccattgcagaataaagcaaattataaataggtggcaggaagtaggacaagcaatgtatgcccctcccatcccaggagtaataaattgtagatcaaacattacaggactaatattaacaagagatggtggggatggggataacaatacggaaatcttcagacctggaggaggaaatatgaaggacaattggagaagtgaattatataagtataaagtagtaaaaattgaaccactgggagtagcacccaccagggctaagagaagagtggtggagagagcaaaaagagcagttggaataggagctgttttccttgggttcttaggagcagcaggaagcactatgggcgcggcgtcaataacgctgacggtacaggccagacaattattgtctggcatagtgcaacagcaaagcaatttgctgagggctatagaggctcaacaacatctgttgaaactcacggtctggggcattaaacagctccaggcaagagtccttgctgtggaaagatacctgcaggatcaacagctcctaggaatttggggctgctctggaaaactcatctgcaccactaatgtgccctggaactctagttggagtaataaatctcagagtgagatatgggagaacatgacctggctgcaatgggataaagaaattagcagttacacaggcataatatataaactaattgaagaatcgcagaaccagcaggaaaagaatgaacaagacttattggcattggacaagtgggcaagtctatggaattggtttgaaatatcaaagtggctgtggtatataaaaatatttataatgatagtaggaggattaataggattaagaatagtttttgctgtgctttctataatcaatagagttaggcagggatactcacctttgtcatttcagacccacaccccaaacccaagggaacccgacaggcccgaaagaatcgaagaagaaggtggagagcaaggcagagacagatcgatacgcttagtgagcggattcttagcacttgcctgggacgacctacggagcctgtgccttttcagctaccaccgcttgagagacttcatcttgattgcagcgaggactgtggaacttctgggacacagcagtctcaaggggttgagactggggtgggaaagcctcaagtatctggggaatcttctgctatattggagtcaggaactaaaaattagtgctgttaatttagttgataccatagcaatagcagtagctggctggacagataggattatagaaacaggacaaagattttgtagagctcttctcaacgtacctagaagaatcagacaaggatttgaaagggctctgctataacatgggtggcaagtggtcaaaaagtagcatagtgggatggcctgagattagggaaagaatgaggcgtgctcctccagcagcaaaaggagtaggagcagtatctcaagatttagataaatttggagcagttacaagcagtaatatgaatcaccctagttgcgtctggctggaagcacaagaggaaacggaggtaggctttccagtcaggccacaagtacctctaaggccaatgacttacaagggagcagtggatctcagccattttttaaaagaaaaggggggactggaagggttaatttactccaagcaaagacaagatatccttgatctgtgggtctaccacacacaaggctacttccctgattggcagaattacacaccagggccaggagtcagatacccactaacatttggatggtgcttcaagctagtaccagttgatccagatgaagtagagaaggatactgagggagagaacaacagcctattacaccctatatgccaacatggaatggatgatgaggagaaagaagtattaaggtggaaatttgacagccgcctggcactaaaacacagagcccaagagatgcatccggagttctacaaagactgctgacacagaagttgctgacagggactttccgctgggactttccaggggaggtgtggtttgggcggagttggggagtggccaaccctcagatgctgcatataagcagctgcttttcgcttgtactgggtctctctaggtagaccagatccgagcctgggagctctctggctatctggggaacccactgcttaagcctcaataaagcttgccttgagtgcb
Basics
• Letters (nucleotides)– 4 in DNA, A,G,C,T
• Codons– Triplets of nucleotides e.g. GAA
• Genomes have coding regions (proteins) & non-coding regions (other)
• One strand can be read forward, the other in reverse
It’s all about the Codons
• The Genetic Code is a dictionary of Codons
• 64 entries (4^3)
Analyzing Genomes
• Compare them to each other– Alignments (e.g. Smith-Waterman, etc.)– Distances• Levenshtein (edit) distance (metric)• Longest Common Subsequence distance
(metric)• Normalized Compression Distance (metric)
– Optimal Grammars• Pisa.c: Optimal sequence grammar search
using hyperstring encodings
Analyzing Genomes
• Look for interesting regions– Information gain (Kullback-Leibler Div)– Coding Costs (Kolmogorov Complexity)– Decaying Coding Costs (Lossy
Kolmogorov Complexity)
Rule 1:Size doesn’t matter
Smallest(almost)
• Mycoplasma Genitalium• 580,000 bp
Largest
• Polychaos Dubium• 670 billion bp
Rule 2:Repetition matters
Don’t say that again
• Sections of DNA that do not repeat are the most important
• Protein coding genes and RNA coding genes are non-repetitive
• Higher-order creatures are largely repetitive
Rule 3:Compression is hard
Putting the squeeze on
• Normal compressors ~ 2bit codes• Special genetic compressors exist• Compressibility equates to sequence
predictability for the model in use
So what does this have to do with security???
A Question
If we could convert sequences of logs, packets, etc. to a genomic encoding,
could we use genomic analysis to dramatically speed up & improve forensics, incident response and
anomaly detection?
YES
How?
• Step 1: Convert events into alphabet• Step 2: Convert stream into string of
letters• Step 3: Money bath
A Naïve Solution
• Step 1: Hash each input, use hash value as a letter
• Step 2: Create stream of hash values• Step 3: #fail
Why?
Answer
• The alphabet is too big • The stream will need at least
2^(2^<hash_key_size) examples• Stream is virtually unpredictable
Enter blar.p
y
WTF is a ‘blarp’?
• Let’s ask Google• The sound a fat person makes being
fat• The sound of taking big fat data and
making it useful & efficient small data
• A cool little python tool for creating and analyzing genomic encodings
• The last two will not be found on Google…yet
Idea
• We want similar events to be represented by a single letter
• Hashes are random projections• Let’s use geometry instead
Position in space
• To precisely locate something in space D, you need dist. to n=D+1 reference points
• Key notion: To get something’s general area you can use n<<D+1 reference points
Locality-Sensitive Hashing
• Created by Yahoo in late 90’s• Used within indexing for text lookups
on massive data sets• Many hashes; data-type dependent• Question: What if you thought about
it as a ‘general area’ hash instead?
How it works
• Basic type: Random Projection• Given a numeric vector (e.g. 1, 15, 3,
14.8) calculate its dot product vs. a random vector
• If result is positive, call it a ‘1’• If negative, call it a ‘0’• Repeat• Concatenate binary together, result
is LSH
Blar.py Pipeline
Vectorize Input
Find Locality
Sensitive Hash
Convert to UTF-16 char
Output stream of UTF-
16
Analyze sliding
window over
genome stream
Score Chart stuff
Vectorizing
• Idea: Count things that matter, take measurements, etc. and create an array to hold that information
• Where the rubber meets the road• Lots of chances for domain expertise
Basic Vectorizing in Blar.py
• Basic model: character n-grams• Also known as Markov chains or Bag of
Letters• Counts up sliding windows of text• E.G. 2-grams for ‘sassyfrassy’sa: 1 as: 2 ss: 2 sy: 2 yf: 1 fr: 1 ra: 1For 256^2 length array (1,0…0,2,0…0,2,0…
Let’s Vectorize Better
• Use Feature Hashing otherwise known as the hashing trick
• Find hash mod length and increment counter for each model pattern
• Permits lossy counting with graceful random collisions
• Blar.py uses length 64 by default and xxHash
Blar.py code
1. def feature_hash_string(s, window, dim):2. # Generate window-char Markov chains & create feature
hashes3. chains = [(xxhash.xxh32(s[i:i+window]) % dim) for i in
xrange(len(s)-(window-1))]
4. # Initialize counter array5. counters = numpy.zeros(dim)
6. # Count instances of feature hashes7. for i in range(len(chains)):8. counters[chains[i]] += 19. # Return feature hash count vector10. return counters
Now let’s find the LSH
1. # Use random projection for LSH and output a UTF char for the locality-sensitive hash
2. def locality_hash_vector(v, width):3. hash = numpy.zeros(width, dtype=int)4. for x in range(0, width - 1):5. projection = numpy.dot(COMP_VECTORS[x], v)6. if projection < 0:7. hash[x] = 08. else:9. hash[x] = 1
10. # Return unicode char equal to the LSH11. return unichr(int(''.join(map(str, hash)),2))
Blar.py analysis
• Analyzes 4 character sequences and assigns a decaying version of the optimal coding cost to each line
• Tells you how interesting a certain event is relative to everything else in the genome, accounting for ordering
• Blar.py Genomes are extremely compressible using bzip especially
Blar.py defaults (ATM)
• 4 character sliding windows• 4 bit hashes• 64d feature hashes• Outputs a list of the most interesting
scores• Outputs a few bad charts
Blar.py vs. Toy File1. Mary had a little lamb whose fleece was white as snow.2. Mary had a little lamb whose fleece was white as snow.3. Mary had a little lamb whose fleece was white as snow.4. Mary had a little lamb whose fleece was white as snow.5. Mary had a little lamb whose fleece was white as snow.6. Gary had a little hand whose hair was as white as blow.7. some more strings8. some more strings9. some more strings10.some more strings11.some more strings12.John McAfee was the keynote for Skytalks.13.John McAfee was the keynote for Skytalks.14.John McAfee was the keynote for Skytalks.15.some more strings16.some more strings17.some more strings18.John McAfee was the keynote for Skytalks.19.John McAfee was the keynote for Skytalks.20.FOO BAR BAS
Blar.py vs. Toy File
Blar.py vs. Toy File(Look Raffy, I’m using the completely inappropriate chart type)
Blar.py vs. BlueGene/L
• From the Usenix Computer Failure Data Repository
• 1.2GB combined log file from 131,072 processors for six months
• 119MB compressed with gzip• 9.4MB blar.py genome• Blar.py ~1000 lines/sec
Blar.py vs. BlueGene/L
Blar.py vs. BlueGene/L
TL;DR
• Fast, accurate, free: Blar.py genomic encoding tool provides very fast, low noise anomaly detection
• Stop searching in a crisis: Great way to quickly explore data for IR, forensics, etc., especially from unknown sources
• Want it? Follow me @conduit242 for the GitHub posting announcement