kalinin-2020 - basnetsosny.bas-net.by/npcs/reps/vi/vi-8-kalinin.pdf6: 1: dqg 0dxyh dojrulwkpv kdyh...
TRANSCRIPT
THE SEQUENCE ALIGNMENT ALGORITHMS FOR IOT INTRUSION DETECTION
NONLINEAR PHENOMENA IN COMPLEX SYSTEMS
XXVII International Seminar Minsk, BELARUS -- May 19-22, 2020
The reported study was funded by RFBR according to the research project โ18-29-03102
Dr. Maxim KALININ, and Vasiliy [email protected]
Major features of the IoT infrastructure
Internet of Things is a communication concept for large-scale
and flexible cyber space:
โข self-organizing network of devices
โข ad-hoc network with p2p communications
โข dynamically changing routing of network traffic
โข large amount of physically moving hosts (devices)
โข huge amount of security-relative data for decision making
2
Issues of providing security in the IoT
Technical
โข Low performance of IoT controllers
โข High demand on bandwidth and transmission speed (real time security)
Financial
โข Cyberphysical system cost increase
โข Service cost increase
Human Factor
โข System engineer โ security specialist
โข Security = documentation & information closeness
โPerformance-Big Dataโ
compromise
High mobility of
environment
Poor traditional protection methods
Heterogeneity
Real time processing
Large scale of IoT network
A sample of industrial IoT
Problems of security providing in IoT
Current requirements for effective protection of IoT
IoT is a growing network of smart devices where one small device canexchange information with other connected devices. Network security isone of the todayโs concerns for IoT. Intrusion detection for IoT exposes theextent with which the IoT is vulnerable to attacks and how such attack canbe detected to prevent damage targeted at smart environments.
3
Intrusion detection in the IoT
Traditional approaches for intrusion detection
Advantages Disadvantages
Anomaly detection Global network viewAble to detect new intrusionsUsing fewer rules (in comparison with signature-based method)
High level of errorsDifficult adaptation (re-learning) for changing conditions (new profiles, dynamic anomalies)
Misuse detection Signature matchingReliable, resource and time efficientLow false alarm rate for well-known intrusions
Depends on the signatures of intrusionsLimitation to detect the unknown attacksHuge database of patterns
IoT
IntrusionDetectionSystem (IDS)
4
IDS work stages
Usual intrusion detection is done in 3 steps:
(1) Monitoring the security-relevant events and data concerning the system state
(2) Sequence extraction and misuse detection by pattern matching: the captured sequence of under-controlled signs (operational sequences, packet sequences, etc.) is compared with the pattern sequence of attack signs
(3) Signaling of intrusion to provide the attack response and counter-measure
5
A traditional intrusion detection based on a pattern matching
(2) There is a new class of polymorphic attacks which have mutations in some acts in the attack operational sequence due to flexibility of IoT or specific intruderโs activity. All of the traditional methods of pattern matching do not detect such sequences of intrusion acts.
A B C D
A B C D
POLYMORPHIC SEQUENCE
A X B C
A B CPattern sequence
Captured operational
sequence
TIME GAPSPattern matching
DOES NOT RECOGNIZE such kinds of intrusions by identity
matching, but THERE IS AN INTRUSION
Gap
Identity No identity No identity
Identity No identity Identity
Pattern matching is a major technique for IDS to detect harmful activity, but โฆ
(1) Specific intrusions in IoT (e.g., forced power consumption, forced topology changing, etc.) have a divided sequence of operations. Some attacks can also have not a linear sequence of acts.
Identity
Pattern sequence (signature)
Captured operational sequence
Slightly different act B
Act B
Identity checking is a method for the sequence matching today. You need a pattern to check identity between the stored chain and the collected chain of intrusion acts. Therefore, a wide class of attacks
can be bypassed in IoT if the identity checking applied
The divided sequence has time gaps in the chain of intrusion acts.
6
Approach of bioinformatics: sequence alignment and similarity
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein chains to identify regions of SIMILARITY that may be a consequence of functional, structural, or evolutionary relationships between the coded chains.
THE GOAL IS THE ALIGNMENT OF LENGTHY, HIGHLY VARIABLE AND NUMEROUS SEQUENCESSequence alignment methods:a) global alignment which stretch the smaller chain along the bigger one b) local alignment which localize the smaller chain on the region of the bigger one Both can be applied for chains: global one is often applied in case if both sequences have approximately equal length, and local alignment is applied if one sequence is bigger than another.
Sample 1. Identification of similarity of mammalsโ DNA chains Sample 2. Similarity identification of coronavirus origin
IDENTITY
SIMILARITY
ALIGNMENT
Novelty for IoT-IDS: we transform IDENTITY checking to SIMILARITY checking
7
8 alignment algorithms had been reviewed and analyzed: PSI-BLAST starts with an element-wise comparison of each record in the database with the given
sequence and then builds a local multiple alignment of the sequences. The process is repeated andthe result is refined during several iterations until the specified number of cycles is exhausted oruntil the results of 2 sequences coincide.
FASTA provides optimized local alignment algorithms combined with analysis of the replacementmatrix.
Smith-Waterman (SW) and Needleman-Wunsch (NW) algorithms are classical for the local andglobal DNA sequence alignment.
Hidden Markov Model (HMM) is a statistical model that simulates the operation of a processsimilar to a Markov process with unknown parameters, and the task is to restore unknownparameters based on observables. The method simultaneously performs alignment and estimationof the probabilities of deletion and insertions.
LASTZ is designed to pre-process one sequence or set of sequences and then align them.Mauve is an alignment system that combines evolutionary event analysis with multiple local
alignments. Seaweed is an improved dot-alignment algorithm that uses implicit matrices with the highest score.
It is very sensitive to local alignments, but does not take into account the blank spaces (time gaps)in the sequences.
The sequence alignment algorithms analysis
8
Overview of the sequence alignment algorithms
ALGORITHMTYPE OF
ALIGNMENTMIN. ACCURACY,
%TIME, SEC COMPLEXITY WINDOW SIZE
MAX. INPUT (MLN. PAIRS)
PSI-BLAST Multiple local 70 160 O(M) 128 10,000FASTA Global 78 730 O(M+N) 200 2,000
Smith-Waterman,Needleman-
Wunsch
Local (global, and semi-local are available,
too)
80 6630 O(m*n) >10 1
Hidden Markov Model
Multiple local 90 5200 O(mn2) >30 10
LASTZ Local 90 1120 O(M+N) 255 500
Mauve Local 98 40O(G2n + Gn logGn),
G โ num. of genomes, n โ avg. length of genome
>10 500
Seaweed Dot 90 370 O(mnw), w โ length of sequence 100 30โฆ1,000
Best accuracy Best speed Best sensitivity
Selected for our task
Selected for our task
9
Accuracy:โข SW/NW, Seaweed, LASTZ, Mauve, and HMM have accuracy >80%
IoT-IDS should have a high attack detection rate:โข Among the observed algorithms, the shortest operation time while processing of a big
sequences (>100Mb) is demonstrated by Mauve (~40 sec)โข With the further increase in the sequence length, the operation time significantly increases
(up to ~24 hours per 1Gb with different algorithms)
Maximal input size determines the number of pairs that are simultaneously fed to the algorithm to detect the intrusions:
โข All algorithms are capable for receiving more than 1 million pairs at the same time, while the highest value is achieved for the PSI-BLAST (10,000 million pairs). But it is not essential for IDS, as it works with a significantly less input
The size of the sequence window allows us to analyze the sequence being separated in blocks:
โข The smaller the window size, the more accurately anomalies are detected. But in this case, time for anomalies search increases
โข SW/NW and Mauve algorithms have the ability to apply the smallest window size, which indicates their advantage over other algorithms in local positioning in the compared sequences
SW/NW and MAUVE algorithms have been selected for the further research and experiments to design a new technique
of the intrusion detection in the IoT
Summary of the sequence alignment algorithms
10
IoT-IDS architecture and the research methodology
โข adapt the DNA-based sequence alignment algorithms to intrusion detection and run them on the IoT device (Raspberry Pi)
โข evaluate the performance of algorithms, and analyze it to understand resources (CPU, RAM, disk) workload in the IoT device
โข evaluate the accuracy of the IoT intrusion detection algorithms
11
โข Biopython software kit: modules Bio.Seq, Bio.Align, Bio.SeqRecordโข Traffic (NSL-KDD, Bot IoT, IoT-23) datasets and โTraffic Generatorโ (.bin files)
were applied for testing the algorithms. Datasets contain DoS attacks, R2L attacks, U2R attacks, Probe, Normal activities.
โข Nucleotides are coded, and the normal sequence is selected and marked.
Experimental environment
The sample of nucleotide coding
The sample is encoded in a way suitable for sequence alignment: 0=AGT, tcp=AAT, ftp_data=TGC, SF=GCA, prefixes TTT, CCC, AAA are written before field of protocol, service, and flag.
Suppose we have a sample to match like below
Then the encoded sample is the sequence to be compared against pattern to find the anomaly.
12
The calculation of the pattern by which the packets in the network are filtered is the key step for intrusion detection with a sequence alignment algorithm.
For genetic algorithms, an intrusion pattern is not a pattern sequence of a single packet.
There is an โaverage packetโ that is built for a whole normal traffic and thus it is similar to every packet of the normal traffic
ADVANTAGES:
โข PERFORMANCE: It is required to choose the optimal length for pattern, since it will be compared with whole incoming traffic and thus will directly affect the speed of packet analysis and the amount of memory for storing it.
โข CONTENT: The pattern should fit as many different โformsโ of traffic as possible, i.e. the pattern must as similar to all normal traffic as possible.
โข DATABASE VOLUME: database of intrusions patterns is reduceะฒ because you do not need to collect all of the possible packets which contain the attacks signatures. Instead of this, you need to collectthe โaverage patternโ.
Thus, the selected signature should not only contain A SUFFICIENT NUMBER OF BYTES to โdescribeโ all kinds of traffic, but should also be SMALL IN LENGTH in order to guarantee the resource
consumption and normal packet analysis performance
Pattern selection task
13
1. An encoded normal sequence is randomly selected.
2. The selected signature using the sequence alignment algorithm is compared to other normal signatures in the dataset.
3. The sum of all the received offsets and threshold are calculated.
4. This operation is repeated with each normal signature.
5. The pattern is considered to be one whose sum of thresholds is the biggest. It is considered to be the โclosestโ (i.e. the most similar) one to the rest of the normal signatures and it is the description of normal traffic โ so called the โaverage packetโ.
A pattern calculation algorithm
14
id 0 1 3 4 โฆ 125703 125707 Sum Threshold
0 - 100.5 91.3 123.4 โฆ 92.3 100.3 98728374.1 120.41 100.5 - 92.4 120.1 โฆ 98.3 111.3 92384091.5 115.33 91.3 92.4 - 88.0 โฆ 110.3 90.3 89999999.3 105.84 123.4 120.1 88.0 - โฆ 109.1 94.8 94182739.4 117.2
โฆ โฆ โฆ โฆ โฆ - โฆ โฆ โฆ โฆ125703 92.3 98.3 110.3 109.1 โฆ - 105.4 99123123.1 121.3125707 100.3 111.3 90.3 94.8 โฆ 105.4 - 93912398.7 116.9
The maximal value indicates the pattern
A sample of the pattern calculation
Mauve and Smith-Waterman (SW) algorithms: a pattern calculation
Mauve vs SWโข Multiple local alignmentsโข Alignment of n-sequencesโข Using the HOXD Matrix
โข Local alignmentโข Alignment of two sequencesโข Using the NUC44 Matrix
To align we need to build a matrix F of weights of the best alignment:
๐น ๐, ๐ = ๐๐๐ฅ0,
๐น ๐ โ 1, ๐ โ 1 + ๐ ๐ฅ , ๐ฆ, where
๐ ๐ฅ , ๐ฆ =
0, ๐คโ๐๐ ๐ (๐ฅ , ๐ฆ ) โค 0,
2๐ (๐ฅ , ๐ฆ )
๐ (๐ฅ ) ร ๐ (๐ฆ )โ ๐ ๐ฅ , ๐ฆ , ๐๐๐ ๐
๐ ๐ฅ โ ๐กโ๐ ๐๐ข๐๐๐๐ ๐๐ ๐๐๐๐ข๐๐๐๐๐๐๐ ๐๐ ๐ ๐โ๐๐๐๐๐ก๐๐ ๐ฅ ๐๐ ๐ ๐ ๐ก๐๐๐๐ ๐ฅ
๐น ๐, ๐ = ๐๐๐ฅ
0,
๐น ๐ โ 1, ๐ โ 1 + ๐ ๐ฅ , ๐ฆ ,
๐น ๐ โ 1, ๐ โ ๐,
๐น ๐, ๐ โ 1 โ ๐.
๐ โ ๐๐๐๐๐๐ก๐๐๐๐๐๐๐ ๐๐๐๐๐๐ก๐ฆ
๐ ๐ฅ , ๐ฆ โ ๐ค๐๐๐โ๐ก ๐๐๐ก๐๐๐ฅ ๐ฃ๐๐๐ข๐๐ ๐๐๐ก๐ค๐๐๐ ๐ ๐๐๐ข๐๐๐๐ ๐ ๐ฆ๐๐๐๐๐
Mauve SW
91 123 31 114
123 91 114 31
31 114 100 125
114 31 125 100
A T G C
A
T
G
C
5 4 4 4
4 5 4 4
4 4 5 4
4 4 4 5
A T G C
A
T
G
C
HOXD Matrix NUC44 Matrix
Let x = TGATG, y = CTGA. Then the matrix F of weights of the best alignment:
Mauve SW0
0 0 0 0 0
91 0 0 91 0
0 473 0 0 473
0 0 655 0 0
T G A T G
C
T
G
A
0 3 6 9 12 15
3 0 0 0 0 0
6 2 0 0 5 2
9 0 7 4 2 10
12 0 4 12 9 7
T G A T G
C
T
G
A
Parameter MAUVE SW
Alignment TGA, TG TGA-
Similarity (sim) 89% 57%
The colored zone indicates the subsequences with the highest weight.
๐ ๐๐ =๐
max ๐๐๐ ๐ฅ ; ๐๐๐ ๐ฆร
1
๐๐๐๐ ๐ฅ
+๐
๐๐๐ ๐ฆ
ร ๐
where ๐ โ ๐๐ข๐๐๐๐ ๐๐ ๐โ๐๐๐๐๐ก๐๐๐ ๐๐ ๐กโ๐ ๐๐๐ ๐ข๐๐ก๐๐๐ ๐ ๐ข๐๐ ๐๐๐ข๐๐๐๐,
๐ , ๐ โ ๐๐ข๐. ๐๐ ๐โ๐๐. ๐๐ ๐กโ๐ ๐ ๐๐. ๐ฅ ๐๐๐ ๐ฆ ๐๐๐๐๐ข๐๐๐ ๐๐ ๐๐๐ ๐ข๐๐ก๐๐๐,
๐๐๐ ๐ฅ , ๐๐๐ ๐ฆ โ ๐กโ๐ ๐๐๐๐๐กโ๐ ๐๐ ๐กโ๐ ๐ ๐๐๐ข๐๐๐๐๐ ๐ฅ ๐๐๐ ๐ฆ,
๐ โ ๐๐ข๐. ๐๐ ๐๐๐๐ข๐๐๐๐๐๐๐ ๐๐ ๐กโ๐ ๐๐๐ ๐ข๐๐ก๐๐๐ ๐ ๐ข๐๐ ๐๐. ๐๐ ๐๐๐๐๐๐๐๐
Matrices S of weights between sequence symbols:
15
Effectiveness testing
Parameter MAUVE SW/NW
Dataset Volume 8,000
True Positive 3,910 3,380
False Positive 60 440
False Negative 90 620
True Negative 3,940 3,560
Accuracy 98.1% 86.8%
Precision 98.5% 88.5%
Recall 97.8% 84.5%
F1 score 0.982 0.864
ROC curve
0
50
100
150
2 4 8 16 32
Tim
e, s
ec
The number of simultaneously analyzed sequences
Accuracy Performance
Mauve algorithm runtimeversus number of input
sequences
Mauve algorithmruntime versus input
sequence size
00,5
11,5
22,5
3
1 5 7 9 14 32
Tim
e, s
ec
The size of the input sequences, Mb
0
0,5
1
1,5
2
6,8 15,6 28,8 50,1 127,2 714,4 1433,6 5836,8
Tim
e, s
ecConvert file size, Kb
The runtime of the algorithm for converting a dump file to a .fa file on the size of the dump file
Our working sample -Raspberry Pi 3 mod.B
0
200
400
600
800
1000
Windows QEMU Raspberry Piitera
tion
s/ s
eco
nd
Testing speed (checking a similarity of the sequences with the pattern)
16
The work conclusion and the further plan
for IoT, multiple alignment algorithm can be applied with the detection accuracy over 98% and identification of the IoT-specific security intrusions with the divided and polymorphic sequences of the operational acts
NW and SW algorithms have be checked for the further optimization
the training (the โaverage packetโ calculation) method has to be optimized
a multithreaded CPU has to be utilized (e.g., Raspberry Pi 4 has to take effect of the multithreaded ARM and the bigger RAM) to do a parallel sequence alignment
17