kalinin-2020 - basnetsosny.bas-net.by/npcs/reps/vi/vi-8-kalinin.pdf6: 1: dqg 0dxyh dojrulwkpv kdyh...

$: Kalinin-2020 - BASNETsosny.bas-net.by/npcs/reps/VI/VI-8-Kalinin.pdf6: 1: DQG 0DXYH DOJRULWKPV KDYH WKH DELOLW\ WR DSSO\ WKH VPDOOHVW ZLQGRZ VL]H ZKLFK LQGLFDWHV WKHLU DGYDQWDJH RYHU$
THE SEQUENCE ALIGNMENT ALGORITHMS FOR IOT INTRUSION DETECTION

NONLINEAR PHENOMENA IN COMPLEX SYSTEMS

XXVII International Seminar Minsk, BELARUS -- May 19-22, 2020

The reported study was funded by RFBR according to the research project №18-29-03102

Dr. Maxim KALININ, and Vasiliy [email protected]

Major features of the IoT infrastructure

Internet of Things is a communication concept for large-scale

and flexible cyber space:

• self-organizing network of devices

• ad-hoc network with p2p communications

• dynamically changing routing of network traffic

• large amount of physically moving hosts (devices)

• huge amount of security-relative data for decision making

2

Issues of providing security in the IoT

Technical

• Low performance of IoT controllers

• High demand on bandwidth and transmission speed (real time security)

Financial

• Cyberphysical system cost increase

• Service cost increase

Human Factor

• System engineer ≠ security specialist

• Security = documentation & information closeness

“Performance-Big Data”

compromise

High mobility of

environment

Poor traditional protection methods

Heterogeneity

Real time processing

Large scale of IoT network

A sample of industrial IoT

Problems of security providing in IoT

Current requirements for effective protection of IoT

IoT is a growing network of smart devices where one small device canexchange information with other connected devices. Network security isone of the today’s concerns for IoT. Intrusion detection for IoT exposes theextent with which the IoT is vulnerable to attacks and how such attack canbe detected to prevent damage targeted at smart environments.

3

Intrusion detection in the IoT

Traditional approaches for intrusion detection

Advantages Disadvantages

Anomaly detection Global network viewAble to detect new intrusionsUsing fewer rules (in comparison with signature-based method)

High level of errorsDifficult adaptation (re-learning) for changing conditions (new profiles, dynamic anomalies)

Misuse detection Signature matchingReliable, resource and time efficientLow false alarm rate for well-known intrusions

Depends on the signatures of intrusionsLimitation to detect the unknown attacksHuge database of patterns

IoT

IntrusionDetectionSystem (IDS)

4

IDS work stages

Usual intrusion detection is done in 3 steps:

(1) Monitoring the security-relevant events and data concerning the system state

(2) Sequence extraction and misuse detection by pattern matching: the captured sequence of under-controlled signs (operational sequences, packet sequences, etc.) is compared with the pattern sequence of attack signs

(3) Signaling of intrusion to provide the attack response and counter-measure

5

A traditional intrusion detection based on a pattern matching

(2) There is a new class of polymorphic attacks which have mutations in some acts in the attack operational sequence due to flexibility of IoT or specific intruder’s activity. All of the traditional methods of pattern matching do not detect such sequences of intrusion acts.

A B C D

A B C D

POLYMORPHIC SEQUENCE

A X B C

A B CPattern sequence

Captured operational

sequence

TIME GAPSPattern matching

DOES NOT RECOGNIZE such kinds of intrusions by identity

matching, but THERE IS AN INTRUSION

Gap

Identity No identity No identity

Identity No identity Identity

Pattern matching is a major technique for IDS to detect harmful activity, but …

(1) Specific intrusions in IoT (e.g., forced power consumption, forced topology changing, etc.) have a divided sequence of operations. Some attacks can also have not a linear sequence of acts.

Identity

Pattern sequence (signature)

Captured operational sequence

Slightly different act B

Act B

Identity checking is a method for the sequence matching today. You need a pattern to check identity between the stored chain and the collected chain of intrusion acts. Therefore, a wide class of attacks

can be bypassed in IoT if the identity checking applied

The divided sequence has time gaps in the chain of intrusion acts.

6

Approach of bioinformatics: sequence alignment and similarity

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein chains to identify regions of SIMILARITY that may be a consequence of functional, structural, or evolutionary relationships between the coded chains.

THE GOAL IS THE ALIGNMENT OF LENGTHY, HIGHLY VARIABLE AND NUMEROUS SEQUENCESSequence alignment methods:a) global alignment which stretch the smaller chain along the bigger one b) local alignment which localize the smaller chain on the region of the bigger one Both can be applied for chains: global one is often applied in case if both sequences have approximately equal length, and local alignment is applied if one sequence is bigger than another.

Sample 1. Identification of similarity of mammals’ DNA chains Sample 2. Similarity identification of coronavirus origin

IDENTITY

SIMILARITY

ALIGNMENT

Novelty for IoT-IDS: we transform IDENTITY checking to SIMILARITY checking

7

8 alignment algorithms had been reviewed and analyzed: PSI-BLAST starts with an element-wise comparison of each record in the database with the given

sequence and then builds a local multiple alignment of the sequences. The process is repeated andthe result is refined during several iterations until the specified number of cycles is exhausted oruntil the results of 2 sequences coincide.

FASTA provides optimized local alignment algorithms combined with analysis of the replacementmatrix.

Smith-Waterman (SW) and Needleman-Wunsch (NW) algorithms are classical for the local andglobal DNA sequence alignment.

Hidden Markov Model (HMM) is a statistical model that simulates the operation of a processsimilar to a Markov process with unknown parameters, and the task is to restore unknownparameters based on observables. The method simultaneously performs alignment and estimationof the probabilities of deletion and insertions.

LASTZ is designed to pre-process one sequence or set of sequences and then align them.Mauve is an alignment system that combines evolutionary event analysis with multiple local

alignments. Seaweed is an improved dot-alignment algorithm that uses implicit matrices with the highest score.

It is very sensitive to local alignments, but does not take into account the blank spaces (time gaps)in the sequences.

The sequence alignment algorithms analysis

8

Overview of the sequence alignment algorithms

ALGORITHMTYPE OF

ALIGNMENTMIN. ACCURACY,

%TIME, SEC COMPLEXITY WINDOW SIZE

MAX. INPUT (MLN. PAIRS)

PSI-BLAST Multiple local 70 160 O(M) 128 10,000FASTA Global 78 730 O(M+N) 200 2,000

Smith-Waterman,Needleman-

Wunsch

Local (global, and semi-local are available,

too)

80 6630 O(m*n) >10 1

Hidden Markov Model

Multiple local 90 5200 O(mn2) >30 10

LASTZ Local 90 1120 O(M+N) 255 500

Mauve Local 98 40O(G2n + Gn logGn),

G – num. of genomes, n – avg. length of genome

>10 500

Seaweed Dot 90 370 O(mnw), w – length of sequence 100 30…1,000

Best accuracy Best speed Best sensitivity

Selected for our task

Selected for our task

9

Accuracy:• SW/NW, Seaweed, LASTZ, Mauve, and HMM have accuracy >80%

IoT-IDS should have a high attack detection rate:• Among the observed algorithms, the shortest operation time while processing of a big

sequences (>100Mb) is demonstrated by Mauve (~40 sec)• With the further increase in the sequence length, the operation time significantly increases

(up to ~24 hours per 1Gb with different algorithms)

Maximal input size determines the number of pairs that are simultaneously fed to the algorithm to detect the intrusions:

• All algorithms are capable for receiving more than 1 million pairs at the same time, while the highest value is achieved for the PSI-BLAST (10,000 million pairs). But it is not essential for IDS, as it works with a significantly less input

The size of the sequence window allows us to analyze the sequence being separated in blocks:

• The smaller the window size, the more accurately anomalies are detected. But in this case, time for anomalies search increases

• SW/NW and Mauve algorithms have the ability to apply the smallest window size, which indicates their advantage over other algorithms in local positioning in the compared sequences

SW/NW and MAUVE algorithms have been selected for the further research and experiments to design a new technique

of the intrusion detection in the IoT

Summary of the sequence alignment algorithms

10

IoT-IDS architecture and the research methodology

• adapt the DNA-based sequence alignment algorithms to intrusion detection and run them on the IoT device (Raspberry Pi)

• evaluate the performance of algorithms, and analyze it to understand resources (CPU, RAM, disk) workload in the IoT device

• evaluate the accuracy of the IoT intrusion detection algorithms

11

• Biopython software kit: modules Bio.Seq, Bio.Align, Bio.SeqRecord• Traffic (NSL-KDD, Bot IoT, IoT-23) datasets and ‘Traffic Generator’ (.bin files)

were applied for testing the algorithms. Datasets contain DoS attacks, R2L attacks, U2R attacks, Probe, Normal activities.

• Nucleotides are coded, and the normal sequence is selected and marked.

Experimental environment

The sample of nucleotide coding

The sample is encoded in a way suitable for sequence alignment: 0=AGT, tcp=AAT, ftp_data=TGC, SF=GCA, prefixes TTT, CCC, AAA are written before field of protocol, service, and flag.

Suppose we have a sample to match like below

Then the encoded sample is the sequence to be compared against pattern to find the anomaly.

12

The calculation of the pattern by which the packets in the network are filtered is the key step for intrusion detection with a sequence alignment algorithm.

For genetic algorithms, an intrusion pattern is not a pattern sequence of a single packet.

There is an ‘average packet’ that is built for a whole normal traffic and thus it is similar to every packet of the normal traffic

ADVANTAGES:

• PERFORMANCE: It is required to choose the optimal length for pattern, since it will be compared with whole incoming traffic and thus will directly affect the speed of packet analysis and the amount of memory for storing it.

• CONTENT: The pattern should fit as many different “forms” of traffic as possible, i.e. the pattern must as similar to all normal traffic as possible.

• DATABASE VOLUME: database of intrusions patterns is reduceв because you do not need to collect all of the possible packets which contain the attacks signatures. Instead of this, you need to collectthe ‘average pattern’.

Thus, the selected signature should not only contain A SUFFICIENT NUMBER OF BYTES to “describe” all kinds of traffic, but should also be SMALL IN LENGTH in order to guarantee the resource

consumption and normal packet analysis performance

Pattern selection task

13

1. An encoded normal sequence is randomly selected.

2. The selected signature using the sequence alignment algorithm is compared to other normal signatures in the dataset.

3. The sum of all the received offsets and threshold are calculated.

4. This operation is repeated with each normal signature.

5. The pattern is considered to be one whose sum of thresholds is the biggest. It is considered to be the “closest” (i.e. the most similar) one to the rest of the normal signatures and it is the description of normal traffic – so called the ‘average packet’.

A pattern calculation algorithm

14

id 0 1 3 4 … 125703 125707 Sum Threshold

0 - 100.5 91.3 123.4 … 92.3 100.3 98728374.1 120.41 100.5 - 92.4 120.1 … 98.3 111.3 92384091.5 115.33 91.3 92.4 - 88.0 … 110.3 90.3 89999999.3 105.84 123.4 120.1 88.0 - … 109.1 94.8 94182739.4 117.2

… … … … … - … … … …125703 92.3 98.3 110.3 109.1 … - 105.4 99123123.1 121.3125707 100.3 111.3 90.3 94.8 … 105.4 - 93912398.7 116.9

The maximal value indicates the pattern

A sample of the pattern calculation

Mauve and Smith-Waterman (SW) algorithms: a pattern calculation

Mauve vs SW• Multiple local alignments• Alignment of n-sequences• Using the HOXD Matrix

• Local alignment• Alignment of two sequences• Using the NUC44 Matrix

To align we need to build a matrix F of weights of the best alignment:

𝐹 𝑖, 𝑗 = 𝑚𝑎𝑥0,

𝐹 𝑖 − 1, 𝑗 − 1 + 𝑄 𝑥 , 𝑦, where

𝑄 𝑥 , 𝑦 =

0, 𝑤ℎ𝑒𝑛 𝑠(𝑥 , 𝑦 ) ≤ 0,

2𝑠(𝑥 , 𝑦 )

𝑆 (𝑥 ) × 𝑆 (𝑦 )− 𝑠 𝑥 , 𝑦 , 𝑒𝑙𝑠𝑒

𝑆 𝑥 − 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 𝑜𝑓 𝑎 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟 𝑥 𝑖𝑛 𝑎 𝑠𝑡𝑟𝑖𝑛𝑔 𝑥

𝐹 𝑖, 𝑗 = 𝑚𝑎𝑥

0,

𝐹 𝑖 − 1, 𝑗 − 1 + 𝑠 𝑥 , 𝑦 ,

𝐹 𝑖 − 1, 𝑗 − 𝑑,

𝐹 𝑖, 𝑗 − 1 − 𝑑.

𝑑 − 𝑝𝑟𝑒𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑒𝑑 𝑝𝑒𝑛𝑎𝑙𝑡𝑦

𝑠 𝑥 , 𝑦 − 𝑤𝑒𝑖𝑔ℎ𝑡 𝑚𝑎𝑡𝑟𝑖𝑥 𝑣𝑎𝑙𝑢𝑒𝑠 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑠𝑦𝑚𝑏𝑜𝑙𝑠

Mauve SW

91 123 31 114

123 91 114 31

31 114 100 125

114 31 125 100

A T G C

A

T

G

C

5 4 4 4

4 5 4 4

4 4 5 4

4 4 4 5

A T G C

A

T

G

C

HOXD Matrix NUC44 Matrix

Let x = TGATG, y = CTGA. Then the matrix F of weights of the best alignment:

Mauve SW0

0 0 0 0 0

91 0 0 91 0

0 473 0 0 473

0 0 655 0 0

T G A T G

C

T

G

A

0 3 6 9 12 15

3 0 0 0 0 0

6 2 0 0 5 2

9 0 7 4 2 10

12 0 4 12 9 7

T G A T G

C

T

G

A

Parameter MAUVE SW

Alignment TGA, TG TGA-

Similarity (sim) 89% 57%

The colored zone indicates the subsequences with the highest weight.

𝑠𝑖𝑚 =𝑁

max 𝑙𝑒𝑛 𝑥 ; 𝑙𝑒𝑛 𝑦×

1

𝑁𝑙𝑒𝑛 𝑥

+𝑁

𝑙𝑒𝑛 𝑦

× 𝑘

where 𝑁 − 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑟𝑒𝑠𝑢𝑙𝑡𝑖𝑛𝑔 𝑠𝑢𝑏𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒,

𝑁 , 𝑁 − 𝑛𝑢𝑚. 𝑜𝑓 𝑐ℎ𝑎𝑟. 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑒𝑞. 𝑥 𝑎𝑛𝑑 𝑦 𝑖𝑛𝑐𝑙𝑢𝑑𝑒𝑑 𝑖𝑛 𝑟𝑒𝑠𝑢𝑙𝑡𝑖𝑛𝑔,

𝑙𝑒𝑛 𝑥 , 𝑙𝑒𝑛 𝑦 − 𝑡ℎ𝑒 𝑙𝑒𝑛𝑔𝑡ℎ𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠 𝑥 𝑎𝑛𝑑 𝑦,

𝑘 − 𝑛𝑢𝑚. 𝑜𝑓 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑟𝑒𝑠𝑢𝑙𝑡𝑖𝑛𝑔 𝑠𝑢𝑏𝑠𝑒𝑞. 𝑖𝑛 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙

Matrices S of weights between sequence symbols:

15

Effectiveness testing

Parameter MAUVE SW/NW

Dataset Volume 8,000

True Positive 3,910 3,380

False Positive 60 440

False Negative 90 620

True Negative 3,940 3,560

Accuracy 98.1% 86.8%

Precision 98.5% 88.5%

Recall 97.8% 84.5%

F1 score 0.982 0.864

ROC curve

0

50

100

150

2 4 8 16 32

Tim

e, s

ec

The number of simultaneously analyzed sequences

Accuracy Performance

Mauve algorithm runtimeversus number of input

sequences

Mauve algorithmruntime versus input

sequence size

00,5

11,5

22,5

3

1 5 7 9 14 32

Tim

e, s

ec

The size of the input sequences, Mb

0

0,5

1

1,5

2

6,8 15,6 28,8 50,1 127,2 714,4 1433,6 5836,8

Tim

e, s

ecConvert file size, Kb

The runtime of the algorithm for converting a dump file to a .fa file on the size of the dump file

Our working sample -Raspberry Pi 3 mod.B

0

200

400

600

800

1000

Windows QEMU Raspberry Piitera

tion

s/ s

eco

nd

Testing speed (checking a similarity of the sequences with the pattern)

16

The work conclusion and the further plan

for IoT, multiple alignment algorithm can be applied with the detection accuracy over 98% and identification of the IoT-specific security intrusions with the divided and polymorphic sequences of the operational acts

NW and SW algorithms have be checked for the further optimization

the training (the ‘average packet’ calculation) method has to be optimized

a multithreaded CPU has to be utilized (e.g., Raspberry Pi 4 has to take effect of the multithreaded ARM and the bigger RAM) to do a parallel sequence alignment

17

kalinin-2020 - basnetsosny.bas-net.by/npcs/reps/vi/vi-8-kalinin.pdf6: 1: dqg 0dxyh dojrulwkpv kdyh...

Documents