structured models with neural...
TRANSCRIPT
Structured models with neural networks
Tomáš Pevný
February 28, 2020
Motivation — identification of infected computers
examples of messages:http://lop.guardpair.com/affs?addonname=[Enter%20Product%20Name]&affid=9050&subaffid=5774&subID=undefined&clientuid=undefined&origaffid=9050&origsubaffid=5774&href=http%3A%2F%2F7http://pf.updatewp.org/?v=3.16&pcrc=308403836&LSVRDT=&ty=CHECKhttp://rules.similardeals.net/v1.0/whitelist/1052/9050x5774/7online.subsea7.net?partnerName=Lyrics
Motivation — representation of PE file format
Executable
CodeDataPE header
Call Graph Functions
Control Flow Graph Basic Blocks
StringsImports
Instructions
f1
f2
f3
f4
f7
f8
f5
f6
b1
b3b2
b4
b5b6
b7
b8
Single-instance learning
x = (x1, . . . , xd) f(x) =
{+1
−1extractfeatures
send toclassifier
Icon made by Freepik from www.flaticon.com
Multi-instance learning
x =
(x1,1, . . . , x1,d)(x1,1, . . . , x1,d)
...(xb,1, . . . , xb,d)
f(x) =
{+1
−1extractfeatures
send toclassifier
Our solution of multi-instance learning
x1 ∈ Rd
x2 ∈ Rd
x3 ∈ Rd
xb ∈ Rd
...
h(x1, θh)
h(x2, θh)
h(x3, θh)
h(xb, θh)
x̃1 ∈ Ru
x̃2 ∈ Ru
x̃3 ∈ Ru
x̃b ∈ Ru
...
g({x̃i}bi=1, θg
)x̄ ∈ Ru f (x̄, θf )
g({x̃i}bi=1
)= 1
l
∑bi=1 x̃i
h(x, θh) : Rd → Ru
h(x, θh) = max{0, xTθh}f(x̄, θf ) : Ru → R2
f(x̄, θf ) = x̄Tθf
One vector per instance
One vector per sample
Generalized multi-instance learning
x =
(x(1)1 , . . . , x
(1)d1
)
(x(2)1,1, . . . , x
(2)1,d2
)
(x(2)1,1, . . . , x
(2)1,d2
)...
(x(2)b,1 , . . . , x
(2)b,d2
)
(x(3)1,1, . . . , x
(3)1,d3
)
(x(3)1,1, . . . , x
(3)1,d3
)...
(x(3)c,1 , . . . , x
(3)c,d3
)
f(x) =
{+1
−1extractfeatures
send toclassifier
Extension of universal approximation theorem
Let S be the class of spaces which1. contains all compact subsets of Rd, d ∈ N2. is closed under finite cartesian products3. for each X ∈ S we have P(X ) ∈ S1
Then for each X ∈ S, every continuous function on X can bearbitrarily well approximated by neural networks.
1Here we assume that P(X ) is endowed with some metric.
Identification of infected computers
Detection of infected users — model
d(1)1
. . . d(1)n
d(2)1
. . . d(2)n
p(1)1
. . . p(1)n
p(2)1
. . . p(2)n
k(1)1
. . . k(1)n
v(1)1
. . . v(1)n
k(2)1
. . . k(2)n
v(2)1
. . . v(2)n
�KVq(1)1
. . . q(1)n
�KVq(2)1
. . . q(2)n
�Dd1 . . . dn
�Pp1 . . . pn
�Qq1 . . . qn
�Uu(1)1
. . . u(1)3n
u(2)1
. . . u(2)
l
u(3)1
. . . u(3)
l
u(4)1
. . . u(4)
l
u(5)1
. . . u(5)
l
u(6)1
. . . u(6)
l
�SLDs(1)1
. . . s(1)
k
�SLDs(2)1
. . . s(2)
k
�SLDs(3)1
. . . s(3)
k
�userx1 . . . xp
fy
Detection of infected users — results
I Training data from 5-30.10 20173.6 · 109/4.45 · 106 urls / users
I Testing data from 3.11 20172 · 108/1.5 · 106 urls / users
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
recall
precision
Convolution [1] R. Forests [2]MIL - URL MIL - 5 min
[1] eXpose: A character-level convolutional neural network with embeddings for detectingmalicious URLs, file paths and registry keys, J. Saxe and K. Berlin, 2017
[2] Learning detectors of malicious web requests for intrusion detection in network traffic, L.Machlica, K. Bartos, and M. Sofka, 2017
Static analysis of malware binaries — PE file format
Executable
CodeDataPE header
Call Graph Functions
Control Flow Graph Basic Blocks
StringsImports
Instructions
f1
f2
f3
f4
f7
f8
f5
f6
b1
b3b2
b4
b5b6
b7
b8
Static analysis of malware binaries — model of a function
push regmov reg regmov reg memdec regjne num
mov reg mempush regmov mem regcall KERNEL32!Disable...
xor reg reginc regpop regretn num
1
2
3
tokenization block representationdisassembly output
2554116045
669126
1
2
3
11251163
ngrams
x1
x2
x3
MIL
NN
Static analysis of malware binaries — model of a binary
x1x2x3
x4x5
xn
Block NNBlock NNBlock NN
Block NNBlock NN
Block NN
f1
f2
fm
meanmax
meanmax
meanmax
y2
Function NN
Function NN
Function NN
y1
ym
meanmax Binary NN
map groupby reduce map groupby reduce
Static analysis of malware binaries — results
I Training data4 · 105 binaries
I Testing data5.5 · 105 binaries
I approx. 10% were malicious
Static analysis of malware binaries — feedback
I avoids specifying imports in PEheader
I loads addresses of libraryfunctions to an array
Static analysis of malware binaries — feedback
I adware related behaviorI creating both visible and
invisible windows
Future directions
I Learning representation of hierarchical dataI Learning distances for clusteringI Generative modelsI Anomaly / few shot learning
I Game theory in securityI Finding new attacks under constraints
I Decentralized learningI Modeling and reasoning over relational data
Modeling the internet
https://www.threatcrowd.org/domain.php?domain=ibuyitttttttttttttttttttttttttttttttttttibuyit.com
Materials
I https://github.com/pevnak/Mill.jlI https://github.com/pevnak/JsonGrinder.jl
I Discriminative models for multi-instance problems withtree-structure, Tomáš Pevný, Petr Somol, 2016
I Using Neural Network Formalism to Solve Multiple-InstanceProblems, Tomáš Pevný, Petr Somol, 2016
I Approximation capability of neural networks on sets ofprobability measures and tree-structured data, Tomáš Pevný,Vojtěch Kovařík, 2019
I Nested Multiple Instance Learning in Modelling of HTTPnetwork traffic, Tomas Pevny, Marek Dedic, 2020