arxiv:1908.03449v1 [cs.cr] 9 aug 2019 · 2019. 8. 12. · ail is also an open source modular...

21
C ARL -H AUSER -O PEN S OURCE I MAGE MATCHING A LGORITHMS B ENCHMARKING F RAMEWORK APREPRINT Vincent Falconieri CIRCL Luxembourg August 12, 2019 ABSTRACT Security analysts need to classify, search and correlate numerous images. Automatic classification tools improve the efficiency of such tasks. Many Image-Matching algorithms are presented in the litterature. The present paper introduces and provides a Open-Source benchmarking and evaluation tool for these algorithms. Is this paper, the framework evaluates algorithms on illustrative datasets, which are constituted of phishing and onion websites. Datasets are provided as Open-Data. Keywords Threat Intelligence · Phishing · Dataset · Open Data · Open Source · Security · CERT · Incident Response · Visual Detection · Automatic classification · Correlation · Clustering 1 Introduction CERTs - as CIRCL - and security teams collect and process content such as images (at large from photos, screenshots of websites or screenshots of sandboxes). Datasets become larger - e.g. on average 10000 screenshots of onion domains websites are scrapped each day in AIL, an analysis tool of information leak - and analysts need to classify, search and correlate through all the images. Automatic tools can help them in this task. Less research about image matching and image classification seems to have been conducted exclusively on websites screenshots. [1][2][3] Our long-term target is to build a generic library and services which can at least be easily integrated in Threat Intelligence tools such as AIL 1 [4] and MISP 2 [5]. A quick-lookup mechanism for correlation would be necessary and part of this library. An evaluation framework is provided as Carl-Hauser 3 and the open-source library itself is provided as Douglas-Quaid 4 . MISP is an open source software solution tool developed at CIRCL for collecting, storing, distributing and sharing cyber security indicators and threats about cyber security incidents analysis. AIL is also an open source modular framework developed at CIRCL to analyze potential information leaks from unstructured data sources or streams. It can be used, for example, for data leak prevention. Image-matching algorithms benchmarks already exist [6][7][8] and are highly informative, but none is delivered turnkey. 1 Analysis Information Leak framework - github.com/CIRCL/AIL-framework 2 Malware Information Sharing Platform - github.com/MISP/MISP 3 github.com/CIRCL/carl-hauser 4 github.com/CIRCL/douglas-quaid arXiv:1908.03449v1 [cs.CR] 9 Aug 2019

Upload: others

Post on 25-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • CARL-HAUSER - OPEN SOURCE IMAGE MATCHINGALGORITHMS BENCHMARKING FRAMEWORK

    A PREPRINT

    Vincent FalconieriCIRCL

    Luxembourg

    August 12, 2019

    ABSTRACT

    Security analysts need to classify, search and correlate numerous images. Automatic classificationtools improve the efficiency of such tasks. Many Image-Matching algorithms are presented in thelitterature. The present paper introduces and provides a Open-Source benchmarking and evaluationtool for these algorithms. Is this paper, the framework evaluates algorithms on illustrative datasets,which are constituted of phishing and onion websites. Datasets are provided as Open-Data.

    Keywords Threat Intelligence · Phishing · Dataset · Open Data · Open Source · Security · CERT · Incident Response ·Visual Detection · Automatic classification · Correlation · Clustering

    1 Introduction

    CERTs - as CIRCL - and security teams collect and process content such as images (at large from photos, screenshotsof websites or screenshots of sandboxes). Datasets become larger - e.g. on average 10000 screenshots of onion domainswebsites are scrapped each day in AIL, an analysis tool of information leak - and analysts need to classify, search andcorrelate through all the images.

    Automatic tools can help them in this task. Less research about image matching and image classification seems to havebeen conducted exclusively on websites screenshots. [1][2][3]

    Our long-term target is to build a generic library and services which can at least be easily integrated in Threat Intelligencetools such as AIL1[4] and MISP2[5]. A quick-lookup mechanism for correlation would be necessary and part ofthis library. An evaluation framework is provided as Carl-Hauser3 and the open-source library itself is provided asDouglas-Quaid4.

    MISP is an open source software solution tool developed at CIRCL for collecting, storing, distributing and sharingcyber security indicators and threats about cyber security incidents analysis.AIL is also an open source modular framework developed at CIRCL to analyze potential information leaks fromunstructured data sources or streams. It can be used, for example, for data leak prevention.

    Image-matching algorithms benchmarks already exist [6][7][8] and are highly informative, but none is delivered turnkey.

    1Analysis Information Leak framework - github.com/CIRCL/AIL-framework2Malware Information Sharing Platform - github.com/MISP/MISP3github.com/CIRCL/carl-hauser4github.com/CIRCL/douglas-quaid

    arX

    iv:1

    908.

    0344

    9v1

    [cs

    .CR

    ] 9

    Aug

    201

    9

    https://github.com/CIRCL/AIL-frameworkhttps://github.com/MISP/MISPhttps://github.com/CIRCL/carl-hauserhttps://github.com/CIRCL/douglas-quaid

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    1.1 Problem Statement

    Figure 1: Carl-Hauser logo

    Image correlation for security event correlation purposes is nowadays mainlymanual. No open-source tool provides easy correlations, without regard to thetechnology used. Ideally, the extraction of links or correlation between theseimages could be fully automated. Even partial automation would reduce theburden of this task on security teams.

    The main contribution of this paper is a free and open-source automatedbench-marking framework for Image-Matching algorithms review.

    This paper also presents research results for visual clustering of phishing websites.

    2 Datasets

    Tests, performances and speed evaluation were conducted using real sets ofpictures. These datasets were extracted from CIRCL’s tools, such as AIL andURLAbuse. One main dataset was used : circl-phishing-dataset-01 of 470+pictures.

    Phishing datasets is available for research purposes athttps://www.circl.lu/opendata/datasets/circl-phishing-dataset-01

    More details about these datasets can be found at ref paper + info link

    (a) Phishing dataset overview (470+ pictures)

    Figure 2: Dataset’s samples

    3 Materials and Methods

    In the quest of an use-case-specific Image-matching library, algorithms and approaches are numerous. Benchmarkingeach one of them is tough and time-consuming. Therefore, we developed a benchmarking framework. The envisionedgoal is to allow a fast implementation and fast evaluation of any new Image-matching library or algorithm, that couldcome up at a later date.

    2

    https://www.circl.lu/opendata/datasets/circl-phishing-dataset-01/https://www.circl.lu/opendata/datasets/circl-phishing-dataset-01/

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    Figure 3 presents a global overview scheme of the framework. Given an input folder containing pictures (PNG or BMP),the framework runs all programmed matching algorithms, and returns a series of outputs (quality, timing measure ...).For quality measure, a ground truth file needs to be provided. This file can easily be built with VisJS-Classificator[9]5.

    The framework explores the parameters space of the algorithms, by generating configuration files on the go.

    These configuration files are read by an execution handler which takes care of the general pipework, including monitoringmechanisms and feeding the configuration file to the core computation handler. This core computation handler is anoverwritten version of few abstract methods, specific to used library.

    Pre and post computations can be enabled or disabled and are part of the configuration : OCR, text-hider (Fig.7), imageconversion, edge detection ...

    Output

    Input

    Launcher

    outputs

    Execution handler

    Configuration file Preprocessing Precomputation andcore computation Postprocessing

    XXX.latex.overview

    XXX.inclusion.matrixXXX.overview

    XXX.quality.matrix

    outputsWrapper

    folder of .png or .bmp pictures

    ground_truth.json

    XXX.graphe.json

    XXX.conf

    XXX.stats

    Figure 3: Global overview of benchmarking framework

    (a) Original picture (b) Painted picture

    Figure 4: Available pre-computation example - Text hider

    Datastructure and functions to store, compute, measure, etc. in an unified way are available in utility classes. Theexecution handler outputs three major elements :

    • A statistics datastructure : including execution time, pre-computation time, matching time (total and perelement), memory consumption, matching quality metrics ...

    • A graph datastructure : each picture matched with another one share a common edge. Nodes are input pictures.The generated graph can directly be read and displayed with tool such as VisJS-Classificator [9]• A configuration datastructure : copy of the generated configuration file that conducted to this result

    One can activate the picture exportation of each pair of pictures matched together. This option is not activated bydefault, due to the high disk-space and computation time (mainly I/O access). However, this output is really relevant tomanually spot drawbacks of algorithms. Figure 5 is an example of such output.

    After all configuration being evaluated, a wrapper performs a reporting about executed experiments by mergingper-configuration results (execution time, quality ..) in unified views. These views are:

    5Open-source and built-for-the-occasion manual image classification tool - github.com/Vincent-CIRCL/visjs_classificator

    3

    https://github.com/Vincent-CIRCL/visjs_classificator

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    • An overview : a list of configuration name and quality measures, sorted by quality• A LATEX overview : same list as previous one, formatted as a LATEX-ready table• An inclusion matrix : a matrix displays the inclusion factor between output graphs. This allows to spot which

    algorithm is similar to which other algorithm (strictly, its output is included). An instance of the intersectionmatrix is provided in Appendix 10.

    • A quality per pair matrix : by pairing algorithms output per 2 and evaluating the merged graph, it generates aquick overview of which algorithms pairing is wise

    Please note that Figure 3 presents only an high level view of the framework. Implementation is more precisely describedin Annex 8.

    (a) The upper central logo is not matched correctly as the best matching picture (2nd from the left) does not include it

    (b) Login form is matched correctly, as the first matches are similar pictures, even with missing buttons

    (c) The form is matched with other forms, but none identical to original one.

    Figure 5: Request picture (on the left) and top matches (on the right, top 1 to top 3 from left to right).

    4 Results

    4.1 Used libraries

    In the following, we’re going to present few relevant visualisation of the framework outputs, as well as overview resultsas described previously.

    Few Image matching libraries were tested, including :

    • ImageHash6 which includes a wide list of fuzzy hash algorithm such as AHash, DHash, PHash, WHash, ...The purpose of these algorithms is to map input files (here, images) into a limited hash space, such that two"similar" images have similar hashes. Their objective is therefore distinct from traditional hash algorithms,which seek to maximize the difference between hashes for even a minimal difference (ultimately, 1 bit) in theinput file.

    • TLSH7[10] is also a fuzzy hashing algorithm, available in its own library.• OpenCV8 is an open source computer vision and machine learning software library, which provides a common

    infrastructure for computer vision applications. Relevant algorithms are available within it, such as SIFT[11][12] (Scale-invariant Feature Transform), SURF [13] (Speeded Up Robust Features), ORB [14] (Oriented

    6github.com/JohannesBuchner/imagehash7github.com/trendmicro/tlsh8github.com/opencv/opencv

    4

    https://github.com/JohannesBuchner/imagehashhttps://github.com/trendmicro/tlshhttps://github.com/opencv/opencv

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    FAST and Rotated BRIEF) ... These algorithms use key point descriptors, a way to condense distributed andlocally relevant information, into a vector or set of vectors. These vectors can then be compared. However,we focused on open-source implementations and avoided patented implementation. SIFT and SURF weretherefore outside of the scope. Thanks to a large performance overview [15] we chose to primarily focus onORB.

    Few tips to analyze ORB-matching pictures :

    • Parrallel lines (if there is not rotation) are indicators of quality matching. It keeps the spatial consistencybetween source and candidate pictures.

    • Text seems to be a problem. Letters are matched to letters, generating false positive. It also "uses" descriptorspace (number of descriptors is artificially limited), and so, hinders true logo (for example) to be described andused.

    Graph visualization is valuable to spot unexpected matches or mismatches and quickly navigate through all pictures[16]. Even if this kind of visualization is hard to scale up to billion of pictures, it is still a relevant tool.

    4.2 Examples

    4.2.1 Fuzzy Hash and ORB

    Two examples are presented. Figure 6b presents a mainly correct matching graph including one mismatching edge.This means that the library (here, W-Hash from ImageHash) had correctly detected Microsoft forms as being the sames,but incorrectly gave a Microsoft form as being the closest picture of the white webpage on the right.

    Figure 6a presents a match on logo, between globally very different pictures. This means that even if the global look ofpictures is different, the algorithm (here, ORB from OpenCV) correctly matched common parts of these screenshots(forms, logo in the corner ...)

    5

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    (a) Good brand matching even with different backgrounds. Logois present on each picture in upper-left corner - ORB

    (b) Graph visualisation of Microsoft phishing web-sites - WHash

    (c) Matches lines between text and forms’ corners

    Figure 6: Graph visualization and successes

    6

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    (a) Good matching on brand even with disturbed background

    (b) Mismatch due to text/font matching - ORB

    (c) Perfect match on Microsoft form

    Figure 7: Graph visualization and issues

    7

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    4.2.2 ORB and RANSAC

    RANSAC outputs a homography between two compared pictures. RANSAC ouputs a homography with a transformationmatrix, which also determines which matches are insiders and which are outliers. A ’strong transformation’ is asignificant rotation/translation/scale-up or down/deformation. A ’light transformation’ is a near direct translation,without rotation, scaling or deformation. Figure 11 and ?? are examples.

    The transformation matrix applied to the first picture generates a new picture that "fits" the second picture it is comparedto. Displaying the first picture with its transformation gives an idea of "how much" the first picture should be transformed.If the transformation is strong (high distortion) then the match is probably low. Examples are presented in Figure 9.Please note that the usual name for this operation is "Reprojection error".

    Even if an Homography is an interesting tool, it can provides a very wide range of result. Transformation can be"infinite" and so results to false matches. On the contrary, in a real-world context, object can have perspective. Anaffine transformation could be more relevant in our case, as in screenshots, the "depth" won’t be used.

    (a) Explanation of RANSAC visualisation created by the test framework

    Figure 8: ORB and RANSAC algorithm

    A few other matches are displayed in Figure 11. These matches were challenging : various range of colors betweenpictures of the same category, various backgrounds, etc.

    8

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    (a) Near no transformation needed. Should be detected as good match.

    (b) Light distortion. Should be detected as good match.

    (c) High distortion. Could be accepted as a false positive.

    (d) Butterfly pattern with a very heavy transformation. Should be a negative match. Note that the distance score (0.95)would have been very under the expected threshold of 0.96 and therefore not discarded without matrix analysis.

    (e) Very high distortion, should be discarded.

    Figure 9: Matrix transformation visualisation - ORB - RANSAC Filtering - Visualisation of transformation matrixapplied to request picture. From left to right : database picture (example), target picture (request), deformed targetpicture thanks to RANSAC transformation matrix

    9

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    (a) Match on the form and the relative space between its components

    (b) Strong deformation for the first two matches. Low deformation for the third row.

    Figure 10: The first column is the candidate picture, second column is the request picture, third column is the transformedrequest picture "to match the candidate picture". Less stretched is the best.

    10

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    (a) Good match over logo. However one match is above the"critical threshold" of 0.96 (b) Good match for Outlook, but also above the threshold.

    (c) Cluster of Outlook pictures, separated to others by a link abovethe 0.97 threshold. (d) Good cluster of simple blue forms

    Figure 11: Results - ORB - RANSAC Filtering - No matrix filter

    11

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    5 Challenges

    We drawn a few conclusions and encountered a few problems, that need to be discussed for further improvements:

    • ORB can be very sensitive to input parameters. Fuzzy-hash algorithms have few parameters and so giveconsistent but untweakable results;

    • ORB outperforms fuzzy-hash algorithms in almost all sounds configuration (about 10% more true positivematches);

    • Even if some fuzzy hashes algorithms should not be sensitive to input format (due to internal conversion), itseems that input format generate difference in output results. Uncompressed format (e.g. BMP) gives bestresults;

    • Fuzzy Hashs algorithms are quite insensitive to text and high frequency noise. Even if this fact seems to beobvious, it is confirmed by the benchmark;

    • Text noise is a huge issue to feature-matching. This usually does not constitute a problem on natural-scenesused to evaluate these algorithms such as SIFT[11][12], SURF[13], ORB[14][17], ... However text is prominentin screenshots. This leads to mismatch and unreasonable results. See Figure ??. ORB is highly sensitive totext, as corners detectors are detecting corners in each letter. This drawback generates false positive matchesand so less true positive matches, in our settings. Hiding the text (blur boxes, mean color boxes, black or whitecolor boxes, ... detected by OCR or DeepLearning approaches) does not fundamentally changes the quality ofORB detection. OCR and Text-hider preprocessing can partially prevent this issue. It seems that the incrementin performance is bounded by the imperfectness of OCR used to detected the text to hide. The main challengefor the text-hider is to remove only irrelevant text, while keeping logos and distinctive typography. OCR candetect logos as text, and so hides it, leading to less information for ORB to correctly match pictures. Pleasenote that no matching had been done on extracted text.

    • Speed-wise and Memory-wise, ORB needs order of magnitude more resources (10−3 seconds for hashes,100 seconds for ORB) than fuzzy-hashs algorithms. This gap is greatly reduced, without a drop of ORBperformances, by using Bag-Of-Word/Bag-of-Pictures approaches. Instead of performing O(N2) comparisonper pair of picture (N being the number of descriptor per picture), we can only perform the comparison inconstant time, by boiling down N descriptors in a boolean array representing few chosen descriptor presences.

    • RANSAC filtering - verifying homography quality between two matched pictures, which boils down to "Doesthe picture A needs to be stretched a lot to match picture B ?" - provides the best results of the benchmark, butcost the most (101 seconds for ORB + RANSAC).

    • Scalability : Hash-based comparison are order of magnitude faster than feature-matching algorithm. Even ifour current implementation is in a test-state - where preformance is not a main objective - scalability constitutesa lurking and rising issue. Trying to get the closest picture of each one of a 5000 pictures dataset took 10Hwith the most "advanced" configuration (ORB+RANSAC based) on a 32 cores, 120Gb RAM server.• Base model : Image matching or Image classification? While establish correlation between pictures is the

    main goal, both approach seems to be able to satisfy it. However, solutions and approaches seems radicallydifferent. The final library may use an hybrid approach.

    • Algorithm combination and scoring : Merging algorithms results is a great problem. Their score range isdiverse, and even if normalization is doable, the meaning of the score is different. One algorithm can provide areliable matching if the score is below 0.2 and another one if the score is below 0.6. Therefore, merging resultsis not only a mathematical issue.

    • Unreachable funnel approach : A funnel of algorithm approach is not adapted to our objective, given currentresults. This means that using most scalable algorithms first and most-computing-intensive algorithms atthe end, while discarding irrelevant candidate picture during the progression in the funnel, is not usable.Correlation between algorithms output is not prevalent and so, what is discarded by one could be matched byanother one.

    6 Future work

    Problems presented previously lead to a list of future possible developments :

    • Extending provided datasets to support research effort• Combining algorithms outputs in an unified and relevant way, to leverage each algorithm main strength• Increasing the list of evaluated algorithms, to leverage new main strength• Create a challenging dataset with scale/blur/negative/color-changed/.. screenshoots-like pictures, to evaluate

    the quality of our algorithms combination on edge-cases• Evaluate and improve the scalability of the framework, to faster evaluate wider datasets

    12

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    • Add more pre and post computation : filter out image, merge image distance, modified edge detector, spatialclustering of descriptors ... to understand what can be done to improve matching quality

    • Time and Memory performance evaluation regarding size of the dataset, to anticipates calculation needs and atwhich dataset size the framework hits the "complexity wall"

    7 Conclusion

    7.1 Summary

    This research paper proposes that even partial automation of screenshots classification would reduce the burden onsecurity teams.

    The presented evaluation framework is available online at github.com/CIRCL/carl-hauser.

    Research results were presented along with encountered issues. Algorithms modification are still being performed tosolve met issues and improve overall quality of the future library. The issues and their solution presented in this paperconstitute a starting point on the path to the creation of a matching and clustering library, intended to be used in OpenSource tools.In the context of this paper, these results represents above all a usage example of the provided datasets.

    7.2 Contact information

    If you have a complaint related to the dataset or the processing over it, please contact us. We aim to be transparent, notonly about how we process but also about rights that are linked to such information and processing. You can contact usat circl.lu/contact/ for request about the dataset itself, regarding elements of the dataset, or extension requests. You cancontact us at same address or on github for feedback about the benchmarking framework, methodology or relevantideas/inquiries.

    13

    https://github.com/CIRCL/carl-hauserhttps://circl.lu/contact/https://github.com/CIRCL/carl-hauser

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    References

    [1] A. Sampat and A. Haskell, “CNN for task classification using computer screenshots for integration into dynamiccalendar/task management systems,” p. 6.

    [2] M. Aburrous, M. A. Hossain, K. Dahal, and F. Thabtah, “Predicting Phishing Websites Using ClassificationMining Techniques with Experimental Case Studies,” in 2010 Seventh International Conference on InformationTechnology: New Generations, pp. 176–181.

    [3] K. Chen, J. Chen, C. Huang, and C. Chen, “Fighting Phishing with Discriminative Keypoint Features,” vol. 13,no. 3, pp. 56–63.

    [4] S. Mokaddem, G. Wagener, and A. Dulaunoy, “AIL - The design and implementation of an Analysis InformationLeak framework,” in 2018 IEEE International Conference on Big Data (Big Data), pp. 5049–5057.

    [5] C. Wagner, A. Dulaunoy, G. Wagener, and A. Iklody, “MISP: The Design and Implementation of aCollaborative Threat Intelligence Sharing Platform,” in Proceedings of the 2016 ACM on Workshop onInformation Sharing and Collaborative Security - WISCS’16. ACM Press, pp. 49–56. [Online]. Available:http://dl.acm.org/citation.cfm?doid=2994539.2994542

    [6] M. Gaillard and E. Egyed-Zsigmond, “Large scale reverse image search - A method comparison for almostidentical image retrieval,” in INFORSID.

    [7] C. Zauner and M. Steinebach, “Implementation and Benchmarking of Perceptual Image Hash Functions.”[8] J. Bian, L. Zhang, Y. Liu, W.-Y. Lin, M.-M. Cheng, and I. D. Reid, “Image Matching: An Application-oriented

    Benchmark.” [Online]. Available: http://arxiv.org/abs/1709.03917[9] Vincent-CIRCL, “Classificator for pictures matching and clustering. Fast and visual.: Vincent-

    CIRCL/visjs_classificator.” [Online]. Available: https://github.com/Vincent-CIRCL/visjs_classificator[10] J. Oliver, C. Cheng, and Y. Chen, “TLSH – A Locality Sensitive Hash,” in 2013 Fourth Cybercrime and Trustworthy

    Computing Workshop. IEEE, pp. 7–13. [Online]. Available: http://ieeexplore.ieee.org/document/6754635/[11] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” vol. 60, no. 2, pp. 91–110. [Online].

    Available: http://link.springer.com/10.1023/B:VISI.0000029664.99615.94[12] I. R. Otero and M. Delbracio, “Anatomy of the SIFT Method,” vol. 4, pp. 370–396. [Online]. Available:

    http://www.ipol.im/pub/art/2014/82/[13] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded Up Robust Features,” in Computer Vision – ECCV 2006,

    A. Leonardis, H. Bischof, and A. Pinz, Eds. Springer Berlin Heidelberg, vol. 3951, pp. 404–417. [Online].Available: http://link.springer.com/10.1007/11744023_32

    [14] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,”in 2011 International Conference on Computer Vision. IEEE, pp. 2564–2571. [Online]. Available:http://ieeexplore.ieee.org/document/6126544/

    [15] S. A. K. Tareen and Z. Saleem, “A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK,” in2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET). IEEE, pp.1–10. [Online]. Available: https://ieeexplore.ieee.org/document/8346440/

    [16] I. Vessey and D. Galletta, “Cognitive Fit: An Empirical Study of Information Acquisition,” vol. 2, no. 1, pp.63–84. [Online]. Available: http://pubsonline.informs.org/doi/abs/10.1287/isre.2.1.63

    [17] ORB (Oriented FAST and Rotated BRIEF) — OpenCV 3.0.0-dev documentation. [Online]. Available:https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_feature2d/py_orb/py_orb.html#orb

    [18] F. Harary and I. C. Ross, “A Procedure for Clique Detection Using the Group Matrix,” vol. 20, no. 3, pp. 205–215.

    14

    http://dl.acm.org/citation.cfm?doid=2994539.2994542http://arxiv.org/abs/1709.03917https://github.com/Vincent-CIRCL/visjs_classificatorhttp://ieeexplore.ieee.org/document/6754635/http://link.springer.com/10.1023/B:VISI.0000029664.99615.94http://www.ipol.im/pub/art/2014/82/http://link.springer.com/10.1007/11744023_32http://ieeexplore.ieee.org/document/6126544/https://ieeexplore.ieee.org/document/8346440/http://pubsonline.informs.org/doi/abs/10.1287/isre.2.1.63https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_feature2d/py_orb/py_orb.html#orb

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    Part I

    Appendices8 Detailed view of the framework

    The exact implementation differs on some points with the idealized overview presented in the paper body. For example,core computation handler and execution handler is one element, which can entirely be overwritten. This allows a moreflexible and adaptable framework to new algorithms, as each one may need to overwrite larger part of the executionhandler.

    15

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    Stor

    age

    is in

    put o

    fStor

    age

    Stor

    age

    Util

    ities

    laun

    ches

    gene

    rate

    sfinal

    ly c

    alls

    Laun

    cher

    Gen

    erat

    e co

    nfigu

    ratio

    n fil

    es

    Cal

    l exe

    cutio

    n ha

    ndle

    r

    outp

    uts

    Cus

    tom

    cor

    e co

    mpu

    tatio

    n ha

    ndle

    r

    Ope

    nCV

    Han

    dler

    Imag

    eHas

    h H

    andl

    er

    TLSH

    Han

    dler

    ...

    uses

    Exec

    utio

    n ha

    ndle

    r

    defin

    es b

    ehav

    ior o

    f

    Con

    figur

    atio

    n fil

    e

    Prep

    roce

    ssin

    gPr

    ecom

    puta

    tion

    and

    core

    com

    puta

    tion

    Post

    proc

    essi

    ng

    File

    Syst

    em h

    andl

    er

    Pict

    ure

    clas

    s

    Prin

    ting

    hand

    ler S

    tats

    han

    dlerJ

    SON

    han

    dler

    Gra

    ph h

    andl

    er

    over

    writ

    en b

    y

    XXX.

    late

    x.ov

    ervi

    ew

    XXX.

    incl

    usio

    n.m

    atrix

    XXX.

    over

    view

    XXX.

    grap

    he.js

    on

    XXX.

    conf

    XXX.

    stat

    s

    XXX.

    qual

    ity.m

    atrix

    Sorts

    mat

    chin

    g re

    sults

    and

    disp

    lay

    rele

    vant

    info

    rmat

    ion

    Mat

    chin

    g re

    sults

    in a

    gra

    phe

    form

    at, t

    o di

    spla

    y w

    ith v

    ijs-

    clas

    sific

    ator

    Gen

    erat

    ed c

    onfig

    urat

    ion

    save

    das

    a fi

    le

    acce

    sses

    gene

    rate

    s

    Wrapp

    er

    Com

    pile

    resu

    ltsG

    ener

    ate

    com

    pile

    d ou

    tput

    s of

    all

    exec

    utio

    ns 

    Gen

    erat

    es a

    read

    y-to

    -use

    late

    xvi

    ew o

    f res

    ults

    Cre

    ates

    a m

    atrix

    vie

    w o

    fqu

    ality

    resu

    lts b

    y pa

    ir of

    algo

    rithm

    s

    fold

    er o

    f .pn

    g or

     .bm

    p pi

    ctur

    esC

    reat

    es a

    mat

    rix v

    iew

    of

    algo

    rithm

    s ou

    tput

    s in

    ters

    ectio

    n

    grou

    nd_t

    ruth

    .json

    Exec

    utio

    n m

    etric

    s lik

    e tim

    e,m

    emor

    y, qu

    ality

    , ...

    Figu

    re12

    :Glo

    balo

    verv

    iew

    ofte

    stfr

    amew

    ork

    16

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    9 Result table

    Following table present raw results reduced to best algorithms.

    A shortened results table is provided, with algorithms evaluated as "best ones" on the phishing dataset. This dataset wascomposed of 207 screenshots of phishing websites. The pre-computation time is the hashing or feature-descriptiontime. The matching time is the elapsed time between a request picture and a best-match answer from the algorithm. Asexplained in previously in the paper body, the true positive rate is the inclusion rate between the output graph of thealgorithm and the ground truth graph.

    Values are normalized thanks to the maximum obtainable score for this version of the score evaluation. Evaluatedalgorithms always give a match for a given image, even for outliers, which has no outbound link in the ground truthgraph. So even in the best case the algorithm has a bounded maximum score.

    NAME (PARAMETERS) TRUE POSITIVE (norm.) PRE COMPUTING (s) MATCHING (s)ORB 500 LEN MAX NO FILTER STD BRUTE FORCE ENA. 0.65263 (0.77641) 0.04624 1.9087

    ORB 500 LEN MIN FAR THR. KNN 2 FLANN LSH ENA. 0.64211 (0.76389) 0.04634 1.1203ORB 500 LEN MIN FAR THR. KNN 2 FLANN LSH DIS. 0.63684 (0.75762) 0.04614 1.12314

    ORB 500 LEN MAX FAR THR. KNN 2 FLANN LSH ENA. 0.63158 (0.75136) 0.05965 1.26475ORB 500 LEN MAX FAR THR. KNN 2 FLANN LSH DIS. 0.62632 (0.74511) 0.04613 1.12695

    ORB 500 LEN MIN RATIO KNN 2 FLANN LSH DIS. 0.61053 (0.72632) 0.04764 1.19758ORB 500 LEN MIN RATIO KNN 2 FLANN LSH ENA. 0.60526 (0.72005) 0.04787 1.14532

    D HASH 0.60386 (0.71839) 0.06124 0.00214A HASH 0.57971 (0.68966) 0.06268 0.00218

    D HASH VERTICAL 0.57488 (0.68391) 0.05934 0.00216ORB 500 LEN MIN NO FILTER STD BRUTE FORCE ENABLED 0.56842 (0.67622) 0.0452 2.08476

    P HASH 0.56039 (0.66667) 0.06403 0.00244W HASH 0.53623 (0.63793) 0.1221 0.00205

    P HASH SIMPLE 0.52657 (0.62644) 0.06377 0.00222ORB 500 LEN MAX RATIO BAD STD BRUTE FORCE ENABLED 0.46316 (0.55101) 0.0445 1.68615ORB 500 LEN MIN RATIO BAD STD BRUTE FORCE ENABLED 0.45263 (0.53847) 0.0445 2.15439

    TLSH 0.42512 (0.50575) 0.00498 0.00096TLSH NO LENGTH 0.40580 (0.48276) 0.00494 0.0011

    17

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    NAME TRUE POSITIVE PRE COMPUTING (s) MATCHING (s)

    raw phishing PNG ORB 500 LEN MAX NO FILTER STD BRUTE FORCE ENABLED 0.65263 0.04624 1.9087

    raw phishing PNG ORB 500 LEN MIN FAR THREESHOLD KNN 2 BRUTE FORCE DISABLED 0.64737 0.06641 1.23615

    raw phishing PNG ORB 500 LEN MIN RATIO CORRECT KNN 2 BRUTE FORCE DISABLED 0.64737 0.05256 1.17277

    raw phishing PNG ORB 500 LEN MAX RATIO CORRECT KNN 2 BRUTE FORCE DISABLED 0.64737 0.04775 1.34641

    raw phishing PNG ORB 500 LEN MAX FAR THREESHOLD KNN 2 BRUTE FORCE DISABLED 0.64737 0.06434 1.20571

    raw phishing PNG ORB 500 LEN MIN FAR THREESHOLD KNN 2 FLANN LSH ENABLED 0.64211 0.04634 1.1203

    raw phishing PNG ORB 500 LEN MIN FAR THREESHOLD KNN 2 FLANN LSH DISABLED 0.63684 0.04614 1.12314

    raw phishing PNG ORB 500 LEN MAX FAR THREESHOLD KNN 2 FLANN LSH ENABLED 0.63158 0.05965 1.26475

    raw phishing PNG ORB 500 LEN MAX FAR THREESHOLD KNN 2 FLANN LSH DISABLED 0.62632 0.04613 1.12695

    raw phishing PNG ORB 500 LEN MIN RATIO CORRECT KNN 2 FLANN LSH DISABLED 0.61053 0.04764 1.19758

    raw phishing PNG ORB 500 LEN MAX RATIO CORRECT KNN 2 FLANN LSH DISABLED 0.61053 0.04769 1.13552

    raw phishing PNG ORB 500 LEN MIN RATIO CORRECT KNN 2 FLANN LSH ENABLED 0.60526 0.04787 1.14532

    raw phishing PNG ORB 500 LEN MAX RATIO CORRECT KNN 2 FLANN LSH ENABLED 0.60526 0.04742 1.11723

    raw phishing PNG D HASH 0.60386 0.06124 0.00214

    raw phishing PNG A HASH 0.57971 0.06268 0.00218

    raw phishing PNG D HASH VERTICAL 0.57488 0.05934 0.00216

    raw phishing PNG ORB 500 LEN MIN NO FILTER STD BRUTE FORCE ENABLED 0.56842 0.0452 2.08476

    raw phishing PNG P HASH 0.56039 0.06403 0.00244

    raw phishing PNG W HASH 0.53623 0.1221 0.00205

    raw phishing PNG P HASH SIMPLE 0.52657 0.06377 0.00222

    raw phishing PNG ORB 500 LEN MAX RATIO BAD STD BRUTE FORCE ENABLED 0.46316 0.0445 1.68615

    raw phishing PNG ORB 500 LEN MIN RATIO BAD STD BRUTE FORCE ENABLED 0.45263 0.0445 2.15439

    raw phishing PNG TLSH 0.42512 0.00498 0.00096

    raw phishing PNG TLSH NO LENGTH 0.4058 0.00494 0.0011

    raw phishing PNG ORB 500 LEN MAX RATIO BAD STD FLANN LSH ENABLED 0.30526 0.05724 1.50447

    raw phishing PNG ORB 500 LEN MAX RATIO BAD STD BRUTE FORCE DISABLED 0.3 0.07377 1.17801

    raw phishing PNG ORB 500 LEN MIN RATIO BAD STD BRUTE FORCE DISABLED 0.28947 0.04555 0.99388

    raw phishing PNG ORB 500 LEN MAX RATIO BAD STD FLANN LSH DISABLED 0.27368 0.14438 1.56765

    raw phishing PNG ORB 500 LEN MIN RATIO BAD STD FLANN LSH ENABLED 0.27368 0.05961 1.47452

    raw phishing PNG ORB 500 LEN MIN RATIO BAD STD FLANN LSH DISABLED 0.26842 0.04805 1.17483

    raw phishing PNG ORB 500 LEN MAX NO FILTER STD FLANN LSH ENABLED 0.06842 0.05417 1.1589

    raw phishing PNG ORB 500 LEN MAX NO FILTER STD FLANN LSH DISABLED 0.05789 0.04719 1.34306

    raw phishing PNG ORB 500 LEN MAX NO FILTER STD BRUTE FORCE DISABLED 0.02105 0.0631 1.14467

    raw phishing PNG ORB 500 LEN MIN NO FILTER STD FLANN LSH DISABLED 0.00526 0.05679 1.18021

    raw phishing PNG ORB 500 LEN MIN NO FILTER STD FLANN LSH ENABLED 0.00526 0.04684 1.08095

    raw phishing PNG ORB 500 LEN MIN NO FILTER STD BRUTE FORCE DISABLED 0.0 0.04518 0.95656

    NAME TRUE POSITIVE PRE COMPUTING (sec) MATCHING (sec)

    raw phishing bmp BMP ORB 500 LEN MAX NO FILTER STD BRUTE FORCE ENABLED 0.64737 0.07296 2.52043

    raw phishing bmp BMP D HASH 0.60386 0.02017 0.00224

    raw phishing bmp BMP TLSH 0.58937 0.08356 0.00093

    raw phishing bmp BMP A HASH 0.57971 0.02544 0.00245

    raw phishing bmp BMP TLSH NO LENGTH 0.57005 0.0856 0.00114

    raw phishing bmp BMP D HASH VERTICAL 0.56522 0.01986 0.00228

    raw phishing bmp BMP ORB 500 LEN MIN NO FILTER STD BRUTE FORCE ENABLED 0.55789 0.04995 2.40043

    raw phishing bmp BMP P HASH 0.55072 0.02438 0.00232

    raw phishing bmp BMP W HASH 0.54106 0.09857 0.00221

    raw phishing bmp BMP P HASH SIMPLE 0.52174 0.02224 0.00233

    raw phishing bmp BMP ORB 500 LEN MAX RATIO BAD STD BRUTE FORCE ENABLED 0.45263 0.0637 2.45745

    raw phishing bmp BMP ORB 500 LEN MIN RATIO BAD STD BRUTE FORCE ENABLED 0.44211 0.06207 2.39609

    raw phishing bmp BMP ORB 500 LEN MAX RATIO BAD STD BRUTE FORCE DISABLED 0.28421 0.05223 1.45008

    raw phishing bmp BMP ORB 500 LEN MIN RATIO BAD STD BRUTE FORCE DISABLED 0.27368 0.06659 1.3083

    raw phishing bmp BMP ORB 500 LEN MAX NO FILTER STD BRUTE FORCE DISABLED 0.01579 0.0531 1.43634

    raw phishing bmp BMP ORB 500 LEN MIN NO FILTER STD BRUTE FORCE DISABLED 0.0 0.05253 1.17698

    18

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    NAME TRUE POSITIVE PRE COMPUTING (sec) MATCHING (sec)

    raw phishing COLORED PNG ORB 500 LEN MAX NO FILTER STD BRUTE FORCE ENABLED 0.59474 0.04536 1.5378

    raw phishing COLORED PNG P HASH 0.58937 0.04627 0.00233

    raw phishing COLORED PNG D HASH VERTICAL 0.57488 0.04983 0.00236

    raw phishing COLORED PNG D HASH 0.57488 0.04925 0.00301

    raw phishing COLORED PNG ORB 500 LEN MIN NO FILTER STD BRUTE FORCE ENABLED 0.56842 0.0739 2.11373

    raw phishing COLORED PNG ORB 500 LEN MAX RATIO CORRECT KNN 2 BRUTE FORCEDISABLED

    0.55789 0.0439 0.86356

    raw phishing COLORED PNG ORB 500 LEN MAX FAR THREESHOLD KNN 2 BRUTE FORCEDISABLED

    0.55789 0.0439 0.86493

    raw phishing COLORED PNG A HASH 0.55072 0.04406 0.00229

    raw phishing COLORED PNG ORB 500 LEN MIN RATIO CORRECT KNN 2 BRUTE FORCEDISABLED

    0.54737 0.04379 0.85381

    raw phishing COLORED PNG ORB 500 LEN MIN FAR THREESHOLD KNN 2 BRUTE FORCEDISABLED

    0.54737 0.04405 0.85917

    raw phishing COLORED PNG ORB 500 LEN MAX FAR THREESHOLD KNN 2 FLANN LSHENABLED

    0.54737 0.04403 1.08197

    raw phishing COLORED PNG ORB 500 LEN MAX FAR THREESHOLD KNN 2 FLANN LSHDISABLED

    0.54211 0.04388 1.084

    raw phishing COLORED PNG ORB 500 LEN MAX RATIO CORRECT KNN 2 FLANN LSHENABLED

    0.53684 0.04413 1.08097

    raw phishing COLORED PNG ORB 500 LEN MIN FAR THREESHOLD KNN 2 FLANN LSHDISABLED

    0.53684 0.0441 1.07795

    raw phishing COLORED PNG ORB 500 LEN MAX RATIO CORRECT KNN 2 FLANN LSHDISABLED

    0.53158 0.0439 1.08247

    raw phishing COLORED PNG P HASH SIMPLE 0.52657 0.04703 0.00298

    raw phishing COLORED PNG ORB 500 LEN MIN FAR THREESHOLD KNN 2 FLANN LSHENABLED

    0.52105 0.04423 1.08245

    raw phishing COLORED PNG W HASH 0.51208 0.11247 0.00229

    raw phishing COLORED PNG ORB 500 LEN MIN RATIO CORRECT KNN 2 FLANN LSHENABLED

    0.51053 0.04389 1.0917

    raw phishing COLORED PNG ORB 500 LEN MIN RATIO CORRECT KNN 2 FLANN LSHDISABLED

    0.5 0.0438 1.08247

    raw phishing COLORED PNG TLSH 0.47826 0.00513 0.00096

    raw phishing COLORED PNG TLSH NO LENGTH 0.45411 0.00508 0.00111

    raw phishing COLORED PNG ORB 500 LEN MAX RATIO BAD STD BRUTE FORCE ENABLED 0.44737 0.0706 2.30236

    raw phishing COLORED PNG ORB 500 LEN MIN RATIO BAD STD BRUTE FORCE ENABLED 0.38947 0.0676 2.39523

    raw phishing COLORED PNG ORB 500 LEN MIN RATIO BAD STD FLANN LSH ENABLED 0.24211 0.04551 1.05413

    raw phishing COLORED PNG ORB 500 LEN MAX RATIO BAD STD FLANN LSH ENABLED 0.24211 0.04491 1.04322

    raw phishing COLORED PNG ORB 500 LEN MAX RATIO BAD STD FLANN LSH DISABLED 0.24211 0.04502 1.05124

    raw phishing COLORED PNG ORB 500 LEN MIN RATIO BAD STD FLANN LSH DISABLED 0.24211 0.04556 1.05038

    raw phishing COLORED PNG ORB 500 LEN MIN RATIO BAD STD BRUTE FORCE DISABLED 0.23684 0.04937 1.31014

    raw phishing COLORED PNG ORB 500 LEN MAX RATIO BAD STD BRUTE FORCE DISABLED 0.23684 0.05403 1.20556

    raw phishing COLORED PNG ORB 500 LEN MAX NO FILTER STD FLANN LSH ENABLED 0.05789 0.04412 0.99475

    raw phishing COLORED PNG ORB 500 LEN MAX NO FILTER STD FLANN LSH DISABLED 0.03684 0.04388 0.99439

    raw phishing COLORED PNG ORB 500 LEN MAX NO FILTER STD BRUTE FORCE DISABLED 0.02105 0.05104 0.8506

    raw phishing COLORED PNG ORB 500 LEN MIN NO FILTER STD FLANN LSH DISABLED 0.0 0.0444 0.99507

    raw phishing COLORED PNG ORB 500 LEN MIN NO FILTER STD BRUTE FORCE DISABLED 0.0 0.08486 1.23023

    raw phishing COLORED PNG ORB 500 LEN MIN NO FILTER STD FLANN LSH ENABLED 0.0 0.04395 0.99463

    19

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    10 Intersection matrix

    Iratio =](Re ∩Gt)

    ](Re)(1)

    With Iratio being the intersection ratio, Re being one algorithm output graph edges, Gt being the ground truth graphedges or another algorithm output graph edges. The true-positive score of an algorithm output is computed as theintersection ratio between one algorithm output and the ground truth graph.

    Ground truth graph has clique 9 [18] of similar nodes. It also has outliers which don’t have any link with othernodes. This implies that algorithms which always give a matching picture to a given request picture (in other wordsalgorithms which can’t say "There is no match") can’t reach the maximum score. Some nodes will have edges thatwill never be "good" as their equivalent node in the ground truth graph have no outbound edge. Some guess willalways be wrong if a guess is forced. Therefore, a normalization is applied (Table 9) to raw true-positive rates asnormalized_score = raw_score/max_score.

    For all given algorithm, we compute the intersection of their outputs in pairs, thanks to Formula 1. The matrix ispresented in Figure 13. The larger the intersection between an Algorithm A (y-axis) and an Algorithm B (x-axis), the"more similar" their output will be, the darker the square of their intersection will be.

    We see that numerous ORB-configuration are similar (homogeneous dark bottom-right square), while most hash-basedalgorithms have similar outputs (somewhat homogeneous mid-dark top-left square). We also see that hash-basedalgorithms and ORB-based configuration have quite different outputs (homogeneous light areas). Therefore, combinationof hash-based algorithms and ORB-based configuration may be complementary.

    The evaluation had been conducted on the rank-1 matching guess provided for each algorithm for each request picture.

    9Clique : subset of nodes of a graph where every two distinct nodes are adjacent

    20

  • Carl-Hauser - Open Source Image Matching Algorithms Benchmarking Framework A PREPRINT

    void_baseline

    raw_phishing_PNG_TLSH

    raw_phishing_PNG_A_HASH

    raw_phishing_PNG_D_HASH

    raw_phishing_PNG_P_HASH

    raw_phishing_PNG_W_HASH

    raw_phishing_PNG_P_HASH_SIMPLE

    raw_phishing_PNG_TLSH_NO_LENGTH

    raw_phishing_PNG_D_HASH_VERTICAL

    raw_phishing_PNG_ORB_500_100_BOW_CMP_HIST.CORREL

    raw_phishing_PNG_ORB_500_500_BOW_CMP_HIST.CORREL

    raw_phishing_PNG_ORB_500_1000_BOW_CMP_HIST.CORREL

    raw_phishing_PNG_ORB_500_5000_BOW_CMP_HIST.CORREL

    raw_phishing_PNG_ORB_500_10000_BOW_CMP_HIST.CORREL

    raw_phishing_PNG_ORB_500_50000_BOW_CMP_HIST.CORREL

    raw_phishing_PNG_ORB_500_100_BOW_CMP_HIST.BHATTACHARYYA

    raw_phishing_PNG_ORB_500_500_BOW_CMP_HIST.BHATTACHARYYA

    raw_phishing_PNG_ORB_500_1000_BOW_CMP_HIST.BHATTACHARYYA

    raw_phishing_PNG_ORB_500_5000_BOW_CMP_HIST.BHATTACHARYYA

    raw_phishing_PNG_ORB_500_10000_BOW_CMP_HIST.BHATTACHARYYA

    raw_phishing_PNG_ORB_500_50000_BOW_CMP_HIST.BHATTACHARYYA

    raw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_STD_BOW_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_STD_BOW_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_KNN_2_BOW_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_KNN_2_BOW_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_RATIO_CORRECT_STD_BOW_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_RATIO_CORRECT_STD_BOW_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_STD_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_RATIO_BAD_STD_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MIN_NO_FILTER_STD_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MIN_RATIO_BAD_STD_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_STD_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_RATIO_BAD_STD_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_RATIO_CORRECT_KNN_2_BOW_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MIN_NO_FILTER_STD_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MIN_RATIO_BAD_STD_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_STD_BRUTE_FORCE_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_RATIO_BAD_STD_BRUTE_FORCE_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MIN_NO_FILTER_STD_BRUTE_FORCE_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MIN_RATIO_BAD_STD_BRUTE_FORCE_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_STD_BRUTE_FORCE_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_RATIO_BAD_STD_BRUTE_FORCE_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MIN_NO_FILTER_STD_BRUTE_FORCE_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MIN_RATIO_BAD_STD_BRUTE_FORCE_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_AND_MAX_NO_FILTER_STD_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_MEAN_AND_MAX_RATIO_BAD_STD_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_RATIO_CORRECT_KNN_2_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MIN_RATIO_CORRECT_KNN_2_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_MEAN_AND_MAX_NO_FILTER_STD_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_AND_MAX_RATIO_BAD_STD_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_FAR_THREESHOLD_KNN_2_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_RATIO_CORRECT_KNN_2_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MIN_FAR_THREESHOLD_KNN_2_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MIN_RATIO_CORRECT_KNN_2_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_AND_MAX_NO_FILTER_STD_BRUTE_FORCE_ENABLED

    raw_phishing_PNG_ORB_500_MEAN_AND_MAX_RATIO_BAD_STD_BRUTE_FORCE_ENABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_FAR_THREESHOLD_KNN_2_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MIN_FAR_THREESHOLD_KNN_2_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_AND_MAX_NO_FILTER_STD_BRUTE_FORCE_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_AND_MAX_RATIO_BAD_STD_BRUTE_FORCE_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_RATIO_CORRECT_KNN_2_BRUTE_FORCE_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MIN_RATIO_CORRECT_KNN_2_BRUTE_FORCE_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MAX_FAR_THREESHOLD_KNN_2_BRUTE_FORCE_DISABLED

    raw_phishing_PNG_ORB_500_LEN_MIN_FAR_THREESHOLD_KNN_2_BRUTE_FORCE_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_AND_MAX_RATIO_CORRECT_KNN_2_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_NO_FILTER_STD_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_RATIO_BAD_STD_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_MEAN_AND_MAX_FAR_THREESHOLD_KNN_2_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_MEAN_AND_MAX_RATIO_CORRECT_KNN_2_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_NO_FILTER_STD_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_RATIO_BAD_STD_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_AND_MAX_FAR_THREESHOLD_KNN_2_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_NO_FILTER_STD_BRUTE_FORCE_ENABLED

    raw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_RATIO_BAD_STD_BRUTE_FORCE_ENABLED

    raw_phishing_PNG_ORB_500_MEAN_AND_MAX_RATIO_CORRECT_KNN_2_BRUTE_FORCE_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_NO_FILTER_STD_BRUTE_FORCE_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_RATIO_BAD_STD_BRUTE_FORCE_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_AND_MAX_FAR_THREESHOLD_KNN_2_BRUTE_FORCE_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_RATIO_CORRECT_KNN_2_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_FAR_THREESHOLD_KNN_2_FLANN_LSH_ENABLED

    raw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_RATIO_CORRECT_KNN_2_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_FAR_THREESHOLD_KNN_2_FLANN_LSH_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_RATIO_CORRECT_KNN_2_BRUTE_FORCE_DISABLED

    raw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_FAR_THREESHOLD_KNN_2_BRUTE_FORCE_DISABLEDvoid_baselineraw_phishing_PNG_TLSHraw_phishing_PNG_A_HASHraw_phishing_PNG_D_HASHraw_phishing_PNG_P_HASHraw_phishing_PNG_W_HASHraw_phishing_PNG_P_HASH_SIMPLEraw_phishing_PNG_TLSH_NO_LENGTHraw_phishing_PNG_D_HASH_VERTICALraw_phishing_PNG_ORB_500_100_BOW_CMP_HIST.CORRELraw_phishing_PNG_ORB_500_500_BOW_CMP_HIST.CORRELraw_phishing_PNG_ORB_500_1000_BOW_CMP_HIST.CORRELraw_phishing_PNG_ORB_500_5000_BOW_CMP_HIST.CORRELraw_phishing_PNG_ORB_500_10000_BOW_CMP_HIST.CORRELraw_phishing_PNG_ORB_500_50000_BOW_CMP_HIST.CORRELraw_phishing_PNG_ORB_500_100_BOW_CMP_HIST.BHATTACHARYYAraw_phishing_PNG_ORB_500_500_BOW_CMP_HIST.BHATTACHARYYAraw_phishing_PNG_ORB_500_1000_BOW_CMP_HIST.BHATTACHARYYAraw_phishing_PNG_ORB_500_5000_BOW_CMP_HIST.BHATTACHARYYAraw_phishing_PNG_ORB_500_10000_BOW_CMP_HIST.BHATTACHARYYAraw_phishing_PNG_ORB_500_50000_BOW_CMP_HIST.BHATTACHARYYAraw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_STD_BOW_ENABLEDraw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_STD_BOW_DISABLEDraw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_KNN_2_BOW_ENABLEDraw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_KNN_2_BOW_DISABLEDraw_phishing_PNG_ORB_500_LEN_MAX_RATIO_CORRECT_STD_BOW_ENABLEDraw_phishing_PNG_ORB_500_LEN_MAX_RATIO_CORRECT_STD_BOW_DISABLEDraw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_STD_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_LEN_MAX_RATIO_BAD_STD_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_LEN_MIN_NO_FILTER_STD_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_LEN_MIN_RATIO_BAD_STD_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_STD_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_LEN_MAX_RATIO_BAD_STD_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_LEN_MAX_RATIO_CORRECT_KNN_2_BOW_DISABLEDraw_phishing_PNG_ORB_500_LEN_MIN_NO_FILTER_STD_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_LEN_MIN_RATIO_BAD_STD_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_STD_BRUTE_FORCE_ENABLEDraw_phishing_PNG_ORB_500_LEN_MAX_RATIO_BAD_STD_BRUTE_FORCE_ENABLEDraw_phishing_PNG_ORB_500_LEN_MIN_NO_FILTER_STD_BRUTE_FORCE_ENABLEDraw_phishing_PNG_ORB_500_LEN_MIN_RATIO_BAD_STD_BRUTE_FORCE_ENABLEDraw_phishing_PNG_ORB_500_LEN_MAX_NO_FILTER_STD_BRUTE_FORCE_DISABLEDraw_phishing_PNG_ORB_500_LEN_MAX_RATIO_BAD_STD_BRUTE_FORCE_DISABLEDraw_phishing_PNG_ORB_500_LEN_MIN_NO_FILTER_STD_BRUTE_FORCE_DISABLEDraw_phishing_PNG_ORB_500_LEN_MIN_RATIO_BAD_STD_BRUTE_FORCE_DISABLEDraw_phishing_PNG_ORB_500_MEAN_AND_MAX_NO_FILTER_STD_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_MEAN_AND_MAX_RATIO_BAD_STD_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_LEN_MAX_RATIO_CORRECT_KNN_2_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_LEN_MIN_RATIO_CORRECT_KNN_2_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_MEAN_AND_MAX_NO_FILTER_STD_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_MEAN_AND_MAX_RATIO_BAD_STD_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_LEN_MAX_FAR_THREESHOLD_KNN_2_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_LEN_MAX_RATIO_CORRECT_KNN_2_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_LEN_MIN_FAR_THREESHOLD_KNN_2_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_LEN_MIN_RATIO_CORRECT_KNN_2_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_MEAN_AND_MAX_NO_FILTER_STD_BRUTE_FORCE_ENABLEDraw_phishing_PNG_ORB_500_MEAN_AND_MAX_RATIO_BAD_STD_BRUTE_FORCE_ENABLEDraw_phishing_PNG_ORB_500_LEN_MAX_FAR_THREESHOLD_KNN_2_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_LEN_MIN_FAR_THREESHOLD_KNN_2_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_MEAN_AND_MAX_NO_FILTER_STD_BRUTE_FORCE_DISABLEDraw_phishing_PNG_ORB_500_MEAN_AND_MAX_RATIO_BAD_STD_BRUTE_FORCE_DISABLEDraw_phishing_PNG_ORB_500_LEN_MAX_RATIO_CORRECT_KNN_2_BRUTE_FORCE_DISABLEDraw_phishing_PNG_ORB_500_LEN_MIN_RATIO_CORRECT_KNN_2_BRUTE_FORCE_DISABLEDraw_phishing_PNG_ORB_500_LEN_MAX_FAR_THREESHOLD_KNN_2_BRUTE_FORCE_DISABLEDraw_phishing_PNG_ORB_500_LEN_MIN_FAR_THREESHOLD_KNN_2_BRUTE_FORCE_DISABLEDraw_phishing_PNG_ORB_500_MEAN_AND_MAX_RATIO_CORRECT_KNN_2_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_NO_FILTER_STD_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_RATIO_BAD_STD_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_MEAN_AND_MAX_FAR_THREESHOLD_KNN_2_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_MEAN_AND_MAX_RATIO_CORRECT_KNN_2_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_NO_FILTER_STD_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_RATIO_BAD_STD_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_MEAN_AND_MAX_FAR_THREESHOLD_KNN_2_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_NO_FILTER_STD_BRUTE_FORCE_ENABLEDraw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_RATIO_BAD_STD_BRUTE_FORCE_ENABLEDraw_phishing_PNG_ORB_500_MEAN_AND_MAX_RATIO_CORRECT_KNN_2_BRUTE_FORCE_DISABLEDraw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_NO_FILTER_STD_BRUTE_FORCE_DISABLEDraw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_RATIO_BAD_STD_BRUTE_FORCE_DISABLEDraw_phishing_PNG_ORB_500_MEAN_AND_MAX_FAR_THREESHOLD_KNN_2_BRUTE_FORCE_DISABLEDraw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_RATIO_CORRECT_KNN_2_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_FAR_THREESHOLD_KNN_2_FLANN_LSH_ENABLEDraw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_RATIO_CORRECT_KNN_2_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_FAR_THREESHOLD_KNN_2_FLANN_LSH_DISABLEDraw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_RATIO_CORRECT_KNN_2_BRUTE_FORCE_DISABLEDraw_phishing_PNG_ORB_500_MEAN_DIST_PER_PAIR_FAR_THREESHOLD_KNN_2_BRUTE_FORCE_DISABLED 0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Similarity (1 = sam

    e)

    Figure 13: Intersection matrix - Guess of rank 1 for each image comparison CONCATENE AND MAKE IT MOREREADABLE : FOR NEXT RELEASE

    21

    1 Introduction1.1 Problem Statement

    2 Datasets3 Materials and Methods4 Results4.1 Used libraries4.2 Examples4.2.1 Fuzzy Hash and ORB4.2.2 ORB and RANSAC

    5 Challenges6 Future work7 Conclusion7.1 Summary7.2 Contact information

    I Appendices8 Detailed view of the framework9 Result table10 Intersection matrix