extracting information from darknet market advertisements

52
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 786687 Extracting information from darknet market advertisements and forums Sven Schlarb [email protected] , Mina Schütz [email protected] , Clements Heistracher [email protected] AIT Austrian Institute of Technology, Competence Unit Data Science & Artificial Inteligence Faisal Ghaffar [email protected] IBM Innovation Exchange, Cognitive Computing Group SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Upload: others

Post on 05-Jun-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Extracting information from darknet market advertisements

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 786687

Extracting information from darknetmarket advertisements and forums

• Sven Schlarb [email protected], Mina Schütz [email protected], Clements Heistracher [email protected] Austrian Institute of Technology, Competence Unit Data Science & Artificial Inteligence

• Faisal Ghaffar [email protected] Innovation Exchange, Cognitive Computing Group

SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Page 2: Extracting information from darknet market advertisements

2SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

OverviewNamed entity recognition & Relationship extraction

Page 3: Extracting information from darknet market advertisements

3SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Overview

Page 4: Extracting information from darknet market advertisements

4SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Manual annotation

Page 5: Extracting information from darknet market advertisements

5SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Model creation

Page 6: Extracting information from darknet market advertisements

6SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Working directories

Page 7: Extracting information from darknet market advertisements

7SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Named Entity Recognition (CKNER)

Page 8: Extracting information from darknet market advertisements

8SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Relationship Extraction (RELEXT)

Page 9: Extracting information from darknet market advertisements

9SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Working directories

Page 10: Extracting information from darknet market advertisements

10SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Storage/Search API

Page 11: Extracting information from darknet market advertisements

11SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Process Flow

Page 12: Extracting information from darknet market advertisements

12SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Process Flow

Replay

Harvesting Scraping AnnotationModel

CreationInformation Extraction

Label suggestions

Page 13: Extracting information from darknet market advertisements

13SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Annotation interface

(RecogitoJS integrated)

Page 14: Extracting information from darknet market advertisements

14SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Annotation interface – Loading data

Page 15: Extracting information from darknet market advertisements

15SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Load grams weapons/drugs dataset

Page 16: Extracting information from darknet market advertisements

16SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Navigate through annotated postings

Page 17: Extracting information from darknet market advertisements

17SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Choose vocabulary for annotation

Page 18: Extracting information from darknet market advertisements

18SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Create regex pattern (e.g. Regex101.com)

Page 19: Extracting information from darknet market advertisements

19SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Get named entity suggestions

Page 20: Extracting information from darknet market advertisements

20SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Regular expression batch labelling

Page 21: Extracting information from darknet market advertisements

21SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Multi-user

Page 22: Extracting information from darknet market advertisements

22SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Multi-user/copkit1

Page 23: Extracting information from darknet market advertisements

23SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Multi-user/admin

Page 24: Extracting information from darknet market advertisements

24SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Definition-based batch labelling

• Useful for initial labelling (before userannotation starts)

• Definition file in JSON format required

• Regex terms can be used instead of fixedstrings to match tokens

• Filename convention: prelabel_definitions_*.json

Page 25: Extracting information from darknet market advertisements

25SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Definition-based batch labelling

Page 26: Extracting information from darknet market advertisements

26SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Training/test data

Data Set

Training Test

Two basic functions for handling train/test data are offered: 1) merge test into train, 2) split train into train and test

Training

Training Test

1.) Merge test set into training set→ Recommended before starting user annotation

2.) Split training set into training and test set→ Recommended before starting model creation

Page 27: Extracting information from darknet market advertisements

27SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Merge test set into training set

Page 28: Extracting information from darknet market advertisements

28SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Split training set into training and test set

Page 29: Extracting information from darknet market advertisements

29SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Restore backups

Page 30: Extracting information from darknet market advertisements

30SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Persist annotations

• Update training/test data files with user annotations (userannotations are removed afterwards)

Page 31: Extracting information from darknet market advertisements

31SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Technical background:

Named entity recognition

Page 32: Extracting information from darknet market advertisements

32SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Preparation: Text classification

Heistracher, C. & Schlarb, S.Machine Learning Techniques for the Classification of Product Descriptions from Darknet MarketplacesProceedings of the 11th International Conference on Applied Informatics, 2020

• Evaluate whether pre-trained embeddings improve the classification performance over simple vector space models, such as TF-IDF, when being transferred to the darknet domain.

• Best performance was achieved with Tensorflow Universal Sentence Encoder and a linear SVM

Page 33: Extracting information from darknet market advertisements

33SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Named entity recognition with SpaCy

• SpaCy is an open-source library for NLP

• SpaCy’s standard models can be used as a basis to build domain-specific models

• For example, SpaCy’s “en_core_web_lg” is a convolutional neural network trained for multiple tasks on the OntoNotes 5 dataset.

• 18 entity types that are available in the pretrained model.

• Words are represented as GloVe vectors, that were trained on the common crawl dataset.

Page 34: Extracting information from darknet market advertisements

34SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Spacy Named Entity Recognition - Evaluation

• Dataset usef for the evaluation was a subset of (Gwern2015) that contained strings from a list of weapon related terms.

• This subset was manually categorized and the categories were used as labels related to weapons.

• The subset of the data contained 1580 product descriptions including: 1582 Drugs, 769 Weapons, 319 ebooks, 96 3D-printing.

Precision 76.97

Recall 70.88

Page 35: Extracting information from darknet market advertisements

35SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Technical background:

Relation extraction

Page 36: Extracting information from darknet market advertisements

36SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Relation extraction

‐ Focus on Transformer models‐ Can be pre-trained on smaller datasets

‐ Model can be used for NLP downstream tasks

‐ Several papers already published with BERT, ALBERT, RoBERTA, GPT..

‐ Mostly trained and evaluated on SemEval 2010 Task 8 & TACRED datasets

‐ Various approaches

‐ Most commonly used: sentence-level

‐ Semantic Role Labeling

‐ Joint Approaches (NER + RE)

‐ Multi-Relation-Extraction

36

Page 37: Extracting information from darknet market advertisements

37SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Relation extraction – approach

‐ Entities are already annotated‐ No ground truth for relations for our dataset annotated

1. Unsupervised relation extraction

2. Using an already labeled dataset with pre-defined labels such as SemEval 2010 Task 8

‐ Two models are fitting for our approach:‐ R-BERT

‐ BERT, RoBERTa and ALBERT are available

‐ Sentence-level

‐ OpenNRE‐ BERT

‐ Sentence-, bag-, document-level and few-shot

37

Page 38: Extracting information from darknet market advertisements

38SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Relation extraction – R-BERT

38

Page 39: Extracting information from darknet market advertisements

39SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Relation extraction – OPENnre

39

Page 40: Extracting information from darknet market advertisements

40SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

SOURCES

• Xu Han, Tianyu Gao, Yuan Yao, Deming Ye, Zhiyuan Liu, and Maosong Sun. 2019. OpenNRE: An open and extensible toolkit for neural relation extraction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 169–174, Hong Kong, China, November. Association forComputational Linguisticts. https://github.com/thunlp/OpenNRE

• Shanchan Wu and Yifan He. 2019. Enriching pre-trained language model with entity information for relation classification. https://github.com/monologg/R-BERT

40

Page 41: Extracting information from darknet market advertisements

41SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain

Thank you !

Page 42: Extracting information from darknet market advertisements

42COPKIT Webinar, 16 September 2020 at 11:00 AM CEST

IBM’s Moment Recognizer(MoRec)

Faisal Ghaffar (IBM), Brian Maguire (IBM)Wednesday 19th August 2020

Page 43: Extracting information from darknet market advertisements

43COPKIT Webinar, 16 September 2020 at 11:00 AM CEST

2. Component Overview• AS an (LEA) end-user, I would like to extract knowledge from DNM forums to

- understand thread discussions- determine topics of discussion - determine trends in topics over time- identify experts on specific topic(s)

Page 44: Extracting information from darknet market advertisements

44COPKIT Webinar, 16 September 2020 at 11:00 AM CEST

Events timeline

Given a thread, extract who said what over a daily timeline

MoRec

Page 45: Extracting information from darknet market advertisements

45COPKIT Webinar, 16 September 2020 at 11:00 AM CEST

Complete Process

Extracted Text

Preprocessing

Word Segmentation

Post Segmentation

Thread Segmentation

Domain Dictionary

Topic Detection Summarization Q&A and NE

Events Generation

Domain Knowledge

Page 46: Extracting information from darknet market advertisements

46COPKIT Webinar, 16 September 2020 at 11:00 AM CEST

Topic Detection

1. Given a darknet forum, determine topics of discussion from individual post messages.

MoRec

Community support

Business activities

Risk management

Page 47: Extracting information from darknet market advertisements

47COPKIT Webinar, 16 September 2020 at 11:00 AM CEST

Text Summarization

Provide ‘extractive’ and ‘abstractive’ summary

MoRec

Summary

Page 48: Extracting information from darknet market advertisements

48COPKIT Webinar, 16 September 2020 at 11:00 AM CEST

Question answering

Given a forum, retrieve answers to questions related to text

Example: question-answer from a context text. Image Source:https://rajpurkar.github.io/mlx/qa-and-squad/

Page 49: Extracting information from darknet market advertisements

49COPKIT Webinar, 16 September 2020 at 11:00 AM CEST

Model and its Performance

RoBERTa- Based on BERT with following differeneces- Dynamic word-masking instead of static- Removed NSP training requirement- Trained on 8k sequences compared to 256 of BERT

Evaluation Metric• Label Ranking Average Precision• Final Model : {'LRAP': 0.967, 'eval loss': 0.185}

Page 50: Extracting information from darknet market advertisements

50COPKIT Webinar, 16 September 2020 at 11:00 AM CEST

Component functions• Multi-Label Categorization based on trained LM

- Is forum post belong to CaaS and carding, hacking, trafficking,drugs• Trained Language models (LM) (RoBERTa ) on dark net data• Summarization updated based on LM• Topic detection based on embeddings• similarity search function• Timeline creation from forum threads• Question-answering (TBD)• Trade Network extraction and analysis (TBD)

Page 51: Extracting information from darknet market advertisements

51COPKIT Webinar, 16 September 2020 at 11:00 AM CEST

Component Interface (Swagger)

Page 52: Extracting information from darknet market advertisements

52COPKIT Webinar, 16 September 2020 at 11:00 AM CEST

Thank you !Time for Questions and

Discussion