extracting information from darknet market advertisements
TRANSCRIPT
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 786687
Extracting information from darknetmarket advertisements and forums
• Sven Schlarb [email protected], Mina Schütz [email protected], Clements Heistracher [email protected] Austrian Institute of Technology, Competence Unit Data Science & Artificial Inteligence
• Faisal Ghaffar [email protected] Innovation Exchange, Cognitive Computing Group
SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
2SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
OverviewNamed entity recognition & Relationship extraction
3SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Overview
4SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Manual annotation
5SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Model creation
6SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Working directories
7SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Named Entity Recognition (CKNER)
8SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Relationship Extraction (RELEXT)
9SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Working directories
10SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Storage/Search API
11SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Process Flow
12SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Process Flow
Replay
Harvesting Scraping AnnotationModel
CreationInformation Extraction
Label suggestions
13SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Annotation interface
(RecogitoJS integrated)
14SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Annotation interface – Loading data
15SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Load grams weapons/drugs dataset
16SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Navigate through annotated postings
17SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Choose vocabulary for annotation
18SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Create regex pattern (e.g. Regex101.com)
19SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Get named entity suggestions
20SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Regular expression batch labelling
21SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Multi-user
22SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Multi-user/copkit1
23SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Multi-user/admin
24SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Definition-based batch labelling
• Useful for initial labelling (before userannotation starts)
• Definition file in JSON format required
• Regex terms can be used instead of fixedstrings to match tokens
• Filename convention: prelabel_definitions_*.json
25SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Definition-based batch labelling
26SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Training/test data
Data Set
Training Test
Two basic functions for handling train/test data are offered: 1) merge test into train, 2) split train into train and test
Training
Training Test
1.) Merge test set into training set→ Recommended before starting user annotation
2.) Split training set into training and test set→ Recommended before starting model creation
27SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Merge test set into training set
28SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Split training set into training and test set
29SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Restore backups
30SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Persist annotations
• Update training/test data files with user annotations (userannotations are removed afterwards)
31SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Technical background:
Named entity recognition
32SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Preparation: Text classification
Heistracher, C. & Schlarb, S.Machine Learning Techniques for the Classification of Product Descriptions from Darknet MarketplacesProceedings of the 11th International Conference on Applied Informatics, 2020
• Evaluate whether pre-trained embeddings improve the classification performance over simple vector space models, such as TF-IDF, when being transferred to the darknet domain.
• Best performance was achieved with Tensorflow Universal Sentence Encoder and a linear SVM
33SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Named entity recognition with SpaCy
• SpaCy is an open-source library for NLP
• SpaCy’s standard models can be used as a basis to build domain-specific models
• For example, SpaCy’s “en_core_web_lg” is a convolutional neural network trained for multiple tasks on the OntoNotes 5 dataset.
• 18 entity types that are available in the pretrained model.
• Words are represented as GloVe vectors, that were trained on the common crawl dataset.
34SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Spacy Named Entity Recognition - Evaluation
• Dataset usef for the evaluation was a subset of (Gwern2015) that contained strings from a list of weapon related terms.
• This subset was manually categorized and the categories were used as labels related to weapons.
• The subset of the data contained 1580 product descriptions including: 1582 Drugs, 769 Weapons, 319 ebooks, 96 3D-printing.
Precision 76.97
Recall 70.88
35SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Technical background:
Relation extraction
36SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Relation extraction
‐ Focus on Transformer models‐ Can be pre-trained on smaller datasets
‐ Model can be used for NLP downstream tasks
‐ Several papers already published with BERT, ALBERT, RoBERTA, GPT..
‐ Mostly trained and evaluated on SemEval 2010 Task 8 & TACRED datasets
‐ Various approaches
‐ Most commonly used: sentence-level
‐ Semantic Role Labeling
‐ Joint Approaches (NER + RE)
‐ Multi-Relation-Extraction
36
37SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Relation extraction – approach
‐ Entities are already annotated‐ No ground truth for relations for our dataset annotated
1. Unsupervised relation extraction
2. Using an already labeled dataset with pre-defined labels such as SemEval 2010 Task 8
‐ Two models are fitting for our approach:‐ R-BERT
‐ BERT, RoBERTa and ALBERT are available
‐ Sentence-level
‐ OpenNRE‐ BERT
‐ Sentence-, bag-, document-level and few-shot
37
38SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Relation extraction – R-BERT
38
39SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Relation extraction – OPENnre
39
40SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
SOURCES
• Xu Han, Tianyu Gao, Yuan Yao, Deming Ye, Zhiyuan Liu, and Maosong Sun. 2019. OpenNRE: An open and extensible toolkit for neural relation extraction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 169–174, Hong Kong, China, November. Association forComputational Linguisticts. https://github.com/thunlp/OpenNRE
• Shanchan Wu and Yifan He. 2019. Enriching pre-trained language model with entity information for relation classification. https://github.com/monologg/R-BERT
40
41SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Thank you !
42COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
IBM’s Moment Recognizer(MoRec)
Faisal Ghaffar (IBM), Brian Maguire (IBM)Wednesday 19th August 2020
43COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
2. Component Overview• AS an (LEA) end-user, I would like to extract knowledge from DNM forums to
- understand thread discussions- determine topics of discussion - determine trends in topics over time- identify experts on specific topic(s)
44COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Events timeline
Given a thread, extract who said what over a daily timeline
MoRec
45COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Complete Process
Extracted Text
Preprocessing
Word Segmentation
Post Segmentation
Thread Segmentation
Domain Dictionary
Topic Detection Summarization Q&A and NE
Events Generation
Domain Knowledge
46COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Topic Detection
1. Given a darknet forum, determine topics of discussion from individual post messages.
MoRec
Community support
Business activities
Risk management
47COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Text Summarization
Provide ‘extractive’ and ‘abstractive’ summary
MoRec
Summary
48COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Question answering
Given a forum, retrieve answers to questions related to text
Example: question-answer from a context text. Image Source:https://rajpurkar.github.io/mlx/qa-and-squad/
49COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Model and its Performance
RoBERTa- Based on BERT with following differeneces- Dynamic word-masking instead of static- Removed NSP training requirement- Trained on 8k sequences compared to 256 of BERT
Evaluation Metric• Label Ranking Average Precision• Final Model : {'LRAP': 0.967, 'eval loss': 0.185}
50COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Component functions• Multi-Label Categorization based on trained LM
- Is forum post belong to CaaS and carding, hacking, trafficking,drugs• Trained Language models (LM) (RoBERTa ) on dark net data• Summarization updated based on LM• Topic detection based on embeddings• similarity search function• Timeline creation from forum threads• Question-answering (TBD)• Trade Network extraction and analysis (TBD)
51COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Component Interface (Swagger)
52COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Thank you !Time for Questions and
Discussion