bot detection algorithm
DESCRIPTION
TRANSCRIPT
![Page 1: Bot detection algorithm](https://reader036.vdocuments.us/reader036/viewer/2022082502/548259abb07959570c8b476e/html5/thumbnails/1.jpg)
Bot Detection Algorithm
Parinita, Computational Linguistics Masters Program, University of Washington
![Page 2: Bot detection algorithm](https://reader036.vdocuments.us/reader036/viewer/2022082502/548259abb07959570c8b476e/html5/thumbnails/2.jpg)
7 Pitfalls to avoid when running controlled experimentation on the web Picking an OEC for which it is easy to beat the control by doing
something clearly “wrong” from a business perspective Incorrectly computing confidence intervals for percent change and
for OECs that involve a nonlinear combination of metrics Using standard statistical formulas for computations of variance
and power Combining metrics over periods where the proportions assigned to
Control and Treatment vary, or over subpopulations sampled at different rates
Neglecting to filter robots Failing to validate each step of the analysis pipeline and the OEC
components Forgetting to control for all differences, and assuming that humans
can keep the variants in sync
Source: http://exp-platform.com/Documents/2009-ExPpitfalls.pdf
1
2
3
4
5
6
7
![Page 3: Bot detection algorithm](https://reader036.vdocuments.us/reader036/viewer/2022082502/548259abb07959570c8b476e/html5/thumbnails/3.jpg)
In practice, identifying Robots is difficultCommon techniques to detect web robots
Limitations
Robots.txt access check • Compliance to Robot Exclusion standard is voluntary and many robots don’t follow it
User Agent Check • Robots use multiple user agent fields within the same session
• Robots hide their identities by using the same user agent field as standard web browsers
IP address Check • Time consuming and often discovers robots that are already well known
Count of HEAD and HTTP requests • Not reliable as non robots can generate HEAD request messagesBut, the navigation patterns of web robots is distinct from those of
human users in terms of 1
2
3
4
Average rate of queries submittedLength of the sessions
Interval between successive queries
Coverage of the web site
![Page 4: Bot detection algorithm](https://reader036.vdocuments.us/reader036/viewer/2022082502/548259abb07959570c8b476e/html5/thumbnails/4.jpg)
Statistical bot detection model works better than a rule-based system
A Approach A: Heuristic-Regression Approach to bot pattern identification, classification algorithm: Decision Trees
If Query = ‘robots.txt’ then Confidence_factor=1.0 If Session_length > 100 and Session_duration < 10 secs then Confidence_factor=0.95 If default then no identification If User_agent=’Mozilla’ And Session_length > 50 then Confidencefactor=0.87
Extract of set of Heuristic Rules
Example of Regression Tree for the crawlers’ confidence factor
Overall Perspective of the processing
![Page 5: Bot detection algorithm](https://reader036.vdocuments.us/reader036/viewer/2022082502/548259abb07959570c8b476e/html5/thumbnails/5.jpg)
Bot detection based on Hidden Markov Models
B Approach B: Hidden Markov Model, classification algorithm: Hidden Markov Model
Use an HMM to describe robot access pattern and then detect robot based on the access model One or more requests from the same user that arrive in the same time unit
are called a batch arrival Calculate the sequences of rt and Rt from server logs (rt
is the number of requests in tth time unit, and Rt summation
of requests in a given time interval) Because of the different behaviors between human users and robots, there
will be different burst levels between them which can also be reflected in rt and Rt
We assume that the process of batch arrival is controlled by a special Markov chain with M different states
Our detect method is based on such a fact that most robots have similar request arrival patterns because they obey the same design guideline
We use the robots observed sequences to train a robot request pattern model and then calculate the likelihood of the incoming request sequence against the robot request pattern
![Page 6: Bot detection algorithm](https://reader036.vdocuments.us/reader036/viewer/2022082502/548259abb07959570c8b476e/html5/thumbnails/6.jpg)
Bot detection based on Bayesian ApproachC Approach C: Bayesian Approach, classification algorithm: Naïve Bayes
Algorithm for bot detection system based on Bayesian Approach Access-log analysis and session identification Session features are selected to be used as variables (nodes) in the
Bayesian network Construction of the Bayesian network structure Learning:
(a) Labeling of the set of training examples. At this step, sessions are classified as crawler- or human-initiated sessions to form the set of examples of the two classes
(b) Learning the required Bayesian network parameters using the set of training examples derived from step a
(c) Quantification of the Bayesian network using the learned parameters Classification: we extract the features of each session and use them as
evidence to be inserted into the Bayesian network model. A probability of each session being a crawler is thus derived.
This approach employs a Naïve Bayes network to classify the HTTP sessions of a web server access log as crawler or human induced. The Bayesian network combines various pieces of evidence that were shown to distinguish between crawler and human HTTP traffic.
![Page 7: Bot detection algorithm](https://reader036.vdocuments.us/reader036/viewer/2022082502/548259abb07959570c8b476e/html5/thumbnails/7.jpg)
References Crook, Thomas and Frasca, Brian and Kohavi, Ron and Longbotham,
Roger.2009; Seven Pitfalls to Avoid when Running Controlled Experiments on the Web
Tan, Pang-Ning and Kumar, Vipin. 2002. Discovery of Web Robot Sessions based on their Navigational Patterns. Data Mining and Knowledge. 2002, Vol. 6, 1, pp. 9-35. http://citeseer.ist.psu.edu/article/tan02discovery.html.
Athena Stassopoulou, Marios D. Dikaiakos.2009; Web robot detection: A probabilistic reasoning approach, Computer Networks 2009.
Lu, Wei-Zhou Lu and Yu, Shun-Zheng.2006 ;Web Robot Detection Based on Hidden Markov Model
Alves, Ronnie and Belo, Orlando and Lourenço, Anália. A Heuristic-Regression Approach to Crawler Pattern Identification on Clickstream Data