bot detection algorithm

Bot Detection Algorithm

Parinita, Computational Linguistics Masters Program, University of Washington

7 Pitfalls to avoid when running controlled experimentation on the web Picking an OEC for which it is easy to beat the control by doing

something clearly “wrong” from a business perspective Incorrectly computing confidence intervals for percent change and

for OECs that involve a nonlinear combination of metrics Using standard statistical formulas for computations of variance

and power Combining metrics over periods where the proportions assigned to

Control and Treatment vary, or over subpopulations sampled at different rates

Neglecting to filter robots Failing to validate each step of the analysis pipeline and the OEC

components Forgetting to control for all differences, and assuming that humans

can keep the variants in sync

Source: http://exp-platform.com/Documents/2009-ExPpitfalls.pdf

1

2

3

4

5

6

7

http://exp-platform.com/Documents/2009-ExPpitfalls.pdf

http://exp-platform.com/Documents/2009-ExPpitfalls.pdf

In practice, identifying Robots is difficultCommon techniques to detect web robots

Limitations

Robots.txt access check • Compliance to Robot Exclusion standard is voluntary and many robots don’t follow it

User Agent Check • Robots use multiple user agent fields within the same session

• Robots hide their identities by using the same user agent field as standard web browsers

IP address Check • Time consuming and often discovers robots that are already well known

Count of HEAD and HTTP requests • Not reliable as non robots can generate HEAD request messagesBut, the navigation patterns of web robots is distinct from those of

human users in terms of 1

2

3

4

Average rate of queries submittedLength of the sessions

Interval between successive queries

Coverage of the web site

Statistical bot detection model works better than a rule-based system

A Approach A: Heuristic-Regression Approach to bot pattern identification, classification algorithm: Decision Trees

If Query = ‘robots.txt’ then Confidence_factor=1.0 If Session_length > 100 and Session_duration < 10 secs then Confidence_factor=0.95 If default then no identification If User_agent=’Mozilla’ And Session_length > 50 then Confidencefactor=0.87

Extract of set of Heuristic Rules

Example of Regression Tree for the crawlers’ confidence factor

Overall Perspective of the processing

Bot detection based on Hidden Markov Models

B Approach B: Hidden Markov Model, classification algorithm: Hidden Markov Model

Use an HMM to describe robot access pattern and then detect robot based on the access model One or more requests from the same user that arrive in the same time unit

are called a batch arrival Calculate the sequences of rt and Rt from server logs (rt

is the number of requests in tth time unit, and Rt summation

of requests in a given time interval) Because of the different behaviors between human users and robots, there

will be different burst levels between them which can also be reflected in rt and Rt

We assume that the process of batch arrival is controlled by a special Markov chain with M different states

Our detect method is based on such a fact that most robots have similar request arrival patterns because they obey the same design guideline

We use the robots observed sequences to train a robot request pattern model and then calculate the likelihood of the incoming request sequence against the robot request pattern

Bot detection based on Bayesian ApproachC Approach C: Bayesian Approach, classification algorithm: Naïve Bayes

Algorithm for bot detection system based on Bayesian Approach Access-log analysis and session identification Session features are selected to be used as variables (nodes) in the

Bayesian network Construction of the Bayesian network structure Learning:

(a) Labeling of the set of training examples. At this step, sessions are classified as crawler- or human-initiated sessions to form the set of examples of the two classes

(b) Learning the required Bayesian network parameters using the set of training examples derived from step a

(c) Quantification of the Bayesian network using the learned parameters Classification: we extract the features of each session and use them as

evidence to be inserted into the Bayesian network model. A probability of each session being a crawler is thus derived.

This approach employs a Naïve Bayes network to classify the HTTP sessions of a web server access log as crawler or human induced. The Bayesian network combines various pieces of evidence that were shown to distinguish between crawler and human HTTP traffic.

References Crook, Thomas and Frasca, Brian and Kohavi, Ron and Longbotham,

Roger.2009; Seven Pitfalls to Avoid when Running Controlled Experiments on the Web

Tan, Pang-Ning and Kumar, Vipin. 2002. Discovery of Web Robot Sessions based on their Navigational Patterns. Data Mining and Knowledge. 2002, Vol. 6, 1, pp. 9-35. http://citeseer.ist.psu.edu/article/tan02discovery.html.

Athena Stassopoulou, Marios D. Dikaiakos.2009; Web robot detection: A probabilistic reasoning approach, Computer Networks 2009.

Lu, Wei-Zhou Lu and Yu, Shun-Zheng.2006 ;Web Robot Detection Based on Hidden Markov Model

Alves, Ronnie and Belo, Orlando and Lourenço, Anália. A Heuristic-Regression Approach to Crawler Pattern Identification on Clickstream Data

bot detection algorithm

Technology