machine decisions the assessment ofhome.deib.polimi.it/schiaffo/ce/assessment machine...modern...

The assessment of machine decisions

Teresa Scantamburlo (University of Bristol)

Outline

● Machine learning and human decision-making

● Case study: The Harm Risk Assessment Tool (HART)

● Normative benchmarks for the assessment of machine decisions

● Challenges and concluding remarks

● Questions and discussion

Machine learning and human decision-making

The progression from computer science to the Internet to

machine learning was inevitable: computers enable the

Internet, which creates a flood of data and the problem of

limitless choice; and machine learning uses the flood of

data to help solve the limitless choice problem.

(P. Domingos, The Master Algorithm, 2015 p. 12)

Modern Artificial Intelligence (AI)

● We are getting used to AI systems performing complex tasks (translating text, recognising speech,

recommending products, predicting diseases, driving cars, etc)

● Early attempts in AI were based on top-down approaches (i.e. building mathematical models of

human capabilities such as vision or language)

● Modern AI is created in a different way, i.e. by exposing learning algorithm to millions of examples,

an approach known as “machine learning”

Learning to classify

● An important group of learning tasks include classification: assign an

item to one of many possible categories

● Many discrete decisions can be rendered as a classification problem:

e.g. blocking emails, choosing the next move in a game, assigning a

medical treatment and so forth

● Objects can be described by a set of features (e.g. “weight” and

“length”) and a class of labels (e.g. “sea bass” and “salmon”). The goal

is to build a function that assign a label to any new object

Duda and Hart and Stork (2001)

Recommending books

● U = set of users, each one with a set of characterising features

● User_i = [“id”, “list_bought books”, “list_searched words”, “rates”, “likes”,….]

● B = set of books

● Function f: U → B

● Given a set of users’ purchase behaviours which book should be recommended to Alice?

● The decisions to recommend a book is based on statistical correlations between the features and the purchased items

● Performance measures tells us how many mistakes the classifier makes on new, unseen data (e.g. recommended books not purchased)

Classifying objects and people● ML algorithms can be used for classifying: E-malis / Products / News / Videos / Restaurants / Hotels

● But they can also be used for classifying people.

● The strength of classification algorithms is that they can easily frame “diagnostic questions” such as:

○ Will this student succeed? ○ Will this candidate meet current and future company’s needs? ○ Which customers will best match your business? ○ Which persons will be more likely to develop depression? ○ Will this person commit violence?

● The diagnostic quality of machine learning is better understood in terms of prediction rather than causality

Credit scoring● Problem: to predict the risk of lending money to a particular individual

(i.e. credit score), based on a list of key attributes held in a credit report, such as payment history, account information and credit utilization

● Example of countries issuing credit scores: USA (e.g. FICO), Canada, UK (e.g. Experian), Germany (e.g. Schufa), China, India, etc.

● Credit scores are used by banks, credit card companies and lenders to assess the creditworthiness of individuals. But credit reports can be also accessed by potential employers, landlords, utility companies, mobile phone companies, etc.

● Impact: denial/access to loans; high/low interest rates; be hired / not hired; denial/access to rent

Ranking of Experian credit scores (https://www.experian.co.uk/)

https://www.experian.co.uk/

Health care● DeepMind has been collaborating with London’s Moorfields Hospital since 2016 to automatically detect eye

conditions and prioritise patients in need of urgent intervention

● A deep learning architecture, trained on 14,884 three-dimensional scans, outputs one of 4 referral suggestions (urgent, semi-urgent, routine, observation only), based on current clinical practice

● The framework was tested against clinical experts on 997 patients. The model achieved and in some cases exceeded experts’ performance (94% prediction accuracy)

De Fauw et al. 2018, Nature Medicine

Employment● Pre-screening / resume analysers

○ Example: pomato, a company based in Princeton (USA), analyses candidates’ resumes and rank them based on job requirements (see http://www.pomato.com/)

● Job interview

○ Example: HireView, a company based in USA but operating also in Europe, predicts the candidates who perform better in video interviews by analysing verbal response, intonation, non-verbal communication, etc (https://www.hirevue.com/)

● Employees’ performance (also called “people analytics”)

○ Example 1: IBM’s Watson Analytics gathers data about employees (e.g. demographics, education, working hours, income, role, etc) and infers which qualities/skills each employee might offer to the company, which employees have boosted their skills and which are more likely to move somewhere else (see the blog post https://www.ibm.com/communities/analytics/watson-analytics-blog/watson-analytics-use-case-for-hr-retaining-valuable-employees/)

○

http://www.pomato.com/

https://www.hirevue.com/

https://www.ibm.com/communities/analytics/watson-analytics-blog/watson-analytics-use-case-for-hr-retaining-valuable-employees/

https://www.ibm.com/communities/analytics/watson-analytics-blog/watson-analytics-use-case-for-hr-retaining-valuable-employees/

Social care

Due to the shirking of public resources local governments employ predictive analytics to detect people most qualified for welfare programmes

● In her book “Automating Inequality”, Virginia Eubanks describes how in many American cities, like Los Angeles, digital tools help social services to match vulnerable people (e.g. low income people or homeless people) with the most available resources

● Predictive tools for social care are in use also in Europe: In UK some local councils are using predictive models (some of these have been developed by private companies such as Xantura) to identify children who might be victims of abuse based on various attributes such as school attendance, arrears, police records, etc (see McIntyre and Pegg article in The Guardian, 16 Sept 2018)

Criminal Justice

● Problem: to predict the probability of reoffending (risk assessment tools) or to predict areas with highest concentration of crime (such as )

● Popular risk assessment tools are:

○ COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) ○ Public Safety Assessment-Court○ HART (Harm Assessment Risk Tool)

● A list of risk assessment tools in use in USA is available on EPIC (Electronic Privacy Information Center) website: https://epic.org/algorithmic-transparency/crim-justice/

https://epic.org/algorithmic-transparency/crim-justice/

Case study: The Harm Risk Assessment Tool

Police in Durham are preparing to go live with an

artificial intelligence (AI) system designed to help

officers decide whether or not a suspect should be

kept in custody.

(BBC article by C. Baraniuk, 10 May 2018)

The Harm Assessment Risk Tool

● Launched in May 2017, it was developed by Durham Constabulary and Cambridge University to

support police officers with custody decisions

● Predicts whether an offender is: “high risk”, “non- serious” or “low risk” over a two-year period

after the arrest

● Offenders who are predicted as “moderate risk” are eligible for the Checkpoint programme

● Main references:

○ Urwin S (2016) Algorithmic Forecasting of Offender Dangerousness for Police Custody Officers: An Assessment of Accuracy for the Durham Constabulary. Master Degree Thesis, Cambridge University, U

○ Oswald M, Grace J, Urwin S and Barnes GC (2018) Algorithmic risk assessment policing models: lessons from the Durham HART model and ‘Experimental’ proportionality, Information & Communications Technology Law, 27(2): 223-250

●

HART = the model

HART is built using random forests

Set-up of cost ratios: HART applies a lower cost ratio to false positives as compared to false negatives

The training dataset is composed of 104,000 custody events from a period between January 2008 and December 2012

The model was validated in two steps:● random samples of the training dataset (“OOB sampling”)

● separate test set composed of 14,882 custody events occurred in 2013

All datasets were drawn from Durham Constabulary management IT system

HART = the model

34 features including

● Gender● Age● Records of prior arrests● first four characters of the offender’s

postcode, ● Mosaic postcode (on socio-geo

demographic characteristics for Durham County)

...

3 labels

● “high risk” = a new serious offence within the next 2 years

● “moderate risk” = a non-serious offence within the next 2 years

● “low risk” = no offence within the next 2 years

HART = confusion matrices

Two distinct errors:

“very dangerous error” = a suspect being labelled as ‘low risk’ commits a serious offence in the next 2 years

“very cautious error” = a suspect being labelled as ‘high risk’ does not commit any crime in the next 2 years.

HART = AccuracyComparison with the accuracy of a random baseline:

[P(Y = “high”) * P(Y= “high”)] + [P(Y = “moderate”) * P(Y= “moderate”)] + [P(Y = “low”) * P(Y= “low”)] =

[0.1186 * 0.1186] + [0.4835 * 0.4835] + [0.3979 * 0.3979] = 0.406 = 41%

Some performance measures extracted from tables 6 and 9 in Urwin (2016: 52,56)

HART = AccuracySome performance measures extracted from tables 6 and 9 in Urwin (2016: 52,56)

Probability of “very dangerous error”

Probability of “very cautious error”

Normative benchmarks for the assessment of machine decisions

Algorithms as gatekeeper

The problem of TrustThe current generation of intelligent algorithms make decisions based on rules learnt from examples, rather than explicit programming

This complicates the problem of trust:

● Can we trust intelligent agents to make consequential decisions about us?

● How are these decisions made?

● Can biases affect them, due to their training examples or the conditions in which they are used?

● Can machines be made aware of values such as equality before the law, transparency and right to privacy?

● Which technical or legal solutions should be developed?

Dimensions of TrustMachine decisions are often judged based on their accuracy, but when they operate on humans the criteria must change since the rights of those humans translate into further obligations for those machines.

These obligations are reflected by 4 dimensions:

● Accuracy: Is the model accurate? Which performance measure is used?

● Fairness: Can the model discriminate? Does it embed some fairness criteria?

● Transparency: Where does the output come from?

● Privacy: does the model use personal data? or data that can correlate to personal information?

Fairness & equality before the law

Some sources of algorithmic discrimination (Barocas and Selbst 2016)

● Bias incorporated in the training set

● Feature selection process

● Sampling (e.g. underrepresented groups)

Example: if judges’ decisions are biased against a social

group, that bias will be learnt by the classifiers and

replicated in future decisions.

Technical criteria (e.g. calibration and error rate balance)

The enjoyment of the rights and freedoms set forth in this Convention shall be secured without discrimination on any ground such as sex, race, colour, language, religion, political or other opinion, national or social origin, association with a national minority, property, birth or other status (European Convention on Human Rights, Article 14)

Transparency

The principle of transparency requires that any information addressed to the public or to the data subject be concise, easily accessible and easy to understand (GDPR, Recital 58)

The data controller shall implement suitable measures to safeguard the data subject’s right and freedoms and legitimate interest, at least the right to obtain human intervention (GDPR, Article 22)

Algorithmic opacity

● Black box: correlations do not reveal the causes behind the phenomenon we want to study

● Mismatch between machine learning representations and human interpretation

Some technical remedies

● Post-hoc interpretations (Lipton, 2016)

● Software verification and cryptographic techniques to ensure procedural regularity (Kroll et al. 2016)

Transparency

Privacy

Everyone has the right to the protection of personal data concerning him or her.

Such data must be processed fairly for specified purposes and on the basis of the consent of the person concerned or some other legitimate basis laid down by law (Charter of Fundamental Rights of the European Union, Article 8)

Some features, albeit lawful, may have a discriminatory impact (consider, for instance, HART’s use of postcode)

What if social posts or other personal signals were used to predict someone’s risk score?

For example, it has been reported that in New Orleans Palantir’s software is helping police to map gangs by using social media (Winston, 2018)

Some technical solutions● K-anonymity (Sweeney, 2002)● Differential privacy (Dwork, 2006)

Challenges, open issues and concluding remarks

Challenges

There are some limitations in technical solutions

● For example, fairness criteria, such as calibration and error rate balance, cannot be

simultaneously satisfied (Chouldechova, 2017, Kleinberg et al. 2017, Berk et al., 2017)

● Technical solutions may redefine social and ethical principles

Stretching the problem-solving attitude

● Translating policy problems mostly in terms of prediction problems

Concluding remarks

● Shifting the application of classification algorithms from things to people does not come without consequences

● The assessment of machine decisions should go beyond prediction accuracy

● These dimensions reflect also the main tracks around which several research communities have

articulated the main social and ethical implications of learning algorithms in the big data context (e.g. FAT ML)

● Dimensions do not exhaust the problem of trust but provide concrete pathways for the assessment of

machine decisions

● Dimensions cannot be solved from a purely technical standpoint. Technical solutions are part of a wider approach including legal and cultural tools

References

● Domingos P, The Master Algorithm, 2015● Duda, Hart and Stork (2001), Pattern Classification, second edition, Wiley & Sons● De Fauw et al. (2018), Clinically applicable deep learning for diagnosis and referral in retinal disease, Nature Medicine 24(9):1342–1350 ● Eubanks V (2018) Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor, New York: St Martin's Press● Urwin S (2016) Algorithmic Forecasting of Offender Dangerousness for Police Custody Officers: An Assessment of Accuracy for the Durham Constabulary. Master Degree

Thesis, Cambridge University ● Oswald M, Grace J, Urwin S and Barnes GC (2018) Algorithmic risk assessment policing models: lessons from the Durham HART model and ‘Experimental’

proportionality, Information & Communications Technology Law, 27(2): 223-250● Lipton Z (2018) The Mythos of Model Interpretability, ACM Queue 16(3). arXiv:1606.03490 https://queue.acm.org/detail.cfm?id=3241340● Kroll JA, Huey J, Barocas S, Felten EW, Reidenberg JR, Robinson DG & Yu H (2017) Accountable Algorithms, University of Pennsylvania Law Review 165(3):633-705● Dwork C (2006), “Differential Privacy”, In: Bugliesi M., Preneel B., Sassone V., Wegener I. (eds) Automata, Languages and Programming. ICALP 2006. Lecture Notes in

Computer Science, 4052. New York: Springer● Sweeney L (2002) K-Anonymity: A Model for Protecting Privacy, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5): 557-570● Chouldechova A (2017) Fair prediction with disparate impact: A study of bias in recidivism prediction instruments, arXiv: 1703.00056 ● Kleinberg J, Mullainathan S & Raghavan M (2017) Inherent trade-offs in the fair determination of risk scores. In: Papadimitriou CH (Ed.) 8th Innovations in Theoretical

Computer Science Conference (ITCS 2017)● Berk R, Heidari H, Jabbari S, Kearns M & Roth A (2017) Fairness in Criminal Justice Risk Assessments: The State of the Art, Sociological Methods & Research ● Scantamburlo, T., Charlesworth, A., & Cristianini, N. (2019). Machine decisions and human consequences. In K. Yeung & M. Lodge (eds) Algorithmic Regulation

(forthcoming). Oxford University Press

Questions and discussion

machine decisions the assessment ofhome.deib.polimi.it/schiaffo/ce/assessment machine...modern...

Documents