what metrics should we use to measure commercial ai? · 2020-06-06 · ai matters, volume 5, issue...

5
AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 What Metrics Should We Use To Measure Commercial AI? Cameron Hughes (Northeast Ohio ACM Chair; [email protected]) Tracey Hughes (Northeast Ohio ACM Secretary; [email protected]) DOI: 10.1145/3340470.3340479 In AI Matters Volume 4, Issue 2, and Issue 4, we raised the notion of the possibility of an AI Cosmology in part in response to the “AI Hype Cycle” that we are currently experi- encing. We posited that our current machine learning and big data era represents but one peak among several previous peaks in AI re- search in which each peak had accompany- ing “Hype Cycles”. We associated each peak with an epoch in a possible AI Cosmology. We briefly explored the logic machines, cybernet- ics, and expert system epochs. One of the objectives of identifying these epochs was to help establish that we have been here before. In particular we’ve been in the territory where some application of AI research finds substan- tial commercial success which is then closely followed by AI fever and hype. The public’s ex- pectations are heightened only to end in dis- illusionment when the applications fall short. Whereas it is sometimes somewhat of a chal- lenge even for AI researchers, educators, and practitioners to know where the reality ends and hype begins, the layperson is often in an impossible position and at the mercy of pop culture, marketing and advertising campaigns. We suggested that an AI Cosmology might help us identify a single standard model for AI that could be the foundation for a common shared understanding of what AI is and what it is not. A tool to help the layperson under- stand where AI has been, where it’s going, and where it can’t go. Something that could provide a basic road map to help the general public navigate the pitfalls of AI Hype. Here, we want to follow that suggestion with a few questions. Once we define and agree on what is meant by the moniker artificial intelli- gence and we are able to classify some appli- cation as actually having artificial intelligence, another set of questions immediately present themselves: How intelligent is any given AI application? How much intelligence does any given AI Copyright c 2019 by the author(s). application have? How much intelligence does an application need to be classified as an AI application? How reliable is the process that produced the intelligence for any given AI application? How transparent is the intelligence in any given AI application? The answers to these questions require some kind of qualitative and quantitative metrics. Namely, how much intelligence does any given AI application have and what is the qual- ity of that intelligence. Further, how could we congeal the answers to these questions so that they can be used to capture (in label form) the ’AI Ingredients’ of any technology aimed at the general public? The amount of education is often used as one metric for intelligence. We refer to individu- als as having a grade school, high school or college-level education. Could we employ a similar metric for AI applications? Would it be feasible to classify AI applications in terms of grade levels? For instance, an AI application with grade level 5 would be considered to have more intelligence than a 4th grade AI applica- tion. Any application that didn’t meet 1st grade level would not be considered an AI applica- tion and applications that achieved better than 12th grade would be considered advanced AI applications. But how could we determine the grade level of any given AI application? A consensus set of metrics that could be passed on to the general public has not yet prevailed. In this AI hype cycle, if an applica- tion uses any artifact from any AI technique a vendor is quick to advertise it as an AI appli- cation. It would be useful to have a metric that would indicate exactly how much AI is in the purported application. We have this kind of in- formation for other products e.g. how much chocolate is actually in the chocolate bar or how much real fruit is actually in the fruit juice being sold to the consumer. Exactly how much AI does that drone have? Or how much AI is 41

Upload: others

Post on 14-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What Metrics Should We Use To Measure Commercial AI? · 2020-06-06 · AI MATTERS, VOLUME 5, ISSUE 25(2) 2019 defines a framework for characterizing hu-man interaction, the mission

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

What Metrics Should We Use To Measure Commercial AI?Cameron Hughes (Northeast Ohio ACM Chair; [email protected])Tracey Hughes (Northeast Ohio ACM Secretary; [email protected])DOI: 10.1145/3340470.3340479

In AI Matters Volume 4, Issue 2, and Issue4, we raised the notion of the possibility ofan AI Cosmology in part in response to the“AI Hype Cycle” that we are currently experi-encing. We posited that our current machinelearning and big data era represents but onepeak among several previous peaks in AI re-search in which each peak had accompany-ing “Hype Cycles”. We associated each peakwith an epoch in a possible AI Cosmology. Webriefly explored the logic machines, cybernet-ics, and expert system epochs. One of theobjectives of identifying these epochs was tohelp establish that we have been here before.In particular we’ve been in the territory wheresome application of AI research finds substan-tial commercial success which is then closelyfollowed by AI fever and hype. The public’s ex-pectations are heightened only to end in dis-illusionment when the applications fall short.Whereas it is sometimes somewhat of a chal-lenge even for AI researchers, educators, andpractitioners to know where the reality endsand hype begins, the layperson is often in animpossible position and at the mercy of popculture, marketing and advertising campaigns.We suggested that an AI Cosmology mighthelp us identify a single standard model forAI that could be the foundation for a commonshared understanding of what AI is and whatit is not. A tool to help the layperson under-stand where AI has been, where it’s going,and where it can’t go. Something that couldprovide a basic road map to help the generalpublic navigate the pitfalls of AI Hype.

Here, we want to follow that suggestion with afew questions. Once we define and agree onwhat is meant by the moniker artificial intelli-gence and we are able to classify some appli-cation as actually having artificial intelligence,another set of questions immediately presentthemselves:

• How intelligent is any given AI application?• How much intelligence does any given AI

Copyright c© 2019 by the author(s).

application have?• How much intelligence does an application

need to be classified as an AI application?• How reliable is the process that produced

the intelligence for any given AI application?• How transparent is the intelligence in any

given AI application?

The answers to these questions require somekind of qualitative and quantitative metrics.Namely, how much intelligence does anygiven AI application have and what is the qual-ity of that intelligence. Further, how could wecongeal the answers to these questions sothat they can be used to capture (in label form)the ’AI Ingredients’ of any technology aimed atthe general public?

The amount of education is often used as onemetric for intelligence. We refer to individu-als as having a grade school, high school orcollege-level education. Could we employ asimilar metric for AI applications? Would it befeasible to classify AI applications in terms ofgrade levels? For instance, an AI applicationwith grade level 5 would be considered to havemore intelligence than a 4th grade AI applica-tion. Any application that didn’t meet 1st gradelevel would not be considered an AI applica-tion and applications that achieved better than12th grade would be considered advanced AIapplications. But how could we determine thegrade level of any given AI application?

A consensus set of metrics that could bepassed on to the general public has not yetprevailed. In this AI hype cycle, if an applica-tion uses any artifact from any AI technique avendor is quick to advertise it as an AI appli-cation. It would be useful to have a metric thatwould indicate exactly how much AI is in thepurported application. We have this kind of in-formation for other products e.g. how muchchocolate is actually in the chocolate bar orhow much real fruit is actually in the fruit juicebeing sold to the consumer. Exactly how muchAI does that drone have? Or how much AI is

41

Page 2: What Metrics Should We Use To Measure Commercial AI? · 2020-06-06 · AI MATTERS, VOLUME 5, ISSUE 25(2) 2019 defines a framework for characterizing hu-man interaction, the mission

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

actually in that new social media application?The ’how much’ question requires a quantita-tive metric of some sort and the grade of intel-ligence involved requires the qualitative met-ric. What if applications that claimed to be AIcapable were required to state the metrics onthe label or in the advertising? For example, Avendor might state: “Our new social media ap-plication is 2% level 3 AI!” This kind of simplemetric scheme would help to mitigate AI Hypecycles.

In addition to characterizing the quantity andquality of the embedded AI, requiring a re-liability metric like MTBAIF (Mean Time Be-tween AI Failure) is also desirable. Statinghow much intelligence is in an application andwhat grade level of intelligence is in an appli-cation provides a good start. However, the re-liability of the AI (i.e its limits, tolerances, cer-tainty, etc.) and a transparency metric thatindicates the ontology, inference predisposi-tion/bias, and type and quality of knowledgewould give the user some real indication of theutility of the application.

Knowledge Ingredients

What if we could provide ’Knowledge In-gredient’ Labels for our AI-based hard-ware/software technologies like those shownin Figure 1.

Figure 1: Pie charts for a Knowledge Indicator thatreflects the Knowledge Ingredients of an AI sys-tem.

In Figure 1, the percentage of the types ofknowledge the system is comprised of is rep-

resented in a pie chart. The color indicatesthe grade levels from lower to higher. In thiscase, the system has 41% of the knowledgeis ”Common Sense” with a relative low gradelevel as compared to the high grade level of”Instinctive Knowledge” at 23%. The expira-tion date indicates the time frame for the via-bility of the knowledge from 2000 - 2025. Out-side of the indicated time frame, the knowl-edge would be consider obsolete. There aretwo indicators, one for researchers and prac-titioners and the other for the laymen. Thedifference in the indicators is the terminologyused to describe the types of knowledge. Hereis our lists of metrics:

1. Percentage of software/hardware dedicatedto the implementation of AI techniques

2. Grade Level3. Reliability (limits, tolerances, certainty, MT-

BAIF)4. Transparency (predisposition/bias)

AI Metrics

The notion of metrics to measure AI perfor-mance has been under active investigationsince the very beginnings of AI research anda standard widely used and accepted set ofmetrics remain elusive. The PerMIS (Perfor-mance Metrics for Intelligent System) work-shops were started in 2000. The PerMISworkshops are dedicated to:

“...defining measures and methodolo-gies of evaluating performance of intelli-gent systems started in 2000, the PerMISseries focuses on applications of perfor-mance measures to applied problems incommercial, industrial, homeland security,and military applications.”

The PerMIS workshops were originally co-sponsored by SIGART (now SIGAI), NIST,IEEE, ACM, DARPA, and NSF. These work-shops endeavored to identify performance in-telligence metrics in many areas such as: on-tologies, mobile robots, intelligence interfaces,agents, intelligent test beds, intelligent per-formance assessment, planning models, au-tonomous systems, learning approaches, andembedded intelligent components. ALFUS(Autonomous Levels For Unmanned Systems)

42

Page 3: What Metrics Should We Use To Measure Commercial AI? · 2020-06-06 · AI MATTERS, VOLUME 5, ISSUE 25(2) 2019 defines a framework for characterizing hu-man interaction, the mission

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

defines a framework for characterizing hu-man interaction, the mission and environmen-tal complexity of a ”system of systems”. Thepurpose of ALFUS was to determine the levelof autonomy, a necessary but not sufficientcomponent of an intelligent system. Accord-ing to Ramsbotham [1]:

“Intelligence implies an ability to per-ceive and adapt to external environmentsin real-time, to acquire and store knowl-edge regarding problem solutions, and toincorporate that knowledge into systemmemory for future use.”

describes the autonomous intelligent systemsof systems based on the 4D/RCS ReferenceModel Architecture for Learning developed byIntelligent Systems Division of the NIST since1980s. In order to attempt to develop sometype of metric, the functions that comprisedthe intelligent behavior had to be decomposedinto specific hardware and software character-istics. This was called SPACE (Sense, Per-ceive, Attend, Apprehend, Comprehend, Ef-fect action):

• Sense:To generate a measurable signal (usuallyelectrical) from external stimuli. A sensorwill often employ techniques (for examples,bandpass filtering or thresholding) such thatonly part of the theoretical response of thetransducer is perceived.

• Perceive:To capture the raw sensor data in a form(analog or digital) that allows further pro-cessing to extract information. In this nar-row construct perception is characterized bya 1:1 correspondence between the sensorsignal and the output data.

• Attend:To select data from what is perceived by thesensor. To a crude approximation, analo-gous to feature extraction.

• Apprehend:To characterize the information content ofthe extracted features. Analogous to patternrecognition.

• Comprehend:To understand the significance of the infor-mation apprehended in the context of exist-ing knowledge–in the case of automata, typ-

ically other information stored in electronicmemory.

• Effect action:To interact with the external environment ormodify the internal state (e.g., the stored in-formation comprising the ”knowledge base”of the system) based on what is compre-hended.

The purpose or “collective mission perfor-mance” of a systems of systems was also cat-egorized based on functional and architecturalcomplexity. The mission and environmentalcomplexity and the degree of human interac-tion determining the level of autonomy couldbe applied to these categories [1]:

1. Leader-FollowerIntelligent behavior exhibited by single node,and replicated (sometimes with minor adap-tation) by other nodes.

2. Swarming (simple)Loosely structured collection of interactingagents, capable of moving collectively.

3. Swarming (complex)Loosely structured collection of interactingagents, capable of individuated behavior toeffect common goals.

4. Homogenous intelligent systemsA relatively structured collection of iden-tical (or at least similar) agents, whereincollective system performance is optimizedby optimizing the performance of individualagents.

5. Heterogeneous intelligent systemsA relatively structured heterogeneous col-lection of specialized agents, wherein thefunctions of intelligence distributed amongthe diverse agents to optimize performanceof a defined task or set of tasks.

6. Ad hoc intelligent adaptive systemsA relatively unstructured and undefined het-erogeneous collection of agent, wherein thefunctions comprising intelligence are dy-namically distributed across the system toadapt to changing tasks.

From less complex groups with low missionand environmental complexity and high hu-man interaction (like Leader-Follower) to themost sophisticated high mission and environ-mental complexity and no human interaction(like Ad Hoc Intelligent Adaptive Systems),

43

Page 4: What Metrics Should We Use To Measure Commercial AI? · 2020-06-06 · AI MATTERS, VOLUME 5, ISSUE 25(2) 2019 defines a framework for characterizing hu-man interaction, the mission

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

these categories of systems are in a some-what order. Figure 2 shows where the ex-tremes of these categories could be graphed.

Figure 2: Lead Follower and Ad Hoc IntelligentAdaptive Systems are located on the 3D space ofComplexity and Human Interaction.

Ultimately, Ramsobotham [1] concluded:

“Even given a comprehensive frame-work, as we begin to build more com-plex intelligent systems of systems, wewill need to acquire knowledge and im-prove analytic tools and metrics. Amongthe more important will be: ... Better mod-els and metrics for characterizing limits ofinformation assurance based on these ef-fects. This will be both a critical need anda major challenge.”

The most known metric used for researchersand commercial AI applications has been theTuring Test developed by Alan Turing in 1950.The purpose of the test was to evaluate a ma-chine’s ability to demonstrate intelligent be-havior. That behavior was to be indistinguish-able from a human being exhibiting the samebehavior. A human judge was to evaluatea natural language conversation between ahuman and the potential ”intelligent machine”shown in Figure 3.

Is this test actually a metric for “intelligence”?This has been debated for many years. If aNLG agent can produce responses that sim-ulate human responses does that mean thatthe system is intelligent? Probably not, butthat has not stopped the use of the Turing Testor AI software developers heralding that their

Figure 3: Human Evaluator was to determinewhich was not human.

systems has passed or almost passed the Tur-ing Test. Now passing the test includes simu-lating the human voice like the Google DuplexAI Assistant.

In “Moving Beyond the Turing Test with theAllen AI Science Challenge” [2], the authorsdescribe the Allen AI Science Challenge “Tur-ing Olympics”, a series of tests that exploremany capabilities that are considered asso-ciated with intelligence. These capabilitiesinclude language understanding, reasoning,and commonsense knowledge needed to per-form smart or intelligent activities. This wouldreplace the Alan Turing pass/fail model. Theidea of the Challenge was to have a four-month-long competition where researcherswere to build an AI agent that could answeran eighth-grade multiple choice science ques-tions. The AI would demonstrate its abilityto utilize state-of-the-art natural language un-derstanding and knowledge-based reasoning.Below summarizes the nature of the competi-tion:

Number of total (4) choice questions: 5,083

Training set Questions: 2,500

Validation set confirming model performance:8,132

Legitimate questions: 800

Final Test + validation set for final score:21,298

44

Page 5: What Metrics Should We Use To Measure Commercial AI? · 2020-06-06 · AI MATTERS, VOLUME 5, ISSUE 25(2) 2019 defines a framework for characterizing hu-man interaction, the mission

AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

Final Legitimate Questions: 2,583

Baseline Score random guessing: 25%

The final results of the test, the top three com-petitors had scores with a spread of 1.05%with the highest score 59.31% which is con-sidered a failing grade. Each of the winningmodel utilized standard information-retrieval-based methods which were not able to passthe eighth grade science exams. What is re-quired is:

“ ...to go beyond surface text to adeeper understanding of the meaning un-derlying each question, then use reason-ing to find the appropriate answer.”

In this case, such a system would have a 1stgrade level Knowledge Quality based on ourquasi Knowledge Ingredient Indicator. Basedon the article [2]:

“All three winners said it was clear thatapplying a deeper, semantic level of rea-soning with scientific knowledge to thequestions and answers would be the keyto achieving scores of 80% and higherand demonstrating what might be consid-ered true artificial intelligence.”

Here we’ve provided a very cursory look at avery limited set of possibilities for measuringcommercial AI. In the next issue, will go a littlefurther and dig a little deeper into the ques-tion of how do we communicate basic AI met-rics and ingredients for commercial AI to thelayperson.

References

[1] Ramsbotham, Alan.J. (2009). CollectiveIntelligence: Toward Classifying Systems ofSystems. PerMIS’09.

[2] Schoenick, Carissa, Clark, Peter, T. et.al(2017). Moving Beyond the Turing Test withthe Allen AI Science Challenge. CACM,VOL.60 (N0.9), pp. 60-64.

Cameron Hughes is acomputer and robot pro-grammer. He is a Soft-ware Epistemologist atCtest Laboratories wherehe is currently working onA.I.M. (Alternative Intelli-gence for Machines) andA.I.R (Alternative Intelli-gence for Robots) tech-nologies. Cameron isthe lead AI Engineer forthe Knowledge Group atAdvanced Software Con-

struction Inc. He is a member of the advisoryboard for the NREF (National Robotics Edu-cation Foundation) and the Oak Hill RoboticsMakerspace. He is the project leader of thetechnical team for the NEOACM CSI/CLUERobotics Challenge and regularly organizesand directs robot programming workshops forvarying robot platforms. Cameron Hughesis the co-author of many books and blogson software development and Artificial Intelli-gence.

Tracey Hughes is a soft-ware and epistemic visu-alization engineer at CtestLaboratories. She isthe lead designer for theMIND, TAMI, and NO-FAQS projects that uti-lize epistemic visualiza-tion. Tracey is also amember of the advisoryboard for the NREF (Na-tional Robotics EducationFoundation) and the OakHill Robotics Makerspace.

She is the lead researcher of the technicalteam for the NEOACM CSI/CLUE RoboticsChallenge. Tracey Hughes is the co-authorwith Cameron Hughes of many books on soft-ware development and Artificial Intelligence.

45