whole basic

83
Artificial Intelligence Amit purohit Evidence of Artificial Intelligence folklore can be traced back to ancient Egypt, but with the development of the electronic computer in 1941, the technology finally became available to create machine intelligence. The term artificial intelligence was first coined in 1956, at the Dartmouth conference, and since then Artificial Intelligence has expanded because of the theories and principles developed by its dedicated researchers. Through its short modern history, advancement in the fields of AI have been slower than first estimated, progress continues to be made. From its birth 4 decades ago, there have been a variety of AI programs, and they have impacted other technological advancements. Definition AI is the science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable. Intelligence is the computational part of the ability to achieve goals in the world. Varying kinds and degrees of intelligence occur in people, many animals and some machines. Objectives 1).To formally define AI. 2).To discuss the character features of AI. 3).To get the student acquainted with the essence of AI. 4).To be able to distinguish betwee the human intelligence and AI. 5).To give an overview of the applications where the AI technology can be used.

Upload: amitp26

Post on 05-Dec-2014

995 views

Category:

Documents


2 download

DESCRIPTION

all topic covered in this ms word document fiile so read it

TRANSCRIPT

Page 1: Whole basic

Artificial Intelligence Amit purohit

Evidence of Artificial Intelligence folklore can be traced back to ancient Egypt, but with the development of the electronic computer in 1941, the technology finally became available to create machine intelligence. The term artificial intelligence was first coined in 1956, at the Dartmouth conference, and since then Artificial Intelligence has expanded because of the theories and principles developed by its dedicated researchers. Through its short modern history, advancement in the fields of AI have been slower than first estimated, progress continues to be made. From its birth 4 decades ago, there have been a variety of AI programs, and they have impacted other technological advancements.

Definition

AI is the science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable.

Intelligence is the computational part of the ability to achieve goals in the world. Varying kinds and degrees of intelligence occur in people, many animals and some machines.

Objectives

1).To formally define AI.

2).To discuss the character features of AI.

3).To get the student acquainted with the essence of AI.

4).To be able to distinguish betwee the human intelligence and AI.

5).To give an overview of the applications where the AI technology can be used.

6).To import the knowledge about the representation schemes like Production System, Problem Reduction.

Turing Test

Alan Turing's 1950 article Computing Machinery and Intelligence [Tur50] discussed conditions for considering a machine to be intelligent. He argued that if the machine could successfully pretend to be human to a knowledgeable observer then you certainly should consider it intelligent. This test would satisfy most people but not all philosophers. The observer could interact with the machine and a human by teletype (to avoid requiring that the machine imitate the appearance or voice of the person), and the human would try to persuade the observer that it was human and the machine would try to fool the observer.

Page 2: Whole basic

The Turing test is a one-sided test. A machine that passes the test should certainly be considered intelligent, but a machine could still be considered intelligent without knowing enough about humans to imitate a human.

Daniel Dennett's book Brainchildren [Den98] has an excellent discussion of the Turing test and the various partial Turing tests that have been implemented, i.e. with restrictions on the observer's knowledge of AI and the subject matter of questioning. It turns out that some people are easily led into believing that a rather dumb program is intelligent.

Background and History

Evidence of Artificial Intelligence folklore can be traced back to ancient Egypt, but with the development of the electronic computer in 1941, the technology finally became available to create machine intelligence. The term artificial intelligence was first coined in 1956, at the Dartmouth conference, and since then Artificial Intelligence has expanded because of the theories and principles developed by its dedicated researchers. Through its short modern history, advancement in the fields of AI have been slower than first estimated, progress continues to be made. From its birth 4 decades ago, there have been a variety of AI programs, and they have impacted other technological advancements.

In 1941 an invention revolutionized every aspect of the storage and processing of information. That invention, developed in both the US and Germany was the electronic computer. The first computers required large, separate air-conditioned rooms, and were a programmers nightmare, involving the separate configuration of thousands of wires to even get a program running.

The 1949 innovation, the stored program computer, made the job of entering a program easier, and advancements in computer theory lead to computer science, and eventually Artificial intelligence. With the invention of an electronic means of processing data, came a medium that made AI possible.

Although the computer provided the technology necessary for AI, it was not until the early 1950's that the link between human intelligence and machines was really observed. Norbert Wiener was one of the first Americans to make observations on the principle of feedback theory feedback theory. The most familiar example of feedback theory is the thermostat: It controls the temperature of an environment by gathering the actual temperature of the house, comparing it to the desired temperature, and responding by turning the heat up or down. What was so important about his research into feedback loops was that Wiener theorized that all intelligent behavior was the result of feedback mechanisms. Mechanisms that could possibly be simulated by machines. This discovery influenced much of early development of AI.

In late 1955, Newell and Simon developed The Logic Theorist, considered by many to be the first AI program. The program, representing each problem as a tree model, would attempt to solve it by selecting the branch that would most likely result in the correct conclusion. The impact that the logic theorist made on both the public and the field of AI has made it a crucial stepping stone in developing the AI field.

Page 3: Whole basic

In 1956 John McCarthy regarded as the father of AI, organized a conference to draw the talent and expertise of others interested in machine intelligence for a month of brainstorming. He invited them to Vermont for "The Dartmouth summer research project on artificial intelligence." From that point on, because of McCarthy, the field would be known as Artificial intelligence. Although not a huge success, (explain) the Dartmouth conference did bring together the founders in AI, and served to lay the groundwork for the future of AI research.

In the seven years after the conference, AI began to pick up momentum. Although the field was still undefined, ideas formed at the conference were re-examined, and built upon. Centers for AI research began forming at Carnegie Mellon and MIT, and a new challenges were faced: further research was placed upon creating systems that could efficiently solve problems, by limiting the search, such as the Logic Theorist. And second, making systems that could learn by themselves.

In 1957, the first version of a new program The General Problem Solver(GPS) was tested. The program developed by the same pair which developed the Logic Theorist. The GPS was an extension of Wiener's feedback principle, and was capable of solving a greater extent of common sense problems. A couple of years after the GPS, IBM contracted a team to research artificial intelligence. Herbert Gelerneter spent 3 years working on a program for solving geometry theorems.

While more programs were being produced, McCarthy was busy developing a major breakthrough in AI history. In 1958 McCarthy announced his new development; the LISP language, which is still used today. LISP stands for LISt Processing, and was soon adopted as the language of choice among most AI developers.

During the 1970's Many new methods in the development of AI were tested, notably Minsky's frames theory. Also David Marr proposed new theories about machine vision, for example, how it would be possible to distinguish an image based on the shading of an image, basic information on shapes, color, edges, and texture. With analysis of this information, frames of what an image might be could then be referenced. another development during this time was the PROLOGUE language. The language was proposed for In 1972

During the 1980's AI was moving at a faster pace, and further into the corporate sector. In 1986, US sales of AI-related hardware and software surged to $425 million. Expert systems in particular demand because of their efficiency. Companies such as Digital Electronics were using XCON, an expert system designed to program the large VAX computers. DuPont, General Motors, and Boeing relied heavily on expert systems Indeed to keep up with the demand for the computer experts, companies such as Teknowledge and Intellicorp specializing in creating software to aid in producing expert systems formed. Other expert systems were designed to find and correct flaws in existing expert systems.

Overview of AI Application Areas

Game Playing

Page 4: Whole basic

You can buy machines that can play master level chess for a few hundred dollars. There is some AI in them, but they play well against people mainly through brute force computation--looking at hundreds of thousands of positions. To beat a world champion by brute force and known reliable heuristics requires being able to look at 200 million positions per second.

Speech Recognition

In the 1990s, computer speech recognition reached a practical level for limited purposes. Thus United Airlines has replaced its keyboard tree for flight information by a system using speech recognition of flight numbers and city names. It is quite convenient. On the the other hand, while it is possible to instruct some computers using speech, most users have gone back to the keyboard and the mouse as still more convenient.

Understanding Natural Language

Just getting a sequence of words into a computer is not enough. Parsing sentences is not enough either. The computer has to be provided with an understanding of the domain the text is about, and this is presently possible only for very limited domains.

Computer Vision

The world is composed of three-dimensional objects, but the inputs to the human eye and computers' TV cameras are two dimensional. Some useful programs can work solely in two dimensions, but full computer vision requires partial three-dimensional information that is not just a set of two-dimensional views. At present there are only limited ways of representing three-dimensional information directly, and they are not as good as what humans evidently use.

Expert Systems

A "knowledge engineer" interviews experts in a certain domain and tries to embody their knowledge in a computer program for carrying out some task. How well this works depends on whether the intellectual mechanisms required for the task are within the present state of AI. When this turned out not to be so, there were many disappointing results. One of the first expert systems was MYCIN in 1974, which diagnosed bacterial infections of the blood and suggested treatments. It did better than medical students or practicing doctors, provided its limitations were observed. Namely, its ontology included bacteria, symptoms, and treatments and did not include patients, doctors, hospitals, death, recovery, and events occurring in time. Its interactions depended on a single patient being considered. Since the experts consulted by the knowledge engineers knew about patients, doctors, death, recovery, etc., it is clear that the knowledge engineers forced what the experts told them into a predetermined framework. In the present state of AI, this has to be true. The usefulness of current expert systems depends on their users having common sense.

Heuristic Classification

Page 5: Whole basic

One of the most feasible kinds of expert system given the present knowledge of AI is to put some information in one of a fixed set of categories using several sources of information. An example is advising whether to accept a proposed credit card purchase. Information is available about the owner of the credit card, his record of payment and also about the item he is buying and about the establishment from which he is buying it (e.g., about whether there have been previous credit card frauds at this establishment).

Production System

Production systems are applied to problem solving programs that must perform a wide-range of seaches. Production ssytems are symbolic AI systems. The difference between these two terms is only one of semantics. A symbolic AI system may not be restricted to the very definition of production systems, but they can't be much different either.

Production systems are composed of three parts, a global database, production rules and a control structure.

The global database is the system's short-term memory. These are collections of facts that are to be analyzed. A part of the global database represents the current state of the system's environment. In a game of chess, the current state could represent all the positions of the pieces for example.

Production rules (or simply productions) are conditional if-then branches. In a production system whenever a or condition in the system is satisfied, the system is allowed to execute or perform a specific action which may be specified under that rule. If the rule is not fufilled, it may perform another action. This can be simply paraphrased:

WHEN (condition) IS SATISFIED, PERFORM (action)

A Production System Algorithm

DATA (binded with initial global data base)when DATA satisfies the halting condition dobeginselect some rule R that can be applied to DATAreturn DATA (binded with the result of when R was applied to DATA)end

Types of Production System

There are two basic types of production System:

Commutative Production System Decomposable Production System

Commutative Production System

Page 6: Whole basic

A production system is commutative if it has the following properties with respect to a database D:

1. Each member of the set of rules applicable to D is also applicable to any database produced by applying an applicable rule to D.

2. If the goal condition is satisfied by D, then it is also satisfied by any database produced by applying any applicable rule to D.

3. The database that results by applying to D any sequence composed of rules that are applicable to D is invariant under permutations of the sequence.

Decomposable Production System

Initial database can be decomposed or split into separate components that can be processed independently.

Search Process

Searching is defined as a sequence of steps that transforms the initial state to the goal state. To do a search process, the following are needed:

The initial state description of the problem A set of legal operators that changes the state.

The final or goal state.

The searching process in AI can be classified into two types:

1. Uniformed Search/ Blind Search2. Heuristic Search/ Informed Search

Uniformed/ Blind Search

A uniformed search algorithm is one that do not have any domain specific knowledge. They use information like initial state, final state and a set of logical operators. this search shoul proceed in a systemic way by exploring nodes in some predetermined orders. It can be classified in to two search technologies:

1. Breadth First search2. Depth First Search

Page 7: Whole basic

Depth First Search !

Depth first search works by taking a node, checking its neighbors, expanding the first node it finds among the neighbors, checking if that expanded node is our destination, and if not, continue exploring more nodes.

The above explanation is probably confusing if this is your first exposure to depth first search. I hope the following demonstration will help more. Using our same search tree, let's find a path between nodes A and F:

Step 0

Let's start with our root/goal node:

We will be using two lists to keep track of what we are doing - an Open list and a Closed List. An Open list keeps track of what you need to do, and the Closed List keeps track of what you have already done. Right now, we only have our starting point, node A. We haven't done anything to it yet, so let's add it to our Open list.

Open List: AClosed List: <empty>

Page 8: Whole basic

Step 1

Now, let's explore the neighbors of our A node. To put another way, let's take the first item from our Open list and explore its neighbors:

Node A's neighbors are the B and C nodes. Because we are now done with our A node, we can remove it from our Open list and add it to our Closed List. You aren't done with this step though. You now have two new nodes B and C that need exploring. Add those two nodes to our Open list.

Our current Open and Closed Lists contain the following data:

Open List: B, CClosed List: A

Step 2

Our Open list contains two items. For depth first search and breadth first search, you always explore explore the first item from our Open list. The first item in our Open list is the B node. B is not our destination, so let's explore its neighbors:

Because I have now expanded B, I am going to remove it from the Open list and add it to the Closed List. Our new nodes are D and E, and we add these nodes to the beginning of our Open list:

Open List: D, E, CClosed List: A, B

Step 3

Page 9: Whole basic

You should start to see a pattern forming. Because D is at the beginning of our Open List, we expand it. D isn't our destination, and it does not contain any neighbors. All you do in this step is remove D from our Open List and add it to our Closed List:

Open List: E, CClosed List: A, B, D

Step 4

We now expand the E node from our Open list. E is not our destination, so we explore its neighbors and find out that it contains the neighbors F and G. Remember, F is our target, but we don't stop here though. Despite F being on our path, we only end when we are about to expand our target Node - F in this case:

Our Open list will have the E node removed and the F and G nodes added. The removed E node will be added to our Closed List:

Open List: F, G, CClosed List: A, B, D, E

Step 5

We now expand the F node. Since it is our intended destination, we stop:

Page 10: Whole basic

We remove F from our Open list and add it to our Closed List. Since we are at our destination, there is no need to expand F in order to find its neighbors. Our final Open and Closed Lists contain the following data:

Open List: G, CClosed List: A, B, D, E, F

The final path taken by our depth first search method is what the final value of our Closed List is: A, B, D, E, F.

Breadth First Search

In depth first search, newly explored nodes were added to the beginning of your Open list. In breadth first search, newly explored nodes are added to the end of your Open list.

For example, here is our original search tree:

The above explanation is probably confusing if this is your first exposure to depth first search. I hope the following demonstration will help more. Using our same search tree, let's find a path between nodes A and F:

Page 11: Whole basic

Step 0

Let's start with our root/goal node:

We will be using two lists to keep track of what we are doing - an Open list and a Closed List. An Open list keeps track of what you need to do, and the Closed List keeps track of what you have already done. Right now, we only have our starting point, node A. We haven't done anything to it yet, so let's add it to our Open list.

Open List: AClosed List: <empty>

Step 1

Now, let's explore the neighbors of our A node. To put another way, let's take the first item from our Open list and explore its neighbors:

Node A's neighbors are the B and C nodes. Because we are now done with our A node, we can remove it from our Open list and add it to our Closed List. You aren't done with this step though. You now have two new nodes B and C that need exploring. Add those two nodes to our Open list.

Our current Open and Closed Lists contain the following data:

Open List: B, CClosed List: A

Step 2

Our Open list contains two items. For depth first search and breadth first search, you always explore explore the first item from our Open list. The first item in our Open list is the B node. B is not our destination, so let's explore its neighbors:

Page 12: Whole basic

Because I have now expanded B, I am going to remove it from the Open list and add it to the Closed List. Our new nodes are D and E, and we add these nodes to the beginning of our Open list:

Open List: D, E, CClosed List: A, B

Step 3

You should start to see a pattern forming. Because D is at the beginning of our Open List, we expand it. D isn't our destination, and it does not contain any neighbors. All you do in this step is remove D from our Open List and add it to our Closed List:

Open List: E, CClosed List: A, B, D

Step 4

We now expand the E node from our Open list. E is not our destination, so we explore its neighbors and find out that it contains the neighbors F and G. Remember, F is our target, but we don't stop here though. Despite F being on our path, we only end when we are about to expand our target Node - F in this case:

Our Open list will have the E node removed and the F and G nodes added. The removed E node will be added to our Closed List:

Page 13: Whole basic

Open List: F, G, CClosed List: A, B, D, E

Step 5

We now expand the F node. Since it is our intended destination, we stop:

We remove F from our Open list and add it to our Closed List. Since we are at our destination, there is no need to expand F in order to find its neighbors. Our final Open and Closed Lists contain the following data:

Open List: G, CClosed List: A, B, D, E, F

The final path taken by our depth first search method is what the final value of our Closed List is: A, B, D, E, F.

iterative Deepening Depth-First Search !

Iterative deepening depth-first search (IDDFS) is a state space search strategy in which a depth-limited search is run repeatedly, increasing the depth limit with each iteration until it reaches d, the depth of the shallowest goal state. On each iteration, IDDFS visits the nodes in the search tree in the same order as depth-first search, but the cumulative order in which nodes are first visited, assuming no pruning, is effectively breadth-first.

IDDFS combines depth-first search's space-efficiency and breadth-first search's completeness (when the branching factor is finite). It is optimal when the path cost is a non-decreasing function of the depth of the node.

The space complexity of IDDFS is O(bd), where b is the branching factor and d is the depth of shallowest goal. Since iterative deepening visits states multiple times, it may seem wasteful, but it turns out to be not so costly, since in a tree most of the nodes are in the bottom level, so it does not matter much if the upper levels are visited multiple times.

Page 14: Whole basic

The main advantage of IDDFS in game tree searching is that the earlier searches tend to improve the commonly used heuristics, such as the killer heuristic and alpha-beta pruning, so that a more accurate estimate of the score of various nodes at the final depth search can occur, and the search completes more quickly since it is done in a better order. For example, alpha-beta pruning is most efficient if it searches the best moves first.

A second advantage is the responsiveness of the algorithm. Because early iterations use small values for d, they execute extremely quickly. This allows the algorithm to supply early indications of the result almost immediately, followed by refinements as d increases. When used in an interactive setting, such as in a chess-playing program, this facility allows the program to play at any time with the current best move found in the search it has completed so far. This is not possible with a traditional depth-first search.

The time complexity of IDDFS in well-balanced trees works out to be the same as Depth-first search: O(bd).

In an iterative deepening search, the nodes on the bottom level are expanded once, those on the next to bottom level are expanded twice, and so on, up to the root of the search tree, which is expanded d + 1 times.[1] So the total number of expansions in an iterative deepening search is

All together, an iterative deepening search from depth 1 to depth d expands only about 11% more nodes than a single breadth-first or depth-limited search to depth d, when b = 10. The higher the branching factor, the lower the overhead of repeatedly expanded states, but even when the branching factor is 2, iterative deepening search only takes about twice as long as a complete breadth-first search. This means that the time complexity of iterative deepening is still O(bd), and the space complexity is O(bd). In general, iterative deepening is the preferred search method when there is a large search space and the depth of the solution is not known.

Informed Search

It is not difficult to see that uninformed search will pursue options that lead away from the goal as easily as it pursues options that lead to wards the goal. For any but the smallest problems this leads to searches that take unacceptable amounts of time and/or space. Informed search tries to reduce the amount of search that must be done by making intelligent choices for the nodes that

Page 15: Whole basic

are selected for expansion. This implies the existence of some way of evaluating the likelyhood that a given node is on the solution path. In general this is done using a heuristic function.

Hill Climbing

Hill climbing is a mathematical optimization technique which belongs to the family of local search. It is relatively simple to implement, making it a popular first choice. Although more advanced algorithms may give better results, in some situations hill climbing works just as well.

Hill climbing can be used to solve problems that have many solutions, some of which are better than others. It starts with a random (potentially poor) solution, and iteratively makes small changes to the solution, each time improving it a little. When the algorithm cannot see any improvement anymore, it terminates. Ideally, at that point the current solution is close to optimal, but it is not guaranteed that hill climbing will ever come close to the optimal solution.

For example, hill climbing can be applied to the traveling salesman problem. It is easy to find a solution that visits all the cities but will be very poor compared to the optimal solution. The algorithm starts with such a solution and makes small improvements to it, such as switching the order in which two cities are visited. Eventually, a much better route is obtained.

Hill climbing is used widely in artificial intelligence, for reaching a goal state from a starting node. Choice of next node and starting node can be varied to give a list of related algorithms.

Mathematical description

Hill climbing attempts to maximize (or minimize) a function f(x), where x are discrete states. These states are typically represented by vertices in a graph, where edges in the graph encode nearness or similarity of a graph. Hill climbing will follow the graph from vertex to vertex, always locally increasing (or decreasing) the value of f, until a local maximum (or local minimum) xm is reached. Hill climbing can also operate on a continuous space: in that case, the algorithm is called gradient ascent (or gradient descent if the function is minimized).*.

Variants

Page 16: Whole basic

In simple hill climbing, the first closer node is chosen, whereas in steepest ascent hill climbing all successors are compared and the closest to the solution is chosen. Both forms fail if there is no closer node, which may happen if there are local maxima in the search space which are not solutions. Steepest ascent hill climbing is similar to best-first search, which tries all possible extensions of the current path instead of only one.

Stochastic hill climbing does not examine all neighbors before deciding how to move. Rather, it selects a neighbour at random, and decides (based on the amount of improvement in that neighbour) whether to move to that neighbour or to examine another.

Random-restart hill climbing is a meta-algorithm built on top of the hill climbing algorithm. It is also known as Shotgun hill climbing. It iteratively does hill-climbing, each time with a random initial condition x0. The best xm is kept: if a new run of hill climbing produces a better xm than the stored state, it replaces the stored state.

Random-restart hill climbing is a surprisingly effective algorithm in many cases. It turns out that it is often better to spend CPU time exploring the space, than carefully optimizing from an initial condition.

Local Maxima

A problem with hill climbing is that it will find only local maxima. Unless the heuristic is convex, it may not reach a global maximum. Other local search algorithms try to overcome this problem such as stochastic hill climbing, random walks and simulated annealing.

Ridges

A ridge is a curve in the search place that leads to a maximum, but the orientation of the ridge compared to the available moves that are used to climb is such that each move will lead to a smaller point. In other words, each point on a ridge looks to the algorithm like a local maximum, even though the point is part of a curve leading to a better optimum.

Plateau

Page 17: Whole basic

Another problem with hill climbing is that of a plateau, which occurs when we get to a "flat" part of the search space, i.e. we have a path where the heuristics are all very close together. This kind of flatness can cause the algorithm to cease progress and wander aimlessly.

Pseudocode

Hill Climbing AlgorithmcurrentNode = startNode;loop doL = NEIGHBORS(currentNode);nextEval = -INF;nextNode = NULL;for all x in L if (EVAL(x) > nextEval)nextNode = x;nextEval = EVAL(x);if nextEval <= EVAL(currentNode)//Return current node since no better neighbors existreturn currentNode;currentNode = nextNode;

Best-First Search

Best-first search is a search algorithm which explores a graph by expanding the most promising node chosen according to a specified rule.

Judea Pearl described best-first search as estimating the promise of node n by a "heuristic evaluation function f(n) which, in general, may depend on the description of n, the description of the goal, the information gathered by the search up to that point, and most important, on any extra knowledge about the problem domain."

Some authors have used "best-first search" to refer specifically to a search with a heuristic that attempts to predict how close the end of a path is to a solution, so that paths which are judged to be closer to a solution are extended first. This specific type of search is called greedy best-first search.

Efficient selection of the current best candidate for extension is typically implemented using a priority queue.

Examples of best-first search algorithms include the A* search algorithm, and in turn, Dijkstra's algorithm (which can be considered a specialization of A*). Best-first algorithms are often used for path finding in combinatorial search.

Code

Page 18: Whole basic

open = initial statewhile open != nulldo1. Pick the best node on open.2. Create open's successors3. For each successor do:a. If it has not been generated before: evaluate it, add it to OPEN, and record its parentb. Otherwise: change the parent if this new path is better than previous one.done

Syntax of Propositional Logic

Logic is used to represent properties of objects in the world about which we are going to reason. When we say Miss Piggy is plump we are talking about the object Miss Piggy and a property plump. Similarly when we say Kermit's voice is high-pitched then the object is Kermit's voice and the property is high-pitched. It is normal to write these in logic as:

plump(misspiggy)

highpitched(voiceof(kermit))

So misspiggy and kermit are constants representing objects in our domain. Notice that plump and highpitched is different from voiceof:

plump and highpitched are represent properties and so are boolean valued functions. They are often called predicates or relations.

voiceof is a function that returns an object (not true/false). To help us differentiate we shall use ``of'' at the end of a function name.

The predicates plump and highpitched are unary predicates but of course we can have binary or n-ary predicates; e.g. loves(misspiggy, voiceof(kermit))

Simple Sentences

The fundamental components of logic are

object constants; e.g. misspiggy, kermit function constants; e.g. voiceof

predicate constants; e.g. plump, highpitched, loved

Page 19: Whole basic

Predicate and function constants take arguments which are objects in our domain. Predicate constants are used to describe relationships concerning the objects and return the value true/false. Function constants return values that are objects.

More Complex Sentences

We need to apply operators to construct more complex sentences from atoms.

Negation

applied to an atom negates the atom:

loves(kermit, voiceof(misspiggy))

'Kermit does not love Miss Piggy's voice''

Conjunction

combines two conjuncts:

loves(misspiggy, kermit) loves(misspiggy, voiceof(kermit))

''Miss Piggy loves Kermit and Miss Piggy loves Kernit's voice''

Notice it is not correct syntax to write in logic

loves(misspiggy, kermit) voiceof(kermit)

because we have tried to conjoin a sentence (truth valued) with an object. Logic operators must apply to truth-valued sentences.

Disjunction

combines two disjuncts:

loves(misspiggy, kermit) loves(misspiggy, voiceof(kermit))

''Miss Piggy loves Kermit or Miss Piggy loves Kermit's voice''

Implication

combines a condition and conclusion

loves(misspiggy, voiceof(kermit)) loves(misspiggy, kermit)

''If Miss Piggy loves Kermit's voice then Miss Piggy loves Kermit''

Page 20: Whole basic

The language we have described so far contains atoms and the connectives , , and . This defines the syntax of propositional Logic. It is normal to represent atoms in propositional logic as single upper-case letters but here we have used a more meaningful terminology for the atoms that extends easily to Predicate Logic.

Semantics of Propositional Logic

We have defined the syntax of propositional Logic. However, this is of no use without talking about the meaning, or semantics, of the sentences. Suppose our logic contained only atoms; e.g. no logical connectives. This logic is very silly because any subset of these atoms is consistent; e.g. beautiful(misspiggy) and ugly(misspiggy) are consistent because we cannot represent ugly(misspiggy) beautiful(misspiggy) So we now need a way in our logic to define which sentences are true.

Example: Models Define Truth

Suppose a language contains only one object constant misspiggy and two relation constants ugly and beautiful. The following models define different facts about Miss Piggy.

M=ø: In this model Miss Piggy is neither ugly nor beautiful.M={ugly(misspiggy)}: In this model Miss Piggy is ugly and not beautiful.M={beautiful(misspiggy)}: In this model Miss Piggy is beautiful and not ugly.M={ugly(misspiggy), beautiful(misspiggy)}: In this model Miss Piggy is both ugly and beautiful. The last statement is intuitively wrong but the model selected commits the truth of the atoms in the language.

Compound Sentences

So far we have restricted our attention to the semantics of atoms: an atom is true if it is a member of the model M; otherwise it is false. Extending the semantics to compound sentences is easy. Notice that in the definitions below p and q do not need to be atoms because these definitions work recursively until atoms are reached.

Conjunction

p q is true in M iff p and q are true in M individually.

So the conjunct

loves(misspiggy, kermit) loves(misspiggy, voiceof(kermit))

is true only when both

Page 21: Whole basic

Miss Piggy loves Kermit; andMiss Piggy loves Kermit's voice

Disjunction

p q is true in M iff at least one of p or q is true in M.

So the disjunct

loves(misspiggy, kermit) loves(misspiggy, voiceof(kermit))

is true whenever

Miss Piggy loves Kermit;Miss Piggy loves Kermit's voice; orMiss Piggy loves both Kermit and his voice.

Therefore the disjunction is weaker than either disjunct and the conjunction of these disjuncts.

Negation

p is true in M iff p is not true in M.

Implication

p q is true in M iff p is not true in M or q is true in M.

We have been careful about the definition of . When people use an implication p q they normally imply that p causes q. So if p is true we are happy to say that p q is true iff q is true. But if p is false the causal link causes confusion because we can't tell whether q should be true or not. Logic requires that the connectives are truth functional and so the truth of the compound sentence must be determined from the truth of its component parts. Logic defines that if p is false then p q is true regardless of the truth of q.

So both of the following implications are true (provided you believe pigs do not fly!):

fly(pigs) beautiful(misspiggy)fly(pigs) beautiful(misspiggy)

Example: Implications and Models

In which of the following models is

ugly(misspiggy) beautiful(misspiggy) true?

M=Ø

Page 22: Whole basic

Miss Piggy is not ugly and so the antecedent fails. Therefore the implication holds. (Miss Piggy is also not beautiful in this model.)

M={beautiful(misspiggy)}

Again, Miss Piggy is not ugly and so the implication holds.

M={ugly(misspiggy)}

Miss Piggy is not beautiful and so the conclusion is valid and hence the implication holds.

M={ugly(misspiggy), beautiful(misspiggy)}

Miss Piggy is ugly and so the antecedent holds. But she is also beautiful and so beautiful(misspiggy) is not true. Therefore the conclusion does not hold and so the implication fails in this (and only this) case.

Truth Tables

Truth tables are often used to calculate the truth of complex propositional sentences. A truth table represents all possible combinations of truths of the atoms and so contains all possible models. A column is created for each of the atoms in the sentence, and all combinations of truth values for these atoms are assigned one per row. So if there are $n$ atoms then there are $n$ initial columns and $2^n$ rows. The final column contains the truth of the sentence for each combination of truths for the atoms. Intervening columns can be added to store intermediate truth calculations. Below are two sample truth tables:

Equivalence

Two sentences are equivalence if they hold in exactly the same models.

Page 23: Whole basic

Therefore we can determine equivalence by drawing truth tables that represent the sentences in the various models. If the initial and final columns of the truth tables are identical then the sentences are equivalent. Examples of equivalences include:

Unlike and , is not commutative:

loves(misspiggy, voiceof(kermit)) loves(misspiggy, kermit)

is very different from

loves(misspiggy, kermit) loves(misspiggy, voiceof(kermit))

Similarly is not associative. )

Syntax & Semantics for Predicate Logic

Syntax of Predicate Logic

Propositional logic is fairly powerful but we must add variables and quantification to be able to reason about objects in atoms and express properties of a set of objects without listing the atom corresponding to each object.

We shall adopt the Prolog convention that variables have an initial capital letter. (This is contrary to many Mathematical Logic books where variables are lower case and constants have an initial capital.)

When we include variables we must specify their scope or quantification. The first quantifier we want is the universal quantifier (for all).

X.loves(misspiggy, X)

This allows X to range over all the objects and asserts that Miss Piggy loves each of them. We have introduced one variable but any number is allowed:

XY.loves(X, Y)

Page 24: Whole basic

Each of the objects love all of the objects, even itself! Therefore XY. is the same as X. Y. Quantifiers, like connectives, act on sentences. So if Miss Piggy loves all cute things (not just Kermit!) we would write

C.[cute(C) -> loves(misspiggy, C)]

rather than

loves(misspiggy, C.cute(C))

because the second argument to loves must be an object, not a sentence.

When the world contains a finite set of objects then a universally quantified sentence can be converted into a sentence without the quantifier; e.g. X.loves(misspiggy, X) becomes

loves(misspiggy, misspiggy) loves(misspiggy, kermit) loves(misspiggy, animal) ...

Contrast this with the infinite set of positive integers and the sentence

N.[odd(N) $\vee$ even(N)]

The other quantifier is the existential quantifier (there exists).

X.loves(misspiggy, X)

This allows X to range over all the objects and asserts that Miss Piggy loves (at least) one of them. Similarly

XY.loves(X, Y)

asserts that there is at least one loving couple (or self-loving object).

We shall be using First Order Predicate Logic where quantified variables range over object constants only. We are defining Second Order Predicate Logic if we allow quantified variables to range over functions or predicates as well; e.g.

X.loves(misspiggy, X(kermit)) includes loves(misspiggy, voiceof(kermit))

X.X(misspiggy, kermit) (there exists some relationship linking Miss Piggy and Kermit!)

Semantics of First Order Predicate Logic

Now we must deal with quantification.

:

Page 25: Whole basic

X.p(X) holds in a model iff $p(z)$ holds for all objects $z$ in our domain.

:

X.p(X) holds in a model iff there is some object z in our domain so that p(z) holds.

Example: Available Objects affects Quantification

If misspiggy is the only object in our domain then

ugly(misspiggy) beautiful(misspiggy) is equivalent to

X.ugly(X) beautiful(X)

If there were other objects then there would be more atoms and so the set of models would be larger; e.g. with objects misspiggy and kermit the possible models are all combinations of the atoms ugly(misspiggy), beautiful(misspiggy) ugly(kermit), beautiful(kermit). Now the 2 sentences are no longer equivalent.

1). Although, every model in which

X.ugly(X) beautiful(X) holds,ugly(misspiggy) beautiful(misspiggy) also holds

2).There are models in which ugly(misspiggy) beautiful(misspiggy) holds,but X.ugly(X) beautiful(X) does not hold; e.g.

M = {ugly(kermit), beautiful(kermit)}.

What about M = {ugly(misspiggy)}, beautiful(misspiggy)?

Clausal Form for Predicate Calculus !

In order to prove a formula in the predicate calculus by resolution,we

1.Negate the formula.

2.Put the negated formula into CNF, by doing the following:

i.Get rid of all operators.

ii.Push the operators in as far as possible.

iii.Rename variables as necessary (see the step below).

Page 26: Whole basic

iv.Move all of the quantifiers to the left (the outside) of the expression using the following rules (where Q is either or and G is a formula that does not contain x):

This leaves the formula in what is called prenex form which consists of a series of quantifiers followed by a quantifier-free formula, called the matrix.

v.Remove all quantifiers from the formula. First we remove the existentially quantified variables by using Skolemization. Each existentially quantified variable, say x is replaced by a function term which begins with a new, n-ary function symbol, say f where n is the number of universally quantified variables that occur before x is quantified in the formula. The arguments to the function term are precisely these variables. For example, if we have the formula

then z would be replaced by a function term f(x,y) where f is a new function symbol. The result is:

This new formula is satisfiable if and only if the original formula is satisfiable.

The new function symbol is called a Skolem function. If the existentially quantified variable has no preceding universally quantified variables, then the function is a 0-ary function and is often called a Skolem constant.

After removing all existential quantifiers, we simply drop all the universal quantifiers as we assume that any variable appearing in a formula is universally quantified.

vi.The remaining formula (the matrix) is put in CNF by moving any operators outside of any operations.

3.Finally, the CNF formula is written in clausal format by writing each conjunct as a set of literals (a clause), and the whole formula as a set clauses (the clause set).

Page 27: Whole basic

For example, if we begin with the proposition

we have:

1.Negate the theorem:

i.Push the operators in. No change.

ii).Rename variables if necessary:

iii)Move the quantifiers to the outside: First, we have

Then we get

iv)Remove the quantifiers, first by Skolemizing the existentially quantified variables. As these have no universally quantified variables to their left, they are replaced by Skolem constants:

Drop the universal quantifiers:

v)Put the matrix into CNF. No change.

2.Write the formula in clausal form:

Inference Rules !

Complex deductive arguments can be judged valid or invalid based on whether or not the steps in that argument follow the nine basic rules of inference. These rules of inference are all relatively simple, although when presented in formal terms they can look overly complex.

Conjunction:

1. P2. Q3. Therefore, P and Q.

Page 28: Whole basic

1. It is raining in New York.2. It is raining in Boston3. Therefore, it is raining in both New York and Boston

Simplification

1. P and Q.2. Therefore, P.

1. It is raining in both New York and Boston.2. Therefore, it is raining in New York.

Addition

1. P2. Therefore, P or Q.

1. It is raining2. Therefore, either either it is raining or the sun is shining.

Absorption

1. If P, then Q.2. Therfore, If P then P and Q.

1. If it is raining, then I will get wet.2. Therefore, if it is raining, then it is raining and I will get wet.

Modus Ponens

1. If P then Q.2. P.3. Therefore, Q.

1. If it is raining, then I will get wet.2. It is raining.3. Therefore, I will get wet.

Modus Tollens

1. If P then Q.2. Not Q. (~Q).3. Therefore, not P (~P).

Page 29: Whole basic

1. If it had rained this morning, I would have gotten wet.2. I did not get wet.3. Therefore, it did not rain this morning.

Hypothetical Syllogism

1. If P then Q.2. If Q then R.3. Therefore, if P then R.

1. If it rains, then I will get wet.2. If I get wet, then my shirt will be ruined.3. If it rains, then my shirt will be ruined.

Disjunctive Syllogism

1. Either P or Q.2. Not P (~P).3. Therefore, Q.

1. Either it rained or I took a cab to the movies.2. It did not rain.3. Therefore, I took a cab to the movies.

Constructive Dilemma

1. (If P then Q) and (If R then S).2. P or R.3. Therefore, Q or S.

1. If it rains, then I will get wet and if it is sunny, then I will be dry.2. Either it will rain or it will be sunny.3. Therefore, either I will get wet or I will be dry.

The above rules of inference, when combined with the rules of replacement, mean that propositional calculus is "complete." Propositional calculus is simply another name for formal logic.

Resolution !

Resolution is a rule of inference leading to a refutation theorem-proving technique for sentences in propositional logic and first-order logic. In other words, iteratively applying the resolution rule in a suitable way allows for telling whether a propositional formula is satisfiable and for proving that a first-order formula is unsatisfiable; this method may prove the satisfiability of a first-order satisfiable formula, but not always, as it is the case for all methods for first-order logic. Resolution was introduced by John Alan Robinson in 1965.

Page 30: Whole basic

Resolution in propositional logic

The resolution rule in propositional logic is a single valid inference rule that produces a new clause implied by two clauses containing complementary literals. A literal is a propositional variable or the negation of a propositional variable. Two literals are said to be complements if one is the negation of the other (in the following, ai is taken to be the complement to bj). The resulting clause contains all the literals that do not have complements. Formally:

where

all as and bs are literals,ai is the complement to bj, andthe dividing line stands for entails

The clause produced by the resolution rule is called the resolvent of the two input clauses.

When the two clauses contain more than one pair of complementary literals, the resolution rule can be applied (independently) for each such pair. However, only the pair of literals that are resolved upon can be removed: all other pair of literals remain in the resolvent clause.

A resolution technique

When coupled with a complete search algorithm, the resolution rule yields a sound and complete algorithm for deciding the satisfiability of a propositional formula, and, by extension, the validity of a sentence under a set of axioms.

This resolution technique uses proof by contradiction and is based on the fact that any sentence in propositional logic can be transformed into an equivalent sentence in conjunctive normal form. The steps are as follows:

1).All sentences in the knowledge base and the negation of the sentence to be proved (the conjecture) are conjunctively connected.

2).The resulting sentence is transformed into a conjunctive normal form with the conjuncts viewed as elements in a set, S, of clauses.

For example

would give rise to a set

Page 31: Whole basic

3).The resolution rule is applied to all possible pairs of clauses that contain complementary literals. After each application of the resolution rule, the resulting sentence is simplified by removing repeated literals. If the sentence contains complementary literals, it is discarded (as a tautology). If not, and if it is not yet present in the clause set S, it is added to S, and is considered for further resolution inferences.

4).If after applying a resolution rule the empty clause is derived, the complete formula is unsatisfiable (or contradictory), and hence it can be concluded that the initial conjecture follows from the axioms.

5).If, on the other hand, the empty clause cannot be derived, and the resolution rule cannot be applied to derive any more new clauses, the conjecture is not a theorem of the original knowledge base.

One instance of this algorithm is the original Davis–Putnam algorithm that was later refined into the DPLL algorithm that removed the need for explicit representation of the resolvents.

This description of the resolution technique uses a set S as the underlying data-structure to represent resolution derivations. Lists, Trees and Directed Acyclic Graphs are other possible and common alternatives. Tree representations are more faithful to the fact that the resolution rule is binary. Together with a sequent notation for clauses, a tree representation also makes it clear to see how the resolution rule is related to a special case of the cut-rule, restricted to atomic cut-formulas. However, tree representations are not as compact as set or list representations, because they explicitly show redundant subderivations of clauses that are used more than once in the derivation of the empty clause. Graph representations can be as compact in the number of clauses as list representations and they also store structural information regarding which clauses were resolved to derive each resolvent.

Example

In English: if a or b is true, and a is false or c is true, then either b or c is true.

If a is true, then for the second premise to hold, c must be true. If a is false, then for the first premise to hold, b must be true.

So regardless of a, if both premises hold, then b or c is true.

Unification

We also need some way of binding variables to values in a consistent way so that components of sentences can be matched. This is the process of Unification.

Page 32: Whole basic

Knowledge Representation

Network Representations

Networks are often used in artificial intelligence as schemes for representation. One of the advantages of using a network representation is that theorists in computer science have studied such structures in detail and there are a number of efficient and robust algorithms that may be used to manipulate the representations.

Trees and Graphs

A tree is a collection of nodes in which each node may be expanded into one or more unique subnodes until termination occurs. There may be no termination and an infinite tree results. A graph is simply a tree in which non-unique nodes are generated; in other words, a tree is a graph with no loops. The representation of the nodes and links is arbitrary. In a computer chess player, for example, nodes might represent individual board positions and the links from each node the legal moves from that position. This is a specific instance of a problem space. In general, problem spaces are graphs in which the nodes represent states and the connections between states represented by an operator that makes the state transformation.

IS-A Links and Semantic Networks

In constructing concept hierarchies, often the most important means of showing inclusion in a set is to use what is called an IS-A link, in which X is a member in some more general set Y. For example, a DOG ISA MAMMAL. As one travels up the link, the more general concept is defined. This is generally the simplest type of link between concepts in concept or semantic hierarchies. The combination of instances and classes connected by ISA links in a graph or tree is generally known as a semantic network. Semantic networks are useful, in part, because they provide a natural structure for inheritance. For instance, if a DOG ISA MAMMAL then those properties that are true for MAMMALs and DOGs need not be specified for the DOG; instead they may be derived via an inheritance procedure. This greatly reduces the amount of information that must be stored explicitly although there is an increase in the time required to access knowledge through the inheritance mechanism. Frames are a special type of semantic network representation.

Associative Network

A means of representing relational knowledge as a labeled directed graph. Each vertex of the graph represents a concept and each label represents a relation between concepts. Access and updating procedures traverse and manipulate the graph. A semantic network is sometimes regarded as a graphical notation for logical formulas.

Page 33: Whole basic

Conceptual Graphs !

A conceptual graph (CG) is a graph representation for logic based on the semantic networks of artificial intelligence.

A conceptual graph consists of concept nodes and relation nodes.

The concept nodes represent entities, attributes, states, and events The relation nodes show how the concepts are interconnected

Conceptual Graphs are finite, connected, bipartite graphs.

Finite: because any graph (in 'human brain' or 'computer storage') can only have a finite number of concepts and conceptual relations.

Connected: because two parts that are not connected would simply be called two conceptual graphs.

Bipartite: because there are two different kinds of nodes: concepts and conceptual relations, and every arc links a node of one kind to a node of another kind

Example

Following CG display form for John is going to Boston by bus.

The conceptual graph in Figure represents a typed or sorted version of logic. Each of the four concepts has a type label, which represents the type of entity the concept refers to: Person, Go, Boston, or Bus. Two of the concepts have names, which identify the referent: John or Boston. Each of the three conceptual relations has a type label that represents the type of relation: agent (Agnt), destination (Dest), or instrument (Inst). The CG as a whole indicates that the person John is the agent of some instance of going, the city Boston is the destination, and a bus is the instrument. Figure 1 can be translated to the following formula:

Page 34: Whole basic

As this translation shows, the only logical operators used in Figure are conjunction and the existential quantifier. Those two operators are the most common in translations from natural languages, and many of the early semantic networks could not represent any others.

Structured Representation

Structure representation can be done in various ways like:

Frames Scripts

Frames

A frame is a method of representation in which a particular class is defined by a number of attributes (or slots) with certain values (the attributes are filled in for each instance). Thus, frames are also known as slot-and-filler structures. Frame systems are also somewhat equivalent to semantic networks although frames are usually associated with more defined structure than the networks.

Like a semantic network, one of the chief properties of frames is that they provide a natural structure for inheritance. ISA-Links connect classes to larger parent classes and properties of the subclasses may be determined at both the level of the class itself and from parent classes.

This leads into the idea of defaults. Frames may indicate specific values for some attributes or instead indicate a default. This is especially useful when values are not always known but can generally be assumed to be true for most of the class. For example, the class BIRD may have a default value of FLIES set to TRUE even though instances below it (say, for example, an OSTRICH) have FLIES values of FALSE.

In addition, the values of particular attribute need not necessarily be filled with a value but may also indicate a procedure to run to obtain a value. This is known as an attached procedure. Attached procedures are especially useful when there is a high cost associated with computing a particular value, when the value changes with time or when the expected access frequency is low. Instead of computing the value for each instance, the values are computed only when needed. However, this computation is run during execution (rather than during the establishment of the frame network) and may be costly.

Scripts

A script is a remembered precedent, consisting of tightly coupled, expectation-suggesting primitive-action and state-change frames.

Page 35: Whole basic

A script is a structured representation describing a stereotyped sequence of events in a particular context. That is, extend frames by explicitly representing expectations of actions and state changes.

Why represent knowledge in this way?

1) Because real-world events do follow stereotyped patterns. Human beings use previous experiences to understand verbal accounts; computers can use scripts instead.

2) Because people, when relating events, do leave large amounts of assumed detail out of their accounts. People don't find it easy to converse with a system that can't fill in missing conversational detail.

Min Max Algorithm

There are plenty of applications for AI, but games are the most interesting to the public. Nowadays every major OS comes with some games. So it is no surprise that there are some algorithms that were devised with games in mind.

The Min-Max algorithm is applied in two player games, such as tic-tac-toe, checkers, chess, go, and so on. All these games have at least one thing in common, they are logic games. This means that they can be described by a set of rules and premisses. With them, it is possible to know from a given point in the game, what are the next available moves. So they also share other characteristic, they are ‘full information games’. Each player knows everything about the possible moves of the adversary.

Before explaining the algorithm, a brief introduction to search trees is required. Search trees are a way to represent searches. The squares are known as nodes and they represent points of the decision in the search. The nodes are connected with branches. The search starts at the root node, the one at the top of the figure. At each decision point, nodes for the available search paths are generated, until no more decisions are possible. The nodes that represent the end of the search are known as leaf nodes.

There are two players involved, MAX and MIN. A search tree is generated, depth-first, starting with the current game position upto the end game position. Then, the final game position is evaluated from MAX’s point of view, as shown in Figure 1. Afterwards, the inner node values of the tree are filled bottom-up with the evaluated values. The nodes that belong to the MAX player

Page 36: Whole basic

receive the maximun value of it’s children. The nodes for the MIN player will select the minimun value of it’s children.

MinMax (GamePosition game) {return MaxMove (game);}

MaxMove (GamePosition game) {if (GameEnded(game)) {return EvalGameState(game);}else {best_move < - {};moves <- GenerateMoves(game);ForEach moves {move <- MinMove(ApplyMove(game));if (Value(move) > Value(best_move)) {best_move < - move;}}return best_move;}}

MinMove (GamePosition game) {best_move <- {};moves <- GenerateMoves(game);ForEach moves {move <- MaxMove(ApplyMove(game));if (Value(move) > Value(best_move)) {best_move < - move;}}

return best_move;}

So what is happening here? The values represent how good a game move is. So the MAX player will try to select the move with highest value in the end. But the MIN player also has something to say about it and he will try to select the moves that are better to him, thus minimizing MAX’s outcome.

Optimisation

Page 37: Whole basic

However only very simple games can have their entire search tree generated in a short time. For most games this isn’t possible, the universe would probably vanish first. So there are a few optimizations to add to the algorithm.

First a word of caution, optimization comes with a price. When optimizing we are trading the full information about the game’s events with probabilities and shortcuts. Instead of knowing the full path that leads to victory, the decisions are made with the path that might lead to victory. If the optimization isn’t well choosen, or it is badly applied, then we could end with a dumb AI. And it would have been better to use random moves.

One basic optimization is to limit the depth of the search tree. Why does this help? Generating the full tree could take ages. If a game has a branching factor of 3, which means that each node has tree children, the tree will have the folling number of nodes per depth:

The sequence shows that at depth n the tree will have 3^n nodes. To know the total number of generated nodes, we need to sum the node count at each level. So the total number of nodes for a tree with depth n is sum (0, n, 3^n). For many games, like chess that have a very big branching factor, this means that the tree might not fit into memory. Even if it did, it would take to long to generate. If each node took 1s to be analyzed, that means that for the previous example, each search tree would take sum (0, n, 3^n) * 1s. For a search tree with depth 5, that would mean 1+3+9+27+81+243 = 364 * 1 = 364s = 6m! This is too long for a game. The player would give up playing the game, if he had to wait 6m for each move from the computer.

The second optimization is to use a function that evaluates the current game position from the point of view of some player. It does this by giving a value to the current state of the game, like counting the number of pieces in the board, for example. Or the number of moves left to the end of the game, or anything else that we might use to give a value to the game position.

Instead of evaluating the current game position, the function might calculate how the current game position might help ending the game. Or in another words, how probable is that given the current game position we might win the game. In this case the function is known as an estimation function.

This function will have to take into account some heuristics. Heuristics are knowledge that we have about the game, and it can help generate better evaluation functions. For example, in checkers, pieces at corners and sideways positions can’t be eaten. So we can create an evaluation

Page 38: Whole basic

function that gives higher values to pieces that lie on those board positions thus giving higher outcomes for game moves that place pieces in those positions.

One of the reasons that the evaluation function must be able to evalute game positions for both players is that you don’t know to which player the limit depth belongs.

However having two functions can be avoided if the game is symetric. This means that the loss of a player equals the gains of the other. Such games are also known as ZERO-SUM games. For these games one evalution function is enough, one of the players just have to negate the return of the function.

The revised algorithm is:

MinMax (GamePosition game) {return MaxMove (game);}

MaxMove (GamePosition game) {if (GameEnded(game) || DepthLimitReached()) {return EvalGameState(game, MAX);}else {best_move < - {};moves <- GenerateMoves(game);ForEach moves {move <- MinMove(ApplyMove(game));if (Value(move) > Value(best_move)) {best_move < - move;}}return best_move;}}

MinMove (GamePosition game) {if (GameEnded(game) || DepthLimitReached()) {return EvalGameState(game, MIN);}else {best_move <- {};moves <- GenerateMoves(game);ForEach moves {move <- MaxMove(ApplyMove(game));if (Value(move) > Value(best_move)) {best_move < - move;}

Page 39: Whole basic

}return best_move;}}

Even so the algorithm has a few flaw, some of them can be fixed while other can only be solved by choosing another algorithm.

One of flaws is that if the game is too complex the answer will always take too long even with a depth limit. One solution it limit the time for search. If the time runs out choose the best move found until the moment.

A big flaw is the limited horizon problem. A game position that appears to be very good might turn out very bad. This happens because the algorithm wasn’t able to see that a few game moves ahead the adversary will be able to make a move that will bring him a great outcome. The algorithm missed that fatal move because it was blinded by the depth limit.

Speeding the Algorithm

There are a few things can still be done to reduce the search time. Take a look at figure 2. The value for node A is 3, and the first found value for the subtree starting at node B is 2. So since the B node is at a MIN level, we know that the selected value for the B node must be less or equal than 2. But we also know that the A node has the value 3, and both A and B nodes share the same parent at a MAX level. This means that the game path starting at the B node wouldn’t be selected because 3 is better than 2 for the MAX node. So it isn’t worth to pursue the search for children of the B node, and we can safely ignore all the remaining children.

This all means that sometimes the search can be aborted because we find out that the search subtree won’t lead us to any viable answer.

This optimization is know as alpha-beta cuttoffs and the algorithm is as follows:

1. Have two values passed around the tree nodes:i)the alpha value which holds the best MAX value found;ii)the beta value which holds the best MIN value found.

Page 40: Whole basic

2. At MAX level, before evaluating each child path, compare the returned value with of the previous path with the beta value. If the value is greater than it abort the search for the current node;

3. At MIN level, before evaluating each child path, compare the returned value with of the previous path with the alpha value. If the value is lesser than it abort the search for the current node.

Full pseudocode for MinMax with alpha-beta cuttoffs.

MinMax (GamePosition game) {return MaxMove (game);}

MaxMove (GamePosition game, Integer alpha, Integer beta) {if (GameEnded(game) || DepthLimitReached()) {return EvalGameState(game, MAX);}else {best_move < - {};moves <- GenerateMoves(game);ForEach moves {move <- MinMove(ApplyMove(game), alpha, beta);if (Value(move) > Value(best_move)) {best_move < - move;alpha <- Value(move);}

// Ignore remaining movesif (beta > alpha)return best_move;}return best_move;}}

MinMove (GamePosition game) {if (GameEnded(game) || DepthLimitReached()) {return EvalGameState(game, MIN);}else {best_move < - {};moves <- GenerateMoves(game);ForEach moves {move <- MaxMove(ApplyMove(game), alpha, beta);if (Value(move) > Value(best_move)) {

Page 41: Whole basic

best_move < - move;beta <- Value(move);}

// Ignore remaining movesif (beta < alpha)return best_move;}return best_move;}}

How better does a MinMax with alpha-beta cuttoffs behave when compared with a normal MinMax? It depends on the order the search is searched. If the way the game positions are generated doesn’t create situations where the algorithm can take advantage of alpha-beta cutoffs then the improvements won’t be noticible. However, if the evaluation function and the generation of game positions leads to alpha-beta cuttoffs then the improvements might be great.

Alpha-Beta Cutoff

With all this talk about search speed many of you might be wondering what this is all about. Well, the search speed is very important in AI because if an algorithm takes too long to give a good answer the algorithm may not be suitable.

For example, a good MinMax algorithm implementation with an evaluation function capable to give very good estimatives might be able to search 1000 positions a second. In tourament chess each player has around 150 seconds to make a move. So it would probably be able to analyze 150 000 positions during that period. But in chess each move has around 35 possible branchs! In the end the program would only be able to analyze around 3, to 4 moves ahead in the game. Even humans with very few pratice in chess can do better than this.

But if we use MinMax with alpha-beta cutoffs, again a decent implementation with a good evaluation function, the result behaviour might be much better. In this case, the program might be able to double the number of analyzed positions and thus becoming a much toughter adversary.

Example

Example of a board with the values estimated for each position.

Page 42: Whole basic

The game uses MinMax with alpha-beta cutoffs for the computer moves. The evaluation function is an weighted average of the positions occupied by the checker pieces. The figure shows the values for each board position. The value of each board position is multiplied by the type of the piece that rests on that position, described in first table.

Rule based Expert System

Expert System !

"An expert system is an interactive computer-based decision tool that uses both facts and heuristics to solve difficult decision problems based on knowledge acquired from an expert."

An expert system is a computer program that simulates the thought process of a human expert to solve complex decision problems in a specific domain. This chapter addresses the characteristics of expert systems that make them different from conventional programming and traditional de- cision support tools. The growth of expert systems is expected to continue for several years. With the continuing growth, many new and exciting applications will emerge. An expert system operates as an interactive system that responds to questions, asks for clarification, makes recommendations, and generally aids the decision-making process. Expert systems provide expert advice and guidance in a wide variety of activities, from computer diagnosis

An expert system may be viewed as a computer simulation of a human expert. Expert systems are an emerging technology with many areas for po- tential applications. Past applications range from MYCIN, used in the medical field to diagnose infectious blood diseases, to XCON, used to configure com- puter systems. These expert systems have proven to be quite successful. Most applications of expert systems will fall into one of the following categories:

Interpreting and identifying Predicting

Diagnosing

Designing

Planning

Page 43: Whole basic

Monitoring

Debugging and testing

Instructing and training

Controlling

Applications that are computational or deterministic in nature are not good candidates for expert systems. Traditional decision support systems such as spreadsheets are very mechanistic in the way they solve problems. They operate under mathematical and Boolean operators in their execution and arrive at one and only one static solution for a given set of data. Calculation intensive applications with very exacting requirements are better handled by traditional decision support tools or conventional programming. The best application candidates for expert systems are those dealing with expert heuristics for solving problems. Conventional computer programs are based on factual knowledge, an indisputable strength of computers. Humans, by contrast, solve problems on the basis of a mixture of factual and heuristic knowledge. Heuristic knowledge, composed of intuition, judgment, and logical inferences, is an indisputable strength of humans. Successful expert systems will be those that combine facts and heuristics and thus merge human knowledge with computer power in solving problems. To be effective, an expert system must focus on a particular problem domain, as discussed below

Domain Specificity

Expert systems are typically very domain specific. For example, a diagnostic expert system for troubleshooting computers must actually perform all the necessary data manipulation as a human expert would. The developer of such a system must limit his or her scope of the system to just what is needed to solve the target problem. Special tools or programming languages are often needed to accomplish the specific objectives of the system.

Special Programming Languages

Expert systems are typically written in special programming languages. The use of languages like LISP and PROLOG in the development of an expert system simplifies the coding process. The major advantage of these languages, as compared to conventional programming languages, is the simplicity of the addition, elimination, or substitution of new rules and memory management capabilities. Some of the distinguishing characteristics of programming languages needed for expert systems work are:

Efficient mix of integer and real variables Good memory-management procedures

Extensive data-manipulation routines

Incremental compilation

Tagged memory architecture

Optimization of the systems environment

Page 44: Whole basic

Efficient search procedures

Architecture of Expert System !

Expert systems typically contain the following four components:

Knowledge-Acquisition Interface User Interface

Knowledge Base

Inference Engine

This architecture differs considerably from traditional computer programs, resulting in several characteristics of expert systems.

Page 45: Whole basic

# Expert System Components #

Knowledge-Acquisition Interface

The knowledge-acquisition interface controls how the expert and knowledge engineer interact with the program to incorporate knowledge into the knowledge base. It includes features to assist experts in expressing their knowledge in a form suitable for reasoning by the computer.

This process of expressing knowledge in the knowledge base is called knowledge acquisition. Knowledge acquisition turns out to be quite difficult in many cases--so difficult that some authors refer to the knowledge acquisition bottleneck to indicate that it is this aspect of expert system development which often requires the most time and effort.

Page 46: Whole basic

Debugging faulty knowlege bases is facilitated by traces (lists of rules in the order they were fired), probes (commands to find and edit specific rules, facts, and so on), and bookkeeping functions and indexes (which keep track of various features of the knowledge base such as variables and rules). Some rule-based expert system shells for personal computers monitor data entry, checking the syntactic validity of rules. Expert systems are typically validated by testing their preditions for several cases against those of human experts. Case facilities--permitting a file of such cases to be stored and automatically evaluated after the program is revised--can greatly speed the vaidation process. Many features that are useful for the user interface, such as on-screen help and explanations, are also of benefit to the developer of expert systems and are also part of knowledge-acquisition interfaces.

Expert systems in the literature demonstrate a wide range of modes of knowledge acquisition (Buchanan, 1985). Expert system shells on microcomputers typically require the user to either enter rules explicitly or enter several examples of cases with appropriate conclusions, from which the program will infer a rule.

User Interface

The user interface is the part of the program that interacts with the user. It prompts the user for information required to solve a problem, displays conclusions, and explains its reasoning.

Features of the user interface often include:

Doesn't ask "dumb" questions Explains its reasoning on request

Provides documentation and references

Defines technical terms

Permits sensitivity analyses, simulations, and what-if analyses

Detailed report of recommendations

Justifies recommendations

Online help

Graphical displays of information

Trace or step through reasoning

The user interface can be judged by how well it reproduces the kind of interaction one might expect between a human expert and someone consulting that expert.

Knowledge Base

The knowledge base consists of specific knowledge about some substantive domain. A knowledge base differs from a data base in that the knowledge base includes both explicit

Page 47: Whole basic

knowledge and implicit knowledge. Much of the knowledge in the knowledge base is not stated explicitly, but inferred by the inference engine from explicit statements in the knowledge base. This makes knowledge bases have more efficient data storage than data bases and gives them the power to exhaustively represent all the knowledge implied by explicit statements of knowledge.

There are several important ways in which knowledge is represented in a knowledge base. For more information, see knowledge representation strategies.

Knowledge bases can contain many different types of knowledge and the process of acquiring knowledge for the knowledge base (this is often called knowledge acquisition) often needs to be quite different depending on the type of knowledge sought.

Types of Knpwledge

There are many different kinds of knowledge considered in expert systems. Many of these form dimensions of contrasting knowledge:

explicit knowledge implicit knowledge

domain knowledge

common sense or world knowledge

heuristics

algorithms

procedural knowledge

declarative or semantic knowledge

public knowledge

private knowledge

shallow knowledge

deep knowledge

metaknowledge

Inference Engine

The inference engine uses general rules of inference to reason from the knowledge base and draw conclusions which are not explicitly stated but can be inferred from the knowledge base.

Inference engines are capable of symbolic reasoning, not just mathematical reasoning. Hence, they expand the scope of fruitful applications of computer programs.

Page 48: Whole basic

The specific forms of inference permitted by different inference engines varies, depending on several factors, including the knowledge representation strategies employed by the expert system.

Expert System Development !

Most expert systems are developed by a team of people, with the number of members varying with the complexity and scope of the project. Of course, a single individual can develop a very simple system. But usually at least two people are involved.

There are two essential roles that must filled by the development: knowledge engineer and substantive expert.

The Knowledge Engineer The Substantive Expert

The Knowledge Engineer

Criteria for selecting the Knowledge Engineer

Competent Organized

Patient

Problem with Knowledge Engineer

Technician with little social skill Sociable with low technical skill

Disorganized

Unwilling to challeng expert to produce clarity

Unable to listen carefully to expert

Undiplomatic when discussing flaws in system or expert's knowledge

Unable to quickly understand diverse substantive areas

The Substantive Expert

Criteria for selecting the expert

Competent Available

Articulate

Page 49: Whole basic

Self-Confident

Open-Minded

Varieties of experts

No expert Multiple experts

Book knowledge only

The knowledge engineer is also the expert

Problem Experts

The unavailable expert The reluctant expert

The cynical expert

The arrogant expert

The rambling expert

The uncommunicative expert

The too-cooperative expert

The would-be-knowledge-engineer expert

Development Process

The systems development process often used for traditional software such as management information systems often employs a process described as the "System Development Life Cycle" or "Waterfall" Model. While this model identifies a number of important tasks in the development process, many developers of expert systems have found it to be inadequate for expert systems for a number of important reasons. Instead, many expert systems are developed using a process called "Rapid Prototyping and Incremental Development."

System Development Life-Cycle

Problem Analysis

Is the problem solvable? Is it feasible with this approach? cost-benefit analysis

Requirement Specification

What are the desired features and goals of the proposed system? Who are the users? What constraints must be considered? What development and delivery environments will be used?

Page 50: Whole basic

Design

Preliminary Design - overall structure, data flow diagram, perhaps language

Detailed Design - details of each module

Implementation

Writing and debugging code, integrating modules, creating interfaces

Testing

Comparing system to its specifications and assessing validity

Maintenance

Corrections, modifications, enhancements

Managing Uncertainty in Expert Systems

Sources of uncertainty in Expert System

Weak implication Imprecise language

Unknown data

Difficulty in combining the views of different experts

Uncertainty in AI

Information is partial Information is not fully reliable

Representation language is inherently imprecise

Information comes from multiple sources and it is conflicting

Information is approximate

Non-absolute cause-effect relationship exist

Representing uncertain information in Expert System

Probabilistic Certainty factors

Theory of evidence

Page 51: Whole basic

Fuzzy logic

Neural Network

GA

Rough set

Bayesian Probability Theory

Bayesian probability is one of the most popular interpretations of the concept of probability. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with uncertain statements. To evaluate the probability of a hypothesis, the Bayesian probabilist specifies some prior probability, which is then updated in the light of new relevant data. The Bayesian interpretation provides a standard set of procedures and formulae to perform this calculation.

Bayesian probability interprets the concept of probability as "a measure of a state of knowledge", in contrast to interpreting it as a frequency or a physical property of a system. Its name is derived from the 18th century statistician Thomas Bayes, who pioneered some of the concepts. Broadly speaking, there are two views on Bayesian probability that interpret the state of knowledge concept in different ways. According to the objectivist view, the rules of Bayesian statistics can be justified by requirements of rationality and consistency and interpreted as an extension of logic. According to the subjectivist view, the state of knowledge measures a "personal belief". Many modern machine learning methods are based on objectivist Bayesian principles. One of the crucial features of the Bayesian view is that a probability is assigned to a hypothesis, whereas under the frequentist view, a hypothesis is typically rejected or not rejected without directly assigning a probability.

The probability of a hypothesis given the data (the posterior) is proportional to the product of the likelihood times the prior probability (often just called the prior). The likelihood brings in the effect of the data, while the prior specifies the belief in the hypothesis before the data was observed.

More formally, Bayesian inference uses Bayes' formula for conditional probability:

where

H is a hypothesis, and D is the data.

P(H) is the prior probability of H: the probability that H is correct before the data D was seen.

Page 52: Whole basic

P(D | H) is the conditional probability of seeing the data D given that the hypothesis H is true. P(D | H) is called the likelihood.

P(D) is the marginal probability of D.

P(H | D) is the posterior probability: the probability that the hypothesis is true, given the data and the previous state of belief about the hypothesis.

Stanford Certainty Factor !

Uncertainty is represented as a degree of belief in two steps:

Express the degree of belief Manipulate the degrees of belief during the use of knowledge based systems

It is also based on evidence (or the expert’s assessment).

Form of certainty factors in ES

IF <evidence>THEN <hypothesis> {cf }

cf represents belief in hypothesis H given that evidence E has occurred

It is based on 2 functionsi) Measure of belief MB(H, E)ii) Measure of disbelief MD(H, E)

Indicate the degree to which belief/disbelief of hypothesis H is increased if evidence E were observed

Uncertain term and their intepretation

Page 53: Whole basic

Total strength of belief and disbelief in a hypothesis:

Nonmonotonic logic and Reasoning with Beliefs

A non-monotonic logic is a formal logic whose consequence relation is not monotonic. Most studied formal logics have a monotonic consequence relation, meaning that adding a formula to a theory never produces a reduction of its set of consequences. Intuitively, monotonicity indicates that learning a new piece of knowledge cannot reduce the set of what is known. A monotonic logic cannot handle various reasoning tasks such as reasoning by default (consequences may be derived only because of lack of evidence of the contrary), abductive reasoning (consequences are only deduced as most likely explanations) and some important approaches to reasoning about knowledge (the ignorance of a consequence must be retracted when the consequence becomes known) and similarly belief revision (new knowledge may contradict old beliefs).

Default reasoning

An example of a default assumption is that the typical bird flies. As a result, if a given animal is known to be a bird, and nothing else is known, it can be assumed to be able to fly. The default assumption must however be retracted if it is later learned that the considered animal is a penguin. This example shows that a logic that models default reasoning should not be monotonic. Logics formalizing default reasoning can be roughly divided in two categories: logics able to deal with arbitrary default assumptions (default logic, defeasible logic/defeasible reasoning/argument (logic), and answer set programming) and logics that formalize the specific default assumption that facts that are not known to be true can be assumed false by default (closed world assumption and circumscription).

Page 54: Whole basic

Abductive reasoning

Abductive reasoning is the process of deriving the most likely explanations of the known facts. An abductive logic should not be monotonic because the most likely explanations are not necessarily correct. For example, the most likely explanation for seeing wet grass is that it rained; however, this explanation has to be retracted when learning that the real cause of the grass being wet was a sprinkler. Since the old explanation (it rained) is retracted because of the addition of a piece of knowledge (a sprinkler was active), any logic that models explanations is non-monotonic.

Reasoning about knowledge

If a logic includes formulae that mean that something is not known, this logic should not be monotonic. Indeed, learning something that was previously not known leads to the removal of the formula specifying that this piece of knowledge is not known. This second change (a removal caused by an addition) violates the condition of monotonicity. A logic for reasoning about knowledge is the autoepistemic logic.

Belief revision

Belief revision is the process of changing beliefs to accommodate a new belief that might be inconsistent with the old ones. In the assumption that the new belief is correct, some of the old ones have to be retracted in order to maintain consistency. This retraction in response to an addition of a new belief makes any logic for belief revision to be non-monotonic. The belief revision approach is alternative to paraconsistent logics, which tolerate inconsistency rather than attempting to remove it.

What makes belief revision non-trivial is that several different ways for performing this operation may be possible. For example, if the current knowledge includes the three facts “A is true”, “B is true” and “if A and B are true then C is true”, the introduction of the new information “C is false” can be done preserving consistency only by removing at least one of the three facts. In this case, there are at least three different ways for performing revision. In general, there may be several different ways for changing knowledge.

Fuzzy Logic

The concept of Fuzzy Logic (FL) was conceived by Lotfi Zadeh, a professor at the University of California at Berkley, and presented not as a control methodology, but as a way of processing data by allowing partial set membership rather than crisp set membership or non-membership. This approach to set theory was not applied to control systems until the 70's due to insufficient small-computer capability prior to that time. Professor Zadeh reasoned that people do not require precise, numerical information input, and yet they are capable of highly adaptive control. If feedback controllers could be programmed to accept noisy, imprecise input, they would be much more effective and perhaps easier to implement. Unfortunately, U.S. manufacturers have not been so quick to embrace this technology while the Europeans and Japanese have been aggressively building real products around it.

Page 55: Whole basic

WHAT IS FUZZY LOGIC?

In this context, FL is a problem-solving control system methodology that lends itself to implementation in systems ranging from simple, small, embedded micro-controllers to large, networked, multi-channel PC or workstation-based data acquisition and control systems. It can be implemented in hardware, software, or a combination of both. FL provides a simple way to arrive at a definite conclusion based upon vague, ambiguous, imprecise, noisy, or missing input information. FL's approach to control problems mimics how a person would make decisions, only much faster.

HOW IS FL DIFFERENT FROM CONVENTIONAL CONTROL METHODS?

FL incorporates a simple, rule-based IF X AND Y THEN Z approach to a solving control problem rather than attempting to model a system mathematically. The FL model is empirically-based, relying on an operator's experience rather than their technical understanding of the system. For example, rather than dealing with temperature control in terms such as "SP =500F", "T <1000F", or "210C <TEMP <220C", terms like "IF (process is too cool) AND (process is getting colder) THEN (add heat to the process)" or "IF (process is too hot) AND (process is heating rapidly) THEN (cool the process quickly)" are used. These terms are imprecise and yet very descriptive of what must actually happen. Consider what you do in the shower if the temperature is too cold: you will make the water comfortable very quickly with little trouble. FL is capable of mimicking this type of behavior but at very high rate.

HOW DOES FL WORK?

FL requires some numerical parameters in order to operate such as what is considered significant error and significant rate-of-change-of-error, but exact values of these numbers are usually not critical unless very responsive performance is required in which case empirical tuning would determine them. For example, a simple temperature control system could use a single temperature feedback sensor whose data is subtracted from the command signal to compute "error" and then time-differentiated to yield the error slope or rate-of-change-of-error, hereafter called "error-dot". Error might have units of degs F and a small error considered to be 2F while a large error is 5F. The "error-dot" might then have units of degs/min with a small error-dot being 5F/min and a large one being 15F/min. These values don't have to be symmetrical and can be "tweaked" once the system is operating in order to optimize performance. Generally, FL is so forgiving that the system will probably work the first time without any tweaking.

Dempster/Shafer Theory

The Dempster-Shafer theory, also known as the theory of belief functions, is a generalization of the Bayesian theory of subjective probability. Whereas the Bayesian theory requires probabilities for each question of interest, belief functions allow us to base degrees of belief for one question on probabilities for a related question. These degrees of belief may or may not have the mathematical properties of probabilities; how much they differ from probabilities will depend on how closely the two questions are related.

Page 56: Whole basic

The Dempster-Shafer theory owes its name to work by A. P. Dempster (1968) and Glenn Shafer (1976), but the kind of reasoning the theory uses can be found as far back as the seventeenth century. The theory came to the attention of AI researchers in the early 1980s, when they were trying to adapt probability theory to expert systems. Dempster-Shafer degrees of belief resemble the certainty factors in MYCIN, and this resemblance suggested that they might combine the rigor of probability theory with the flexibility of rule-based systems. Subsequent work has made clear that the management of uncertainty inherently requires more structure than is available in simple rule-based systems, but the Dempster-Shafer theory remains attractive because of its relative flexibility.

The Dempster-Shafer theory is based on two ideas: the idea of obtaining degrees of belief for one question from subjective probabilities for a related question, and Dempster's rule for combining such degrees of belief when they are based on independent items of evidence.

To illustrate the idea of obtaining degrees of belief for one question from subjective probabilities for another, suppose I have subjective probabilities for the reliability of my friend Jon. My probability that he is reliable is 0.9, and my probability that he is unreliable is 0.1. Suppose he tells me a limb fell on my car. This statement, which must true if she is reliable, is not necessarily false if she is unreliable. So his testimony alone justifies a 0.9 degree of belief that a limb fell on my car, but only a zero degree of belief (not a 0.1 degree of belief) that no limb fell on my car. This zero does not mean that I am sure that no limb fell on my car, as a zero probability would; it merely means that jon's testimony gives me no reason to believe that no limb fell on my car. The 0.9 and the zero together constitute a belief function.

Knowledge Acquisition

Knowledge Acquisition is concerned with the development of knowledge bases based on the expertise of a human expert. This requires to express knowledge in a formalism suitable for automatic interpretation. Within this field, research at UNSW focusses on incremental knowledge acquisition techniques, which allow a human expert to provide explanations of their decisions that are automatically integrated into sophisticated knowledge bases.

Types of Learning

Learning is acquiring new knowledge, behaviors, skills, values, preferences or understanding, and may involve synthesizing different types of information. The ability to learn is possessed by humans, animals and some machines. Progress over time tends to follow learning curves.

Human learning may occur as part of education or personal development. It may be goal-oriented and may be aided by motivation. The study of how learning occurs is part of neuropsychology, educational psychology, learning theory, and pedagogy.

Learning may occur as a result of habituation or classical conditioning, seen in many animal species, or as a result of more complex activities such as play, seen only in relatively intelligent animals and humans. Learning may occur consciously or without conscious awareness. There is evidence for human behavioral learning prenatally, in which habituation has been observed as

Page 57: Whole basic

early as 32 weeks into gestation, indicating that the central nervous system is sufficiently developed and primed for learning and memory to occur very early on in development.

Play has been approached by several theorists as the first form of learning. Children play, experiment with the world, learn the rules, and learn to interact. Vygotsky agrees that play is pivotal for children's development, since they make meaning of their environment through play.

Types of Learning

Habituation

In psychology, habituation is an example of non-associative learning in which there is a progressive diminution of behavioral response probability with repetition of a stimulus. It is another form of integration. An animal first responds to a stimulus, but if it is neither rewarding nor harmful the animal reduces subsequent responses. One example of this can be seen in small song birds - if a stuffed owl (or similar predator) is put into the cage, the birds initially react to it as though it were a real predator. Soon the birds react less, showing habituation. If another stuffed owl is introduced (or the same one removed and re-introduced), the birds react to it again as though it were a predator, demonstrating that it is only a very specific stimulus that is habituated to (namely, one particular unmoving owl in one place). Habituation has been shown in essentially every species of animal, including the large protozoan Stentor Coeruleus.

Sensitization

Sensitization is an example of non-associative learning in which the progressive amplification of a response follows repeated administrations of a stimulus (Bell et al., 1995). An everyday example of this mechanism is the repeated tonic stimulation of peripheral nerves that will occur if a person rubs his arm continuously. After a while, this stimulation will create a warm sensation that will eventually turn painful. The pain is the result of the progressively amplified synaptic response of the peripheral nerves warning the person that the stimulation is harmful. Sensitization is thought to underlie both adaptive as well as maladaptive learning processes in the organism.

Asociative learning

Associative learning is the process by which an element is learned through association with a separate, pre-occurring element.

Operant conditioning

Operant conditioning is the use of consequences to modify the occurrence and form of behavior. Operant conditioning is distinguished from Pavlovian conditioning in that operant conditioning deals with the modification of voluntary behavior. Discrimination learning is a major form of operant conditioning. One form of it is called Errorless learning.

Classical conditioning

Page 58: Whole basic

The typical paradigm for classical conditioning involves repeatedly pairing an unconditioned stimulus (which unfailingly evokes a particular response) with another previously neutral stimulus (which does not normally evoke the response). Following conditioning, the response occurs both to the unconditioned stimulus and to the other, unrelated stimulus (now referred to as the "conditioned stimulus"). The response to the conditioned stimulus is termed a conditioned response.

Imprinting

Imprinting is the term used in psychology and ethology to describe any kind of phase-sensitive learning (learning occurring at a particular age or a particular life stage) that is rapid and apparently independent of the consequences of behavior. It was first used to describe situations in which an animal or person learns the characteristics of some stimulus, which is therefore said to be "imprinted" onto the subject.

Observational learning

The learning process most characteristic of humans is imitation; one's personal repetition of an observed behaviour, such as a dance. Humans can copy three types of information simultaneously: the demonstrator's goals, actions and environmental outcomes (results, see Emulation (observational learning)). Through copying these types of information, (most) infants will tune into their surrounding culture.

Multimedia learning

The learning where learner uses multimedia learning environments. This type of learning relies on dual-coding theory.

e-Learning and Augmented Learning

Electronic learning or e-learning is a general term used to refer to Internet-based networked computer-enhanced learning. A specific and always more diffused e-learning is mobile learning (m-Learning), it uses different mobile telecommunication equipments, such as cellular phones.

When a learner interacts with the e-learning environment, it's called augmented learning. By adapting to the needs of individuals, the context-driven instruction can be dynamically tailored to the learner's natural environment. Augmented digital content may include text, images, video, audio (music and voice). By personalizing instruction, augmented learning has been shown to improve learning performance for a lifetime.

Rote learning

Rote learning is a technique which avoids understanding the inner complexities and inferences of the subject that is being learned and instead focuses on memorizing the material so that it can be recalled by the learner exactly the way it was read or heard. The major practice involved in rote learning techniques is learning by repetition, based on the idea that one will be able to quickly

Page 59: Whole basic

recall the meaning of the material the more it is repeated. Rote learning is used in diverse areas, from mathematics to music to religion. Although it has been criticized by some schools of thought, rote learning is a necessity in many situations.

Informal learning

Informal learning occurs through the experience of day-to-day situations (for example, one would learn to look ahead while walking because of the danger inherent in not paying attention to where one is going). It is learning from life, during a meal at table with parents, Play, exploring.

Formal learning

Formal learning is learning that takes place within a teacher-student relationship, such as in a school system.

Learning Automata !

An automaton is a machine or control mechanism designed to automatically follow a predetermined sequence of operations or respond to encoded instructions. The term stochastic emphasizes the adaptive nature of the automaton we describe here. The automaton described here do not follow predetermined rules, but adapts to changes in its environment. This adaptation is the result of the learning process described in this chapter.

"The concept of learning automaton grew out of a fusion of the work of psychologists in modeling observed behavior, the efforts of statisticians to model the choice of experiments based on past observations, the attempts of operation researchers to implement optimal strategies in the context of the two-armed bandit problem, and the endeavors of system theorists to make rational decisions in random environments"

In classical control theory, the control of a process is based on complete knowledge of the process/system. The mathematical model is assumed to be known, and the inputs to the process are deterministic functions of time. Later developments in control theory considered the uncertainties present in the system. Stochastic control theory assumes that some of the characteristics of the uncertainties are known. However, all those assumptions on uncertainties and/or input functions may be insufficient to successfully control the system if changes. It is then necessary to observe the process in operation and obtain further knowledge of the system, i.e., additional information must be acquired on-line since a priori assumptions are not sufficient. One approach is to view these as problems in learning.

Rule-based systems, although performing well on many control problems, have the disadvantage of requiring modifications, even for a minor change in the problem space. Furthermore, rule-based approach, especially expert systems, cannot handle unanticipated situations. The idea behind designing a learning system is to guarantee robust behavior without the complete knowledge, if any, of the system/environment to be controlled. A crucial advantage of

Page 60: Whole basic

reinforcement learning compared to other learning approaches is that it requires no information about the environment except for the reinforcement signal.

A reinforcement learning system is slower than other approaches for most applications since every action needs to be tested a number of times for a satisfactory performance. Either the learning process must be much faster than the environment changes or the reinforcement learning must be combined with an adaptive forward model that anticipates the changes in the environment

Learning is defined as any permanent change in behavior as a result of past experience, and a learning system should therefore have the ability to improve its behavior with time, toward a final goal. In a purely mathematical context, the goal of a learning system is the optimization of a functional not known explicitly

In the 1960’s, Y. Z. Tsypkin [Tsypkin71] introduced a method to reduce the problem to the determination of an optimal set of parameters and then apply stochastic hill climbing techniques. M.L. Tsetlin and colleagues [Tsetlin73] started the work on learning automata during the same period. An alternative approach to applying stochastic hill-climbing techniques, introduced by Narendra and Viswanathan is to regard the problem as one of finding an optimal action out of a set of allowable actions and to achieve this using stochastic automata. The difference between the two approaches is that the former updates the parameter space at each iteration while the later updates the probability space.

The stochastic automaton attempts a solution of the problem without any information on the optimal action (initially, equal probabilities are attached to all the actions). One action is selected at random, the response from the environment is observed, action probabilities are updated based on that response, and the procedure is repeated. A stochastic automaton acting as described to improve its performance is called a learning automaton

Genetic algorithms !

Genetic algorithms are one of the best ways to solve a problem for which little is known. They are a very general algorithm and so will work well in any search space. All you need to know is what you need the solution to be able to do well, and a genetic algorithm will be able to create a high quality solution. Genetic algorithms use the principles of selection and evolution to produce several solutions to a given problem.

Genetic algorithms tend to thrive in an environment in which there is a very large set of candidate solutions and in which the search space is uneven and has many hills and valleys. True, genetic algorithms will do well in any environment, but they will be greatly outclassed by more situation specific algorithms in the simpler search spaces. Therefore you must keep in mind that genetic algorithms are not always the best choice. Sometimes they can take quite a while to run and are therefore not always feasible for real time use. They are, however, one of the most powerful methods with which to (relatively) quickly create high quality solutions to a problem. Now, before we start, I'm going to provide you with some key terms so that this article makes sense.

Page 61: Whole basic

Individual - Any possible solution Population - Group of all individuals

Search Space - All possible solutions to the problem

Chromosome - Blueprint for an individual

Trait - Possible aspect of an individual

Allele - Possible settings for a trait

Locus - The position of a gene on the chromosome

Genome - Collection of all chromosomes for an individual

Basics of Genetic Algorithms

The most common type of genetic algorithm works like this: a population is created with a group of individuals created randomly. The individuals in the population are then evaluated. The evaluation function is provided by the programmer and gives the individuals a score based on how well they perform at the given task. Two individuals are then selected based on their fitness, the higher the fitness, the higher the chance of being selected. These individuals then "reproduce" to create one or more offspring, after which the offspring are mutated randomly. This continues until a suitable solution has been found or a certain number of generations have passed, depending on the needs of the programmer.

Selection

While there are many different types of selection, I will cover the most common type - roulette wheel selection. In roulette wheel selection, individuals are given a probability of being selected that is directly proportionate to their fitness. Two individuals are then chosen randomly based on these probabilities and produce offspring. Pseudo-code for a roulette wheel selection algorithm is shown below.

for all members of populationsum += fitness of this individualend for

for all members of populationprobability = sum of probabilities + (fitness / sum)sum of probabilities += probabilityend for

loop until new population is fulldo this twicenumber = Random between 0 and 1for all members of populationif number > probability but less than next probability

Page 62: Whole basic

then you have been selectedend forendcreate offspringend loop

Crossover

So now you have selected your individuals, and you know that you are supposed to somehow produce offspring with them, but how should you go about doing it? The most common solution is something called crossover, and while there are many different kinds of crossover, the most common type is single point crossover. In single point crossover, you choose a locus at which you swap the remaining alleles from on parent to the other. This is complex and is best understood visually.

As you can see, the children take one section of the chromosome from each parent. The point at which the chromosome is broken depends on the randomly selected crossover point. This particular method is called single point crossover because only one crossover point exists. Sometimes only child 1 or child 2 is created, but oftentimes both offspring are created and put into the new population. Crossover does not always occur, however. Sometimes, based on a set probability, no crossover occurs and the parents are copied directly to the new population. The probability of crossover occurring is usually 60% to 70%.

Mutation

After selection and crossover, you now have a new population full of individuals. Some are directly copied, and others are produced by crossover. In order to ensure that the individuals are not all exactly the same, you allow for a small chance of mutation. You loop through all the alleles of all the individuals, and if that allele is selected for mutation, you can either change it by a small amount or replace it with a new value. The probability of mutation is usually between 1 and 2 tenths of a percent. A visual for mutation is shown below.

As you can easily see, mutation is fairly simple. You just change the selected alleles based on what you feel is necessary and move on. Mutation is, however, vital to ensuring genetic diversity within the population.

Applications

Page 63: Whole basic

Genetic algorithms are a very effective way of quickly finding a reasonable solution to a complex problem. Granted they aren't instantaneous, or even close, but they do an excellent job of searching through a large and complex search space. Genetic algorithms are most effective in a search space for which little is known. You may know exactly what you want a solution to do but have no idea how you want it to go about doing it. This is where genetic algorithms thrive. They produce solutions that solve the problem in ways you may never have even considered. Then again, they can also produce solutions that only work within the test environment and flounder once you try to use them in the real world. Put simply: use genetic algorithms for everything you cannot easily do with another algorithm.