tree-building survey development group jon kiparsky andrew schonfeld glenn thelen

Tree-building Survey Development Group

Jon KiparskyAndrew Schonfeld

Glenn Thelen

http://code.google.com/p/tree-buildingsurvey/

The Application

• TBS – “Tree-building survey”

• Not a landscaping tool

• Applet for eliciting naïve intutitions about phylogenetic relationships of diverse organisms

• Data to be used for pedagogical purposes as well as research on core knowledge of evolutionary biology

The method

• Present students with a set of 20 organisms

• Students are instructed to arrange the organisms to reveal the relationships among them, “as if you were working for the Harvard Museum of Natural History”.

• Relationships indicated by “links” – represented by lines on the screen

Internal Representation

• Tree data represented internally as a list of “nodes” and “connections”.• A sample tree:• O:0:Bat:766:309:true:():(29,):#O:6:HorseshoeCrab:460:325:true:():

(27,):#O:7:Human:888:302:true:():(29,):#O:8:Jellyfish:424:257:true:():(26,):#O:9:Leech:343:240:true:():(26,):#O:10:Lizard:631:340:true:():(28,):#O:15:Spider:536:321:true:():(27,):#O:16:Squirrel:838:313:true:():(29,):#O:18:Turtle:685:346:true:():(28,):#E:24::482:134:true:(26,27,):(31,):#E:26::415:201:true:(9,8,):(24,):#E:27::509:213:true:(6,15,):(24,):#E:28::633:221:true:(10,18,):(30,):#E:29::774:222:true:(0,16,7,):(30,):#E:30::709:149:true:(28,29,):(31,):#E:31::608:76:true:(24,30,):():#C:32:31:24#C:33:24:26#C:34:24:27#C:35:26:9#C:36:26:8#C:37:31:30#C:38:30:28#C:39:30:29#C:40:27:6#C:41:27:15#C:42:28:10#C:43:28:18#C:44:29:0#C:45:29:16#C:46:29:7#“

• Type – number – x, y position - in the model? – connected to – connected from

Saving applet data

• Applet can get the data

• But applets run in a sandbox

• How does the data get to the server?

Perl –> javascript –> database

• The original design, inherited at the start of the project, used perl to build a login page.

• Perl called javascript to start and stop the applet

• Perl read initial student data from Prof. White’s student database, and saved the students’ trees back to that database

Not scalable

• Each student needs entry in Prof. White’s database

• This would be a problem

• Set up a new database for all students?

• No.

Next solution: simplify

• Databases are a heavyweight solution to a lightweight problem

• We didn’t need any features of a database beyond access to records

• Perl is very good at flat files

TBS directory structure

TBS root directory-TBSRun.jar-prof_index

Professor Adams(student files)

Professor Baker(student files)

Professor Chavez(student files)

Why Java? Why Perl?

• “Obsolete” technologies

• Client is maintainer – client knows these technologies

• Technical limitations of the platform

• Novelty is no virtue

• “Appropriate technology”: simplest technology that will do the job is the right technology for the job

Original Technology Stack(Perl CGI & Java Applet)

Advantages1) We know it works

2) Original code owner won’t have to learn new technology

Disadvantages1) Archaic Technology

2) Too many technologies going on (for a new developer this can be a lot to learn)

3) Lack of portability (files cannot be bundled together)

4) No Logging

Servlet/Applet Technology Stack w/ Front-End CGI Communication

Advantages1) Makes use of more recent technologies

2) With the exception of database queries (SQL) and HTML, everything is in one standard coding language (JAVA)

3) Portability, everything can be nicely packaged within one WAR file for easy deployment on most servers

Disadvantages1) Original code owner will have to learn new technology

2) Still using Front-End CGI communication so in order to send data to the database “Submit” buttons outside the <applet> tag are required

3) Very little logging

Servlet/Applet Technology Stack w/ Back-End CGI Communication

Advantages1) Very little dependence on HTML code

2) Applet can talk to database, no more HTML page refreshes

3) Logging is now possible, although tracking user applet moves is still cumbersome

Disadvantages1) Original code owner will have to learn even more new technology

Introduction to Scoring

We created a Java Applet for drawing evolutionary tree and saving them. Based on section number, the links that students were given would either

have arrows or no arrows. Before receiving the student data, we developed some ideas for scoring

based on the assumption that most student trees would be proper evolutionary trees.

However, only 7 of the 241 student trees were proper evolutionary trees. Many of them were not technically even trees. As a result, in order to use the data, our scoring algorithms needed to make few assumptions about the structure of the students trees.

In this presentation “student trees” will refer to the what the students have actually drawn, even if what they drew was not technically a tree.

In order to score the student trees, trees are divided into those containing enough links to load into a graph, or those where groupings have to be inferred from the x and y coordinates of the nodes.

Tests For Trees With Sufficient Links

Trees containing nodes and links are loaded into a generic graph structure. Basic Tests:

Does tree have branches (should be “yes”) Are all organism nodes included (should be “yes”) Are all organism nodes terminal (should be “yes”) Are organism nodes directly connected to other organism nodes (should

be “no”) Tests using “off the shelf” graph algorithms

Does the graph contain loops (should be “no”) Are all regions of the graph connected (should be “yes”)

The most important algorithm for scoring graphs is Floyd-Warshall

Floyd–Warshall Algorithm• Produces an N by N Matrix of shortest paths between all pairs of nodes Create a 2D matrix path[][] of dimension N by N and initialize all data to

“unconnected”, which can be finite or infinite. In our implementation it is “9999”

Set each path[n][n] = 0 For all nodes with index n1 with an adjacent node with index n2: set path[n1]

[n2] = 1 Finally, run the following algorithm:

for(int k = 0; k < numVertices; k++) {

for(int i = 0; i < numVertices; i++) {

for(int j = 0; j < numVertices; j++) {

path[i][j] = Math.min(path[i][j], path[i][k] + path[k][j] }}}

Note that running time is N cubed. Since no student graph has more than 40 nodes (about 64000 loops) this is not prohibitive. Even with 10000 student trees, the number of loops required is less than one billion, and there are only slightly over two operations per loop. This should be no more than a few seconds work for a modern CPU.

Tests using utilizing shortest path data

Is the maximum shortest path between two group members (i.e mammal to mammal) less than the minimum shortest path between two non-group members (i.e. mammal to non-mammal) ?

Are any pairs of organisms unconnected ? While these algorithms give definite true/false answers, they do not

distinguish between a student tree with one mistake and a tree with many mistakes. However the shortest path table can be used to score the relative quality of groupings in a way that allows for mistakes and is independent of graph structure.

Scoring Groupings Based on Ratios of Average Shortest Paths

Given a matrix path[][] computed using Floyd-Warshall. (see Shortest Path Table) Since direction in student trees with arrows was essentially arbitrary, directionality is

ignored, so path[i][j] = path[j][i] Step 1: Find average path length between members of a group: For all members of a

group (ex. Mammals) take the sum of all path[i][j] such that i,j are both mammals and i != j, then divide the result by the number of paths used in the summation.

Step 2: Find average path length between members of a group and non-members: For all members of a group (ex. Mammals) take the sum of all path[i][j] such that i is a mammal and j is a non-mammal, and divide the result by the number of paths used in the summation.

Compute: Average Path Length Between Group Members and Non-Members

Average Path Between Group Members Since the path lengths between group members should be less than the path lengths

between group members and non-group members, larger scores are better. Randomly grouped graphs would score around a 1.0 Lower than 1.0 is worse than random Greater than 1.0 is better than random

Scoring Groupings Based on Ratios of Average Shortest Paths (continued)

Based on current available student data the grouping scores are classified as follows

scores <= 1.0 are bad (red) 1.0 < scores < 1.25 are poor to OK (yellow) scores >= 1.25 are good (green)

While this score is a relatively good indicator of grouping, the score is a bit arbitrary.

Also, an improperly constructed tree can score higher than a properly constructed tree. For example, a proper tree should not have organisms directly connected to each other. A properly constructed tree will have a minimum distance of 2 between any pair of organisms. However, a tree with mammals connected directly to each other would have an average distance between mammals closer to 1, and therefore would score higher. Indeed trees that score highest on this score are improper trees, but the groupings are very good.

Tests For Trees Containing Few or No Links

In the absence of links, each node still has an x, y coordinate. This can be used to try to get data from the tree. There are two ways we have researched to do this.

1: Try to guess where the student intended to put the link. One possible way to do this is:

Connect nodes to their closest neighbor. Continue until there are no unconnected regions of the graph.

After guessing the links, use the graph algorithms. 2: Use the location data directly without guessing links.

Convex Hulls can be used analyzing grouping and do not require guessing where the links should be. As a result we have decided to implement this approach first.

2-Dimensional Convex Hulls

Given a set of (x, y) points belonging to a group, such as mammals, a convex hull is the smallest polygon in area, such that all points of the group are contained inside of, or on the polygon.

Given two convex hulls for two different groups, such as invertebrates and invertebrates, if the two groups are perfectly separated on the plane, there should be no overlap between the convex hull for vertebrates and the convex hull for invertebrates. If there is a region of the plane such that the two convex halls overlap, the groups are not properly grouped.

The drawback of this method, is that a single misplaced organism will cause this test to fail.

We are working on an algorithm to find the largest subset of two groups of points such that the convex hulls do not overlap. This will give a range from completely wrong to only one organism misplaced.

Human Scoring Currently, our application allows users to categorize tree structure into a few rough

groups. The data entered is stored as appended as plain text to a file in the format: [student name]:[category][new line] The groups are

Perfect – Passes all tests for tree structure, though groupings may be incorrect. Almost Perfect – Has a minor error, such as a jellyfish at the root Tree – A graph with no loops or only a small loop, and all nodes connected

except where a student appears to have forgotten to make a connection. Should have enough branches to show that the student understands the concept.

Web – A graph that is mostly connected but has lots of small loops or one or more large loops, or has few or no branches. Also for graphs that simply have a “bizarre” structure.

Islands – Where organisms are divided up into multiple unconnected trees or webs. Many graphs classified as “islands” actually have good groupings

Insufficient Links – Where there are not enough links to score groupings based on how nodes are linked together. Sometimes it is obvious that the student intended to link organisms together but they are not linked in the underlying data structure.

Garbage – Where the student did not use most of the organisms. In the future, students will not get credit unless they use all the organisms. However the vast majority of students did put a good effort into their trees, even though they were guaranteed 15 points for simply doing it.

Neural Networks

Neural networks are good finding patterns relating input data to an output data without having to explicitly specify the relationship. They are especially good for human scored data which tends to be fuzzy.

Examples of output data include: A scale of how good the overall structure of a student tree is from 1 to 10 Categorizing tree structure into discrete categories, such as in the

previous slide. As long as the output is relatively consistent, the data can be used to train a

neural network.

Neural Networks (continued)

Given output data, there must be a consistent way to represent the input data for training the neural network.

Currently the standard way of representing any graph is the 20 x 20 matrix of shortest paths connecting each pair of organisms produced by the Floyd-Warshall algorithm. Unconnected nodes are currently set to (longest path + 1), this could be changed later, the important thing is that the values are consistent. Unused organisms are currently treated as unconnected nodes. The indices for each organism are kept consistent:

Bat = 0, Beetle = 1, Bird = 2 … Whale = 20 Finally, the matrix should be normalized so that path length ranges from 0 to 1.

Neural Networks (conclusion) Using the normalized 20 x 20 matrix as input and discrete categories as

output allows “off the shelf” neural networks for Optical Character Recognition (OCR) to be used, since the 20 x 20 matrix is equivalent to a grayscale image of a human drawn character as input, and the categories are similar to letters.

If the neural network consistently performs better than random guessing, it can be assumed to be working.

This will lay a framework for using different types of neural networks for tasks such as rating trees from 1 to 10, which is a different problem than sorting graphs into distinct groups.

The drawback to using neural networks is that they learn by being given an input, “guessing” the output, and updating their internal wiring based on the correct output value. This requires having enough data available for it to master a particular task.

Since we only have 241 student trees, it is unlikely that we have enough data to train a neural network. However, if TBS becomes popular enough that there are 10,000 or more student trees, human scorers could score a few thousand, and see if a neural network is accurate enough to reliably score the remainder. Since human scoring is a valuable resource, this could be a significant help. Finally, a neural network might be able to find patterns in large sets of data that is undetected by human scorers.

tree-building survey development group jon kiparsky andrew schonfeld glenn thelen

Documents

new database

virtueappropriate technology

simplest technology

right technology

whites student database

new technology2

initial student data

new developer