from: mark silverman saturday, april 27, 10:21 am to: perrizo, william i've been knee deep in...

35
From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more robust, as well as responding to some rfps. I'm very close. Do you have any power points etc on what you did for Netflix? I have some "recommend engine" type opps. Never head from Jonathan on code. Will push him. pen out what we discussed. From: Perrizo, William Sent: Monday, April 29, 10:35 AM In the Netflix Contest, the task was to beat Cinematch (Netflix’s recommender) by 10%. Contestants were given 5 yrs so it was not a speed contest (accuracy only). The “pTree advantage” is mostly a speed advantage. My main pTree sales pitch has always been “get information from your Very Big Data in human time!”. Most recommenders have Very Big Training Sets (which are getting ever bigger - and the bigger, the better!). Therefore, the rubber meets the recommender road on speed, not accuracy. Difficult to devise a recommender-speed contest, so Netflix didn’t. We used Nearest Neighbor Voting in two ways (and combined the two votes into one at the end). 1. We made a PTreeSet of the User Rating-History Table, UT(User , Movie, Rating) in which each row is a user and each column is a movie. We used pTree horizontal processing (ANDs, ORs …) of UT to get a “Near neighbor user vote” which predicted a rating for each Test pair, (u,m) (near nbr users, v, close to u in terms of their rating history voted on which rating u might give to m. “Near” was defined in terms of the ratings correlation of v and m (over a pruned set of the movies rated by both v and u). 2. We made a PTreeSet of the Movie Rated History Table, MT(Movie , User, Rating) in which each row is a movie and each column is a user. We used pTree horizontal processing (ANDs, ORs …) of MT to get a “Near neighbor movie vote” which predicted the rating for each Test pair, (m,u) (near nbr movies, n, close to m in terms of their rated history voted on which rating m might be given by u. “Near was defined in terms of the ratings correlation of n and m (over a pruned down set of the users who rated both n and m) We also tried bringing in Association Rule Mining as a third contributor to the predictions, but without much success. I’m going to spend a little time trying to apply our newer FAUST methods to it. In lots of ways, the recommender environment is similar to the text classification environment - the main training object is a real_number_labeled relationship between two entities (users and movies entities with ratings labels in the recommender case and documents and terms entities with frequency labels in the text mining case). In both we have to deal with very high dimensions as well as very high cardinality. Text mining is easier since label=0 means term_freq=0 while in the recommender case, label=0 does not mean rating=0 (hated it!) but means “didn’t rate it” (one has to be very careful not to allow label=0 to be interpreted in the code as “absolutely hated that movie!”. Hmm. It occurs to me now that we calculated a weighted average decimal rating (e.g., if it turned out to be 4.2546 we predicted 4. Using the FAUST methods, we will treat each rating as a categorical class (non-numeric). Maybe “rating=0 problem” will not rear it’s ugly head?? On the “add to Trainiing Set dynamically” issue, speed seems like the solution here too. If your recommender is slow (as most are) then you are pretty much forced to treat new training data separately (rather than rebuilding your predictor model). Off the top of my head, I would Original From: Perrizo, William Sent: Saturday, April 27, 2013 2:04 PM To: 'Mark Silverman' When you get to it to a describable point, I would love to hear about the Hadoop implementation. I have attached some slides on our Netflix work (probably overload?) From: Mark Silverman Monday, April 29, 8:20 AM Is there anything in particular about the Netflix approach that makes vertical methods more effective potentially? If datasets start getting large I assume we have a nice performance advantage, possibly also an advantage that new “movies” could be added to the collection dynamically rather than requiring remodeling? To: Perrizo, William; Question. Did you deal at all with the issue of recommendation normalization? What if a 3 or 4 to one user is a 5 to another…. To: '[email protected] ' Usually families had Netflix accounts and parents rate differently than kids (opposite?). We did not come up with a solution for that one. Also, many new users rated lots of movies they had never seen just to prime the pump. That one could be ameliorated somewhat by noting the date of rating (e.g., a user who rates 1000 movies in his/her first few days is probably priming the pump). Using a rating vector neighborhood of voters, we would include in the vote only users who also considered a rating of 3 or 4 as a high rating (and exclude those who consider 5 as high, unfortunately). I remember trying to normalize out that difference by dividing by STD (or max minus min) of a user’s ratings.

Upload: cameron-mathews

Post on 14-Jan-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more robust, as well as responding to some rfps.  I'm very close. Do you have any power points etc on what you did for Netflix?  I have some "recommend engine" type opps. Never head from Jonathan on code. Will push him. pen out what we discussed.

From: Perrizo, William Sent: Monday, April 29, 10:35 AM In the Netflix Contest, the task was to beat Cinematch (Netflix’s recommender) by 10%.  Contestants were given 5 yrs so it was not a speed contest (accuracy only). 

The “pTree advantage” is mostly a speed advantage.  My  main pTree sales pitch has always been “get information  from your Very Big Data in human time!”.   Most recommenders have Very Big Training Sets (which are getting ever bigger - and the bigger, the better!).  Therefore, the rubber meets the recommender road on speed, not accuracy.  Difficult to devise a recommender-speed contest, so Netflix didn’t.

We used Nearest Neighbor Voting in two ways (and combined the two votes into one at the end).1.        We made a PTreeSet of the User Rating-History Table, UT(User, Movie, Rating) in which each row is a user and each column is a movie. 

We used pTree horizontal processing (ANDs, ORs …) of UT to get a “Near neighbor user vote” which predicted a rating for each Test pair, (u,m) (near nbr users, v, close to u in terms of their rating history voted on which rating u might give to m.  “Near” was defined in terms of the ratings correlation of v and m (over a pruned set of the movies rated by both v and u).

2.       We made a PTreeSet of the Movie Rated History Table, MT(Movie, User, Rating) in which each row is a movie and each column is a user.  We used pTree horizontal processing (ANDs, ORs …) of MT to get a “Near neighbor movie vote” which predicted the rating for each Test pair, (m,u) (near nbr movies, n, close to m in terms of their rated history voted on which rating m might be given by u.  “Near was defined in terms of the ratings correlation of n and m (over a pruned down set of the users who rated both n and m) We also tried bringing in Association Rule Mining as a third contributor to the predictions, but without much success.

I’m going to spend a little time trying to apply our newer FAUST methods to it.  In lots of ways, the recommender environment is similar to the text classification environment - the main training object is a real_number_labeled relationship between two entities (users and movies entities with ratings labels in the recommender case and documents and terms entities with frequency labels in the text mining case).  In both we have to deal with very high dimensions as well as very high cardinality.  Text mining is easier since label=0 means term_freq=0 while in the recommender case, label=0 does not mean rating=0 (hated it!) but means “didn’t rate it” (one has to be very careful not to allow label=0 to be interpreted in the code as “absolutely hated that movie!”.

Hmm.  It occurs to me now that we calculated a weighted average decimal rating (e.g., if it turned out to be 4.2546 we predicted 4.  Using the FAUST methods, we will treat each rating as a categorical class (non-numeric).  Maybe “rating=0 problem” will not rear it’s ugly head??

On the “add to Trainiing Set dynamically” issue, speed seems like the solution here too.  If your recommender is slow (as most are) then you are pretty much forced to treat new training data separately (rather than rebuilding your predictor model).  Off the top of my head, I would think we would just take the entire training set (with the new ratings added) and go.  Remember, a nearest neighbor classifier is a lazy classifier in the sense that it does not build a model of the training set during a slow “build phase” and then use that model swiftly during a classify phase, but it uses the entire training set for each new prediction.  So all we would do is extend out two PTreeSets to include the new training data, not rebuild a training model.  Of course, FAUST builds a decision tree, so I will have to think about issue for FAUST).

Original From: Perrizo, William Sent: Saturday, April 27, 2013 2:04 PM To: 'Mark Silverman' When you get to it to a describable point, I would love to hear about the Hadoop implementation. I have attached some slides on our Netflix work (probably overload?)

From: Mark Silverman Monday, April 29, 8:20 AM Is there anything in particular about the Netflix approach that makes vertical methods more effective potentially?  If datasets start getting large I assume we have a nice performance advantage, possibly also an advantage that new “movies” could be added to the collection dynamically rather than requiring remodeling?

To: Perrizo, William; Question. Did you deal at all with the issue of recommendation normalization?  What if a 3 or 4 to one user is a 5 to another….To: '[email protected]' Usually families had Netflix accounts and parents rate differently than kids (opposite?).   We did not come up with a solution for that one.Also, many new users rated lots of movies they had never seen just to prime the pump.  That one could be ameliorated somewhat by noting the date of rating (e.g., a user who

rates 1000 movies in his/her first few days is probably priming the pump). Using a rating vector neighborhood of voters, we would include in the vote only users who also considered a rating of 3 or 4 as a high rating (and exclude those who consider 5 as high, unfortunately).  I remember trying to normalize out that difference by dividing by STD (or max minus min) of a user’s ratings.

 If your document recommender environment is one in which a document is recommended (more highly) to a user iff the neighborhood of its tf (or tf*idf) vector contains lots of documents already read by that user? – then there may not be huge a normalization problem (unless you want to try to normalize out differences in author style?)

Page 2: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

From: Mark Silverman Sent: Thursday, May 02, 2013 2:01 PM Well, it’s submitted but sufficiently vague but hopefully sounding good.So my thinking was more along the lines of a user having a set of weighted term frequencies that essentially map to a subject matter interest. 

 Thus, if I am interested in Uganda, I have one set of term-freqs.However, if I am interested in, say, Uganda and the NY Knicks, then tracking interests by user becomes difficult because I am essentially looking

for neighbors against things that are rare.  So where I was going was that we are tracking not by user but by user-topic, such that we determine how close a users feedback is to his existing selections, and start a new topic if it is sufficiently far.  Thus, I am matching his likes against other NY Knick fans, not other Ugandan NY Knick fans.

That’s sort of what I put in with a lot of caveats given it’s a long project and there plenty of discovery needed.From: Perrizo, William Sent: Thursday, May 02, 2013 2:30 PM This may be too late and it’s not a biggy, but our solution to the “u = a family

of user” issue is that we took as our voter set, other users who rated almost the same set of other movies as u did, which meant that the voters probably consisted of a similar family mix (e.g., some “transformer type movies rated by the young males, “16 candles” type movies rated by the young females, “Bridges Over Madison County” type rated by the adult female and “A Few Good Men” type rated by the adult men…)

From: Mark Silverman [mailto:[email protected]] So I am also curious whether there is a way to account for inactivity, for example, a recommendation that is not taken?

From: Perrizo, William In the Netflix contest there was no inactivity information given and therefore it did not play in the contest. But I take the “is” in your question to be a probe independent of the Netflix Contest? It’s a very interesting question.

Since recommendations are issued for those items that classify as likely to receive a top rating, t, a recommendation not taken could be recorded as a reduced rating where x could be adjusted according to the %, p, of recommendations ignored by that user (e.g., p*t).

From: Mark Silverman Sent: Wednesday, May 01, 2013 7:34 PM To: Perrizo, William Bingo.  It occurs to me further in thinking about this  that the concept of a “user” is also a bit fuzzy.  As I remember Greg mentioned to me there were concerns about the “solvability” of Netflix due to the fact that multiple family members might be interested in different movies, for example, my son is not interested in the same movies as me but uses the same account.

 Similarly, I as a subscriber to, let’s say, articles of “interest” might be interested in both the New York Knicks (yes, I am a long suffering fan) and the economy of the Congo.  If we average these term-weights together, I get mush.  Perhaps we have “user-profiles”, how close is my recommendation to previous recommendations I’ve made.  I could be interested in 5 different topics, or let’s say 5 different movie genres, and by considering me as one person with one set of average term-weights I lose this.

 Time to cut the cord and finish, I have 18 hours to go.From: Perrizo You are right about the “family=user” point.  Several people have written about it but I don’t think anyone has come up with a

true solution to it.  It might help somewhat to know that a user is a family and then only use near neighbor voters that also display characteristics of being a family (high ratings for “transformer” type movies and for sophisticated dramas…

There is a paper http://dl.acm.org/citation.cfm?id=372071  on movie-based classification in which they look for movies of a similar genre, type, director, actors, … to the movie in question (and that the user in question has rated).  Then the average rating that the user gave to those “similar” movies is the prediction. We also took another movie-based approach in which we took near neighbor movies (voters) to be movies rated similarly to users rating pattern on other movies by a set of other users (who rated the movie whose rating was to be predicted).

Page 3: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

Netflix provided 100M ratings (from 1 to 5) of 17K movies by 500K users. These essentially arrive in the form of a triplet of numbers: (User,Movie,Rating). In particular, for (User,Movie,?) not in the database, tell me what the Rating would be--that is, predict how the given User would rate the given Movie. For visualizing the problem, it makes sense to think of the data as a big sparsely filled matrix, with users across the top and movies down the side (or vice versa if you feel like transposing everything I say henceforth), and each cell in the matrix either contains an observed rating (1-5) for that movie (row) by that user (column), or is blank meaning you don't know. To quantify "big", sticking with the round numbers, this matrix would have about 8.5 billion entries (number of users times number of movies). Note also that this means you are only given values for one in eighty five of the cells. The rest are all blank. Netflix has then posed a "quiz" which consists of a bunch of question marks plopped into previously blank slots, and your job is to fill in best-guess ratings in their place. They have chosen mean squared error as the measure of accuracy, which means if you guess 1.5 and the actual rating was 2, you get docked for (2-1.5)^2 points, or 0.25. (they specify root mean squared error, referred to as rmse, but since they're monotonically related it's all the same and thus it will simply hurt your head less if you ignore the square root at the end.) They also provide a date for both the ratings and the question marks, which implies that any cell in the matrix can potentially have more than one rating in it.

Imagine for a moment that we have the whole shebang--8.5 billion ratings and a lot of weary users. Presumably there are some generalities to be found in there, something more concise and descriptive than 8.5 billion completely independent and unrelated ratings. For instance, any given movie can, to a rough degree of approximation, be described in terms of some basic attributes such as overall quality, whether it's an action movie or a comedy, what stars are in it, and so on. And every user's preferences can likewise be roughly described in terms of whether they tend to rate high or low, whether they prefer action movies or comedies, what stars they like, and so on. And if those basic assumptions are true, then a lot of the 8.5 billion ratings ought to be explainable by a lot less than 8.5 billion numbers, since, for instance, a single number specifying how much action a particular movie has may help explain why a few million action-buffs like that movie.

A fun property of machine learning is that this reasoning works in reverse too: If meaningful generalities can help you represent your data with fewer numbers, finding a way to represent your data in fewer numbers can often help you find meaningful generalities. Compression is akin to understanding and all that. In practice this means defining a model of how the data is put together from a smaller number of parameters, and then deriving a method of automatically inferring from the data what those parameters should actually be. In today's foray, that model is called singular value decomposition, which is just saying what I've already eluded: We'll assume that a user's rating of a movie is composed of a sum of preferences about the various aspects of that movie.

For example, imagine that we limit it to forty aspects, such that each movie is described only by forty values saying how much that movie exemplifies each aspect, and correspondingly each user is described by forty values saying how much they prefer each aspect. To combine these all together into a rating, we just multiply each user preference by the corresponding movie aspect, and then add those forty leanings up into a final opinion of how much that user likes that movie. E.g., Terminator might be (action=1.2,chickflick=-1,...), and user Joe might be (action=3,chickflick=-1,...), and when you combine the two you get Joe likes Terminator with 3*1.2 + -1*-1 + ... = 4.6+... . Note here that Terminator is tagged as an anti-chickflick, and Joe likewise as someone with an aversion to chickflicks, so Terminator actively scores positive points with Joe for being decidedly un-chickflicky. (Point being: negative numbers are ok.) Anyway, all told that model requires 40*(17K+500K) values, or about 20M -- 400 times less than the original 8.5B.

ratingsMatrix[user][movie] = sum (userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40In matrix terms, the original matrix has been decomposed into two very oblong matrices: the 17,000 x 40 movie aspect matrix, and the 500,000 x

40 user preference matrix. Multiplying those together just performs the products and sums described above, resulting in our approximation to the 17,000 x 500,000 original rating matrix. Singular value decomposition is just a mathematical trick for finding those two smaller matrices which minimize the resulting approximation error--specifically the mean squared error (rather convenient!).

So, in other words, if we take the rank-40 singular value decomposition of the 8.5B matrix, we have the best (least error) approximation we can within the limits of our user-movie-rating model. I.e., the SVD has found our "best" generalizations for us. Pretty neat, eh?

Only problem is, we don't have 8.5B entries, we have 100M entries and 8.4B empty cells. Ok, there's another problem too, which is that computing the SVD of ginormous matrices is... well, no fun. But, just because there are five hundred really complicated ways of computing singular value decompositions in the literature doesn't mean there isn't a really simple way too: Just take the derivative of the approximation error and follow it. This has the added bonus that we can ignore the unknown error on the 8.4B empty slots.

Page 4: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

If you write out the equations for the error between the SVD-like model and the original data--just the given values, not the empties--and then take the derivative with respect to the parameters we're trying to infer, you get a rather simple result which I'll give here in C code to save myself the trouble of formatting the math:

userValue[user]+= lrate*err*movieValue[movie]; movieValue[movie]+= lrate*err*userValue[user];The above code is evaluated for each rating in the training database. Lrate is the learning rate, a rather arbitrary number which I fortuitously set to

0.001 on day one and regretted it every time I tried anything else after that. Err is the residual error from the current prediction. So, the whole routine to train one sample might look like:

/* * Where: * real *userValue = userFeature[featureBeingTrained]; * real *movieValue = movieFeature[featureBeingTrained]; * real lrate = 0.001; */static inlinevoid train(int user, int movie, real rating){ real err = lrate * (rating - predictRating(movie, user));

userValue[user] += err * movieValue[movie];movieValue[movie] += err * userValue[user];}

predictRating() here would also use userValue and movieValue to do its work, so there's a tight feedback loop. I mention the "static inline" and cram the lrate into err just to make the point that: this is the inside of the inner loop, and every clock cycle counts.

My wee laptop is able to do a training pass through the entire data set of 100 million ratings in about seven and a half seconds. Slightly uglier but more correct, unless you're using an atemporal programming language you will want to do:

uv = userValue[user];userValue[user] += err * movieValue[movie];movieValue[movie] += err * uv;

Anyway, this will train one feature (aspect), and in particular will find the most prominent feature remaining (the one that will most reduce the error that's left over after previously trained features have done their best). When it's as good as it's going to get, shift it onto the pile of done features, and start a new one. For efficiency's sake, cache the residuals (all 100 million of them) so when you're training feature 72 you don't have to wait for predictRating() to re-compute the contributions of the previous 71 features. You will need 2 Gig of ram, a C compiler, and good programming habits to do this.

There remains the question of what to initialize a new feature to. Unlike backprop and many other gradient descent algorithms, this one isn't really subject to local minima that I'm aware of, which means it doesn't really matter. I initialize both vectors to 0.1, 0.1, 0.1, 0.1, .... Profound, no? (How it's initialized actually does matter a bit later, but not yet...)

The end result, it's worth noting, is exactly an SVD if the training set perfectly covers the matrix. Call it what you will when it doesn't. (If you're wondering where the diagonal scaling matrix is, it gets arbitrarily rolled in to the two side matrices, but could be trivially extracted if needed.) A host of refinements:

Prior to even starting with the SVD, one can get a good head start by noting the average rating for every movie, as well as the average offset between a user's rating and the movie's average rating, for every user. I.e., the prediction method for this baseline model is:

static inlinereal predictRating_Baseline(int movie, int user){return averageRating[movie] + averageOffset[user];}So, that's the return value of predictRating before the first SVD feature even starts training. You would think the average rating for a movie

would just be... its average rating! Alas, Occam's razor was a little rusty that day.

Page 5: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

Trouble is, what if there's a movie which only appears in the training set once, say with a rating of 1. Does it have an average rating of 1? Probably not! In fact you can view that single observation as a draw from a true probability distribution who's average you want... and you can view that true average itself as having been drawn from a probability distribution of averages--the histogram of average movie ratings essentially. If we assume both distributions are Gaussian, then according to my shoddy math the actual best-guess mean should be a linear blend between the observed mean and the apriori mean, with a blending ratio equal to the ratio of variances. That is: If Ra and Va are the mean and variance (squared standard deviation) of all of the movies' average ratings (which defines your prior expectation for a new movie's average rating before you've observed any actual ratings) and Vb is the average variance of individual movie ratings (which tells you how indicative each new observation is of the true mean--e.g,. if the average variance is low, then ratings tend to be near the movie's true mean, whereas if the avg variance is high, ratings tend to be more random and less indicative) then:

BogusMean = sum(ObservedRatings)/count(ObservedRatings) K = Vb/VaBetterMean = [GlobalAverage*K + sum(ObservedRatings)] / [K + count(ObservedRatings)]But in fact K=25 seems to work well so I used that instead. :) The same principle applies to computing the user offsets. The point here is simply that any time you're averaging a small number of examples, the

true average is most likely nearer the apriori average than the sparsely observed average. Note if the number of observed ratings for a particular movie is zero, the BetterMean (best guess) above defaults to the global average movie rating as one would expect.

Moving on: 20 million free parameters is still rather a lot for a training set with only 100 million examples. While it seems like a neat idea to just ignore all those blank spaces in the implicit ratings matrix, the truth is we have some expectations about what's in them, and we can use that to our advantage. As-is, this modified SVD algorithm tends to make a mess of sparsely observed movies or users. To give an example, imagine you have a user who has only rated one movie, say American Beauty. Let's say they give it a 2 while the average is (just making something up) 4.5, and further that their offset is only -1, so we would, prior to even employing the SVD, expect them to rate it 3.5. So the error given to the SVD is -1.5 (the true rating is 1.5 less than we expect). Now imagine that the current movie-side feature, based on broader context, is training up to measure the amount of Action, and let's say that's a paltry 0.01 for American Beauty (meaning it's just slightly more than average). The SVD, recall, is trying to optimize our predictions, which it can do by eventually setting our user's preference for Action to a huge -150.0. I.e., the algorithm naively looks at the one and only example it has of this user's preferences, in the context of the one and only feature it knows about so far (Action), and determines that our user so hates action movies that even the tiniest bit of action in American Beauty makes it suck a lot more than it otherwise might. This is not a problem for users we have lots of observations for because those random apparent correlations average out and the true trends dominate.

So, once again, we need to account for priors. As with the average movie ratings, it would be nice to be able to blend our sparse observations in with some sort of prior, but it's a little less clear how to do that with this incremental algorithm. But if you look at where the incremental algorithm theoretically converges, you get:

userValue[user] = [sum residual[user,movie]*movieValue[movie]] / [sum (movieValue[movie]^2)]The numerator there will fall in a roughly zero-mean Gaussian distribution when charted over all users, which through various gyrations I won't

bore you with leads to: userValue[user] = [sum residual[user,movie]*movieValue[movie]] / [sum (movieValue[movie]^2 + K)]And finally back to: userValue[user] += lrate * (err * movieValue[movie] - K * userValue[user]);movieValue[movie] += lrate * (err * userValue[user] - K * movieValue[movie]);This is essentially equivalent to penalizing the magnitude of the features, and so is probably related to Tikhonov regularization. The point: to try to

cut down on over fitting, ultimately allowing use of more features. Last, Vincent liked K=0.02 or so, with well over 100 features (singular vector pairs--if you can still call them that).

Moving on: As I mentioned a few entries ago, linear models are pretty limiting. Fortunately, we've bastardized the whole matrix analogy so much by now that we aren't really restricted to linear models any more: We can add non-linear outputs such that instead of predicting with:

sum (userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40

Page 6: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

We can use: sum G(userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40Two choices for G proved useful. One is to simply clip the prediction to the range 1-5 after each component is added in. That is, each feature is

limited to only swaying the rating within the valid range, and any excess beyond that is lost rather than carried over. So, if the first feature suggests +10 on a scale of 1-5, and the second feature suggests -1, then instead of getting a 5 for the final clipped score, it gets a 4 because the score was clipped after each stage. The intuitive rationale here is that we tend to reserve the top of our scale for the perfect movie, and the bottom for one with no redeeming qualities whatsoever, and so there's a sort of measuring back from the edges that we do with each aspect independently. More pragmatically, since the target range has a known limit, clipping is guaranteed to improve our performance, and having trained a stage with clipping on we should use it with clipping on. However, I did not really play with this extensively enough to determine there wasn't a better strategy.

A second choice for G is to introduce some functional non-linearity such as a sigmoid. I.e., G(x) = sigmoid(x). Even if G is fixed, this requires modifying the learning rule slightly to include the slope of G, but that's straightforward. The next question is how to adapt G to the data. I tried a couple of options, including an adaptive sigmoid, but the most general and the one that worked the best was to simply fit a piecewise linear approximation to the true output/output curve. That is, if you plot the true output of a given stage vs the average target output, the linear model assumes this is a nice 45 degree line. But in truth, for the first feature for instance, you end up with a kink around the origin such that the impact of negative values is greater than the impact of positive ones. That is, for two groups of users with opposite preferences, each side tends to penalize more strongly than the other side rewards for the same quality. Or put another way, below-average quality (subjective) hurts more than above-average quality helps. There is also a bit of a sigmoid to the natural data beyond just what is accounted for by the clipping. The linear model can't account for these, so it just finds a middle compromise; but even at this compromise, the inherent non-linearity shows through in an actual-output vs. average-target-output plot, and if G is then simply set to fit this, the model can further adapt with this new performance edge, which leads to potentially more beneficial non-linearity and so on... This introduces new free parameters and encourages over fitting especially for the later features which tend to represent small groups. We found it beneficial to use this non-linearity only for the first twenty or so features and to disable it after that.

Moving on: Despite the regularization term in the final incremental law above, over fitting remains a problem. Plotting the progress over time, the probe rmse eventually turns upward and starts getting worse (even though the training error is still inching down). We found that simply choosing a fixed number of training epochs appropriate to the learning rate and regularization constant resulted in the best overall performance. I think for the numbers mentioned above it was about 120 epochs per feature, at which point the feature was considered done and we moved on to the next before it started over fitting. Note that now it does matter how you initialize the vectors: Since we're stopping the path before it gets to the (common) end, where we started will affect where we are at that point. I wonder if a better regularization couldn't eliminate overfitting altogether, something like Dirichlet priors in an EM approach--but I tried that and a few others and none worked as well as the above.

Here is the probe and training rmse for the first few features with and w/o regularization term "decay" enabled. Same thing, just the probe set rmse, further along where you can see the regularized version pulling ahead: Same plot again, but this time showing probe rmse (vertical) against train rmse (horizontal). Note how the regularized version has better probe

performance relative to the training performance: Anyway, that's about it. I've tried a few other ideas over the last couple of weeks, including a couple of ways of using the date information, and

while many of them have worked well up front, none held their advantage long enough to actually improve the final result. If you notice any obvious errors or have reasonably quick suggestions for better notation or whatnot to make this explanation more clear, let me

know. And of course, I'd love to hear what y'all are doing and how well it's working, whether it's improvements to the above or something completely different. Whatever you're willing to share,

Page 7: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

ru,i = u o i = f=1..F ru,f * rf,i

R = UT o I

UT f1 f fF u1

:

:

uTestSizeU

= u ru,f

I i1 i iTestSizeI f1

fF

o f rf,i

R i1 i iTestSizeI u1

.u.

uTestSizeU

ru,i

u+ = lrate ( u,i * iT - * u ) where u,i = ru,i - ru,i where ru,i = actual rating value^ ^

UT+f1 f fF u1

:

:

uTestSizeU

u+ ru,f

I i1 i iTestSizeI f1

fF

o f rf,i

Page 8: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

V(d)= 2a11d1 +j1a1jdj

2a22d2+j2a2jdj

:

2anndn +jnanjdj

d0, one can hill-climb it to locally maximize the variance, V, as follows:

d1≡(V(d0));

d2≡(V(d1)):... where

Maximizing theVarianceGiven any table, X(X1, ..., Xn), and any unit vector, d, in n-space, let

= jXj

2 dj2 +2

j<kXjXkdjdk - "

= j=1..n

(Xj2

- Xj2)dj

2 ++(2j=1..n<k=1..n

(XjXk - XjXk)djdk )

V(d)≡VarianceXod=(Xod)2 - (Xod)2

= i=1..N

(j=1..n xi,jdj)

2 - ( j=1..n

Xj dj )2

N 1

= i

jxi,j

2dj2 +

j<k xi,jxi,kdjdk

-

jXj

2dj2 +2

j<k XjXkdjdk N 1

N2

+ jkajkdjdkV(d)=jajjdj2

subject to i=1..ndi2=1

dT o A o d = V(d)

V

i XiXj-XiX,j

:

d1 ... dnd1:

dn

ijaijdidjV(d) =

x1

x2

:xN

x1odx2od

xNod=

Xod=Fd(X)=DPPd(X)d1

dn

- (jXj dj) (

kXk dk) =

i(

j xi,jdj) (

k xi,kdk) N 1

2a11 2a12 ... 2a1n

2a21 2a22 ... 2a2n

:'2an1 ... 2ann

d1

:di

:dn

V(d)≡Gradient(V)=2Ao

d or

We can separate out the diagonal or not:

Ubhaya Theorem1:

k{1,...,n} s.t. d=ek will hill-climb V to its globally max.

Let d=ek s.t. akk is a maximal diagonal element of A,Theorem2 (working on it):

d=ek will hill-climb V to its globally maximum.

How do we use this theory?For Dot Product gap based Clustering, we can hill-climb akk below to a d that gives us the global maximum variance. Heuristically, higher variance means more prominent gaps.

For Dot Product Gap based Classification, we can start with X = the table of the C Training Set Class Means, where Mk≡MeanVectorOfClassk .

M1

M2

:MC

Then Xi = Mean(X)i and

and XiXj = Mean Mi1 Mj1

. : MiC MjC

These computations are O(C) (C=number of classes) and are instantaneous. Once we have the matrix A, we can hill-climb to obtain a d that maximizes the variance of the dot product projections of the class means.

FAUST Classifier MVDI (Maximized Variance Definite Indefinite:

Build a Decision tree. 1. Find d that maximizes variance of dot product projections of class means each round. 2. Apply DI each round

FAUST technology relies on: 1. a distance dominating functional, F. 2. Use of gaps in range(F) to separate.

For Unsupervised (Clustering) Hierarchical Divisive? Piecewise Linear? other? Perf Anal (which approach is best for which type of table?)

For Supervised (Classification), Decision Tree? Nearest Nbr? Piecewise Linear? Perf Anal (which is best for training set?)

White papers: Terabyte Head Wall. The Only Good Data is Data in MotionMultilevel pTrees: k=0,1 suffices! A PTreeSet is defined by specifying a table, an array of stride_lengths (usually equi-length so just that one length is specified) and a stride_predicate (T\F condition on a stride (stride=bag [or array?] of bits):So the metadata of PTreeSet(T,sl,sp) specifies T, sl and sp.A “raw” PTreeSet has sl=1 and the identity predicate (sl and sp not used).A “cooked” PTreeSet (AKA Level-1 PTreeSet) for a table with sl1 (main purpose: provide compact summary information on the table.)Let PTS(T) be a raw PTreeSet, then it, plus PTS(T,64,p), ..., PTS(T,64^k,p) form a tree of vertical summarizations of T.

Note that P(T, 64*64, p) is different from P(P(T,64,p), 64, p), but both make sense since P(t, 64, p) is a table and P(P(T, 64, p), 64, p) is just a cooked pTree on it.

Page 9: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

FAUST MVDIFAUST MVDI on IRIS 15 records from each Class for Testing (Virg39 was removed as an outlier.)

Definite_____ Indefinite s-Mean 50.49 34.74 14.74 2.43 s -1 10 e-Mean 63.50 30.00 44.00 13.50 e 23 48 s_ei 23 10 empty i-Mean 61.00 31.50 55.50 21.50 i 38 70 se_i 38 48

(-1, 16.5=avg{23,10})s sCt=50 (16.5, 38)e eCt=24 (48.128)i iCt=39 d=(.33, -.1, .86, .38)indef[38, 48]se_i seCt=26 iCt=13

Definite Indefinite i-Mean 62.8 29.2 46.1 14.5 i -1 8 e-Mean 59 26.9 49.6 18.4 e 10 17 i_e 8 10 empty

d=(-.55, -.33, .51, .57) (-1,8)e Ct=21 (10,128)i Ct=9indef[8,10]e_i eCt=5 iCt=4

In this case, since the indefinite interval is so narrow, we absorb it into the two definite intervals; resulting in decision tree:

38

xod

0 4

8

d1=(-.55, -.33, .51, .57)

Versicolor

xod1 < 9

Virginica

xod1 9

Setosa

d0=(.33, -.1, .86,.38)

xod0 < 16.5

Versicolor

16.5 xod 0 < 38

Virginica

48 < xod0

Page 10: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

FAUST MVDIFAUST MVDI SatLog 413train 4atr 6cls 127testGradient Hill Climb of Variance(d) d1 d2 d3 d4 Vd)0.00 0.00 1.00 0.00 2820.13 0.38 0.64 0.65 7000.20 0.51 0.62 0.57 7420.26 0.62 0.57 0.47 7810.30 0.70 0.53 0.38 8100.34 0.76 0.48 0.30 8300.36 0.79 0.44 0.23 8410.37 0.81 0.40 0.18 8470.38 0.83 0.38 0.15 8500.39 0.84 0.36 0.12 8520.39 0.84 0.35 0.10 853

Fomn Ct min max max+1mn2 49 40 115 119 106 108 91 155 156mn5 58 58 76 64 108 61 92 145 146mn7 69 77 81 64 131 154 104 160 161mn4 78 91 96 74 152 60 127 178 179mn1 67 103 114 94 167 27 118 189 190mn3 89 107 112 88 178 155 157 206 207

Gradient Hill Climb of Var(d)on t25 d1 d2 d3 d4 Vd) 0.00 0.00 0.00 1.00 1137 -0.11 -0.22 0.54 0.81 1747

MNod Ct ClMn ClMx ClMx+1mn2 45 33 115 124 150 54 102 177 178mn5 55 52 72 59 69 33 45 88 89Gradient Hill Climb of Var(d)on t257 0.00 0.00 1.00 0.00 496 -0.15 -0.29 0.56 0.76 1595Same using class means or training subset.

Gradient Hill Climb of Var(d)on t75 0.00 0.00 1.00 0.00 12 0.04 -0.09 0.83 0.55 20-0.01 -0.19 0.70 0.69 21Gradient Hill Climb of Var(d)on t13 0.00 0.00 1.00 0.00 29-0.83 0.17 0.42 0.34 166 0.00 0.00 1.00 0.00 25-0.66 0.14 0.65 0.36 81-0.81 0.17 0.45 0.33 88Gradient Hill Climb of Var(d)on t143 0.00 0.00 1.00 0.00 19-0.66 0.19 0.47 0.56 95 0.00 0.00 1.00 0.00 27-0.17 0.35 0.75 0.53 54-0.32 0.36 0.65 0.58 57-0.41 0.34 0.62 0.58 58

Using class means: FoMN Ct min max max+1mn4 83 101 104 82 113 8 110 121 122mn3 85 103 108 85 117 79 105 128 129mn1 69 106 115 94 133 12 123 148 149Using full data: (much better!)mn4 83 101 104 82 59 8 56 65 66mn3 85 103 108 85 62 79 52 74 75mn1 69 106 115 94 81 12 73 95 96

Gradient Hill Climb of Var t156161 0.00 0.00 1.00 0.00 5-0.23 -0.28 0.89 0.28 19-0.02 -0.06 0.12 0.99 157 0.02 -0.02 0.02 1.00 159 0.00 0.00 1.00 0.00 1-0.46 -0.53 0.57 0.43 2Inconclusive both ways so predict

purality=4(17) (3ct=3 tct=6

cl=4

Gradient Hill Climb of Var t146156 0.00 0.00 1.00 0.00 0 0.03 -0.08 0.81 -0.58 1 0.00 0.00 1.00 0.00 13 0.02 0.20 0.92 0.34 16 0.02 0.25 0.86 0.45 17Inconclusive both ways so predict

purality=4(17) (7ct=15 2ct=2

Cl=7

Gradient Hill Climb of Var t127 0.00 0.00 1.00 0.00 41-0.01 -0.01 0.70 0.71 90-0.04 -0.04 0.65 0.75 91 0.00 0.00 1.00 0.00 35-0.32 -0.14 0.59 0.73 105Inconclusive predict purality=7(62

4(15) 1(5) 2(8) 5(7)

cl=7

F[a,b) 0 92 104 118 127 146 156 157 161 179 190Class 2 2 2 2 2 2 5 5 5 5 7 7 7 7 7 7 1 1 1 1 1 1 1 4 4 4 4 4 3 3 3 3

d=(0.39 0.89 0.35 0.10 )

F[a,b) 89 102Class 5 2

d=(-.11 -.22 .54 .81)

F[a,b) 47 65 81 101Class 7 5 5 2 2

d=(-.15 -.29 .56 .76)

F[a,b) 57 61 69 87Class 5 7

d=(-.01, -.19, .7, .69)

F[a,b) 21 35 41 59Class 3 1

d=(-.81, .17, .45, .33)

F[a,b) 52 56 66 73 75Class 3 3 3 3 4 1 1

d=(-.66, .19, .47, .56)

On the 127 sample SatLog TestSet: 4 errors or 96.8% accuracy.speed? With horizontal data, DTI is applied one unclassified sample at a time (per execution thread).With this pTree Decision Tree, we take the entire TestSet (a PTreeSet), create the various dot product SPTS (one for each inode), create ut SPTS Masks. These masks mask the results for the entire TestSet.

For WINE: min max+18.40 10.33 27.00 9.63 28.65 9.9 53.47.56 11.19 32.61 10.38 34.32 7.7 111.88.57 12.84 30.55 11.65 32.72 8.7 108.48.91 13.64 34.93 11.97 37.16 13.1 92.2 Awful results!

Page 11: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

FAUST MVDIFAUST MVDI Concrete

For Concrete min max+1 train335.3 657.1 0 l120.5 611.6 12 m321.1 633.5 0 hTest 0 l****** 1 m****** 0 h****** 0 321

3.0 57.0 0 l 3.0 361.0 11 m28.0 92.0 0 h 0 l***** 2 m***** 0 h 92***** 999

d0= -0.34 -0.16 0.81 -0.45

xod0<320

Class=m (test:1/1)

xod0>=634

Class=l (test:1/1)d1= .85 -.03 .52 -.02

xod2>=92

Class=m (test:2/2)d2= .85 -.00 .53 .05

xod2<28

Class= l or md3= .81 .04 .58 .01

d3547.9 860.9 4 l 617.1 957.3 0 m 762.5 867.7 0 h 0 l******* 0 m******* 0 h. 0******* 617

xod3<969

Cl=l *test 6/9)

xod3>=868

Cl=m (test:1/1)

d2544.2 651.5 0 l515.7 661.1 0 m591.0 847.4 40 h 1 l****** 0 m****** 11 h 662****** 999

xod2>=662

Cl=h (test:11/12)

xod3<544

Cl=m *test 0/0)

d4 = .79 .14 .60 .03

xod4<640

Cl=l *test 2/2)

xod4>=681

Cl=l (test:0/3)

7 test errors / 30 = 77%

Seeds.97 .17 -.02 .15 d013.3 19.3 0 0 l16.4 23.5 0 0 m12.2 15.2 25 5 h 0 13.2 19.3 23.5

xod<13.2

Class=h errs:0/5)

xod>=19.3

Class=m errs0/1)

.97 .19 .08 .16 d113.4 19.6 0 0 l16.9 19.9 4 3 m13.5 16.0 0 0 h0 13.45 18.6 99

xod>=18.6

Class=m errs0/4)

xod<13.2

Class=h errs:0/5)

0.97 0.19 0.06 0.1514.4 19.6 0 0 l16.8 18.8 0 0 m13.5 15.8 11 1 h0 14.366 17.816 99

Class=h errs:0/1)

Class=m errs0/0).00 .00 1.00 .00 1.0 8.0 6 4 l 4.0 5.0 0 0 m 2.0 9.0 0 0 h0 2 2 99

Class=l errs:0/4)

Class=m errs8/12)

8 test errors / 32 = 75%

Page 12: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

FAUST ClassifierFAUST Classifier

a

d

1. Cut in the middle of:VectorOfMedians (VOM), not the means. Use stdev ratio not middle for even better cut placement?2. Cut in the middle of {Max{Rod}, Min{Vod}. (assuming mRodmVod) If no gap, move cut to minimize Rerrors + Verrors.3. Hill-climb d to maximize gap or to minimize training set errors or (simplest) to minimize dis(max{rod},min{vod}) .4. Replace mr, mv with the avg of the margin points?

vomV

vomRdim 2

dim 1

d-line

0. Cut in middle of the means means: a= (mR+(mV-mR)/2)od = (mR+mV)/2odD≡mRmV d=D/|D| PR=Pxod<a PV=Pxoda

Min{Vod}Max{Rod} CutR=CutV=avg{minVod,minRod}, else CutR≡Min{Vod}, Cut≡Max{Rod}5. PR=Pxod<CutRPV=Pxod>CutV

MnVod

MaxRod

y PR or yPV , Definite classifications; else re-do on Indefinite region, PCutRxodCutV until actual gap (AND with certain

stop cond? E.g., "On nth round, use definite only (cut at midpt(mR,mV)."

R

V

d2 -line

d2

Another way to view FAUST DI is that it is a Decision Tree Method.With each non-empty indefinite set, descend down the tree to a new levelFor each definite set, terminate the descent and make the classification.

Each round, it may be advisable to go through an outlier removal process on each class before setting Min{Vod} and Max{Rod} (E.g., Iteratively check if F-

1(Min{Vod}) consists of V-outliers).

r   v v r mR   r    v v v v      r    r      v mV v     r    v v    r         v                   

Page 13: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

Definitei = ( Mx<i, Mn>i )

FAUST DI FAUST DI K-class training set, TK, and a given d (e.g., from D≡MeanTKMedTK):

Let mi≡meanCi s.t. dom1dom2 ...domK Mni≡Min{doCi} Mxi≡Max{doCi} Mn>i≡Minj>i{Mnj} Mx<i≡Maxj<i{Mxj}

Indefinitei,i+1 = [ Mn>i, Mx<i+1 ] Then recurse on each Indefinite.

For IRIS 15 records were extracted from each Class for Testing. The rest are the Training Set, TK. D=MEAN sMEANe Definite_____ Indefinite__ s-Mean 50.49 34.74 14.74 2.43 s -1 25 e-Mean 63.50 30.00 44.00 13.50 e 10 37 se 25 10 empty i-Mean 61.00 31.50 55.50 21.50 i 48 128 ei 37 48F < 18 setosa (35 seto) 1ST ROUND D=MeansMeane 18 < F < 37 versicolor (15 vers) 37 F 48 IndefiniteSet2 (20 vers, 10 virg) 48 < F virginica (25 virg)

F < 7 versicolor (17 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 7 F 10 IndefSet3 ( 3 vers, 5 virg) 10 < F virginica ( 0 vers, 5 virg)

F < 3 versicolor ( 2 vers. 0 virg) IndefSet3 ROUND D=MeaneMeani 3 F 7 IndefSet4 ( 2 vers, 1 virg) Here we will assign 0 F 7 versicolor 7 < F virginica ( 0 vers, 3 virg) 7 < F virginica

100% accuracy.

Test:F < 15 setosa (15 seto) 1ST ROUND D=MeansMeane 15 < F < 15 versicolor ( 0 vers, 0 virg) 15 F 41 IndefiniteSet2 (15 vers, 1 virg) 41 < F virginica ( 14 virg)

F < 20 versicolor (15 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 20 < F virginica ( 0 vers, 1 virg)

Option-1: The sequence of D's is: Mean(Classk)Mean(Classk+1) k=1... (and Mean could be replaced by VOM or?)

Option-2: The sequence of D's is: Mean(Classk)Mean(h=k+1..nClassh) k=1... (and Mean could be replaced by VOM or?)

Option-3: D seq: Mean(Classk)Mean(h not used yetClassh) where k is the Class with max count in subcluster (VoM instead?)

Option-2: D seq.: Mean(Classk)Mean(h=k+1..nClassh) (VOM?) where k is Class with max count in subcluster.

Option-4: D seq.: always pick the means pair which are furthest separated from each other.

Option-5: D Start with Median-to-Mean of IndefiniteSet, then means pair corresp to max separation of F(mean i), F(meanj)Option-6: D Always use Median-to-Mean of IndefiniteSet, IS. (initially,

IS=X)

Page 14: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

FAUST DI sequentialFAUST DI sequential For SEEDS 15 records were extracted from each Class for Testing.

m1 14.4 5.6 2.7 5.1 4.4 d(m1,m2) DEFINITE INDEFINITEm2 18.6 6.2 3.7 6.0 3.4 d(m1,m3) 2 -inf 0m3 11.8 5.0 4.7 5.0 7.0 d(m2,m3) 1 106 0 12 0 106 0 F 106, 3 106 inf 23 0 106 so totally non-productive!

Option-4, means pair most separated in X.

Option-6: D Median-to-Mean of IndefSet (initially IS=X)m1 14.4 5.6 2.7 5.1 37.3 meanF1 DEFINITE Cl=1 2 3 INDEFINITEm2 18.6 6.2 3.7 6.0 71.2 meanF2 def3[ -inf 21) 0 0 32m3 11.8 5.0 4.7 5.0 `2.0 meanF3 def1[ 28 49) 22 0 0 ind1[ 21 28 )On whole TR def2[ 58 inf) 0 30 0 ind2[ 49 58 )

m1 13.0 5.1 3.7 5.0 30 avF1 DEFINITE INDEFINITE def3[ -inf 0 )m3 13.0 5.0 4.0 5.0 27 avF3 def1[ 37 inf ) in11[ 0 37 )On Indef-1

Cl=1 2 3 0 0 0 1 0 0

Cls1 outlier(F=54)

Cl=1 2 3 6 0 3

m1 13.2 5.2 4.0 5.0 9 avF1 DEFINITE INDEFINITE def3[ -inf 0 )m3 13.0 5.0 4.0 5.0 6 avF3 def1[ 13 inf ) in11[ 0 13 )On Indef-11

Cl=1 2 3 0 0 0 1 0 0

Cls1 outlier (F=29)

Cl=1 2 3 5 0 3

m1 13.0 5.2 3.6 5.0 13 avF1 DEFINITE INDEFINITE def3[ -inf 9 )m3 13.0 5.0 4.0 5.0 9 avF3 def1[ 19 inf ) in111[ 9 19 )On Indef-111

Cl=1 2 3 0 0 0 0 0 1

Cls3 outlier (F=0)

Cl=1 2 3 5 0 2

m1 13.0 5.2 3.6 5.0 13 avF1 DEFINITE INDEFINITE def3[ -inf 9 )m3 13.0 5.0 4.0 5.0 9 avF3 def1[ 19 inf ) in1111[ 9 19 )On Indef-1111

Cl=1 2 3 0 0 0 0 0 0 done! declare Class=1

Cl=1 2 3 5 0 2

Page 15: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

FAUST DI sequentialFAUST DI sequential For SEEDS 15 records were extracted from each Class for Testing.

Option-6: D Median-to-Mean of Xm1 14.4 5.6 2.7 5.1 37.3 meanF1 DEFINITE Cl=1 2 3 INDEFINITEm2 18.6 6.2 3.7 6.0 71.2 meanF2 def3[ -inf 21) 0 0 32m3 11.8 5.0 4.7 5.0 `2.0 meanF3 def1[ 28 49) 22 0 0 ind31[ 21 28 )On whole TR def2[ 58 inf) 0 30 0 ind12[ 49 58 )

m1 13.0 5.1 3.7 5.0 30 avF1 DEFINITE INDEFINITEm3 13.0 5.0 4.0 5.0 27 avF3 def1[-inf 18 ). def3[ 55 inf ) in1313[ 18 55 )

Cl=1 2 3 1 0 0 0 0 0

Cl=1 2 3 . 6 0 3

D Mean(loF)-to-Mean(hiF) of IndefSet31

m1 12.8 5.2 3.2 5.0 18 avF1 DEFINITE INDEFINITEm3 13.0 5.0 4.0 5.0 10 avF3 def3[ -inf 10 ). def1[ 20 inf ) in313131[ 10 20 ) .

Cl=1 2 3 0 0 1 1 0 0

Cl=1 2 3 . 5 0 2

D Mean(loF)-to-Mean(hiF) of IndefSet1313

m1 13.0 5.2 3.6 5.0 4 avF1 DEFINITE INDEFINITEm3 13.0 5.0 3.5 5.0 2 avF3 def1[ -inf 0 ) def3[ 5 inf ) C1= [ 0 5 )

Cl=1 2 3 0 0 0 1 0 0The rest, Class=1

Cl=1 2 3 . 4 0 2

D Mean(loF)-to-Mean(hiF) of IndefSet313131 (d repeats after this so=C1

m1 16.2 6.0 1.8 5.2 5.8 avF1 DEFINITE INDEFINITEm2 16.6 6.0 4.6 6.0 6.2 avF2 def1[ -inf 2 ) def2[ 15 inf ) in1212[ 2 15 )

Cl=1 2 3 5 0 0 0 5 0

Cl=1 2 3 . 0 0 0

D Mean(loF)-to-Mean(hiF) of IndefSet12

[-inf, 21)class=3 [28, 49)class=2 [58.inf) class=3 d=(.,9, -,1, -.2, -.2) [21,28)ind31 d=(-.9, -.1, .14, -.1) [49, 58)ind12 d=(0, .31, -.9, 0)

[-inf,18)def [49, 58)ind23

Page 16: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

V(d)= 2a11d1 +j1a1jdj

2a22d2+j2a2jdj

:

2anndn +jnanjdj

do=ek s.t. akk is max or d0k=akk d1≡(V(d0)) d2≡(V(d1)) til F(dk)

2a11 2a12 ... 2a1n

2a21 2a22 ... 2a2n

:'2an1 ... 2ann

d1

:di

:dn

GRADIENT(V) = 2A o d

FAUST CLUSTERINGUse DPPd(x), but which unit vector, d*, provides the best gap(s)?

1. DPPd exhaustively searches a grid of d's for the best gap provider.2. Use some heuristic to choose a good d?

GV: Gradient-optimized Variance

MM: Use the d that maximizes |MedianF(X)-Mean(F(X))|. We have Avg as a function of d. Median? (Can you do it?) HMM: Use a heuristic for MedianF(X): F(VectorOfMedians)=VOMod MVM: Use D=MEAN(X)VOM(X), d=D/|D|

= jXj

2 dj2 +2

j<kXjXkdjdk - "

= j=1..n

(Xj2

- Xj2)dj

2 ++(2j=1..n<k=1..n

(XjXk - XjXk)djdk )

V(d)≡VarDPPd(X)= (Xod)2 - (Xod)2

= i=1..N

(j=1..n xi,jdj)

2 - ( j=1..n

Xj dj )2

N 1

= i

jxi,j

2dj2 +

j<k xi,jxi,kdjdk

-

jXj

2dj2 +2

j<k XjXkdjdk N 1

N2

+ jkajkdjdkV(d)=jajjdj2

subject to i=1..ndi2=1

dT o VX o d = VarDPPdX≡V

V

i XiXj-XiX,j

:

d1 ... dnd1:

dn

ijaijdidjV(d) =

x1

x2

:xN

x1odx2od

xNod=

Xod=Fd(X)=DPPd(X)d1

dn

- (jXj dj) (

kXk dk) =

i(

j xi,jdj) (

k xi,kdk) N 1

0 0 0 0 0 0 0 0 1 0 5 0 0 0 0 0 2 0 5 2 0 0 0 0 3 0 5 2 3 0 0 0 4 0 5 4 3 6 0 0median 5 0 5 4 3 6 9 0 6 0 5 6 6 6 9 10 7 0 5 6 6 6 9 10 8 0 5 8 6 9 9 10 9 0 5 8 9 9 9 10 10 10 10 10 10 10 10 10std 3.16 2.87 2.13 3.20 3.35 3.82 4.57 4.98variance 10.0 8.3 4.5 10.2 11.2 14.6 20.9 24.8Avg 5.00 0.91 5.00 4.55 4.18 4.73 5.00 4.55

consecutive 1 0 5 0 0 0 0 0differences 1 0 0 2 0 0 0 0 1 0 0 0 3 0 0 0 1 0 0 2 0 6 0 0 1 0 0 0 0 0 9 0 1 0 0 2 3 0 0 10 1 0 0 0 0 0 0 0 1 0 0 2 0 3 0 0 1 0 0 0 3 0 0 0 1 10 5 2 1 1 1 0

avgCD 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00maxCD 1.00 10.00 5.00 2.00 3.00 6.00 9.00 10.00||mean-VOM| 0.00 0.91 0.00 0.55 1.18 1.27 4.00 4.55

Maximize variance - is it wise?

Finding good unit vector, d, for Dot Prod functional, DPP. to maximize gaps

= j=1..n

Xjdj

Mean(DPPdX)=(1/N)i=1..N

j=1..n xi,jdj

sub to i di2=1

Maximize wrt d, |Mean(DPPd(X)) - Median(DPPd(X)|

=j (1/N

i xi,j ) dj

Compute Median(DPPd(X)? Want to use only pTree processing.Want a formula in d and numbers only (like the one above for the mean (involves only the vector d and the numbers X1 ,..., Xn )

MEDIAN picks out last 2 sequences which have best gaps (discounting outlier gaps at the extremes) and it discards 1,3,4 which are not so good.

Page 17: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

Xx1 x2 1 2 3 4 5 6 7 8 9 a b1 1 1 1=q3 1 2 32 2 3 2 43 3 45 2 5 59 3 615 1 7 f14 2 815 3 9 6 p d13 4 a b10 9 b c e1110 c9 11 d a1111 e 87 8 f 7 9

The 15 Count_Arrays

z1 2 2 4 1 1 1 1 2 1

z2 2 2 4 1 1 1 1 2 1

z3 1 5 2 1 1 1 1 2 1

z4 2 4 2 2 1 1 2 1

z5 2 2 3 1 1 1 1 1 2 1

z6 2 1 1 1 1 3 3 3

z7 1 4 1 3 1 1 1 2 1

z8 1 2 3 1 3 1 1 2 1

z9 2 1 1 2 1 3 1 1 2 1

za 2 1 1 1 1 1 4 1 1 2

zb 1 2 1 1 3 2 1 1 1 2

zc 1 1 1 2 2 1 1 1 1 1 1 2

zd 3 3 3 1 1 1 1 2

ze 1 1 2 1 3 2 1 1 2 1

zf 1 2 1 1 2 1 2 2 2 1

The 15 Value_Arrays (one for each q=z1,z2,z3,...)

z1 0 1 2 5 6 10 11 12 14

z2 0 1 2 5 6 10 11 12 14

z3 0 1 2 5 6 10 11 12 14

z4 0 1 3 6 10 11 12 14

z5 0 1 2 3 5 6 10 11 12 14

z6 0 1 2 3 7 8 9 10

z7 0 1 2 3 4 6 9 11 12

z8 0 1 2 3 4 6 9 11 12

z9 0 1 2 3 4 6 7 10 12 13

za 0 1 2 3 4 5 7 11 12 13

zb 0 1 2 3 4 6 8 10 11 12

zc 0 1 2 3 5 6 7 8 9 11 12 13

zd 0 1 2 3 7 8 9 10

ze 0 1 2 3 5 7 9 11 12 13

zf 0 1 3 5 6 7 8 9 10 11

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Level0, stride=z1 PointSet (as a pTree mask)

z1z2z3z4z5z6z7z8z9zazbzczdzezf

gap: [F=2, F=5]

FAUST Clustering, simple example: Gd(x)=xod Fd(x)=Gd(x)-MinG on a dataset of 15 image points

z13111110000000000

z12000001000000001

z11000000111111110

pTree masks of the 3 z1_clusters (obtained by ORing)

gap: [F=6, F=10]

Fp=MN,q=z1=0F=1

F=2

Page 18: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

What have we learned?What is the DPPd FAUST CLUSTER algorithm?D=MedianMean, d1≡D/|D| is a good start. But first, Variance-Gradient hill-climb it. (Median means Vector of Medians).

For X2=SubCluster2 use a d2 which is perpendicular to d1? In high dimensions, there are many perpendicular directions.

GV hill-climb d2=D2/|D2| (D2=MedianX2-MeanX2) constrained to be to d1, i.e., constrained to d2od1=0 (in addition to d2od2=1.We may not want to constrain this second hill-climb to unit vectors perpendicular to d1. It might be the case that the gap gets wider using a d2 which is not perpendicular to d1?

X2=SubCluster2

SubCluster1

GMP: Gradient hill-climb (wrt d) VarianceDPPd starting at d2=D2/|D2| where d2≡Unitized( Vom{x-xod1|xX2} - Mean{x-xod1|xX2} ) Variance-Gradient hill-climbed subject only to dod=1

GCCP: Gradient hill-climb (wrt d) VarianceDPPd starting at d2=D2/|D2| where D2=CCi(X2)-CCj(X2), and hill-climbs subject to dod=1, where the CCs are two of the Circumscribing rectangle's Corners (the CCs may be a faster calculations than Mean and Vom).

Taking all edges and diagonals of CCR(X) (the Coordinate-wise Circumscribing Rectangle of X) provides a grid of unit vectors. It is an equi-spaced grid iff we use a CCC(X) (Coordinate-wise Circumscribing Cube of X). Note that there may be many CCC(X)s. A canonical one is the one that is furthest from the origin (take the longest side first. Extend each other side the same distance from the origin side of that edge.

A good choice may be to always take the longest side of CR(X) as D, D≡LSCR(X).

Should outliers on the (n-1)-dim-faces at the ends of LSCR(X) be removed first?So remove all LSCR(X)-endface outliers until after removal the same side is still the LSCR(X). Then use that LSCR(X) as D.

(We shouldn't constrain the 2nd hill-climb to d1od2=0 and subsequent hill-climbs to dkodh=0, h=2...k-1. (gap could be larger). So the 2nd round starts at d2≡Unitized( Vom{x-xod1|xX2} - Mean{x-xod1|xX2} ) and hill-climbs subject only to dod=1)

Page 19: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

WINEF-MN Ct gp8 0 1 12 12 1 3 15 2 13 28 1 2 30 1 2 32 2 2 34 1 1 35 2 3 38 1 8 46 1 1 47 3 10 57 1 1 58 1 1 59 1 1 60 1 2 62 1 2 64 1 1 65 1 1 66 1 1 67 4 1 68 2 1 69 1 1 70 1 2 72 3 1 73 1 1 74 3 1 75 2 1 76 1 1 77 1 2 79 1 3 82 1 1 83 1 1 84 2 1 85 1 1 86 1 2 88 2 1 89 4 1 90 2 1 91 1 1 92 6 1 93 3 1 94 5 1 95 4 2 97 5 1 98 2 1 99 1 1100 4 1101 7 1102 4 1103 2 1104 3 1105 6 1106 3 1107 8 1108 10 1109 2 1110 4 1111 5 1112 2 1113 4 1114 1

[0.12) 1L 0H

GV MVM

___ _ [12,28) 1L 2H

___ _ [28,46) 2L 6H

___ _ [46,57) 2L 2H

___ [57,115) 51L 83H C1

C1 F-M Ct g3 0 1 1 1 2 1 2 2 3 5 1 1 6 1 1 7 4 1 8 2 2 10 2 1 11 1 2 13 2 1 14 1 1 15 1 1 16 5 2 18 1 2 20 2 3 23 1 1 24 1 1 25 1 1 26 2 2 28 1 1 29 1 1 30 5 1 31 2 1 32 1 1 33 4 1 34 5 1 35 4 1 36 4 1 37 2 1 38 3 1 39 3 1 40 2 1 41 4 1 42 3 1 43 5 1 44 3 1 45 4 1 46 5 1 47 4 1 48 3 1 49 11 1 50 5 1 51 3 1 52 5 1 53 4 1 54 4 1 55 1

3L 2H

C11 10L 13H C12

C11 F-MN gp2 0 1 1 1 1 1 2 3 1 3 3 2 5 2 1 6 1 2 8 2 210 2 111 1 112 4 113 1 215 2

_4L 2H

_2L 1H_0L 2H

3L 5H1L 1H

C12 F-M gp2 0 1 1 1 8 1 2 3 1 3 2 1 4 4 1 5 11 1 6 8 1 7 2 1 8 6 1 9 4 110 3 111 4 112 4 113 5 114 3 115 3 116 4 218 2 119 5 120 6 121 4 122 1 123 2 124 3 125 3 328 2 129 2 231 1

29L 46H

7L 19H

2L 2H0L 1H

C121 max thin 0 1 1 1 6 1 2 5 1 3 3 1 4 3 1 5 8 1 6 8 1 7 4 1 8 7 1 9 3 110 1 111 5 112 6 113 3 114 2 115 3 116 3 117 4

23L 25H 6L 21H

.07 .15 .98 .12 588-.01 .26 .97 .00 608(F-MN) gp8 0 1 1 1 4 1 2 4 1 3 5 1 4 4 1 5 6 1 6 8 1 7 6 1 8 4 1 9 5 1 10 2 1 11 3 1 12 7 1 13 4 1 14 3 1 15 2 1 16 2 1 17 3 1 18 4 1 19 3 1 20 4 1 21 1 1 22 7 1 23 2 1 24 4 1 25 1 1 26 1 1 27 1 1 28 1 1 29 1 1 30 1 1 31 1 1 32 1 3 35 1 2 37 3 1 38 1 1 39 1 1 40 3 1 41 3 3 44 2 1 45 2 1 46 4 1 47 2 2 49 1 2 51 1 1 52 1 3 55 1 1 56 1 1 57 1 9 66 2 1 67 2 8 75 1 4 79 2 1 80 1 2 82 2 1 83 1 2 85 1 13 98 1 2100 1 3103 1 11114 1

51L 83H [0.66) C1

___ _ [66,75) 2L 2H

___ _ [75,98) 2L 6H

___ _ [98,115) 2L 2H

.11 .19 .96 .19 209-.02 .41 .91 0 232C1(F-MN) gp3 0 1 1 1 6 1 2 5 1 3 2 1 4 4 1 5 8 1 6 8 1 7 4 1 8 3 1 9 7 110 1 111 4 112 6 113 4 114 2 115 3 116 3 117 2 118 2 119 3 120 4 121 6 122 4 123 1 124 2 125 4 126 1 127 1 229 2 130 2 232 1 335 1 136 1 137 1 138 1 139 4 140 2 242 2 244 1 145 2 247 4 148 2 149 1 150 1 353 1 154 2 155 2

[0.35) C11 38L 68H

[35,53) C12 10L 13H

___ [53,56) 3L 2H

-.05 -.31 -.95 -.01 605 .01 -.27 -.96 -.0 608XF-M gp3 0 1 1111 1 415 1 116 1 1329 1 130 1 232 2 234 1 135 2 439 1 847 2 148 2 957 1 158 1 159 1 261 1 263 1 265 1 166 1 167 1 168 5 169 2 170 1 373 3 174 3 175 1 176 2 177 2 279 1 382 1 183 1 184 1 185 1 186 1 187 1 188 1 189 1 190 4 191 2 192 7 193 1 194 5 195 4 196 2 197 3 198 2 199 2 1100 3 1101 4 1102 7 1103 3 1104 2 1105 6 1106 3 1107 5 1108 9 1109 6 1110 4 1111 5 1112 4 1113 4 1114 1

2L 0H C1

_0L 2H C2

_2L 5H C3_0L 1H C4

_0L 1H C4

_9L 7H C5

4L 8H C6

38L 68H C7

-.11 -.02 -.86 .5 43-.05 -.4 -.92 .01 68C7F-M*3 g3 0 1 3 3 1 2 5 1 4 9 2 615 1 318 1 220 1 121 1 122 3 325 2 328 1 230 3 131 2 132 1 133 1 134 2 135 1 136 3 339 2 140 1 141 2 344 1 246 3 248 1 250 2 252 1 153 1 154 1 155 1 156 2 157 1 158 2 159 2 160 2 161 1 162 1 163 2 265 1 166 1 167 1 168 1 169 3 170 1 171 1 172 1 173 1 174 2 175 2 176 1 177 3 178 4 179 3 180 4 181 1 182 1 183 1 285 2 186 2 288 3 189 2 291 2 293 4 396 1

_2L 4H C71

___ 4L 2H___ 0L 2H

_2L 12H

_1L 4H

___ 28L 44H C76 1L

.01 -.27 -.96 -.01 23-.04 -.43 -.9 .03 24C76*4 g3 0 1 3131 1 334 1 135 2 237 1 239 2 241 1 243 1 144 1 246 3 349 1 150 1 151 2 152 1 153 2 255 2 257 2 360 1 262 2 163 1 265 3 166 2 369 1 170 1 171 2 273 2 174 1 175 2 176 3 177 2 178 2 179 3 180 3 282 1 284 1 286 2 187 1 188 1 290 2 191 2 394 2 296 2 197 2

___ _1L 0H___ _0L 1H

4L8H C763

___ _ 2L 9H

___ _ 1L 8H

17L 15H C766

___ _ 3L 3H

-.21 .34 -.91 .9 8C766 *16 g4 0 1 3030 1 232 1 739 1 140 1 141 1 142 1 446 1 248 1 250 2 555 1 358 1 765 1 267 1 572 1 375 2 277 1 178 4 280 1 383 1 184 2 488 1 189 1 11100 1 4104 1 11115 1

_0L 1H

_2L 0H

_3L 1H

_1L 3H

_2L 0H

_2L 0H

_4L 8H

_2L 0H

_0L 2H_1L 0H

.19 .8 -.54 .18 7-.21 .7 -.7 -.09 9 C763F-M*8 g8 0 2 1616 1 1329 1 1241 2 445 1 752 1 456 1 763 1 871 2

0L 2H

_2L 0H

_2L 4H_0L 2H

-.08 .59 -.8 -.07 80 .08 .83 -.56 -.01 95 C5 g3 0 1 4 4 1 812 1 315 1 217 1 219 1 423 1 124 1 226 1 127 1 229 3 231 1 132 1 133 1

_3L 0H

_1L 2H

5L 5H

.05 .59 -.293 .75 18-.1 .9 -.3 .1 34 C6*8 16 0 1 4 4 2 16 20 1 11 31 1 37 68 1 15 83 1 15 98 1 8106 1 11117 1 1118 2

_1L 2H

_2L

_1L 6H

ACCURACY WINEGV 62.7MVM 66.7GM 81.3

GM

Page 20: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

SEEDS219 31 14 29 akk d1 d2 d3 d4 V(d.98 .14 .06 .13 9.98 .14 .06 .13 910(F-MN) gp6 0 2 1 1 10 1 2 5 1 3 1 6 9 3 1 10 10 1 11 10 1 12 2 6 18 2 1 19 3 1 20 7 1 21 2 1 22 1 1 23 3 6 29 6 1 30 4 1 31 7 1 32 1 6 38 1 1 39 2 1 40 6 1 41 5 1 42 1 7 49 3 1 50 1 2 52 7 1 53 2 7 60 1 2 62 4 1 63 3 8 71 5 1 72 2 2 74 1 6 80 5 1 81 8 1 82 5 1 83 3 9 92 2 10102 1 1103 2 1104 1

___ ___ [0,9) 0k 0r 18c C1

___ ___ [9,18) 1k 0r 24c C2

___ ___ [18,29) 10k 0r 8c C3

___ ___ [29,38) 18k 0r 0c C4

___ ___ [38,49) 13k 2r 0c C5

___ ___ [49,60) 7k 6r 0c C6

___ ___ [60,71) 1k 7r 0c C7

___ ___ [71,80) 0k 8r 0c C8

___ ___ [80,92) 0k 21r 0c C9___ ___ [92,102) 0k 2r 0c Ca

___ ___ [102,105) 0k 4r 0c Cb

C3 .97 .15 .09 .14 0 0 .07 1 0 4 10F-M g9 0 2 1010 3 1020 3 1030 4 131 1 940 1 1050 1 1161 1 970 2

___ ___ [0,10) 2k 0r 0c ___ ___ [10,20) 2k 0r 1c ___ ___ [20,30) 2k 0r 1c

___ ___ [30,40) 4k 0r 1c ___ ___ [40,50) 0k 0r 1c ___ ___ [50,61) 0k 0r 1c ___ ___ [61,70) 0k 0r 1c ___ ___ [70,71) 0k 0r 2c

___ ___ [0,22) 4k 0r 0c

___ ___ [22,49) 3k 6r 0c

C6 10(F-M) g12 0 3 10 10 1 12 22 3 10 32 3 9 41 2 7 48 1

256 36 10 32 akk.98 .14 .04 .12 0.00 -.00 .96 .29 3

GV MVM10(F-MN)gp6 0 2 1 1 10 1 2 5 1 3 1 6 9 3 1 10 10 1 11 10 1 12 2 6 18 2 1 19 3 1 20 7 1 21 2 1 22 1 1 23 3 6 29 6 1 30 4 1 31 7 1 32 1 6 38 1 1 39 2 1 40 6 1 41 5 1 42 1 7 49 3 1 50 1 2 52 7 1 53 2 7 60 1 2 62 4 1 63 3 8 71 5 1 72 2 2 74 1 6 80 5 1 81 8 1 82 5 1 83 3 9 92 2 10102 1 1103 2 1104 1

___ ___ [0,9) 0k 0r 18c C1

___ ___ [9,18) 1k 0r 24c C2

___ ___ [18,29) 10k 0r 8c C3

___ ___ [29,38) 18k 0r 0c C4

___ ___ [38,49) 13k 2r 0c C5

___ ___ [49,60) 7k 6r 0c C6

___ ___ [60,71) 1k 7r 0c C7

___ ___ [71,80) 0k 8r 0c C8

___ ___ [80,92) 0k 21r 0c C9___ ___ [92,102) 0k 2r 0c Ca

___ ___ [102,105) 0k 4r 0c Cb

C3200(F-MN)gp12 0 2 12 12 3 12 24 3 12 36 5 12 48 1 12 60 1 12 72 1 40112 2

___ ___ [0,35) 8k 0r 0c ___ ___ [35,48) 2k 0r 3c

___ ___ [48,72) 0k 0r 2c

___ ___ [72,113) 0k 0r 3c

C6200(F-MN)gp12 0 3 1212 1 3850 3 1060 1 262 3 1274 2

___ ___ [0,50) 4k 0r 0c ___ ___ [50,60) 1k 0r 2c

___ ___ [60,74) 1k 0r 3c ___ ___ [74,75) 1k 0r 1c

.794 -.403 -.304 .337 60.957 .156 -.205 .132 910(F-MN) gp3 0 1 2 2 1 2 4 4 2 6 3 2 8 7 2 10 2 2 12 1 2 14 1 2 16 10 2 18 10 1 19 2 3 22 2 1 23 2 1 24 1 1 25 1 2 27 4 2 29 4 2 31 4 2 33 2 5 38 3 1 39 3 2 41 7 2 43 2 2 45 2 1 46 1 2 48 1 1 49 1 1 50 4 2 52 5 1 53 1 1 54 3 3 57 2 2 59 3 2 61 3 1 62 1 2 64 3 2 66 3 3 69 5 7 76 1 2 78 2 2 80 2 2 82 4 2 84 1 2 86 1 2 88 4 1 89 1 1 90 8 2 92 5 11103 2 1104 1 1105 1 1106 1 2108 1

___ ___ [0,22) 0k 0r 42c C1

GM

___ ___ [22,33) 10k 0r 8c C2

___ ___ [33,57) 33k 2r 0c C3

___ ___ [57,69) 6k 9r 0c C4___ ___ [69,76) 1k 4r 0c C6

___ ___ [76,103) 0k 26r 0c C7

___ ___ [103,109) 0k 6r 0c C8

-.577 .577 .577 .000 1 .119 .112 .986 .000 3C2: 10(F-MN) gp10 0 1 1010 2 111 3 1021 3 1031 5 1041 1 1051 1 1162 1 163 1

___ ___ [0,31) 9k 0r 0c C21___ ___ [31,41) 1k 0r 4c C22

___ ___ [41,64) 0k 0r 4c C23

-.832 -.282 .134 -.458 0-.44 .00 -.87 -.22 2C4: 10(F-MN) gp21

0 3 1111 2 2031 3 2152 3 2779 1 2099 3

___ ___ [0,52) 1k 7r C41___ ___ [52,79) 1k 2r C42

___ ___ [79100) 4k 0r C43

ACCURACY SEEDS WINEGV 94 62.7MVM 93.3 66.7GM 96 81.3

Page 21: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

IRIS GM.88 .09 -.98 -.18 168-.29 .13 -.88 -.36 417-.36 .09 -.86 -.36 420F-MN Ct gp5 0 1 3 3 2 1 4 1 2 6 1 1 7 1 2 9 2 110 1 212 3 113 1 114 3 115 4 116 2 117 3 118 1 119 6 120 3 121 1 122 2 123 2 124 6 125 7 126 2 127 3 128 2 129 6 130 3 131 2 132 3 133 3 134 3 135 3 136 5 137 1 138 2 139 1 140 2 141 1 243 1 245 1 146 1 147 1 552 1 860 2 161 3 162 4 163 3 164 13 165 12 166 4 167 5 168 2 270 2

___50e 49i C1

___ 50s 1i C2

-.36 .09 -.86 -.36 105-.54 -0.17 -.76 -.33 118C1 2*(F-M g3 0 2 4 4 1 1 5 1 1 6 1 511 1 213 1 316 1 218 1 321 1 122 1 123 1 225 2 126 2 228 2 129 1 231 3 132 1 234 1 135 2 136 2 137 4 340 2 141 1 243 3 245 1 247 4 148 1 149 2 150 4 151 3 253 5 154 2 155 2 156 1 157 3 259 3 261 2 263 1 164 1 165 2 267 1 168 1 169 2 170 2 171 3 172 1 173 2 275 1 176 1 177 1 178 1 179 1 180 1 282 2 1092 1 294 2 296 1

___28i C11

_46e 21i C12

___ 4e C13

.81 .28 -.28 .42 13...

.53 .23 .73 .37 39C12 4*F-M g3 0 2 4 4 1 4 8 2 210 1 212 1 214 1 317 1 118 1 220 1 121 1 122 1 224 3 125 1 227 1 128 1 230 1 434 1 236 2 238 1 341 2 344 1 246 2 248 1 250 1 252 2 254 1 155 1 156 1 157 1 158 3 159 1 160 2 262 1 163 4 265 1 166 1 167 1 168 1 169 3 372 1 274 1 276 1 278 1 179 1 382 1 284 1 185 1 489 1 190 1 292 1 193 1

___ 19e 1i

___ 6e 0i

18e 11i C123

___ 3e 2i

___ 0e 3i

___ 0e 4i

-.034 .37 -.31 .87 4 C123 12*F-M g4 0 1 6 6 1 1016 1 218 1 321 1 122 1 123 1 629 1 332 1 335 1 540 2 545 1 449 1 150 2 454 1 256 1 561 2 162 1 264 1 165 1 267 1 370 1 171 1 1283 1 184 1 185 1

__ 1i .___ 1e .

___ 9e 1i .

_ 4e .

__ 0e 2i .

___ 2e 6i .

___ 2e 1i

MVMF-MN gp8 0 2 3 3 5 1 4 5 1 5 14 1 6 11 1 7 6 1 8 1 1 9 5 110 1 515 1 823 1 225 2 227 1 229 1 1..68 1

50s 1i C1 C2

MVMC2 2(F-)g4 0 1 4 4 1 1 5 1 4 9 1 3...69 1 473 1 174 1 276 2 480 1 484 1 286 2 591 1

___ 3e

47e 40i C22

___ 0e 11i

C22 4(F-) g4 0 1 6 6 1 4 10 1 2 12 1 4 ... 33 2 1 34 1 4 38 1 1 39 1 3 ... 79 1 2 81 1 5 86 1 2 88 2 2 90 1 1 91 1 1 92 2 2 94 1 1 95 1 2 97 1 1 98 1 3101 2 1102 2 4106 1 1107 1 2109 1 1110 2 1111 2 6117 1 1118 1 1119 1 1120 1

___ 18e C221 29e 14i___ ___

___ 26i C221 8F-)g5 0 1 7 7 1 4 11 1 5 16 1 1 17 1 3 20 1 1 21 1 2 23 1 1 24 1 5 29 1 3 32 2 2 34 1 1 35 1 4 39 3 5 44 1 3 47 2 3 50 1 3 53 1 4 57 1 3 60 1 3 63 1 1 64 2 5 69 2 1 70 1 3 73 1 1 74 1 1 75 1 4 79 1 1 80 2 2 82 2 1 83 1 1 84 1 2 86 1 4 90 1 5 95 1

___ 3e . ___ 5e 1i ___9e .

__9e 2i

___ 5e 11i

.90 .24 .37 .04 180

.41 -.04 .84 .35 418

.36 -.08 .86 .36 420F-MN Ct gp3 0 2 2 2 2 1 3 2 1 4 5 1 5 7 1 6 16 1 7 6 1 8 4 1 9 4 110 2 818 1 523 1 225 2 227 1 229 1 130 1 131 1 132 2 133 1 134 3 135 5 136 4 137 3 138 1 139 4 140 3 141 3 142 4 143 4 144 2 145 5 146 7 147 3 148 2 149 1 150 3 151 4 152 3 153 2 154 3 155 3 156 3 157 1 158 4 361 2 162 1 163 1 265 1 166 1 167 2 370 1

___50s 1i C1

___ 50e 40i C2 9i C3

C23762 808 2260 266 d1 d2 d3 d4.84 .18 .51 .06 64.57 .22 .71 .34 82.51 .22 .74 .38 83(F-MN)*3 Ct gp3 0 1 2 2 1 1 3 1 2 5 1 15 20 2 3 23 1 3 26 2 2 28 1 1 29 1 2 31 1 2 33 2 2 35 2 2 37 1 1 38 3 1 39 1 1 40 1 1 41 1 1 42 1 4 46 1 1 47 2 2 49 1 1 50 1 1 51 1 2 53 1 1 54 2 2 56 1 2 58 1 1 59 2 2 61 2 1 62 2 1 63 3 1 64 1 1 65 2 2 67 3 1 68 2 1 69 1 1 70 2 1 71 2 1 72 2 2 74 1 1 75 1 2 77 1 1 78 1 1 79 1 2 81 1 2 83 1 1 84 1 3 87 2 1 88 1 1 89 1 1 90 2 1 91 1 1 92 2 3 95 1 1 96 2 1 97 1 2 99 1 2101 2 2103 1 3106 3 3109 1 1110 1 1111 1

___ 4e 1i C21

___ 19e 1i C22

___ 27e 16i C23

___ 9i C24

___ 8i ___ 3i

___ _3i

C23 F-M*3 g33847 818 2284 257.96 .22 .06 -.14 15 0 1 6 6 1 2 8 1 4 12 1 3 15 1 1 16 1 2 18 2 8 26 1 2 28 1 1 29 1 1 30 1 2 32 1 1 33 1 3 36 1 3 39 2 1 40 1 1 41 2 1 42 2 2 44 2 2 46 1 1 47 2 5 52 1 1 53 1 3 56 1 1 57 1 3 60 1 1 61 1 1 62 1 2 64 1 6 70 2 5 75 1 2 77 2 3 80 1 9 89 1 8 97 1

___ 1e 0i

__2e 5i

___ 16e 11i

___ 2e

GV

___ 6e

C221 8F- g5 0 1 7 7 1 4 11 1 5 16 1 1 17 1 3 20 1 1 21 1 2 23 1 1 24 1 5 29 1 3 32 2 2 34 1 1 35 1 4 39 3 5 44 1 3 47 2 3 50 1 3 53 1 4 57 1 3 60 1 3 63 1 1 64 2 5 69 2 1 70 1 3 73 1 1 74 1 1 75 1 4 79 1 1 80 2 2 82 2 1 83 1 1 84 1 2 86 1 4 90 1 5 95 1

___1e ___ 2e

__ 4e 1i

___9e

__9e 2i

___ 5e 10i ___ 1i

ACCURACY IRIS SEEDS WINEGV 82.7 94 62.7MVM 94 93.3 66.7GM 94.7 96 81.3

Page 22: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

F-m/8 g4 C2 0 1 2 2 1 1 3 1 2 5 2 3 8 1 210 1 111 1 516 1 218 1 523 1 124 1 125 2 126 2 127 1 229 4 130 2 131 2 132 1 133 3 2... 1s65 1

CONCRETE

g4 F-MN/8 0 1 2 2 1 2 4 1 2 6 1 1 7 1 1 8 1 210 1 111 1 112 1 113 1 316 2 319 1 221 1 526 1 127 1 128 2 129 2 130 1 232 5 133 2 134 2 135 1 136 3 137 3 138 3 139 5 140 3 141 7 142 6 143 3 144 5 145 1 146 3 147 3 148 4 149 7 150 4 151 6 152 10 153 3 154 4 155 8 156 5 157 3 158 7 159 2 160 2 161 1 162 2 264 1 165 2 166 1 167 2

___14M 0H C1 C2

C21 0L 8M 0H C22 2M 0H C23

___ ___ . 1H 2M

C23 g3 F-M/8 0 2 2 2 1 1 3 3 1 4 3 1 5 1 1 6 1 1 7 6 1 8 1 1 9 8 110 2 111 6 112 2 113 5 114 2 115 2 318 1 119 7 120 1 121 3 122 1 123 2 124 4 125 1 227 8 128 9 129 4 231 2 132 1 133 3 134 3 236 7 137 12 239 1 140 1 141 1 142 6 648 1 250 2

GV

C231 g4 F-M/8 0 1 7 ... 1s12 1 214 6 115 7 419 1 120 3 121 3 122 2 123 1 225 1 227 1 229 1 130 1 131 1 233 1 639 1 342 1 446 1 1056 2

_30L 8H_ . 3L 2M

2L 2M 1H

___1L 4M 3H ___1L 1M 4H

___ 3L 2M 18H

___4L 2M 8H

___1L 2H

C232 g2 F-M/8 0 1 1 1 1 1 2 2 1 3 1 2 5 2 1 6 1 1 7 2 1 8 2 1 9 1 716 1 117 3 118 2 220 7 121 8 122 7 123 1 225 2 126 3 127 2 128 3 129 1 130 1 131 2 233 1 134 4 135 3 338 3 139 8 1150 2 151 1

0L 32M 13H11L 13M 54H

0 1 1 1 1 4 5 1 1 ... 1s 46 4 3 49 1 7 56 1 2 58 1 3 61 1 4 65 1 1 66 1 3 69 1 2 71 1 6 77 1 3 80 1 3 83 1 3 86 1 14100 1 3103 1 2105 1 3108 2 4112 1

___ 2M

C1 43L 33M 55H

___ 7M C2

___ 4M C3

___6M C4

MVM

(F-)/4 gp4

0 1 1 1 1 7 8 1 412 1 416 1 218 1 220 2 121 2 2 ... 1s+2s71 2 273 1 174 1 276 2 278 2 482 2 284 1 690 2 898 1 9107 1 16123 1

C1F-/4 g4

___ 4M .

C11 43L 23M 53H

___6M 2H

MVM C11 F-/4 g4 0 4 2 2 1 2 4 4 2 6 25 2 8 2 1 9 7 110 4 111 9 213 3 114 6 115 4 116 1 319 5 423 2 326 5 127 4 128 9 129 5 231 6 132 5 335 6 540 2

C111 3L 23M 49H

___ 30L 1M 4H

C111F-/4 g4 0 1 1616 3 117 2 118 9 119 3 221 5 627 3 128 5 129 14 130 1 838 2 240 15 141 3 445 3 247 2 1966 3 2187 1

___ __1L

1L 21M

__ 1L 2M 20H

___ __ 31H

___ 8H ___ 2M 9H

GM

X g4 (F-MN)/8 0 2 2 2 1 2 4 2 1 5 1 3 8 2 3 11 1 1 12 3 2 14 4 1 15 3 1 16 3 1 17 2 1 18 3 1 19 6 1 20 3 1 21 3 1 22 2 1 23 5 1 24 4 1 25 3 1 26 6 1 27 3 1 28 1 1 29 6 1 30 3 1 31 2 1 32 3 1 33 3 1 34 1 2 36 3 1 37 1 1 38 2 1 39 3 1 40 5 1 41 1 1 42 6 1 43 1 1 44 3 2 46 5 1 47 1 1 48 3 1 49 1 1 50 2 1 51 1 1 52 1 1 53 1 1 54 1 1 55 1 1 56 3 1 57 3 2 59 1 2 61 1 1 62 3 3 65 2 9 74 1 4 78 1 3 81 1 2 83 1 3 86 1 2 88 1 2 90 1 1 91 1 4 95 1 2 97 1 1 98 1 2100 1 4104 1 3107 1

C2-.6 .2 -.07 .771 6882..-.72 .19 -.40 .54 9251.38 .14 -.79 .46 11781

C2 gp8 (F-MN)/5 0 2 2 2 1 2 4 2 1 5 1 3 8 2 3 11 1 1 12 2 2 14 4 1 15 3 1 16 3 1 17 2 1 18 3 1 19 6 1 20 3 1 21 3 1 22 1 1 23 5 1 24 3 1 25 3 1 26 6 1 27 3 1 28 1 1 29 6 1 30 3 1 31 2 1 32 1 1 33 3 1 34 1 2 36 3 2 38 2 1 39 2 1 40 5 1 41 1 1 42 6 1 43 1 1 44 3 2 46 5 1 47 1 1 48 1 1 49 1 1 50 2 1 51 1 1 52 1 1 53 1 1 54 1 1 55 1 1 56 3 1 57 2 2 59 1 2 61 1 1 62 3 3 65 2 9 74 1 4 78 1 8 86 1 2 88 1 2 90 1 5 95 1 2 97 1 1 98 1 2100 1 4104 1

43L 28M 55H C21

0L 10M 0H C22

C21 g4 F-M/4 0 1 1 1 1 3 4 1 3 7 2 1 8 2 1 9 1 211 1 213 4 114 2 115 4 116 1 218 2 119 3 120 1 121 2 122 6 224 2 125 3 126 1 228 2 230 1 131 1 233 1 437 1 138 2 139 2 140 1 141 1 142 1 143 2 144 1 145 2 146 1 147 1 148 1 149 2 251 2 455 1 156 8 157 4 158 4 159 2 160 1 161 1 263 5 265 1 267 2 168 1 371 1 172 4 173 8 174 5 175 1 883 3 184 3 185 2 186 199 3

C21132L 13M 0H

C2127L 3M 10H

C2134L 7M 38H

C2140L 5M 7H

C211 g5 F-M)/4 0 1 6 6 2 1 7 2 512 1 113 4 114 1 115 4 217 1 118 2 119 2 221 2 122 3 123 1 124 3 428 1 1442 1 244 1 145 1 348 2 250 1 555 1 257 1 158 1 563 1 164 1 771 1 1182 1 1698 2

___5L .

__20L 5M .

___ 5L 1M .

___ 2L 1M .

___5L 1M

C212 g5 F-M/3 0 1 20 20 1 8 28 1 1 29 2 9 38 1 11 49 1 5 54 1 11 65 1 10 75 2 3 78 1 11 89 1 7 96 1 2 98 1 2100 1 11111 2 1112 1

__6L 3M . __1L 2H

___ 8H43L 38M 55H C2

0L 14M 0H C1

ACCURACY CONCRETE IRIS SEEDS WINEGV 76 82.7 94 62.7MVM 78.8 94 93.3 66.7GM 83 94.7 96 81.3

Page 23: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

GV 0.11 0.09 0.03 0.14 20.27 0.86 0.33 0.27 731.00 0.00 0.00 0.00 50.29 0.84 0.36 0.29 720.26 0.87 0.32 0.26 730.00 1.00 0.00 0.00 560.25 0.88 0.31 0.25 730.00 0.00 1.00 0.00 80.29 0.84 0.36 0.29 720.26 0.87 0.32 0.26 730.00 0.00 0.00 1.00 50.29 0.84 0.36 0.29 720.26 0.87 0.32 0.26 731.00 1.00 0.00 0.00 930.26 0.87 0.32 0.26 731.00 0.00 1.00 0.00 270.29 0.84 0.36 0.29 720.26 0.87 0.32 0.26 731.00 0.00 0.00 1.00 220.29 0.84 0.36 0.29 720.26 0.87 0.32 0.26 731.00 1.00 1.00 0.00 1540.27 0.87 0.33 0.27 731.00 1.00 0.00 1.00 1410.26 0.87 0.33 0.26 731.00 0.00 1.00 1.00 570.29 0.84 0.36 0.29 720.26 0.87 0.32 0.26 730.00 1.00 1.00 1.00 1540.27 0.87 0.33 0.27 731.00 1.00 1.00 1.00 2160.27 0.86 0.33 0.27 73

1.00 0.00 0.00 0.00 230.71 0.23 0.66 0.01 47C1 g3 400*F-M 0 1 1 1 1 6 7 1 310 2 212 3 214 3 115 1 318 1 220 1 222 3 426 1 329 1 332 1 133 1 235 1 237 2 239 1 140 1 545 2 247 1 148 2 149 1 251 1 152 2 153 2 154 2 256 1 258 3 159 1 160 1 262 2 163 1 164 2 367 4 168 1 169 2 170 1 373 1 275 2 176 2 278 1 179 2 281 1 182 1 183 1 1

...97 1

ABALONE MVM GM

X g2 100(F-M) 3 2 3 6 1 2 8 1 1 9 2 312 1 315 2 116 1 218 2 119 1 120 2 121 3 122 2 123 1 124 6 125 1 126 1 228 3 129 2 130 2 232 3 133 2 134 3 135 5 136 4 137 4 138 3 139 5 140 3 141 2 142 1 143 2 144 3 145 4 146 2 147 3 148 3 149 1 150 3 151 1 152 1 153 7 154 4 155 3 156 3 157 4 158 2 159 1 160 3 161 4 162 2 264 2 165 1 166 1 268 3 169 2 170 1 474 1 276 1 379 2 180 2 383 2 285 1 489 1 13102 1

0.39 0.57 0.10 -0.72 0.210.57 0.44 0.09 -0.69 0.240.77 0.61 0.17 0.01 2.190.58 0.48 0.17 0.64 3.80.55 0.46 0.16 0.68 3.81

ACR CONC IRIS SEEDS WINE ABALGV 76 83 94 63 73MVM 79 94 93 67 79GM 83 95 96 81 81

2L 0M 0H _

7L 3M 0H _

6L 8M 0H _

2L 21M 1H _

1L 7M _

1L 19M 1H _ 1.0 .00 .00 .00 10

.62 .41 .13 .65 46

.33 .29 .13 .89 56C2 g3 300*F-M 0 1 8 8 1 1 9 1 211 1 112 1 113 3 114 1 216 2 117 1 118 3 220 2 121 1 324 1 125 1 227 2 128 1 129 2 130 1 131 1 233 2 134 1 135 1 237 1 138 3 139 1 140 1 545 1 146 1 248 1 654 1 458 1 159 1 362 1 163 1 164 1 468 1 169 1 1483 1 386 1 23109 1

13M 5H _

12M 7H _

6M 5H _ 1M 1H _ 1H

g3 200*F-M 0 1 1111 1 1425 1 1742 1 143 1 548 1 351 1 2...67 2 168 2 169 3 2... 1s92 1

1M _

1H 1H

2M 1H _ 5M 12H _

30L 85M 12H C1

C1 g3 100*F-M 0 1 6 6 1 1... 1s54 1 256 2 3...71 2

1H

20L 84M 11H C11 10L 1M 0H C11 g3 400*F-M 0 1 1 1 1 4 5 1 3 8 4 1 9 1 312 2 2..81 2 384 2 185 1

2M 1H _ 4M 1H _

17L 78M 9H C111 3L

C111 g3 1500*F-M 0 1 15 15 1 5 20 1 4 24 1 1 25 1 1 26 1 3 29 1 1 30 1 1 31 2 1 32 1 1 33 2 3 36 1 2 38 3 1 39 2 2 41 2 1 42 1 1 43 2 2 45 1 2 47 3 1 48 1 2 50 1 1 51 1 4 55 2 1 56 3 2 58 1 2 60 3 1 61 2 1 62 2 2 64 1 1 65 2 3 68 2 1 ...112 1 4116 2

3L _

3M _

4L 3M _

3L 13M 2H

4L 8M 4H

3L 51M 3H

6L . 1M _

3L .

12L 7M _

3L 4M _

4L 72M 15H C1 10L 1M 0H 3M 1H _

5M 10H _ 1M _ 1H

0.25 0.30 -0.20 -0.90 0.18 -0.44 -0.37 -0.19 -0.79 0.81 -0.52 -0.42 -0.19 -0.72 0.83C1 g3 300(F-M) 0 1 1 1 1 2 3 2 1 4 1 1 5 1 1 6 2 1 7 1 310 1 111 1 314 3 216 2 117 1 118 2 220 1 222 1 123 2 124 1 125 2 126 3 127 1 128 2 129 1 231 1 132 1 335 1 136 1 238 1 341 1 344 3 145 1 146 2 248 1 149 1 150 2 252 2 153 1 154 1 155 1 459 2 160 1 464 1 165 1 166 1 167 2 269 2 170 2 171 2 273 1 174 1 175 2 176 2 177 1 178 3 280 1 181 3 283 2 184 1 185 1 186 1 288 1 189 1 190 1 292 1

7M 4H .

16M 8H C11

17M 2H .

1M 2H _

3L 30M 1H

.55 .43 .14 .27 .38

C11 g3 1000(F-M)

0 1 1010 1 717 1 219 1 827 1 936 1 1147 2 249 1 352 2 456 1 460 1 262 1 264 1 771 3 172 1 577 2 481 1 384 1 690 1

0M 6H _ 1M 2H _

15H _

Page 24: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

3364 1804 185.38 0.56 0 3365 3399 186.38 1.00 1 3366 980 186.68 0.30 0 3367 1518 187.84 1.15 1 3368 2090 188.45 0.61 1 3369 890 189.10 0.65 1 3370 24 189.74 0.65 1 3371 2435 189.77 0.03 0 3372 804 190.14 0.36 0 3373 930 190.24 0.11 0 3374 1096 191.30 1.06 1 3375 1441 191.39 0.09 0 3376 2885 191.86 0.47 0 3377 2315 191.91 0.05 0 3378 699 192.04 0.13 0 3379 2108 194.34 2.30 1 3380 1316 195.58 1.24 1 3381 991 195.85 0.27 0 3382 1564 196.05 0.20 0 3383 2800 196.37 0.32 0 3384 880 196.62 0.25 0 3385 2038 196.75 0.13 0 3386 481 197.09 0.34 0 3387 480 197.85 0.76 1 3388 295 198.38 0.53 0 3389 1234 200.42 2.04 1 3390 2140 201.46 1.04 1 3391 3353 202.36 0.90 1 3392 3402 202.64 0.28 0 3393 45 202.86 0.21 0 3394 3017 204.63 1.77 1 3395 3365 207.54 2.91 1 3396 2436 207.77 0.24 0 3397 553 209.73 1.96 1 3398 2545 210.52 0.79 1 3399 54 213.63 3.11 1 3400 1933 214.58 0.95 1 3401 3201 216.16 1.57 1 3402 2895 217.18 1.02 1 3403 446 217.83 0.65 1 3404 2302 218.43 0.61 1 3405 2873 219.47 1.04 1 3406 3388 223.00 3.52 1 3407 1509 225.98 2.99 1 3408 32 229.46 3.48 1 3409 3189 231.30 1.84 1 3410 3228 231.43 0.13 0 3411 2107 232.39 0.96 1 3412 1150 232.79 0.40 0 3413 2279 236.69 3.90 1 3414 2289 237.43 0.74 1 3415 2385 238.03 0.60 0 3416 1037 245.93 7.90 1 3417 201 246.72 0.79 1 3418 1252 249.23 2.51 1 3419 1739 250.34 1.11 1 3420 2446 257.59 7.26 1 3421 1637 258.64 1.05 1 3422 3220 260.55 1.91 1 3423 1304 262.67 2.12 1 3424 2355 271.20 8.53 1 3425 232 293.86 22.66 1 3426 3411 299.23 5.37 1 3427 1955 303.42 4.19 1 3428 1832 328.03 24.61 1 3429 1197 335.83 7.81 1 3430 2852 364.01 28.18 1

0.1=AvgGp 64=#gaps Row# Doc# F 28.2=MxGp .6=GapThreshold 1 1791 5.67 Gap 0 ... ... ... ... ... 8 3389 7.00 0.19 0 9 2397 7.65 0.65 1 10 2841 7.82 0.17 0 ... ... ... ... ... 2621 2334 89.40 0.06 0 2622 1122 90.00 0.60 1 2623 245 90.06 0.06 0 ... ... ... ... ... 3123 3169 132.06 0.00 0 3124 321 132.81 0.75 1 3125 2047 133.05 0.24 0 ... ... ... ... ... 3210 343 145.29 0.37 0 3211 2475 145.89 0.60 1 3212 458 146.10 0.21 0 ... ... ... ... ... 3240 542 151.15 0.09 0 3241 2569 151.76 0.61 1 3242 1143 151.92 0.15 0 ... ... ... ... ... 3285 1803 157.97 0.00 0 3286 2257 158.70 0.73 1 3287 2723 158.77 0.07 0 ... ... ... ... ... 3293 129 159.56 0.32 0 3294 2541 160.45 0.89 1 3295 2870 160.48 0.03 0 ... ... ... ... ... 3301 401 161.38 0.04 0 3302 2918 162.03 0.65 1 3303 100 162.07 0.04 0 ... ... ... ... ... 3312 1157 164.54 0.08 0 3313 185 165.26 0.72 1 3314 685 165.91 0.65 1 3315 2948 166.25 0.34 0 ... ... ... ... ... 3325 190 168.59 0.37 0 3326 2498 169.20 0.61 1 3327 264 169.31 0.11 0 3328 1611 169.64 0.33 0 3329 3052 169.96 0.32 0 3330 1002 170.43 0.47 0 3331 1628 170.64 0.20 0 3332 1241 171.80 1.16 1 3333 3155 172.00 0.20 0 ... ... ... ... ... 3342 861 173.84 0.15 0 3343 2509 174.98 1.13 1 3344 2293 175.65 0.67 1 3345 1257 175.67 0.02 0 3346 2776 176.04 0.37 0 3347 1422 177.15 1.11 1 3348 12 177.24 0.09 0 3349 183 177.26 0.02 0 3350 620 177.29 0.03 0 3351 679 179.08 1.79 1 3352 462 179.15 0.07 0 3353 3404 180.02 0.88 1 3354 1850 180.79 0.76 1 3355 3342 181.21 0.43 0 3356 1396 183.04 1.82 1 3357 2982 183.26 0.22 0

___ ___ gap=.65 Ct=9 C1

___ ___ gap=.6 Ct=2613 C2

___ ___ gap=.75 Ct= 502 C3

___ ___ gap=.6 Ct= 87 C4

___ ___ gap=.61 Ct=30 C5

___ ___ gap=.73 Ct=45 C6

___ ___ gap=.89 Ct=8 C7

___ ___ gap=.65 Ct=8 C8

___ ___ gp=.72 Ct= 11 C9 ___ ___ gp=.65 Ct=1 outlr

___ ___ gp=.61 Ct=12 C11

___ ___ gp=1.2 Ct=6 C12

___ ___ gp=1.1 Ct=11 C13 ___ ___ gap=.67 Ct=1 utlr

___ ___ gp=1.1 Ct=3 C15

___ ___ gp=1.8 Ct=4 C16

___ ___ gp=1.8 Ct=5 otl;r

gp=1 Ct=8 C16 .outliers. Someof them aresubstantial

AvgGp.0085 gp>6*avg ROW KOS F GAP CT 1 1791 0.2270 --- -- 2 1317 0.2920 0.065 12668 1602 6.6576 0.007 26673090 1390 9.8504 0.004 4223132 1546 10.278 0.012 423148 2662 10.507 0.021 163216 505 11.289 0.019 683264 2219 11.994 0.027 483291 231 12.445 0.039 273302 710 12.631 0.038 113317 220 12.934 0.023 153338 405 13.315 0.028 213355 194 13.693 0.009 173368 12 14.151 0.078 83378 2731 14.590 0.011 103392 1096 15.459 0.022 5

MVM gaps>6*avg d=e841 (highest STD).

DOC W=841 1716 0 ... ... 1379 C02427 0

1027 1... ... 3427 1 743 C1

1 2 ... ... 2519 2 470 C2

868 3 ... ... 3224 3 274 C3

1882 4 ... ... 3257 4 175 C4

1434 5 ... ... 910 5 127 C5

2753 6 ... ... 549 6 75 C6

1186 7 ... ... 1015 7 79 C7

503 8 ... ... 3156 8 43 C8

2971 9 ... ... 2182 9 39 C9

2868 10 ... ... 1316 10 32 C10

2750 13 54 13 2293 13 183 13 2870 13 1222 13 3217 13 1519 13 8 C13

2164 14 otlrs 1656 14 3244 14 1709 14

185 15 otlrs 401 15 414 15 893 15

2731 16 otlrs 1396 16 3220 16 3190 16

1832 17 otlr

2852 18 otlrs 3201 18 1234 18

3189 19 otlr

1524 22 otlr

1529 24 otlr

1197 25 otlr

201 27 otlr

1150 29 otlr

1335 34 otlr 2648 11 ... ... 336 11 18 C11

2983 12 ... ... 3177 12 14 C12

Cluster size:d=USTD MVM 10 7 11 8 15 8 16 9 17 11 21 11 27 12 42 30 48 45 68 87 422 5022667 2613

GV 3 3 4 4 4 5 6 6 10 42 3163029

Doc F=DPPd Gap 24=MxGp2682 02749 7.574 0.038 0 30292983 8.436 0.079 0 423402 8.629 0.052 0 2 864 9.184 0.053 0 102293 9.462 0.106 1 42994 13.45 0.055 0 3161445 13.66 0.029 0 43399 14.05 0.099 0 6 185 14.21 0.156 1 12731 14.35 0.143 1 12948 14.65 0.066 0 51495 14.99 0.014 0 2 804 15.20 0.205 1 13177 15.42 0.034 0 61316 15.61 0.024 0 21335 16.01 0.028 0 31637 16.35 0.330 1 1 880 16.86 0.039 0 31509 17.03 0.176 1 12885 17.21 0.177 1 1 446 18.07 0.863 1 11197 18.65 0.005 0 43189 19.30 0.644 1 11252 20.65 1.352 1 1

GV on 22 highest STD KOS wds d=(.46 .16 .03 .32 .71 .07 .06 .03 .09 .03 .10 .10 .19 .04 .16 .14 .01 .02 .04 .02 .00 .02)

KOSblogs d=UnitSTDVec g>6*avg

Page 25: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

GV using a grid (Unitized Corners of Unit Cube + Diagonal of the Variance Matrix + Mean-to-Vector_of_Medians)

On these pages we display the variance hill-climb for each of the four datasets (Concrete, IRIS, Seeds, Wine) for a grid of starting unit vectors, d.

I took the circumscribing unit non-negative cube and used all the Unitized diagonals.

In low dimension (all dimension=4 here) this grid is very nearly a uniform grid.

Note that this will work less and less well as the dimension grows.

In all cases, the same local max and nearly the same unit vector are reached.

CONC d1 d2 d3 d4 VAR

UCUC(1000)

UCUC(0100)

UCUC(0010)

UCUC(0001)

UCUC(1100)

1.00 0.00 0.00 0.00 10249 0.99 -0.05 -0.11 0.06 10585 0.97 -0.03 -0.21 0.13 10947 0.93 -0.00 -0.32 0.19 11370 0.87 0.03 -0.42 0.26 11796 0.80 0.05 -0.51 0.31 12168 0.73 0.08 -0.58 0.36 12453 0.66 0.09 -0.63 0.39 12649 0.61 0.11 -0.67 0.42 12776 0.56 0.12 -0.70 0.43 12855 0.52 0.13 -0.72 0.45 12902 0.49 0.13 -0.73 0.46 12929 0.47 0.14 -0.74 0.46 12945 0.45 0.14 -0.75 0.47 12954 0.44 0.14 -0.75 0.47 12960 0.43 0.14 -0.76 0.47 12963 0.42 0.14 -0.76 0.47 12965 0.41 0.14 -0.76 0.48 12966

0.00 1.00 0.00 0.00 795-0.23 0.33 -0.78 0.49 11645-0.12 0.21 -0.82 0.52 12191-0.01 0.19 -0.83 0.52 12469 0.09 0.19 -0.83 0.52 12660 0.16 0.18 -0.82 0.52 12783 0.22 0.17 -0.81 0.51 12859 0.26 0.17 -0.81 0.50 12904 0.29 0.16 -0.80 0.50 12931 0.32 0.16 -0.79 0.50 12946 0.33 0.16 -0.79 0.49 12955 0.35 0.15 -0.78 0.49 12960 0.36 0.15 -0.78 0.49 12963 0.37 0.15 -0.78 0.49 12965 0.37 0.15 -0.78 0.48 12966

0.00 0.00 1.00 0.00 9950-0.10 -0.18 0.93 -0.31 12279-0.17 -0.18 0.86 -0.44 12749-0.23 -0.17 0.83 -0.48 12865-0.27 -0.17 0.81 -0.49 12911-0.30 -0.16 0.80 -0.50 12935-0.32 -0.16 0.79 -0.49 12949-0.34 -0.16 0.79 -0.49 12956-0.35 -0.15 0.78 -0.49 12961-0.36 -0.15 0.78 -0.49 12964-0.37 -0.15 0.78 -0.49 12965-0.37 -0.15 0.78 -0.48 12966

0.00 0.00 0.00 1.00 6686 0.08 0.16 -0.44 0.88 10572 0.16 0.17 -0.69 0.68 12435 0.22 0.17 -0.77 0.57 12816 0.26 0.17 -0.79 0.53 12901 0.29 0.16 -0.79 0.51 12932 0.32 0.16 -0.79 0.50 12947 0.34 0.16 -0.79 0.49 12955 0.35 0.15 -0.78 0.49 12960 0.36 0.15 -0.78 0.49 12963 0.37 0.15 -0.78 0.49 12965 0.37 0.15 -0.78 0.48 12966

0.71 0.71 0.00 0.00 4968 0.94 0.02 -0.29 0.18 11266 0.88 0.02 -0.40 0.24 11709 0.82 0.05 -0.49 0.30 12096 0.74 0.07 -0.57 0.35 12400 0.68 0.09 -0.62 0.38 12614 0.62 0.10 -0.66 0.41 12754 0.57 0.12 -0.69 0.43 12841 0.53 0.12 -0.71 0.44 12894 0.50 0.13 -0.73 0.45 12924 0.47 0.13 -0.74 0.46 12942 0.45 0.14 -0.75 0.47 12953 0.44 0.14 -0.75 0.47 12959 0.43 0.14 -0.76 0.47 12962 0.42 0.14 -0.76 0.47 12964 0.41 0.14 -0.76 0.48 12965 0.41 0.15 -0.77 0.48 12966 0.40 0.15 -0.77 0.48 12967

UCUC(1010) 0.71 0.00 0.71 0.00 9007 0.69 -0.18 0.67 -0.21 10074 0.62 -0.20 0.68 -0.33 10486 0.52 -0.21 0.72 -0.41 10867 0.40 -0.21 0.76 -0.46 11289 0.27 -0.21 0.80 -0.50 11721 0.15 -0.20 0.82 -0.51 12106 0.03 -0.20 0.83 -0.52 12408

UCUC(1001)

UCUC(0110)

UCUC(0101)

UCUC(0011)

-0.06 -0.19 0.83 -0.52 12619-0.14 -0.18 0.82 -0.52 12758-0.20 -0.17 0.82 -0.51 12843-0.25 -0.17 0.81 -0.51 12895-0.28 -0.16 0.80 -0.50 12925-0.31 -0.16 0.79 -0.50 12943-0.33 -0.16 0.79 -0.49 12953-0.35 -0.16 0.79 -0.49 12959-0.36 -0.15 0.78 -0.49 12962-0.37 -0.15 0.78 -0.49 12964-0.37 -0.15 0.78 -0.48 12965-0.38 -0.15 0.78 -0.48 12966-0.38 -0.15 0.77 -0.48 12967

0.71 0.00 0.00 0.71 9105 0.78 0.05 -0.32 0.53 11499 0.74 0.07 -0.50 0.44 12306 0.68 0.09 -0.60 0.42 12601 0.62 0.10 -0.65 0.42 12753 0.57 0.12 -0.69 0.43 12841 0.53 0.12 -0.71 0.45 12894 0.50 0.13 -0.73 0.45 12924 0.47 0.13 -0.74 0.46 12942 0.45 0.14 -0.75 0.47 12953 0.44 0.14 -0.75 0.47 12959 0.43 0.14 -0.76 0.47 12962 0.42 0.14 -0.76 0.47 12964 0.41 0.14 -0.76 0.48 12965 0.41 0.15 -0.77 0.48 12966 0.40 0.15 -0.77 0.48 12967

0.00 0.71 0.71 0.00 3491-0.19 -0.13 0.94 -0.25 12162-0.25 -0.17 0.86 -0.41 12806-0.28 -0.16 0.82 -0.47 12915-0.31 -0.16 0.80 -0.49 12942-0.33 -0.16 0.79 -0.49 12953-0.35 -0.16 0.79 -0.49 12959-0.36 -0.15 0.78 -0.49 12963-0.37 -0.15 0.78 -0.49 12964-0.37 -0.15 0.78 -0.48 12966

0.00 0.71 0.00 0.71 4926 0.01 0.20 -0.54 0.81 11209 0.09 0.18 -0.73 0.65 12473 0.16 0.18 -0.79 0.56 12765 0.22 0.17 -0.80 0.53 12861 0.26 0.17 -0.80 0.51 12907 0.29 0.16 -0.80 0.50 12932 0.32 0.16 -0.79 0.50 12947 0.34 0.16 -0.79 0.49 12955 0.35 0.15 -0.78 0.49 12960 0.36 0.15 -0.78 0.49 12963 0.37 0.15 -0.78 0.49 12965 0.37 0.15 -0.78 0.48 12966

0.00 0.00 0.71 0.71 4951-0.06 -0.09 0.89 0.45 6835-0.16 -0.15 0.97 -0.02 10755-0.23 -0.17 0.90 -0.33 12547-0.28 -0.16 0.84 -0.44 12876-0.31 -0.16 0.81 -0.48 12934-0.33 -0.16 0.80 -0.49 12951-0.34 -0.16 0.79 -0.49 12958-0.35 -0.15 0.78 -0.49 12962-0.36 -0.15 0.78 -0.49 12964-0.37 -0.15 0.78 -0.49 12965-0.38 -0.15 0.78 -0.48 12966

UCUC(1110) 0.58 0.58 0.58 0.00 4647 0.76 -0.15 0.62 -0.14 9784 0.72 -0.19 0.61 -0.27 10149 0.65 -0.20 0.64 -0.36 10422 0.56 -0.20 0.69 -0.41 10750 0.44 -0.21 0.74 -0.46 11149 0.32 -0.21 0.78 -0.49 11582 0.19 -0.21 0.81 -0.51 11988 0.07 -0.20 0.83 -0.52 12319-0.04 -0.19 0.83 -0.52 12559-0.12 -0.18 0.83 -0.52 12719-0.18 -0.18 0.82 -0.51 12820-0.23 -0.17 0.81 -0.51 12881-0.27 -0.17 0.80 -0.50 12917-0.30 -0.16 0.80 -0.50 12938-0.32 -0.16 0.79 -0.49 12950-0.34 -0.16 0.79 -0.49 12957-0.35 -0.15 0.78 -0.49 12961-0.36 -0.15 0.78 -0.49 12964-0.37 -0.15 0.78 -0.49 12965-0.38 -0.15 0.78 -0.48 12966

UCUC(1101)

UCUC(1011)

UCUC(0111)

0.58 0.58 0.00 0.58 6756 0.69 0.10 -0.43 0.57 11945 0.65 0.10 -0.58 0.48 12599 0.60 0.11 -0.66 0.45 12784 0.55 0.12 -0.70 0.45 12864 0.51 0.13 -0.72 0.45 12908 0.49 0.13 -0.73 0.46 12933 0.46 0.14 -0.74 0.46 12947 0.45 0.14 -0.75 0.47 12956 0.43 0.14 -0.76 0.47 12960 0.42 0.14 -0.76 0.47 12963 0.42 0.14 -0.76 0.48 12965 0.41 0.15 -0.76 0.48 12966 0.58 0.00 0.58 0.58 6414 0.82 -0.10 0.46 0.33 8390 0.93 -0.12 0.32 0.12 9506 0.97 -0.11 0.20 0.02 9889 0.99 -0.10 0.11 -0.00 10069 1.00 -0.08 0.02 0.01 10254 0.99 -0.06 -0.08 0.05 10508 0.98 -0.04 -0.18 0.11 10851 0.94 -0.01 -0.29 0.18 11263 0.89 0.02 -0.40 0.24 11695 0.82 0.05 -0.49 0.30 12084 0.75 0.07 -0.56 0.35 12391 0.68 0.09 -0.62 0.38 12609 0.62 0.10 -0.66 0.41 12751 0.57 0.12 -0.69 0.43 12839 0.53 0.12 -0.71 0.44 12892 0.50 0.13 -0.73 0.45 12924 0.47 0.13 -0.74 0.46 12942 0.45 0.14 -0.75 0.47 12953 0.44 0.14 -0.75 0.47 12959 0.43 0.14 -0.76 0.47 12962 0.42 0.14 -0.76 0.47 12964 0.41 0.14 -0.76 0.48 12965 0.41 0.15 -0.76 0.48 12966 0.00 0.58 0.58 0.58 3102-0.15 0.02 0.71 0.68 5237-0.34 -0.08 0.86 0.37 7997-0.46 -0.12 0.88 -0.09 11648-0.47 -0.13 0.81 -0.33 12756-0.45 -0.14 0.77 -0.42 12928-0.44 -0.14 0.76 -0.45 12955-0.43 -0.14 0.76 -0.47 12962-0.42 -0.14 0.76 -0.47 12964-0.41 -0.14 0.76 -0.48 12965-0.41 -0.15 0.76 -0.48 12966

UCUC(1111)

akk

MVM

0.50 0.50 0.50 0.50 4385 0.83 -0.04 0.32 0.46 8393 0.95 -0.06 0.09 0.28 9943 0.97 -0.04 -0.09 0.20 10663 0.95 -0.01 -0.24 0.21 11151 0.90 0.01 -0.36 0.25 11601 0.83 0.04 -0.47 0.30 12007 0.76 0.07 -0.55 0.34 12334 0.69 0.09 -0.61 0.38 12569 0.63 0.10 -0.65 0.41 12726 0.58 0.11 -0.69 0.43 12824 0.54 0.12 -0.71 0.44 12883 0.50 0.13 -0.73 0.45 12918 0.48 0.13 -0.74 0.46 12939 0.46 0.14 -0.75 0.46 12951 0.44 0.14 -0.75 0.47 12958 0.43 0.14 -0.76 0.47 12962 0.42 0.14 -0.76 0.47 12964 0.41 0.14 -0.76 0.48 12965 0.41 0.15 -0.76 0.48 12966 0.17 0.05 0.98 0.01 9327 0.06 -0.19 0.93 -0.30 11888-0.04 -0.19 0.88 -0.44 12502-0.12 -0.18 0.84 -0.49 12715-0.19 -0.18 0.83 -0.50 12822-0.24 -0.17 0.81 -0.50 12882-0.27 -0.17 0.80 -0.50 12918-0.30 -0.16 0.80 -0.50 12939-0.32 -0.16 0.79 -0.49 12951-0.34 -0.16 0.79 -0.49 12958-0.35 -0.15 0.78 -0.49 12962-0.36 -0.15 0.78 -0.49 12964-0.37 -0.15 0.78 -0.49 12965-0.38 -0.15 0.78 -0.48 12966 0.00 -0.00 0.00 -0.01 1 0.28 -0.19 0.49 -0.80 10378 0.18 -0.20 0.71 -0.65 11773 0.06 -0.20 0.79 -0.58 12296-0.04 -0.19 0.82 -0.54 12563-0.12 -0.18 0.82 -0.53 12724-0.19 -0.18 0.82 -0.52 12823-0.24 -0.17 0.81 -0.51 12883-0.27 -0.17 0.80 -0.50 12918-0.30 -0.16 0.80 -0.50 12939-0.33 -0.16 0.79 -0.49 12951-0.34 -0.16 0.79 -0.49 12958-0.35 -0.15 0.78 -0.49 12962-0.36 -0.15 0.78 -0.49 12964-0.37 -0.15 0.78 -0.49 12965-0.38 -0.15 0.78 -0.48 12966

Page 26: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

SEEDS d1 d2 d3 d4 VAR

UCUC(1000)

UCUC(0100)

UCUC(0010)

UCUC(0001)

UCUC(1100)

UCUC(1010)

UCUC(1001)

UCUC(0110)

UCUC(0101)

UCUC(0011)

UCUC(1110)

UCUC(1101)

UCUC(1011)

UCUC(0111)

UCUC(1111)

akk

MVM

1.00 0.00 0.00 0.00 8 0.97 0.16 -0.11 0.14 9

0.00 1.00 0.00 0.00 0 0.96 0.23 -0.14 0.13 9

0.00 0.00 1.00 0.00 2-0.36 -0.07 0.93 -0.00 4-0.82 -0.15 0.55 -0.09 8-0.94 -0.16 0.27 -0.12 9

0.00 0.00 0.00 1.00 0 0.97 0.15 -0.00 0.19 9

0.71 0.71 0.00 0.00 6 0.97 0.17 -0.12 0.13 9

0.71 0.00 0.71 0.00 4 0.96 0.16 0.20 0.15 8 0.97 0.16 -0.05 0.14 9

0.71 0.00 0.00 0.71 5 0.97 0.16 -0.10 0.14 9

0.00 0.71 0.71 0.00 1 0.19 0.06 0.98 0.08 2 0.33 0.04 0.94 0.10 3 0.70 0.11 0.69 0.14 5 0.96 0.16 0.18 0.15 8 0.97 0.16 -0.06 0.14 9

0.00 0.71 0.00 0.71 0 0.97 0.20 -0.08 0.15 9

0.00 0.00 0.71 0.71 1 0.08 -0.01 0.99 0.09 2-0.07 -0.03 1.00 0.05 3-0.51 -0.10 0.86 -0.03 5-0.88 -0.15 0.44 -0.10 8-0.95 -0.16 0.23 -0.13 9

0.58 0.58 0.58 0.00 4 0.96 0.17 0.15 0.15 8 0.97 0.16 -0.07 0.14 9

0.58 0.58 0.00 0.58 5 0.97 0.17 -0.10 0.14 9

0.58 0.00 0.58 0.58 4 0.96 0.16 0.17 0.16 8 0.97 0.16 -0.06 0.14 9

0.00 0.58 0.58 0.58 1 0.56 0.11 0.80 0.14 4 0.92 0.15 0.31 0.15 8 0.98 0.16 -0.02 0.14 9

0.50 0.50 0.50 0.50 4 0.97 0.17 0.13 0.15 8 0.97 0.16 -0.07 0.14 9

0.98 0.14 0.06 0.13 9

-0.62 0.36 0.27 -0.30 4-0.95 -0.15 0.22 -0.13 9

GV using a grid (Unitized Corners of Unit Cube + Diagonal of the Variance Matrix + Mean-to-Vector_of_Medians) 2WINE d1 d2 d3 d4 VAR

UCUC(1000)

UCUC(0100)

UCUC(0010)

UCUC(0001)

UCUC(1100)

UCUC(1010)

UCUC(1001)

UCUC(0110)

UCUC(0101)

UCUC(0011)

UCUC(1110)

UCUC(1101)

UCUC(1011)

UCUC(0111)

UCUC(1111)

akk

MVM

1.00 0.00 0.00 0.00 4 0.40 -0.06 -0.91 -0.07 497 0.02 -0.25 -0.97 -0.01 608

0.00 1.00 0.00 0.00 82-0.00 0.49 0.87 0.00 577-0.01 0.28 0.96 0.00 608

0.00 0.00 1.00 0.00 567-0.01 0.25 0.97 0.00 608

0.00 0.00 0.00 1.00 1-0.20 0.17 0.84 0.47 455-0.02 0.26 0.96 0.01 608

0.71 0.71 0.00 0.00 42 0.02 0.51 0.86 -0.00 570-0.01 0.29 0.96 0.00 608

0.71 0.00 0.71 0.00 277-0.01 0.25 0.97 0.00 608

0.71 0.00 0.00 0.71 2 0.46 0.00 -0.88 0.12 447 0.02 -0.25 -0.97 -0.00 608

0.00 0.71 0.71 0.00 472-0.01 0.31 0.95 0.00 608

0.00 0.71 0.00 0.71 42-0.01 0.48 0.88 0.01 578-0.01 0.28 0.96 0.00 608

0.00 0.00 0.71 0.71 287-0.02 0.25 0.97 0.01 608

0.58 0.58 0.58 0.00 310-0.01 0.31 0.95 0.00 607-0.01 0.27 0.96 0.00 608

0.58 0.58 0.00 0.58 29 0.02 0.50 0.86 0.01 572-0.01 0.29 0.96 0.00 608

0.58 0.00 0.58 0.58 186-0.01 0.25 0.97 0.01 608

0.00 0.58 0.58 0.58 317-0.01 0.30 0.95 0.01 608

0.50 0.50 0.50 0.50 234-0.01 0.31 0.95 0.01 607-0.01 0.27 0.96 0.00 608

0.07 0.15 0.98 0.12 588-0.01 0.26 0.97 0.00 608

-0.13 -1.00 -3.07 -0.03 6314 0.01 -0.27 -0.96 -0.00 608

IRIS d1 d2 d3 d4 VAR

UCUC(1000)

UCUC(0100)

UCUC(0010)

UCUC(0001)

UCUC(1100)

UCUC(1010)

UCUC(1001)

UCUC(0110)

UCUC(0101)

UCUC(0011)

UCUC(1110)

UCUC(1101)

UCUC(1011)

UCUC(0111)

UCUC(1111)

akk

MVM

1.00 0.00 0.00 0.00 68 0.45 -0.03 0.83 0.34 415 0.36 -0.08 0.86 0.36 420

0.00 1.00 0.00 0.00 19-0.10 0.48 -0.82 -0.30 334-0.34 0.10 -0.86 -0.36 420

0.00 0.00 1.00 0.00 311 0.35 -0.09 0.86 0.35 420

0.00 0.00 0.00 1.00 58 0.34 -0.08 0.85 0.39 420

0.71 0.71 0.00 0.00 39 0.53 0.12 0.78 0.33 390 0.37 -0.07 0.86 0.36 420

0.71 0.00 0.71 0.00 316 0.38 -0.07 0.85 0.35 420

0.71 0.00 0.00 0.71 114 0.40 -0.05 0.84 0.36 419 0.36 -0.08 0.86 0.36 420

0.00 0.71 0.71 0.00 133 0.37 -0.04 0.86 0.36 419 0.36 -0.08 0.86 0.36 420

0.00 0.71 0.00 0.71 27 0.41 0.06 0.82 0.40 410 0.37 -0.08 0.86 0.36 420

0.00 0.00 0.71 0.71 312 0.35 -0.09 0.86 0.36 420

0.58 0.58 0.58 0.00 193 0.40 -0.04 0.85 0.35 419 0.36 -0.08 0.86 0.36 420

0.58 0.58 0.00 0.58 72 0.43 0.01 0.83 0.36 414 0.37 -0.08 0.86 0.36 420

0.58 0.00 0.58 0.58 349 0.37 -0.07 0.85 0.36 420

0.00 0.58 0.58 0.58 185 0.36 -0.05 0.85 0.37 420

0.50 0.50 0.50 0.50 243 0.90 0.24 0.37 0.04 180 0.41 -0.04 0.84 0.35 418 0.36 -0.08 0.86 0.36 420

0.90 0.24 0.37 0.04 180 0.41 -0.04 0.84 0.35 418 0.36 -0.08 0.86 0.36 420

-0.00 -0.04 0.05 0.01 1 0.35 -0.09 0.86 0.36 420

As we all know, Dr. Ubhaya is the best Mathematician on campus and he is attempting to prove three things: 1. That a GV-hill-climb that does not reach the global max Variance is rare indeed. 2. That one is guaranteed to reach the global maximum with at least one of the coordinate unit vectors (so a 90 degree grid will always suffice). 3. That akk will always reach the global max.

Page 27: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

Finding round clusters that aren't DPPd separable? (no linear gap)

d

Find the golf ball? Suppose we have a white mask pTree. No linear gaps exits to reveal it.

Search a grid of d-tubes until a DPPd gap is found in the interior of the tube(Form mask pTree for interior of the d-tube. Apply DPPd that mask to reveal interior gaps.)

Look for conical gaps (fix the the cone point at the middle of tube) over all cone angles(look for an interval of angles with no points).

Notice that this method includes DPPd since a gap for a cone angle of 90 degrees is linear.

Page 28: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

Zz1 1 1z2 3 1z3 2 2z4 3 3z5 6 2z6 9 3z7 15 1z8 14 2z9 15 3za 13 4zb 10 9zc 11 10zd 9 11ze 11 11zf 7 8

F=zod 11 27 23 34 53 80118114125114110121109125 83

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

p2 0 0 1 0 1 0 1 0 1 0 1 0 1 1 0

p1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 1

p0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

p2' 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1

p1' 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0

p0' 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0

0 &p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

C=1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

C=2

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

C=1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

C=1

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

&p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

C=0

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

C=2

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

C=2

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

C=6

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

0 &p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

C=3

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

C=3

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

&p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

C=2

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

C=2

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

C=2

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

C=2

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

C=8

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

C=8

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

0p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0C=5

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0C=5

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0C=5

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0C=5

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1C10

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1C10

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1C10

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1C10

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

[000 0000, 000 1111]= [0,15]=[0,16) has 1 point, z1. This is a 24 thinning. z1od=11 is only 5 units from the right edge, so z1 is not declared an outlier)

Next, we check the min dis from the right edge of the next interval to see if z1's right-side gap is actually 24 (the calculation of the min is a pTree process - no x looping required!)

p=

FAUST Gap Revealer Width 24 so compute all pTree combinations

down to p4 and p'4 d=M-p

[010 0000 , 010 1111] = [32,48).z4od=34 is within 2 of 32, so z4 is not declared an anomaly.

[011 0000, 011 1111] = [ 48, 64). z5od=53 is 19 from z4od=34 (>24) but 11 from 64. But the next int [64,80) is empty z5 is 27 from its right nbr. z5 is declared an outlier and we put a subcluster cut thru z5

[100 0000 , 100 1111]= [64, 80). This is clearly a 24 gap.

1 z1 z2 z72 z3 z5 z83 z4 z6 z94 za5 M 6 78 zf9 zba zcb zd zec0 1 2 3 4 5 6 7 8 9 a b c d e f

[001 0000, 001 1111] = [16,32). The minimum, z3od=23 is 7 units from the left edge, 16, so z1 has only a 5+7=12 unit gap on its right (not a 24 gap). So z1 is not declared a 24 (and is declared a 24 inlier).

[101 0000 , 101 1111]= [80, 96). z6od=80, zfod=83

[110 0000 , 110 1111]= [96,112). zbod=110, zdod=109. So both {z6,zf} declared outliers (gap16 both sides.

[111 0000 , 111 1111]= [112,128) z7od=118 z8od=114z9od=125 zaod=114zcod=121 zeod=125No 24 gaps. But we can consult SpS(d2(x,y) for actual distances:

X1 X2 dX1X2z7 z8 1.4z7 z9 2.0z7 z10 3.6z7 z11 9.4z7 z12 9.8z7 z13 11.7z7 z14 10.8

z8 z9 1.4z8 z10 2.2z8 z11 8.1z8 z12 8.5z8 z13 10.3z8 z14 9.5

X1 X2 dX1X2z9 z10 2.2z9 z11 7.8z9 z12 8.1z9 z13 10.0z9 z14 8.9

z10 z11 5.8z10 z12 6.3z10 z13 8.1z10 z14 7.3

X1 X2 dX1X2z11 z12 1.4z11 z13 2.2z11 z14 2.2

z12 z13 2.2z12 z14 1.0

z13 z14 2.0

Which reveals that there are no 24 gaps in this subcluster.

And, incidentally, it reveals a 5.8 gap between {7,8,9,a} and {b,c,d,e} but that analysis is messy and the gap would be revealed by the next xofM round on this sub-cluster anyway.

Page 29: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

FAUST Tube Clustering: (This method attempts to build tubular-shaped gaps around clusters)Allows for a better fit around convex clusters that are elongated in one direction (not round).

p

Gaps in dot product lengths [projections] on the line.

y

q

tubecap gap

width

tube radius gap width

Exhaustive Search for all tubular gaps:It takes two parameters for a pseudo- exhaustive search (exhaustive modulo a grid width).1. A StartPoint, p (an n-vector, so n dimensional)2. A UnitVector, d (a n-direction, so n-1 dimensional - grid on the surface of sphere in Rn).

Then for every choice of (p,d) (e.g., in a grid of points in R2n-1) two functionals are used to enclose subclusters in tubular gaps.a. SquareTubeRadius functional, STR(y) = (y-p)o(y-p) - ((y-p)od)2 b. TubeLength functional, TL(y) = (y-p)od

Given a p, do we need a full grid of ds (directions)? No! d and -d give the same TL-gaps.

Given d, do we need a full grid of p starting pts? No! All p' s.t. p'=p+cd give same gaps.Hill climb gap width from a good starting point and direction.

MATH: Need dot product projection length and dot product projection distance (in red).

yo

dot prod proj len

f|f|

f|f|

f

y = y - (yof) fof

f squared is y - (yof) fof

f (yof) fof

fo y - y - f|f|

yo f|f|

squared = yoy - 2 (yof)2 fof

+ fof (yof)2 (fof)2

squared = yoy - 2 (yof)2 fof

+ (yof)2 fof

dot product projection distance

Squared y on f Proj Dis = yoy - (yof)2 fof

Squared y-p on q-p Projection Distance = (y-p)o(y-p) - ( (y-p)o(q-p) )2

(q-p)o(q-p)1st

= yoy -2yop + pop - ( yo(q-p) - p o(q-p |q-p|

2

|q-p| M-p|M-p|

(y-p)o

For the dot product length projections (caps) we already needed:

= ( yo(M-p) - po M-p )|M-p| |M-p|

That is, we needed to compute the green constants and the blue and red dot product functionals in an optimal way (and then do the PTreeSet additions/subtractions/multiplications). What is optimal? (minimizing PTreeSet functional creations and PTreeSet operations.)

Page 30: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

F=(y-M)o(x-M)/|x-M|-mn restricted to a cosine cone on IRIS

x=s1cone=1/√2

60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 50

x=s2cone=1/√2

47 159 260 461 362 663 1064 1065 566 467 469 170 1 51

x=s2cone=.9

59 260 361 362 563 964 1065 566 467 469 170 1 47

x=s2cone=.1

39 240 141 144 145 146 147 152 1 i3959 260 461 362 663 1064 1065 566 467 469 170 1 59

x=e1cone=.707

33 136 237 238 339 140 541 442 243 144 145 646 447 548 149 250 551 152 254 255 157 258 160 162 163 164 165 2 60

x=i1cone=.707

34 135 136 237 238 339 540 442 643 244 745 547 248 349 350 351 452 353 254 255 456 257 158 159 160 161 162 163 164 166 1 75

w maxscone=.707

0 2 8 110 312 213 114 315 116 317 518 319 520 621 222 423 324 325 926 327 328 329 530 331 432 333 234 235 236 437 138 140 141 442 543 544 745 346 147 648 649 251 152 253 155 1 137

w maxscone=.93

8 1 i1013 114 316 217 218 119 320 421 124 125 426 1 e21 e3427 229 237 1 i7 27/29 are i's

w maxscone=.925

8 1 i1013 114 316 317 218 219 320 421 124 125 526 1 e21 e3427 228 129 231 1 e3537 1 i731/34 are i's

w maxs-to-minscone=.939

14 1 i2516 1 i4018 2 i16 i4219 2 i17 i3820 2 i11 i4822 223 124 4 i34 i5025 3 i24 i2826 3 i2727 528 329 230 231 332 434 335 436 237 238 239 340 141 2 46 147 248 149 1 i3953 154 255 156 157 858 559 460 761 462 563 564 165 366 167 168 1 11414 i and 100 s/e.So picks i as 0

w xnnn-nxxxcone=.95

8 2 i22 i5010 211 2 i2812 4 i24 i27 i3413 214 415 316 817 418 719 320 521 122 123 134 1 i39 43/50 e so picks out e

w naaa-xaaacone=.95

12 113 214 115 216 117 118 419 320 221 322 523 6 i2124 525 127 128 129 230 2 i7 41/43 e so picks e

w aaan-aaaxcone=.54

7 3 i27 i28 8 1 9 310 12 i20 i3411 712 1313 514 315 719 120 121 722 723 2824 6100/104 s or e so 0 picks i

Corner points

Gap in dot product projections onto the cornerpoints line.

Cosine cone gap (over some angle)

Cosine conical gapping seems quick and easy (cosine = dot product divided by both lengths.

Length of the fixed vector, x-M, is a one-time calculation. Length y-M changes with y so build the PTreeSet.

Cone Clustering: (finding cone-shaped clusters)

Page 31: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

"Gap Hill Climbing": mathematical analysis rotation d toward a higher F-STD or grow 1 gap using support pairs:

F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows.

Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning. This is easy since our method produces the pTree mask the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place.

0 1 2 3 4 5 6 7 8 9 a b c d e ff 1 0e 2 3d 4 5 6c 7 8b 9a98765 a j k l m n4 b c q r s3 d e f o p2 g h1 i0

d1

d 1-gap

=p

q=

d2

d2-gap

d2-gap >> than d1=gap (still not optimal.) Weight mean by the dist from gap? (d-barrel radius)

0 1 2 3 4 5 6 7 8 9 a b c f 1e 2 3d 4 5 6c 7 8b 9a98765 a j k 4 b c q 3 d e f 2 1 0

d1

d 1-gap

p

q

d2

d2-gap

In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q???

Dot F p=aaan q=aaax 0 6 1 28 2 7 3 7 4 1 5 1 9 710 311 512 1313 814 1215 416 217 1218 519 620 621 322 823 324 3

C1<7 (50 Set)

7<C2<16 (4i, 48e)

hill-climb gap at 16 w half-space avgs.

C3>16 (46i, 2e)

C2uC3 p=avg<16 q=avg>16 0 1 1 1 2 2 3 1 7 2 9 210 211 312 313 214 515 116 317 318 219 220 421 522 223 524 925 126 127 328 229 130 331 532 233 334 335 136 237 438 139 142 244 145 247 2

No conclusive gaps Sparse Lo end: Check [0,9] 0 1 2 2 3 7 7 9 9 i39 e49 e8 e44 e11 e32 e30 e15 e31i39 0 17 21 21 24 22 19 19 23e49 17 0 4 4 7 8 8 9 9e8 21 4 0 1 5 7 8 10 8e44 21 4 1 0 4 6 8 9 7e11 24 7 5 4 0 7 9 11 7e32 22 8 7 6 7 0 3 6 1e30 19 8 8 8 9 3 0 4 4e15 19 9 10 9 11 6 4 0 6e31 23 9 8 7 7 1 4 6 0i39,e49,e11 singleton outliers. {e8,i44} doubleton outlier set

Sparse Hi end: Check [38,47] distances 38 39 42 42 44 45 45 47 47 i31 i8 i36 i10 i6 i23 i32 i18 i19i31 0 3 5 10 6 7 12 12 10i8 3 0 7 10 5 6 11 11 9i36 5 7 0 8 5 7 9 10 9i10 10 10 8 0 10 12 9 9 14i6 6 5 5 10 0 3 9 8 5i23 7 6 7 12 3 0 11 10 4i32 12 11 9 9 9 11 0 4 13i18 12 11 10 9 8 10 4 0 12i19 10 9 9 14 5 4 13 12 0i10,i18,i19,i32,i36 singleton outliers {i6,i23} doubleton outlier

There is a thinning at 22 and it is the same one but it is not more prominent. Next we attempt to hill-climb the gap at 16 using the mean of the half-space boundary.(i.e., p is avg=14; q is avg=17.

C123 p avg=14 q avg=17 0 1 2 3 3 2 4 4 5 7 6 4 7 8 8 2 9 1110 412 313 120 121 122 223 127 228 129 130 231 4

32 233 334 435 136 337 438 239 240 541 342 343 644 845 146 247 148 349 351 752 2

53 254 355 156 357 358 161 263 264 166 167 1

Here, gap between C1,C2 is more pronounced Why?Thinning C2,C3 more obscure?It did not grow gap wanted to grow (tween C2 ,C3.

Page 32: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

CAINE 2013 Call for Papers 26th International Conference on Computer Applications in Industry and Engineering September 25{27, 2013, Omni Hotel, Los Angles, Califorria, USA Sponsored by the International Society for Computers and Their Applications (ISCA) CAINE{2013 will feature contributed papers as well as workshops and special sessions. Papers will be accepted into oral presentation sessions. The topics will include, but are not limited to, the following areas:Agent-Based Systems Image/Signal Processing Autonomous Systems Information Assurance Big Data Analytics Information Systems/DatabasesBioinformatics, Biomedical Systems/Engineering Internet and Web-Based Systems Computer-Aided Design/Manufacturing Knowledge-based SystemsComputer Architecture/VLSI Mobile Computing Computer Graphics and Animation Multimedia Applications Computer Modeling/Simulation Neural NetworksComputer Security Pattern Recognition/Computer Vision Computers in Education Rough Set and Fuzzy Logic Computers in Healthcare RoboticsComputer Networks Fuzzy Logic Control Systems Sensor Networks Data Communication Scientic Computing Data Mining Software Engineering/CASEDistributed Systems Visualization Embedded Systems Wireless Networks and CommunicationImportant Dates: Workshop/special session proposal . . May 2.5,.2.013 Full Paper Submis . .June 5,.2013. Notice Accept ..July.5 , 2013.Pre-registration & Camera-Ready Paper Due . . . ..August 5, 2013. Event Dates . . .Sept 25-27, 2013

SEDE Conf is interested in gathering researchers and professionals in the domains of SE and DE to present and discuss high-quality research results and outcomes in their fields. SEDE 2013 aims at facilitating cross-fertilization of ideas in Software and Data Engineering, The conference topics include, but not limited to:. Requirements Engineering for Data Intensive Software Systems. Software Verification and Model of Checking. Model-Based Methodologies. Software Quality and Software Metrics. Architecture and Design of Data Intensive Software Systems. Software Testing. Service- and Aspect-Oriented Techniques. Adaptive Software Systems. Information System Development. Software and Data Visualization. Development Tools for Data Intensive. Software Systems. Software Processes. Software Project Mgnt. Applications and Case Studies. Engineering Distributed, Parallel, and Peer-to-Peer Databases. Cloud infrastructure, Mobile, Distributed, and Peer-to-Peer Data Management. Semi-Structured Data and XML Databases. Data Integration, Interoperability, and Metadata. Data Mining: Traditional, Large-Scale, and Parallel. Ubiquitous Data Management and Mobile Databases. Data Privacy and Security. Scientific and Biological Databases and Bioinformatics. Social networks, web, and personal information management. Data Grids, Data Warehousing, OLAP. Temporal, Spatial, Sensor, and Multimedia Databases. Taxonomy and Categorization. Pattern Recognition, Clustering, and Classification. Knowledge Management and Ontologies. Query Processing and Optimization. Database Applications and Experiences. Web Data Mgnt and Deep WebMay 23, 2013 Paper Submission Deadline June 30, 2013 Notification of AcceptanceJuly 20, 2013 Registration and Camera-Ready Manuscript Conference Website: http://theory.utdallas.edu/SEDE2013/

ACC-2013 provides an international forum for presentation and discussion of research on a variety of aspects of advanced computing and its applications, and communication and networking systems. Important Dates May 5, 2013 - Special Sessions Proposal June 5, 2013 - Full Paper Submission July 5, 2013 - Author Notification Aug. 5, 2013 - Advance Registration & Camera Ready Paper Due

CBR International Workshop Case-Based Reasoning CBR-MD 2013 July 19, 2013, New York/USA Topics of interest include (but are not limited to): CBR for signals, images, video, audio and text Similarity assessment Case representation and case mining Retrieval and indexing Conversational CBR Meta-learning for model improvement and parameter setting for processing with CBR Incremental model improvement by CBR Case base maintenance for systems Case authoring Life-time of a CBR system Measuring coverage of case bases Ontology learning with CBR Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013

Workshop on Data Mining in Life Sciences DMLS Discovery of high-level structures, incl e.g. association networks Text mining from biomedical literatur Medical images mining Biomedical signals mining Temporal and sequential data mining Mining heterogeneous data Mining data from molecular biology, genomics, proteomics, pylogenetic classification With regard to different methodologies and case studies: Data mining project development methodology for biomedicine Integration of data mining in the clinic Ontology-driver data mining in life sciences Methodology for mining complex data, e.g. a combination of laboratory test results, images, signals, genomic and proteomic samples Data mining for personal disease management Utility considerations in DMLS, including e.g. cost-sensitive learning Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013 Workshop date: July 19th, 2013

Workshop on Data Mining in Marketing DMM'2013 In business environment data warehousing - the practice of creating huge, central stores of customer data that can be used throughout the enterprise - is becoming more and more common practice and, as a consequence, the importance of data mining is growing stronger. Applications in Marketing Methods for User Profiling Mining Insurance Data E-Markteing with Data Mining Logfile Analysis Churn Management Association Rules for Marketing Applications Online Targeting and Controlling Behavioral Targeting Juridical Conditions of E-Marketing, Online Targeting and so one Controll of Online-Marketing Activities New Trends in Online Marketing Aspects of E-Mailing Activities and Newsletter Mailing Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013 Workshop date: July 19th, 2013

Workshop Data Mining in Ag DMA 2013 Data Mining on Sensor and Spatial Data from Agricultural Applications Analysis of Remote Sensor Data Feature Selection on Agricultural Data Evaluation of Data Mining Experiments Spatial Autocorrelation in Agricultural Data Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013 Workshop date: July 19th, 2013

Page 33: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

Hierarchical Clustering

Any maximal anti-chain (maximal set of nodes s.t no 2 directly connected) is a clustering. (dendogram offers many

A

B C

BC

D E

DE

F G

FG

DEFGABC

But horizontal anti-chains are clusterngs from top down (or bottom up) method(s).

Page 34: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

0 1 1 1 5 1 6 1 7 1 8 4 9 1 10 1 11 2 12 1 13 5 14 1 15 3 16 3 17 4 18 1 19 3 20 9 21 4 22 3 23 7 24 2 25 4 26 8 27 7 28 7 29 10 30 3 31 1 32 3 33 6 34 4 35 5 37 2 38 2 40 1 42 3 43 1 44 1 45 1 46 4 49 1 56 1 58 1 61 1 65 1 66 1 69 1 71 1 77 1 80 1 83 1 86 1100 1103 1105 1108 2112 1

________ [0.90) 43L 46 M 55H gap=14 [90,113) 0L 6M 0H CLUS_1

GV F=(DPP-MN)/4 Concrete(C, W, FA, A)

______ gap=6 [74,90) 0L 4M 0H CLUS_2

______ CLUS 4 gap=7 [52,74) 0L 7M 0H CLUS_3

CLUS 4 (F=(DPP-MN)/2, Fgap2 0 3 7 4 9 110 1211 812 715 418 1021 322 723 225 226 327 128 229 131 332 134 240 447 352 153 354 355 456 257 358 160 261 262 464 467 268 171 772 379 585 187 2

______ gap=7 =79 5L 0M 0H CLUS_4.1.1 gap=6 Median=79 Avg=79 [74,90) 2L 0M 1H CLUS_4.1 1 Merr in L Median=87 Avg=86.3

_______ =0 0L 0M 3H CLUS 4.4.1 gap=7 Median=0 Avg=0 =7 0L 0M 4H CLUS 4.4.2 gap=2 Median=7 Avg=7 [8,14] 1L 5M 22H CLUS 4.4.3 1L+5M err H Median=11 Avg=10.7

gap=2 [30,33] 0L 4M 0H CLUS 4.2.1 gap=2 Median=31 Avg=32.3 =34 0L 2M 0H CLUS 4.2.2 gap=6 Median=34 Avg=34______ =40 0L 4M 0H CLUS_4.2.3 gap=7 Median=40 Avg=40 =47 0L 3M 0H CLUS_4.2.4 gap=5 Median=47 Avt=47

gap=3 [70,79) 10L 0M 0H CLUS_4.5 Median=71 Avg=71.7

gap=2______ =64 2L 0M 2H CLUS 4.6.1 gap=3 Median=64 Avg=64 2 H errs in L [66,70) 10L 0M 0H CLUS 4.6.2 Median=67 Avg=67.3

gap=3______ =15 0L 0M 4H CLUS 4.3.1 gap=3 Median=15 Avg=15 =18 0L 0M 10H CLUS 4.3.2 gap=3 Median=18 Avg=18

Accuracy=90%

______ [20,24) 0L 10M 2H CLUS 4.7.2 gap=2 Median=22 Avg=22 2H errs in L [24,30) 10L 0M 0H CLUS_4.7.1 Median=26 Avg=26

______ [50,59) 12L 1M 4H CLUS 4.8.1 gap=2 Median=55 Avg=55 1M+4H errs in L [59,63) 8L 0M 0H CLUS_4.8.2 Median=61.5 Avg=61.3

Agglomerate (build dendogram) by iteratively gluing together clusters with min Median separation.Should I have normalize the rounds?Should I have used the same Fdivisor and made sure the range of values was the same in 2nd round as it was in the 1st round (on CLUS 4)?Can I normalize after the fact, I by multiplying 1st round values by 100/88=1.76?Agglomerate the 1st round clusters and then independently agglomerate 2nd round clusters?C1 C2 C3 C4

_____________At this level, FinalClus1={17M} 0 errors

med=62

med=33

med=17

med=71

med=23

med=21

med=9

med=34

med=57

med=86

med=71

med=10

med=56

med=14

med=61

med=18

med=40

Suppose we know (or want) 3 clusters, Low, Medium and High Strength. Then we find Suppose we know that we want 3 strength clusters, Low, Medium and High. We can use an anti-chain that gives us exactly 3 subclusters two ways, one show in brown and the other in purpleWhich would we choose? The brown seems to give slightly more uniform subcluster sizes.Brown error count: Low (bottom) 11, Medium (middle) 0, High (top) 26, so 96/133=72% accurate. The Purple error count: Low 2, Medium 22, High 35, so 74/133=56% accurate.What about agglomerating using single link agglomeration (minimum pairwise distance?

CONCRETE

Page 35: From: Mark Silverman Saturday, April 27, 10:21 AM To: Perrizo, William I've been knee deep in getting Hadoop impl. running and making the product more

CLUS 4 (F=(DPP-MN)/2, Fgap2 0 3 7 4 9 110 1211 812 715 418 1021 322 723 225 226 327 128 229 131 332 134 240 447 352 153 354 355 456 257 358 160 261 262 464 467 268 171 772 379 585 187 2

______ gap=7 =79 5L 0M 0H CLUS_4.1.1 gap=6 Median=79 Avg=79 [74,90) 2L 0M 1H CLUS_4.1 1 Merr in L Median=87 Avg=86.3

_______ =0 0L 0M 3H CLUS 4.4.1 gap=7 Median=0 Avg=0 =7 0L 0M 4H CLUS 4.4.2 gap=2 Median=7 Avg=7 [8,14] 1L 5M 22H CLUS 4.4.3 1L+5M err H Median=11 Avg=10.7

gap=2 [30,33] 0L 4M 0H CLUS 4.2.1 gap=2 Median=31 Avg=32.3 =34 0L 2M 0H CLUS 4.2.2 gap=6 Median=34 Avg=34______ =40 0L 4M 0H CLUS_4.2.3 gap=7 Median=40 Avg=40 =47 0L 3M 0H CLUS_4.2.4 gap=5 Median=47 Avt=47

gap=3 [70,79) 10L 0M 0H CLUS_4.5 Median=71 Avg=71.7

gap=2______ =64 2L 0M 2H CLUS 4.6.1 gap=3 Median=64 Avg=64 2 H errs in L [66,70) 10L 0M 0H CLUS 4.6.2 Median=67 Avg=67.3

gap=3______ =15 0L 0M 4H CLUS 4.3.1 gap=3 Median=15 Avg=15 =18 0L 0M 10H CLUS 4.3.2 gap=3 Median=18 Avg=18

Accuracy=90%

______ [20,24) 0L 10M 2H CLUS 4.7.2 gap=2 Median=22 Avg=22 2H errs in L [24,30) 10L 0M 0H CLUS_4.7.1 Median=26 Avg=26

______ [50,59) 12L 1M 4H CLUS 4.8.1 gap=2 Median=55 Avg=55 1M+4H errs in L [59,63) 8L 0M 0H CLUS_4.8.2 Median=61.5 Avg=61.3

Agglomerating using single link (min pairwise distance = min gap size! (glue min-gap adjacent clusters 1st)

The first thing we can notice is that outliers mess up agglomerations which are supervised by knowledge of the number of subclusters expected. Therefore we might remove outliers by backing away from all gap5 agglomerations, then looking for a 3 subcluster max anti-chains.

What we have done is to declare F<7 and F>84 as extreme tripleton outliers sets; and F=79. F=40 and F=47 as singleton outlier sets because they are F-gapped by at least 5 (which is actually 10) on either side.

The brown gives more uniform sizes. Brown errors: Low (bottom) 8, Medium (middle) 12 and High (top) 6, so 107/133=80% accurate.

The one decision to agglomerate C4.7.1 to C4.7.2 (gap=3) instead of C4.3.2 to C4.7.2 (gap=3) lots of error. C4.7.1 and C4.7.2 are problematic since they are separate out, but in increasing F order, it's H M L M L, so if we suspected this pattern we would look for 5 subclusters.

The 5 orange errors in increasing F-order are: 6, 2, 0, 0, 8 so 127/133=95% accurate.

If you have ever studied concrete, you know it is a very complex material. The fact that it clusters out with a F-order pattern of HMLML is just bizarre! So we should expect errors. CONCRETE

GV