graph based facet selection

18
Graph Based Facet Selection NICK YU, LEON ZHANG, ANTON EMELYANOV

Upload: nicholai-yu

Post on 06-Aug-2015

62 views

Category:

Technology


0 download

TRANSCRIPT

Graph Based Facet SelectionNICK YU, LEON ZHANG, ANTON EMELYANOV

Why Facet Selection? Obviously, our graph has errors

This is because our sources have errors

Ideally, when we have more data from more sources, our data correctness should improve (but it doesn’t)

Having more data ≠ Having more knowledge

If our precision tier needs to have a low error rate, we need a way to filter out errors

Source Based Facet Selection All facet from one data source shares same confidence

Aggregate confidence if multiple data source have same facet

However… As we add more sources, it is impossible to have a score per source per property

Even if we trust a source highly, its individual facts cannot be 100% correct

The selection process does not take account of information we already know

Mari Henmi

Emiri Henmi

ChildrenChildren

Taking Advantage of Prior Knowledge

When seen in isolation, it’s hard to know whether Mari is Emiri’s mother or child

However, if we know the following facts (each with some probability of being true), then the job is a lot easier:

◦ Emiri’s the other parent, Teruhiko, is Mari’s husband

◦ Emiri’s sibling, Noritaka, is Mari’s child

Mari Henmi

Noritaka Henmi

Teruhiko Saigo

Emiri Henmi

Spouse

Children ChildrenChildrenChildren

Sibling

Children

Mari Henmi

Emiri Henmi

ChildrenChildren

Inspired by Google’s Knowledge Vault concept

Graph Based Selection using Prior Knowledge

We can generalize the model in the following form. Given a triple , and are all possible paths that connect to , including reverse edges and multiple hops. The probability of that triplet being true is:

In particular, for any given triple , we first find all the paths from Mari to Emiri.◦ Examples include , , etc.

Then we assign the weight to each of the paths, and calculate the prediction as:

There are several linear models possible. We find logistic regression to perform very well

Implementation Steps to calculate this score:

1. Create training set and test set2. Mine all the possible paths from S to O3. Treat each path as a feature and train a model

Simple, right?

Training and Test Set Need to contain:

◦ Positive examples – easy◦ Negative examples – how?

Local Closed World Assumption:◦ For a given entity, if we know the values of a property, then we know all values of that property◦ More concretely, if we already know a Tom Cruise has three children, then any other entity is unlikely to

be his fourth children – this is a possible negative example◦ However, if we don’t know his children at all, then we cannot say who must not be his children – this is

not a possible negative example◦ Remember we just need LCWA to be true enough to generate negative examples, it doesn’t have to be

100% true.

Leon Zhang
I think we should describe more on how to choose negative set. You spent lots of time here, and it is critical to overall success. we should describe it clearly to make sure people understand us

What Negative Examples to Choose

Are all entities who violate LCWA also good candidates for negative examples?

Randomly pick any entity?◦ This cannot work because paths between any random pair of entities are very sparse◦ The classifier will learn to classify the existence of connection between two entities and not the right

kind of connections

The related entities of positive examples?◦ Choose A-B to be a negative example if A-C is a positive example and B and C are related entities◦ The connection between A-C is still very sparse

All neighbor entities?◦ Choose A-B to be a negative example if A is already connected to B and A-B is not a positive example◦ Need to make sure B has the same type as the expected type of the property

All Paths (Rules) That Connect S to O

We find the following paths between any two entities:◦ 1 hop: All forward and backward edges◦ 2 hops: All edges including f/f, f/b, b/f, b/b directions

◦ Excluding intermediate hub entities

◦ We use bidirectional search to speed up the job◦ The performance breaks down beyond two hops – this can be improved

(Some) Rules Trained for Marriage Property

Rule Precision Recall F1->people.person.children<-people.person.children 0.990501 0.254667 0.405163->people.person.children->people.person.parent 0.995007 0.25403 0.40473<-people.person.parent<-people.person.children 0.991585 0.253872 0.404246<-people.person.parent->people.person.parent 0.995389 0.253263 0.403788->film.actor.film<-film.actor.film 0.210501 0.044027 0.072822<-film.film.actor->film.film.actor 0.210041 0.043981 0.072733->film.actor.film<-film.actor.performance--film.performance.film 0.211015 0.043931 0.072722->film.actor.film->film.film.performance--film.performance.actor 0.211004 0.043931 0.072721<-film.film.actor<-film.actor.film 0.210119 0.043965 0.072715->film.actor.film->film.film.actor 0.210183 0.043959 0.072711<-film.film.actor<-film.actor.performance--film.performance.film 0.210729 0.043914 0.072681<-film.film.actor->film.film.performance--film.performance.actor 0.210711 0.043914 0.07268

Train Final Models for Each Property

Given the paths (rules ) for each property, we train a logistic regression model for each property p:

How to map the rules to a feature vector?◦ There are 90,000 distinct possible paths between any two given entity. This maps to a feature vector of 90,000

dimensions.◦ There could be more paths as we grow our graph. How do we assign dimensions for new paths?

Our solution – hash kernels◦ Project the feature space down to a 1,500 dimensional hash space◦ Learn the model on the hashed feature space◦ Use L1 regularization to get rid of useless features◦ Collisions are handled in some degree by the hash kernel itself. Additional collisions are handled by having

multiple hash kernels

Mari Henmi & Emiri Hemi

Mari Henmi

Emiri Henmi

ChildrenChildren

Is Mari a child of Emiri's?Rule WeightBias -3.38435<-people.person.children<-people.person.marriage--time.event.person<-people.person.parent -0.03988<-people.person.marriage--time.event.person->people.person.children -0.3237<-people.person.parent<-people.person.parent<-people.person.sibling--people.sibling_relationship.sibling<-people.person.parent<-people.person.siblings<-people.person.parent->people.person.sibling--people.sibling_relationship.sibling<-people.person.parent->people.person.siblings -0.3237->people.person.children->people.person.children<-people.person.sibling--people.sibling_relationship.sibling -0.03796->people.person.children<-people.person.siblings->people.person.children->people.person.sibling--people.sibling_relationship.sibling -0.12332->people.person.children->people.person.siblings -0.03463->people.person.marriage--time.event.person<-people.person.parent -0.4855->people.person.marriage--time.event.person->people.person.children -0.06937->people.person.parentTotal -4.8224Sigmoid 0.007983

Is Emiri a child of Mari's?Rule WeightBias -3.38435<-people.person.children<-people.person.children<-people.person.marriage--time.event.person 3.043293<-people.person.children->people.person.marriage--time.event.person 1.556977<-people.person.parent<-people.person.sibling--people.sibling_relationship.sibling<-people.person.children 1.72802<-people.person.sibling--people.sibling_relationship.sibling->people.person.parent 1.194149<-people.person.siblings<-people.person.children 0.369578<-people.person.siblings->people.person.parent 0.436715->people.person.children->people.person.parent->people.person.parent<-people.person.marriage--time.event.person 3.05227->people.person.parent->people.person.marriage--time.event.person 1.445125->people.person.sibling--people.sibling_relationship.sibling<-people.person.children 1.386518->people.person.sibling--people.sibling_relationship.sibling->people.person.parent 0.989205->people.person.siblings<-people.person.children 1.365237->people.person.siblings->people.person.parent 0.827563Total 14.0103Sigmoid 0.99999

Leon Zhang
hard to understand

Measurement We measure the trained models on a separate hold-out set

Precision = True Positives / Predicted Positives

Recall = True Positives / Labeled Positives

Most models have high precision and not so high recalls

This is because the model can’t reason about shallow entities

_ P Precision Recallautomotive.automotive_class.related 0.997171 0.999055automotive.trim_level.model_year 1 1automotive.trim_level.option_package--automotive.option_package.trim_levels 1 0.924326automotive.trim_level.related_trim_level 1 1award.nominated_work.nomination--award.nomination.nominee 0.911385 0.327528award.nominee.award_nominations--award.nomination.nominated_work 0.852335 0.265277award.winner.awards_won--award.honor.winner 0.773061 0.694073award.winning_work.honor--award.honor.winner 0.907776 0.186617education.school.school_district 0.983673 0.598015film.actor.film 0.967154 0.981438film.director.film 0.674589 0.81029film.film.actor 0.991635 0.981757film.film.art_director 0.621622 0.042048film.film.country 0.86165 0.776492film.film.director 0.705793 0.821246film.film.editor 0.886905 0.060105film.film.language 1 0.780943film.film.music 0.945946 0.01992film.film.performance--film.performance.actor 0.620755 0.07921film.film.producer 0.492865 0.052533film.film.production_company 0.94723 0.187467film.film.story 0.976492 0.159292film.film.writer 0.795948 0.787734film.producer.film 0.829213 0.317553film.writer.film 0.704711 0.77448music.artist.track_contributions--music.track_contribution.track 0.963513 0.835044music.track.artist 0.955354 0.975672music.track.producer 0.76477 0.705882organization.organization.headquarters--location.address.city_entity 0.865854 0.022955organization.organization.headquarters--location.address.subdivision_entity 0.811475 0.037106people.deceased_person.place_of_death 0.916096 0.075246people.person.children 0.998595 0.917393people.person.marriage--time.event.person 0.996065 0.798455people.person.nationality 0.95238 0.9496people.person.parent 0.998061 0.916419people.person.place_of_birth 0.966238 0.344751people.person.sibling--people.sibling_relationship.sibling 0.993952 0.909814people.person.siblings 0.998787 0.911975soccer.player.national_career_roster--sports.sports_team_roster.team 0.997714 0.637226sports.pro_athlete.team--soccer.roster_position.team 0.607641 0.919797sports.pro_athlete.team--sports.sports_team_roster.team 0.811641 0.633239

Demo http://10.123.70.114:8787/

Handling Scalar Values Our models only handle entity-entity facets and not entity-value facets

To handle scalar values, we can bucketize the values and then treat buckets as entities. Then, we can apply the same algorithm.

For example, to score the facet “Tom Cruise is born on 7/3/1962”, we can do the following:◦ Bucketize 7/3/1962 into entity “1960s”◦ Find all possible paths between Tom Cruise and “1960s”◦ One such path could be: ◦ We can assign high weights to paths like this one, and the rest same as before

Issues and Further Work Classifiers work well with rich entities but not shallow entities

◦ As we grow more data, our rich entities should increase

The training and test set are not representative of real world data◦ Positive examples are often highly connected – this can cause the classifier to be very conservative◦ Negative examples are often too random – real world data can be more ambiguous

The prototype works with two hops but not yet three hops◦ When we get to three hops, the intermediate data reaches about 40TB or more. More optimization

needed

Resources Aether: aether://experiments/01fe49c5-ae8f-4fc3-b713-9c63f8c68cf8