graph based facet selection
TRANSCRIPT
Why Facet Selection? Obviously, our graph has errors
This is because our sources have errors
Ideally, when we have more data from more sources, our data correctness should improve (but it doesn’t)
Having more data ≠ Having more knowledge
If our precision tier needs to have a low error rate, we need a way to filter out errors
Source Based Facet Selection All facet from one data source shares same confidence
Aggregate confidence if multiple data source have same facet
However… As we add more sources, it is impossible to have a score per source per property
Even if we trust a source highly, its individual facts cannot be 100% correct
The selection process does not take account of information we already know
Mari Henmi
Emiri Henmi
ChildrenChildren
Taking Advantage of Prior Knowledge
When seen in isolation, it’s hard to know whether Mari is Emiri’s mother or child
However, if we know the following facts (each with some probability of being true), then the job is a lot easier:
◦ Emiri’s the other parent, Teruhiko, is Mari’s husband
◦ Emiri’s sibling, Noritaka, is Mari’s child
Mari Henmi
Noritaka Henmi
Teruhiko Saigo
Emiri Henmi
Spouse
Children ChildrenChildrenChildren
Sibling
Children
Mari Henmi
Emiri Henmi
ChildrenChildren
Inspired by Google’s Knowledge Vault concept
Graph Based Selection using Prior Knowledge
We can generalize the model in the following form. Given a triple , and are all possible paths that connect to , including reverse edges and multiple hops. The probability of that triplet being true is:
In particular, for any given triple , we first find all the paths from Mari to Emiri.◦ Examples include , , etc.
Then we assign the weight to each of the paths, and calculate the prediction as:
There are several linear models possible. We find logistic regression to perform very well
Implementation Steps to calculate this score:
1. Create training set and test set2. Mine all the possible paths from S to O3. Treat each path as a feature and train a model
Simple, right?
Training and Test Set Need to contain:
◦ Positive examples – easy◦ Negative examples – how?
Local Closed World Assumption:◦ For a given entity, if we know the values of a property, then we know all values of that property◦ More concretely, if we already know a Tom Cruise has three children, then any other entity is unlikely to
be his fourth children – this is a possible negative example◦ However, if we don’t know his children at all, then we cannot say who must not be his children – this is
not a possible negative example◦ Remember we just need LCWA to be true enough to generate negative examples, it doesn’t have to be
100% true.
What Negative Examples to Choose
Are all entities who violate LCWA also good candidates for negative examples?
Randomly pick any entity?◦ This cannot work because paths between any random pair of entities are very sparse◦ The classifier will learn to classify the existence of connection between two entities and not the right
kind of connections
The related entities of positive examples?◦ Choose A-B to be a negative example if A-C is a positive example and B and C are related entities◦ The connection between A-C is still very sparse
All neighbor entities?◦ Choose A-B to be a negative example if A is already connected to B and A-B is not a positive example◦ Need to make sure B has the same type as the expected type of the property
All Paths (Rules) That Connect S to O
We find the following paths between any two entities:◦ 1 hop: All forward and backward edges◦ 2 hops: All edges including f/f, f/b, b/f, b/b directions
◦ Excluding intermediate hub entities
◦ We use bidirectional search to speed up the job◦ The performance breaks down beyond two hops – this can be improved
(Some) Rules Trained for Marriage Property
Rule Precision Recall F1->people.person.children<-people.person.children 0.990501 0.254667 0.405163->people.person.children->people.person.parent 0.995007 0.25403 0.40473<-people.person.parent<-people.person.children 0.991585 0.253872 0.404246<-people.person.parent->people.person.parent 0.995389 0.253263 0.403788->film.actor.film<-film.actor.film 0.210501 0.044027 0.072822<-film.film.actor->film.film.actor 0.210041 0.043981 0.072733->film.actor.film<-film.actor.performance--film.performance.film 0.211015 0.043931 0.072722->film.actor.film->film.film.performance--film.performance.actor 0.211004 0.043931 0.072721<-film.film.actor<-film.actor.film 0.210119 0.043965 0.072715->film.actor.film->film.film.actor 0.210183 0.043959 0.072711<-film.film.actor<-film.actor.performance--film.performance.film 0.210729 0.043914 0.072681<-film.film.actor->film.film.performance--film.performance.actor 0.210711 0.043914 0.07268
Train Final Models for Each Property
Given the paths (rules ) for each property, we train a logistic regression model for each property p:
How to map the rules to a feature vector?◦ There are 90,000 distinct possible paths between any two given entity. This maps to a feature vector of 90,000
dimensions.◦ There could be more paths as we grow our graph. How do we assign dimensions for new paths?
Our solution – hash kernels◦ Project the feature space down to a 1,500 dimensional hash space◦ Learn the model on the hashed feature space◦ Use L1 regularization to get rid of useless features◦ Collisions are handled in some degree by the hash kernel itself. Additional collisions are handled by having
multiple hash kernels
Mari Henmi & Emiri Hemi
Mari Henmi
Emiri Henmi
ChildrenChildren
Is Mari a child of Emiri's?Rule WeightBias -3.38435<-people.person.children<-people.person.marriage--time.event.person<-people.person.parent -0.03988<-people.person.marriage--time.event.person->people.person.children -0.3237<-people.person.parent<-people.person.parent<-people.person.sibling--people.sibling_relationship.sibling<-people.person.parent<-people.person.siblings<-people.person.parent->people.person.sibling--people.sibling_relationship.sibling<-people.person.parent->people.person.siblings -0.3237->people.person.children->people.person.children<-people.person.sibling--people.sibling_relationship.sibling -0.03796->people.person.children<-people.person.siblings->people.person.children->people.person.sibling--people.sibling_relationship.sibling -0.12332->people.person.children->people.person.siblings -0.03463->people.person.marriage--time.event.person<-people.person.parent -0.4855->people.person.marriage--time.event.person->people.person.children -0.06937->people.person.parentTotal -4.8224Sigmoid 0.007983
Is Emiri a child of Mari's?Rule WeightBias -3.38435<-people.person.children<-people.person.children<-people.person.marriage--time.event.person 3.043293<-people.person.children->people.person.marriage--time.event.person 1.556977<-people.person.parent<-people.person.sibling--people.sibling_relationship.sibling<-people.person.children 1.72802<-people.person.sibling--people.sibling_relationship.sibling->people.person.parent 1.194149<-people.person.siblings<-people.person.children 0.369578<-people.person.siblings->people.person.parent 0.436715->people.person.children->people.person.parent->people.person.parent<-people.person.marriage--time.event.person 3.05227->people.person.parent->people.person.marriage--time.event.person 1.445125->people.person.sibling--people.sibling_relationship.sibling<-people.person.children 1.386518->people.person.sibling--people.sibling_relationship.sibling->people.person.parent 0.989205->people.person.siblings<-people.person.children 1.365237->people.person.siblings->people.person.parent 0.827563Total 14.0103Sigmoid 0.99999
Measurement We measure the trained models on a separate hold-out set
Precision = True Positives / Predicted Positives
Recall = True Positives / Labeled Positives
Most models have high precision and not so high recalls
This is because the model can’t reason about shallow entities
_ P Precision Recallautomotive.automotive_class.related 0.997171 0.999055automotive.trim_level.model_year 1 1automotive.trim_level.option_package--automotive.option_package.trim_levels 1 0.924326automotive.trim_level.related_trim_level 1 1award.nominated_work.nomination--award.nomination.nominee 0.911385 0.327528award.nominee.award_nominations--award.nomination.nominated_work 0.852335 0.265277award.winner.awards_won--award.honor.winner 0.773061 0.694073award.winning_work.honor--award.honor.winner 0.907776 0.186617education.school.school_district 0.983673 0.598015film.actor.film 0.967154 0.981438film.director.film 0.674589 0.81029film.film.actor 0.991635 0.981757film.film.art_director 0.621622 0.042048film.film.country 0.86165 0.776492film.film.director 0.705793 0.821246film.film.editor 0.886905 0.060105film.film.language 1 0.780943film.film.music 0.945946 0.01992film.film.performance--film.performance.actor 0.620755 0.07921film.film.producer 0.492865 0.052533film.film.production_company 0.94723 0.187467film.film.story 0.976492 0.159292film.film.writer 0.795948 0.787734film.producer.film 0.829213 0.317553film.writer.film 0.704711 0.77448music.artist.track_contributions--music.track_contribution.track 0.963513 0.835044music.track.artist 0.955354 0.975672music.track.producer 0.76477 0.705882organization.organization.headquarters--location.address.city_entity 0.865854 0.022955organization.organization.headquarters--location.address.subdivision_entity 0.811475 0.037106people.deceased_person.place_of_death 0.916096 0.075246people.person.children 0.998595 0.917393people.person.marriage--time.event.person 0.996065 0.798455people.person.nationality 0.95238 0.9496people.person.parent 0.998061 0.916419people.person.place_of_birth 0.966238 0.344751people.person.sibling--people.sibling_relationship.sibling 0.993952 0.909814people.person.siblings 0.998787 0.911975soccer.player.national_career_roster--sports.sports_team_roster.team 0.997714 0.637226sports.pro_athlete.team--soccer.roster_position.team 0.607641 0.919797sports.pro_athlete.team--sports.sports_team_roster.team 0.811641 0.633239
Handling Scalar Values Our models only handle entity-entity facets and not entity-value facets
To handle scalar values, we can bucketize the values and then treat buckets as entities. Then, we can apply the same algorithm.
For example, to score the facet “Tom Cruise is born on 7/3/1962”, we can do the following:◦ Bucketize 7/3/1962 into entity “1960s”◦ Find all possible paths between Tom Cruise and “1960s”◦ One such path could be: ◦ We can assign high weights to paths like this one, and the rest same as before
Issues and Further Work Classifiers work well with rich entities but not shallow entities
◦ As we grow more data, our rich entities should increase
The training and test set are not representative of real world data◦ Positive examples are often highly connected – this can cause the classifier to be very conservative◦ Negative examples are often too random – real world data can be more ambiguous
The prototype works with two hops but not yet three hops◦ When we get to three hops, the intermediate data reaches about 40TB or more. More optimization
needed
Resources Aether: aether://experiments/01fe49c5-ae8f-4fc3-b713-9c63f8c68cf8