data mining on symbolic knowledge extracted from the web r.ghani, r. jones, d.mladenic, k.nigam,...

Data Mining on Symbolic Knowledge Extracted

from the Web

R.Ghani, R. Jones, D.Mladenic, K.Nigam, S.Slattery

Carnegie Mellon University and J.Stefan Institute

The Web

KB

Data Mining

Gather info.

Wrappers, Info. Extraction

Gathering info. from the Web• Wrappers - extract from highly structured,

automatically generated Web pages– eg., movie theaters and restaurants from

Web-based entertainment guides combined with a map system

• Information Extraction - extract from free-form, unstructured data– hand-constructed extraction rules– learned rules:

• to identify the name of a person given home page• to identify relations suggested by hyperlniks

Data sources

• corporations around the world• propositional and relational facts on 4312

companies collected by crawling the Web• data sources:

– Hoovers Online Web resource - selected info. about companies including Web site URL

– the first 50 Web pages from the company site - extracting predefined features

The Web

KBWrapping from corp. info.

Extracting from corp. Web sites

Data Mining

Abstracting features

New knowledge

Features describing a company• Extracted: types of activity, companies it points

to/mentions, officers, sector predicted from the text using Naïve Bayes, locations derived using Naïve Bayes and autoslog-based rules, country derived from URL

• Wrapped: listed sector and sub-sector, type, address including city and state, competitors, subsidiaries, products, officers, auditors, revenue, net-income, net-profit, employees

• Abstracted: companies in the same city/state, sharing officers, linking/mentioning the same companies, reciprocally linking/mentioning/compeeting,

discretized revenue, net-income, net profit, employees

Data Mining Algorithms

Finding regularities:• Association rules - Apriori algorithm

– features first mapped to Boolean features, companies represented with sparse vectors

– Y :- X (support: P(X,Y), confidence:P(Y|X))

Describing a target concept:• Propositional rules - C5.0 decision tree/rules

– set-valued features (locations, officers,…) and relational features (same-city, mentions,…) excluded

• Relational rules - FOIL– Y :- X (covered positive, covered negative)

Apriori experiments• 2658 rules found using default

support and confidence (10%, 80%) and all the features (about 26 000)

• the most frequent rules pointed out the need for data cleaning (errors in extraction)

• 254 rules after removing rules containing wrongly extracted features

• Companies having documentation on their sites, that are located in USA or provide technical assistance, are involved in selling.– activity=sell :- locations=united-states, links-to=adobe-

systems-incorporate (10.8%, 93%)

– activity=sell :- activity=technical-assistance, links-to=adobe-systems-incorporate (11.9%, 91.1%)

• Most companies located in Japan either sell or perform research, while the companies located in USA either sell or supply.– activity=sell :- locations=japan (14.5%,90.8%)

– activity=research :- locations=japan (13.2%, 82.2%)

Regularities found I

• Companies mentioning software on their Web pages are mostly located in USA.– locations=united-states :- activity=supply, activity=expertise,

mentions=software (5.3%, 64%)

• Companies in our dataset are stable in their finances.– revenue-1997=high :- revenue-1996=high, revenue-1995=high,

revenue-1994=high, revenue-1993=high (5%, 99.5%)

Regularities found II

Using support 1%, confidence 10% and 4 features: url-country, sector, competitor, auditors (mapped to 3532 Boolean features)

• The most frequent auditors for our data: Price-Waterhouse Coopers LLP (13.4%), Ernst & Young LLP (11%), Arthur Andersen LLP (10.2%)

Regularities found III

• Companies in computer-software-&-services have Price Waterhouse Coopers (20.9%) or Ernst & Young (14.3%) as their auditor.

• Companies in diversified-services have Price-Waterhouse Coopers (15.7%) or Arthur Andersen (13.9%) as their auditor.

• Companies in drugs have Ernst & Young (26.8%) as their auditor.

Regularities found IV

• About half of the companies that compete with microsoft-corporation are in computer-software-&-services and about a quarter of companies that are in computer-software-&-services compete with microsoft-corporation.– competitor=microsoft-corporation (2.1%, 54.9%)

– competitor=microsoft-corporation :- hoovers-sector=computer-software-&-services (2.1%, 25.7%)

• Competitors as predictors for computer-software-&-services companies: Computer Sciences Corporation, Associates International Inc., SAP Aktiengesellschaft

Regularities found V

• Telecommunications also have some outstanding competitors: Bellsouth Corporation, MCI Worldcom Inc., Bell Atlantic Corporation, Lucent Technologies inc..

• Most companies competing with Conagra inc., KMart Corporation and BP Amoco p.l.c. are in food-beverage-&-tobacco, retail and energy, respectively.

Regularities found VI

Learning target concepts• Target concepts were selected:

hoovers-sector, hoovers-type, auditors, competitors, share-officers, country, and state.

• Decision trees for learning propositional rules

• Foil for learning first order rules: for unary relations, for binary relations

Propositional rules found I

Depending on the city the company is located in, different features are used to predict the hoovers-sector:– for Atlanta, computer companies have a higher revenue than

diversified services companies (same for Chicago).

– for Houston, depending on the Naive Bayes classification (based on the company web-pages), we predict either Manufacturing, Computer Software & Services, or Energy.

– for Dallas, most Health companies are non-profit and thus have a lower income than leisure companies.

Propositional rules found II

When the city is excluded from the feature set(improvements of Web-based sector classifier):

– Telecommunications has more employees than Energy (employees can help weed out incorrect classifications in the coarse-sector prediction for Energy.

– Where the Naive Bayes classifier predicts communications-services, income can be used to distinguish between Media (lower-income) and Telecommunications (high).

– Where the Naive Bayes classifier predicts investment-services, employees can be used to distinguish between Financial Services(lower) and Banking (high).

First order rules found IPredicting hoovers-sector:• Companies headquartered somewhere other than

Fremont competing with ``Computer Associates International'' are in the computer software & services sector (51 pos.,0 neg.).

• Companies headquartered in New York, that are not in natural-gas-industry nor technology-sector, are in the media industry (8 pos., 0 neg.).– media(A) :- hq-city(A,new-york), sector(A,B),

B<>natural-gas-industry, coarse-sector(A,C), C<>technology-sector, competitor(?,A), performs-activity(A,?), not(products(A,?)), not(locations(A,?)).

First order rules found IIPredicting auditors:• Companies headquartered in Madrid having listed

historical financial information use Arthur Andersen as their auditor (4 pos, 0 neg.).– arthur-andersen(A) :- hq-city(A,madrid), net_profit(A,?,?).

Predicting binary competitor from: hq-city, url-country, links-to, hoovers-sector

• Two companies in the same sector are competitors (11407 pos., 0 neg.).– competitor(A,B) :- A <> B, hoovers-sector(A,C), hoovers-

sector(B,C).

Disucssion • Data cleaning needed - additional source of

noise from imperfect feature extractors– very frequent regularities checked manually for

obvious extractor errors

• Feature selection needed, especially for relational learning– learn simple, unary target relation to suggest

useful features for learning binary relation

• Interaction observed between symbolic features and Naïve Bayesian classifier prediction (decision tree improved Naïve Bayes prediction using symbolic features).

Future work• Information from wrapped Web sites can be

used to train extractors.• Extractors extended to use text and

symbolic features (look at Web pages and use background knowledge).

• Use data mining to identify potential errors in extractors.

• Combine unsupervised search for associations with rule learning in an incremental, iterative approach supporting data cleaning and feature selection.

data mining on symbolic knowledge extracted from the web r.ghani, r. jones, d.mladenic, k.nigam,...

Documents

extraction254 rules

frequent rules

relational rules foily

autoslogbased rules

relational features

propositional rules

types of activity

discretized revenue