improving web sites with web usage mining, web content mining, and semantic analysis jean-pierre...
TRANSCRIPT
Improving Web Sites Improving Web Sites with Web Usage with Web Usage
Mining,Mining,Web Content Mining, Web Content Mining,
and Semantic Analysisand Semantic AnalysisJean-Pierre Norguet
W eb siteV isitor
Log file
request
log transactionresponse
Web CommunicationWeb Communication
• Web transaction = request + response• Meta-data in Web logs:
– Request date et time– Page reference (URI)– Referral URI– Client machine information
W eb site
W eb designer
Log files
100 90 80 70
R eports
W ebanalytics
tool
updateV isitors
Web Analytics ProcessWeb Analytics Process
Web Analytics ToolsWeb Analytics Tools
• Results– Page views– Number of visitors– Debit– Traffic
• Exploitation– Self-promotion– Sales planning– Technical resizing– Structure Optimization
Low semantics Low-level decisions
Organization StructureOrganization Structure
Web analytics tools
O rganizationm anager
W eb sitechief editor
Sub-editor Sub-editorSub-editor
Web Analytics ResultsWeb Analytics Results
• Low semantics low intuitivity• Too numerous results
Adress: http://www.ulb.ac.be/cgi/search
PPage age Ref. Ambiguity Ref. Ambiguity (1)(1)
PPage age Ref. Ambiguity Ref. Ambiguity (2)(2)
Adress: http://www.ulb.ac.be/cgi/search
PPage age VolatilityVolatilityAdress: http://www.ulb.ac.be/cgi/search
Page Synonymy (1)Page Synonymy (1)
Page Synonymy (2)Page Synonymy (2)
Page PolysemyPage Polysemy
PPage age Temporality (1)Temporality (1)
PPage age Temporality (2)Temporality (2)
Problems SummaryProblems Summary
• Low semantics low intuitivity• Too numerous results• Page reference ambiguity• Page synonymy• Page polysemy• Page temporality• Page volatility
Our solutionOur solution
• Summarized and conceptual results for:– Chief editors– Organization managers
• Generic solution, independent from:– Web site content– Web site language– Web site technology
analyze output text content
Output Page CollectionOutput Page Collection
• Mining points in Web environment:1. Web logs (+ content journal)2. Web server3. Network wire4. On-screen Web page
W eb server
R outer
Browser
2. S erver m onito ring
4. C lien t-s ide
3. N etwork m onitoring
1. W eb log files
In ternet
V is itor
Lexical AnalysisLexical Analysis
• Output page mining Web pages• Unformatting text• Tokenization terms• Stopwords removal• Stemming• Term selection index terms• Occurrence counting audience
metrics
PresenceConsultation
Online pagesOutput pages
Interest
• Term occurrence counting in pages:
Term-Based MetricsTerm-Based Metrics
Term-Based MetricsTerm-Based Metrics
• Term-based metrics:– Consultation– Presence– Interest
• Limitations:– Too many terms– Term synonymy– Term polysemy
Ontology-based term grouping
Hierarchical Hierarchical AggregationAggregation
• Consultation• Presence
Apple Straw berry
Fruit
CarotPotato
Vegetable
Food
22
644
162
324
11
84
44
Apple Straw berry
Fruit
CarotPotato
Vegetable
Food
14101.44
644
1616
11210
11.232
32488
12721
6.0437
8422
4411
Hierarchical Hierarchical AggregationAggregation
• Consultation• Presence• Interest (x2)
Apple Straw berry
Fruit
CarotPotato
Vegetable
Food
14101.44
644
1616
11210
11.232
32488
12721
6.0437
8422
4411
Hierarchical Hierarchical AggregationAggregation
• Consultation• Presence• Interest (x2)
Data modelData model
• Ontology term hierarchy• Number of occurrences: by day, by
term• List of days (possibly aggregated)
day term
DailyTerm Occurrences
day : DATETIME term : VARCHAR consultation : INT presence : INT
OntologyElem ent
term : VARCHAR parentTerm : VARCHAR
Day
day : VARCHAR label : VARCHAR
parentTerm
OLAP ModelOLAP Model
• Parent-child ontology dimension• Time dimension• Measures
Term -basedm etrics
C onsu lta tionP resence
In te res t
T im e O nto logy
Case StudyCase Study
• Web site: cs.ulb.ac.be– 1.500 pages– 100 page views/day– Knowledge domain: computer science
• Ontology: ACM classification– Knowledge domain: computer science– 11 top domains– 3 levels– 1230 terms
Experimental settingExperimental setting
• WASA prototype• SQL Server OLAP Analysis Service
V isito rs
W eb server stats server SQ L server
H TTP S erver
Logs
C ontentJourna l
W A S A
M yS Q L M yO D B C
S Q L S erver
O LA P
E xce l
Concept-Based MetricsConcept-Based Metrics
• Y: top ontology domains• X: consultation, presence, interest
ResultsResults
Exploitation ProcessExploitation Process
W ASA adm inistrator
chief editor
sub-editors
configures andrun
viewreports
redefine writingtasks
W ASA
defineconcepts
organization m anagerview
reports redefine W eb com m unicationobjectives
m anage organization
update W eb sitecontent
...
...
SummarySummary
• Web analytics• Output page mining• Lexical analysis• Concept-based metrics with OLAP• Experiments• Conclusion & future work
ConclusionConclusion
• Most Web sites supported• Approach validated by experiments• Topic-based metrics are intuitive• Exploitation at higher decision levels• Limitation: ontology availability• Future work: ontology enrichment Integration into Web analytics tools
Thank you Thank you forfor your your attentionattention
Q & AQ & A
• Web logs + content journal• (+) Easy to setup• (+) Minimal storage and
computation• (-) Dynamic pages
Content JournalingContent Journaling
W eb server
R outer
Browser
1. W eb log files
In ternet
V is itor
• Web server plugin• (+) Dynamic pages• (+) Fast• (-) Risky
Server MonitoringServer Monitoring
W eb server
R outer
Browser
2. S erver m onito ring
In ternet
V is itor
• TCP/IP packet sniffing• (+) Independent from Web server• (-) Ethernet only• (-) Encrypted content• (-) CPU-intensive
Network MonitoringNetwork Monitoring
W eb server
R outer
Browser
3. N etwork m onitoring
In ternet
V is itor
• Page-embedded program1. Parses page2. Sends content to mining server
• (+) Distributed workload• (+) Supports client-side XML/XSL• (-) Visibility and vulnerability
Client-Side CollectionClient-Side Collection
W eb server
R outer
Browser
4. C lien t-s ide
In ternet
V is itor
Output Page CollectionOutput Page Collection
• Collection methods alone or in combination any Web site output is collectable1. Implemented: WASA-CJ2. Implemented: Sourceforge mod_trace_output
W eb server
R outer
Browser
2. S erver m onito ring
4. C lien t-s ide
3. N etwork m onitoring
1. W eb log files
In ternet
V is itor
ExperimentsExperiments
• Experimental settings• Visualization• Ontology coverage• Validation• Scalability
Experimental settingExperimental setting
• WASA prototype• SQL Server OLAP Analysis Service
V isito rs
W eb server stats server SQ L server
H TTP S erver
Logs
C ontentJourna l
W A S A
M yS Q L M yO D B C
S Q L S erver
O LA P
E xce l
EUROVOC ThesaurusEUROVOC Thesaurus
• European Commission thesaurus• Knowledge domain: EC-related
domains• 21 top domains• 8 levels• 6650 terms
Eurovoc ExampleEurovoc Example• 04 Politics• 08 International Relations• 10 European Communities• 12 Law• 16 Economics• 20 Trade• 24 Finance• 28 Social Questions• 32 Education and Competition• 36 Science• 40 Business and Competition• 44 Employment and Working Conditions• 48 Transport• 52 Environment• 56 Agriculture, Forestry and Fisheries• 60 Agri-Foodstuffs• 64 Production, Technology and Research• 66 Energy• 68 Industry• 72 Geography• 76 International Organisations
28 SOCIAL QUESTIONS• 2806 family• 2811 migration• 2816 demography and population• 2821 social framework• 2826 social affairs• 2831 culture and religion
– arts– cultural policy– culture– acculturation– civilization– cultural difference– cultural identity
• RT: protection of minorities (1236)• RT: socio-cultural group (2821)
– cultural pluralism– popular culture– regional culture– religion
• 2836 social protection• 2841 health• 2846 construction and town planning
Ontology CoverageOntology Coverage
• Definition: the percentage of ontology terms that appear in the Web site
• ACM classification: 15%• Eurovoc: 0,75%• Characterizes the meaning of the
metrics ontology enrichment with terms of
the Web site
Collaborative Collaborative EnrichmentEnrichment
EZchief editor
EZsub-editor
MMsub-editor
EMsub-editor
SSsub-editor
PSsub-editor
JPNsub-editor
JMDsub-editor
EZorganization m anager
JTSsub-editor
JMDwebm aster
JPNW ASA adm inistrator
Methodology StepsMethodology Steps
• Editor browses his pages• Select new terms• Find enrichment point in the ontology• Insert terms into ontology• Editor sends ontology to chief editor• Chief editor commits the inserts
ResultsResults
ValidationValidation
• Comparison with WebTrends• Personal Web site• Optimized custom ontology of 1250
terms• Top concepts match the page
directories results should be comparable
ResultsResultsUrchin
WASA
Scalability: Case StudyScalability: Case Study
• Web site: www.ulb.ac.be– 800,000 pages– 100,000 page views– Knowledge domain: broad
• Ontology: Eurovoc– Knowledge domain: broad (EC’s interests)– 21 top domains– 8 levels– 6650 terms
• Run=15 hours, linear dependency reasonableand applicable to any Web site
ExperimentsExperiments
• Experimental settings• Visualization• Ontology coverage• Validation• Scalability
OntologiesOntologies
• Specification of a conceptualisation• Controlled vocabulary of terms and
relations• An ontology defines concepts and their
relations, that are necessary to share, reuse, and represent a domain knowledge
• Example:Fru it
S trawberry
A pp leFoodV egetab le
good m ix
Ontology RestrictionOntology Restriction
• Ontology concept hierarchy
Fru itS trawberry
A pp leFoodV egetab le
Fru itS trawberry
A pp leFoodV egetab le
good m ix
ContentsContents
• Context & motivations• Output page mining• Lexical analysis• Concept-based metrics with OLAP• Experiments• Exploitation• Conclusion & future work
ContextContext
• Web emergence• Web communication analysis• Maintenance needs effective
decisions• Highest organization levels• Summarized and conceptual results• Web analytics tools unappropriate
Exploitation ProcessExploitation Process
W ASA adm inistrator
chief editor
sub-editors
configures andrun
viewreports
redefine writingtasks
W ASA
defineconcepts
organization m anagerview
reports redefine W eb com m unicationobjectives
m anage organization
update W eb sitecontent
...
...
Metric ExploitationMetric Exploitation
• High interest – Search pages about the topic– Rank pages by consultation– Optimize pages
• Low interest– Search pages about the topic– Rank pages by presence– Question the topic: important/not
important– Drain traffic to the pages/delete pages
Future WorkFuture Work
• Concept visualisation in semantic space
• Automated taxonomy enrichment• Additional OLAP dimensions