a methodological approach for web sites reengineering · 22/05/2003 fnrs contact day on...
TRANSCRIPT
A MethodologicalA MethodologicalApproach for Web SitesApproach for Web Sites
ReengineeringReengineering
Estiévenart FabriceFrançois AuroreHenrard Jean
Hainaut Jean-Luc
22/05/2003 FNRS contact day on "Software (re-)engineering" 2
ContextContext
• Still many static web sites…– Advantage
• easy to create• à for small web sites
– Drawbacks• data and layout are mixed up• maintenance problems• à out-of-date or redundant information• à non-homogeneous design
– Solution• DBMS + scripts (php, Perl,…)
22/05/2003 FNRS contact day on "Software (re-)engineering" 3
GoalsGoals
• To provide methods and tools for web sitesreengineering :
Web Page
Web site
Web Site Reverse Engineering
Data Conceptual schema
Web Site Engineering
Web Page
Web site
DB
22/05/2003 FNRS contact day on "Software (re-)engineering" 4
Method : overviewMethod : overview
Web Page
Web site
Classification
Web PageHTML
Page Type
Cleaning
Web PageXHTML
Page Type
Semantic Enrichment
META File
Page Type
XMLPage Type
XML schema
Conceptual schema
Conceptualisation
Extraction
22/05/2003 FNRS contact day on "Software (re-)engineering" 5
Method : step 1Method : step 1
• Pages classification– « Page type » = a set of pages relative to the same
concept, that display very similar information in a verysimilar layout
22/05/2003 FNRS contact day on "Software (re-)engineering" 6
Method : steps 2 and 3.1Method : steps 2 and 3.1
• HTML cleaning• Semantic enrichment
– For each page type• Concepts identification and description on a sample page
- « Concept » = a part of the HTML tree describing the layout, thestructure and possibly the value of a certain reality
- Ex : the concept « Phone Number »<tr>
<td align="middle" bgcolor="#FFFF66"><b>Phone :</b>
</td ><td bgcolor="#FFFF99">+32 71 72.23.49</td>
</tr>- Ex : the concept « Address » composed of « Street » and « City »
<table width="100%"> <tr><td><b>Address :<b></td></tr> <tr><td>Quality Street, 25</td></tr> <tr><td>London</td></tr></table>
22/05/2003 FNRS contact day on "Software (re-)engineering" 7
Method : the META fileMethod : the META file
<HTMLDescription xmlns:meta="http://www.cetic.be/FR/CRAQ-DB.htm"><meta:element name="Department">
<html><head>...</head><body> <table>
<meta:element ref="DeptName"/><meta:element ref="PhoneNumber"/><meta:element ref="Address"/>
</table></body>
</html></meta:element><meta:element name="PhoneNumber">
<tr><td align="middle" bgcolor="#FFFF66"><b>Phone :</b></td><td bgcolor="#FFFF99"><meta:value/></td>
</tr></meta:element>…
</HTMLDescription>
22/05/2003 FNRS contact day on "Software (re-)engineering" 8
• Application to other pages of the same type- there may be layout and structure differences between pages of the
same type àa same concept may have several descriptions- Example : a layout difference
Method : step 3.2Method : step 3.2
<tr> <td><b>Name :</b></td></tr>
<table width="100%"> <tr><td>Address :</td></tr> <tr><td>Quality Street, 25</td></tr> <tr><td>London</td></tr></table>
- Example : a structure difference
<tr> <td><i>Name :</i></td></tr>
<table width="100%"> <tr><td>Address :</td></tr> <tr><td>New York</td></tr> <tr><td>Main Street, 110</td></tr></table>
22/05/2003 FNRS contact day on "Software (re-)engineering" 9
Method : step 4Method : step 4
• Data and schema extraction– Data extraction
• Web pages + META file à XML document
– Data structure extraction• META file à XML schema
22/05/2003 FNRS contact day on "Software (re-)engineering" 10
Method : step 5.1Method : step 5.1
• Schema integration– To discover relationships between concepts– To detect redundancy
1-1f
1-1
1-1f
1-1
1-11-1f
1-1f
1-1
1-11-1f
1-11-1f
1-1f
1-1
0-Nf
1-1
0-Nf 1-1
0-Nf
1-11-Nf
0-1
1-11-1f
1-Nf
1-1
1-1f
1-1
1-1f 1-1
1-1f
1-1
1-1f 1-1
1-Nf
0-1
1-Nf 0-1
Staffseq: .Administrative[*]
.Academic[*]
.PHD[*]
ProjectsList
ProjectName«content»
ProjectDate«content»
Projectseq: .ProjectName
.ProjectDate
PHD
Person_1keygid: key
PersonName_1«content»
PersonNamekeyref«content»
PersonFunction_1«content»
PersonFunction«content»
PersonAddress«content»
Personexact-1: .f
.f
.fseq: .PersonName
.PersonFunction
Description_1seq: .PersonName_1
.PersonFunction_1
.PersonAddress
Description«content»
DeptName«content»
DepartmentwebFileName[0-1]seq: .DeptName
.Description
.Staff
.ProjectsList
Administrative
Academic
22/05/2003 FNRS contact day on "Software (re-)engineering" 11
Method : step 5.2Method : step 5.2
• Schema conceptualisation
1-1 0-Nstaff
1-1
1-N
in
PROJECTNameDate
PERSONNameFunctionAddressStatusid: Name
DEPARTMENTNamePhoneFaxAddressMapMailDescription
22/05/2003 FNRS contact day on "Software (re-)engineering" 12
Web Site EngineeringWeb Site Engineering
PROJECTNameDateID_DEPequ: ID_DEP
acc
PERSONNameFunctionAddressStatusID_DEPid: Name
acc ref: ID_DEP
acc
DEPARTMENTID_DEPNamePhoneFaxAddressMapMailDescriptionid: ID_DEP
acc
DBDBdescribesdescribes
Migration
XML Document
• Database engineering + Data migration
22/05/2003 FNRS contact day on "Software (re-)engineering" 13
ToolsTools
• HTML cleaning– Tidy
• Semantic enrichment– XML editor or the semantic browser (based on Mozilla) to
edit/generate the META file
• Data and schema extraction– XML parsers (Java DOM)– pageType.extractSchema(METAfile) à XMLSchema– pageType.extractData(HTML*, METAfile) à XML
• Schemas integration/conceptualisation anddatabase engineering
– CASE tool DB-Main
22/05/2003 FNRS contact day on "Software (re-)engineering" 14
Conclusion and future workConclusion and future work
• A method and tools to extract from a web site dataand their structure
• Difficulty : enormous diversity of web pagesstructures and layouts to represent the same reality
• Future work– test on real-size web sites– refine the semantic enrichment step
• improve GUI• automation/assistance based on heuristics