foundations of web technology - springer978-1-4615-1135-9/1.pdf · foundations of web technology by...

18
FOUNDATIONS OF WEB TECHNOLOGY

Upload: others

Post on 01-Nov-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

FOUNDATIONS OFWEB TECHNOLOGY

Page 2: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

THE KLUWER INTERNATIONAL SERIESIN ENGINEERING AND COMPUTER SCIENCE

Page 3: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

FOUNDATIONS OF WEB TECHNOLOGY

by

Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA.

SPRINGER SCIENCE+BUSINESS MEDIA, LLC

Page 4: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

Library of Congress Cataloging-in-Publication Data

Sarukkai, Ramesh R. Foundations of Web Technology ISBN 978-1-4613-5409-3 ISBN 978-1-4615-1135-9 (eBook) DOI 10.1007/978-1-4615-1135-9

Copyright © 2002 by Springer Science+Business Media New York Originally published by Kluwer Academic Publishers in 2002 Softcover reprint ofthe hardcover Ist edition 2002

AlI rights reserved. No part ofthis work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording, or otherwise, without the written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser ofthe work.

Permission for books published in Europe: [email protected]! Permissions for books published in the United States of America: [email protected]

Printed an acid-free paper.

Page 5: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

Dedicated to my mother Santha and my father S.K. Rangarajan

Page 6: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

Contents

Contributors

Acknowledgements

Preface

Part 1 Fundamentals

1 Introduction1. WORLD WIDE WEB

2. CORE TECHNOLOGY

3. WHAT'S COVERED IN THIS BOOK

4. ORGANIZATION OF THE BOOK

2 Data Markup1. INTRODUCTION

2. DATA MARKUP

3. EXTENSIBLE MARKUP LANGUAGE (XML)

4. EXTENSIBLE STYLE SHEETS

5. XPATH

6. HYPERTEXT MARKUP LANGUAGE (HTML)

7. CONCLUSION

FuRTHER READINGEXERCISES

xv

XVll

xix

33446

11121317294141505050

Page 7: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

3 Networking1. INTRODUCTION

2. LAYERING OF NETWORKS

3. LOCATING ENDPOINTS

4. TRANSMISSION PROTOCOLS

5. CLIENT/SERVER

6. HYPER TEXT TRANSFER PROTOCOL (HTIP)

7. WEB SECURITY

8. PRIVACY

9. CONCLUSION

FuRTHER READING

EXERCISES

4 Infonnation Retrieval1. INTRODUCTION

2. COMPONENTS OF IR SYSTEM

3. TEXT PROCESSING

4. INDEXING AND SEARCH

5. RANKING

6. QUERY OPERATIONS

7. LATENT SEMANTIC INDEXING

8. EVALUATION METRICS

9. CONCLUSIONS

FuRTHER READING

EXERCISES

Part II Applications

5 Web Search and Directory1. INTRODUCTION

2. WEB SEARCH

3. VARIATIONS IN SEARCHING

4. RANKING

5. WEB DIRECTORIES

6. CONCLUSION

FuRTHER READING

EXERCISES

6 Web Mining1. INTRODUCTION

2. DATA MINING

3. AsSOCIATION MINING

4. PREDICTIVE MODELLING

Contents

535454555964717884848585

8788899096

100104106108110110III

113

115116116125128132135136136

139140140141145

Page 8: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

Contents

5. CLUSTERING

6. OTHER DATA MINING PROBLEMS

7. EXAMPLES OF WEB MINING

8. CONCLUSION

FuRTHER READING

EXERCISES

7 Messaging and Commerce1. INTRODUCTION

2. MESSAGING APPLICATIONS

3. ELECTRONIC MAIL PROTOCOLS

4. 1M ARCHITECTURE

5. COMMERCE APPLICATIONS

6. OVERVIEW OF E-COMMERCE FRAMEWORKS

7. EXAMPLE ARCHITECTURE

8. CONCLUSION

FuRTHER READING

EXERCISES

8 Mobile Access1. INTRODUCTION

2. MOBILE COMMUNICATION SYSTEMS

3. WIRELESS APPLICATION PROTOCOL

4. WIRELESS MARKUP LANGUAGES

5. GENERATING WIRELESS CONTENT

6. SHORT MESSAGING SERVICE

7. EMERGING TRENDS

8. CONCLUSION

FuRTHER READING

EXERCISES

9 Web Services1. INTRODUCTION

2. OVERVIEW OF ARCHITECTURE

3. UDDI4. SOAP5. PLATFORMS

6. EXAMPLE OF A SERVICE

7. LIMITATIONS

8. CONCLUSION

FuRTHER READING

EXERCISES

157165165172172173

177178178179184187188196205205206

207208208211214221227230233233234

237238238241242244244248249250250

Page 9: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

10 Conclusion1. REvIEw

2. SYSTEM DESIGN OVERVIEW

3. LIMITAnONS

4. ThE fuTUREAPPENDIX

REFERENCES

ACRONYMS

INDEX

Contents

251251254256257261271283285

Page 10: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

List of Figures

Figure 1. Growth of the World Wide Web .4Figure 2. Information - Structure - Presentation Separation 13Figure 3. Graphical representation of infonnation structure 14Figure 4. History of Markup Languages .16Figure 5. Illustration ofXSL 31Figure 6. Transfonning one XML structure to another XML structure 33Figure 7. Example ofHTML fonn output. 47Figure 8. Client Server Architecture 64Figure 9. Illustration of a proxy server scenario 65Figure 10. Simple example of encrypted message transmission 80Figure 11. Overview ofInfonnation Retrieval System 89Figure 12. Example of a "Similarity Matrix" 93Figure 13. Example of a prefix tree 98Figure 14. Document vectors for the two sample documents 102Figure 15. Documents Matrix representation A. .1 07Figure 16. Precision versus Recall Graph 110Figure 17. Overview of Web search system. 117Figure 18. Web Crawling System 119Figure 19. Meta-Search Engine .126Figure 20. Graph Structure used to illustrate the HITS algorithm 131Figure 21. Web Directory - fixed taxonomy, but automatic classification. 133Figure 22. Example of Semi-Automatic Taxonomy Generation 134Figure 23. Web Graph for exercise (7) 137Figure 24. Single layer neural network .149Figure 25. Example of a decision tree for classification 151Figure 26. Example of a Linear Classifier 156Figure 27. Illustration of clustering 157Figure 28. Clustered data samples and the two centroids 164Figure 29. Overview of an e-mail system. 180Figure 3O. Overview of prototype 1M system. 185Figure 31. Prototype E-commerce architecture 197Figure 32. Pricing and Packaging 198Figure 33. Subscription module for billing 200Figure 34. Overview of Global System for Mobile Communication 21 0Figure 35. WAP Architecture Overview 211Figure 36. Two approaches to generating wireless markup 222Figure 37. Transcoding Proxy Architecture 224Figure 38. XSLT Approach to Wireless Markup Document Generation 225Figure 39. Overview of SMS Architecture 228Figure 40. Web services protocol Stack 240

Page 11: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

Contents

Figure 41. Web service integration procedure 242Figure 42. Overview of Web Service Usage 245

Page 12: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

List of Tables

Table 1. Layers of data abstraction 13Table 2. Example XML Document... .19Table 3. Example XML document with external DTD 20Table 4. Contents of the example DTD .20Table 5. Example defining attributes for an element. 22Table 6. XML Schema example 25Table 7. ComplexType in XML Schema 26Table 8. Example illustrating reference features in XML Schema .27Table 9. Illustration of XSL specification in an XML document. 31Table 10. Example stylesheet definition .32Table 11. Result XML document when stylesheet in Table lOis applied to

XML in Table 9 33Table 12. Example of XPath expressions .41Table 13. Example of HTML table rendering .46Table 14. IP Datagram 56Table 15. IP Datagram containing TCP segment. 60Table 16. UDP packets encapsulated in IP datagrams 63Table 17. HTTP 1.0 Client Request... 71Table 18. HTTP status codes 73Table 19. Sample HTTP/l.l Response codes absent in HTTP/1.0 76Table 20. HTTP/l.I Header Fields 76Table 21. Regular Expression Generator for a simple tokenizer 91Table 22. Example of stoplist words 92Table 23. An example ofN-Gram Stemming 93Table 24. Example of Entropy Successor Stemming 95Table 25. Example of inverted index 96Table 26. Example of text for prefix tree creation 98Table 27. Sample documents for Vector Space illustration 101Table 28. Web Crawling Algorithm 120Table 29. HITS Algorithm 131Table 30. Authority scores for iterations of the HITS Algorithm 132Table 31. The Hub scores for iterations of the HITS Algorithm 132Table 32. Sample data to illustrate association mining 144Table 33. Sample Web site ratings table 153Table 34. Sample data to illustrate classification 154Table 35. Data Samples to illustrate clustering .163Table 36. Assignment of data samples to clusters after first iteration .164Table 37. List of Transactions 173Table 38. Classification Training data 173Table 39. Sample clustering data for exercise 6 174

Page 13: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

Contents

Table 40. Collaborative Filtering data for exercise 8 175Table 41. Example of an SMTP session 181Table 42. IFX gateway/service provider functional component stack. 195Table 43. Example of an IFX document. 196Table 44. WAP Protocol Stack 212Table 45. "Hello world" WML example 215Table 46. WML Example illustrating transitions from one card to the next.

............................................................................................................216Table 47. Example of anchored text. 217Table 48. Example of input collection and submission to backend server. 218Table 49. Example XML Document... 226Table 50. Example XSL stylesheet for generating WML. 226Table 51. Steps involved in transmission of a SMS message to a mobile

device (GSM) 229Table 52. Example WSDL document 247

Page 14: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

Contributors

Dr. Ramesh Rangarajan Sarukkai is currently a senior architect at Yahoo!Inc, and has worked at Lernout & Hauspie Inc., and IBM TJ WatsonResearch Center. He has successfully led many projects to completion, anddeveloped products that are award-winning, and used by millions of users.He holds M.S and Ph.D. degrees from the University of Rochester,Rochester, NY, and B.E. degree from Visveshvaraya College (UVCE,Bangalore University), India. Dr. Sarukkai's first paper on a novel approachto automatic character recognition, appeared in the reputed journal PatternRecognition, based on his independent project during high-school. Dr.Sarukkai continued R&D in many areas such as AI, speech recognition,information retrieval, networking, wireless and web technology, andpublished in leading journals/conferences such as Computer Networks, manyIEEE transactions, Neural Computation (MIT Press), Computer &Graphics. and Pattern Recognition. In addition, Dr. Sarukkai holds manypatents (awarded and pending) in the above areas, including an early patenton Web technology co-invented with Hewlett-Packard Labs in 1996. He hasserved on various leading journals and conferences as a reviewer, andworking groups such as the World Wide Web Consortium's (W3C) VoiceBrowser Activity.

Page 15: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

Acknowledgements

Writing a book that encompasses a wide range of topics requires a lot offeedback and constructive suggestions for improvements on various fronts.Without the time and efforts of the reviewers and the numerous colleagueswho spent their valuable time to discuss, read, and edit my manuscript, thisbook wouldn't exist in its current form.

Firstly, the anonymous reviewers gave useful feedback on the bookproposal. Next, I would like to thank the following for their time, effort andvaluable feedback on the contents of the book: Prof. Dana Ballard (Univ. ofRochester) for his ever creative and insightful comments, Dr. Dave Raggett(W3C/OpenWave) for meticulous editing and excellent points, Prof. MarkCrovella (Boston Univ.) for constructive suggestions, Dr. Udi Manber(Yahoo! Inc.) for useful guidance on topic selection and book writing inaddition to technical feedback, Dr. Sanjeev Dharap (Yahoo! Inc.) for manydiscussions and comments, Raghuveer Chakravarthi (Yahoo! Inc.) forarchitecture overview and discussions on Instant Messaging, and Kian-TatLin (Yahoo! Inc.) for useful comments on Data/Web mining. I have alsobenefited greatly from useful discussions with many at Yahoo! Inc. whichhas influenced the contents of the book including Dr. Anurag Mendhekar,Madhu Yarlagadda, Ash Patel, Sanjay Rao, Dr. Qi Lu, VenkatPanchapakesan and other colleagues.

Without the blessings of God, and the strong support of my parents, Iwould never be able to achieve anything in my life. I am deeply thankful tomy father Prof. S. K. Rangarajan for his constant guidance, andencouragement to pursuing creative endeavours, never stop learning, and

Page 16: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

Introduction

positive attitude. My mother Santha has always been supportive andconstantly pushed me to aim higher. I also thank my siblings Sekhar, andSundar for their advice. Sekhar had many insightful comments andsuggestions on my book. Sundar, the writer in the family, encouraged me toseriously pursue writing a book, and I thank him for that (esp. his advice:"your words are just around the comer. Just write them down!"). Last butnot the least, without the love, constant S\lpport, and encouragement of mybeautiful wife Ramya, this book would never exist. In addition to putting upwith late night writing schedules, Ramya has given insightful and practicalfeedback on organization, and technical aspects of the book, and I am deeplyindebted to her for that.

The editorial staff at Kluwer have been very supportive and helpful withmy many simple formatting and editorial questions: I would like toespecially thank Sharon Palleschi, and Susan Lagerstorme-Fife for theirprompt responses and formatting support. I cannot list the many otherpeople, who have shaped or influenced my thoughts and thus the contents ofthis book, but I thank them for that, and I apologize for any unintentionalomissions. While all the good in this book is attributable to the feedback,discussions and support of the many people, any error or fault is my own.

The poetic quotations at the beginning of the preface and chapters arefrom the work "Fireflies" by Nobel Laureate Rabindranath Tagore.

Page 17: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

Preface

"Birth is from the mystery ofnight into the greater mystery ofday"

My idea ofwriting a book on the concepts that power the Web started in1999. The huge growth of the Web has fuelled a variety of applications thatare being used by millions around the world. Despite the ups and downs inthe dot-com business world, Web technology solidified over the last decade.The applications that power the Web derive their strengths from a diverserange of technology such as information retrieval and mobile data access.

Most books on Web focus on programmatic aspects of languages such asJava, JavaScript, or description of standards such as Hypertext MarkupLanguage (HTML) or Wireless Markup Language (WML). A book thatcovers the concepts behind the infrastructure of the Web would beindispensable to a wide range of audience interested in learning how theWeb works, how techniques in Web technology can be applied to their ownproblem, and what the emergent technological trends in these areas are. Thismotivated me to write a book that covered the "Foundations of WebTechnology" ranging from fundamental areas such as information retrieval,data markup to applications such as web search, instant messaging, mobileaccess and web services. I believe that this book would be useful for anumber of years to come since Web technology has matured considerably,and the concepts discussed in this book will continue to be applieduniversally.

Page 18: FOUNDATIONS OF WEB TECHNOLOGY - Springer978-1-4615-1135-9/1.pdf · FOUNDATIONS OF WEB TECHNOLOGY by Ramesh R. Sarukkai Senior Architect, Yahoo Inc, USA. SPRINGER SCIENCE+BUSINESS

Introduction

Audience

This book has been written to appeal to a wide range of audience. For aperson interested in understanding the basic concepts of Web technology,this book covers the fundamentals and the techniques needed to build Webapplications. For the professional who has worked on specific parts of Webor related technology, this book will provide a broad understanding of thearchitecture of different applications on the Web, and how they relate toeach other. The techniques are discussed both from a conceptual level aswell as a practical level, so that the ideas discussed can be translated to real­world prototypes. The pedagogical style of the book coupled with thenumerous examples, illustrations and exercises makes the content accessibleto a wide variety of audiences.

Course Textbook

This book is compelling as a course that covers the foundations on Webtechnology. Each chapter has a set of exercises that cover both conceptual,theoretical questions, as well as projects. The "Further Reading" section ineach chapter is a good point to go deeper into the topic covered in thatchapter. This book can also serves as a base for a seminar course on Webtechnology. The book is written for any student with an engineeringbackground, although programming skills, and preliminary coursework oncomputer science is preferable. This book is suitable for an engineeringstudent at a senior undergraduate or graduate level. Prerequisites for such acourse are basic undergraduate level computer (e.g. computer organization,data structures), and math courses (e.g. calculus, vector-analysis).