cs246 by john cho1 cs246: web information systems junghoo john cho spring 2014
TRANSCRIPT
CS246 by John Cho 1
CS246: Web Information Systems
Junghoo “John” Cho
Spring 2014
CS246 by John Cho 2
Course Information
Web page: http://oak.cs.ucla.edu/cs246/ Topic: Web information management Time: MW 2:00 -- 3:50 pm Place: Boelter Hall 5422 Instructor: Junghoo “John” Cho
office: 3531H Boelter Hall email: [email protected]
please use subject “CS246: …” office hours: Mon 1-2 pm.
CS246 by John Cho 3
Who is this class for?
Strong interest in research Interest in Web information systems Time commitment:
Around 2-3 papers every week Typically one full day of paper reading
One indepedent project Similar to paper writing
In fact we read papers from past student projects! Or interesting application implementation
CS246 by John Cho 4
Today’s Topics
Overview of the course topics Course logistics
Paper reading assignments Class project
CS246 by John Cho 5
Prerequisite
Introductory database, e.g., CS143 e.g.: query? SQL?
Basic algorithms and data structures Basic probability and statistics
P(A|C), Bayes rule, … Design and implementation experience
Basic C++ Quick test: Grab a sample paper
See if you can read, understand and build it
CS246 by John Cho 6
Tell Us About You Name Department & Program Before coming to UCLA Brief history at UCLA Technical/research interests Expectation from the class
CS246 by John Cho 7
Legacy database Plain text files
Biblio sever
Information Galore
CS246 by John Cho 8
Central Problem
How to manage/access information on the Web?
Three major approaches Central indexing
E.g., Web search engine Dynamic integration
E.g., comparison shopping services Data extraction
E.g., spamming companies
CS246 by John Cho 9
Topic: Web Search (Central Indexing)
Central Index
CS246 by John Cho 10
Topic: Web Search (Central Indexing)
Web: collection of passive HTML pages Find Web pages relevant to a query
Traditional Information Retrieval: Web = collection of HTML pages HTML page = a bag of words
More than that? Links, structure of the Web User access patterns HTML tags (markups)
CS246 by John Cho 11
Topic: Dynamic Integration
Cars.com Amazon.com
Apartments.com401carfinder.com
CS246 by John Cho 12
Topic: Dynamic Integration
Mediator
Wrapper
Source 1
Wrapper
Source 2
Wrapper
Source n
CS246 by John Cho 13
Topic: Data Extraction
WWWBeatles $10Madonna $20NSync $20
Structured data
How can we extract “structured data” from free text automatically?
CS246 by John Cho 14
Main Course Workload Paper reading
Paper reading assignments Class discussion We mainly focus on “central indexing”
Independent projects
CS246 by John Cho 15
High-Level Goal
Learn core ideas and techniques Some of the techniques can be useful for other
fields Learn how to read papers Hopefully learn what it is like to do research
Sometimes very frustrating but often very rewarding
CS246 by John Cho 16
Paper Reading Why:
Something that you will do all the time as a researcher Learn to be critical and communicate well Acquire knowledge to conduct research/project
About 20 papers from Conferences: SIGMOD, VLDB, WWW, and …
Before the class: Everyone: read and review the paper
During the class: Instructor: present his own understanding and lead class
discussion Everyone: participate!!!
CS246 by John Cho 17
How to Get Papers
From the class homepage http://oak.cs.ucla.edu/cs246/
Some of the materials password protected User name: cs246 Password: papers
Let me know if any problem
CS246 by John Cho 18
How to Read Papers Understand the “Big Picture” What is the problem? Why is it important? Why is it difficult? What has this paper done? What others have done?
CS246 by John Cho 19
Paper Reviews (1) Due by the preceding Sunday
Submit through our Web submission interface on the class Web page
Required components: at most 3 paragraph Summary (1 paragraph): your own words
This paper discusses how to optimize queries with... Comments/criticisms (1-2 paragraphs): the good & the bad
It addresses a real problem and the solution is interesting … But I feel the experiments are not realistic because...
Optional: questions, as many as you wantWhy the authors assume that queries are independent?
CS246 by John Cho 20
Paper Reviews (2) May skip 3 paper summaries without penalty Most reviews will get full score unless they are
written extremely poorly
CS246 by John Cho 21
Class Project
Why: Work on a specific problem and learn to find a solution
40% of the class Team of up to 3 Topic: any problem related to the general problem Open style
Rigorous study of a research problem or Any interesting system implementation
CS246 by John Cho 22
Class Project Schedule
Important Milestones Group formation: 4/09 (2nd week Wed) Project proposal: 4/20 (3rd week Sun) Project progress: 5/07 (6th week Wed) Final report: 5/21 (8th week Sun) Project presentation: 9th and 10th weeks
You are responsible to stay on track Make appointments with instructor as needed
CS246 by John Cho 23
Project: Please Remember
Put your aims high and be realistic Expect to read at least 4-5 papers along the way Start early
Don’t do it right before the deadline Always unexpected obstacles Some students could not finish in previous quarters
Please, please start early
You are responsible to be on track
CS246 by John Cho 24
Grading
Midterm: 40% Paper reviews: 20% Project: 40%
CS246 by John Cho 25
Announcements
First review due Sunday 4/06 Three papers for class 3 and 4
Graph structure in the Web The Anatomy of a Large-Scale Hypertextual … Authoritative sources in a hyperlinked environment