Text Retrieval and Spreadsheets
Session 4LBSC 690
Information Technology
Agenda
• Questions
• Text retrieval
• Spreadsheets
Document Retrieval
• Lots of applications– Chasing down citations in papers you read– Web search engines– Managing your personal files
• Two basic approaches– Explicit queries (“information retrieval”)– “Watch what I do” (“adaptive filtering”)
Ways of Finding Text
• Searching metadata– Using controlled or uncontrolled vocabularies
• Free text– Characterize documents by the words the contain
• Social filtering– Exchange and interpret personal ratings
“Exact Match” Retrieval
• Find all documents with some characteristic– Indexed as “Presidents -- United States”– Containing the words “Clinton” and “Peso”– Read by my boss
• A set of documents is returned– Hopefully, not too many or too few– Usually listed in date or alphabetical order
Ranked Retrieval
• Put most useful documents near top of a list– Put possibly useful documents lower in the list
• No need to exclude any documents– Just list those least likely to be useful last
• Two basic techniques– Similarity-based– Probability-based
Similarity-Based Retrieval
• Assume “most useful” = most similar to query• Weight terms based on two criteria:
– Repeated words are good cues to meaning– Rarely used words make searches more selective
• Compare weights with query– Add up the weights for each query term– Put the documents with the highest total first
Example: Coordination Measure
11
1
1: Nuclear fallout contaminated Montana.
2: Information retrieval is interesting.
3: Information retrieval is complicated.
11
1
1
1
1
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
1
1 2 3
Documents:
Query: recall and fallout measures for information retrieval
Some Search Engines to Try
• Images– http://altavista.com (select images)
• Audio– http://www.musclefish.com (select demos)
What’s a Spreadsheet?
• Large table containing numbers– May also contain labels to aid interpretation– Columns are named with LETTERS– Rows are named with NUMBERS– Cells are named like A4, C1, ...
• Some cells are automatically calculated– Formula specified when spreadsheet is created– Values are recalculated continuously
How Spreadsheets are Used
• Record keeping (checkbook)• Calculation (income tax)• What-if analysis (cash flow)
– Sensitivity analysis (exchange rate)• Goal seeking (retirement planning)
– Uses continuous recalculation (“iteration”)
Spreadsheet Applications
• Originally designed for financial records• Library applications
– Budget– Collection development– Shelving capacity
• Educational Applications– Grade records– Equipment inventory
Excel Demo
• Start Excel– Microsoft Office folder
• Open N:\SHARE\CLASS\POSTCARD.XLS– File menu– N: is the volume labeled lbsc690c in windows
• Enter your 1999 (desired) income in cell B3– Tax due is displayed in cell B4
Excel Demo
• Change the tax due– Place the cursor over B4– Type “=B3*0.x”
• “=” tells Excel this is a formula• “B3” refers to the number in cell B3• The “x” in “0.x” should reflect your political views
– 0.5 would take away half your money
– Try different values in cell C3• What kind of spreadsheet use is this?
Excel Demo
• Add itemized deductions– Highlight row 4 (click on 4)– Select “Rows” in “Insert” menu twice– Label A4 as “Deduction amount”– Label A5 as “Taxable income”– Put the appropriate formula in B5– Change the formula in B6 as needed
• Note how it was copied from B4 with changes
Excel Demo
• Limit the deduction– Maximum of 50% of income or 10,000
• Search for help on “maximum” and “minimum”• Replace the formula in B5 with a more
complicated one– You can use another cell to show a partial result
When Style is Important
• Too complex to visualize at once– Size– Relationships between formulas
• Used by more than one person– Includes use in presentations and papers
• Used for a long time– Essentially communicating to yourself
Style Guidelines
• Organization– Depict the solution approach visually– Group things where possible (e.g., parameters)– Build in cross-checks to discover input errors
• Readability– Describe the computation– Meaningful labels help a lot– Minimize clutter
Building Complex Applications
• Computers keep track of detail well– But people don’t
• Adopt meaningful abstractions– Organize a calculation the way you think
• Use a structured process– Examples: waterfall and spiral models
Waterfall Model
• Five steps– Identify requirements– Develop a detailed specification– Design the spreadsheet– Implement the spreadsheet– Test the spreadsheet
• Team project is based on a waterfall model– Specification, Test Plan, and User Manual
Spiral Model
• Build a prototype to solve part of the problem– Don’t worry about efficiency at this point
• Use what you learn to build another prototype– Either more complete or more efficient
• Repeat until the prototype does what you want
Lessons Learned
• Large projects need both models– Waterfall model helps identify subtasks– The first try is usually not right
• Most common mistake is not starting over– It seems easier to keep refining a prototype– But that won’t ever fix design-level problems
• Rule of thumb: double every estimate!
Summary
• Retrieval exploits human-machine synergy– Machines are fast, but simple– Humans are sophisticated, but slow
• Spreadsheets can make calculation easy– Easily modified to add new calculations– Need to design complex spreadsheets carefully