text retrieval and spreadsheets session 4 lbsc 690 information technology
DESCRIPTION
Document Retrieval Lots of applications –Chasing down citations in papers you read –Web search engines –Managing your personal files Two basic approaches –Explicit queries (“information retrieval”) –“Watch what I do” (“adaptive filtering”)TRANSCRIPT
![Page 1: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/1.jpg)
Text Retrieval and Spreadsheets
Session 4LBSC 690
Information Technology
![Page 2: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/2.jpg)
Agenda
• Questions
• Text retrieval
• Spreadsheets
![Page 3: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/3.jpg)
Document Retrieval
• Lots of applications– Chasing down citations in papers you read– Web search engines– Managing your personal files
• Two basic approaches– Explicit queries (“information retrieval”)– “Watch what I do” (“adaptive filtering”)
![Page 4: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/4.jpg)
Ways of Finding Text
• Searching metadata– Using controlled or uncontrolled vocabularies
• Free text– Characterize documents by the words the contain
• Social filtering– Exchange and interpret personal ratings
![Page 5: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/5.jpg)
“Exact Match” Retrieval
• Find all documents with some characteristic– Indexed as “Presidents -- United States”– Containing the words “Clinton” and “Peso”– Read by my boss
• A set of documents is returned– Hopefully, not too many or too few– Usually listed in date or alphabetical order
![Page 6: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/6.jpg)
Ranked Retrieval
• Put most useful documents near top of a list– Put possibly useful documents lower in the list
• No need to exclude any documents– Just list those least likely to be useful last
• Two basic techniques– Similarity-based– Probability-based
![Page 7: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/7.jpg)
Similarity-Based Retrieval
• Assume “most useful” = most similar to query• Weight terms based on two criteria:
– Repeated words are good cues to meaning– Rarely used words make searches more selective
• Compare weights with query– Add up the weights for each query term– Put the documents with the highest total first
![Page 8: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/8.jpg)
Example: Coordination Measure
11
1
1: Nuclear fallout contaminated Montana.
2: Information retrieval is interesting.
3: Information retrieval is complicated.
11
1
1
1
1
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
1
1 2 3
Documents:
Query: recall and fallout measures for information retrieval
![Page 9: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/9.jpg)
Some Search Engines to Try
• Images– http://altavista.com (select images)
• Audio– http://www.musclefish.com (select demos)
![Page 10: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/10.jpg)
What’s a Spreadsheet?
• Large table containing numbers– May also contain labels to aid interpretation– Columns are named with LETTERS– Rows are named with NUMBERS– Cells are named like A4, C1, ...
• Some cells are automatically calculated– Formula specified when spreadsheet is created– Values are recalculated continuously
![Page 11: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/11.jpg)
How Spreadsheets are Used
• Record keeping (checkbook)• Calculation (income tax)• What-if analysis (cash flow)
– Sensitivity analysis (exchange rate)• Goal seeking (retirement planning)
– Uses continuous recalculation (“iteration”)
![Page 12: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/12.jpg)
Spreadsheet Applications
• Originally designed for financial records• Library applications
– Budget– Collection development– Shelving capacity
• Educational Applications– Grade records– Equipment inventory
![Page 13: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/13.jpg)
Excel Demo
• Start Excel– Microsoft Office folder
• Open N:\SHARE\CLASS\POSTCARD.XLS– File menu– N: is the volume labeled lbsc690c in windows
• Enter your 1999 (desired) income in cell B3– Tax due is displayed in cell B4
![Page 14: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/14.jpg)
Excel Demo
• Change the tax due– Place the cursor over B4– Type “=B3*0.x”
• “=” tells Excel this is a formula• “B3” refers to the number in cell B3• The “x” in “0.x” should reflect your political views
– 0.5 would take away half your money
– Try different values in cell C3• What kind of spreadsheet use is this?
![Page 15: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/15.jpg)
Excel Demo
• Add itemized deductions– Highlight row 4 (click on 4)– Select “Rows” in “Insert” menu twice– Label A4 as “Deduction amount”– Label A5 as “Taxable income”– Put the appropriate formula in B5– Change the formula in B6 as needed
• Note how it was copied from B4 with changes
![Page 16: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/16.jpg)
Excel Demo
• Limit the deduction– Maximum of 50% of income or 10,000
• Search for help on “maximum” and “minimum”• Replace the formula in B5 with a more
complicated one– You can use another cell to show a partial result
![Page 17: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/17.jpg)
When Style is Important
• Too complex to visualize at once– Size– Relationships between formulas
• Used by more than one person– Includes use in presentations and papers
• Used for a long time– Essentially communicating to yourself
![Page 18: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/18.jpg)
Style Guidelines
• Organization– Depict the solution approach visually– Group things where possible (e.g., parameters)– Build in cross-checks to discover input errors
• Readability– Describe the computation– Meaningful labels help a lot– Minimize clutter
![Page 19: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/19.jpg)
Building Complex Applications
• Computers keep track of detail well– But people don’t
• Adopt meaningful abstractions– Organize a calculation the way you think
• Use a structured process– Examples: waterfall and spiral models
![Page 20: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/20.jpg)
Waterfall Model
• Five steps– Identify requirements– Develop a detailed specification– Design the spreadsheet– Implement the spreadsheet– Test the spreadsheet
• Team project is based on a waterfall model– Specification, Test Plan, and User Manual
![Page 21: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/21.jpg)
Spiral Model
• Build a prototype to solve part of the problem– Don’t worry about efficiency at this point
• Use what you learn to build another prototype– Either more complete or more efficient
• Repeat until the prototype does what you want
![Page 22: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/22.jpg)
Lessons Learned
• Large projects need both models– Waterfall model helps identify subtasks– The first try is usually not right
• Most common mistake is not starting over– It seems easier to keep refining a prototype– But that won’t ever fix design-level problems
• Rule of thumb: double every estimate!
![Page 23: Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1af57f8b9ab059981369/html5/thumbnails/23.jpg)
Summary
• Retrieval exploits human-machine synergy– Machines are fast, but simple– Humans are sophisticated, but slow
• Spreadsheets can make calculation easy– Easily modified to add new calculations– Need to design complex spreadsheets carefully