from: mark silverman [mailto:[email protected]] sent: thursday, july 05, 2012 4:35 pm...

8
From: Mark Silverman [mailto:[email protected] ] Sent: Thursday, July 05, 2012 4:35 PM Subject: FYI... text mining Had a great meeting with a company today that can really help us and is very interested. While there is synergy in where he thinks we can play and where we want to (images, anomaly detection), he mentioned something on text mining called Latent Semantic Index (LSI) . Per our discussion a few Saturdays ago, seems like an interesting dimensionality reduction method that in practice can reduce dimensions to a reasonable number (like hundreds), and an oblique approach could work nicely. Thought I’d pass it along as an FYI. So you know, we are now getting some good traction just in the past 2 weeks (meaning multiple meetings with the folks directly involved with Big Data) with General Dynamics, Adobe, these guys today, and one or two other players (I also got today a response from our friends in Ft. Meade regarding that BAA from last month with questions on our pricing, so I take that as a good sign). Not sure what I’ll do if everyone says yes, but that’s a good problem J. Things are heating up…… -Mark HOW LSI WORKS The Search for Content Latent semantic indexing looks at patterns of word distribution (specifically, word co-occurence) across a set of documents. Before we talk about the mathematical underpinnings, we should be a little more precise about what kind of words LSI looks at. Natural language is full of redundancies, and not every word that appears in a document carries semantic meaning. The most frequently used words in English are words that don't carry content: functional words, conjunctions, prepositions, auxilliary verbs... The first step in doing LSI is culling all those extraneous words from a document, leaving only content words likely to have semantic meaning. There are many ways to define a content word - here is one recipe for generating a list of content words from a document collection: Make a complete list of all the words that appear anywhere in the collection 2. Discard articles, prepositions, and conjunctions 3. Discard common verbs (know, see, do, be) 4. Discard pronouns 5. Discard common adjectives (big, late, high) 6. Discard frilly words (therefore, thus, however, albeit, etc.) 7. Discard any words that appear in every document 8. Discard any words that appear in only one document

Upload: harriet-rice

Post on 03-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From: Mark Silverman [mailto:msilverman@treeminer.com] Sent: Thursday, July 05, 2012 4:35 PM Subject: FYI... text miningmsilverman@treeminer.com Had a

From: Mark Silverman [mailto:[email protected]] Sent: Thursday, July 05, 2012 4:35 PM Subject: FYI... text miningHad a great meeting with a company today that can really help us and is very interested. While there is synergy in where he thinks we can play

and where we want to (images, anomaly detection), he mentioned something on text mining called Latent Semantic Index (LSI). Per our discussion a few Saturdays ago, seems like an interesting dimensionality reduction method that in practice can reduce dimensions to a

reasonable number (like hundreds), and an oblique approach could work nicely.    Thought I’d pass it along as an FYI. So you know, we are now getting some good traction just in the past 2 weeks (meaning multiple meetings with the folks directly involved with

Big Data) with General Dynamics, Adobe, these guys today, and one or two other players (I also got today a response from our friends in Ft. Meade regarding that BAA from last month with questions on our pricing, so I take that as a good sign).

 Not sure what I’ll do if everyone says yes, but that’s a good problem J.  Things are heating up……  -Mark

HOW LSI WORKS The Search for ContentLatent semantic indexing looks at patterns of word distribution (specifically, word co-occurence) across a set of documents. Before we talk about

the mathematical underpinnings, we should be a little more precise about what kind of words LSI looks at.

Natural language is full of redundancies, and not every word that appears in a document carries semantic meaning. The most frequently used words in English are words that don't carry content: functional words, conjunctions, prepositions, auxilliary verbs... The first step in doing LSI is culling all those extraneous words from a document, leaving only content words likely to have semantic meaning. There are many ways to define a content word - here is one recipe for generating a list of content words from a document collection:

Make a complete list of all the words that appear anywhere in the collection2.       Discard articles, prepositions, and conjunctions3.       Discard common verbs (know, see, do, be)4.       Discard pronouns5.       Discard common adjectives (big, late, high)6.       Discard frilly words (therefore, thus, however, albeit, etc.)7.       Discard any words that appear in every document8.       Discard any words that appear in only one documentThis process condenses our documents into sets of content words that we can then use to index our collection.

Thinking Inside the GridUsing our list of content words and documents, we can now generate a term-document matrix. This is a fancy name for a very large grid, with

documents listed along the horizontal axis, and content words along the vertical axis. For each content word in our list, we go across the appropriate row and put an 'X' in the column for any document where that word appears. If the word does not appear, we leave that column blank.

Doing this for every word and document in our collection gives us a mostly empty grid with a sparse scattering of X-es. This grid displays everthing that we know about our document collection. We can list all the content words in any given document by looking for X-es in the appropriate column, or we can find all the documents containing a certain content word by looking across the appropriate row.

Page 2: From: Mark Silverman [mailto:msilverman@treeminer.com] Sent: Thursday, July 05, 2012 4:35 PM Subject: FYI... text miningmsilverman@treeminer.com Had a

Notice that our arrangement is binary - a square in our grid either contains an X, or it doesn't. This big grid is the visual equivalent of a generic keyword search, which looks for exact matches between documents and keywords. If we replace blanks and X-es with zeroes and ones, we get a numerical matrix containing the same information.

The key step in LSI is decomposing this matrix using a technique called singular value decomposition. The mathematics of this transformation are beyond the scope of this article (a rigorous treatment is available here), but we can get an intuitive grasp of what SVD does by thinking of the process spatially. An analogy will help. Breakfast in Hyperspace: Imagine that you are curious about what people typically order for breakfast down at your local diner, and you want to display this information in visual form. You decide to examine all the breakfast orders from a busy weekend day, and record how many times the words bacon, eggs and coffee occur in each order (so an order (for a table) is a document and {bacon, eggs, coffee} is the term dictionary.) You can graph the results of your survey by setting up a chart with three orthogonal axes - one for each keyword. The choice of direction is arbitrary - perhaps a bacon axis in the x direction, an eggs axis in the y direction, and the all-important coffee axis in the z direction. To plot a particular breakfast order, you count the occurrence of each keyword, and then take the appropriate number of steps along the axis for that word. When you are finished, you get a cloud of points in three-dimensional space, representing all of that day's breakfast orders. If you draw a line from the origin of the graph to each of these points, you obtain a set of vectors in 'bacon-eggs-and-coffee' space. The size and direction of each vector tells you how many of the three key items were in any particular order, and the set of all the vectors taken together tells you something about the kind of breakfast people favor on a Saturday morning. What your graph shows is called a term space. Each breakfast order forms a vector in that space, with its direction and magnitude determined by how many times the three keywords appear in it. Each keyword corresponds to a separate spatial direction, perpendicular to all others. Because our example uses 3 keywords, resulting term space has 3 dimensions, making it possible for us to visualize it.

It is easy to see that this space could have any number of dimensions, depending on how many keywords we chose to use. If we were to go back through the orders and also record occurrences of sausage, muffin, and bagel, we would end up with a six-dimensional term space, and six-dimensional document vectors.

Applying this procedure to a real document collection, where we note each use of a content word, results in a term space with thousands of dims. Each doc in our collection is a vector with as many components as there are content words. Altho we can't possibly visualize such a space, it is built in the exact same way as the breakfast space. Documents in such a space that have many words in common will have vectors near to each other, while documents with few shared words will have vectors far apart.

Latent semantic indexing works by projecting this large, multidimensional space down into a smaller number of dimensions. In doing so, keywords that are semantically similar will get squeezed together, and will no longer be completely distinct. This blurring of boundaries is what allows LSI to go beyond straight keyword matching. To understand how it takes place, we can use another analogy.

Imagine you keep tropical fish. You submit a picture of it to Modern Aquaria magazine, for fame and profit. To get the best possible picture, you will want to choose a good angle from which to take the photo. You want to make sure that as many of the fish as possible are visible in your picture, without being hidden by other fish in the foreground. You also won't want the fish all bunched together in a clump, but rather shot from an angle that shows them nicely distributed in the water. Since your tank is transparent on all sides, you can take a variety of pictures from above, below, and from all around the aquarium, and select the best one.

In math terms, you are looking for an optimal mapping of points in 3-space (the fish) onto a plane (film in camera). 'Optimal' can mean many things - in this case it means 'aesthetically pleasing'. If your goal is to preserve the relative distance between the fish as much as possible, so that fish on opposite sides of the tank don't get superimposed in the photograph to look like they are right next to each other. Here you would be doing exactly what the SVD algorithm tries to do with a much higher-dimensional space.

Instead of mapping 3-space to 2-space, however, the SVD algorithm goes to much greater extremes. A typical term space might have tens of thousands of dimensions, and be projected down into fewer than 150. Nevertheless, the principle is exactly the same. The SVD algorithm preserves as much information as possible about the relative distances between the document vectors, while collapsing them down into a much smaller set of dims. In this collapse, info is lost, and content words are superimposed on one another.

Information loss sounds like a bad thing, but here it is a blessing. What we are losing is noise from our original term-document matrix, revealing similarities that were latent in the document collection. Similar things become more similar, while dissimilar things remain distinct. This reductive mapping is what gives LSI its seemingly intelligent behavior of being able to correlate semantically related terms.

We are really exploiting a property of natural language, namely that words with similar meaning tend to occur together. 

Page 3: From: Mark Silverman [mailto:msilverman@treeminer.com] Sent: Thursday, July 05, 2012 4:35 PM Subject: FYI... text miningmsilverman@treeminer.com Had a

0 1 0 0 1 1 0 PContentWord0 1 0 0 1 1 0content word mask pTree(e.g., cutting words in 1 or all docs)

3 2 3 1 2 2 3document freq (df)

Pte,10 1 1 0 1 0 0

term existence

A pTree organization of text data:

11

1

Doc

Terma or on no in it to

0 7 6 0 3 0 0

term frequency

3 2 3 1 2 2 3 document freq column

0 1 1 0 1 0 0 Ptf,1,1

0 1 1 0 1 0 0 Ptf,1,0

Terma or on no in it to

11

1

Doc

Ter

m P

os

1

2

3

4

5

6

7

8

9

a

b

c

0 1 0 0 1 0 0

0 0 0 0 1 0 1

0 1 0 1 0 0 0

1 0 0 0 0 0 1

0 1 0 0 1 0 0

0 0 0 0 1 0 1

0 1 0 1 0 0 0

1 0 0 0 0 0 1

0 1 0 0 1 0 0

0 0 0 0 1 0 1

0 1 0 1 0 0 0

0 0 1 0 0 0 0

End

OfS

ente

ce

0

0

1

0

1

0

0

0

1

0

1

0

0

0

1

0

1

0

0

0

1

0

1

0

0

0

1

0

1

0

0

0

1

0

1

0

Pto,1

End

OfP

arag

raph

0

0

1

0

1

0

0

0

1

0

1

0

0

0

1

0

1

0

0

0

1

0

1

0

0

0

1

0

1

0

0

0

0

0

1

0

End

OfC

hapt

er

0

0

1

0

1

0

0

0

1

0

1

0

0

0

1

0

1

0

0

0

1

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

End

OfD

ocum

ent

0

0

1

0

1

0

0

0

1

0

1

0

0

0

1

0

1

0

0

0

1

0

1

0

0

0

0

0

0

0

0

0

0

0

1

0

PEOD,1

0 1 1 0 1 0 0 Ptf,1,2

Page 4: From: Mark Silverman [mailto:msilverman@treeminer.com] Sent: Thursday, July 05, 2012 4:35 PM Subject: FYI... text miningmsilverman@treeminer.com Had a

System description for the NSF Big Data proposal:

Analytics and Tools for the Analysis and Management of Vertically Structured Data

The proposed project will focus on the development of analytical methods and tools for analyzing and managing data in vertically structured formats. In addition to new analytical methods capable of dealing with massively large data-sets, either in number or depth, the goal will be to demonstrate the utility of a data provenance architecture capable of securely managing and providing privacy at rest to data repositories with no practical limits.

Analytic methods to be developed will be proposed on a system which will be described as Fast, Accurate Unsupervised Supervised machine Teaching (FAUST). This method uses an extremely fast vertical machine learning algorithm for analysis of a wide variety of data types.

The vertical P-Tree data structure serves as the basic unit for implementation of the FAUST series of algorithms. These data structures will provide insight into computational, storage and I/O access rate efficiencies which can be extracted by vertical representation of data.

The data provenance and management architecture is based on an identity-centric computing model.

In this model entities involved in the information management cycle are provided with a unique and anonymous organizational identity. In this architecture subordinate identities are derived from organizationally superior identities. The work being proposed will extend the notion of organizationally specific identities to data objects, vertically structured in this case, which are being placed under organizational provenance guidelines. In addition to providing privacy at rest and a highly granular access control system the utility of this strategy for providing rigorous pseudonymization and privacy protected localization of disbursed data will be evaluated.

Page 5: From: Mark Silverman [mailto:msilverman@treeminer.com] Sent: Thursday, July 05, 2012 4:35 PM Subject: FYI... text miningmsilverman@treeminer.com Had a

Provenance From Wikipedia, the free encyclopedia Jump to: navigation, search For other uses, see Provenance (disambiguation). Provenance, from the French provenir, "to come from", refers to the chronology of the ownership or location of a historical object.[1] The term was originally mostly used for works of art, but is now used in similar senses in a wide range of fields, incl. science and computing. Typical uses may cover any artifact found in archaeology, any object in paleontology, certain documents (manuscripts), copies of printed books. In most fields, the primary purpose of provenance is to confirm or gather evidence as to the time, place, and—when appropriate—the person

responsible for the creation, production, or discovery of the object. This will typically be accomplished by tracing the whole history of the object up to the present. Comparative techniques, expert opinions, and the

results of scientific tests may also be used to these ends, but establishing provenance is essentially a matter of documentation.

In archaeology, the term provenience is used somewhat similarly to provenance. Archaeological researchers use provenience to refer to the three-dimensional location of an artifact or feature within an archaeological site,[2] as opposed to provenance, which includes an object's complete documented history.

Works of art and antiques: The provenance of works of fine art, antiques and antiquities is of great importance, especially to their owner. There are a number of reasons why painting provenance is important, which mostly also apply to other types of fine art. A good provenance increases the value of a painting, and establishing provenance may help confirm the date, artist and, especially for portraits, ubject of a painting.

Wines: In transactions of old wine with the potential of improving with age, the issue of provenance has a large bearing on the assessment of the contents of a bottle, both in terms of quality and the risk of wine fraud. A documented history of wine cellar conditions is valuable in estimating the quality of an older vintage due to the fragile nature of wine.[13]

Archives: Provenance is a fundamental principle of archives, referring to the individual, group, or organization that created or received the items in a collection. According to archival theory and the principle of provenance, records of different provenance should be separated. In archival practice, proof of provenance is provided by the operation of control systems that document the history of records kept in archives, including details of amendments made to them. It was developed in the nineteenth century by both French and Prussian archivists. Provenance is the title of the journal published by the Society of Georgia Archivists.

Books: In books, the study of provenance refers to the study of the ownership of individual copies of books. It's usually extended to study of the circumstances in which individual copies of books have changed ownership, and of evidence left in books of how readers interacted with them.

Provenance studies may shed light on the books themselves, providing evidence of the role particular titles have played in social, intellectual and literary history. Such studies may also add to our knowledge of particular owners of books. For instance, looking at the books owned by a writer may help to show which works influenced him or her. Many provenance studies are historically focused, and concentrated on books owned by writers, politicians and public figures. The recent ownership of books is studied, however, as is evidence of how ordinary or anonymous readers have interacted with books.[16][17] Provenance can be studied both by examining the books themselves (for instance looking at inscriptions, marginalia, bookplates, book rhymes, and bindings) and by reference to external sources of information such as auction catalogues.[14]

Page 6: From: Mark Silverman [mailto:msilverman@treeminer.com] Sent: Thursday, July 05, 2012 4:35 PM Subject: FYI... text miningmsilverman@treeminer.com Had a

Provenance From Wikipedia, the free encyclopedia Jump to: navigation, search For other uses, see Provenance (disambiguation). pg 2

Archaeology: Evidence of provenance can be of importance in archaeology. Fakes are not unknown and finds are sometimes removed from the context in which they were found without documentation, reducing their value to the world of learning. Even when apparently discovered in-situ archaeological finds are treated with caution. The provenance of a find may not be properly represented by the context in which it was found. Artifacts can be moved far from their place of origin by mechanisms that include looting, collecting, theft or trade and further research is often required to establish the true provenance of a find.

Paleontology: In paleontology it is recognised that fossils can also move from their primary context and are sometimes found, apparently in-situ, in deposits to which they do not belong, moved by, for example, the erosion of nearby but different outcrops. Most museums make strenuous efforts to record how the works in their collections were acquired and these records are often of use in helping to establish provenance.

Seed provenance: Seed provenance refers to the specified area in which plants that produced seed are located or were derived.

Data provenance: Scientific research is held to be of good provenance when it is documented in detail sufficient to allow reproducibility.[23] Scientific workflows assist scientists and programmers with tracking their data through all transformations, analyses, and interpretations. Data sets are reliable when the process used to create them are reproducible and analyzable for defects.[24] Current initiatives to effectively manage, share, and reuse ecological data are indicative of the increasing importance of data provenance. Examples of these initiatives are National Science Foundation Datanet projects, DataONE and Data Conservancy.

Computers and law: The term provenance is used when ascertaining the source of goods such as computer hardware to assess if they are genuine or counterfeit. Chain of custody is an equivalent term used in law, especially for evidence in criminal or commercial cases.

Data provenance covers the provenance of computerized data. There are two main aspects of data provenance: ownership of the data and data usage. Ownership will tells the user who is responsible for the

source of the data, ideally including information on the originator of the data. Data usage gives details regarding how the data has been used and modified and often includes info on how to cite the data source or sources. Data provenance is of particular concern with electronic data, as data sets are often modified and copied without proper citation or

acknowledgement of the originating data set. Databases make it easy to select specific information from data sets and merge this data with other data sources without any documentation of how the data was obtained or how it was modified from the original data set or sets.

Secure Provenance refers to providing integrity and confidentiality guarantees to provenance information. In other words, secure provenance means to ensure that history cannot be rewritten, and users can specify who else can look into their actions on the object. [25]

See also Dating methodology (archaeology) Post excavation Arnolfini Portrait - fairly full example of the provenance of a painting Annunciation (van Eyck, Washington) - another example Records Management Traceability

External links: Look up provenance in Wiktionary, the free dictionary. EU Provenance Project - a technology project that sought to support the electronic certification of data provenance DataONE Data Conservancy

Page 7: From: Mark Silverman [mailto:msilverman@treeminer.com] Sent: Thursday, July 05, 2012 4:35 PM Subject: FYI... text miningmsilverman@treeminer.com Had a

APPENDIX: HADOOP MapReduceBad news: lots of programming workCommunication and coordination; Recovery from machine failure; Status reporting; Debugging; Optimization; LocalityBad news II: repeat for every problem you want to solve. How can we make it easy to write distributed programs?Data Flow in MapReduce: Read a lot of dataMap: extract something you care about from each rec. Partition the output – which keys go to which reducer. Shuffle and

sort – reducer expects his keys sorted and for each key – list of all valuesReduce: aggregate, summarize, filter, or transform. Write the results. Map selects. Reduce does grouping and summing.Example: Word histogramMap(string input_key, string input_value);/*input_key=doc_name, input)value=doc_contents*/For each word w in input_values:Emit_intermediate(w, *!*);Reduce(string key, Iterator intermediate_value); /*key: a word, same for i/o; intermediate_value a list of counts*/ int result

– 0; for each v in intermediate_value: result += ParseInt(v);Emit(A=String(result));

HADOOP MapReduce Example: Inverted Wed GraphFor each pg, gen a list of incoming links. Input: Web documentsMap: for each link L in doc D emit <href(L),D> Reduce: Combine all docs into a listMapReduce can do Select From Where but can’t join. Example: Joining with Other Data, e.g., For each major city in our GEO database – create a list of pages that refer to

it and where: Need to go over all web docs.Per-host info might be in per-process data structure, or involve fCPC to list of machines containing data for all?Map: go over the document. Use heuristic to understand if the document talks about a place/city. For each city

name referred in the doc–write doc_id and offset in it.Reduce: concat to list of top rated refs for each city.

Page 8: From: Mark Silverman [mailto:msilverman@treeminer.com] Sent: Thursday, July 05, 2012 4:35 PM Subject: FYI... text miningmsilverman@treeminer.com Had a

The NameNode tracks blocksand Datanodes

Name node is single point of failure but won’t be long. All on same loc (generally), not distributed. The key is its value. (Think keys and values) <key,value> <byte offset, some text> <user id, user profile> <timestamp, access log entry>

<user, list of user’s friends> Everything is keys and values. (byte offset, not a file.)To write a MapReduce program: Write a mapper that takes a key and value, emits zero or more new keys and values.Write a reducer that takes all the values of one key and emits zero or more new keys, values. E.g., Hello world :: Regular

programming Word count :: MapReduce programming