reframing public housing: visualization and data analytics in history

Post on 22-Jan-2018

285 Views

Category:

Education

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

OSUL and Digital Humanities

Dealing with Data Problems◦ While the Library licenses the content via

a content provider, access to the underlying data for aggregated research is and isn’t supported.

◦ In this case, access to content is limited through both our subscriptions and newspaper publishers themselves.

◦ For this project, licensing to many of the sources David and Patrick were interested in working with required licensing fees of ~$25-50,000 per newspaper.

Big “little” dataWe worry a lot about big research data in the library and how this information will be preserved and made accessible into the future

◦ But equally concerning – is big “little” data

Big “little” data has very specific problems:1. Acquisition of the data can be really difficult

2. Storage tends to be inefficient and difficult

3. It’s incredibly hard to move around

4. For purposes of aggregation, it limits the types of tools that can be used for evaluation

5. When the data is closed, finding undocumented inconsistencies is hard

Sample Data Set

NewsPaper Processing tool

Data processing methodologyCreated two data sets:

1. First data set focused on any digital object (excluding classifieds), that included references to public housing

2. Second data set focused on any digital object (excluding classifieds), that included public housing and 4 agreed upon synonyms for public housing

One of the benefits of using the resources that we did, was that there was very little article duplication across resources (i.e., very little reliance on the Associated Press – meaning that little data filtering needed to occur to account for duplicate data across newspapers)

Data processing methodologyFrom these sets – I wrote a suite of tools in C# that measured:

1) Presence of positive terms

2) Presences of negative terms

3) Neutral terms

4) Frequency of negative and positive terms

5) Proximity to positive and negative terms to provide weight

These tools utilized stemming to allow the tool to capture forms of words.

One thing that this work highlighted however, was the limitations in the data due to data quality. These resources are ocr’ed representations of a particular newspaper article, classified, etc. – and ocr data quality varies significantly across the titles. A secondary research project that I’ve begun is using these data sets to test ocr quality of the set by utilizing word frequency to map unique words across a digital object

0

5

10

15

20

25

30

35

40

45

1930 1940 1950 1960 1970 1980 1990 2000

Cleveland Call Post

More Positive More Negative

Just Public Housing: Cleveland

-15

-10

-5

0

5

10

15

20

25

1930 1940 1950 1960 1970 1980 1990 2000

Article Content: Positive Over Negative

Just Public Housing: Cleveland

Extended Terms: Cleveland

0

5

10

15

20

25

30

35

40

45

50

1930 1940 1950 1960 1970 1980 1990 2000

Cleveland Call Post

More Positive More Negative

Extended Terms: Cleveland

-15

-10

-5

0

5

10

15

20

25

1930 1940 1950 1960 1970 1980 1990 2000

Article Content: Positive Over Negative

Public Housing vs Extended Terms

-15

-10

-5

0

5

10

15

20

25

1930 1940 1950 1960 1970 1980 1990 2000

Article Content: Positive Over Negative

-15

-10

-5

0

5

10

15

20

25

1930 1940 1950 1960 1970 1980 1990 2000

Article Content: Positive Over Negative

Public Housing vs Extended Terms: NY

-10

-5

0

5

10

15

20

25

30

35

1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

Article Content: Positive Over Negative

-15

-10

-5

0

5

10

15

20

25

30

1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

Article Content: Positive Over Negative

Data processing methodologyPotential additional areas of inquiry:• Representation of public housing in:• letters to the editor

• Editorials

• Featured Articles

top related