big data brighton | big data in academia | jan 2013

January 2013 at

University of Brighton

http://meetup.com/Big-Data-Brighton



Agenda• Miltos Petridis, Professor of Computer Science, University

of Brighton

• Dr Patricia Roberts, Senior Lecturer & Researcher in database design, development and management, University of Brighton - Structured vs Unstructured Data: why structure matters.

• Simon Wibberley, PhD student in computational linguistics at the Text Analytics Group at the University of Sussex. Real-time text stream analysis, event detection, and entity recognition. Event detection on Twitter.

• Kevin Long, Teradata - Summary and Business context

Big Data

“A new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-speed capture, discovery and/or analysis”1

New investment initiatives are coming, such as in the US in 2012:

“more than $200 million in new funding through six agencies and departments to improve the nation’s ability to extract knowledge and insights from large and complex collections of digital data” 2

Knowledge and insights... hmm Before companies rush to use the technologies

they should be asking some questions:

• Can we make any assumptions about the

quality of the data we are using?

• Is there a significant difference between structured and unstructured data?

• Can the underlying structure of the data affect what you can do with it?

In this brief talk, I will be examining these

questions with reference to my research and recent trends

Can we make any assumptions about the quality of the data we are using?

• One of the problems about the recent explosion in the amount of data is that some data (particularly collected from social networking sites) is of dubious quality – A straw pole of my students found that 1 in 5

deliberately enter incorrect data about themselves online to protect their identity

• We might not have any assurance that the data is true or that it is correctly linked to metadata – Is data typed? – Is the data related to other data? How is it related? – Are relationships between data and its meaning

being lost?

A view of different data models 3

Is there a significant difference between structured and unstructured

data? • How is data structured? • Does the underlying data model matter? • What are the options for a data model? • Over the years many models of data have

evolved and most are still in use • Data models used give insights into

assumptions about the semantics of the data

Finding meaning from ‘flat’ data

• A problem with ‘flat’ or unstructured data representations is that it has traditionally been difficult to aggregate and present to users in a way that they can understand

• In contrast, structured data can be summarised easily and its structure represents the meaning of data within an organization

• Data analytics are changing this by presenting accessible information from ‘flat’ data

Can the underlying structure of the data affect what you can do with it?

• The short answer from my research is ‘YES’

• How it affects what you can do with the data is the long answer – It is really easy to store a piece of data but

retrieving it (intact with its meaning and its relationships to other data) is more difficult

– When ‘Big Data’ technologies are used to knowledge and insights from the data we should be sure that the technology is not introducing new problems

Impedance mismatch problems

• Moving data from one paradigm to another often causes the meaning to be lost

• Can cause problems for developers who move data from one paradigm to another

• Also a problem for end users who may lose the connections

A way forward

• Working out goals in your data management • Understanding the structure of the data you

are using, wherever it comes from • Getting assurance about the quality of the

data • Then having confidence that the knowledge

and insights are based in firm foundations

Thank you

Any questions?

References 1. Carter, P (2011) , Big Data Analytics: Future

Architectures, Skills and Roadmaps for the CIO, SAS White paper, IDC Go-to-Market Services

2. E. Gianchandani. Obama administration unveils $200m big data r&d initiative. In The Computing Community Consortium (CCC) Blog, 2012.

3. Renzo Angles and Claudio Gutierrez. 2008. Survey of graph database models. ACM Comput. Surv. 40, 1, Article 1 (February 2008)

Event Detecon on Twi�er

Simon Wibberley

Text Analycs Group

University of Sussex

[email protected]

What are Events? We just don’t know.

Event Categories

Constrained Unconstrained

Well Reported

Poorly ReportedInteresting

Relatively Easy Interesting

Very Tricky

Algorithms

• Query Driven

– Volume / rate analysis of matching data

– Addresses constrained event type

• Data Driven

– Mine stream for interesng data

– Addresses unconstrained event type

GB Dressage Gold

London Riots

Event Characterisaon

• Fill in unknowns

• Self explanatory for (very) constrained events

• Select representave / well formed Tweet[s]

• Term relevance / clustering

• Topic analysis

• Geo-locaon / Enty extracon

CASM

• Centre for the Analysis of Social Media

• Collaboraon between DEMOS and TAG

• Applying text analycs to social media to

answer sociological quesons

• OSI funded EU senment anaylsis pilot project

h�p://www.demos.co.uk/projects/casm/

Ethics

Narrow Broad

Anonymous

Identity Preserving StasiJudiciary

Me!Social Science

Reffin, J (2012)

big data brighton | big data in academia | jan 2013

Technology