big data brighton | big data in academia | jan 2013

Post on 05-Jul-2015

478 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Four talks about Big Data in Academia at Big Data Brighton Jan 2013. Two of the talks' slides are here. I'll upload Miltos' slides when I receive them. Dr Patricia Roberts, Senior Lecturer & Researcher in database design, development and management, University of Brighton - Structured vs Unstructured Data: why structure matters. Simon Wibberley, PhD student in computational linguistics at the Text Analytics Group at the University of Sussex. Real-time text stream analysis, event detection, and entity recognition. Event detection on Twitter.

TRANSCRIPT

January 2013 at

University of Brighton

http://meetup.com/Big-Data-Brighton

Agenda• Miltos Petridis, Professor of Computer Science, University

of Brighton

• Dr Patricia Roberts, Senior Lecturer & Researcher in database design, development and management, University of Brighton - Structured vs Unstructured Data: why structure matters.

• Simon Wibberley, PhD student in computational linguistics at the Text Analytics Group at the University of Sussex. Real-time text stream analysis, event detection, and entity recognition. Event detection on Twitter.

• Kevin Long, Teradata - Summary and Business context

Big Data

“A  new  generation  of  technologies  and  architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-speed capture,  discovery  and/or  analysis”1

New investment initiatives are coming, such as in the US in 2012:

“more  than  $200  million  in  new  funding  through six agencies and departments to improve  the  nation’s   ability to extract knowledge and insights from large and complex collections  of  digital  data”  2

Knowledge and insights... hmm Before companies rush to use the technologies

they should be asking some questions:

• Can we make any assumptions about the

quality of the data we are using?

• Is there a significant difference between structured and unstructured data?

• Can the underlying structure of the data affect what you can do with it?

In this brief talk, I will be examining these

questions with reference to my research and recent trends

Can we make any assumptions about the quality of the data we are using?

• One of the problems about the recent explosion in the amount of data is that some data (particularly collected from social networking sites) is of dubious quality – A straw pole of my students found that 1 in 5

deliberately enter incorrect data about themselves online to protect their identity

• We might not have any assurance that the data is true or that it is correctly linked to metadata – Is data typed? – Is the data related to other data? How is it related? – Are relationships between data and its meaning

being lost?

A view of different data models 3

Is there a significant difference between structured and unstructured

data? • How is data structured? • Does the underlying data model matter? • What are the options for a data model? • Over the years many models of data have

evolved and most are still in use • Data models used give insights into

assumptions about the semantics of the data

Finding  meaning  from  ‘flat’  data

• A  problem  with  ‘flat’  or  unstructured  data  representations is that it has traditionally been difficult to aggregate and present to users in a way that they can understand

• In contrast, structured data can be summarised easily and its structure represents the meaning of data within an organization

• Data analytics are changing this by presenting  accessible  information  from  ‘flat’  data

Can the underlying structure of the data affect what you can do with it?

• The short answer from my research is ‘YES’

• How it affects what you can do with the data is the long answer – It is really easy to store a piece of data but

retrieving it (intact with its meaning and its relationships to other data) is more difficult

– When  ‘Big  Data’  technologies  are  used  to  knowledge and insights from the data we should be sure that the technology is not introducing new problems

Impedance mismatch problems

• Moving data from one paradigm to another often causes the meaning to be lost

• Can cause problems for developers who move data from one paradigm to another

• Also a problem for end users who may lose the connections

A way forward

• Working out goals in your data management • Understanding the structure of the data you

are using, wherever it comes from • Getting assurance about the quality of the

data • Then having confidence that the knowledge

and insights are based in firm foundations

Thank you

Any questions?

References 1. Carter, P (2011) , Big Data Analytics: Future

Architectures, Skills and Roadmaps for the CIO, SAS White paper, IDC Go-to-Market Services

2. E. Gianchandani. Obama administration unveils $200m big data r&d initiative. In The Computing Community Consortium (CCC) Blog, 2012.

3. Renzo Angles and Claudio Gutierrez. 2008. Survey of graph database models. ACM Comput. Surv. 40, 1, Article 1 (February 2008)

Event Detecon on Twi�er

Simon Wibberley

Text Analycs Group

University of Sussex

simon.wibberley@sussex.ac.uk

What are Events? We just don’t know.

Event Categories

Constrained Unconstrained

Well Reported

Poorly ReportedInteresting

Relatively Easy Interesting

Very Tricky

Algorithms

• Query Driven

– Volume / rate analysis of matching data

– Addresses constrained event type

• Data Driven

– Mine stream for interesng data

– Addresses unconstrained event type

GB Dressage Gold

London Riots

London Riots

Event Characterisaon

• Fill in unknowns

• Self explanatory for (very) constrained events

• Select representave / well formed Tweet[s]

• Term relevance / clustering

• Topic analysis

• Geo-locaon / Enty extracon

CASM

• Centre for the Analysis of Social Media

• Collaboraon between DEMOS and TAG

• Applying text analycs to social media to

answer sociological quesons

• OSI funded EU senment anaylsis pilot project

h�p://www.demos.co.uk/projects/casm/

Ethics

Narrow Broad

Anonymous

Identity Preserving StasiJudiciary

Me!Social Science

Reffin, J (2012)

top related