data scientist (analytics)

Post on 21-Jul-2016

41 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

data science

TRANSCRIPT

What is Data Science?

Data science is the study of the generalizable extraction of knowledge from data. It incorporates varying elements and builds on techniques and theories from many fields, including mathematics, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing and high performance computing with the goal of extracting meaning from data and creating data products. Data science is a buzzword, often used interchangeably with analytics or Big data, that is often abused for marketing anything involving data processing, in particular to re-brand existing competitive, intelligence and business analytics approaches.

Figure: Drew Conway’s Venn diagram of data science

Data Scientist

Data Scientist solves complex data problems through employing deep expertise in some scientific discipline. It is generally expected that data scientist are able to work with various elements of computer science, mathematics and statistics. However a data scientist is most likely to be an expert in only one or two of these disciplines and proficient in another two or three. And it means that data science must be practical as a team, where across the membership of the team there is expertise and proficiency across all the disciplines.

Desirable qualities of a Data Scientist

•Data grappling skills: they should know how to move data around and manipulate data with some programming language or languages.•Data viz experience: they should know how to draw informative pictures of data. That should in fact be the very first thing they do when they encounter new data•Knowledge of stats, error bars, confidence intervals: ask them to explain this stuff to you. They should be able to.

(Continued….

•Experience with forecasting and prediction, both general and specific (ex): lots of variety here, and if you have more than one data scientist position open, I’d try to get people from different backgrounds (finance and machine learning for example) because you’ll get great cross-pollination that way•Great communication skills: data scientists will be a big part of your business and will contribute to communications with big clients.

Why we statisticians are here?

There is a debate in the arena of Data Scientist that Statisticians are not needed in the field of Data Science. But by the following few reasons one can prove that the Statistics or Statisticians is a vital part of Data Science."Data grappling skills" are things we have learnt along

the way in modern regression and advanced data analysis, which between them guarantee an intensive R usage. These are things we explicitly teach in statistical computing, with even more R.

"Data viz experience" begins with our introductory Statistics classes, and then goes on in great depth in statistical graphics and visualization, with even more of the accompanying R. The habit of starting to understand any new data by drawing pictures is certainly something we inculcate.

(Continued…

"Knowledge of stats, error bars, confidence intervals" needs no elaboration.

"Experience with forecasting and prediction" again, both regression and advanced data analysis are full of this.

"Great communication skills" Graphics, regression, and advanced data analysis all require, and grade on, the ability to write comprehensible and useful data analysis reports. The research projects class involves a lot of this, as well as regular oral presentations. It would be good if we did more on this front, however.

Big Data

Big Data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set.

In the other hand we can say that define Big Data As, Very large distributed aggregations of loosely structured data – often incomplete and inaccessible:Petabytes/Exabytes of data,Millions/billions of people,Billions/trillions of record,Loosely structured and often distributed data,Flat schemas with few complex interrelationships,Often involving time-stamped events,Often made up of incomplete data,Often including connections between data elements

that must be probabilistically inferred.

Some of examples of Big Data problems are:Web-based businesses are developing information

products that combine data gathered from customers to offer more appealing recommendations and more successful coupon programs.

By embracing social media, retail organizations are engaging brand advocates, changing the perception of brand antagonists, and even enabling enthusiastic customers to sell their products.

Hospitals are analyzing medical data and patient records to predict those patients that are likely to seek readmission within a few months of discharge. The hospital can then intervene in hopes of preventing another costly hospital stay.

Big Data Management tools we will use in future:

RHadoopHivePigPython

top related