sds podcast episode 175 with gregory piatetsky- shapiro

27
Show Notes: http://www.superdatascience.com/175 1 SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Upload: others

Post on 30-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 1

SDS PODCAST

EPISODE 175

WITH

GREGORY

PIATETSKY-

SHAPIRO

Page 2: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 2

Kirill Eremenko: This is episode number 175 with President and Editor

at KDNuggets, Gregory Piatetsky-Shapiro.

Welcome to the Super Data Science Podcast. My name

is Kirill Eremenko, data science coach and lifestyle

entrepreneur. Each week, we bring you inspiring

people and ideas to help you build your successful

career in data science. Thanks for being here today,

and now let's make the complex simple.

Welcome back to the Super Data Science Podcast,

ladies and gentlemen. Today, I've got a very exciting

guest for you on the show, the legendary Gregory

Piatetsky-Shapiro, who is the founder of KDNuggets, is

joining us. I actually met Gregory quite a while ago. It

was over a year ago in May 2017 at the ODSC

Conference where we chatted and I invited him to the

podcast, but it took this long for us to organize

everything, and now he's finally come on the show. If

you don't know who Gregory is, then this will just put

things into perspective for you. KDNuggets is one of

the most popular data science resources out there.

Write accurate news on data science, they provide

their own articles, they conduct polls on data science,

and many, many more exciting things in the space of

data science. They've been around since 1997. Here's

another perspective for you, Gregory has 256,000

followers on LinkedIn, so that should just tell you of

what kind of an influencer in the space of data science

Gregory is, and how much he's actually contributed to

the community, how many things he's given back to

Page 3: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 3

the space. Today, we are with welcoming him on the

show.

In today's podcast, what will we be talking about?

Today, we're going to cover off quite a few topics. Of

course, we'll go through the foundations of KDNuggets.

A very exciting, very interesting story of how it all

started, where Gregory began his journey into the

space and what KDNuggets has grown into, but also

we will cover off some of the more recent advances that

have been happening in the space of data science that

KDNuggets has been highlighting or has been

participating in.

For instance, we'll talk about the whole concept of data

science being the sexiest professional in 21st century

and what has it turned into now, and what role is

machine learning playing in there? We'll also talk

about what the new GDPR regulations in Europe mean

for data scientists. The Global Data Protection

Regulation, it came into play in Europe earlier this

year. We'll also talk about GDPR, the new European

Data Protection Regulation which came into play

earlier this year. It's one of the first changes in

decades in the European Data Protection Regulations.

We'll talk about the concept of citizen and data

scientist. We'll talk about reinforcement learning, and

quite a lot of other very exciting things as you can

imagine Gregory has seized all these new updates in

the space of data science on a daily basis. He is the

editor for KDNuggets, so all these articles that you're

seeing on KDNuggets actually go through him, and

Page 4: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 4

today he's sharing his best and most exciting insights

with us.

All in all, a very exciting episode full of most recent

technology core advancements and interesting stories

on how this all came to be. Can't wait for you to check

it out, so let's dive straight into it, and without further

ado, I bring to you, Gregory Piatetsky-Shapiro, founder

and editor at KDNuggets.

Welcome, ladies and gentlemen to the Super Data

Science Podcast. Today, I've got a very exciting guest,

Gregory Piatetsky-Shapiro on the phone. Gregory,

welcome to the show. How are you today?

Gregory P. S.: Thank you, Kirill. I'm excited to be here. It's a pleasure

to be on your podcast.

Kirill Eremenko: It's so wonderful to have you. We met in May, 2,000,

what was it? 17? No, I think May 2,000, yeah, 17. Last

year in May, and it's been over a year, and I've been

wanting to get you on the show for a year now, and

finally we're here. This is super, super, exciting.

Gregory, where are you located right now?

Gregory P. S.: I am in Boston, Massachusetts.

Kirill Eremenko: Is that-?

Gregory P. S.: Actually, I'm-

Kirill Eremenko: Yep.

Gregory P. S.: ... working at home, so we have beautiful sunny

weather, and all of my cats, I think, are outside. As a

data scientist in the daytime, I do have the cats, but

Page 5: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 5

hopefully they don't interfere in the middle of this

conversation.

Kirill Eremenko: Fantastic. Yeah, I was just about to ask that. That is

your home base, Boston. Is that correct?

Gregory P. S.: Yes.

Kirill Eremenko: Wonderful. It's so great to hear that you've got sunny

weather in Boston today. Last time I was there, it was

in May last year, it was surprisingly chilly. Yeah, so it's

good to hear that the weather is nice today. All right,

so let's dive straight into the podcast. Gregory, you are

the Founder and Director or President and Editor of

KDNuggets, a very popular data science media outlet

and news aggregator and a platform that shares

research about data science. You've been running this

platform for 21 years now. Tell us a little bit about how

it all started. Where did this idea come from?

Gregory P. S.: Yes, thank you. Probably, I started when I was a kid, I

was very fascinated by science fiction, and I loved

stories about robots, especially, from Isaac Asimov and

other writers like Stanislaw Lem and [inaudible

00:06:39] that was known in the Western. I was

always curious about the idea of AI, and this probably

motivated me to learn computers when they first year

appeared. In my first year in college when computers

were still programmed with punch cards, I remember

spending several weeks of my free time in the summer,

writing a program to play battleships, which was still a

very advanced program for that period. And then I

used APL. That was a special language developed by

IBM. It's A Programming Language, and it had special

symbols for every different array operation.

Page 6: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 6

You can think of it as like R but with Greek letters.

After spending several weeks programming it, I played

one game and I was very soundly defeated by my own

program. I think as a result, I become much more

interested in creating programs than playing them. I

did my undergraduate, I studied for undergraduate

degree in Mathematics, then I came to United States to

study computer science at NYU, and I got my PhD in

Applied Machine Learning to Databases. I think the

idea was a self-organizing database system that

automatically selects different indices and does

something intelligent.

Then I worked at GTE as a researcher. GTE was a

large telephone company in United States. Now, it is

part of Verizon, which is even a larger telecom

company. I remember around 1986 or so, I attended a

workshop, which was called Expert Database System.

That was a very interesting name, but the concept was

very fuzzy, and the workshop paper and talks were all

over the place. I thought we could focus on something

more clearly defined, analyzing databases and finding

interesting patterns. In one of our projects that we did

on applying some intelligent to figure out databases,

and I discovered that a particular query would run

10,000 faster if we knew that there was a particular

rule, that kind of functional constraint that always

existed. There were some over-supplication. Can you

find some useful rules in databases?

I was, at that time, young, energetic and naïve, and I

thought that I could organize a better workshop. At

that time, a popular term was data mining. It's

Page 7: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 7

interesting to note just as an aside how the

terminology changes and reflects the time. It went

from data fishing and data dredging, which were bad

times, and data mining became second popular term.

Now, the popular term is data science or maybe until

last year. Now, it's machine learning and artificial

intelligence, but in any case, so I organized a

workshop. I thought data mining was not sexy enough,

so I came up with the name Knowledge Discovery in

Data or KDD. That was the first workshop back in

1981, which attracted, I think, about 70 people

including several leading researchers.

Kirill Eremenko: Wow.

Gregory P. S.: Then I organized a couple of more workshop, and later

in 1994, one of my best ideas was to stop doing it

myself and to recruit Usama Fayyad, who was then

just a fresh PhD from Ann Arbor. His advisor

Ramasamy Uthurusamy, was then a researcher

general modest, and they agreed to run '94 workshop.

Then, next year, that workshop went into a conference,

and later with the help of Won Kim, who is the chair of

KDD that SIGMOD. He was very experienced with ACM

that's a leading professional organization, Association

for Computing Machinery. We created a special

interest group, SIGKDD, that was running KDD

conference, and they're still running until today.

I think we've had about that 22 KDD conferences since

then. I'm very pleased to say that KDD remains the

leading research conference in the field based on

citations and other indices. Now, I can stand back

after many years of organizing [inaudible 00:11:50]

Page 8: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 8

like a grandparent, enjoy the baby doing really well.

That was kind of one track of my activity.

How did I get to where I am? After third KDD

workshop, I decided to send a newsletter to people who

attended the workshop, and I called it the Knowledge

Discover Nuggets. The first issue, which is still online,

went to, I think, about 50 people, who attended that

workshop. Now, it's almost 25 years, actually 25 years

[inaudible 00:12:32] so KDNuggets has about 200,000

subscribers and followers that was emailed with the

Facebook, LinkedIn, and our website gets about

500,000 a month.

Kirill Eremenko: Wow, congratulations. That's huge.

Gregory P. S.: Big goals. Thank you. But we're focusing on analytics,

data science, and machine learning. If I try to talk to

my people you realize that as a data scientist at heart,

I just tried to select a few interesting things to write

about or select things on the web that we can publish.

I guess that was a second track in my career.

In parallel, when organizing conferences and

publishing newsletter was not a full-time activity, and

also in all the conference organizing that I've done was

always as a volunteer [inaudible 00:13:43] was very

received any payment for it, but probably was one of

the more rewarding things that I've done because I

enjoyed doing it with interesting people and helping to

put good things together. But another interesting thing

that I've been doing in terms of research and data

mining involved consulting and being enrolled in the

world of startups. In 1997, which was still very part of

the Dot-com Rush 00:14:23], I left the GTE research

Page 9: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 9

lab and I joined the startup that was doing analytics

data mining consulting for financial industry, mainly

banks and insurance companies.

We worked with the largest names like Credit Suisse,

Chase Manhattan, Citibank. I was a chief scientist,

and managed a small team of perhaps about 10

people. Then around 2000, our smaller startup was

bought by a big startup. For a very short period of

time, the value of the big startup exceeded $1 billion.

Kirill Eremenko: Wow.

Gregory P. S.: It became the wanted unicorn, but before anyone,

including me, could do anything foolish with the stock

options, the larger startup's stock crashed almost all

the way down to zero. I left it 2001. I think maybe

couple of months before that stock went all the way to

zero. I was self-employed since about 2001, mainly

publishing KDNuggets and doing consulting and data

mining.

I think one interesting question for all the younger

people listening is synergy. In my case, I've done this

three parallel and mutually supporting activities as a

research and consult and data mining, and as a

founder and chair of KDD conferences, or Publishing

Editor of KDNuggets news and website. In each one of

those activities was in some way helping the other. I

know, Kirill, that you're also teaching courses and you

have a very nice book, Confident Data Skills.

Kirill Eremenko: Thank you, yes.

Gregory P. S.: And probably doing other things. I guess probably,

helpful suggestion for young people that try to do

Page 10: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 10

interesting things is to think is there a synergy with

this activity with some other [inaudible 00:16:42] if

there is not, then maybe it's not the best thing to do.

The very synergy, it generally helps you to succeed.

Just to finish in this, in the last few years, I think,

maybe writing the big data and data science, which

KDNuggets became so popular that I stopped that I

stopped consulting them. Now, I only publish

KDNuggets, and we have another excellent full-time

idea [inaudible 00:17:17] based in Canada. We have

several interns based in London and other places.

KDNuggets is global in its reach.

Kirill Eremenko: Gotcha. Wow, that's such an interesting career, and I

love that you mentioned that wonderful takeaway for

your career [inaudible 00:17:41] about synergy. I can

totally agree with that that when you're working on A

and B, you should be aiming to make sure that A plus

B is more than just A plus B. It's A plus B plus an

extra value. So it's not one plus one equal two. If you

truly have a synergy in the things that you're working

on, one plus one equals three or four or five, because

they complement each other, and they help your

audience, and they help you propel your career

forward. That's a very interesting takeaway, and

definitely, I can agree that looking back unconsciously,

I've probably done that. I can see I've done that in my

own career, but that was always unconscious. That

was just like a gut feel, but if you think about it

consciously, I think you can make much faster

progress in the things that you're doing and how

you're going foreword.

Page 11: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 11

Thank you, and it's really exciting to hear that

KDNuggets has got so many followers, 200,000

subscribers and 500,000 visitors per month. That is

truly astonishing numbers. You mentioned that you

select those blog posts. How many blog posts do you

publish on KDNuggets? How frequently do they come

out?

Gregory P. S.: Well, we publish every weekday, and we try to select

maybe two or three interesting blog posts a day. Now,

we get a lot of submissions. Occasionally, myself and

Matthew [May 00:19:14] we also write our own

editorial pieces, and if we see some interesting blog

posts around the web, then we'll also ask the authors

for reposting those as guest blogs on KDNuggets, but

there's so much stuff on the web that we try to select

only a small number, maybe two or three per day.

Kirill Eremenko: That's quite a lot as well. Already, that makes it 10 or

15 or more per week. How do you find the time to go

through all of them? You probably get a ton of

submissions sent to you. How many submissions do

you get, just out of curiosity?

Gregory P. S.: Well, it's hard to say, but I think we probably get

something like three to five submissions per day, not a

very large number because we have clear guidelines,

and we're also focus on more technical solutions. Our

audience is mainly data scientists, and machine

learning engineers, so we'll not publish something like

why your business should use data science. I assume

our readers already know, but we would publish

something that explains how to create a pipeline in

Python or some ideas how to use Python [inaudible

Page 12: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 12

00:20:38] or maybe some interesting polls that I run

every month or so. There're some interesting

observations like our recent poll, most popular annual

poll on what is the software that you use?

I've been running this poll, actually, since 2001,

amazingly.

Kirill Eremenko: Wow.

Gregory P. S.: Yeah, this is the 19th such poll. Now, the latest poll is

out to show that there is kind of a clear ecosystem

emerging around Python, Spark, Anaconda and

TensorFlow. Now, it's becoming this integral part of

data science tool box. Python seems to have more

significant [inaudible 00:21:31] ahead of R. There're a

lot more tools that use Python than R. There are some

other interesting observations that your readers can

see on KDNuggets.

Kirill Eremenko: Wonderful. Is it just like on the main page of the blog

or is there a specific page for all these insights? 'Cause

I-

Gregory P. S.: Well, on the main menu, we have a section called top

stories, and if you scroll there, then you will find more

interesting things.

Kirill Eremenko: That's so cool.

Gregory P. S.: Yeah, being data scientist, we always analyze the

results, so we always like to see what's more popular,

publish separate posts with just the top stories.

Kirill Eremenko: Gotcha. Wow, this is really cool. I'm on the page right

now, and I highly recommend for people to check it

out. It's kdnuggets.com, and then you can, at the top,

Page 13: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 13

find top stories and look through those. All right, well,

that's really interesting, very powerful insight.

Actually, before today's podcast, I was reading your

most recent blog about why data science is no longer

the sexiest profession of the 21st century, even though

it's still satisfactory, there's a new profession that is

the sexiest. Do you mind sharing a little bit on that

with us?

Gregory P. S.: Sure. Recently done a poll of our readers, and I think

we asked them basically, "What's your title and how

satisfied are you?" [inaudible 00:23:05] very satisfied,

which we converted to +2, to very unsatisfied, which

we converted to -2, and surprisingly, the profession

with the highest job satisfied was machine learning

engineer, which, well, and as a researcher I have to

say that the average satisfaction was like 0.7, and the

standard deviation was around 1.0, so it's not like all

the machine learning engineers were highly satisfied.

There was still a lot of unsatisfied ones, but on

average, I think there was a significant difference

between the job satisfaction for this profession,

machine learning engineer, and the second and third

place, which were researcher and data scientist.

Data scientist is still the most common job title. I see

that on the web and [inaudible 00:24:10] and on job

[inaudible 00:24:12] to get on KDNuggets, but kind of

there is more coming, more requests, more demand,

coming for people with machine learning engineer

skill. I guess a difference I would describe as machine

learning engineer is building machine learning

systems, probably they now use deep learning, and

Page 14: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 14

data scientist perhaps do more work on analyzing and

then trying to understand what is happening with

companies, not necessarily building production

systems.

Kirill Eremenko: Gotcha. Very interesting. That's a little hint, I guess, to

our listeners. If you're looking for the new data

scientists of the job that's coming to take on the data

scientists, it might be machine learning engineer. Very

interesting. Thank you for that. All right, so I wanted

to ask you a couple of questions. You've obviously had

a very diverse and interesting, like a career filled with

lots of different roles and different engagements, and

different things that you've worked on, that you've

done, I just wanted to find out some of the highlights.

What is a recent win that you share with us?

Something that you've had [inaudible 00:25:43]?

Gregory P. S.: I will mention maybe a couple of interesting things,

maybe they're not as recent but it's still very

instructed. I think one of the most interesting project

that I worked on when I was still at GTE Laboratory

was called Key Findings Reporter, for which we're

called KEFIR. It was a system for analysis and

summarization of key changes in large databases, and

we applied it to healthcare data. Healthcare in United

States is a scandal and also very, very expensive. I

think we spend here twice as much as other

industrialized countries. We got data with no better

results, and trying to understand where all that money

goes is an essential part of the equation.

Our system automatically analyzed changes in

[inaudible 00:26:47] variables and it selected the

Page 15: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 15

important ones, and it was combined with the small

and for a system to add recommendations at what to

do about the changes. Like for example, if you have

particular type of medical problem, then the expert

system will recommend how to solve it. It presented

visualization and it looked at changes in trends. One

good way to identify what changes are more important

is always look at changes. For example, if you just look

at the associations, you can find a huge number of

significant associations in data. How do you fill the

important ones. You'll look at ones that change over

time. What is true this period and was not true in the

previous period.

It was all combined in one very nice system, and it was

applied to all our GTE healthcare data and it identified

some significant potential savings. We did win Highest

Technical award from GT. Unfortunately, I guess I

would still regard it as a failure because the system

was not deployed.

Kirill Eremenko: Why is that?

Gregory P. S.: Probably that's connected to another question we

discussed. What's the most thing to do? I think the

most difficult challenge in my work as data scientist

was getting the results deployed because that requires

change in organizational culture and support from the

top. In case of if the system was acting technically but

there was no place in the organization. It was not clear

who would us it, how it would affect them, the work of

people who were analyzing healthcare data. That is

probably the fate of many data science projects. You

can easily build the great prototype but unless there is

Page 16: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 16

a clear way to deployment and support from the

organization, it is still a failure.

Kirill Eremenko: I see.

Gregory P. S.: That's, I guess, another interesting story I can say, I

worked in many different projects. Probably the ones I

enjoyed the most was working on bioinformatics data.

I had one project where we worked with a mass

spectrometry data trying to develop early indicators of

Alzheimer. The problem with analyzing biological data

is you have a huge number of variables. You could

have 20,000 different compounds, but you don't have

a large number of patients. Typically you could get

meeting several 100 patients. Imagine you have 100

trackers and poly trackers, you have applied 2,000

variables, it creates very significant problems in

determining what's significant and what is just

random noise. In that particular case, we did discover

very strong biomarkers, but they were 100% accurate.

There was, I think, quite dozen of them.

One of them actually had biological significance

because it was like vitamin C, so our initial results

suggested that people who had more vitamin C were

likely to get Alzheimer. Even though my intuition is

that the scientists told me, "Beware of perfect results."

This was [inaudible 00:30:46] it was 100% correct, so

it doesn't matter how you put in the data, if it's 100%

correct, it will still be 100% correct. Myself and my

friends, we all started to drink more orange juice and

vitamin C, but were still skeptical about the results.

The only way to test them was to get another

population. We did that and we found that probably

Page 17: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 17

the original data was contaminated in some form. I

guess don't trust the results if they're too good. That

could be a useful lesson.

Probably, the most success that I had in my career of

data mining [inaudible 00:31:41] was when we had to

help organizations make some strategic decisions. We

would examine whether they should use this

particular strategy or that one. Some of those work

was deployed but as they consulted, they cannot tell

you unfortunately the details but I know that there

were kind of pay-off of, I think, seven digits based on

our results, but those results were easy to deploy

because it was like do this decision A or decision B to

get the required change in the entire organization

structure.

Kirill Eremenko: Gotcha. Thank you very much. That's interesting. We

just talked about the wins and the challenges, and I

appreciate you sharing your experience. It's sometimes

difficult to share experiences, especially if it's a project

like the one you're working on for the Key Findings

Reporter, where you're working on it for a long time

and you're really proud of the results but it's not

deployed, but it is a great example for our listeners,

especially for those starting out of some challenges

that they might come across. In this case the takeaway

is that even if your project is great and you see that it's

got a lot of value, the situation might occur in such a

way that it might not be deployed in the end, and that

shouldn't ... Of course, you should do as much as you

can in order to avoid the situation, but if it does

Page 18: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 18

happen, then don't let it bring you down. It sometimes

happen and even to the best people in the industry.

Also, the other example is also great where the results

are too good. Even in data science, sometimes

intuition plays an important role. Like you said, when

the results for that vitamin C example were too good,

your intuition was saying that don't trust the results

[inaudible 00:33:50] I think it's also a good thing to

look out for if your results are too good to be true, then

find another place to check them, verify them, and

make sure that the test or the example is repeatable.

All right, so we talked about something that's the wins

and we talked about. How about what is your one

most favorite thing about being a data scientist?

What's the one most favorite thing that's kept you

going through this career for more than 20 years?

Gregory P. S.: Well, I really enjoy the process of exploratory data

analysis and visualization. Analyzing the data running,

the data algorithms, what does the data review? It's

like discovery of new and unknown realms. I think

curiosity is an essential trait for a good data scientist.

Along with discovering something, now I try to see

what's the best way to visualize and present it.

Especially, for example, if I'm looking at data for

recently the Nuggets posts, there're many ways to

organize it and thinking of what is a good story that

the data sells and what is a good image that is worth a

1,000 word in a story. Generally, I think probably the

most useful thing to read, and I think when I read a

study somewhere that confirmed it is the captions on

images.

Page 19: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 19

If a picture is worth 1,000 words, then a good caption

on that image may be worth 10,000 words. Think of

how to present the data, present the story and

visualize it and describe the image that you just

presented.

Kirill Eremenko: Thank you. It's definitely one of my favorite parts as

well of data science. Well, Gregory, I know that you will

need to go very soon, so I wanna really jump to the

part where I'm very curious, as you said, an important

part of being a data scientist is curiosity. I'm very

curious to get your answer to the following question.

It's a philosophical question, one I ask very often in

the podcast almost every time. I always get different

answers. Different people have different perspectives.

The reason by I'm so curious to get your perspective

on this is because of the amount of experience you

have in the field, your worldview and how it's

developed overtime. On top of that, you just interact

with so many people, over hundreds of thousands of

followers, you influence them, you reply to their

comments on KDNuggets, you get these emails, you

have aggregated so much, such a wealth of

information in the space.

Here it goes. From all this experience, from everything

you've seen in the field of data science, where do think

the field of data science and analytics is going, and

what should our listeners prepared for to be ready for

the future that's coming?

Gregory P. S.: Thank you, Kirill. I think that's a great question. I

guess as data scientists, we should always try to

predict the future, and as data scientists with a lot of

Page 20: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 20

experience, I can say that we're not very good at

predicting human trends, but I'll try nevertheless.

Kirill Eremenko: All right.

Gregory P. S.: What I see now is data science is becoming part of a

larger machine learning and AI field, which is really

progressing very fast. Capabilities especially in deep

learning are growing at amazing rate, like every day we

see some really amazing stuff, like this recent Google

Duplex Demonstration, where they had completely

human quality calls with unsuspecting humans, but I

think AI hype is growing even faster than AI

capabilities, so beware of the hype, I guess that could

be one warning.

Second recent important events is the GDPR, this is a

Europe general data protection directive that took

effect May, 25th this year. It seems those companies

even outside of Europe to revise their privacy policy to

conform to GDPR. The good part for consumers, it

offers more protections. It gives consumers some right

about the data that they use to receiving their data,

and it potentially makes like more complicated for

companies because GDPR also gives consumers rights

for something like explanation, and exactly what it

means is, I think, still under debate, I think interested

listeners can read my blog called "Does GDPR Make

Machine Learning Illegal?" which looks into that. I

think the answer is no, it doesn't make machine

learning illegal but the right for explanation may make

machine learning more difficult, and exactly how it will

play out will, I think, be determined by words, I think.

The first was used against Google and Facebook, were

Page 21: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 21

filed in the first couple of hours after GDPR published

that, so we'll see.

Another interesting trend that I'm watching is what's

being called citizen data scientist. I think this term

was introduced by Gartner a couple of years ago, and

the idea was to also become so good that any citizen

can use the them and do data science. I have been

very skeptical of citizen data scientist. I think do you

want a citizen dentist to work on your teeth or a

citizen pilot to fly your airplane, probably not. I think

data science can either be fully automated, and this

was a direction taken by companies like DataRobot,

H2O and others that offer kind of full automated

solutions, or you can have physicians that require

training and expertise in data science and kind of

having people with no training who use tools that are

semi-automated. I think it's very dangerous because

you can easily make blunt conclusions just think of

my example with vitamin C and Alzheimer, which

citizens data scientists will say that was correct results

but would lack training and intuition to warn where

they're going into a wrong direction.

Now, I think there's a golden age for data science.

There're amazing tools that allow one person to do

what hundreds of people could not do 10 years ago,

but data science as most data-driven activities with

some relatively clear rules and goals is also becoming

automated. We had a poll recently on KDNuggets that

asked readers when data science will be automated,

and the median answer was 2025. For our data

science listeners enjoy this great period but beware of

Page 22: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 22

coming automation. In terms of the future trends, of

course, [WebID 00:42:30] has heard many times about

deep learning. Another important technology that I

think now is coming into forefront is reinforcement

learning, and especially Deep reinforcement learning.

Data science involves really from data that has already

been recorded, kind of learning from the past, whereas

reinforcement learning is applied to agents that are

active in their work, data experiments, and can learn

from their experiments. This was the keep summaries

and successes like AlphaGo that defeated the world

champion in Go by essentially learning this by playing

with itself. If I can make one more interesting

observation about the future, so this AlphaGo was

developed initially from learning with human masters

or experts in Go, and later, people at DeepMind

developed a more general version which they called

Alpha Zero, Zero to indicate that it started with zero

human knowledge, essentially just with itself using

reinforcement learning and deep learning. It achieved

in about four hours, the super human level in chess.

That was very disappointing for me as a former chess

player.

It took it, I think, three days to achieve that

superhuman level in Go. This Alpha Zero version

played strongest chess player in the world. It's no

longer human. I think the strongest chess player in

the world is now a computer. I think they've had a

program called Stockfish which was programmed old

fashioned style with [inaudible 00:44:38] human

opponents and [inaudible 00:44:40] millions of

Page 23: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 23

physicians [inaudible 00:44:43] and when Alpha Zero

played the Stockfish, it defeated it something like 10 to

0.

I looked at some of the games, and it made completely

inhuman moves. I don't know Go but I do know chess,

so I could appreciate how amazing those moves were.

Humans would make and we'd call them, amazing

examples of human intuition and creativity, but I

think somebody describe it, "It would be like aliens

landed on earth and they learned to play chess." I

guess they're kind of looking forward, this give us a

sneak preview into artificial general intelligence. I've

got no idea when next it will be achieved but people

who will interact with it will probably be hard pressed

to understand why does it do what it does. That is

experience of chess masters looking at how the

superhuman Alpha Zero works.

It has a completely different intuition, and people who

understand Go report similar things, that it plays

completely different way that humans have never even

thought about, not always, and then they're still moves

that humans can understand, but occasionally does it

completely superhuman move. That's kind of for a

preview of small window into artificial general

intelligence.

Kirill Eremenko: Well, fantastic. Thank you very much. I noticed that

you have a blog post about this as well, which is very

exciting, so if there're any chess players listening or

even if you're just interested in artificial intelligence,

Gregory has got a blog post about data science in 30

minutes, artificial general intelligence and answers to

Page 24: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 24

your questions, so you can read more about this, and

I'm definitely curious about this. I've gonna jump onto

this and check it out, 'cause I'm also a chess player

myself. It's a very good lens to put it in. I've heard

about the developments of Google DeepMind in the

game of Go and AlphaGo Zero, and how it was able to

win with a huge advantage.

In the same way, I don't play Go. I'm not a Go player

so it's quite hard to relate, but with this chess

situation, I definitely would like to know a bit more

about that inhuman move with the knight and things

like that. I'll have a look at that. Thank you so much

for sharing. Yeah, it's definitely an interesting area,

and of course, I'd like to also, just recap on the things

that you mentioned about the trends. I knew this was

a good question. Gregory, you were a great person to

answer the question, and you did give us so many tips,

so ladies and gentlemen, listen to this podcast, here

are some takeaways from Gregory's answer to our

question, what to prepare for the future. AI capabilities

are growing, and machine learning as well, but beware

of the AI hype.

GDPR, so look at that, the European Data Protection

Direction which into action May 25th this year. Does it

make machine learning illegal or not? There's a blog

post on KDNuggets about that as well. Seasoned data

scientists, that's a concept that was introduced by

Gartner, but is it really a good thing or is it actually

something that sounds good but it actually might

cause more problems if people don't really know what

they're doing? How is that related to automation of

Page 25: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 25

data science, things that companies like DataRobot

and H2O are looking into.

Then the fourth thing was data science and

automation. You're moving on from that. You had a

poll that asked your readers and the median answer

was 2025. That's when data science will be fully

automated, so something to look into as well, and keep

following the trends on KDNuggets to see how that

changes, and if it does, and finally a new addition into

this whole mix of AI, deep learning, machine learning

and data science is reinforcement learning. It's picking

up more and more these days, so another important

technology to look out for in the future.

Gregory, all I can say is a huge thank you. I know

we've gone a bit over the available time you had.

Before you go, could you please let our listeners know

how they can contact you, find you, follow you, get in

touch, or just learn all these amazing things that

you're sharing with the world?

Gregory P. S.: Well, thank you Kirill. Well, our listeners can find

website KDNuggets. They can contact me by email,

editor1, the "editor" followed by digit 1,

[email protected], or tweet to @kdnuggets, and

they can also like our Facebook, KDNuggets, or join

our KDNuggets LinkedIn group. Welcome reader's

comments, submissions or blogs. We always look for

good technical submissions. As I mentioned we

publish two, three blogs per day, although currently I

have to say we already scheduled all the blogs until

July 2nd, but good blogs will certainly get published.

Kirill Eremenko: Gotcha.

Page 26: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 26

Gregory P. S.: Kirill, thank you very much. I enjoyed the discussion,

and hope to see you again at another conference

somewhere.

Kirill Eremenko: Thank you. Thank you very much, Gregory. Very lovely

having you on the show, and I do also hope we'll catch

up soon.

There you have it. That was Gregory Piatetsky-

Shapiro, and all of his amazing and exciting and

insightful stories from the years of experience in data

science and all the people he's interacted with, all of

the articles and news that he's aggregated through

KDNuggets and all of the amazing events that he's

been through.

I'll be interested to find out what your favorite part of

today's podcast was. For me personally, it was the

example that Gregory gave about KEFIR, that situation

where a technically excellent system was developed

but it wasn't used because it didn't have a place in the

organization. A very telling example and something

that can happen to anybody, it can happen on any

project, so it's always important to understand, I

guess, what you're working towards and learning from

experience such as this one. When they're not even

your own, you can still learn from it and understand

that situations like that can happen, and how you can

try to avoid them in your own career. Of course,

among other things there was a lot of very valuable

insights that Gregory shared with us.

On that note, we're gonna wrap up. I highly encourage

you to check out KDNuggets and follow that website,

and follow the news that they're sharing. Get onto

Page 27: SDS PODCAST EPISODE 175 WITH GREGORY PIATETSKY- SHAPIRO

Show Notes: http://www.superdatascience.com/175 27

their email list, so you get all the updates, all the very

important and most recent updates of data science. Of

course, follow Gregory himself, connect with him, if

you're not following him already on LinkedIn, I'm sure

he'll be happy to get in touch and stay in touch. Of

course, you can find all of the short notes for today's

episode at www.superdatascience.com/175. We'll also

include a ton of links that we mentioned on the show

so head on over to superdatascience.com/175, and

check the module out, look up those articles, look at

those polls, and see where the world of data science is

going. I can't wait to seeing you back here next time.

Until then, happy analyzing.