sds podcast episode 285: bringing dev & diverse ...€¦ · entrepreneur, and each week we...

47
SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE COMMUNITIES INTO DATA SCIENCE

Upload: others

Post on 22-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

SDS PODCAST

EPISODE 285:

BRINGING DEV &

DIVERSE

COMMUNITIES INTO

DATA SCIENCE

Page 2: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

Kirill Eremenko: This is episode number 285, with top contributor on

Stack Overflow, Jon Skeet.

Kirill Eremenko: Welcome to the SuperDataScience podcast. My name

is Kirill Eremenko, Data Science Coach and Lifestyle

Entrepreneur, and each week we bring you inspiring

people and ideas to help you build your successful

career in data science. Thanks for being here today,

and now, let's make the complex simple.

Kirill Eremenko: This episode is brought to you by SuperDataScience,

our online membership platform for learning data

science at any level. We've got over two and a half

thousand video tutorials, over 200 hours of content

and 30-plus courses with new courses being added on

average once per month. You can get access to all of

this today just by becoming a SuperDataScience

member. There is no strings attached. You just need to

go to superdatascience.com and sign up there, cancel

at any time.

Kirill Eremenko: In addition with your membership, you get access to

any new courses that we release plus all the bonuses

associated with them. Of course, there are many

additional features that are in place or are being put in

place as we speak, such as the Slack channel for

members, where you can already today connect with

other data scientists all over the world or in your

location and discuss different topics such as artificial

intelligence, machine learning, data science,

visualization and more. Or just hang out in the pizza

room and have random chats with fellow data

scientists.

Page 3: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

Kirill Eremenko: Also, another feature of the SuperDataScience

platform is the office hours, where every week we invite

valuable guests in the space of data science and

interrogate them about their techniques, about their

methodologies in the space of data science and you

actually get a presentation from the guests and you get

an opportunity to ask Q&A at the end.

Kirill Eremenko: In some of our office hours, we just present some of

the most valuable techniques that our hosts think are

going to be valuable to you. All of that and more you

get as part of your membership at SuperDataScience,

so don't hold off. Sign up today at

www.superdatascience.com. Secure your membership

and take your data science skills to the next level.

Kirill Eremenko: Welcome back to the SuperDataScience podcast, ladies

and gentlemen. Super excited to have you back here

on the show today, and I'm super humbled by our

today's guest, Jon Skeet.

Kirill Eremenko: Jon has submitted almost 35,000 answers on Stack

Overflow and his advice has reached an estimated 276

million people worldwide. That's 276 million. Quite an

insane number, if you take a second to think about it.

Kirill Eremenko: I just got off the phone with Jon and the podcast

you're going to hear is going to be very interesting. We

had a great discussion and is going to be a different

perspective today. The reason for that is that Jon is

not a data scientist, he's a C# developer, an expert in

C# and also some other programming languages.

Kirill Eremenko: Don't let that scare you away, because, a couple of

reasons. First of all, there's a lot of similarities

Page 4: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

between data science and development. Both use

programming and things like versioning, and

diagnosing problems are common between the two, so

we can learn quite a lot of things from Jon. The other

reason why this is very relevant is because data

science is more and more coming closer to product

development, is being integrated more and more into

products. Before, data science was just, let's get some

insights, let's do some predictions.

Kirill Eremenko: More and more, we see that companies are integrating

analytics, machine learning, artificial intelligence, data

science, into their products. You will, eventually, it's

highly likely that in your career, especially if you go

and work in startups, for startups, you start startups,

that you will encounter situations where you need to

combine your data science knowledge of developing

knowledge in order to productionize data science.

Therefore, already in this podcast, you can get a head

start and understand how these two worlds meet and

what are their intersects.

Kirill Eremenko: Finally, the third reason is, maybe you are coming to

data science from a world of development. Maybe you

have some experience in programming languages like

C# or compiled languages. It will be interesting for you

to see Jon's perspective on the world of data science.

Kirill Eremenko: All in all, a fantastic podcast, I really enjoyed our

conversation. You'll hear a lot of very valuable

technical topics that we covered and also, at the end,

we actually talked about the importance of

community. What it means to be part of a community

and how communities grow, which you can do as a

Page 5: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

data scientist, to make our community be more

inclusive, more welcoming and prosper further. I think

this is valuable, these are valuable insights for

somebody who's been heavily involved in the

development community. These are valuable insights

for data scientists and for us all to grow much faster

and better and stronger as a community.

Kirill Eremenko: On that note, I can't wait for you to check out today's

exciting podcast. Without further ado, I bring to you

the top contributor on Stack Overflow, Jon Skeet.

Kirill Eremenko: Welcome back to the SuperDataScience podcast, ladies

and gentlemen. Super excited to have you on the

show, because I have Jon Skeet on the other line.

Kirill Eremenko: Jon, how are you going today?

Jon Skeet: Very well. Thank you. Very well.

Kirill Eremenko: Very, very nice to talk to you. Could you please remind

me, what city are you calling from, from the UK?

Jon Skeet: I'm in Reading, which is just to the west of London.

Kirill Eremenko: Just to the west of London. Very cool. You said you're

having some fantastic weather these days?

Jon Skeet: Yeah, it's been really nice recently. A few occasional

downpours, but generally, we're escaping from the

normal British wet weather of a summer, so it's very

fine. The only downside is, by the end of the day, the

shed from which I usually work is pretty warm.

Kirill Eremenko: That's a good problem to have in the UK.

Jon Skeet: Yeah.

Page 6: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

Kirill Eremenko: It was so cool to see your drums. That's so awesome.

That is very exciting. I wish you could ... Maybe one

day you can play something, and we can ...

Jon Skeet: I think it'll be quite a while before I'm even slightly

good enough to play for anyone else. I only bought the

drum kit a week and a bit ago. I'm practicing hard, but

I've got a long way to go.

Kirill Eremenko: Fantastic. Well, so you are in Reading. How long have

you been in Reading for?

Jon Skeet: Just over 20 years, actually.

Kirill Eremenko: Twenty years.

Jon Skeet: Straight out of university, I ended up working for

Digital Equipment. That was in Reading and moved to

my first house in Tilehurst, which is the sort of village

near Reading. It's a bit bigger than a village, but we

tend to call it a village. I've moved within Tilehurst, but

stayed basically there, even from before I was married.

Kirill Eremenko: Wow. Fantastic. You're married now?

Jon Skeet: Yes. We celebrated our 20th wedding anniversary fairly

recently. Yeah. Very, very happily married.

Kirill Eremenko: Wonderful. That's so cool. Congratulations.

Jon Skeet: Thank you.

Kirill Eremenko: Jon, what fascinates me is that from a ... Would you

say Reading is a little place or a big place?

Jon Skeet: Reading is a very large town sort of bordering on being

a city. It's not officially a city, but I wouldn't be

surprised if in the next five or 10 years' time it was

Page 7: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

given the official designation of city. It's quite close to

London and there are really good rail links that are

improving over time, actually, so while a lot of people

do commute from the outskirts of Reading into

Reading, an awful lot of people also go from Reading

into London to work in London.

Jon Skeet: But it's great because it's nice and close to London, so

I can get to the office when I need to, and also go to

see plays and musicals, which I love doing. But also,

it's really close to the countryside, so house prices are

bad, but not awful, and I can get to the countryside

nice and easily get to other places in the UK easily. It's

a really nice place to be.

Kirill Eremenko: Oh, fantastic. That is exciting. What I find very

interesting is that from a almost city size town of

Reading, which is very exciting that it's growing, from

the town of Reading, you have done something

extremely unfathomable. You are the number one

contributor to a little website called Stack Overflow.

You have answered over 34,000 questions and you

have reached over 276 million people. If I was wearing

a hat, I would take it off for you right now. That is

huge. Congratulations on that.

Jon Skeet: Thank you very much. Thank you, but it doesn't feel

that big a deal, because it's sort of just something I've

been doing for whatever 10 years now. I answer fewer

questions than I used to, because there are fewer

questions that sort of seem like they are appropriate

for me to answer, but I do still ... I go on there every

day. I think it's probably been nearly 10 years since I

Page 8: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

last missed a day on Stack Overflow, because I take

my laptop on holidays and things.

Kirill Eremenko: Oh wow.

Jon Skeet: I manage to disengage from main work, but I do

always like to keep an eye on what's going on Stack

Overflow.

Kirill Eremenko: That's very impressive. Your profile has been viewed

over 1.8 million times and it's just incredible how

you've contributed to so many people, such a great

cause. How does that make you feel?

Jon Skeet: Obviously, I'm thrilled to have helped lots of people,

but I think it's worth bearing in mind that there are

lots of other people who have helped huge numbers of

people as well, and huge numbers of people who've

helped just a few people. So, the cumulative effect is

enormous. Now, I am privileged that being number one

draws a certain amount of, potentially, undeserved

praise. There is the sort of myth of Jon Skeet as this

perfect programmer who never needs to consult any

documentation.

Jon Skeet: In fact, just over the weekend, there's been an

interesting Twitter thread where someone, I believe a

venture capitalist, gave his impression of a Tenex

software engineer who never needed to consult

documentation, knew every line of code that had been

deployed into production, and various things that I

actually thought weren't particularly positive for really

empowering a team, a whole team, rather than one

person to drive forward a product.

Page 9: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

Jon Skeet: But there is this myth of me never writing a line of

code that's incorrect and all kinds of things, which I

hope for most people understand just is not reality at

all. I am a pretty regular guy. I make bugs just like

everyone else does. I kicked myself after losing an

entire day or two over something that turns out to be a

tiny typo. I happen to have just gained this

mythological reputation by just contributing a bit more

than other people have on Stack Overflow. So yeah, it

definitely doesn't reflect reality, but I enjoy it at the

same time.

Kirill Eremenko: Got you. Thank you. For our listeners, we're going to

set the scene. Jon, you're a expert programmer in C#,

correct?

Jon Skeet: That's right. Yes.

Kirill Eremenko: C#.

Jon Skeet: I have loved C# since it started. I think I played with

some of the betas before it went to a general

availability, 1.0, in 2001, 2002. I've been working with

it, either professionally or on an enjoyably amateur

status, ever since then sort of. I've alternated between

working with Java and working with C# professionally,

but whenever I've been working just with Java

professionally, I've kept up with the C# in my free

time.

Kirill Eremenko: Fantastic. I also have played around C#. I was helping

one time, my brother created a Sudoku for [inaudible

00:12:52] Salmons in C#, which was fun. I totally love

... My favorite language is C, I would say and then

Page 10: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

C++, because of it's object-oriented nature. C# is

fantastic as well, although I don't know it really great.

Kirill Eremenko: What I wanted to do is, before the podcast, this works

better for our listeners, Jon and I sat down and

actually discussed what we were going to be talking

about, because as you can imagine, while C# can be

relevant to some data scientists and can be used to

deliver, deploy, develop data science applications in

some cases, in most cases it's not our language of

choice. You might be surprised, what are we going to

be talking about with Jon if he's an expert in C#?

Kirill Eremenko: If you hear some notes about C# in this podcast, and

you are interested in C#, that is awesome. That is for

you, but at the same time, what we're actually going to

be focusing on with Jon is the importance of

community and importance of what it is like to be in a

tech profession. Because there are lots of similarities

between development and data science, and through

his work on Stack Overflow and through his exposure

to the community of developers in Stack Overflow and

this, in general, is community that's helping each

other out, it will be very interesting to gain some

insights. Because the data science community, as far

as I know, is not that old. As old as the development

community, so maybe there's some takeaways that we

can apply to the data science community and to how

we interact with each other. That's what I expect we're

going to focus on, but you never know how the

conversation is going to go.

Jon Skeet: Absolutely.

Page 11: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

Kirill Eremenko: It's going to be fun.

Jon Skeet: My experience is that, well, certainly podcasts

[inaudible 00:14:33] tend to meander away from what

we expected [crosstalk 00:14:36]. Often, including

things around versioning or dates and times, which

are two other topics that I'm fairly passionate about. I

suspect that we'll find, in the course of this discussion,

that there are various touch points where the

problems that the data science community face are

similar to the problems that the more regular

programming community faces. There will be various

similarities and, hopefully, a few differences we can

note and sort of learn from each other, new

approaches that we can take.

Kirill Eremenko: Totally. Totally, even this one that you mentioned, the

versioning, that is such an important thing. In data

science, we don't have ... maybe in the silos and in

certain companies, maybe there are certain

frameworks that are coming out where there's very

rigorous, methodical system for versioning. But,

overall, when somebody starts the data science, that's

the last thing they probably learn.

Kirill Eremenko: They learn about machine learning and so on, but they

don't have this habit of versioning files. Like I, through

my work at Deloitte, where they have very specific

ways to version anything, like I even version my, I

don't know, tax documents, PowerPoints, they all have

like version 1.1, version 1.2, 3.7. Everything I create

almost always has a version. Whereas in data science,

I don't think that's the case. Tell us about the

importance of versioning in development.

Page 12: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

Jon Skeet: Within the .NET community in particular, we've

adopted SemVer, Semantic Versioning, which is not

.NET specific and is fairly widely used within

programming where an artifact, whether that's a

library, an application, whatever, but probably

something that other people will depend on, they need

to know how it's going to behave, that gets a three-part

version number. It has a major, minor, and patch

version and also, potentially, some other information

like a dash beta 01 or whatever that says, "This is pre-

released and can change sort of arbitrarily." But then,

if you say, "I'm following Semantic Versioning," that

means that, if I've published a 1.0.0 of something,

then if I publish something else within the same major

version number, then it should be backward

compatible. If I publish a 1.1.0, then anything that

was previously using 1.0.0 should be able to upgrade

to 1.1.0 without being broken.

Kirill Eremenko: It should be able to use 1.1.0 without you changing

the code of that thing that's using these?

Jon Skeet: Exactly. There are different levels of compatibility. One

thing would be, and this depends on your

programming language and environment and things,

but in something like C#, which is compiled, there's a

separate compilation step that happens long before

execution that can be different things where you may

say, "Well, it's source compatible," so your code that

built against 1.0.0 can still build against 1.1.0. There's

the other aspect of binary compatibility, which is,

while I compiled this code against 1.0.0, but actually

at execution time for whatever reason I am loading

Page 13: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

1.1.0 of the library and that should still be okay as

well. Then you get patch versioning, where you should

be able to go backwards or forwards in time. If I could

build against 1.1.5, I should also be able to build

against 1.1.4, so it's sort of forward compatibility as

well as backward compatibility.

Jon Skeet: Then you get into really difficult problems, where

you're writing an application, and you depend on one

library that depends on another library at version one,

but you want to depend on that same library at

version two, and those aren't necessarily compatible

with each other at all. There could be all kinds of

differences and, certainly, in .NET that causes a

problem, because while some aspects of the execution

environment can handle multiple versions being

loaded at the same time a lot of the tooling doesn't

support it. I wrote a blog post on that fairly recently,

saying, "Hey, we need to get better at this." I don't

know how many dependencies and what level of that

sort of versioning problem you have in data science. I

would say the most important thing isn't even

versioning in terms of making sure that everything has

a number, but at least keeping a versioned history of

things, whether that's in gate or in subversion or some

other source control, so that you can get back to, "Oh,

I know I had a working version a few days ago. Let me

have a look at that."

Jon Skeet: I've been to some machine learning talks and done sort

of workshops, but don't have significant experience. I

can definitely imagine the importance of keeping a log

of, "Well, I tried this and this was the result." That sort

Page 14: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

of goes on to another topic that I'm absolutely

passionate about within programming, which is, how

do you diagnose problems? A lot of that is making sure

that you can keep a log of exactly what you did and

exactly what the result was, and being clear enough

about that without spending hours and hours doing it.

I would imagine that's a skill that data scientists sort

of pick up naturally, because it feels like it's probably

closer to one of your core competencies. I would love it

if the data science community could try to teach the

programming community about how to keep good logs

of what happened when you tried things.

Kirill Eremenko: That's fantastic. Before we dive into more into

diagonizing problems, I wanted to also mention that

for data scientists, there's a very specific component

that needs to be also remembered. Is that, you don't

not only need to version the code that you're creating,

but you also need to version the data that you're using

to train.

Jon Skeet: Absolutely, yes. The same data set behaving differently

under different versions of your code versus different

versions of the data behaving differently under the

same version of your code.

Kirill Eremenko: Exactly, exactly. That's another moving part in the

equation.

Jon Skeet: Right.

Kirill Eremenko: I love that. That there is that similarity of versioning,

but there's a difference that data is such a crucial part

of what they're set to do. Then, more you need to have

these, not just say what kind of data was like have a

Page 15: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

backup preferably of that data, because maybe that

data is not in your control. Maybe you're getting it

from a server where somebody might change it and

then you're completely stuck. You know?

Jon Skeet: Right.

Kirill Eremenko: You have no way ... It's important, right, in versioning

to be able to go back to the previous version in case

the new version is broken.

Jon Skeet: Yeah, and to know which version you did things

against. We seem to be, whether I am driving it to

topics that I'm interested in or not, there's something

similar in programming that many people are unaware

that they're depending on version data with time

zones.

Kirill Eremenko: Oh wow, it's a good one.

Jon Skeet: Many people assume times zone rules just stay the

same forever. No, while you go into daylight saving

time at this time and then you come out of daylight

saving time at this other date, and the rules are set,

but no. The rules change several times a year, and I

don't mean because things go into or out of daylight

saving time, but a country might decide, "We're not

going to have daylight saving time anymore."

Kirill Eremenko: Yeah.

Jon Skeet: In fact, the European Union at the moment is

deciding, I think in principle it has been agreed that

from 2020, I think, countries will have the option of no

longer using daylight saving time, so everyone who has

recorded some data that is time zone or were in some

Page 16: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

form or other, they have recorded it with, presumably,

the current version of time zone data that they were

using at the time. But I'd be very surprised if more

than 1% of developers actually recorded, "Yes, I was

using Iona time zone data 2016 J, or whatever it is."

The rules that we knew about at the time, which

predict future and past things, is just an aspect of

version data that people don't expect to be versioned.

Kirill Eremenko: Yeah, I totally agree. Even Russia had this a few years

ago when they stopped using daylight saving times for

a few years and now they've started back using it or

something like that and try keeping track of all those

things. That has a massive impact. Your analysis can

be completely wrong, especially, you're doing

something, I don't know, for example, on the data

relating to financial markets. Bam, all of a sudden it's

not 8:00 AM, it's 9:00 AM or it's not 7:00 AM, it's 8:00

AM.

Jon Skeet: Absolutely. Yes, yes, it really matters. It also matters

how quickly you can get updated datas, because some

countries don't give much warning at all that they are

changing their rules. Literally, sort of, there have been

countries that were about to go into daylight saving

time and announced the day before, "No, we're not

going to do that."

Kirill Eremenko: Wow.

Jon Skeet: I had colleagues who were going through airports and

half the monitors in the airport said one time and half

the monitors said a time and hour later. Of all the

Page 17: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

places that you really, really want to be sure what the

time is, an airport is absolutely one of them.

Kirill Eremenko: Oh, that's crazy. That's crazy. Okay. You mentioned a

really interesting topic, which I love. Diagnosing

problems.

Jon Skeet: Right.

Kirill Eremenko: Code is code, whether you're coding in a ... Oh, by the

way, can you tell us quickly, you mentioned C#

compiled language, Python, on the other hand,

interpreted language. What's the difference?

Jon Skeet: Yes. I believe that even Python, there can be compiled-

ish versions, but to be honest, I don't know very much

about Python. Where I give opinions on Python for any

time in this podcast, please treat them with a grain of

salt, a very, very large grain of salt, but a compiled

language like C#, you take the source code and you

provide it and any libraries that you depend on, into

the compiler and the compiler outputs a file, which

contains a binary representation. Now, for compiled

languages like C and C++, that compiled

representation is pretty much machine code that can

be executed directly.

Jon Skeet: For C#, it's something called intermediate language,

which is roughly equivalent to Java bytecode. If any of

your listeners are familiar with that. Again, Java or a

compiled language, you get out class files that are in

this bytecode format that the JVM, the Java Virtual

Machine, knows how to run. It gets even more

complicated, because both Java and .NET almost

always take those compiled, so that binary formats,

Page 18: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

not your original text source code, but they then do

something called JIT compilation, which is Just In

Time compilation.

Jon Skeet: They take that sort of nearly machine code and turn it

into actual machine code, so they don't need to go

through that translation step several times. That

happens at execution time, but there's this first bit

where you get to check that all your source code

actually makes sense.

Kirill Eremenko: Got you. Got you. In summary, in some cases, C++

and C compiles straight to a sort of file that can be

run. In the case of Java and C# is first compiled to

intermediate file and that helps find any errors at

compilation among other benefits, of course.

Jon Skeet: Right.

Kirill Eremenko: Then the second a Just In Time compilation is

required, so you can run in multiple architectures,

again, in addition to other benefits as well.

Jon Skeet: [inaudible 00:26:59] efficiently. There are ways of

compiling, certainly, C# and I believe Java with ahead-

of-time compilation, which is sort of doing the JIT

compilation bit of bytecode into machine code, doing

that ahead of time instead of when it started to run as

well, so there are lots of different options.

Kirill Eremenko: Got you, got you. On the other hand we have

interpreted languages such as Python. Any comments

on that, what's the difference?

Jon Skeet: In theory, if you take your very simplest idea of an

interpreted language, you have this interpreter just

Page 19: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

like you have a Java Virtual Machine, but instead of

working with the bytecode, it's working with the source

code, so it runs, and it maybe reads your whole source

file into memory, but then it looks at one line at a time

and says, "Right, what does this line mean? I will

execute the code that's in that line, and by execute I

have to understand what it means." If it's something

like, let X = Y + Z, then it needs to pause all of that

and understand what it means, and then say, "Okay,

now I need to load the value of Y, load the value of Z,

add them together, save them in variable X."

Jon Skeet: Now, the very simplest kind of interpreter, if you have

that code in a loop, would be looking at that line

saying, let X = Y + Z every single time, and have to

understand it. Now, that is massively inefficient.

Everything would run far too slowly to be useful. More

modern interpreters might store some almost like the

ILO or the bytecode representation of that, so that it

doesn't have to do the textual passing every time, or

they might actually do something like the JIT

compilation. Even though it's sort of interpreted, I

think very few languages are genuinely interpreted the

whole time these days, because we've got good at doing

things in, well, that's JavaScript here, V8, et cetera.

Jon Skeet: The difference between static versus dynamic

languages and compiled versus interpreted, they are

different things, but often go hand in hand. Static

languages tend to be compiled, dynamic languages

tend to be interpreted. But the difference in execution

time between those two sort of extremes has gone

Page 20: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

down massively over time, because we've got a lot

better at dealing with interpreted languages.

Kirill Eremenko: Yeah. One of the differences that somebody

programming with these languages would see, and this

is quite important by the way for data scientists,

because more and more data science is becoming not

just, "All right, let's do some analytics," it's becoming

more product-oriented. Like in certain startups, data

science is embedded into the product, so you will

encounter-

Jon Skeet: Absolutely.

Kirill Eremenko: Yeah, you will encounter times when you will,

especially if you're going into the startup world or

developing new products, you will encounter situations

where you will need to work with compiled languages.

The difference in what you observe can be that, if

you're typing some code in Python and then you run it,

it will run. For instance, you have, let's say, have 100

lines of code and you have an error in line 50, it'll run

the first 49 lines and actually execute them. When it

gets to line 50 it will give you an error.

Kirill Eremenko: In a compiled language, when you try to compile that,

it will give you an error and none of those lines will be

run, so it's important to understand that if you have

some, for instance, data manipulation, data cleaning,

some pre-processing in the first 49 lines of code, in

one case they will be executed and your data will

change. Whereas another case they won't be executed,

because you won't be able to compile the file. I think

that's quite an important, quite radical, difference for

Page 21: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

people to understand that, not only it's about what

you see, but also the effect that it can have in the

background on anything that you were doing before

that error occurred.

Jon Skeet: Absolutely. Personally, I would like to see more

support for static languages and, obviously, I would

love to see C# used more in machine learning and data

science in general. If I knew more about data science, I

might be in a place to help that along. As it is, I'm

almost entirely ignorant, so I don't know how we would

do that, but it does come back partly to the aspect of

interpreted languages are often also used interactively.

My understanding is that a lot of data science is done

via Jupiter notebooks and the like, where you're

exploring things as you go. It's not like you write all of

your 100 lines of code and then run it and then find

that there's the problem, but you've built that code up

over time by trying things interactively, and that's

where statically typed and compiled languages tend to,

and this is always sort of caveat of, tend to, there are

exceptions, tend to not deal very well with being done

interactively.

Jon Skeet: You tend to have to do things by creating your source

file beforehand. Now, that's not always the case, and

this may be how we build C# support for data science,

or data science support for C#, depending on which

way you want to think about it, is by allowing C# to be

run more interactively. There are definitely projects

available for that sort of thing already. C# scripting,

approaches and ways of running C# in a browser and

the like. Maybe there will be really good Jupiter

Page 22: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

notebook support in the future. I am sure that there

have been some projects that explore Jupiter within

C# already, but they haven't gained the sort of traction

that we'd need to see more mainstream support. But I

think the benefits, as you were saying, of not running

those first 49 lines of code before you find the error at

line 50, there are significant benefits of that, so I

would love to see more support for C# within data

science. I just wish I could help with it, but I just don't

have the knowledge to do so.

Kirill Eremenko: Yeah. It's really a difference in philosophy, isn't it? For

me, when I think of data science versus programming

and compiled languages, data science, like you said, is

very explorative in nature and even if you're not just

looking for insights, you know you want to build the

model, it still requires exploration of different

approaches during the process. For me, the way I

imagine it, the analogy, is like building a sandcastle.

You are trying this out, this falls over, you put a new

tower on, then you put the wall and then the water

washes it away. You build it again and so on. It's like

always you're playing with clay or sand, this type of

creative approach.

Jon Skeet: Right.

Kirill Eremenko: Whereas in programming, especially in the world of C#

and more in the compiled languages, it was a long time

ago for me, so you're much better placed to draw the

analogy here, but does it ... It almost feels like you

have a blueprint of what you want in advance. It's like

you're building a castle, not out of sand, but of little

bricks or a Lego piece.

Page 23: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

Jon Skeet: To some extent. To some extent. With more test-driven

approaches, it's often, well, you write the test and then

you make sure that that's implemented. I don't want

anyone to get the impression that you write the whole

application and then you can run some of the code.

You can still do things iteratively, but it's less

interactive iteration. Now, it can feel somewhat close to

it if you get a really tight test run, write some code,

run the test again, et cetera, when you can get that

loop fairly tight, it can be pretty good. I would want to

mention F# at this point.

Jon Skeet: F# is a functional language which is still compiled to

Io. You can inter-operate between F# and C# and other

Io languages, VB, et cetera, but F# was designed from

the start to support this interactive exploratory mode.

My understanding, not as an F# developer, is that a lot

of F# work does happen in that exploratory mode like

data science, so maybe actually thinking about what

would be a good language for data science in a

compiled statically typed way would be F# rather than

C# and maybe we can build on that F# work to also

support C# over time. My understanding is some data

scientists do already use F#.

Kirill Eremenko: No, I'm very interested. I didn't know about F#. That's

very exciting. Okay. Let's move on to something you

touched on, how you diagnose problems. Are there any

best practices of problem diagnosis that, like in code,

that you can share with data scientists, because code

is code. Even though it's interpreted or compiled,

whatever, it's still a very, in any country, this was

what I love about coding.

Page 24: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

Kirill Eremenko: You can know how to code in Europe, you can know

how to code in Africa, you can go then to China and

code there. Code is pretty much the same around the

world. Are there any best practice, something you've

developed throughout your career, that you can share

on how to find those errors in the code? Sometimes,

errors, they don't even pop up as an error, but it's

there.

Jon Skeet: Right. Yes. There are so many different sort of

categories of error. There are errors that you find at

compile time that you don't understand why this

doesn't work, why it won't compile, and they're

relatively easy, generally. There are errors where

things go bang, with exceptions, at execution time and

they can be reasonably easy to find and fix. There are

errors where, "My code all runs, it just produces the

wrong output," and that's where things start getting

harder. Then it gets really hard with, "My code runs

and produces the right outputs on my machine, but

the wrong output in production," and that's fairly

hard.

Jon Skeet: Then it gets even harder with, "The code runs and

always produces the right output on my machine and

99% of the time it produces the right output in

production as well, but just occasionally it's very, very

slightly wrong." Diagnosing those errors can be really

hard. We should probably timebox this almost,

because I can talk about diagnostics for a very, very

long time and I'm hoping, eventually, to write a book

about how to get into diagnostics, because this feels

like the silver bullet that I'm being, without trying to

Page 25: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

be too immodest, I'm pretty good at diagnostic things

and that is the reason that I am able to help people on

Stack Overflow.

Jon Skeet: You give me a problem and so long as you have given

it to me in a sufficiently well specified way, ideally so

that I can reproduce the problem, then I can apply the

diagnostic steps and help you get to an answer. Now, I

can do that, but if I have to help 100 people that way,

then I have to go through the diagnostic steps 100

times. Obviously, it's far more efficient if I can improve

the level or help improve the level of diagnostic skills

throughout the community and then each of those 100

people could have diagnosed the problem themselves

and fixed it themselves. Now, that's-

Kirill Eremenko: Got you. Well, this is a great opportunity to try out

what you are going to share in the book.

Jon Skeet: Right. That's a simplification. There are times where

you need more knowledge than you have, et cetera,

but I think the silver bullet in diagnostics is divide and

conquer. You have a whole application that might

involve some XML pausing, producing some JSON, it's

interacting with a database, it's interacting with a web

service. It's got a web front-end. All these things and

something is wrong. The first thing to do, in my

experience usually, is try to isolate where the problem

is.

Jon Skeet: If you can reproduce it without doing any XML

parsing, then the problem, probably, isn't in the XML

passing. It may well be that that has problems as well.

One of the fascinating areas in diagnostics is where

Page 26: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

you happen to notice other problems as you're

tracking down one major thing, but it's really easy to

be sidetracked by, "Oh, it turns out that's broken, and

the third thing is broken. Everything's broken," and

trying to work out-

Kirill Eremenko: Or when you have two problems that cancel each other

out.

Jon Skeet: Yeah, that's even worse. You think, "Well, why is that

code doing that? I will fix that as I go along. Oh no,

that's now caused another problem," and making sure

that you keep some kind of record of, "Well, I need to

go back at some point and fix those things," but

without losing track of where you are on your main,

"This is the problem I'm trying to fix."

Kirill Eremenko: Divide and conquer.

Jon Skeet: Try to reduce things ... Sorry?

Kirill Eremenko: Divide and conquer.

Jon Skeet: Yes. Divide and conquer. Try to reduce things to the

smallest program that you can find that demonstrates

the problem. Not just the smallest in terms of source

code size, but the smallest in terms of environment

and the friendliest in terms of environment. I'm a big,

big fan of console applications, not necessarily for

running them. I do like building tools that are console

applications that do something useful, but if I'm trying

to diagnose a problem in a web application, but I don't

think the problem is in the webiness of it, but yeah.

Say, something I will be diagnosing today is trying to

Page 27: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

make calls to a diagnostic service, a Google API

diagnostic service from a web application.

Jon Skeet: Now, that probably is going to depend on the website

of things, because it goes into the logging framework,

but if in the same application I found that I had a

problem talking to our speech API, for example, then

that wouldn't be specific to the web application. I'll

probably pull that out into a separate console

application, hard code the data rather than taking it

from the user, because it's then easier to share the

program and easier to reproduce without having to

type in the inputs every time and then I've got a

console application that's 30 lines long or something

that doesn't behave as I want to. That's a really small

amount of code to debug.

Jon Skeet: I can launch a debugger for a console application

really easily without having to work out all of the

intricacies of setting up the debugging for a web

application, which ... It's not very hard these days, but

it's just extra steps, so you're trying to get this as

small as possible, as simple as possible. If you think

about the amount of code that's involved in a console

app compared with a Windows gooey or a web app or a

mobile app, all of these things, if you can get it into

some really simple form, then it makes things so much

easier to work with. Sometimes, you will still need to

debug into it and you try lots of different things, but

often just having simplified things, having separated

out all the 99% of code that doesn't matter, so that

you can focus on the 1% that does matter.

Page 28: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

Jon Skeet: Suddenly, it becomes really obvious and you think,

"Well, why didn't I see that before?" Well, because you

had all this other code that was potentially wrong but

turned out not to be. So yeah, it's isolating the

problem and knowing how to say, "Okay, for the

moment, I will hypothesize that the problem isn't in

the XML. I will just hard code the data that would

normally be passed from the XML. Maybe I will

hypothesize that it's not in the formatting of the JSON,

so I will hard code the JSON output," or whatever it is.

Getting rid of those dependencies is the big thing for

diagnostics within programming.

Jon Skeet: Now, I don't know how well that sort of transfers into

data science. Maybe you can simplify your data sets. If

you've got an enormous dataset and it's giving you

some strange results, I can see there being significant

problems in saying, "Well, I will take a much smaller

part of the dataset." Whether that's fewer data points

or removing half the features and saying, "Well, I'll

only concentrate on there on the people's name,

address and age or something and see whether I still

get the same results." I would imagine that

everything's so intricately bound in data science

models, that you could even easily take the wrong

steps there, but, hopefully, some of it transfers.

Kirill Eremenko: No. No, definitely. It's very valuable advice to divide

and conquer. The way I see is, basically, you have lots

of degrees of freedom of where the error could be, try

to cut as many of them off as possible.

Jon Skeet: Absolutely.

Page 29: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

Kirill Eremenko: To lock them in. In terms of data science, again, I

wouldn't remove the features, but in terms of limiting

a dataset, that is actually a quite common practice to

develop a model with 10% of your dataset and then

only expand to 100% or whatever you have, 50%, later

on. That could, totally, be applicable.

Kirill Eremenko: Of course, every situation is different and it's case-by-

case basis, but having this philosophy in mind of,

"Okay, I have an error, it's quite a large code. All right,

let's hard code certain things into it. Reduce the

degrees of freedom and try, and reproduce the error," I

think you're totally right. That's your silver bullet.

Jon Skeet: Yeah.

Kirill Eremenko: Awesome. Okay. Thank you. That's a great tip on

diagnosing problems. Any other comments? How do

you not make the errors in the first place?

Jon Skeet: Oh-oh.

Kirill Eremenko: Or how do you ... When you're coding-

Jon Skeet: If I knew how to do that [inaudible 00:45:14].

Kirill Eremenko: Maybe there's some best practice, like you code 10

lines or 100 lines and then you review them or you ... I

don't know. What's something that you've developed?

What's your secret sauce?

Jon Skeet: The two best practices that spring to mind are very far

from secret sauce and are pretty widely used these

days are code review and tests. If I come to a code base

that doesn't have any tests, that scares me, because I

don't know ... If I make a change, I don't know whether

Page 30: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

that's going to have some adverse effect on some other

bit, but a well tested code base, ideally, with different

kinds of tests. There's a certain amount that you can

do with unit tests, where you're not interacting with

anything external. You try to test one piece of code in

isolation from everything else. Something that's only

got unit tests, okay, that's not so bad, but I also

generally like to see integration tests.

Jon Skeet: Where in unit tests you tend to fake out external

dependencies, I will assume that my database behaves

in this particular way, and so I'll fake out the

interaction between my code and the database. Well,

that's fine so long as I'm right about the assumption. I

want to see some integration tests as well, which use

the actual database to say, "Well, what happens when

I really, really try to do this against the database," or

against the web service or whatever it is? Those tend

to be more expensive, either in terms of resources and

time to set up the test, time to execute the tests. They

may be actually financially expensive if you have to

call some API.

Jon Skeet: I might have tens of thousands of tests that are unit

tests, because they're essentially free, but if I'm calling

some translation API and I want to call that 10,000

times within my tests, then that's going to take a long

time, because it takes a lot longer to make any kind of

network call than to do stuff in memory, and if I'm

doing that an awful lot, then I may be billed for those

translations that actually don't end up being useful to

me other than to verify the code. I would want to be

able to run all the tests frequently, even if I haven't

Page 31: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

made changes in half the things, so to some extent,

I'm not getting much value for all those API calls that

I'm making.

Jon Skeet: That's why you want to have a good balance between

unit tests that are free but sort of limited usage,

versus the more expensive in time or billing or

whatever it is, integration tests that test far more of

the system. I would say, integration tests when they

fail, you've then got a significantly bigger job to

diagnose why they're failing, because you've got a lot

more surface area, you've got more degrees of freedom

as you put it. Whereas, the unit test is generally only

testing one of those degrees of freedom. If something

goes wrong, you immediately know, "Well, it must be

in that bit." There are different pros and cons for those

different kinds of tests, but they definitely help to

reduce how many bugs will get into my code as well

code review.

Kirill Eremenko: Fantastic. Thank you so much. This part of our

discussion, I think, has been very useful, especially for

data scientists who want to get an edge in terms of

being prepared for product development and

integration of data science into product. I think that's

a big thing to put on your resume.

Jon Skeet: Right.

Kirill Eremenko: [inaudible 00:48:49] to this. They can totally do that.

In the interest of time, I only have about 10 minutes

left. I do want to talk a bit about communities.

Jon Skeet: Absolutely.

Page 32: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

Kirill Eremenko: This is a very important topic. We actually ran a

survey recently among our students. We have about

850,000 students on Udemy studying data science

and close to 100,000 on SuperDataScience. This

survey, I think, went out just to the SuperDataScience

community, or SuperDataScience students. One of the

biggest, I was actually talking to our business

development manager today and he said one of the

biggest insights that we got from the survey is that

people want a community. That, right now, the way

data science is structured, the way people are learning

it, it's a very hot topic. Data scientists want to learn,

data scientists are needed in the job market and

people are learning.

Kirill Eremenko: But one of the things that still lacks at the moment is

that, whilst there are courses and the knowledge is out

there, and we are one of the providers of this

knowledge, one of the things that we could do better is

we could create a community for people to interact to

have some kind of feedback loops, have some friends,

have some buddies, have some conversations,

interesting talks with people and so on and learn from

each other, support each other, mentor each other.

That's a big thing. In the development world, as we

discussed at the start, community has naturally

evolved, and it's already been around for longer than

in the data science world. Tell us a bit about the

development community. What is it like and why do

you enjoy being part of it so much?

Jon Skeet: The first thing to say is that, it's not like there's one

development community. I'm sure you know that

Page 33: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

already, but it's important to emphasize because, not

only are there lots of different communities, there's the

C# community, the Ruby community, the Java

community, et cetera, and even within that, there are

different sort of sub-communities that are massively

overlapping and different ways that those communities

come together. Just some examples where I'm involved

in community, Stack Overflow, obviously, it's not

trained to be a social network.

Jon Skeet: In some ways, it's impersonal, but it's community of

learning. I think it's important to think about each

community, what you're hoping to get and think about

whether you want people to be taking personal time

and getting to know other people in the same sort of

area, because something like Stack Overflow only does

that marginally, because that's not it's aim. It's aim is

to have questions and answers. Whereas, at the other

end, you've got user groups. I go to the Reading C#

user group or Reading .NET and I've spoken at many

other user groups, and often there's much more of a

feeling of, "Hey, I want to talk to other people in the

same space."

Jon Skeet: That's partly for sharing information, just getting to

know people, because it's always nice to get to know

people. We are sociable people, finding out about job

opportunities, learning more information, et cetera.

That's typically in-person and much more of a social

aspect. Then somewhere in between, there's

conferences where during the conference talks,

obviously, that's mostly one too many. The conference

speaker speaking to the room and getting some

Page 34: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

feedback, but it's relatively rarely discussion oriented.

But then in the halls between talks or if you're brave

and [inaudible 00:52:43] say, "Well, I'm not going to

get into any talk now, I'm just going to hang out in the

lobby and talk with other people," and that's a totally

valid thing to do, it can feel a bit odd to start with, but

once you get used to it, it's a great thing.

Jon Skeet: If there's no talk that you're particularly interested in,

just chat with people and you'll get loads out of it.

There are more organized bits of communities. I'm on

the board for the .NET Foundation, which tries to be a

sort of community hub in terms of supporting various

.NET projects, acting as, to some extent, a bridge with

Microsoft, so we can represent .NET users in a

cohesive way. Yeah, the .NET Foundation is still

finding out what needs it needs to meet and how to

meet them, but it tries to be an online community

enabler as much as a community in its own right.

There are all these different ways that you can be

community and I would certainly encourage the data

science community to think along similar lines of, it's

not like there needs to be one big player.

Jon Skeet: Arguably, it would be better for there not to be a single

big player that, if it doesn't work for some data

scientists for whatever reason, then they feel they

don't have a community to be part of. Having lots of

smaller self-organizing communities, I think, is

generally better. To sort of anticipate a potential

question, make sure right from the very, very, very

start that those communities are as inclusive as

possible right from the start. Just do not tolerate any

Page 35: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

discrimination, whether that's in terms of the obvious

aspects of race and gender, sexuality, et cetera, but

also in terms of, be friendly to newcomers. There are

some communities that have a reputation for being

really hostile to people who are just trying to get into

that community.

Jon Skeet: Why would you want to do that? If you're passionate

about cookery, then why would you want to discourage

other people from becoming better at cookery? Even if

they're saying, "Well, currently I'm trying to boil an egg

and it keeps just cracking, because I haven't got any

water in there." Oh well, you don't say, "Well you're so

stupid for not putting water in there," you say, "Well,

okay, let's take a setback. Yeah, you need water. Let's

see what other things you might be doing that could

be improved." Try to bring people along rather than

being some sort of "I'm better than you" kind of

community. Community shouldn't be competition.

There can be competitive elements, if you enjoy some

competitive coding competitive ... I know that there are

data science competitions, and that's fine, but that

needs to be one aspect and not the whole idea of the

community to prove who's best. That doesn't really

help anyone.

Kirill Eremenko: Totally. I totally agree with you. We actually have a

conference in San Diego that we run every, what is it,

this year or September? At the end of September this

year, it's called DataScienceGO, and one of the things

that somehow just happened and now we are very

happy about it and we're promoting it in the sense

we're supporting it as much as we can is that it's a

Page 36: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

very diverse community. We have, sort of last year we

had like 350 people attend from all different walks of

life, all different backgrounds, countries.

Kirill Eremenko: We had a much a larger presentation of female data

scientists and aspiring data scientists. In the data

science world right now it's about 10%. I think we had

over 20 or 30% at the conference. What we're trying to

do to promote this is, we were trying to showcase and

specifically invite speakers from minority groups or

female speakers in data science. Not to reverse

discriminate against male data scientists, but provide

a platform for anybody to show that it can be done,

that you can achieve success and create these role

models for people to look up to and to learn from. I

think-

Jon Skeet: I hope you're inviting them to talk about data science,

not to talk about-

Kirill Eremenko: Of course, yeah. Of course.

Jon Skeet: It's like, "Oh, what's it like being a woman in data

science?"

Kirill Eremenko: No.

Jon Skeet: Maybe somebody may want to talk about that, but the

main thing is recognizing that there are some awesome

women in data science and, in fact, most of the

conferences ... I go to various developer conferences

and there are often machine learning topics, and I

would say 90% of the conference talks that I've been to

on machine learning have been given by women who

have been awesome and really, really good.

Page 37: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

Jon Skeet: I have to say, I'm disappointed to hear that the data

science community is only 10% women, because I'd

heard great things about it being nearly 50-50 at

various user groups and things. Maybe there are

pockets of data science community which already have

discovered whatever secret sauce it is. I should caveat

that any secret sauce like this is likely to involve a lot

of hard work.

Jon Skeet: Being inclusive is not just a matter of saying, "Well,

I'm not going to be nasty to anyone." You need to be

actively inclusive and watching out for any problems

that will put people off. But I would've thought data

science, even more than development, really needs to

be diverse, because you'll be dealing with data that is

diverse and I suspect I'm preaching to the choir here

when I say that, if your data is biased, then your

results will be biased.

Jon Skeet: I think having a diverse community has to be part of

trying to ensure that you don't have biased data and

that you can challenge assumptions that come in all

through the process. Whether that's collecting the

data, working out what data to collect to start with,

and then how you process the data, et cetera. It feels

like if the data science community isn't diverse, you

really face an uphill battle there.

Kirill Eremenko: Yeah, I totally, totally understand what you mean. As

you mentioned before the podcast, there's quite

substantial dangers of having a homogenous

community, whether it's in the development space or

the data science space. There's been studies-

Page 38: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

Jon Skeet: It's a danger, not only to the results, but also to how

enjoyable it is. There are several reasons to be diverse,

to encourage diversity and there are obvious moral

reasons for discouraging people, excluding people,

however implicitly, is just plain wrong. But it's also not

as fun for the people who are in the community.

Diverse communities are more interesting to be part

of. As well as getting better results and all kinds of

things, there's only benefits, pretty much.

Kirill Eremenko: Yeah, yeah. I was about to say that there's studies to

show that in diverse teams when you have people from

all sorts of minority groups or from male and female

representatives and from as many nationalities as

possible, diverse groups like that get much better

results.

Jon Skeet: Right.

Kirill Eremenko: It's a question of why, maybe it's the difference of

opinions, difference of backgrounds, the different sort

of communication and how people challenge each

other and things like that. There can be multiple, like

millions of different reasons for that, but the fact is the

fact. If you measure the results, diverse groups,

whether it's in teams of developers or executive teams

or data science teams, diverse groups get, in general,

on average, get better results.

Jon Skeet: It correlates with other aspects. If you have teams

where psychological safety is valued, so that people

feel that they can voice minority opinions and you can

have a minority opinion, even if you're generally a

homogenous group. I could be sitting in a group of

Page 39: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

other cis, straight, white males and still disagree with

them, but being in a team where that is okay to do and

is encouraged and supported and people don't get

shouted down, then that in itself is great.

Jon Skeet: Psychologically safe teams are likely to be more

attractive for members of minority groups who might

feel deliberately excluded from other places. There are

benefits that sort of bounce off each other. You're more

likely to get a diverse team the more you encourage

diversity, which in itself is useful even in a non-diverse

team, encouraging the kind of practice that is

attractive for diversity.

Kirill Eremenko: Why, is it because like a self-fulfilling prophecy almost.

Jon Skeet: Almost. Yeah. Yeah, and it takes work. I want to keep

emphasizing this. It's not something that happens

because you say, "Okay, from now on we're not going

to be jerks."

Kirill Eremenko: Yeah.

Jon Skeet: That needs to be step zero, but it needs to be, "Okay,

well, I'm going to watch myself and others for behavior

that I think might've excluded people or just not

encouraged ... not got the best from everyone." That's

what it's really about, is making sure that everyone is

contributing as well as they possibly can, so even if

you never told that junior team member to shut up,

the fact that all the senior team members were

constantly speaking and interrupting each other, may

have made there be no space for that junior team

members to speak up. So you're not getting the benefit

Page 40: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

from them. Why have them there, but not get the value

from them?

Jon Skeet: It's watching very consciously and thinking, "How can

we do better?" Sometimes, that will be a case of calling

out yourself and saying, "Okay, I've realized that I've

been interrupting and I need to stop doing that."

Sometimes, it can be calling out other people and that

aspect of psychological safety that it's okay to call out

other people and it's okay to be called out and your

reaction needs to be not a defensive, "Well, I didn't do

that ..." but so, "Okay, I will acknowledge that. I will

think about that and I will try to improve." That more

positive environment takes work and we mustn't

underestimate how much work it takes, but also the

benefits are so enormous.

Kirill Eremenko: Yeah. That totally echoes, for example, what Naval

Ravikant says, he's the founder of AngelList, that even

if you're selfish, even if you just want the ultimate best

career for yourself and just ... like, you only think

about yourself, even in that hypothetical case of an

extremely selfish person, it's in their best interest to

maximize the output that anybody on the team can

give regardless of their background. Because that way,

you're exploiting their talents, which are inevitably ...

Everybody has different talents.

Kirill Eremenko: There can be major differences, there can be minor

differences, but there are different talents and there's

no two people that are the best at exactly the same

thing. Do you want to exploit other people's talents the

more, so that the team gets the most out of it, so that

you create amazing products, you change the world,

Page 41: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

you do crazy cool things and that will allow you to

grow, allow you to get the best benefits, allow you to

progress your career as fast as and as quickly as

possible.

Jon Skeet: Absolutely.

Kirill Eremenko: I totally agree with you. It's all about creating this, I

love how you put it, psychologically safe space for

people to feel that they can speak up and share their

opinions.

Jon Skeet: And its playing the longterm game into the short-term

game. To take out any sort of identity politics that

some people might feel difficult about, or might find

difficult. Let's suppose I wanted to make sure that the

industry only rewarded with a surname beginning with

S, because my surname begins with S. I could see

immediately that that gives me a massive pay rise, I'm

massively in demand, it's all great. Unfortunately, that

means that all the people who would be contributing

great features, like the lead designer of C#, it's Mads

Torgesen. His surname begins with T rather than S.

Therefore, I won't benefit from anything that he would

be contributing to the C# language.

Jon Skeet: All the C# compiler authors, whose name doesn't begin

with S are suddenly excluded and I don't get a decent

C# compiler. So yeah, I might be well paid right now,

but I have to work with crummy tools. I don't get the

best hardware. I don't get the best software. I don't get

the best, all the rest of the environment, because I've

decided to exclude people. Realistically, I don't think

most people do want to exclude people. They just don't

Page 42: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

... they can sometimes feel, "Well, if I widen my circle,

then it means I get a smaller slice of the pie." It's really

about making that pie ever bigger and bigger and

bigger.

Kirill Eremenko: Fantastic. Jon, that's very great examples or like, I

think people now listening to this podcast have, if they

weren't agreeing before, now definitely are on board

with the implications and the theory behind it. What

are some practical steps that people [inaudible

01:06:51] take? You already mentioned the importance

... Sorry, you already mentioned a step of the start.

Could you repeat that one? Maybe there's some other

steps that people can take, practically, to help create

these safe environments and safe spaces.

Jon Skeet: I sort of notionally said step zero of saying let's not be

jerks. To some extent that is the first step, but actually

it's learning about things. I started getting into

feminism about four years ago and found just how

much I needed to learn. The same is true in terms of

all kinds of aspects of discrimination, which you might

sort of discount as being, "Well, why do we need

feminism? Now women have equality, they have equal

rights. There can't be a gender pay gap, because the

law says that there isn't allowed to be." Well, it's more

complicated than that. Just take a humble approach

to learning, and if you actively engage in, there's so

much material out there to find out where our

industry has gone wrong and the situation we're in

and there's no point in trying to be part of a solution

without understanding part of the problem to start

with.

Page 43: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

Jon Skeet: The other sort of warning I would give is the

temptation to solutionize, to use jargon. A speaker

called Rhonda Bergman put this really, really well. I

don't think it was her analogy to start with, but she

expanded on very well, about talking about the

difference between knights and allies. That a knight

goes in and wants to solve a particular situation and

wants to end up being the hero, and that should never

be the point of it.

Jon Skeet: An ally will go in and take a broader sweep of, "Okay,

what's wrong here and how can I contribute to the

solution for the sake of the solution rather than to be

some sort of hero?" It generally addresses the causes

rather than the immediate symptoms. Obviously,

immediate symptoms if someone's being bullied or

harassed or whatever it is, they need to be addressed,

but just addressing those, and saying, "Right, job done

now," without addressing the underlying causes, ends

up not being nearly as longterm productive. So it's

really about educating yourself and trying to find out

where you can be of help rather than where you can

lead the charge, because chances are someone else's

already leading the charge and could do with your

support rather than could do with someone else sort of

diluting the message.

Kirill Eremenko: Yeah, that's a very valuable piece of advice, that you

can't have, like in any undertaking, you can't have

everybody be a leader. You need leaders and followers

and it's totally fine to be a follower. Even if you can

contribute in a minor way to this, that's already going

to be a massive benefit. Jon, thank you so much. In

Page 44: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

the interest of time, please could you share with us,

where can our listeners follow you, find you, read more

much about your work or ask you a question maybe?

That would be so fantastic.

Jon Skeet: Sure. I'm on Twitter, my Twitter handle is just Jon

Skeet. I have a blog of blog.Jonskeet.uk and

codeblog.Jonskeet.uk. My non code blog is mostly

around feminism, although this weekend I posted a

recipe for Tiramisu and Tiramisu ice cream. It's

anything non-codey, and the code blog is what you'd

expect it to be, and, obviously, on Stack Overflow.

Kirill Eremenko: Fantastic. One last question before we finish up. Is

there any book you can recommend to our listeners

that has impacted your life?

Jon Skeet: Yeah. It sort of fits in very well with data science and

what we've been talking about. I'd like to recommend

Everyday Sexism, compiled by Laura Bates. The

Everyday Sexism Project is a project that Laura started

when she had a terrible experience on a bus once and

speaking to other women found that her experience

was not uncommon, but wasn't being talked about.

This book has lots of data points in it.

Jon Skeet: Lots of citations of reports and concrete data, so if you

think that sexism isn't a problem, then read the book

and see concrete evidence about it. It's sort of

simultaneously inspiring and terrifying, I find. There

are lots of other books around feminism and

particularly in the tech industry, there's a book called

Brotopia, which looks at sexism in Silicon Valley, but

Page 45: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

yeah, but my main recommendation would be

Everyday Sexism, compiled by Laura Bates.

Kirill Eremenko: Thank you very much. Everyday sexism, Laura Bates.

Everybody, check it out. Jon, I just want to say thank

you so much. I became a fan when I saw how much

you contribute to community and now after this

conversation we had, I'm a huge admirer of what

you're doing, both in the space of helping community

and your expertise, unquestionable expertise in the

domain of coding and how passionate you are about

building community and making it accessible to

everybody and making sure that all minorities are

respected and everybody has ... that equality is there.

Thank you so much for spearheading this space in the

world.

Jon Skeet: Thank you for having me on the podcast.

Kirill Eremenko: Thank you everyone for tuning into the

SuperDataScience podcast. Super appreciative of you

being here. That was Jon Skeet, the top contributor on

Stack Overflow joining in for today's conversation. How

amazing is that? I completely enjoyed our conversation

about the technical aspects at the start and about the

community and being all inclusive towards the end. I

hope you got a lot out of this. One thing to always keep

in mind is that indeed data science is becoming

ubiquitous and, eventually, you'll see it be embedded

all over the place.

Kirill Eremenko: It won't be just creating models, providing business

advice, but you were already seeing this as being

embedded into products and that includes products

Page 46: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

where development is required, websites and apps and

different ... Basically, programs run everything we see

around us, whether it's Alexa that's in your kitchen or

whether it's a washing machine or an airplane, it is a

code that's running that. As data science and machine

learning, AI, gets more and more integrated into that,

we will need to understand better the world of

developers and developers will need to better

understand the world of data science. If you want to

get ahead of the competition, if you want to have a

significant advantage or an additional significant

advantage on your resume, in your career, this is

definitely something to look into.

Kirill Eremenko: Data scientists who understand what development is

all about, understand these differences that we talked

about, such as compiled versus interpreted languages.

What is versioning, how that affects developers, how

that affects data scientists, what kind of problems you

want to diagnose. What is the silver bullet in cold

diagnostics? The divide and conquer principle, what

are cold reviews, what are tests? All the things that we

talked about, taken out of context, might seem that

they are too far fetched for data scientists. Actually,

it's a massive advantage you can add to your career. I

hope you enjoyed this podcast and got a lot out of it.

Kirill Eremenko: Make sure to follow Jon. His Twitter handle is

@jonskeet and spelled J-O-N S-K-E-E-T without the H.

J-O-N-S-K-E-E-T. Make sure to follow Jon. He already

has 638,000 followers and that means he's doing

something right and, obviously, sharing very valuable

knowledge. As always, you can get the show notes for

Page 47: SDS PODCAST EPISODE 285: BRINGING DEV & DIVERSE ...€¦ · Entrepreneur, and each week we bring you inspiring ... Before, data science was just, let's get some insights, ... documentation,

this podcast, including all the materials mentioned

and including the book that Jon mentioned at www ...

Well, you can get a link to the book, not the book

itself, of course, at www.superdatascience.com/285.

That's superdatascience.com/285. There, you can also

find the transcript for today's episode.

Kirill Eremenko: On that note, thank you so much for being here today

and I look forward to seeing you back here next time.

Until then, happy analyzing.