sds podcast episode 285: bringing dev & diverse ...€¦ · entrepreneur, and each week we...
TRANSCRIPT
SDS PODCAST
EPISODE 285:
BRINGING DEV &
DIVERSE
COMMUNITIES INTO
DATA SCIENCE
Kirill Eremenko: This is episode number 285, with top contributor on
Stack Overflow, Jon Skeet.
Kirill Eremenko: Welcome to the SuperDataScience podcast. My name
is Kirill Eremenko, Data Science Coach and Lifestyle
Entrepreneur, and each week we bring you inspiring
people and ideas to help you build your successful
career in data science. Thanks for being here today,
and now, let's make the complex simple.
Kirill Eremenko: This episode is brought to you by SuperDataScience,
our online membership platform for learning data
science at any level. We've got over two and a half
thousand video tutorials, over 200 hours of content
and 30-plus courses with new courses being added on
average once per month. You can get access to all of
this today just by becoming a SuperDataScience
member. There is no strings attached. You just need to
go to superdatascience.com and sign up there, cancel
at any time.
Kirill Eremenko: In addition with your membership, you get access to
any new courses that we release plus all the bonuses
associated with them. Of course, there are many
additional features that are in place or are being put in
place as we speak, such as the Slack channel for
members, where you can already today connect with
other data scientists all over the world or in your
location and discuss different topics such as artificial
intelligence, machine learning, data science,
visualization and more. Or just hang out in the pizza
room and have random chats with fellow data
scientists.
Kirill Eremenko: Also, another feature of the SuperDataScience
platform is the office hours, where every week we invite
valuable guests in the space of data science and
interrogate them about their techniques, about their
methodologies in the space of data science and you
actually get a presentation from the guests and you get
an opportunity to ask Q&A at the end.
Kirill Eremenko: In some of our office hours, we just present some of
the most valuable techniques that our hosts think are
going to be valuable to you. All of that and more you
get as part of your membership at SuperDataScience,
so don't hold off. Sign up today at
www.superdatascience.com. Secure your membership
and take your data science skills to the next level.
Kirill Eremenko: Welcome back to the SuperDataScience podcast, ladies
and gentlemen. Super excited to have you back here
on the show today, and I'm super humbled by our
today's guest, Jon Skeet.
Kirill Eremenko: Jon has submitted almost 35,000 answers on Stack
Overflow and his advice has reached an estimated 276
million people worldwide. That's 276 million. Quite an
insane number, if you take a second to think about it.
Kirill Eremenko: I just got off the phone with Jon and the podcast
you're going to hear is going to be very interesting. We
had a great discussion and is going to be a different
perspective today. The reason for that is that Jon is
not a data scientist, he's a C# developer, an expert in
C# and also some other programming languages.
Kirill Eremenko: Don't let that scare you away, because, a couple of
reasons. First of all, there's a lot of similarities
between data science and development. Both use
programming and things like versioning, and
diagnosing problems are common between the two, so
we can learn quite a lot of things from Jon. The other
reason why this is very relevant is because data
science is more and more coming closer to product
development, is being integrated more and more into
products. Before, data science was just, let's get some
insights, let's do some predictions.
Kirill Eremenko: More and more, we see that companies are integrating
analytics, machine learning, artificial intelligence, data
science, into their products. You will, eventually, it's
highly likely that in your career, especially if you go
and work in startups, for startups, you start startups,
that you will encounter situations where you need to
combine your data science knowledge of developing
knowledge in order to productionize data science.
Therefore, already in this podcast, you can get a head
start and understand how these two worlds meet and
what are their intersects.
Kirill Eremenko: Finally, the third reason is, maybe you are coming to
data science from a world of development. Maybe you
have some experience in programming languages like
C# or compiled languages. It will be interesting for you
to see Jon's perspective on the world of data science.
Kirill Eremenko: All in all, a fantastic podcast, I really enjoyed our
conversation. You'll hear a lot of very valuable
technical topics that we covered and also, at the end,
we actually talked about the importance of
community. What it means to be part of a community
and how communities grow, which you can do as a
data scientist, to make our community be more
inclusive, more welcoming and prosper further. I think
this is valuable, these are valuable insights for
somebody who's been heavily involved in the
development community. These are valuable insights
for data scientists and for us all to grow much faster
and better and stronger as a community.
Kirill Eremenko: On that note, I can't wait for you to check out today's
exciting podcast. Without further ado, I bring to you
the top contributor on Stack Overflow, Jon Skeet.
Kirill Eremenko: Welcome back to the SuperDataScience podcast, ladies
and gentlemen. Super excited to have you on the
show, because I have Jon Skeet on the other line.
Kirill Eremenko: Jon, how are you going today?
Jon Skeet: Very well. Thank you. Very well.
Kirill Eremenko: Very, very nice to talk to you. Could you please remind
me, what city are you calling from, from the UK?
Jon Skeet: I'm in Reading, which is just to the west of London.
Kirill Eremenko: Just to the west of London. Very cool. You said you're
having some fantastic weather these days?
Jon Skeet: Yeah, it's been really nice recently. A few occasional
downpours, but generally, we're escaping from the
normal British wet weather of a summer, so it's very
fine. The only downside is, by the end of the day, the
shed from which I usually work is pretty warm.
Kirill Eremenko: That's a good problem to have in the UK.
Jon Skeet: Yeah.
Kirill Eremenko: It was so cool to see your drums. That's so awesome.
That is very exciting. I wish you could ... Maybe one
day you can play something, and we can ...
Jon Skeet: I think it'll be quite a while before I'm even slightly
good enough to play for anyone else. I only bought the
drum kit a week and a bit ago. I'm practicing hard, but
I've got a long way to go.
Kirill Eremenko: Fantastic. Well, so you are in Reading. How long have
you been in Reading for?
Jon Skeet: Just over 20 years, actually.
Kirill Eremenko: Twenty years.
Jon Skeet: Straight out of university, I ended up working for
Digital Equipment. That was in Reading and moved to
my first house in Tilehurst, which is the sort of village
near Reading. It's a bit bigger than a village, but we
tend to call it a village. I've moved within Tilehurst, but
stayed basically there, even from before I was married.
Kirill Eremenko: Wow. Fantastic. You're married now?
Jon Skeet: Yes. We celebrated our 20th wedding anniversary fairly
recently. Yeah. Very, very happily married.
Kirill Eremenko: Wonderful. That's so cool. Congratulations.
Jon Skeet: Thank you.
Kirill Eremenko: Jon, what fascinates me is that from a ... Would you
say Reading is a little place or a big place?
Jon Skeet: Reading is a very large town sort of bordering on being
a city. It's not officially a city, but I wouldn't be
surprised if in the next five or 10 years' time it was
given the official designation of city. It's quite close to
London and there are really good rail links that are
improving over time, actually, so while a lot of people
do commute from the outskirts of Reading into
Reading, an awful lot of people also go from Reading
into London to work in London.
Jon Skeet: But it's great because it's nice and close to London, so
I can get to the office when I need to, and also go to
see plays and musicals, which I love doing. But also,
it's really close to the countryside, so house prices are
bad, but not awful, and I can get to the countryside
nice and easily get to other places in the UK easily. It's
a really nice place to be.
Kirill Eremenko: Oh, fantastic. That is exciting. What I find very
interesting is that from a almost city size town of
Reading, which is very exciting that it's growing, from
the town of Reading, you have done something
extremely unfathomable. You are the number one
contributor to a little website called Stack Overflow.
You have answered over 34,000 questions and you
have reached over 276 million people. If I was wearing
a hat, I would take it off for you right now. That is
huge. Congratulations on that.
Jon Skeet: Thank you very much. Thank you, but it doesn't feel
that big a deal, because it's sort of just something I've
been doing for whatever 10 years now. I answer fewer
questions than I used to, because there are fewer
questions that sort of seem like they are appropriate
for me to answer, but I do still ... I go on there every
day. I think it's probably been nearly 10 years since I
last missed a day on Stack Overflow, because I take
my laptop on holidays and things.
Kirill Eremenko: Oh wow.
Jon Skeet: I manage to disengage from main work, but I do
always like to keep an eye on what's going on Stack
Overflow.
Kirill Eremenko: That's very impressive. Your profile has been viewed
over 1.8 million times and it's just incredible how
you've contributed to so many people, such a great
cause. How does that make you feel?
Jon Skeet: Obviously, I'm thrilled to have helped lots of people,
but I think it's worth bearing in mind that there are
lots of other people who have helped huge numbers of
people as well, and huge numbers of people who've
helped just a few people. So, the cumulative effect is
enormous. Now, I am privileged that being number one
draws a certain amount of, potentially, undeserved
praise. There is the sort of myth of Jon Skeet as this
perfect programmer who never needs to consult any
documentation.
Jon Skeet: In fact, just over the weekend, there's been an
interesting Twitter thread where someone, I believe a
venture capitalist, gave his impression of a Tenex
software engineer who never needed to consult
documentation, knew every line of code that had been
deployed into production, and various things that I
actually thought weren't particularly positive for really
empowering a team, a whole team, rather than one
person to drive forward a product.
Jon Skeet: But there is this myth of me never writing a line of
code that's incorrect and all kinds of things, which I
hope for most people understand just is not reality at
all. I am a pretty regular guy. I make bugs just like
everyone else does. I kicked myself after losing an
entire day or two over something that turns out to be a
tiny typo. I happen to have just gained this
mythological reputation by just contributing a bit more
than other people have on Stack Overflow. So yeah, it
definitely doesn't reflect reality, but I enjoy it at the
same time.
Kirill Eremenko: Got you. Thank you. For our listeners, we're going to
set the scene. Jon, you're a expert programmer in C#,
correct?
Jon Skeet: That's right. Yes.
Kirill Eremenko: C#.
Jon Skeet: I have loved C# since it started. I think I played with
some of the betas before it went to a general
availability, 1.0, in 2001, 2002. I've been working with
it, either professionally or on an enjoyably amateur
status, ever since then sort of. I've alternated between
working with Java and working with C# professionally,
but whenever I've been working just with Java
professionally, I've kept up with the C# in my free
time.
Kirill Eremenko: Fantastic. I also have played around C#. I was helping
one time, my brother created a Sudoku for [inaudible
00:12:52] Salmons in C#, which was fun. I totally love
... My favorite language is C, I would say and then
C++, because of it's object-oriented nature. C# is
fantastic as well, although I don't know it really great.
Kirill Eremenko: What I wanted to do is, before the podcast, this works
better for our listeners, Jon and I sat down and
actually discussed what we were going to be talking
about, because as you can imagine, while C# can be
relevant to some data scientists and can be used to
deliver, deploy, develop data science applications in
some cases, in most cases it's not our language of
choice. You might be surprised, what are we going to
be talking about with Jon if he's an expert in C#?
Kirill Eremenko: If you hear some notes about C# in this podcast, and
you are interested in C#, that is awesome. That is for
you, but at the same time, what we're actually going to
be focusing on with Jon is the importance of
community and importance of what it is like to be in a
tech profession. Because there are lots of similarities
between development and data science, and through
his work on Stack Overflow and through his exposure
to the community of developers in Stack Overflow and
this, in general, is community that's helping each
other out, it will be very interesting to gain some
insights. Because the data science community, as far
as I know, is not that old. As old as the development
community, so maybe there's some takeaways that we
can apply to the data science community and to how
we interact with each other. That's what I expect we're
going to focus on, but you never know how the
conversation is going to go.
Jon Skeet: Absolutely.
Kirill Eremenko: It's going to be fun.
Jon Skeet: My experience is that, well, certainly podcasts
[inaudible 00:14:33] tend to meander away from what
we expected [crosstalk 00:14:36]. Often, including
things around versioning or dates and times, which
are two other topics that I'm fairly passionate about. I
suspect that we'll find, in the course of this discussion,
that there are various touch points where the
problems that the data science community face are
similar to the problems that the more regular
programming community faces. There will be various
similarities and, hopefully, a few differences we can
note and sort of learn from each other, new
approaches that we can take.
Kirill Eremenko: Totally. Totally, even this one that you mentioned, the
versioning, that is such an important thing. In data
science, we don't have ... maybe in the silos and in
certain companies, maybe there are certain
frameworks that are coming out where there's very
rigorous, methodical system for versioning. But,
overall, when somebody starts the data science, that's
the last thing they probably learn.
Kirill Eremenko: They learn about machine learning and so on, but they
don't have this habit of versioning files. Like I, through
my work at Deloitte, where they have very specific
ways to version anything, like I even version my, I
don't know, tax documents, PowerPoints, they all have
like version 1.1, version 1.2, 3.7. Everything I create
almost always has a version. Whereas in data science,
I don't think that's the case. Tell us about the
importance of versioning in development.
Jon Skeet: Within the .NET community in particular, we've
adopted SemVer, Semantic Versioning, which is not
.NET specific and is fairly widely used within
programming where an artifact, whether that's a
library, an application, whatever, but probably
something that other people will depend on, they need
to know how it's going to behave, that gets a three-part
version number. It has a major, minor, and patch
version and also, potentially, some other information
like a dash beta 01 or whatever that says, "This is pre-
released and can change sort of arbitrarily." But then,
if you say, "I'm following Semantic Versioning," that
means that, if I've published a 1.0.0 of something,
then if I publish something else within the same major
version number, then it should be backward
compatible. If I publish a 1.1.0, then anything that
was previously using 1.0.0 should be able to upgrade
to 1.1.0 without being broken.
Kirill Eremenko: It should be able to use 1.1.0 without you changing
the code of that thing that's using these?
Jon Skeet: Exactly. There are different levels of compatibility. One
thing would be, and this depends on your
programming language and environment and things,
but in something like C#, which is compiled, there's a
separate compilation step that happens long before
execution that can be different things where you may
say, "Well, it's source compatible," so your code that
built against 1.0.0 can still build against 1.1.0. There's
the other aspect of binary compatibility, which is,
while I compiled this code against 1.0.0, but actually
at execution time for whatever reason I am loading
1.1.0 of the library and that should still be okay as
well. Then you get patch versioning, where you should
be able to go backwards or forwards in time. If I could
build against 1.1.5, I should also be able to build
against 1.1.4, so it's sort of forward compatibility as
well as backward compatibility.
Jon Skeet: Then you get into really difficult problems, where
you're writing an application, and you depend on one
library that depends on another library at version one,
but you want to depend on that same library at
version two, and those aren't necessarily compatible
with each other at all. There could be all kinds of
differences and, certainly, in .NET that causes a
problem, because while some aspects of the execution
environment can handle multiple versions being
loaded at the same time a lot of the tooling doesn't
support it. I wrote a blog post on that fairly recently,
saying, "Hey, we need to get better at this." I don't
know how many dependencies and what level of that
sort of versioning problem you have in data science. I
would say the most important thing isn't even
versioning in terms of making sure that everything has
a number, but at least keeping a versioned history of
things, whether that's in gate or in subversion or some
other source control, so that you can get back to, "Oh,
I know I had a working version a few days ago. Let me
have a look at that."
Jon Skeet: I've been to some machine learning talks and done sort
of workshops, but don't have significant experience. I
can definitely imagine the importance of keeping a log
of, "Well, I tried this and this was the result." That sort
of goes on to another topic that I'm absolutely
passionate about within programming, which is, how
do you diagnose problems? A lot of that is making sure
that you can keep a log of exactly what you did and
exactly what the result was, and being clear enough
about that without spending hours and hours doing it.
I would imagine that's a skill that data scientists sort
of pick up naturally, because it feels like it's probably
closer to one of your core competencies. I would love it
if the data science community could try to teach the
programming community about how to keep good logs
of what happened when you tried things.
Kirill Eremenko: That's fantastic. Before we dive into more into
diagonizing problems, I wanted to also mention that
for data scientists, there's a very specific component
that needs to be also remembered. Is that, you don't
not only need to version the code that you're creating,
but you also need to version the data that you're using
to train.
Jon Skeet: Absolutely, yes. The same data set behaving differently
under different versions of your code versus different
versions of the data behaving differently under the
same version of your code.
Kirill Eremenko: Exactly, exactly. That's another moving part in the
equation.
Jon Skeet: Right.
Kirill Eremenko: I love that. That there is that similarity of versioning,
but there's a difference that data is such a crucial part
of what they're set to do. Then, more you need to have
these, not just say what kind of data was like have a
backup preferably of that data, because maybe that
data is not in your control. Maybe you're getting it
from a server where somebody might change it and
then you're completely stuck. You know?
Jon Skeet: Right.
Kirill Eremenko: You have no way ... It's important, right, in versioning
to be able to go back to the previous version in case
the new version is broken.
Jon Skeet: Yeah, and to know which version you did things
against. We seem to be, whether I am driving it to
topics that I'm interested in or not, there's something
similar in programming that many people are unaware
that they're depending on version data with time
zones.
Kirill Eremenko: Oh wow, it's a good one.
Jon Skeet: Many people assume times zone rules just stay the
same forever. No, while you go into daylight saving
time at this time and then you come out of daylight
saving time at this other date, and the rules are set,
but no. The rules change several times a year, and I
don't mean because things go into or out of daylight
saving time, but a country might decide, "We're not
going to have daylight saving time anymore."
Kirill Eremenko: Yeah.
Jon Skeet: In fact, the European Union at the moment is
deciding, I think in principle it has been agreed that
from 2020, I think, countries will have the option of no
longer using daylight saving time, so everyone who has
recorded some data that is time zone or were in some
form or other, they have recorded it with, presumably,
the current version of time zone data that they were
using at the time. But I'd be very surprised if more
than 1% of developers actually recorded, "Yes, I was
using Iona time zone data 2016 J, or whatever it is."
The rules that we knew about at the time, which
predict future and past things, is just an aspect of
version data that people don't expect to be versioned.
Kirill Eremenko: Yeah, I totally agree. Even Russia had this a few years
ago when they stopped using daylight saving times for
a few years and now they've started back using it or
something like that and try keeping track of all those
things. That has a massive impact. Your analysis can
be completely wrong, especially, you're doing
something, I don't know, for example, on the data
relating to financial markets. Bam, all of a sudden it's
not 8:00 AM, it's 9:00 AM or it's not 7:00 AM, it's 8:00
AM.
Jon Skeet: Absolutely. Yes, yes, it really matters. It also matters
how quickly you can get updated datas, because some
countries don't give much warning at all that they are
changing their rules. Literally, sort of, there have been
countries that were about to go into daylight saving
time and announced the day before, "No, we're not
going to do that."
Kirill Eremenko: Wow.
Jon Skeet: I had colleagues who were going through airports and
half the monitors in the airport said one time and half
the monitors said a time and hour later. Of all the
places that you really, really want to be sure what the
time is, an airport is absolutely one of them.
Kirill Eremenko: Oh, that's crazy. That's crazy. Okay. You mentioned a
really interesting topic, which I love. Diagnosing
problems.
Jon Skeet: Right.
Kirill Eremenko: Code is code, whether you're coding in a ... Oh, by the
way, can you tell us quickly, you mentioned C#
compiled language, Python, on the other hand,
interpreted language. What's the difference?
Jon Skeet: Yes. I believe that even Python, there can be compiled-
ish versions, but to be honest, I don't know very much
about Python. Where I give opinions on Python for any
time in this podcast, please treat them with a grain of
salt, a very, very large grain of salt, but a compiled
language like C#, you take the source code and you
provide it and any libraries that you depend on, into
the compiler and the compiler outputs a file, which
contains a binary representation. Now, for compiled
languages like C and C++, that compiled
representation is pretty much machine code that can
be executed directly.
Jon Skeet: For C#, it's something called intermediate language,
which is roughly equivalent to Java bytecode. If any of
your listeners are familiar with that. Again, Java or a
compiled language, you get out class files that are in
this bytecode format that the JVM, the Java Virtual
Machine, knows how to run. It gets even more
complicated, because both Java and .NET almost
always take those compiled, so that binary formats,
not your original text source code, but they then do
something called JIT compilation, which is Just In
Time compilation.
Jon Skeet: They take that sort of nearly machine code and turn it
into actual machine code, so they don't need to go
through that translation step several times. That
happens at execution time, but there's this first bit
where you get to check that all your source code
actually makes sense.
Kirill Eremenko: Got you. Got you. In summary, in some cases, C++
and C compiles straight to a sort of file that can be
run. In the case of Java and C# is first compiled to
intermediate file and that helps find any errors at
compilation among other benefits, of course.
Jon Skeet: Right.
Kirill Eremenko: Then the second a Just In Time compilation is
required, so you can run in multiple architectures,
again, in addition to other benefits as well.
Jon Skeet: [inaudible 00:26:59] efficiently. There are ways of
compiling, certainly, C# and I believe Java with ahead-
of-time compilation, which is sort of doing the JIT
compilation bit of bytecode into machine code, doing
that ahead of time instead of when it started to run as
well, so there are lots of different options.
Kirill Eremenko: Got you, got you. On the other hand we have
interpreted languages such as Python. Any comments
on that, what's the difference?
Jon Skeet: In theory, if you take your very simplest idea of an
interpreted language, you have this interpreter just
like you have a Java Virtual Machine, but instead of
working with the bytecode, it's working with the source
code, so it runs, and it maybe reads your whole source
file into memory, but then it looks at one line at a time
and says, "Right, what does this line mean? I will
execute the code that's in that line, and by execute I
have to understand what it means." If it's something
like, let X = Y + Z, then it needs to pause all of that
and understand what it means, and then say, "Okay,
now I need to load the value of Y, load the value of Z,
add them together, save them in variable X."
Jon Skeet: Now, the very simplest kind of interpreter, if you have
that code in a loop, would be looking at that line
saying, let X = Y + Z every single time, and have to
understand it. Now, that is massively inefficient.
Everything would run far too slowly to be useful. More
modern interpreters might store some almost like the
ILO or the bytecode representation of that, so that it
doesn't have to do the textual passing every time, or
they might actually do something like the JIT
compilation. Even though it's sort of interpreted, I
think very few languages are genuinely interpreted the
whole time these days, because we've got good at doing
things in, well, that's JavaScript here, V8, et cetera.
Jon Skeet: The difference between static versus dynamic
languages and compiled versus interpreted, they are
different things, but often go hand in hand. Static
languages tend to be compiled, dynamic languages
tend to be interpreted. But the difference in execution
time between those two sort of extremes has gone
down massively over time, because we've got a lot
better at dealing with interpreted languages.
Kirill Eremenko: Yeah. One of the differences that somebody
programming with these languages would see, and this
is quite important by the way for data scientists,
because more and more data science is becoming not
just, "All right, let's do some analytics," it's becoming
more product-oriented. Like in certain startups, data
science is embedded into the product, so you will
encounter-
Jon Skeet: Absolutely.
Kirill Eremenko: Yeah, you will encounter times when you will,
especially if you're going into the startup world or
developing new products, you will encounter situations
where you will need to work with compiled languages.
The difference in what you observe can be that, if
you're typing some code in Python and then you run it,
it will run. For instance, you have, let's say, have 100
lines of code and you have an error in line 50, it'll run
the first 49 lines and actually execute them. When it
gets to line 50 it will give you an error.
Kirill Eremenko: In a compiled language, when you try to compile that,
it will give you an error and none of those lines will be
run, so it's important to understand that if you have
some, for instance, data manipulation, data cleaning,
some pre-processing in the first 49 lines of code, in
one case they will be executed and your data will
change. Whereas another case they won't be executed,
because you won't be able to compile the file. I think
that's quite an important, quite radical, difference for
people to understand that, not only it's about what
you see, but also the effect that it can have in the
background on anything that you were doing before
that error occurred.
Jon Skeet: Absolutely. Personally, I would like to see more
support for static languages and, obviously, I would
love to see C# used more in machine learning and data
science in general. If I knew more about data science, I
might be in a place to help that along. As it is, I'm
almost entirely ignorant, so I don't know how we would
do that, but it does come back partly to the aspect of
interpreted languages are often also used interactively.
My understanding is that a lot of data science is done
via Jupiter notebooks and the like, where you're
exploring things as you go. It's not like you write all of
your 100 lines of code and then run it and then find
that there's the problem, but you've built that code up
over time by trying things interactively, and that's
where statically typed and compiled languages tend to,
and this is always sort of caveat of, tend to, there are
exceptions, tend to not deal very well with being done
interactively.
Jon Skeet: You tend to have to do things by creating your source
file beforehand. Now, that's not always the case, and
this may be how we build C# support for data science,
or data science support for C#, depending on which
way you want to think about it, is by allowing C# to be
run more interactively. There are definitely projects
available for that sort of thing already. C# scripting,
approaches and ways of running C# in a browser and
the like. Maybe there will be really good Jupiter
notebook support in the future. I am sure that there
have been some projects that explore Jupiter within
C# already, but they haven't gained the sort of traction
that we'd need to see more mainstream support. But I
think the benefits, as you were saying, of not running
those first 49 lines of code before you find the error at
line 50, there are significant benefits of that, so I
would love to see more support for C# within data
science. I just wish I could help with it, but I just don't
have the knowledge to do so.
Kirill Eremenko: Yeah. It's really a difference in philosophy, isn't it? For
me, when I think of data science versus programming
and compiled languages, data science, like you said, is
very explorative in nature and even if you're not just
looking for insights, you know you want to build the
model, it still requires exploration of different
approaches during the process. For me, the way I
imagine it, the analogy, is like building a sandcastle.
You are trying this out, this falls over, you put a new
tower on, then you put the wall and then the water
washes it away. You build it again and so on. It's like
always you're playing with clay or sand, this type of
creative approach.
Jon Skeet: Right.
Kirill Eremenko: Whereas in programming, especially in the world of C#
and more in the compiled languages, it was a long time
ago for me, so you're much better placed to draw the
analogy here, but does it ... It almost feels like you
have a blueprint of what you want in advance. It's like
you're building a castle, not out of sand, but of little
bricks or a Lego piece.
Jon Skeet: To some extent. To some extent. With more test-driven
approaches, it's often, well, you write the test and then
you make sure that that's implemented. I don't want
anyone to get the impression that you write the whole
application and then you can run some of the code.
You can still do things iteratively, but it's less
interactive iteration. Now, it can feel somewhat close to
it if you get a really tight test run, write some code,
run the test again, et cetera, when you can get that
loop fairly tight, it can be pretty good. I would want to
mention F# at this point.
Jon Skeet: F# is a functional language which is still compiled to
Io. You can inter-operate between F# and C# and other
Io languages, VB, et cetera, but F# was designed from
the start to support this interactive exploratory mode.
My understanding, not as an F# developer, is that a lot
of F# work does happen in that exploratory mode like
data science, so maybe actually thinking about what
would be a good language for data science in a
compiled statically typed way would be F# rather than
C# and maybe we can build on that F# work to also
support C# over time. My understanding is some data
scientists do already use F#.
Kirill Eremenko: No, I'm very interested. I didn't know about F#. That's
very exciting. Okay. Let's move on to something you
touched on, how you diagnose problems. Are there any
best practices of problem diagnosis that, like in code,
that you can share with data scientists, because code
is code. Even though it's interpreted or compiled,
whatever, it's still a very, in any country, this was
what I love about coding.
Kirill Eremenko: You can know how to code in Europe, you can know
how to code in Africa, you can go then to China and
code there. Code is pretty much the same around the
world. Are there any best practice, something you've
developed throughout your career, that you can share
on how to find those errors in the code? Sometimes,
errors, they don't even pop up as an error, but it's
there.
Jon Skeet: Right. Yes. There are so many different sort of
categories of error. There are errors that you find at
compile time that you don't understand why this
doesn't work, why it won't compile, and they're
relatively easy, generally. There are errors where
things go bang, with exceptions, at execution time and
they can be reasonably easy to find and fix. There are
errors where, "My code all runs, it just produces the
wrong output," and that's where things start getting
harder. Then it gets really hard with, "My code runs
and produces the right outputs on my machine, but
the wrong output in production," and that's fairly
hard.
Jon Skeet: Then it gets even harder with, "The code runs and
always produces the right output on my machine and
99% of the time it produces the right output in
production as well, but just occasionally it's very, very
slightly wrong." Diagnosing those errors can be really
hard. We should probably timebox this almost,
because I can talk about diagnostics for a very, very
long time and I'm hoping, eventually, to write a book
about how to get into diagnostics, because this feels
like the silver bullet that I'm being, without trying to
be too immodest, I'm pretty good at diagnostic things
and that is the reason that I am able to help people on
Stack Overflow.
Jon Skeet: You give me a problem and so long as you have given
it to me in a sufficiently well specified way, ideally so
that I can reproduce the problem, then I can apply the
diagnostic steps and help you get to an answer. Now, I
can do that, but if I have to help 100 people that way,
then I have to go through the diagnostic steps 100
times. Obviously, it's far more efficient if I can improve
the level or help improve the level of diagnostic skills
throughout the community and then each of those 100
people could have diagnosed the problem themselves
and fixed it themselves. Now, that's-
Kirill Eremenko: Got you. Well, this is a great opportunity to try out
what you are going to share in the book.
Jon Skeet: Right. That's a simplification. There are times where
you need more knowledge than you have, et cetera,
but I think the silver bullet in diagnostics is divide and
conquer. You have a whole application that might
involve some XML pausing, producing some JSON, it's
interacting with a database, it's interacting with a web
service. It's got a web front-end. All these things and
something is wrong. The first thing to do, in my
experience usually, is try to isolate where the problem
is.
Jon Skeet: If you can reproduce it without doing any XML
parsing, then the problem, probably, isn't in the XML
passing. It may well be that that has problems as well.
One of the fascinating areas in diagnostics is where
you happen to notice other problems as you're
tracking down one major thing, but it's really easy to
be sidetracked by, "Oh, it turns out that's broken, and
the third thing is broken. Everything's broken," and
trying to work out-
Kirill Eremenko: Or when you have two problems that cancel each other
out.
Jon Skeet: Yeah, that's even worse. You think, "Well, why is that
code doing that? I will fix that as I go along. Oh no,
that's now caused another problem," and making sure
that you keep some kind of record of, "Well, I need to
go back at some point and fix those things," but
without losing track of where you are on your main,
"This is the problem I'm trying to fix."
Kirill Eremenko: Divide and conquer.
Jon Skeet: Try to reduce things ... Sorry?
Kirill Eremenko: Divide and conquer.
Jon Skeet: Yes. Divide and conquer. Try to reduce things to the
smallest program that you can find that demonstrates
the problem. Not just the smallest in terms of source
code size, but the smallest in terms of environment
and the friendliest in terms of environment. I'm a big,
big fan of console applications, not necessarily for
running them. I do like building tools that are console
applications that do something useful, but if I'm trying
to diagnose a problem in a web application, but I don't
think the problem is in the webiness of it, but yeah.
Say, something I will be diagnosing today is trying to
make calls to a diagnostic service, a Google API
diagnostic service from a web application.
Jon Skeet: Now, that probably is going to depend on the website
of things, because it goes into the logging framework,
but if in the same application I found that I had a
problem talking to our speech API, for example, then
that wouldn't be specific to the web application. I'll
probably pull that out into a separate console
application, hard code the data rather than taking it
from the user, because it's then easier to share the
program and easier to reproduce without having to
type in the inputs every time and then I've got a
console application that's 30 lines long or something
that doesn't behave as I want to. That's a really small
amount of code to debug.
Jon Skeet: I can launch a debugger for a console application
really easily without having to work out all of the
intricacies of setting up the debugging for a web
application, which ... It's not very hard these days, but
it's just extra steps, so you're trying to get this as
small as possible, as simple as possible. If you think
about the amount of code that's involved in a console
app compared with a Windows gooey or a web app or a
mobile app, all of these things, if you can get it into
some really simple form, then it makes things so much
easier to work with. Sometimes, you will still need to
debug into it and you try lots of different things, but
often just having simplified things, having separated
out all the 99% of code that doesn't matter, so that
you can focus on the 1% that does matter.
Jon Skeet: Suddenly, it becomes really obvious and you think,
"Well, why didn't I see that before?" Well, because you
had all this other code that was potentially wrong but
turned out not to be. So yeah, it's isolating the
problem and knowing how to say, "Okay, for the
moment, I will hypothesize that the problem isn't in
the XML. I will just hard code the data that would
normally be passed from the XML. Maybe I will
hypothesize that it's not in the formatting of the JSON,
so I will hard code the JSON output," or whatever it is.
Getting rid of those dependencies is the big thing for
diagnostics within programming.
Jon Skeet: Now, I don't know how well that sort of transfers into
data science. Maybe you can simplify your data sets. If
you've got an enormous dataset and it's giving you
some strange results, I can see there being significant
problems in saying, "Well, I will take a much smaller
part of the dataset." Whether that's fewer data points
or removing half the features and saying, "Well, I'll
only concentrate on there on the people's name,
address and age or something and see whether I still
get the same results." I would imagine that
everything's so intricately bound in data science
models, that you could even easily take the wrong
steps there, but, hopefully, some of it transfers.
Kirill Eremenko: No. No, definitely. It's very valuable advice to divide
and conquer. The way I see is, basically, you have lots
of degrees of freedom of where the error could be, try
to cut as many of them off as possible.
Jon Skeet: Absolutely.
Kirill Eremenko: To lock them in. In terms of data science, again, I
wouldn't remove the features, but in terms of limiting
a dataset, that is actually a quite common practice to
develop a model with 10% of your dataset and then
only expand to 100% or whatever you have, 50%, later
on. That could, totally, be applicable.
Kirill Eremenko: Of course, every situation is different and it's case-by-
case basis, but having this philosophy in mind of,
"Okay, I have an error, it's quite a large code. All right,
let's hard code certain things into it. Reduce the
degrees of freedom and try, and reproduce the error," I
think you're totally right. That's your silver bullet.
Jon Skeet: Yeah.
Kirill Eremenko: Awesome. Okay. Thank you. That's a great tip on
diagnosing problems. Any other comments? How do
you not make the errors in the first place?
Jon Skeet: Oh-oh.
Kirill Eremenko: Or how do you ... When you're coding-
Jon Skeet: If I knew how to do that [inaudible 00:45:14].
Kirill Eremenko: Maybe there's some best practice, like you code 10
lines or 100 lines and then you review them or you ... I
don't know. What's something that you've developed?
What's your secret sauce?
Jon Skeet: The two best practices that spring to mind are very far
from secret sauce and are pretty widely used these
days are code review and tests. If I come to a code base
that doesn't have any tests, that scares me, because I
don't know ... If I make a change, I don't know whether
that's going to have some adverse effect on some other
bit, but a well tested code base, ideally, with different
kinds of tests. There's a certain amount that you can
do with unit tests, where you're not interacting with
anything external. You try to test one piece of code in
isolation from everything else. Something that's only
got unit tests, okay, that's not so bad, but I also
generally like to see integration tests.
Jon Skeet: Where in unit tests you tend to fake out external
dependencies, I will assume that my database behaves
in this particular way, and so I'll fake out the
interaction between my code and the database. Well,
that's fine so long as I'm right about the assumption. I
want to see some integration tests as well, which use
the actual database to say, "Well, what happens when
I really, really try to do this against the database," or
against the web service or whatever it is? Those tend
to be more expensive, either in terms of resources and
time to set up the test, time to execute the tests. They
may be actually financially expensive if you have to
call some API.
Jon Skeet: I might have tens of thousands of tests that are unit
tests, because they're essentially free, but if I'm calling
some translation API and I want to call that 10,000
times within my tests, then that's going to take a long
time, because it takes a lot longer to make any kind of
network call than to do stuff in memory, and if I'm
doing that an awful lot, then I may be billed for those
translations that actually don't end up being useful to
me other than to verify the code. I would want to be
able to run all the tests frequently, even if I haven't
made changes in half the things, so to some extent,
I'm not getting much value for all those API calls that
I'm making.
Jon Skeet: That's why you want to have a good balance between
unit tests that are free but sort of limited usage,
versus the more expensive in time or billing or
whatever it is, integration tests that test far more of
the system. I would say, integration tests when they
fail, you've then got a significantly bigger job to
diagnose why they're failing, because you've got a lot
more surface area, you've got more degrees of freedom
as you put it. Whereas, the unit test is generally only
testing one of those degrees of freedom. If something
goes wrong, you immediately know, "Well, it must be
in that bit." There are different pros and cons for those
different kinds of tests, but they definitely help to
reduce how many bugs will get into my code as well
code review.
Kirill Eremenko: Fantastic. Thank you so much. This part of our
discussion, I think, has been very useful, especially for
data scientists who want to get an edge in terms of
being prepared for product development and
integration of data science into product. I think that's
a big thing to put on your resume.
Jon Skeet: Right.
Kirill Eremenko: [inaudible 00:48:49] to this. They can totally do that.
In the interest of time, I only have about 10 minutes
left. I do want to talk a bit about communities.
Jon Skeet: Absolutely.
Kirill Eremenko: This is a very important topic. We actually ran a
survey recently among our students. We have about
850,000 students on Udemy studying data science
and close to 100,000 on SuperDataScience. This
survey, I think, went out just to the SuperDataScience
community, or SuperDataScience students. One of the
biggest, I was actually talking to our business
development manager today and he said one of the
biggest insights that we got from the survey is that
people want a community. That, right now, the way
data science is structured, the way people are learning
it, it's a very hot topic. Data scientists want to learn,
data scientists are needed in the job market and
people are learning.
Kirill Eremenko: But one of the things that still lacks at the moment is
that, whilst there are courses and the knowledge is out
there, and we are one of the providers of this
knowledge, one of the things that we could do better is
we could create a community for people to interact to
have some kind of feedback loops, have some friends,
have some buddies, have some conversations,
interesting talks with people and so on and learn from
each other, support each other, mentor each other.
That's a big thing. In the development world, as we
discussed at the start, community has naturally
evolved, and it's already been around for longer than
in the data science world. Tell us a bit about the
development community. What is it like and why do
you enjoy being part of it so much?
Jon Skeet: The first thing to say is that, it's not like there's one
development community. I'm sure you know that
already, but it's important to emphasize because, not
only are there lots of different communities, there's the
C# community, the Ruby community, the Java
community, et cetera, and even within that, there are
different sort of sub-communities that are massively
overlapping and different ways that those communities
come together. Just some examples where I'm involved
in community, Stack Overflow, obviously, it's not
trained to be a social network.
Jon Skeet: In some ways, it's impersonal, but it's community of
learning. I think it's important to think about each
community, what you're hoping to get and think about
whether you want people to be taking personal time
and getting to know other people in the same sort of
area, because something like Stack Overflow only does
that marginally, because that's not it's aim. It's aim is
to have questions and answers. Whereas, at the other
end, you've got user groups. I go to the Reading C#
user group or Reading .NET and I've spoken at many
other user groups, and often there's much more of a
feeling of, "Hey, I want to talk to other people in the
same space."
Jon Skeet: That's partly for sharing information, just getting to
know people, because it's always nice to get to know
people. We are sociable people, finding out about job
opportunities, learning more information, et cetera.
That's typically in-person and much more of a social
aspect. Then somewhere in between, there's
conferences where during the conference talks,
obviously, that's mostly one too many. The conference
speaker speaking to the room and getting some
feedback, but it's relatively rarely discussion oriented.
But then in the halls between talks or if you're brave
and [inaudible 00:52:43] say, "Well, I'm not going to
get into any talk now, I'm just going to hang out in the
lobby and talk with other people," and that's a totally
valid thing to do, it can feel a bit odd to start with, but
once you get used to it, it's a great thing.
Jon Skeet: If there's no talk that you're particularly interested in,
just chat with people and you'll get loads out of it.
There are more organized bits of communities. I'm on
the board for the .NET Foundation, which tries to be a
sort of community hub in terms of supporting various
.NET projects, acting as, to some extent, a bridge with
Microsoft, so we can represent .NET users in a
cohesive way. Yeah, the .NET Foundation is still
finding out what needs it needs to meet and how to
meet them, but it tries to be an online community
enabler as much as a community in its own right.
There are all these different ways that you can be
community and I would certainly encourage the data
science community to think along similar lines of, it's
not like there needs to be one big player.
Jon Skeet: Arguably, it would be better for there not to be a single
big player that, if it doesn't work for some data
scientists for whatever reason, then they feel they
don't have a community to be part of. Having lots of
smaller self-organizing communities, I think, is
generally better. To sort of anticipate a potential
question, make sure right from the very, very, very
start that those communities are as inclusive as
possible right from the start. Just do not tolerate any
discrimination, whether that's in terms of the obvious
aspects of race and gender, sexuality, et cetera, but
also in terms of, be friendly to newcomers. There are
some communities that have a reputation for being
really hostile to people who are just trying to get into
that community.
Jon Skeet: Why would you want to do that? If you're passionate
about cookery, then why would you want to discourage
other people from becoming better at cookery? Even if
they're saying, "Well, currently I'm trying to boil an egg
and it keeps just cracking, because I haven't got any
water in there." Oh well, you don't say, "Well you're so
stupid for not putting water in there," you say, "Well,
okay, let's take a setback. Yeah, you need water. Let's
see what other things you might be doing that could
be improved." Try to bring people along rather than
being some sort of "I'm better than you" kind of
community. Community shouldn't be competition.
There can be competitive elements, if you enjoy some
competitive coding competitive ... I know that there are
data science competitions, and that's fine, but that
needs to be one aspect and not the whole idea of the
community to prove who's best. That doesn't really
help anyone.
Kirill Eremenko: Totally. I totally agree with you. We actually have a
conference in San Diego that we run every, what is it,
this year or September? At the end of September this
year, it's called DataScienceGO, and one of the things
that somehow just happened and now we are very
happy about it and we're promoting it in the sense
we're supporting it as much as we can is that it's a
very diverse community. We have, sort of last year we
had like 350 people attend from all different walks of
life, all different backgrounds, countries.
Kirill Eremenko: We had a much a larger presentation of female data
scientists and aspiring data scientists. In the data
science world right now it's about 10%. I think we had
over 20 or 30% at the conference. What we're trying to
do to promote this is, we were trying to showcase and
specifically invite speakers from minority groups or
female speakers in data science. Not to reverse
discriminate against male data scientists, but provide
a platform for anybody to show that it can be done,
that you can achieve success and create these role
models for people to look up to and to learn from. I
think-
Jon Skeet: I hope you're inviting them to talk about data science,
not to talk about-
Kirill Eremenko: Of course, yeah. Of course.
Jon Skeet: It's like, "Oh, what's it like being a woman in data
science?"
Kirill Eremenko: No.
Jon Skeet: Maybe somebody may want to talk about that, but the
main thing is recognizing that there are some awesome
women in data science and, in fact, most of the
conferences ... I go to various developer conferences
and there are often machine learning topics, and I
would say 90% of the conference talks that I've been to
on machine learning have been given by women who
have been awesome and really, really good.
Jon Skeet: I have to say, I'm disappointed to hear that the data
science community is only 10% women, because I'd
heard great things about it being nearly 50-50 at
various user groups and things. Maybe there are
pockets of data science community which already have
discovered whatever secret sauce it is. I should caveat
that any secret sauce like this is likely to involve a lot
of hard work.
Jon Skeet: Being inclusive is not just a matter of saying, "Well,
I'm not going to be nasty to anyone." You need to be
actively inclusive and watching out for any problems
that will put people off. But I would've thought data
science, even more than development, really needs to
be diverse, because you'll be dealing with data that is
diverse and I suspect I'm preaching to the choir here
when I say that, if your data is biased, then your
results will be biased.
Jon Skeet: I think having a diverse community has to be part of
trying to ensure that you don't have biased data and
that you can challenge assumptions that come in all
through the process. Whether that's collecting the
data, working out what data to collect to start with,
and then how you process the data, et cetera. It feels
like if the data science community isn't diverse, you
really face an uphill battle there.
Kirill Eremenko: Yeah, I totally, totally understand what you mean. As
you mentioned before the podcast, there's quite
substantial dangers of having a homogenous
community, whether it's in the development space or
the data science space. There's been studies-
Jon Skeet: It's a danger, not only to the results, but also to how
enjoyable it is. There are several reasons to be diverse,
to encourage diversity and there are obvious moral
reasons for discouraging people, excluding people,
however implicitly, is just plain wrong. But it's also not
as fun for the people who are in the community.
Diverse communities are more interesting to be part
of. As well as getting better results and all kinds of
things, there's only benefits, pretty much.
Kirill Eremenko: Yeah, yeah. I was about to say that there's studies to
show that in diverse teams when you have people from
all sorts of minority groups or from male and female
representatives and from as many nationalities as
possible, diverse groups like that get much better
results.
Jon Skeet: Right.
Kirill Eremenko: It's a question of why, maybe it's the difference of
opinions, difference of backgrounds, the different sort
of communication and how people challenge each
other and things like that. There can be multiple, like
millions of different reasons for that, but the fact is the
fact. If you measure the results, diverse groups,
whether it's in teams of developers or executive teams
or data science teams, diverse groups get, in general,
on average, get better results.
Jon Skeet: It correlates with other aspects. If you have teams
where psychological safety is valued, so that people
feel that they can voice minority opinions and you can
have a minority opinion, even if you're generally a
homogenous group. I could be sitting in a group of
other cis, straight, white males and still disagree with
them, but being in a team where that is okay to do and
is encouraged and supported and people don't get
shouted down, then that in itself is great.
Jon Skeet: Psychologically safe teams are likely to be more
attractive for members of minority groups who might
feel deliberately excluded from other places. There are
benefits that sort of bounce off each other. You're more
likely to get a diverse team the more you encourage
diversity, which in itself is useful even in a non-diverse
team, encouraging the kind of practice that is
attractive for diversity.
Kirill Eremenko: Why, is it because like a self-fulfilling prophecy almost.
Jon Skeet: Almost. Yeah. Yeah, and it takes work. I want to keep
emphasizing this. It's not something that happens
because you say, "Okay, from now on we're not going
to be jerks."
Kirill Eremenko: Yeah.
Jon Skeet: That needs to be step zero, but it needs to be, "Okay,
well, I'm going to watch myself and others for behavior
that I think might've excluded people or just not
encouraged ... not got the best from everyone." That's
what it's really about, is making sure that everyone is
contributing as well as they possibly can, so even if
you never told that junior team member to shut up,
the fact that all the senior team members were
constantly speaking and interrupting each other, may
have made there be no space for that junior team
members to speak up. So you're not getting the benefit
from them. Why have them there, but not get the value
from them?
Jon Skeet: It's watching very consciously and thinking, "How can
we do better?" Sometimes, that will be a case of calling
out yourself and saying, "Okay, I've realized that I've
been interrupting and I need to stop doing that."
Sometimes, it can be calling out other people and that
aspect of psychological safety that it's okay to call out
other people and it's okay to be called out and your
reaction needs to be not a defensive, "Well, I didn't do
that ..." but so, "Okay, I will acknowledge that. I will
think about that and I will try to improve." That more
positive environment takes work and we mustn't
underestimate how much work it takes, but also the
benefits are so enormous.
Kirill Eremenko: Yeah. That totally echoes, for example, what Naval
Ravikant says, he's the founder of AngelList, that even
if you're selfish, even if you just want the ultimate best
career for yourself and just ... like, you only think
about yourself, even in that hypothetical case of an
extremely selfish person, it's in their best interest to
maximize the output that anybody on the team can
give regardless of their background. Because that way,
you're exploiting their talents, which are inevitably ...
Everybody has different talents.
Kirill Eremenko: There can be major differences, there can be minor
differences, but there are different talents and there's
no two people that are the best at exactly the same
thing. Do you want to exploit other people's talents the
more, so that the team gets the most out of it, so that
you create amazing products, you change the world,
you do crazy cool things and that will allow you to
grow, allow you to get the best benefits, allow you to
progress your career as fast as and as quickly as
possible.
Jon Skeet: Absolutely.
Kirill Eremenko: I totally agree with you. It's all about creating this, I
love how you put it, psychologically safe space for
people to feel that they can speak up and share their
opinions.
Jon Skeet: And its playing the longterm game into the short-term
game. To take out any sort of identity politics that
some people might feel difficult about, or might find
difficult. Let's suppose I wanted to make sure that the
industry only rewarded with a surname beginning with
S, because my surname begins with S. I could see
immediately that that gives me a massive pay rise, I'm
massively in demand, it's all great. Unfortunately, that
means that all the people who would be contributing
great features, like the lead designer of C#, it's Mads
Torgesen. His surname begins with T rather than S.
Therefore, I won't benefit from anything that he would
be contributing to the C# language.
Jon Skeet: All the C# compiler authors, whose name doesn't begin
with S are suddenly excluded and I don't get a decent
C# compiler. So yeah, I might be well paid right now,
but I have to work with crummy tools. I don't get the
best hardware. I don't get the best software. I don't get
the best, all the rest of the environment, because I've
decided to exclude people. Realistically, I don't think
most people do want to exclude people. They just don't
... they can sometimes feel, "Well, if I widen my circle,
then it means I get a smaller slice of the pie." It's really
about making that pie ever bigger and bigger and
bigger.
Kirill Eremenko: Fantastic. Jon, that's very great examples or like, I
think people now listening to this podcast have, if they
weren't agreeing before, now definitely are on board
with the implications and the theory behind it. What
are some practical steps that people [inaudible
01:06:51] take? You already mentioned the importance
... Sorry, you already mentioned a step of the start.
Could you repeat that one? Maybe there's some other
steps that people can take, practically, to help create
these safe environments and safe spaces.
Jon Skeet: I sort of notionally said step zero of saying let's not be
jerks. To some extent that is the first step, but actually
it's learning about things. I started getting into
feminism about four years ago and found just how
much I needed to learn. The same is true in terms of
all kinds of aspects of discrimination, which you might
sort of discount as being, "Well, why do we need
feminism? Now women have equality, they have equal
rights. There can't be a gender pay gap, because the
law says that there isn't allowed to be." Well, it's more
complicated than that. Just take a humble approach
to learning, and if you actively engage in, there's so
much material out there to find out where our
industry has gone wrong and the situation we're in
and there's no point in trying to be part of a solution
without understanding part of the problem to start
with.
Jon Skeet: The other sort of warning I would give is the
temptation to solutionize, to use jargon. A speaker
called Rhonda Bergman put this really, really well. I
don't think it was her analogy to start with, but she
expanded on very well, about talking about the
difference between knights and allies. That a knight
goes in and wants to solve a particular situation and
wants to end up being the hero, and that should never
be the point of it.
Jon Skeet: An ally will go in and take a broader sweep of, "Okay,
what's wrong here and how can I contribute to the
solution for the sake of the solution rather than to be
some sort of hero?" It generally addresses the causes
rather than the immediate symptoms. Obviously,
immediate symptoms if someone's being bullied or
harassed or whatever it is, they need to be addressed,
but just addressing those, and saying, "Right, job done
now," without addressing the underlying causes, ends
up not being nearly as longterm productive. So it's
really about educating yourself and trying to find out
where you can be of help rather than where you can
lead the charge, because chances are someone else's
already leading the charge and could do with your
support rather than could do with someone else sort of
diluting the message.
Kirill Eremenko: Yeah, that's a very valuable piece of advice, that you
can't have, like in any undertaking, you can't have
everybody be a leader. You need leaders and followers
and it's totally fine to be a follower. Even if you can
contribute in a minor way to this, that's already going
to be a massive benefit. Jon, thank you so much. In
the interest of time, please could you share with us,
where can our listeners follow you, find you, read more
much about your work or ask you a question maybe?
That would be so fantastic.
Jon Skeet: Sure. I'm on Twitter, my Twitter handle is just Jon
Skeet. I have a blog of blog.Jonskeet.uk and
codeblog.Jonskeet.uk. My non code blog is mostly
around feminism, although this weekend I posted a
recipe for Tiramisu and Tiramisu ice cream. It's
anything non-codey, and the code blog is what you'd
expect it to be, and, obviously, on Stack Overflow.
Kirill Eremenko: Fantastic. One last question before we finish up. Is
there any book you can recommend to our listeners
that has impacted your life?
Jon Skeet: Yeah. It sort of fits in very well with data science and
what we've been talking about. I'd like to recommend
Everyday Sexism, compiled by Laura Bates. The
Everyday Sexism Project is a project that Laura started
when she had a terrible experience on a bus once and
speaking to other women found that her experience
was not uncommon, but wasn't being talked about.
This book has lots of data points in it.
Jon Skeet: Lots of citations of reports and concrete data, so if you
think that sexism isn't a problem, then read the book
and see concrete evidence about it. It's sort of
simultaneously inspiring and terrifying, I find. There
are lots of other books around feminism and
particularly in the tech industry, there's a book called
Brotopia, which looks at sexism in Silicon Valley, but
yeah, but my main recommendation would be
Everyday Sexism, compiled by Laura Bates.
Kirill Eremenko: Thank you very much. Everyday sexism, Laura Bates.
Everybody, check it out. Jon, I just want to say thank
you so much. I became a fan when I saw how much
you contribute to community and now after this
conversation we had, I'm a huge admirer of what
you're doing, both in the space of helping community
and your expertise, unquestionable expertise in the
domain of coding and how passionate you are about
building community and making it accessible to
everybody and making sure that all minorities are
respected and everybody has ... that equality is there.
Thank you so much for spearheading this space in the
world.
Jon Skeet: Thank you for having me on the podcast.
Kirill Eremenko: Thank you everyone for tuning into the
SuperDataScience podcast. Super appreciative of you
being here. That was Jon Skeet, the top contributor on
Stack Overflow joining in for today's conversation. How
amazing is that? I completely enjoyed our conversation
about the technical aspects at the start and about the
community and being all inclusive towards the end. I
hope you got a lot out of this. One thing to always keep
in mind is that indeed data science is becoming
ubiquitous and, eventually, you'll see it be embedded
all over the place.
Kirill Eremenko: It won't be just creating models, providing business
advice, but you were already seeing this as being
embedded into products and that includes products
where development is required, websites and apps and
different ... Basically, programs run everything we see
around us, whether it's Alexa that's in your kitchen or
whether it's a washing machine or an airplane, it is a
code that's running that. As data science and machine
learning, AI, gets more and more integrated into that,
we will need to understand better the world of
developers and developers will need to better
understand the world of data science. If you want to
get ahead of the competition, if you want to have a
significant advantage or an additional significant
advantage on your resume, in your career, this is
definitely something to look into.
Kirill Eremenko: Data scientists who understand what development is
all about, understand these differences that we talked
about, such as compiled versus interpreted languages.
What is versioning, how that affects developers, how
that affects data scientists, what kind of problems you
want to diagnose. What is the silver bullet in cold
diagnostics? The divide and conquer principle, what
are cold reviews, what are tests? All the things that we
talked about, taken out of context, might seem that
they are too far fetched for data scientists. Actually,
it's a massive advantage you can add to your career. I
hope you enjoyed this podcast and got a lot out of it.
Kirill Eremenko: Make sure to follow Jon. His Twitter handle is
@jonskeet and spelled J-O-N S-K-E-E-T without the H.
J-O-N-S-K-E-E-T. Make sure to follow Jon. He already
has 638,000 followers and that means he's doing
something right and, obviously, sharing very valuable
knowledge. As always, you can get the show notes for
this podcast, including all the materials mentioned
and including the book that Jon mentioned at www ...
Well, you can get a link to the book, not the book
itself, of course, at www.superdatascience.com/285.
That's superdatascience.com/285. There, you can also
find the transcript for today's episode.
Kirill Eremenko: On that note, thank you so much for being here today
and I look forward to seeing you back here next time.
Until then, happy analyzing.