sds podcast episode 249: diving into data science … · mike that was back in the middle of 2017...
TRANSCRIPT
Kirill Eremenko: This is episode number 249 with the CEO and Co-
Founder at SFL Scientific, Michael Segala.
Kirill Eremenko: Welcome to the SuperDataScience Podcast. My name
is Kirill Eremenko, Data Science Coach and Lifestyle
Entrepreneur. And each week we bring you inspiring
people and ideas to help you build your successful
career in Data Science. Thanks for being here today
and now let's make the complex simple.
Kirill Eremenko: This episode is brought to you by our very own Data
Science Conference, DataScienceGO 2019. There are
plenty of Data Science conferences out there.
DataScienceGo is not your ordinary data science
event. This is a conference dedicated to career
advancement. We have three days of immersive talks,
panels and training sessions designed to teach, inspire
and guide you. There's three separate career tracks
involved. Whether you're a beginner, a practitioner or a
manager, you can find a career track for you and
select the right talks to advance your career. We're
expecting 40 speakers, that's four zero. 40 speakers to
join us for DataScienceGO 2019 and just to give you a
taste of what to expect, here are some of the speakers
that we had in the previous years. Creator of Makeover
Monday, Andy Kriebel. IA thought leader Ben Taylor,
Data Science influencer Randy Lao, Data Science
mentor Kristen Kehrer, Founder of Visual Cinnamon
Nadieh Bremer, Technology Futurist Pablos Holman
and many, many more.
Kirill Eremenko: This year we will have over 800 attendees from
beginners to data scientists to managers and leaders.
There'll be plenty of networking opportunities with our
attendees and speakers and you don't want to miss
out on that. That's the best way to grow your data
science network and grow your career. And as a bonus
there will be a track for executives. If you're a
executive listening to this, check this out. Last year at
DataScienceGO X, which is our special track for
executives, we had key business decision makers from
Ellie Mae, Levi Strauss, Dell, Red Bull and more.
Kirill Eremenko: Whether you're a beginner, practitioner, manager or
executive, DataScienceGO is for you. DataScienceGO
is happening on the 27th, 28th, 29th of September,
2019 in San Diego. Don't miss out. You can get your
tickets at www.datasciencego.com. I would personally
love to see you there, network with you and help
inspire your career or progress your business into the
space of Data Science. Once again, the website is
www.datasciencego.com. And I'll see you there.
Kirill Eremenko: Welcome back to the SuperDataScience Podcast ladies
and gentlemen, I'm super excited to have you back
here on the show because we've got a returning guest.
For the second time round, Michael Segala is joining
us. He is the CEO and Co-Founder of an AI, Data
Science, Machine Learning consulting firm based out
of Boston but operating globally called SFL Scientific.
Kirill Eremenko: Previously we had a super exciting discussion with
Mike that was back in the middle of 2017 and it was
episode number 65 on the SuperDataScience podcast
if you missed it and today Mike is back with even more
case studies and more inspiration for you guys in the
space of data science. Here are some things that we
talked about, just as last time Mike shared three case
studies and of course they were different this time.
This time we talked about healthcare imaging and we
delve deep into neural networks and the architecture
and design of neural networks.
Kirill Eremenko: Then we talked about logistics and supply chain and
the challenges there and we talked about things such
as bottlenecks and routes and how machine learning
can help in those spaces and what kind of projects
they're doing in that industry. And we talked about
energy and in the space of energy, Mike actually give
us two case studies, and some of the things that you'll
learn there are dealing with unbalanced data sets,
creating fake data sets, unsupervised learning for
anomaly detection and supervised learning with small
data sets and in general, this challenge of small data.
Those are just a couple of things that you'll learn,
there's plenty, plenty more that Mike shared, including
an overview of the world of Data Science projects and
Data Science Consulting in general, which I think you
will find extremely valuable and why companies in
2019 and 2020 might actually start defunding artificial
intelligence and machine learning and what we can do
about it.
Kirill Eremenko: As you can imagine, this is going to be a very, very
powerful podcast, can't wait to jump into it. But before
we do, I wanted to give a shout out to our fan of the
week. And this one is from Ronnie who says, "If you
have an interest in programming automation, big data,
machine learning, etc., this is a must listen, focuses
on data science, analytics, etc., in the corporate
world." Thank you very much Ronnie. Very, very
inspiring to hear that.
Kirill Eremenko: And for those of you out there who are listening to the
show and you haven't yet left a review, then head on
over there on your podcast app or just go to iTunes
and leave a review for the SuperDataScience podcast,
that would be just amazing. I'd really appreciate it
because I love reading your reviews. And with that
said, I'm super excited about today's episode. And
without further ado, for the second time round I bring
to you Mike Segala CEO and Co-Founder of SFL
Scientific.
Kirill Eremenko: Welcome back ladies and gentlemen to the
SuperDataScience podcast. Super excited to have you
on the show because we've got a returning amazing
guest with us here. The one and only Mike Segala from
Boston's SFL Scientific. Mike, welcome back. How are
you doing today?
Michael Segala: I'm doing great. Thanks for having me back. It's a
pleasure to talk again.
Kirill Eremenko: The pleasure's all mine, the podcast we had last time
was an amazing success and totally totally rocked it,
so looking forward to having another one today. How's
the weather in Boston these days?
Michael Segala: Well, it's late February, so we're cold and windy, but
not too bad snow this year, I can't complain too much,
but not nearly as nice as where you're at in the world.
Kirill Eremenko: Yeah, man, I'm in Tasmania now and like I was
mentioning before, it was freezing last night, is my first
time in Tasmania, literally the day I got here, they
have like the worst wind and weather ever and it's
freezing cold. But it's a nice [crosstalk 00:07:05]
Michael Segala: That's how it always is that right? You go on vacation,
and they're like, "Oh, this is the worst seaweed we've
ever had." There's always something but it keeps it
fun, right?
Kirill Eremenko: Yeah, that's true. A bit of variety. That's right. That's
exciting. It's been over one and a half years since we
last spoke, the previous episode, by the way for our
listeners, if you haven't heard it, highly recommend
checking out. Mike shares amazing case studies. It's
episode number 65, so you can find at
superdatascience.com/65, with Mike it was over one
and half years ago. What's been happening since then?
Michael Segala: A lot. Just to kind of recap real quick for the audience.
I've run SFL Scientific, we're a Data Science consulting
company. Unlike a lot of these traditional product
companies or vendors, we're purely focused on really
attacking this Data Science market from a purely kind
of consultative standpoint. Truly kind of service
oriented. What that means for us is we get to have a
lot of really smart folks on staff that get to work across
a really far ranging kind of sets of clients and topics
across the data science and data engineering space.
Michael Segala: For us we're really just continuing to grow and move
with the market. As everything continues to mature
and money gets fed into this AI market, SFL is taking a
really nice ride along with them and continue to kind
of execute on really interesting innovative projects and
just grow the business. It's been a great time and it's
kind of very similar to yours Kirill. We both kind of
started the companies a couple of years ago in the
beginning of this phase and are doing great stuff so
congrats to you as we've been taking this ride a little
bit together here.
Kirill Eremenko: Thanks for that. Thanks mate. Yeah, I can only to the
same. It's exciting to see the explosive growth you had.
I sometimes go on the SFL Scientific website and even
if you're not, a business owner, if you're a data
scientist or you're a data science manager aspiring,
highly recommend checking out sflscientific.com. I just
go there for inspiration sometimes, you go to solutions
or our work. I like how you have this grid of different
industries you've worked in, from advertising,
marketing, agriculture, insurance. And then like I click
on one of them and I'm like, "Oh, that's really cool."
What have you done in agriculture, satellite imaging,
resource management, crop forecasting, livestock
monitoring. Those are some really cool things. There's
a ton of industries you guys have worked on. It's crazy.
How do you keep up with all these projects?
Michael Segala: Well, keep up with the project is different than
executing. Keep up is a lot of late nights and email
exchanges. But everybody on this podcast listening is
pretty educated at least from a data science
perspective, and as we know, algorithms, data sets,
they all kind of boil down to the same fundamental
data types and challenges. What do we have
fundamentally? We have images, we have time series
data, we have text data and a couple other types of
fundamental modalities of data. And what you can
start doing is thinking about, all right, if I had an
image and this image came off of an MRI machine or a
satellite image or even a camera in my house, how
would I classify that image? Or how would I segment
that image?
Michael Segala: And if you're really good at thinking through the
fundamental challenge behind capturing, collecting
and storing and then solving the problems of those
data types, you can kind of extract a way some of that
industry vocabulary and difficulties that very industry
specific folks focus on. What we really try to focus on
as a company is saying, "Hey, I want to hire the best in
class folks at computer vision or time series analysis
or NLP analysis." And arm them with that kind of 95%
of the knowledge to solve all problems. And then when
we talk to somebody from Ad tech or from Pharma or
from finance, being able to slot in and solve an NLP
problem or computer vision problem is kind of very,
very similar and almost a rinse and repeat because
you have that core knowledge. And then you can really
apply it across all these verticals very, very easily.
Michael Segala: That's the way that we attack the market. Now granted
that's not for everybody, but we find that to be
extremely successful and we really had no issues with
that so far.
Kirill Eremenko: That's amazing. I love that you mentioned because we
talked many times with many guests about the
transferability of data science skills. That's why I
personally enjoy Data Science which I think it's such a
cool industry to be in is because you develop those
skills as you mentioned, and then you can take, you
can separate in your mind the data science side of
things and the domain knowledge or the business
knowledge and you can take your data science skills
and transfer them into different areas and very quickly
graft that domain knowledge and consulting is like one
of the places of course where that is the most evident.
Michael Segala: Yeah, absolutely. And I mean for us, we don't think of
data science as a point position around algorithms. I
actually think that's the least interesting thing going
right now in data science. Because when you think
about data science, all these algorithms, take anything
off the shelf, your XGBoost models, your tensor flow
models, right? These are all becoming very commodity
and it's almost trivial at this point to take some data,
run it through XGBoost and get a prediction. Literally
if that takes you more than 20 minutes, if you're just
kind of doing rinse and repeat, you don't know what
you're doing.
Michael Segala: When we're thinking about consulting, it's so much
more than this kind of very singular thought around
algorithms. We like to take that very holistic approach
of saying, "If you're a real organization who needs to
solve a real data problem, how do you do that?"
Michael Segala: And the first way that you do that is as a data scientist
to take a big step back and think about the strategic
vision here, what's the real business use case that
you're thinking about? How would you solve this?
What's that ROI look like? What do I actually get at the
end of these algorithms? And you really thinking
through not just the sciencey algorithm stuff, but also
the business stuff. And then also thinking about, well,
how would I engineer that solution? How would I do
that in a kind of scalable, secure environment where I
can now go in productionize this thing.
Michael Segala: And kind of having that, and coding that around these
algorithms is really where that interest lies. And again,
the reason that I'm saying that is because if you're a
consultant and if you want to get into this space where
if you really want to be a great data scientist, what we
find is, these just very simple algorithms, they're going
to be commoditing. If you want to stay above that
curve, you have to really think about that larger
picture. And that's also very repeatable across
industries, all of these themes make you an extremely
innovative folk and be able to be used across all these
different problem statements. It just kind of keeps
going and going.
Kirill Eremenko: Yeah, totally agree. And you mentioned just before the
podcast that you have grown to over 30 people. What
kind of roles do you have on your team? Is like
everybody doing data science projects end to end or do
you have some people specialized in certain types of
industries, certain types of areas or parts of the data
science project?
Michael Segala: We have two very different groups of teams, first is
more of the sales and the business folks that sit under
me, but we'll put them further aside for the moment.
They have their great roles, they do their things, but
not really for this podcast, actually let me just stop
there for two seconds. I actually make all of my sales
and business people take your courses.
Kirill Eremenko: Oh no.
Michael Segala: I swear to god. As their first two weeks or three weeks
of their introduction, they have to take your, I think
two of your courses as their introduction to data
science.
Kirill Eremenko: Wow.
Michael Segala: Everybody [inaudible 00:15:30]
Kirill Eremenko: Thanks men that's so exciting to hear.
Michael Segala: Because it's a great resource. It's an absolute great
resource and I feel that everybody on my team, no
matter if you're a sales folk or if you're whoever else,
you have to be a data scientist at least some novice
level. You have great resources so we really appreciate
them.
Kirill Eremenko: That's awesome. It's so exciting to hear as well that, I
think this stands to show that data culture or data
driven thinking and culture. This is on one hand of
course it's about knowing your product and what
you're selling. But on the other hand, this way your
team as a whole can develop this data driven mindset.
If a salesperson is talking to a client, they might be
like, "Oh, this might be helpful." XGBoost or Decision
Trees, Random Forests. Really, really cool. Thanks
man, you put a huge smile on my face.
Michael Segala: I'll answer your other question but I want to get back
to this as well because I think a lot of folks that listen
to your podcast could be from that sales and business
side of the world. And at least me, right in my team, I
run that department in myself ... I'm a physicist. I'm a
scientist. I'm a data scientist but now basically I'm a
sales guy and I have a very core belief, exactly
parroting what you're saying that if you want to sell
data science, if you want to be in that role of data
science but not a technical employee, it is
phenomenally critical that you have that same
vocabulary. You understand the real challenges and
you can be, at least a five minute conversation where
you're actually conveying real knowledge about the
topic.
Michael Segala: Otherwise you just kind of look silly compared to
people who know what they're talking about. It's
extremely crucial to have a real base line in there. But
anyways, putting that to the side for the moment, on
the technical side of the house, we usually have two
types of individuals. One is our data scientists and our
data scientists, we look for people who are generalists
but extremely gifted generalist, I need you on one day
to be able to solve cutting edge 3D medical imaging
projects and then the next day doing NLP work. We
tend to not hire folks who only know how to do one
thing because you're a consulting company. That
project might be up in six weeks and then you're off to
something else.
Michael Segala: Our goal is to hire really well rounded folks, but we
tend to double down a bit in the healthcare market,
healthcare, Pharma, biotech. It's really nice when
people have that kind of general backgrounds, physics,
biology, chemistry and things of that nature. But really
bright individuals, that kind of know the data science
space. That's kind of the one group of team. And then
the other side is more of our engineering folks, we call
them like AI engineers because they're not like day-to-
day sad mid folks or SQL people. They're the ones kind
of deploying these solutions at scale, all the way from
very large petabyte size image loads to realtime data
transfer and kind of model deployments.
Michael Segala: We tend to have those two kind of engineering and
data science teams, but they work huge overlaps. Both
can kind of pair each other and do a really nice job.
That's how we set up the teams internally.
Kirill Eremenko: Gotcha. What's the split approximately between the
data scientists and AI engineers?
Michael Segala: I would say 70, 30 maybe 70% data science, 30%
engineering.
Kirill Eremenko: Gotcha.
Michael Segala: Give or take, something like that.
Kirill Eremenko: Am I understanding this correctly that you not only
deliver the insights and find solve the problem for the
client using data science, but you also help
organizations actually deploy their solutions into
production and actually have those models working on
an ongoing basis, hook up all the tools and make sure
that everything's working right. Is my understanding
correct?
Michael Segala: Absolutely. And I think if you don't do that, you're
falling very short of what it actually means to do data
science. Data science isn't running a POC on your
laptop, with a CSV file, it could be, but for most real
organizations, they need something much more robust
than that. That can fit into a real process and kind of
take in real data and kind of show the results and
kind of fold into more of their business process. It's
really critical for us, obviously the first phase in most
projects is very simple, take this data, show me that
you can predict something, great, show it in a sandbox
environment. And then what we really need to
transition them into where most organizations fail
short and why most data science projects fail is not
because the data's no good or because the models are
no good.
Michael Segala: It's actually because the folks don't know how to
integrate these things and productionize the code.
That's a huge problem we see in the industry. We
really try to be thoughtful, when we kind of prove out
the POC to show them and work with them to deploy
it, 'cause unless you deploy it, it's really a failed
project. Absolutely. It's extremely important.
Kirill Eremenko: It's kind of like a follow through, like getting things
done. I imagine it as American football, imagine one
player throws the ball and the other one has to catch
it, the data science side of things, that's throwing the
ball. How fast you can throw it, how accurately you
can throw it, how you can avoid other players jumping
at you when you're throwing it, all that stuff. But if
there's nobody to catch it, then where's that ball going
to go, is just going to land by itself.
Michael Segala: It's going to hit somebody in the back of the head.
That's all it's going to do.
Kirill Eremenko: Exactly.
Michael Segala: But I agree. Any analogy you want to make.
Absolutely. The fact that we still don't have a culture
in the Data Science space around deployment and
productionization, I think is one of the biggest issues
that I see. And one of the biggest risks of folks not
investing longer term, kind of in their data strategies,
these kind of failed POCs. And a lot of that is really
just kind of comes down to integration and
productionization.
Kirill Eremenko: When you say POC, what do you mean? Just so we're
all in the same boat?
Michael Segala: Yeah, sorry. A POC usually is take any, I don't know,
take a problem, pick a use case, whatever it happens
to be. Predicting churn for my customers, pick
something ... A POC is normally, here's 10,000
historical customers, here's the data. Show me that
you can predict with some given level of probability
that these customers can churn, pretty straight
forward. They give it to you on a CSV file, you fire up
XGBoost within a couple hours you could probably do
something. You need to show the business that, that is
validated and you can do it. But now you need to then
productionize this by saying, "Okay, now I have real
customer data coming in every day, I'm collecting it,
I'm adding external information. How do I integrate
this code and algorithm into my actual workflow?"
That kind of [inaudible 00:22:28] has the POC into
more of a real kind of implementation phase. That
make sense?
Kirill Eremenko: POC is basically proof of concept that you cannot get it
done.
Michael Segala: Proof of concept.
Kirill Eremenko: Okay. Gotcha. All right. Well, this would be an
interesting, to hear from you who's in the field, you
guys work in tons of these projects. What would you
say roughly is the estimated amount of time ratio
between the data science side of things, doing the
work and preparing the model and the productization
of the model? How would you split the time required
by your team on to those two part components?
Michael Segala: It's a very open ended question and it depends
phenomenally on the project, obviously. You have to
realize that for us we tend to work on more innovative
type of projects, because a lot of these low hanging
fruit problems, internal data science teams are doing,
or you can call some API to do it, you might
necessarily need to bring us in for some of the bigger
type of stuff. A lot of our projects are more kind of that
cutting edge bigger projects. For us, I tend to try to
run a first POC in the matter of say 4 to 12 weeks, give
or take that timeframe, if it's fast 4 weeks, if it's a little
longer 12 weeks, in that probably half of that time is
spent getting the data, thinking about it, doing some
kind of exploratory analysis, cleaning it, playing with
it.
Michael Segala: Maybe a quarter is spent modeling it and then the last
quarter is spent explaining to the client, walking it
through, understanding it, validating it and things of
that nature. The first half of the project, maybe only
half of the time is spent with the algorithms. And then
I would say to productionize that, I mean that could
itself take anywhere from a day to a year. It really
depends on the business and how complex their IT
infrastructure is, how complex the data is. If there's
security issues, if there's compliance issues. That's
when you get into the world wind of just craziness. It
really depends.
Kirill Eremenko: Wow. Sounds like that part is the more uncertain one
from a day to a year. Well it's lots of uncertainty there.
Michael Segala: Yeah. I mean, I'm being a little heavy handed with the
day, call it a couple of weeks. But yeah, I mean it
could be very quick to a very arduous task.
Kirill Eremenko: Okay. Well that's good to know. And that also shows
that there's a massive hidden complexity involved of
data science projects that a lot of executives don't
consider. If you have a data science strategy, that's
something you should have a part of your data science
treasure. If you're just developing your data science
strategy, not only should you include things like, do
you have the data, do you have data silos, how you're
going to break those silos, what kind of team are you
going to hire or who are you going to approach about
these projects? What kind of tools are you going to be
using for these projects, but also you need to include
this whole productization of the models.
Michael Segala: 100% yep. Absolutely.
Kirill Eremenko: All right, let's shift gears a bit. That was an awesome
intro and like awesome overview of the world of data
science consulting and just in general data science
projects. Let's talk about some case studies. Last time
you shared three incredible case studies on the show.
In fact they had multiple components. I would say
even more than three case studies. Do you have any
new exciting things that you've been working on for
the past one and a half years that you can share with
us?
Michael Segala: I can and I should have remembered which ones I
shared, but I'll pick three ones and will probably be
different and if I repeat myself and remember just tell
me.
Kirill Eremenko: What you shared, first one was on cleaning
unstructured data with NLP pipelines. Then second
one was deep learning to detect cancer. And also we
talked about growing organs with deep learning. And
case study number three was gaining an advantage in
sports betting using machine learning.
Michael Segala: Fair enough. All right, let's actually, let's do a couple of
different ones as well. I like to always go back to
medical imaging. I remember that when I had talked
about last time. We've been working for about the past
year or so and I'll give you three again, just kind of
three or four random ones pretty quickly. We've been
working for about the past, I'll call it a year, a year and
a half with a client who is kind of bleeding edge from a
medical imaging perspective. And medical imaging is
extremely important for lots of different reasons. Let's
take a step back and think about why we care about
automation of medical imaging. Right now you go and
you get an MRI, you get a CT scan, you get a pathology
reading and basically what we're doing, we're detecting
cancer, we're detecting breaks, we're detecting
whatever it happens to be.
Michael Segala: There's this kind of coolness factor of can I use an
algorithm to predict probabilistically is this a tumor
and can I do that at a rate that is more accurate than
a radiologist. That's kind of the cool factor and and
sure, right? We're getting to the point where we can do
that and we're getting to the point where FDA clears it.
But what's really interesting and why we really want to
do it is for two reasons.
Michael Segala: The first reason is reducing variability within the
medical profession, because right now, if I had an MRI
and I gave that to a doctor to predict or for them to tell
me if I have a cancer they technically will disagree with
a group of radiologists and they'll even disagree with
themselves at a pretty large fraction of a clip.
Michael Segala: If we design a system that is unanimous and reduces
that variance, we're now getting to the point where we
can give care to a population in a very unbiased way,
it's a pretty significant kind of implication. The second
implication is, this actually takes doctors lots of time
to do, this could take minutes to hours of their time,
that is not spent with patients.
Michael Segala: Now you're kind of giving them back all of this time
where they can go and do what's really important,
which is seeing and talking with patients. That's really
why we want to do medical imaging and why it's such
a popular field, within deep learning and data sicence.
And I won't go on with this along with all of them, I
really like medical imaging for lots of reasons.
Michael Segala: What we're doing with this medical imaging project is
we have the world's largest collection of 3D CT and
MRI brain scans looking for different cancers within
the sinus cavities. I think it's like 51 different tumor
types that can just establish within your seven
different cavity regions within the brain or within the
face. What we've done there is amassed large amounts
of data paid, well our client has paid lots and lots of
money for doctors to label it. And we've built extremely
sophisticated algorithms to detect very, very small
signatures of malignant like tumor cells within these
3D images.
Michael Segala: That's the first one, and that's been going on for a
while. Extremely successful. Kind of has shown them
to have accuracies, I can't really say the accuracy
numbers, but far exceeding what they would need to
be to get real kind of clinical validation. Very very
interesting, very profound. If we think about the
implications, so that's kind of the first one.
Kirill Eremenko: Quick question. What do you mean 3D images? Is it
like multiple layers of MRI scans?
Michael Segala: Well, an MRI is a 3D, it's not a single 2D plane. You
actually had a stack of like 128 2D images make up
one 3D image. You have to look across the X, Y, and
the Z plane. And obviously within that Z dimension,
you can have, that's where a tumor might be
embedded within two or three of the actual slices. It's
a very complex problem, because now you've taken a
data set and for every image you basically multiplied it
by a factor of a hundred. Just think of the size of these
datas and the complexity of the algorithms that have
to happen.
Kirill Eremenko: Yeah. Can imagine. If it's possible for you to share
what kind of algorithms or even branches of machine
learning or other areas of AI did you guys use for this?
Michael Segala: This is all deep learning, this is all computer vision
and I just want to make a point here because this is a
great question. You cannot take an off the shelf VGG
16 or 19 or whatever they have out now and do
transfer learning and expect to get a medically viable
algorithm. The stuff that people play with is great from
an education standpoint, if you do it on Kaggle sure,
that's fun. But if you really want to be serious about
solving these problems. You're really starting from
scratch and designing from a research perspective
these algorithms in an extremely deep networks, very
complex systems, and you'd better have access to lots
of really big and powerful GPUs.
Michael Segala: We write all of this from scratch in pure TensorFlow,
because [inaudible 00:31:52] is way too restrictive and
they just go to town and just really, these takes a long
time to do. It's all very custom kind of convolutional
networks and stuff like that. And you do lots of
cleaning and pre-processing and post-processing that,
just go on and on to get the accuracies up and up.
Kirill Eremenko: Gotcha. How'd you guys choose TensorFlow over
PyTorch?
Michael Segala: I mean the team does for whatever reason. Sometimes
the client demands it, sometimes for whatever reasons,
our team chooses it. For this client specifically, I don't
remember why the choice was made but for us, I
mean, it's not a one or the other. It's whatever best fits
that very specific situation. For this maybe it was,
TensorFlow was better for these 3D images over
PyTorch, but I'm literally making that up. I don't know
why that specific choice was made, but for this client it
was made, I'm sure for a very specific reason.
Kirill Eremenko: Wow. So many questions.
Michael Segala: Sure.
Kirill Eremenko: With deep learning, very interesting. First one would
probably be one of the main parts of deep learning is
architecting the neural network, finding out or
experimenting with how many hidden layers you have,
how many neurons in those layers and things like
that. Do you guys have any approaches that you have
developed in SFL Scientific over the years on what's
the most efficient way to experiment with neural
network architectures to get to the end result faster or
is it completely dependent on the project and it's a
creative component that people, that you rely on your
team to execute.
Michael Segala: I mean it's a little bit of both. It's a lot of experience
and a little bit of creativity. And now I'm speaking for
an area where my team would be much better suited
to speak on than I will but I'll pretend to know a lot
more than I really do. You have to realize that we've
worked in these kinds of medical imaging problems for
years, from a kind of all the way from our graduate
background for the past several years and a lot of our
folks have been working on problems like this for 10 or
20 years. We know computer vision and have deep
learning in the medical space very well. We happen to
have a pretty good understanding of how to build
architectures around understanding and segmenting
and classifying dicom like CT or MRI images. And we
know kind of the computing power, we know the size
of the data. We can calculate the number of neurons
to say, "Hey, I need to show incrementally that we're
getting better and better accuracies."
Michael Segala: Because you don't start by throwing the kitchen sink
at the problem. You start small and you start quick to
kind of iteratively show that you can make progress.
Design a network that you can do in a couple of hours
and then show it works. Now a couple more hours or a
couple of days or a couple of weeks. You're always
building on that, intentionally moving in a kind of
structured way. It is obviously just knowing some stuff
and then being smart around selecting and kind of fine
tuning your network and growing that as a function of
your accuracy demanding it. Not a great answer, but
it's my answer.
Kirill Eremenko: No, no. I like what you said about starting small, I
think that's important because maybe somebody might
be working on a project and they get an accuracy rate
with a certain architecture of, I don't know, like 60%
and that really is discouraging to them. And they
completely change the approach, they abandon that
first idea that they had and they try something
completely different. But what I'm getting from what
you're saying is that, okay you got to 60%, see if you
can get that to 70%. Can you adjust it rather than
completely abandoning it.
Kirill Eremenko: You might've had a great idea at the start, see if you
can adjust it and increase, increase, increase and get
to that end goal. The point is not to hit the bull's eye
right away, but just like keep throwing the darts until
you get closer and closer and closer. And you finally
hit the bull's eye.
Michael Segala: There's two things. Great data scientist are great
problem solvers, hands down. Being thoughtful about
why things aren't converging or what can be improved
on and then second to your kind of number of 60%. I
challenge a lot of our folks and a lot of our clients,
when we start throwing out numbers like 60%, 70%,
80%, I'll always say, "Well what's good, is 80%
accurate on detecting cancer good?" And it actually
invokes a lot of thought and like what is an actual
good accuracy and what would you do if it was 80% or
60% or 99%. When you're a data scientist and you're
sitting there and you're building these algorithms and
you're getting your accuracy numbers, you really need
to think about, well, what is needed for the business
and what are these accuracy values actually
correspond to in terms of an outcome and what level
do I really need to achieve?
Michael Segala: It's not this kind of playground science laboratory.
You're doing this for a business, for a real purpose so
figure out that purpose then work backwards in terms
of what your accuracy needs to get to. I think that's
such a critical point that most folks just ignore.
Kirill Eremenko: Okay. Totally, totally agree. Thank you for that. That
was case study number one, medical imaging.
Michael Segala: All right, let's see. I have another great case study. I
hope I don't get in trouble for this one. We'll see. I'm
going to be very, very light with the details. We do
some work within the federal government. One of them
happens to be with a client that develops in airports
the baggage screen or stuff that you walk through.
Stuff that you physically put your baggage through
and then stuff that your baggage that you check in
goes underneath and goes through. Those are actually
just large CT scans. They're large CT images. And what
happens is as your bag is going through, like you
know, you go through the airport security, you're
sitting there, it takes a second and then you have a
screener, a TSA agent sitting there and they say, "Hey,
I see an interesting object." It could be a knife, it could
be a gun or they're looking for other objects like
explosives and things of that nature.
Michael Segala: You could imagine that these machines might have
some interesting algorithms built into them. And you
can imagine-
Kirill Eremenko: You'd hope so.
Michael Segala: You could imagine even further that nowadays we
would probably want to enhance those algorithms by
using like a deep learning solution or really innovative
solutions. If you imagined all those things, the TSA
probably works with consulting companies that
designed and developed these types of Algorithms for
folks. We may be one of those companies doing some
really interesting work around detection for the TSA.
Kirill Eremenko: Or maybe not.
Michael Segala: Or maybe not, I don't know. That could be used case
two, but we won't, I don't know how much I'll get in
trouble for that one so we'll skip that one for today.
Kirill Eremenko: Sounds good.
Michael Segala: But very similar. It's object detection, it's
segmentation, it's classification around really
interesting images. And that image could be anything,
it could be, go ahead.
Kirill Eremenko: I was just going to say that, it just shows that your
existing expertise in the medical space with imaging is
very transferable to other industries such as scanning
baggage.
Michael Segala: Yup. Absolutely. Other types of use cases, we're seeing
a lot in these very traditional industries like
manufacturing, retail, consumer goods, where they
have lots of logistical and supply chain problems. This
one's not a real sexy one, but it's something that we
see a tremendous amount of potential lift for.
Increasing logistics and supply chain is an area where
there's a lot of hot press happening at the moment.
Michael Segala: If you're a beverage company, if you're a company that
sells lots of jeans or whatever you happen to do and
you're selling tens of thousands if not hundreds of
thousands of these products, the question becomes
very simple, how can I use an AI solution? Whatever
that means. Some machine learning or deep learning
to actually allocate merchandise in a much more
optimized way. We have a few different clients and a
lot of these big industries that ship, talking hundreds
or millions of individual items every day, every week,
every month that they want to be able to dynamically
understand, how do I ship them? How do I become
better about not wasting material? How do I increase
my bottom line by just doing that in a more optimal
way. We've seen very recently a lot of these industries
looking out because they're seeing what machine
learning can do over their very traditional kind of rule
based forecasting methods just to enhance these
operations.
Michael Segala: A lot of our use cases just literally in the last couple
months have been around that supply chain and
logistics. If you're somebody who's looking at
interesting problems, I think that most big companies,
most Fortune companies or even even smaller mid
market, they all have very similar types of use cases
around this space. Forecasting, supply chain
manufacturing where you can do a lot of interesting
stuff. That's kind of a not really a single use case, but
lots of use cases baked into one there. A lot of real
great value there, very different, very time series like
data and things of that nature. And what you can start
doing there is coupling from a predictive side if you're
also doing supply chain, you have things that are
failing. Machines are failing, equipment is failing. And
the question becomes in that same supply chain when
I'm doing my forecast, can I also understand failure
events, right? Predictive maintenance and whatever it
happens to be on those same machines.
Michael Segala: We're seeing these companies starting to collect and
analyze all this information to wanting to predict when
their machines are going to fail. How often do we need
to take them offline? How will that affect their
shipping? How will that affect their logistics? And kind
of solving two problems at once, a better forecasts,
plus being able to augment and not necessarily need
to fix machines before they break, kind of fix them
beforehand. It's kind of two things boiled into one. But
you could potentially do it all together. It's kind of the
second use case we've been seeing a lot of recently.
Kirill Eremenko: Gotcha. On that, with logistics, I see there's lots of
components where data science, or data science can
be applied to solve challenges. But some of the two
challenges that I'm quite familiar with are bottlenecks
in logistics, like where is the bottleneck? Is it at the
factory? Is it at the pickup location? Is it through the
route? Is it at the end? And the other one would be
optimizing routes. Things like the traveling salesman
problem. How do you get to as many destinations?
Like if you delivering the milk to different stores, how
do you get two of them in the most efficient way
possible? Could you give us some examples of, what
kind of algorithms would you use in maybe these two
problems, bottlenecks and optimizing routes or maybe
other problems and challenges in logistics that you
guys have worked with before.
Michael Segala: Yeah. For instance to kind of answer your first
question first, lots of different algorithms could be
used here. Folks are starting to experiment and got a
lot of success with these reinforcement algorithms,
whatever you want to call them, deep reinforcement or
regular reinforcement learning algorithms to do these
kind of difficult optimization problems, right? Because
that's what it is, it's an optimization. If you kind of
take a step back, a lot of these traditional mark off
models or Monte Carlo simulation, very similar. You
have a very complex dynamical system. How do you
optimize across this entire system. It's not necessarily
the same as just a single kind of prediction variable,
but now you're doing it in a very complex manner.
Michael Segala: We're seeing a lot of interest in movement, especially
in some of our use cases in that manner of kind of
using some of these [inaudible 00:44:17] bleeding edge
methodologies. Others, if you can still turn into a
traditional machine learning algorithm, if you can
predict something, either binary or categorical or
forecasting, you can use whatever traditional ML
algorithm you want, or a deep learning algorithm. This
is a problem that lends itself to lots of different
opportunities, optimizations is a different class of
problems, you can even use like genetic algorithms or
things of that nature. Lots of interesting stuff there.
Michael Segala: And then your second question was more specifically-
Kirill Eremenko: Sorry on that one. Do you guys have any approaches
like genetic algorithms or enforcement algorithms,
deep reinforcement learning or machine learning, do
you have any of those that have shown to be the most
useful, the easiest for you guys to deploy or the
quickest win for your clients? Any comments on that?
Michael Segala: Oh yeah. It always depends, it depends on the problem
statement, it depends on the complexity of the
problem. Quickest wins are always going to be the
easiest algorithms, if you can map anything into a very
simple machine learning algorithm with a prediction
variable, you're going forward with some features.
Yeah, I mean you could do that pretty easily. If you
need to dynamically optimize a really complex system
and you need to go to like a deep reinforcement to
algorithm. Yeah. I mean that's going to take a lot more
time in the amount of lift you might get there. It might
be extremely incremental.
Michael Segala: Again, right, there's a trade off in all of these things. I
always advocate very heavily for determining a
baseline as fast as possible, literally whatever the
fastest path to getting a number out to set a baseline,
do it, then start experimenting and making it more
complex.
Michael Segala: Whatever that means for this problem, start with an
ML model, then go to a deep learning model, then go to
a genetic algorithm, whatever it happens to be for your
problem. Always kind of think of it in a very
incremental fashion in complexity. That's at least my
opinion.
Kirill Eremenko: Love it.
Michael Segala: I think the best way to approach the problem.
Kirill Eremenko: Gotcha. Love it. Love the establish a best baseline as
fast as possible. I think that's golden advice for data
scientists out there.
Michael Segala: And then what was your second question?
Kirill Eremenko: The different types of problems I think, bottlenecks,
optimizing routes, maybe if you had some more to add
to those.
Michael Segala: Yeah. For instance, one of the problem statements that
we're currently working on, that's a lot of interest. Is in
clinical trials, clinical trials is actually a very
complicated problem because you're a big Pharma
company and you need to run a trial for your
medication against a pretty diverse and large
population of people. Think of something simple like
Tylenol. You're not running clinical trials on Tylenol,
but if you were, you'd have a bunch of Tylenol, you
find a bunch of people at a bunch of medical sites and
you'd ship them some Tylenol or they would take it
and you would monitor how they interact with the
drug and things of that nature. That's how clinical
trials work.
Kirill Eremenko: What is Tylenol, sorry to familiar with the US term.
Michael Segala: Oh geez. Just helps your headache. It's a, what is it, a
acetaminophen whatever that word is.
Kirill Eremenko: So it's for headaches?
Michael Segala: It's for headaches yeah. It's been around for a hundred
years. But that was just kind of a silly example.
Kirill Eremenko: In Australia we have Paranol.
Michael Segala: Sure. I don't know what that is but sure. For clinical
trials you have this problem of needing to send out
things to the medical facilities for them to be able to
run their clinical trials, collect data, collect vital
information, collect even blood or other types of,
whatever you're collecting directly from the patient and
do so in a very complex manner. Because you have
patients in different countries, different ages, races,
genetic profiles, whatever it happens to be. You're
sending, your shipping, you're receiving these things
that are all highly perishable. It has to happen in a
very kind of dynamic environment.
Michael Segala: For them, this logistic problem with bottlenecks in
things that are highly perishable that are shipping all
over the globe is a huge problem. And it's this very
dynamic system that we're applying these exact type of
algorithms that we've been talking about. Clinical
trials is a good one, but everything's the same. If
you're shipping a pair of jeans, a bottle of coke or a lab
kit for a clinical trial, the methodology is very similar
in the way to attack the problem is very similar. It's
just kind of that end use case. What you call it is a
little bit different.
Kirill Eremenko: Okay. Gotcha. All right, well thank you very much.
Logistics, amazing case study. Do you have one more
for us? I think we have time for one more.
Michael Segala: Jeez, how about this. What do you care about? I'll give
you a use case on something.
Kirill Eremenko: Ooh, good one. Let's do energy, the energy space. Do
you have any of those?
Michael Segala: Of course. I have some on energy. Let's see. I have a
few on energy. What could be interesting? I'll give you
two quick ones in energy.
Kirill Eremenko: Sounds good.
Michael Segala: One quick one is a lot of people, if you have a meter, I
don't know how you guys do it, but I assume you have
an energy meter sitting outside of your house, and
that energy meter is basically collecting information on
how much energy you used. It turns out that two types
of people tend to want to screw with that energy meter.
One a is people from, sometimes not as affluent
communities who don't necessarily want to pay the bill
or affluent communities who don't want to pay the bill.
It could be here. Or drug dealers who are using an
absurd amount of energy and that at that peak some
kind of alarm, but they need to hide that.
Michael Segala: What you can do is you actually take a magnet to the
outside. I've never done it, but I've been told, you can
take a magnet to the outside and you can actually
trick the smart meter or whatever it is from the
reading. And it shows much less consumption that
what you're actually having. We did some work with a
large energy company out in the UK that was running
into a lot of these problems, people were literally
putting magnets on their meters outside, fooling the
system and they were people fooling it because they
didn't want to pay the bill or drug dealers. And the
question was, can you take all of that time series data,
'cause it's very temporal time series data and look for
patterns that would be anomalous that they think
corresponds to somebody kind of adding these
fraudulent activities.
Michael Segala: We were given a pretty large set of data, but a very,
very small set of labeled data. Literally only like a few
10 or 20 labeled cases of these anomalies. We attacked
the problem a couple of different ways, both in a
supervised and unsupervised manner. We did a lot of
different things, could be really thoughtful about it
and we were able actually to show, you can spot these
anomalies and you can really see when people are
gaming the system from an energy perspective. That's
kind of a one quick use case-
Kirill Eremenko: Just quickly on that, that's very interesting because
you had only 10 to 20 examples like in a spreadsheet
or like in a database with millions of rows of negatives
results. You had 10 to 20 positives. How'd you deal
with situations like that? What is your advice to data
scientists out there? How do you attack a problem
where you only have under 20 examples of what is a
positive outcome that you are actually trying to
identify?
Michael Segala: This happens a lot of times. We actually fool ourselves
that big data is the challenge. The actual challenge is
small data, big data's not a challenge, It's "Ah, okay,
we have big data." We get into a lot of these cases
where you have very small labeled positive examples.
You have to be very thoughtful about it, you could
theoretically create fake data sets to encapsulate very
similar behavior right through that kind of same
simulation in modeling. You could do that or you can
start attacking the problem because you could treat
this as anomaly detection, pretending you didn't know
any labeled data. Can you actually spot anomalies?
And you have 10 or 20 in your back pocket to think
about, or you turn it into a supervised learning
problem with a very, very small holdout set. And find
and experiment with it.
Michael Segala: There's lots of different scenarios, but again it's really
about being a problem solver and thinking about, can
you do something that's convincing enough to yourself
from a technology standpoint that it's working and can
you make a business case that it should be
implemented? There's lots of different ways to solve a
problem, but you have to do it in a kind of systematic
way and be thoughtful about it.
Kirill Eremenko: That's so cool. I love your three examples, just to recap
on those, create fake data sets, anomaly detection so
pretend you don't even have those 20 and see what the
algorithm will do, completely unsupervised or
supervised learning with a very small holdout dataset.
I'll probably just add to that, that it's also important to
talk with the clients. And correct me if I'm wrong here,
but I think it's important to talk to a client and
understand how important for them are false positives
and false negatives.
Kirill Eremenko: In your case, in this case of an energy company, is it
really bad if you have a high rate of false positives,
would they prefer high rate of false positives or high
rate of false negatives? If you identify more cases
where people are allegedly trying to trick the meter,
how difficult is it for them to ask the electrician next
time they go out outside to check if there's a magnet
on the box or not?
Kirill Eremenko: Based on that conversation with a client, you can fine
tune your algorithm to either output more bindings in
terms of like these anomalies or less. And in some
cases it might be, I don't think in the case of energy it
would be as bad as saying the case of medicine where
a false positive can actually change somebody's life.
Michael Segala: Absolutely. You're absolutely right. And what you said,
it's the real critical part in what the real kind of
mindset needs to be is how do you tweak this
algorithm to actually fit what we care about capturing
and what does that cost. Because the question is really
what would that cost for the electrician to go back and
report it? And then how would they report that? And
where's that data stored? You get into this kind of
cascading effects of what your algorithm actually
mandates to the business to actually have to
implement. It's not a trivial problem. That's actually
where the real ingenuity and kind of problem solving
comes in and kind of tweaking that outcome to
actually be effective. You're 100% on there.
Kirill Eremenko: Gotcha. Okay. That was the first one on energy. And
second one?
Michael Segala: Second one. Let's see. We have a few. The other one we
were doing, kind of similar, a little bit different. This is
energy as related to internal devices in the home. And
the question for them is, if I had all of the kind of time
series data of the meter coming in, can I understand
which appliance that data is coming from? There's this
concept of energy disaggregation. Meaning, if I only
gave one overall signal, can I see what came from the
refrigerator or the microwave or the TV or whatever
else it happens to be.
Michael Segala: Again, it's a very interesting class of algorithms where
you can kind of look at consumption patterns and
then kind of detangle them in terms of understanding
exactly where your consumption comes from and why
you'd want to do that is because you would be able to
show, "Hey, your appliances over here are causing
80% of your bill. Get something more efficient or
unplug it or do something of that nature." It's this
really kind of personalization that is happening,
especially within this big energy companies that want
to kind of get consumer buy in and kind of always
have consumers coming back and never leaving them
is to showing them these kind of innovative solutions
towards some of their energy bills and outputs and
things of that nature, especially as we become a
greener and greener society.
Michael Segala: This was a very interesting one and actually showed
the extremely promising results as well that, that
company is using.
Kirill Eremenko: How do you go about a problem like that? How did you
desegregate components of a signal?
Michael Segala: This was actually a while ago, if I remember correctly,
and I can be completely wrong here, I think we had a
pretty small training set as well of, they had a couple
of houses or dozen houses or a hundred houses, I
don't remember at this point, that actually had smart
meters plugged into all of the devices. You were able to
see a real training set of, here's the total consumption
and here's what all the devices were, which was fine,
you can show that. Now the question becomes on a
very new house, does that algorithm actually transfer
over and is it generalized?
Michael Segala: That's really the big question and I think we used two
different approaches. The first being a lot of these
mark off models, hidden mark off models I believe had
worked really well for this case. This was maybe about
two years ago when deep learning was still kind of in
its infancy, not really infancy, but really being used,
especially for time series. I think we started playing
around with some deep learning at a time series space
there as well. And that was showing some really nice
progress, but we were able to achieve what they
wanted to in those kinds of mark off models and they
kind of took that and ran with it. If I remember
correctly, that's how we attacked that problem back
then.
Kirill Eremenko: Okay. Well, Mike thank you so much for showing
those case studies. Amazing medical imaging logistics,
energy case studies, if our listeners want to ... If you
guys want to check out more case studies as I
mentioned at the start, head on over to
sflscientific.com and they have a tab there called
solutions or the other one is Our work and you can
read quite a bit about different use cases in different
industries.
Kirill Eremenko: Before we finish up, 'cause we are slowly getting to the
end of this super exciting podcast which could
probably go on for a few more hours. But before we
finish up, I wanted to ask you on a question that I, a
more philosophical question I like to ask guests
sometimes and that is, from where you stand and from
all these projects and clients and industries and
approaches and employees, you've seen people, you've
seen the data science. Where do you think the field of
data science is going and what do our listeners need to
look out for to prepare for the future that's coming
ahead?
Michael Segala: That's a tough question. Are you asking that as
somebody who wants to get into the data science
space, as a data scientist? Or are you asking that in
terms of where do I think industries are going?
Kirill Eremenko: Ooh, that's a good question, how about we do both.
Michael Segala: 'Cause those are very different conversations.
Kirill Eremenko: How about we do both, what's your view on both of
those?
Michael Segala: All right. Both of them will go quick because I don't, we
could talk for a long time. Sometimes I get too
talkative. Let's start easy, data scientists. And we see
this a lot, I'll have an open REQ and by the way, we
have lots of open REQs if somebody wants a job, come
and talk. But we see more and more people wanting to
become data scientists transitioning into this space.
There's a lot of great potential, money being invested
and people honing their skills with courses like the
ones you teach, people going to conferences like the
ones that you guys give, a lot of great mind share,
knowledge share and things of this nature, which are
so much easier than when I started about six years
ago. I think that's gonna continue to happen.
Michael Segala: However, I think algorithms themselves [inaudible
01:00:22] come and already are kind of becoming very
commodity. Everybody nowadays can fire up XGBoost
and run something, that doesn't make you a good data
scientist that makes you extremely commodity in your
job. I think data science is going to start to become a
wider role that is going to be, as we're talking about
here, it's really a problem solver. How do you take a
business problem and solve it with data? That's really
the big question here. And unless you're capable of
thinking about the larger problem and the impact that
it has on the business and how you're actually going to
take that algorithm and actually allow your business
to generate revenue or cut costs, you're probably not
going to be a very successful data scientist, especially
as these tools become more and more efficient and will
start to automate some of your job away.
Michael Segala: I really think the trend in our industry will also be to
automate out some of our own data scientists who are
doing just kind of very routine type of work. But the
ones that survive and do a great job I think are going
to be probably one of the most critical folks within the
company by far. That's really how I see that transition
happening. And I actually don't think that, that's far
away, I think within the next 12 to 24 months. Maybe
the next time we talk on one of these, we'll start to see
that already.
Michael Segala: In terms of, and let me know if that didn't answer the
question-
Kirill Eremenko: That totally makes sense. I just want to add here that
from my experience 'cause I ... For listeners who don't
know, I worked at Deloitte for two years in the data
science division and what I can definitely say and
probably you've gathered from this conversation we're
having here with Mike that being in consulting really
helps with that, becoming a problem solver,
understanding how to not just like do a cool project or
a cool algorithm but think of the business as a whole.
Kirill Eremenko: If you are looking for a job, I just want to reiterate
Mike's call, give Mike a shout out and or contact Mike
on Linkedin or somewhere else and chat to him
because in a few years in consulting really puts you in
a whole of game of data science into a different
perspective. Not to say that you can't get there on your
own without consulting, if you're in an industry that's
totally fine as well. Just from experience, I know that
consulting is a great way to get to that type of mindset.
Michael Segala: Yeah. I tell my new employees, within 12 months
they'll probably have more project depth and skills
than somebody who sits in a single kind of vertical for
10 years. Just the breadth of project and the depth
that we get to get into extremely quick. It's exciting.
But it's hard, it's not stagnant and you're always kind
of thinking and moving on your feet. It's not for
everybody, but I love it.
Michael Segala: In terms of businesses, I think we're really at this
critical junction in terms of where data science will go.
We see industry starting to invest for sure. They invest
kind of small pockets of money on a few small
initiatives, the big companies that make the media
hype, the Apples, the Googles, the Airbnbs, those
aren't even relevant, those are the outliers, the
anomalies.
Michael Segala: I'm talking about the other 99% of the market. And we
know, we work with so many of them, they see that
there's a lot of interest out there. There's a lot of
innovation happening and there's a lot of hype and
potential. They're starting to make strategic bets into
this space by funding a couple POCs, proof of
concepts, hiring a few individuals or a larger team
depending on the organization. But we're really at that
critical point where now in the beginning of 2019 over
the next kind of nine months, a lot of folks have
budgeted Data Science into this 2019 workflow that
need to start paying off. They need to see real revenue
generated or margins decrease by better automation
and cutting costs and things of that nature or margins
increase.
Michael Segala: I think if we don't start delivering past POCs and really
start embedding algorithms into deeper kind of
production workflows, it's actually going to take a big
hit and a big step back and people will start defunding
AI into their 2020 and 2021 plans. And I honestly
think that, there's a lot of folks that, very [inaudible
01:04:50].
Michael Segala: Here's a fun app on Instagram and I just want to go
and repeat that and kind of play off of them and you're
always going to see that in the market, but that's
quickly going to become cannibalized in this AI space.
When you have all these big IBM commercials and
Microsoft commercials that are really hyping AI and
people are investing, they need to see something very
quickly pay off or we're just not going to continue to
get funding and this market will start to slow for sure.
Michael Segala: It's up to you, the listeners, you have so many great
listeners on this podcast that are the ones in the
trenches. And I say that wholeheartedly, like I think
that your audience is by and large some of the greatest
audience, especially in the data science space that I've
interacted with and I still get, literally every week
people [inaudible 01:05:35] about your podcast and
what you're doing. And they come to me and say, "Oh,
I heard you on Kirill's podcast. That's great."
Michael Segala: You definitely are driving the correct audience. It's
kind of all of our responsibility because as data
scientists to ensure that these projects are successful
and we don't just kind of cannibalize ourselves in the
next year or two and not get any bigger funding.
'Cause then we're all going to be out of a job. That's
honestly how I think the market is going to mature.
Kirill Eremenko: Fantastic. That ties back into that productization
discussion that we had. For data scientists out there
don't just leave your project, it feels very satisfying to
find the insights and deliver them, talk to your
manager, boss, client, whoever it is you're talking to
and consult them, advise them on next steps on how
they can actually put that into production, follow up
with them, go back in a few weeks and check if your
model is performing, if it's deteriorating of it needs
some maintenance. Be proactive in that [inaudible
01:06:32], it's kind of like marriage. If you get married,
you don't just stop there. You have to keep dating,
your wife I mean. Or husband. You have to keep
caring out for each other. It's not like you won the
game once you got married. There's lots more. And
now it's the aftermath and the commitment that comes
afterwards.
Michael Segala: I see you've been well trained as a husband.
Kirill Eremenko: Not a husband yet my friends.
Michael Segala: Soon to be [crosstalk 01:07:01]
Kirill Eremenko: Yeah one day.
Michael Segala: One day, good for you.
Kirill Eremenko: Mike, wanted to ask you, how many clients have you
guys worked with if it's not a secret, just curious.
Michael Segala: Oh geez. I don't know the number, but it's in the
hundreds.
Kirill Eremenko: Wow. You guys-
Michael Segala: I don't know the number ...
Kirill Eremenko: Are doing really well. All right, well Mike, thanks so
much for coming on the show, been a huge pleasure.
Before I let you go, what are some of the best ways for
our listeners to contact you, whether they are
interested in working with you or whether they're
interested in maybe joining your team.
Michael Segala: Please come to the website, sflscientific.com. There is a
place there. I think you could either chat with us or
you could inbound an email. That all comes directly to
my folks who tell me right away if you're looking for a
job, that HR, I think we have like an HR jobs page,
that gets looked at. I tell you, we interview almost a
person a day at this point. A lot of them are great
candidates, but for whatever reason don't work out.
We're always looking for really great folks. if you
inbound to us, I guarantee one of our folks will see it
in a few minutes and reply back accordingly. So please
be in touch, that's probably the best way to get in
touch is just through the website.
Kirill Eremenko: Okay, great. And is it okay for people to connect with
your LinkedIn as well?
Michael Segala: Of course. It's my pleasure.
Kirill Eremenko: Awesome. Thanks so much. Of course we'll share all of
those links on the show notes. And on that note, Mike,
thanks so much for joining us today and sharing
amazing case studies and your view on the world of
data set.
Michael Segala: Again, thank you so much. It's always an honor and
pleasure to see your progression as well. Best of luck
with you and hope we can talk again soon.
Kirill Eremenko: There you have it. That was Mike Segala from SFL
Scientific. I hope you enjoyed today's episode and got a
lot of valuable takeaways from the show. If you'd like
to connect with Mike, hit him up on LinkedIn, you can
find the URL as well as all the other materials
mentioned on episode in the show notes at
www.superdatascience.com/249 that's
superdatascience.com/249.
Kirill Eremenko: There you can also find the transcript for this episode
if you'd like to read it. And my personal favorite part
for today was the challenge of small data, dealing with
unbalanced datasets. And the three approaches that
Mike shared with us ranging from creating fake data
sets to unsupervised anomaly detection, to supervised
learning with a small holdout dataset. Some very
exciting stuff. And of course apart from just the
challenges of small data, there are plenty of other
valuable gems shared by Mike.
Kirill Eremenko: And I'd like to reiterate again the call to action from
Mike and the team at SFL. If you're looking for a job
and you'd like to join consulting, then go, head on over
to SFL Scientific and look for the careers page and
apply there. If you're a business owner, an executive
director and you would you have some challenges that
you think can be solved with machine learning, you'd
like to explore the space of AI and Data Science, then
hit up Mike, don't hold back and see how SFL
Scientific can help your business grow and become
even more competitive.
Kirill Eremenko: And on that note, if you're enjoying the
SuperDataScience show, make sure to head on over to
iTunes or to your favorite app for playing podcast and
leave us a review there. I'll really appreciate it. I love
reading your reviews. Thank so much and I look
forward seeing you back here next time. Until then,
happy analyzing.