blog.dbvisit.comblog.dbvisit.com/.../2016/08/kh_kafka-part-1-25.08-ta… · web viewgwen is well...

14

Click here to load reader

Upload: vutruc

Post on 03-Jul-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: blog.dbvisit.comblog.dbvisit.com/.../2016/08/KH_Kafka-part-1-25.08-Ta… · Web viewGwen is well known and respected in the Oracle user ... to landing it in a Hadoop / Data Lake

This is blog post #1 - why am I so interested in Kafka?My first real introduction to Kafka was “way back” in April

2015, at IOUG’s Collaborate conference. This Oracle technology conference was the setting and context for Gwen Shapira Link http://www.confluent.io/blog/author/gwen-shapira), Oracle ACE Director, author and System Architect at Confluent, to present a talk entitled “Kafka for Oracle DBAs”.

Gwen is well known and respected in the Oracle user community, which Dbvisit has also been involved in for many years, and to hear of her enthusiasm for this new technology piqued our interest.

And, as recent Google search statistics show, we are not the only curious enquirers:

As we have become more interested in Kafka, and began to pay attention to happenings in this domain, we noticed its popularity, and reach, rapidly spread. It was soon clear that this was a big deal. For example, and also.:

After conversations with customers and prospects, and with Gwen herself (whose encouragement and nudging over time helped move us forward) Dbvisit decided to also throw our their hat into the Kafka ring, and committed to building out a

Page 2: blog.dbvisit.comblog.dbvisit.com/.../2016/08/KH_Kafka-part-1-25.08-Ta… · Web viewGwen is well known and respected in the Oracle user ... to landing it in a Hadoop / Data Lake

connector to facilitate the movement of Oracle Cchange Ddata into Kafka - and we will discuss the specifics of this later.

(Link)http://ingest.tips/2015/04/26/from-relational-into-kafka/

We have partnered with This is being worked out in partnership with Confluent (the people behind Kafka) - and a project on our side andto developed an open source Java component which is an Oracle source connector for the Kafka Connect framework (link), - which that Confluent have built to facilitate ingest and digestion from Kafka. We are aiming to move this connector from beta to an offical release in the coming weeks - and Iin anticipation of this, we, presented a joint webinar with Confluent about this a few weeks back (link) detailing our plans. So stay tuned for more on this!

What is Kafka?But for many, Kafka, the problems it solves, and the

possibilities it opens up, may be new - so let’s start back at the beginning, and explain what it is we are talking about.

Kafka is open source software - an Apache project (blink http://kafka.apache.org/) , birthed out of LinkedIn, and spearheaded by the team at Confluent (link). “Technically speaking” it has been described as a ‘publish-subscribe messaging rethought as a distributed commit log” that has, from the ground up, been engineered to be fast, scalable, durable and distributed by design.

That is, Kafka does a lot of big data "stuff" really well...

In the current business climate I think it is fair to say that data is a strategic asset, but with its own set of challenges. For one, :

Page 3: blog.dbvisit.comblog.dbvisit.com/.../2016/08/KH_Kafka-part-1-25.08-Ta… · Web viewGwen is well known and respected in the Oracle user ... to landing it in a Hadoop / Data Lake

tThere is more data than ever to deal with (from a growing array of sources) - and tools for effective management of these (oftentimes) disparate feeds areis essential. At the same time,

mMaking use of this data - for business advantage - is an imperative. This requirement is moving from the realms of “nice to have”, or the domain of only the largest of enterprises, to becoming a critical business function for organisations of all sizes. Data is a differentiator, both in terms of what companies can offer their customers (types and speed of services), and also what they know about themselves and their operations, enabling them to see and improve processes and efficiencies.

One of the promises of Kafka, as a technology solution, is that it can take a multitude of data from all over your organisation and centralise it - but in a very different way to landing it in a Hadoop / Data Lake singular end point repository. This is about nimbleness; decoupling data writers from data consumers - whilst channeling multiple data flows through a unified system, capable of enabling in flight stream processing activity, before moving it on to your choice of targeted systems.

As Jay Kreps, Confluent CEO and Kafka originator outlines, (link here: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying), it is common to see these types of complex data flows in modern organisations:

Page 4: blog.dbvisit.comblog.dbvisit.com/.../2016/08/KH_Kafka-part-1-25.08-Ta… · Web viewGwen is well known and respected in the Oracle user ... to landing it in a Hadoop / Data Lake

These can be transformed when working with Kafka, and its underlying principle of a unified log system, to something like the following:

Extending this idea beyond simplicity and clarity, Kafka can also become a hub in a complete stream data platform,

Page 5: blog.dbvisit.comblog.dbvisit.com/.../2016/08/KH_Kafka-part-1-25.08-Ta… · Web viewGwen is well known and respected in the Oracle user ... to landing it in a Hadoop / Data Lake

enabling other services to run against these real time, continuous data feeds - opening it up for other business functions and possibilities… http://www.confluent.io/blog/stream-data-platform-1/

Page 6: blog.dbvisit.comblog.dbvisit.com/.../2016/08/KH_Kafka-part-1-25.08-Ta… · Web viewGwen is well known and respected in the Oracle user ... to landing it in a Hadoop / Data Lake

As a tech guy - and an Oracle production DBA in a “past life”- 2 aspects of Kafka really piqued my interest in approaching this new technology from the RDBMS world:

Elegance.I can well remember the complexities and frustrations in

managing, monitoring, and logging and data flows in database environments with less than 15 instances - let alone configurations 10, 100 or 1000 times this scale! Ad hoc feeds and pipelines from these database systems, applications and underlying servers are all well and good - until they’re not - and then you can waste an inordinate amount of time working on repairing these flows.

So the idea of a centralised pipeline, capable of handling wide varieties of data (from databases to web server logs, and everything in between), and which is durable and scalable, is very appealing. Here we see a way to get information right from the peripheries of our business operations, and bring it back to the core, ready to be made use of in a flexible way by any number of interested parties and applications. This is about getting data which otherwise would have been siloed away, accessible for business benefit, such as real time analysis and correlations.

Simplicity.Good old log files hardly seem the place to begin a revolution

in technology, but they are at the core of Kafka’s design and operation., as Jay Kreps outlines this paradigm in his book on the topic in the link below:

https://www.amazon.com/Heart-Logs-Stream-Processing-Integration/dp/1491909382

Page 7: blog.dbvisit.comblog.dbvisit.com/.../2016/08/KH_Kafka-part-1-25.08-Ta… · Web viewGwen is well known and respected in the Oracle user ... to landing it in a Hadoop / Data Lake

In fact, one way of understanding Kafka is to see it simply as a massively scalable, distributed and fault tolerant log system. Log files? Again, this seems counterintuitive - but perhaps an analogy will be helpful here…

I have a confession, that despite access to a number of task and time management apps I often simply open up a basic text editor document (in Text Wrangler or Sublime Text) to keep a set of timekeeping notes - appending to it during the course of each day. I have another for general brain dumps (so I don’t forget something and can go back and review it later) and one for notes on testing that I have been doing.

My log files don’t have any fixed structure - they are free form and I just write entries with as little or as much detail as I see fit - as long as I can read it back and make sense of these messages later. So they look something like this:

Timekeeping-notes.txt

Page 8: blog.dbvisit.comblog.dbvisit.com/.../2016/08/KH_Kafka-part-1-25.08-Ta… · Web viewGwen is well known and respected in the Oracle user ... to landing it in a Hadoop / Data Lake

And these happen to be nothing more than physical files on my laptop. There’s nothing special about them - although they are very useful to me:

At its core Kafka is very similar: it is basically just a set of log files, containing messages, grouped according to particular topics (like my timekeeping vs. brain dump vs. testing notes logs) persisted on disk. By default it doesn’t require or impose structure onto what is written to it, and new messages are just appended to the end of each log.

Page 9: blog.dbvisit.comblog.dbvisit.com/.../2016/08/KH_Kafka-part-1-25.08-Ta… · Web viewGwen is well known and respected in the Oracle user ... to landing it in a Hadoop / Data Lake

Coming back to my timekeeping-notes.txt log file, there are obviously a number of issues with this structure, particularly in terms of access (what if my boss wanted to read this at the same time as I was writing to to it?) and redundancy (outside of backups what happens if my hard drive crashes - I stand to lose all my records!).

So, how about this: I break the file up into logical segments, using day (Mon, Tues, Wed, etc) as the key value for the messages, and locate each of these distinct partitioned log files on a different hard drive/server. ServerA has a log file which holds all my time records for Mondays and, at the other end, serverE has all those for Fridays. From there, I then enable replication behind the scenes which copies the Mon log file from serverA to serverB and serverC, so that in case of a disaster I have a copy of this log file preserved and updated elsewhere. And while serverA might be the location where writes for Monday’s timekeeping records primarily occur, if it goes offline for some reason then the copy on serverB or serverC can take the lead, and take the writes. This new and improved configuration can now cope with levels of throughput (scale) and parallelism, and does so with provision for redundancy from the ground up. So it would look a little like this:

Page 10: blog.dbvisit.comblog.dbvisit.com/.../2016/08/KH_Kafka-part-1-25.08-Ta… · Web viewGwen is well known and respected in the Oracle user ... to landing it in a Hadoop / Data Lake

As with any analogy there are limits in terms of direct application, but this is a reasonable approximation of Kafka’s core architecture and function;

Messages -> grouped into multiple topics -> partitioned by key values -> persisted to individual log files on disk

So redundancy and parallelism (facilitating scalability) are central, all the while doing this in such a way that decouples the processes writing to the system (Producers) from those that are reading it out (Consumers). And at its core, the humble log file :-)

Page 11: blog.dbvisit.comblog.dbvisit.com/.../2016/08/KH_Kafka-part-1-25.08-Ta… · Web viewGwen is well known and respected in the Oracle user ... to landing it in a Hadoop / Data Lake

In a blog post from way back in 2013 Jay Kreps actually describes Kafka as a “log as a service” - and his detailed explanation is well worth reading:

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Where to from here?In terms of learning more about Kafka I recommend

reviewing the Apache project http://kafka.apache.org/, and Confluent’s platform/documentation and blog:

http://docs.confluent.io/3.0.0/http://www.confluent.io/blog

And in the next blog post in this series, we will take a look at how this new world, data streaming technology, relates to the old world of the RDBMS; how and why these technologies might work in together, and what possibilities exist for bridging these worlds.

Thanks for reading, and stay tuned for more!