bd sqltohadoop1 pdf

13
© Copyright IBM Corporation 2013 Trademarks SQL to Hadoop and back again, Part 1: Basic data interchange techniques Page 1 of 13 SQL to Hadoop and back again, Part 1: Basic data interchange techniques Martin C. Brown ([email protected]) Director of Documentation 24 September 2013 In this series of articles, we'll look at a range of different methods for integration between Apache Hadoop and traditional SQL databases, including simple data exchange methods, live data sharing and exchange between the two systems, and the use of SQL-based layers on top of Apache Hadoop, including HBase and Hive, to act as the method of integration. Here in Part 1, we examine some of the basic architectural aspects of exchanging information and the basic techniques for performing data interchange. View more content in this series Big data and SQL "Big data" is a term that has been used regularly now for almost a decade. And it, along with technologies such as NoSQL, are seen as the replacements for the long successful RDBMS solutions that use SQL. Today, DB2®, Oracle, Microsoft® SQL Server MySQL, and PostgreSQL dominate the SQL space and still make up a considerable proportion of the overall market. Big data and the database systems and services that go along with it have become additional cogs in the gears of modern systems. But how do you integrate your existing SQL-based data stores with Hadoop so you can take advantage of the different technologies when you need them? Let's examine the basic architectural aspects of exchanging information and basic techniques for performing data interchange. Data and querying considerations The most important consideration when exchanging information between SQL and Hadoop is the data format of the information. The format should be driven entirely from the perspective of the information and the reason it is being exported. Simply exporting your data and then importing it into Hadoop doesn't solve any problems. You need to know exactly what you are importing, why you are importing it, and what you expect to get out of the process. Before we look at the specifics of why you are exchanging the data to begin with, first consider the nature of the data exchange. Is it one-way? Or is it two-way?

Upload: herotest

Post on 23-Oct-2015

9 views

Category:

Documents


0 download

DESCRIPTION

hadoop

TRANSCRIPT

Page 1: Bd Sqltohadoop1 PDF

© Copyright IBM Corporation 2013 TrademarksSQL to Hadoop and back again, Part 1: Basic data interchangetechniques

Page 1 of 13

SQL to Hadoop and back again, Part 1: Basic datainterchange techniquesMartin C. Brown ([email protected])Director of Documentation

24 September 2013

In this series of articles, we'll look at a range of different methods for integration betweenApache Hadoop and traditional SQL databases, including simple data exchange methods, livedata sharing and exchange between the two systems, and the use of SQL-based layers on topof Apache Hadoop, including HBase and Hive, to act as the method of integration. Here in Part1, we examine some of the basic architectural aspects of exchanging information and the basictechniques for performing data interchange.

View more content in this series

Big data and SQL"Big data" is a term that has been used regularly now for almost a decade. And it, along withtechnologies such as NoSQL, are seen as the replacements for the long successful RDBMSsolutions that use SQL. Today, DB2®, Oracle, Microsoft® SQL Server MySQL, and PostgreSQLdominate the SQL space and still make up a considerable proportion of the overall market. Bigdata and the database systems and services that go along with it have become additional cogsin the gears of modern systems. But how do you integrate your existing SQL-based data storeswith Hadoop so you can take advantage of the different technologies when you need them?Let's examine the basic architectural aspects of exchanging information and basic techniques forperforming data interchange.

Data and querying considerationsThe most important consideration when exchanging information between SQL and Hadoop is thedata format of the information. The format should be driven entirely from the perspective of theinformation and the reason it is being exported.

Simply exporting your data and then importing it into Hadoop doesn't solve any problems. Youneed to know exactly what you are importing, why you are importing it, and what you expect to getout of the process.

Before we look at the specifics of why you are exchanging the data to begin with, first consider thenature of the data exchange. Is it one-way? Or is it two-way?

Page 2: Bd Sqltohadoop1 PDF

developerWorks® ibm.com/developerWorks/

SQL to Hadoop and back again, Part 1: Basic data interchangetechniques

Page 2 of 13

One-way data exchange — from SQL to Hadoop, or Hadoop to SQL — is practical in situationswhere the data is being transported to take advantage of the query functionality, and the sourceis not the companion database solution. For example, pure-textual data, or the raw results of acomputational or analysis program, might be stored in Hadoop, processed with MapReduce, andstored in SQL (see Figure 1).

Figure 1. Hadoop to SQL translations

The reverse is less common, where information is extracted from SQL into Hadoop, but it can beused to process SQL-based content that provides a lot of textual content, such as blogs, forums,CRM and other systems (see Figure 2).

Figure 2. SQL to Hadoop translations

Two-way data exchange is more common and provides the best of both worlds in terms of the dataexchange and data processing (see Figure 3).

Figure 3. Bidirectional translations

Although there are many examples, the most common one is to take large, linear datasets and textdatasets from the SQL and convert that into summarized information that can be processed by aHadoop cluster. The summarized information can then be imported back into your SQL store. Thisis particularly useful where the large dataset would take too long to process within an SQL query.An example would be a large corpus of review scores or word/term counts.

IBM InfoSphere BigInsightsInfoSphere BigInsights makes integrating between Hadoop and SQL databases muchsimpler, since it provides the necessary tools and mechanics to export and import databetween different databases. Using InfoSphere BigInsights you can define database sources,views, queries and other selection criteria, and then automatically convert that into a varietyof formats before importing that collection directly into Hadoop (see Resources for moreinformation).

Page 3: Bd Sqltohadoop1 PDF

ibm.com/developerWorks/ developerWorks®

SQL to Hadoop and back again, Part 1: Basic data interchangetechniques

Page 3 of 13

For example, you can create a query that extracts the data and populates a JSON array withthe record data. Once exported, a job can be created to process and crunch the data beforeeither displaying it, or importing the processed data and exporting the data back into DB2.

Download InfoSphere BigInsights Quick Start Edition, a complimentary, downloadableversion of InfoSphere BigInsights. Using Quick Start Edition, you can try out the features thatIBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, andBigSheets.

As a rule, there are three primary reasons for interfacing between SQL and Hadoop:

1. Exporting for storage— Hadoop provides a practical solution for storing large amountsof infrequently used data in a format that can be queried, processed, and extracted. Forexample: Usage logs, access logs, and error information are all practical for insertion into aHadoop cluster to take advantage of the HDFS architecture. A secondary feature of this typeof export is that the information can later be processed or parsed and converted into a formatthat can be used again.

2. Exporting for analysis— Two common cases are exporting for reimport to SQL andexporting the analysis output to be used directly in your application (analyzing and storing theresult in JSON, for example). Hadoop provides the advantage here by allowing for distributedlarge-scale processing of information rather than the single-table host processing providedin SQL. With the analysis route, the original information is generally kept, but the analysisprocess is used to provide summary or statistical bases that work alongside the original data.

3. Exporting for processing— Processing-based exports are designed to take the originalraw source information, process and reduce or simplify it, then store that information back toreplace the original data. This type of exchange is most commonly used where the sourceinformation has been captured, but the raw original information is no longer required. Forexample, logging data of various forms can be easily resolved into a simpler structure, eitherby looking for specific event types or by summarizing the data to counts of specific errors oroccurrences. The raw data is often not required here. Reducing that data through Hadoop andloading the summary stats back, saves time and makes the content easier to query.

With these basic principles in mind, let's look at the techniques for a basic data exchange betweenSQL and Hadoop.

Exporting data from SQL

When exporting from SQL, the biggest consideration is the format of the information that yougenerate. Because Hadoop is not a tabular database, it makes sense to choose a flexible formatfor data that will processed in Hadoop. One option is the CSV format if you want to work with puretabular information, but you can also use raw text with suitable separators or identifiers.

For complex structures, it can make more sense to output information in a structure that allowsfor easy separation and distribution. An example would be generating record data as JSON andexporting blocks of data, for example 10,000 records per file. Using a flexible encapsulation formatlike JSON solves many of the data interchange headaches.

Page 4: Bd Sqltohadoop1 PDF

developerWorks® ibm.com/developerWorks/

SQL to Hadoop and back again, Part 1: Basic data interchangetechniques

Page 4 of 13

Using a standard dump or query exportMost SQL databases and interfaces have a method for exporting data in specific formats. Forexample, within MySQL you can create a CSV file using the command line, as shown in Listing 1.

Listing 1. Creating a CSV file using the command lineSELECT title, subtitle, servings, description into OUTFILE 'result.csv'FIELDS TERMINATED BY ',' FROM recipes t;

In DB2 the same solution exists (see Listing 2).

Listing 2. Creating a CSV file in DB2EXPORT TO result.csv OF DEL MODIFIED BY NOCHARDEL SELECT title, subtitle,servings, description FROM recipes;

The resulting file can be loaded straight into Hadoop through HDFS. Generating the same outputwith a simple script in Perl, Python, or Ruby is as straightforward.

Writing a custom programDepending upon the dataset, using a custom application to export the data may be more practical.This is true particularly with structured data where the information you want to output is based onthe content of multiple tables and structures.

In general, the easiest method is to take your structured data, agree on an output format orstructure (so it can be parsed within Hadoop), and then dump that information out.

For example, when processing recipe data to look for common themes and threads, you can usethe internal tool to load the recipe record, include the ingredients, description, method, and otherdata, then use the constructed recipe object to output the information for processing in Hadoop,storing each recipe as a JSON object (see Listing 3).

Listing 3. Exporting complex datause JSON;use Foodware;use Foodware::Public;use Foodware::Recipe;

my $fw = Foodware->new();

my $recipes = $fw->{_dbh}->get_generic_multi('recipe','recipeid', { active => 1});

my $js = new JSON;

foreach my $recipeid (keys %{$recipes}){ my $recipe = new Foodware::Recipe($fw,$recipeid,{ measgroup => 'Metric', tempgroup => 'C',});

my $id = $recipe->{title}; $id =~ s/[ ',\(\)]//g; my $record = { _id => $id,

Page 5: Bd Sqltohadoop1 PDF

ibm.com/developerWorks/ developerWorks®

SQL to Hadoop and back again, Part 1: Basic data interchangetechniques

Page 5 of 13

title => $recipe->{title}, subtitle => $recipe->{subtitle}, servings => $recipe->{servings}, cooktime => $recipe->{metadata_bytag}->{totalcooktime}, preptime => $recipe->{metadata_bytag}->{totalpreptime}, totaltime => $recipe->{metadata_bytag}->{totaltime}, keywords => [keys %{$recipe->{keywordbytext}} ], method => $recipe->{method}, ingredients => $recipe->{ingredients}, comments => $recipe->{comments}, };

foreach my $ingred (@{$recipe->{ingredients}}) { push(@{$record->{ingredients}}, { meastext => $ingred->{'measuretext'}, ingredient => $ingred->{'ingredonly'}, ingredtext => $ingred->{'ingredtext'}, } ); }

print to_json($record),"\n";}

The data is exported to a file that contains the recipe data (see Listing 4).

Listing 4. File containing the recipe data{ "_id" : "WarmpotatotunaandCheshiresalad", "comments" : null, "preptime" : "20", "servings" : "4", "keywords" : [ "diet@wheat-free", "diet@peanut-free", "diet@corn-free", "diet@citrus-free", "meal type@salads", "diet@shellfish-free", "main ingredient@fish", "diet@demi-veg", "convenience@add bread for complete meal", "diet@gluten-free" ], "subtitle" : "A change from cold salads...", "totaltime" : "35", "cooktime" : "15", "ingredients" : [ { "scaled_fromqty" : 100, "_error_ingredid" : 1,... } ]}

The result can be loaded directly into HDFS and processed by a suitable MapReduce job toextract the information required. One benefit of this structured approach is that it enables you toperform any requiring preprocessing on the output, including structuring the information in a formatyou can use within your Hadoop MapReduce infrastructure.

Page 6: Bd Sqltohadoop1 PDF

developerWorks® ibm.com/developerWorks/

SQL to Hadoop and back again, Part 1: Basic data interchangetechniques

Page 6 of 13

The phrase "importing into Hadoop" really means you simply need to copy the information intoHDFS for it to be available (see Listing 5).

Listing 5. Copying the information into HDFS$ hdfs dfs mkdir recipes$ hdfs dfs -copyFromLocal recipes.json recipes

Once the files are copied in, they can be used by your Hadoop MapReduce jobs as required.

For better flexibility within HDFS, the output can be chunked into multiple files, and those files canbe loaded. Depending upon your use case and processing requirements, extracting the data intoindividual files (one per notional record) may be more efficient for the distributed processing.

Using Sqoop to move dataSqoop is an additional tool for Hadoop that connects to an existing database using a JDBC driverand imports tables or databases from the source JDBC connection directly into HDFS. For the vastmajority of imports where raw data from the SQL tables is being imported into Hadoop verbatimwithout processing, Sqoop offers the simplest and most efficient process for moving the data. Forexample, all of the tables within a single database can be loaded using Listing 6.

Listing 6. Loading all tables within a single database$ sqoop import-all-tables --connect jdbc:mysql://192.168.0.240/cheffy --username=cheffy

For those drivers that support it, use the --direct option to directly read the data and then writeit into HDFS. The process is much faster, as it requires no intervening files. When loading data inthis way, directories are created within HDFS according to the table names. For example, withinthe recipe data set is the access log information in the access_log table, and the imported data iswritten into text files within the access_log directory (see Listing 7).

Listing 7. Viewing imported data from Sqoop$ hdfs dfs -ls access_logFound 6 items-rw-r--r-- 3 cloudera cloudera 0 2013-08-15 09:37 access_log/_SUCCESSdrwxr-xr-x - cloudera cloudera 0 2013-08-15 09:37 access_log/_logs-rw-r--r-- 3 cloudera cloudera 36313694 2013-08-15 09:37 access_log/part-m-00000-rw-r--r-- 3 cloudera cloudera 36442312 2013-08-15 09:37 access_log/part-m-00001-rw-r--r-- 3 cloudera cloudera 36797470 2013-08-15 09:37 access_log/part-m-00002-rw-r--r-- 3 cloudera cloudera 36321038 2013-08-15 09:37 access_log/part-m-00003

By default, the files are split into approximately 30MB blocks, and the data is separated bycommas (see Listing 8).

Listing 8. CSV converted Sqoop table data1,1,1135322067,09890012-11583713-542922105,recipeview,7792,1,1135322405,09890012-11583713-542922105,recipeview,2883,89,1135327750,26458011-11487731-455118105,search-ingredient,4,89,1135327750,26458011-11487731-455118105,ingredient,pork5,89,1135327750,26458011-11487731-455118105,ingredient,cheese6,89,1135327765,26458011-11487731-455118105,recipeview,1421

Page 7: Bd Sqltohadoop1 PDF

ibm.com/developerWorks/ developerWorks®

SQL to Hadoop and back again, Part 1: Basic data interchangetechniques

Page 7 of 13

To select individual tables, use the code in Listing 9.

Listing 9. Selecting individual tables$ sqoop import-all-tables --connect jdbc:mysql://192.168.0.240/cheffy --username=cheffy --table access_log

And to select individual columns from that table, use the code in Listing 10.

Listing 10. Selecting individual columns$ sqoop import-all-tables --connect jdbc:mysql://192.168.0.240/cheffy --username=cheffy --table access_log --columns id,userid,operation

Rather than individually selecting tables and columns, a more practical approach is to use a queryto specify the information to output. When using this method, you must use the $CONDITIONSvariable in your statement and specify the column to use when dividing up the data into individualpackets using the --split-by option as shown in Listing 11.

Listing 11. Specifying information to output$ sqoop import-all-tables --connect jdbc:mysql://192.168.0.240/cheffy --username=cheffy --query 'select recipeid,recipe,description from recipe WHERE $CONDITION'--split-by id

One limitation of Sqoop, however, is that it provides limited ability to format and construct theinformation. For complex data, the export and load functions of a custom tool may provide betterfunctionality.

Extracting data from HadoopWhen processing raw, processed data back from Hadoop, you need to take the files output by yourHadoop job. As with exporting, you should ensure that your Hadoop job outputs the information ina format that you can read back effectively.

Importing to SQLUsing CSV is simple and straightforward, but for more complex structures, you might want toconsider the JSON route again because it makes the entire conversion and translation process soeasy.

Getting the information out requires use of the HDFS tool to get your output files back to afilesystem where you can perform a load —$ hdfs dfs -copyToLocal processed_logs/*, forexample. Once you have the files, you can load the information using whatever method suits thesource information and structure.

Exporting from SqoopAs with the import process, Sqoop provides a simplified method for translating information fromyour Hadoop job back into an SQL table.

When outputting the resulting information from Sqoop, use the CSV format for the easiest export.Then to import the information, you will need to create a suitable table to accept the processed

Page 8: Bd Sqltohadoop1 PDF

developerWorks® ibm.com/developerWorks/

SQL to Hadoop and back again, Part 1: Basic data interchangetechniques

Page 8 of 13

logs. For example, from our access logs, the Hadoop output has mapped the data into summariesof the number of operations, so it's necessary to first create a suitable table: CREATE TABLEsummary_logs (operation CHAR(80), count int). Then the information can be imported directlyfrom Hadoop into your SQL table (see Listing 12).

Listing 12. Exporting from Hadoop into SQL

$ sqoop export --connect jdbc:mysql://192.168.0.240/cheffy --username=root --export-dir processed_log --table processed_log13/08/15 10:04:34 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.13/08/15 10:04:34 INFO tool.CodeGenTool: Beginning code generation13/08/15 10:04:35 INFO manager.SqlManager: Executing SQL statement: SELECT t. * FROM `access_log` AS t LIMIT 113/08/15 10:04:35 INFO manager.SqlManager: Executing SQL statement: SELECT t. * FROM `access_log` AS t LIMIT 113/08/15 10:04:35 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce13/08/15 10:04:35 INFO orm.CompilationManager: Found hadoop core jar at: /usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-core.jarNote: /tmp/sqoop-cloudera/compile/8034e8d9feb8c1b0f69a52fede8d1da7/access_log.java uses or overrides a deprecated API.Note: Recompile with -Xlint:deprecation for details.13/08/15 10:04:37 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/8034e8d9feb8c1b0f69a52fede8d1da7/access_log.jar13/08/15 10:04:37 INFO mapreduce.ExportJobBase: Beginning export of access_log13/08/15 10:04:39 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.13/08/15 10:04:39 INFO input.FileInputFormat: Total input paths to process : 413/08/15 10:04:39 INFO input.FileInputFormat: Total input paths to process : 413/08/15 10:04:39 INFO mapred.JobClient: Running job: job_201308150649_000613/08/15 10:04:40 INFO mapred.JobClient: map 0% reduce 0%13/08/15 10:04:57 INFO mapred.JobClient: map 2% reduce 0%...13/08/15 10:08:06 INFO mapred.JobClient: CPU time spent (ms)=2747013/08/15 10:08:06 INFO mapred.JobClient: Physical memory (bytes) snapshot=31760793613/08/15 10:08:06 INFO mapred.JobClient: Virtual memory (bytes) snapshot=207665971213/08/15 10:08:06 INFO mapred.JobClient: Total committed heap usage (bytes) =18835046413/08/15 10:08:06 INFO mapreduce.ExportJobBase: Transferred 139.1333 MB in 207.5656 seconds (686.3975 KB/sec)13/08/15 10:08:06 INFO mapreduce.ExportJobBase: Exported 2401906 records.

The process is complete. Even at the summarized level, we are looking at 2.4 million records ofsimplified data from a content store about 600 times that size.

With the imported information, we can now perform some simple and quick queries and structureson the data. For example, this summary of the key activities takes about 5 seconds (see Figure 4).

Page 9: Bd Sqltohadoop1 PDF

ibm.com/developerWorks/ developerWorks®

SQL to Hadoop and back again, Part 1: Basic data interchangetechniques

Page 9 of 13

Figure 4. Summary operations

On the full data set, the process took almost an hour. Similarly, a query on the top search termstook less than a second, compared to over 3 minutes, a time savings that makes it possible toinclude a query on the homepage (see Figure 5).

Figure 5. Summary ingredient search

These are simplified examples of the external reduction processing in Hadoop being used, butthey effectively demonstrate the advantage of the external interface.

Page 10: Bd Sqltohadoop1 PDF

developerWorks® ibm.com/developerWorks/

SQL to Hadoop and back again, Part 1: Basic data interchangetechniques

Page 10 of 13

Conclusions

Getting information in and out of Hadoop data that has been based on SQL data is notcomplicated, providing you know the data, its format, and how you want the information internallyprocessed and represented. The actual conversion, exporting, processing, and importing issurprisingly straightforward.

The solutions in this article have looked at direct, entire-dataset dumps of information that can beexported, processed, and imported to Hadoop. The process can be SQL to Hadoop, Hadoop toSQL, or SQL to Hadoop to SQL. In fact, the entire sequence can be scripted or automated, butthat's a topic for a future article in this series.

In Part 2, we look at more advanced examples of performing this translation and movement ofcontent by using one of the SQL layers that sits on top of HDFS. We'll also lay the foundation forproviding a full live transmission of data for processing and storage.

Page 11: Bd Sqltohadoop1 PDF

ibm.com/developerWorks/ developerWorks®

SQL to Hadoop and back again, Part 1: Basic data interchangetechniques

Page 11 of 13

Resources

Learn

• "Analyzing social media and structured data with InfoSphere BigInsights" teaches you thebasics of using BigSheets to analyze social media and structured data collected throughsample applications provided with BigInsights.

• Read "Understanding InfoSphere BigInsights" to learn more about the InfoSphere BigInsightsarchitecture and underlying technologies.

• Watch the Big Data: Frequently Asked Questions for IBM InfoSphere BigInsights video tolisten to Cindy Saracco discuss some of the frequently asked questions about IBM's Big Dataplatform and InfoSphere BigInsights.

• Watch Cindy Saracco demonstrate portions of the scenario described in this article in BigData -- Analyzing Social Media for Watson.

• Check out "Exploring your InfoSphere BigInsights cluster and sample applications" to learnmore about the InfoSphere BigInsights web console.

• Visit the BigInsights Technical Enablement wiki for links to technical materials, demos,training courses, news items, and more.

• Learn about the IBM Watson research project.• Take this free course from Big Data University on Hadoop Reporting and Analysis (log-

in required). Learn how to build your own Hadoop/big data reports over relevant Hadooptechnologies, such as HBase, Hive, etc., and get guidance on how to choose betweenvarious reporting techniques, including direct batch reports, live exploration, and indirectbatch analysis.

• Learn the basics of Hadoop with this free Hadoop Fundamentals course from Big DataUniversity (log-in required). Learn about the Hadoop architecture, HDFS, MapReduce, Pig,Hive, JAQL, Flume, and many other related Hadoop technologies. Practice with hands-onlabs on a Hadoop cluster on the Cloud, with the supplied VMWare image, or install locally.

• Explore free courses from Big Data University on topics ranging from Hadoop Fundamentalsand Text Analytics Essentials to SQL Access for Hadoop and real-time stream computing.

• Create your own Hadoop cluster on the IBM SmartCloud Enterprise with this free course fromBig Data University (log-in required).

• Order a copy of Understanding Big Data: Analytics for Enterprise Class Hadoop andStreaming Data for details on two of IBM's key big data technologies.

• Visit the Apache Hadoop Project and check out the Apache Hadoop Distributed File System.• Learn about the HadoopDB Project.• Read the Hadoop MapReduce tutorial at Apache.org.• Using MapReduce and load balancing on the cloud (Kirpal A. Venkatesh et al.,

developerWorks, July 2010): Learn how to implement the Hadoop MapReduce frameworkin a cloud environment and how to use virtual load balancing to improve the performance ofboth a single- and multiple-node system.

• For information on installing Hadoop using CDH4, see CDH4 Installation - Cloudera Support.• Big Data Glossary By Pete Warden, O'Reilly Media, ISBN: 1449314597, 2011.• Hadoop: The Definitive Guide by Tom White, O'Reilly Media, ISBN: 1449389732, 2010.

Page 12: Bd Sqltohadoop1 PDF

developerWorks® ibm.com/developerWorks/

SQL to Hadoop and back again, Part 1: Basic data interchangetechniques

Page 12 of 13

• "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for AnalyticalWorkloads" explores the feasibility of building a hybrid system that takes the best featuresfrom both technologies.

• "SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizableuser-defined functions" describes the motivation for this new approach to UDFs, as well asthe implementation within AsterData Systems' nCluster database.

• Check out "MapReduce and parallel DBMSes: friends or foes?"• "A Survey of Large-Scale Data Management Approaches in Cloud Environments" gives a

comprehensive survey of numerous approaches and mechanisms of deploying data-intensiveapplications in the cloud which are gaining a lot of momentum in research and industrialcommunities.

• Learn more about big data in the developerWorks big data content area. Find technicaldocumentation, how-to articles, education, downloads, product information, and more.

• Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-basedoffering that extends the value of open source Hadoop with features like Big SQL, textanalytics, and BigSheets.

• Follow these self-paced tutorials (PDF) to learn how to manage your big data environment,import data for analysis, analyze data with BigSheets, develop your first big data application,develop Big SQL queries to analyze big data, and create an extractor to derive insights fromtext documents with InfoSphere BigInsights.

• Find resources to help you get started with InfoSphere Streams, IBM's high-performancecomputing platform that enables user-developed applications to rapidly ingest, analyze, andcorrelate information as it arrives from thousands of real-time sources.

• Stay current with developerWorks technical events and webcasts.• Follow developerWorks on Twitter.

Get products and technologies

• Hadoop 0.20.1 is available from Apache.org.• Download Hadoop MapReduce.• Get Hadoop HDFS.• Download InfoSphere BigInsights Quick Start Edition, available as a native software

installation or as a VMware image.• Download InfoSphere Streams, available as a native software installation or as a VMware

image.• Use InfoSphere Streams on IBM SmartCloud Enterprise.• Build your next development project with IBM trial software, available for download directly

from developerWorks.

Discuss

• Ask questions and get answers in the InfoSphere BigInsights forum.• Ask questions and get answers in the InfoSphere Streams forum.• Check out the developerWorks blogs and get involved in the developerWorks community.• IBM big data and analytics on Facebook.

Page 13: Bd Sqltohadoop1 PDF

ibm.com/developerWorks/ developerWorks®

SQL to Hadoop and back again, Part 1: Basic data interchangetechniques

Page 13 of 13

About the author

Martin C. Brown

A professional writer for over 15 years, Martin (MC) Brown is the author andcontributor to more than 26 books covering an array of topics, including the recentlypublished Getting Started with CouchDB. His expertise spans myriad developmentlanguages and platforms: Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C,C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Microsoft WP, MacOS and more. He currently works as the director of documentation for Continuent.

© Copyright IBM Corporation 2013(www.ibm.com/legal/copytrade.shtml)Trademarks(www.ibm.com/developerworks/ibm/trademarks/)