hands on lab introduction to hadoop on the cloud using ... · pdf file1 hands on lab...

1

Hands on Lab

Introduction to Hadoop on the cloud using

BigInsights on BlueMix

dev@Pulse, Feb. 24 - 25, 2014

Cindy Saracco, Senior Solutions Architect, [email protected], @IBMbigdata

Nicolas Morales, Solutions Engineer, [email protected], @NicolasJMorales

2

Table of Contents

Getting started ................................................................................................................... 3

Pre-requisites ............................................................................................................................................. 3

What you'll learn........................................................................................................................................ 4

Exercise 1: Exploring the Web Console ......................................................................... 5

Launching the Web Console ...................................................................................................................... 6

Working with the Welcome page ............................................................................................................... 9

Inspecting the status of your cluster .........................................................................................................10

Working with Files ....................................................................................................................................12

Exercise 2: Analyzing data with BigSheets ................................................................... 18

Collecting social media data ....................................................................................................................19

Creating a BigSheets workbook ................................................................................................................20

Tailoring workbooks and generating charts .............................................................................................23

Exercise 3: Querying data with IBM Big SQL ............................................................ 35

Obtaining sample data ..............................................................................................................................36

Creating, loading, and querying a Big SQL table ....................................................................................38

Optional Exercise: Setting up an Eclipse environment .............................................. 41

Getting started ..........................................................................................................................................41

Creating a BigInsights Server Connection ...............................................................................................43

Creating a Big SQL Connection in Eclipse ..............................................................................................44

Creating and testing a Big SQL JDBC client application ........................................................................46

Summary .......................................................................................................................... 52

Important Information:.................................................................................................. 53

3

Getting started In this hands-on lab, you'll learn how to work with IBM’s Platform-as-a-Service (PaaS)

for MapReduce, a cloud offering now in beta. This offering enables Hadoop developers

to quickly get started using critical services from InfoSphere BigInsights (IBM’s

Hadoop-based platform) to create big data applications. By using this beta service,

developers can avoid the overhead of acquiring and provisioning their own hardware

cluster. Moreover, a cloud-based infrastructure enables them to rapidly scale their

hardware environment as their application needs increase.

In this lab, you'll learn how to use IBM’s Hadoop-based cloud services to explore social

media data. In particular, you’ll investigate global coverage of a popular brand (“IBM

Watson”) through the use of a spreadsheet-style interface. Later, you’ll learn how you

can query this social media data using Big SQL, IBM’s SQL interface to data managed

by BigInsights.

As an aside, if you prefer to work with BigInsights on your own hardware, you can

download a free Quick Start Edition installation image or VMWare image. Just visit this

web site.

Pre-requisites

This lab uses beta software available on IBM’s BlueMix cloud environment. Prior to

starting this lab, you need to obtain a BlueMix account. Registration is free, although the

number of available seats is limited. To apply for an account, visit http://ng.bluemix.net

and click the Join us in beta button.

Once you have an account, you should become familiar with the BlueMix environment

before starting this lab. In particular, you should be able to log into your account, create

an application, and bind that application to IBM's MapReduce service, which is the

4

subject of this lab. (The process of binding your application to this service is the same as

the process of binding your application to other services available on BlueMix.) If

necessary, consult the online BlueMix documentation or enroll in a separate lab to learn

more about the BlueMix environment.

Finally, you need to locate the settings for your MapReduce environment variables, as

these include the appropriate URLs for accessing various BigInsights services as well as

the required user ID and password.

What you'll learn

After completing this hands-on lab, you’ll be able to:

• Launch the Web console and access several of its key services

• Explore big data using a spreadsheet-style tool

• Query big data using Big SQL

• Configure Eclipse to use the BigInsights plug-in, which includes Big SQL support

Allow 1 to 2 hours to complete the core sections of this lab.

To learn more about IBM’s Hadoop-based platform and its MapReduce service, visit the

BigInsights technical wiki or participate in the forum. To learn more about BlueMix and

participate in its community, visit the BlueMix Dev site.

NOTE: Images of screen captures contain sample data. Your output may vary, depending on your environment. The BlueMix environment is expected to evolve throughout its beta program and user interfaces are subject to change.

In addition, some code examples in this lab include sample user IDs, passwords, and service URLs. You will need to modify the examples to include appropriate data for your environment.

5

Exercise 1: Exploring the Web Console IBM’s Hadoop-based services enable firms to store, process, and analyze large volumes

of various types of data. Included in these services is access to a Web console for

inspecting the health of your cluster, monitoring the status of jobs (applications),

downloading certain application development aids, and performing other functions.

Before developing your application that uses IBM’s MapReduce service, it will be

helpful for you to become familiar with the Web console. For further details on the Web

console or BigInsights, consult the product documentation.


• Launch the Web console.

• Work with popular resources accessible through the Welcome page.

• Inspect the status of your cluster.

• Work with the distributed file system. In particular, you'll explore the Hadoop Distributed File System (HDFS) directory structure, create subdirectories, and

upload a file to HDFS.

Allow 15 - 30 minutes to complete this section of lab.

This lab is an introduction to a subset of console functions. Administrative capabilities

available through the Web console in production environments won't be covered here,

because your BlueMix beta account currently lacks administrative authority. In addition,

real-time monitoring dashboards and application linking are among the more advanced

console functions that are out of this lab's scope.

6

Launching the Web Console

In this section, you'll learn how to launch the Web console for the IBM MapReduce

(BigInsights) service.

__1. If necessary, create a new application using an appropriate BlueMix boilerplate template available in the catalog and add the IBM MapReduce service to it. Subsequent sections of this exercise use the Java+DB Web Starter application boilerplate as an example.

7

__2. Verify that your cloud application that uses IBM's MapReduce service is up and running. (The BlueMix dashboard displays the status of your applications.) For example, the image below depicts a JavaDBSample application that includes IBM's MapReduce service (shown circled). Note that the application's status shows that it is running.

__3. Optionally, double click on your application's icon (not the displayed URL) to display further details about it. Using the example above as a guideline, click on the orange box at top to see additional information about the services available.

8

__4. Locate the VCAP_Services environment variables associated with your application to determine the MapReduce service's Web console URL, user ID, and password. The way in which you access this information will depend on the application boilerplate you selected from the BlueMix dashboard. Often, the first category beneath your application's OVERVIEW button will contain the appropriate information. In this example, there is a RUNTIME button that displays this information if you scroll down to the bottom of the page.

Environment variable settings are in JSON format. The items shown below in bold highlight sample information you need to collect for this exercise.

{

"name": "MapReduce-ei562",

"label": "MapReduce-2.1.0",

"plan": "Community",

"credentials": {

"username": "u123456",

9

"password": "pw123456.1234567890",

"BigSqlUrl": "jdbc:bigsql://11.22.33.44:7052/dbu123456",

"ConsoleUrl":"https://11.22.33.44:8080/data/html/index.html",

"HiveUrl":"jdbc:hive://11.22.33.44:10000/dbu123456",

"HttpfsUrl":"http://11.22.33.44:14000/webhdfs/v1/"

}

}

__5. Copy and paste the ConsoleUrl value into your Web browser.

__6. When prompted, enter the username and password values to log into the console.

__7. Verify that your Web console appears similar to this:

Working with the Welcome page

This section introduces you to the Web console's main page displayed through the

Welcome tab. The Welcome page features links to common tasks, many of which can

also be launched from other areas of the console. In addition, the Welcome page includes

links to popular external resources, such as the BigInsights Information Center (product

documentation) and community forum.

__1. Inspect the Quick Links pane at top right and use its vertical scroll bar (if necessary) to become familiar with the various resources accessible through this pane. Note that this section contains links for downloading software drivers and an Eclipse plug-in.

10

__2. Inspect the Learn More pane at lower right. Links in this area access external Web resources that you may find useful, such as the BigInsights Information Center, a public discussion forum, IBM support, and IBM's BigInsights product site. If desired, click on one or more of these links to see what's available.

Inspecting the status of your cluster

The Web console allows administrators to inspect the overall health of their cluster as

well as perform basic functions, such as starting and stopping specific servers or

components, adding nodes to the cluster, and so on. The free BlueMix beta offering

precludes you from obtaining administrative status at this time, but you can still explore

some basic capabilities that don't require administrative authority.

__1. Click on the Cluster Status tab at the top of the page.

11

__2. Inspect the overall status of your cluster. The figure below was taken on a cluster of 7 nodes that had several services running. (Host node information about each node was masked in this graphic; your display will show IP addresses of each node in your cluster.)

Note that on this cluster, HBase, Monitoring, and Oozie services were unavailable.

__3. Click on the Hive service and note the detailed information provided for this service in the pane at right. (Host node information was masked in this graphic; your display will show node IP addresses for the Hive Node, Hive Web Interface, and JDBC URL.)

By clicking on any installed service, administrators can start or stop the service.

12

Working with Files

The Files tab of the console enables you to explore the contents of your file system,

create new subdirectories, upload small files for test purposes, and perform other file-

related functions. In this module, you’ll learn how to perform such tasks against the

Hadoop Distributed File System (HDFS) of BigInsights.

__1. Click on the Files tab of the Web console to begin exploring your distributed file system.

__2. Expand the directory tree shown in the pane at left to locate the /user subdirectory for your user ID (/user/biadmin is shown below).

__3.

13

__4. Become familiar with the functions provided through the icons at the top of this pane, as we'll refer to some of these in subsequent sections of this module. Simply point your cursor at the icon to learn its function. From left to right, the icons enable you to Copy a file or directory, move a file, create a directory, rename a file or directory, upload a file to HDFS, download a file from HDFS to your local file system, remove a file from HDFS, set permissions, open a command window to launch HDFS shell commands, and refresh the Web console page

__5. Position your cursor on your user subdirectory (e.g., /user/biadmin in this example) directory and click the Create Directory icon to create a subdirectory for test purposes.

__6. When a pop-up window appears prompting you for a directory name, enter ConsoleLab and click OK.

__7. Expand the directory hierarchy to verify that your new subdirectory was created.

14

__8. Create another directory named ConsoleLabTest.

__9. Use the Rename icon to rename this directory to ConsoleLabTest2.

__10. Click the Move icon, when the pop up Move screen appears select the ConsoleLab directory and click OK.

__11. Using the set permission icon, you can change the permission settings for your directory. When finished click OK.

15

__12. While highlighting the ConsoleLabTest2 folder, select the Remove icon and remove the directory.

__13. Obtain the sample blogs-data.txt file from your instructor, or download the sampleData.zip file from this article and extract the .zip file to a directory on your local file system. In a moment, you will upload the blogs-data.txt file to the cloud DFS.

__14. In the ConsoleLab directory of your cloud DFS, and click the Upload icon to upload a small sample file for test purposes.

16

__15. When the pop-up window appears, click the Browse button to browse your local file system for the sample file you obtained earlier (blogs-data.txt).

__16. Navigate through your local file system to the directory and locate the blogs-data.txt file. Click OK.

__17. Verify that the window displays the name of this file. Note that you can continue to Browse for additional files to upload and that you can delete files as upload targets from the displayed list. However, for this exercise, simply click OK.

__18. When the upload completes, verify that the file appears in the directory tree at left, if it is not immediately visible click the refresh button. On the right, you should see a subset of the file’s contents displayed in text format

__19. Highlight the blogs-data.txt file in your ConsoleLab directory and click the Download button.

__20. When prompted, click the Save File button. Then select OK.

18

Exercise 2: Analyzing data with BigSheets IBM’s Hadoop-based offering enables firms to store, process, and analyze large volumes

of various types of data. In this exercise, you’ll see how you can explore social media

data collected from a sample application provided with InfoSphere BigInsights using

BigSheets, a spreadsheet-style tool accessible from the Web console.

This lab exercise based on an article that can be found here:

http://www.ibm.com/developerworks/data/library/techarticle/dm-

1206socialmedia/index.html . It’s a good idea to run this article before attempting the lab,

as the article explains the business context and application scenario covered in this

exercise. If you prefer, you can watch a 14-minute video from the author of the article

here: http://www.youtube.com/watch?v=kny3nPwSZ_w

Before completing this exercise, you should be familiar with the Web console and be able

to perform the basic operations covered in the previous lab module.


• Create BigSheets workbooks based on social media data collected about a popular brand (“IBM Watson” in this scenario).

• Perform simple data cleansing and analytical operations to discover insights about the social media data.

• Tag your workbooks so you can easily locate those of interest later.

• Create charts based on your analysis.

Allow 45 - 60 minutes to complete this lab.

19

Collecting social media data

Sample social media data about "IBM Watson" is available for public download from the

BigSheets developerWorks article referenced above. (In production environments,

analysts can run an IBM-provided application to collect social media data about search

items of their choice.)

For purposes of this lab, examples cite a user ID of "biadmin" for your BigInsights /

MapReduce service. Substitute your user ID for biadmin as you work through the

exercise.

__1. If necessary, launch the Web console.

__2. Download and unzip the sampleData.zip file provided with this article into a local directory on your computer.

__3. Using the DFS navigator available from the Files tab of the Web console, create subdirectories for /sampleData and sampleData/IBMWatson under your user ID's directory. For example, if your user ID is biadmin, create a /user/biadmin/sampleData/IBMWatson directory.

__4. Upload the news-data.txt and blogs-data.txt files to the ../sampleData/IBMWatson directory.

__5. To review this data, use the Files tab to navigate to the following folder (/user/biadmin/sampleData/IBMWatson) and select the blogs-data.txt file as shown below.

Where did this data come from?

Data for this lab was collected using IBM’s sample BoardReader application provided with InfoSphere BigInsights. This application collected data from thousands of news and blog web sites, and a subset of this information was provided as an attachment to a developerWorks article.

20

In a future section, you will convert this file to a BigSheets workbook so you can explore, customize, and visualize the data.

Creating a BigSheets workbook

In this section, you will use a spread-sheet style interface (BigSheets) to explore the

social media data you just uploaded. BigSheets provides access to data in structures

known as “workbooks.”

__1. Return to the Files tab.

__2. Navigate to the /user/biadmin/sampleData/IBMWatson/blogs-data.txt file and click on the file.

__3. Click the Sheet radio button to view this data within a BigSheets interface.

21

__4. The data is formatted in a JSON Array structure. Click the pencil icon and select the JSON Array option for this file. Then click the green check mark.

__5. Save this as a Master Workbook named “Watson Blogs”. Optionally, provide a description. Click the Save button.

__6. Repeat this process for the news-data.txt file in the same folder. To do this, return to the Files tab, navigate to the file, and follow the 3 previous steps. This time, name the workbook “Watson News”.

__7. Click on the “Workbooks” link in the upper left-hand corner of the page.

__8. Verify that you see these two workbooks on your system.

22

__9. Add tags to your workbook so users can easily search for and locate it among a long list of workbooks. To do so, first select the “Watson Blogs” workbook.

__10. Scroll down to Workbook Details and add tags for “Watson” “IBM” “Blogs” by selecting the green + and adding each individually. If you don’t see Workbook Details, you may need to toggle between Normal and Full Screen.

__11. From the BigSheets tab, you can quickly filter workbooks and search for a specific tag. Enter the term “tag: Blogs” to see all workbooks that have the associated tag.

23

Tailoring workbooks and generating charts

In this section, you'll learn how to customize your workbook in a few simple ways. For

example, you'll learn how to remove unwanted columns for a given workbook, combine

data from multiple workbooks together, and perform simple data cleansing operations.

You'll even see how you can visualize your results in simple charts.

__1. From the list of workbook displayed in BigSheets (which you launched in previous steps), click on the link named “Watson News” to open this workbook.

__2. This Master Workbook is a “base” workbook and has a limited set of things you can edit. Therefore, in order to begin to manipulate the data contained within a workbook, we will want to create a dependent workbook.

__a. Click the “Build new Workbook” button

__b. When the new Workbook appears, you can change its default name (by clicking on the pencil icon next to the name) to the new name of “Watson News Revised” then click the green check mark.

__c. Click the Fit column(s) button to more easily see columns A through H on your screen

.

24

__3. Remove the column “IsAdult” from your workbook. This is currently column E. Click on the triangle next to the column name of “IsAdult” and select the “Remove” option to remove this from your new workbook.

__4. In this case, you want to keep only a few columns. In order, to more easily remove a larger number of columns (without having to do this same click-remove process), click the triangle again (from any column) and select the “Organize Columns6” option.

__a. Click the red X button next to each column title you want to remove.

In this case, KEEP the following columnsJ

__i. Country __ii. FeedInfo __iii. Language __iv. Published __v. SubjectHtml __vi. Tags __vii. Type __viii. Url

Did I lose data?

Deleting a column does not remove data. Deleting a column in a workbook just removing the mapping to this column.

25

__b. Click the green check mark button when you are ready to remove the columns you have selected to remove.

__5. Click on the Fit column(s) button again to show columns A through H. You should see the following columns in your new workbook.

__6. Select “Save and Exit”. You may input an optional description. Click Save to complete the save.

__7. After clicking Save, you will be shown two buttons (run and close). Click the Run button to run the workbook. You can monitor the progress of your request by watching the status bar indicator in the upper right-hand side of the page.

__8. To reduce the unwanted columns in the “Watson Blogs” workbook, you will want to perform the same steps above in order to wind up with a new workbook called “Watson Blogs Revised”

26

__9. Now, since we have two workbooks with the exact same structure, we can perform a “union” of these two workbooks as the basis for exploring the coverage of IBM Watson across the sources that Boardreader provided.

__10. To perform this action, make sure you are currently in the “Watson News Revised” workbook. Click the “Build New Workbook” button again.

__11. In the top left-hand side or bottom left, you should see a link called “Add sheets”. This allows you to perform additional analysis on your data within the current workbook. Click the “Add sheets” link.

__12. The Load option will allow you to load data into the current workbook from another workbook. Click the Load icon and select the “Watson Blogs Revised” workbook link.

__13. The system will ask you for a *Sheet Name and you should change Sheet1 to “Watson Blogs Revised” as the name of the new tab that will be created in your current workbook.

27

__14. Click the green check-mark button at this time to load the new workbook into your current workbook.

__15. Verify that you see two tabs at the bottom on your current workbook. Move your mouse over the second one, and a tool tip will show the action and the name you provided for this sheet / tab within you current workbook. (Giving your tabs meaningful names will help you and other that use your sheets an easy way to understand your data processing flow(s).)

__16. Next, add a new sheet to perform the Union function. Select Union.

__17. The Union function asked for the other “sheet” you would like to use. Select the triangle to expose the pull-down menu.

__18. Select ”Watson News Revised” and then click the green plus-mark button.

__19. Provide the sheet name “News and Blogs”. Before you click the green check-mark button to add this new tab/sheet to your workbook, make sure your options match the example below.

28

__20. Click the green check-mark button to add this tab to your workbook.

__21. Save and Exit and then run this new workbook. When prompted for a description, you can change the name of your new workbook from “Watson News Revised(1)” to “Watson News and Blogs”. Click the Save button. Then click the Run button to run the workbook.

__22. Select the Workflow Diagram icon to see a mapping of the workbooks associated with the News and Blogs workbook. This can be done at any point to keep a clear picture of which workbooks you are extending to/from.

__23. Close this frame.

__24. If you are not already in the workbook, open the “Watson News and Blogs” workbook.

__25. In this case, we want to keep our initial workbook “as is” and produce another workbook that contains the records in sorted order. So, click the “Build New Workbook” button to do this.

__26. To more easily keep track of what you are doing, rename your new workbook immediately. (This is a good practice to follow in general.) Call this workbook Watson Sorted.

__27. Explore the language and the types of posts contained in your workbook. To do so, click the triangle next to the column name of any column so that you can select the Sort -> Advanced option.

29

.

__28. Click on the pull-down triangle to expose the list of columns under the “Add Columns to Sort” area. Click on the green + button to add the two columns you wish to sort on. Then, select the desired order for sorting each column. In this case, your “Advance Sort” should look like the following picture.

__29. Click on the green check-mark button to continue and create the new tab/sheet with your desired sorting applied to it.

__30. As with all new tabs/sheets, the system shows you a simulated result based on the rows of data BigSheets keeps in memory. You should be able to click on Fit column(s) to review the contents of both the Language and Type columns to see that your “advanced sort” was applied to this simulated set of data.

__31. Now, “Save and Exit” and then run your workbook. This will apply the sorting options to more than the first 2,000 rows the system operates on as a simulation. This will sort the entire, larger data in the workbook. So, you should see different results once your workbook has been run. For example, in the simulated data, only one Vietnamese row was showing. However, against the entire data set, you should see twenty (20) rows that are of the Vietnamese language. This is because more of the Vietnamese rows were in the data beyond the first 2,000 rows the system uses in memory for a simulated result before you click the run button. Review and confirm these results after the job reaches 100% and then you can move onto the next step.

30

__32. To easily visualize the coverage of posts about IBM Watson by language, you can create a chart. While still in the “Watson Sorted” workbook, click the Add chart link in the lower left. When the list of available chart types is displayed, click Chart > Pie. Then complete the following information to produce a pie chart of the languages used.

__33. Click on the green check-mark button to create the chart tab.

__34. Just like working with tabular data, you will see a simulated visualization. Again, this is based on the rows in cache. (If you click on the Close button here, you can interact with the chart which is based on simulated data. You would then click the “Run” button in the upper right.)

__35. Click the Run button to run the visualization against the entire data set.

__36. Once the chart has been run, you can interact with it to find out the second, most-popular language for posts regarding IBM Watson is Russian. Move your mouse over this item within your pie chart to see these results.

31

__37. Mouse over the fifth and sixth largest languages in the pie chart you just generated, and note that they are both variations on the Chinese language.

In the steps that follow, you'll clean up this data so that all forms of "Chinese" will appear as a single category.

__38. From the “Watson Sorted” workbook, click on the Edit button.

__39. Optionally, click on the Fit column(s) button to make your columns thinner and to see more data on the screen.

__40. Add another column to your workbook to capture the new values for various forms of "Chinese". To do this, click on the triangle next to the Language column name. Select the Insert Right -> New Column option.

32

__41. Then, you will provide a name for your new column, like “Language_Revised” and then click the green check-mark button (or hit enter) to apply your new column name.

__42. Your cursor is then moved to the fx (or function) area where you can provide the function to be used to generate the contents of your new column.

__43. Enter the following formula as your functionJ IF(SEARCH('Chin*', #Language) > 0, 'Chinese', #Language)

This formula looks at the Language column indicated by #Language. If the #Language column starts with ‘Chin*’, then the new #Language_Revised column with contain ‘Chinese’. If it does not, the value of #Language is copied over to #Language_Revised. (See the original article, URL at the top of this document, for additional explanation of this formula.)

__44. Click the green check-mark button (or hit Enter). The output of this formula will appear in your new column.

33

__45. Click Save and Exit. You will be prompted to Click run to update the data.

__46. Click the Run button in the upper right to run the workbook.

__47. Now, click on the Language Coverage tab that contains your previously generated pie chart. This now has the status of “needs to be run”. Before we run it, we need to change one of the settings on the pie chart to use our newly generated column named Language_Revised.

__48. To change the settings, click on the triangle next to the Language Coverage tab.

__49. Click to select the “Chart Settings” option.

__50. Change the “Value:” item to be based on the new, Language_Revised column.

34

__51. Click on the green check-mark button to apply your new settings.

__52. Click on the Run button to regenerate your pie chart.

__53. Once your new pie chart has been generated, you should be able to see Chinese as a cleaned up, single item in your pie chart (compared to the two items you saw previously). With this cleansed data, Chinese is now the second largest and Russian is third.

35

Exercise 3: Querying data with IBM Big SQL In this exercise, you’ll learn how to use IBM Big SQL, an SQL language processor, to

summarize, query, and analyze data in a data warehouse system for Hadoop. Big SQL

provides broad SQL support that is typical of commercial databases. You can issue

queries using JDBC or ODBC drivers to access data accessible through IBM’s

MapReduce service on the cloud in the same way that you access databases from your

enterprise applications.

To keep things simple and enable you to concentrate on learning Big SQL, you’ll use the

interactive Big SQL application available from the Web console to issue your SQL

statements. Alternatively, you can use the InfoSphere BigInsights Tools for Eclipse to

create and run Big SQL queries interactively from Eclipse.

Before completing this lab, you should be familiar with the Web console and BigSheets,

as certain exercises in this lab use the output of work covered in previous labs.

After you complete this module, you will understand how to:

• Create a Big SQL table that uses Hive as its storage manager.

• Load data exported from BigSheets into your Big SQL table.

• Query your Big SQL table from the Web console.

Allow 30 minutes to complete this lab.

36

Obtaining sample data

In this section, you will use data contained in one of your BigSheets workbooks as

sample data to load into a Big SQL table. If you haven't already done so, complete at

least the first two sections of the previous lab on BigSheets. You will need access to the

Watson Blogs Revised workbook created from that lab.

__1. If necessary, launch the Web console and click on the BigSheets tab.

__2. Open the Watson Blogs Revised workbook you created previously.

__3. In the upper right corner, click the Export As button. When prompted, select TSV (tab-separated values) from the drop-down list of data format types.

__4. Select File as the Export to: destination source.

__5. Click the Browse button. In the DFS file navigator window that appears, navigate to the ../sampleData subdirectory of your user ID and enter sampleBlogs as the file's name. (Do not add a .tsv suffix -- this will be done automatically.)

37

__6. Click OK. Inspect the data shown and verify that the Include Headers button is unchecked. Click OK again.

__7. Click on the Files tab of the Web console, and navigate to the J/sampleData directory where you exported the file. Verify that sampleBlogs.tsv is present.

38

Creating, loading, and querying a Big SQL table

Now that you have the results of your BigSheets analysis ready, you can create a Big

SQL table for it, load that table with your data, and query the table's contents. For

simplicity, this lab explains how to do that from the Web console. If you already have

your Eclipse environment set up for working with BigInsights on BlueMix, you can issue

these statements from a Big SQL file in one of your projects as well.

__1. From the Welcome tab of the Web console, click on the Run Big SQL Queries link in the Quick Links section.

A new tab will appear in your Web browser.

39

__2. Determine the name of the Big SQL schema that your user ID is authorized to access. This information is part of the BigSqlUrl environment variable included in the VCAP_SERVICES list, which was discussed in the first lab module (on the Web console). In the example shown below, the JDBC URL for Big SQL is jdbc:bigsql://11.22.33.44:7052/db0JYXFCBY so the database name for this user is db0JYXFCBY.

{

"name": "MapReduce-ei562",

"label": "MapReduce-2.1.0",

"plan": "Community",

"credentials": {

"username": "u123456",

"password": "pw123456.1234567890",

"BigSqlUrl": "jdbc:bigsql://11.22.33.44:7052/ db0JYXFCBY",

"ConsoleUrl":"https://11.22.33.44:8080/data/html/index.html",

"HiveUrl":"jdbc:hive://11.22.33.44:10000/ db0JYXFCBY",

"HttpfsUrl":"http://11.22.33.44:14000/webhdfs/v1/"

}

You will need to refer to this database (schema) name in your queries.

__3. In the middle box of the Big SQL query application, type a CREATE TABLE statement similar to the example shown below, adjusting the schema name for the table to match your environment.

create table if not exists schema-name.watsonblogs

(country char(2),FeedInfo varchar(300),

countryLang char(25),published char(25),

subject varchar(300), tags varchar(100),

type char(20), url varchar(100))

row format delimited fields terminated by '\t';

The screen capture shown above depicts a version of this statement that creates a table in the db0JYXFCBY schema (database) named watsonblogs if such a table doesn't already exists. The table consists of 8 columns, each corresponding to a field in the TSV file generated by the BigSheets export operation.

__4. Click Run and verify that the operation completes successfully.

__5. Next, enter a LOAD command similar to this example, adjusting the path specification and database schema name to match your environment:

40

load hive data inpath '/user/0JYXFCBY/sampleData/sampleBlogs.tsv'

overwrite into table db0JYXFCBY.watsonblogs;

Note that this command loads data from the watsonBlogs.tsv file in the /user/0JYXFCBY/sampleData subdirectory of the distributed file system into the db0JYXFCBY.watsonblogs table, overwriting any data that might be present in the table. In keeping with Hive's behavior, this command moves the file from its original DFS directory into the Hive database.

__6. Click Run and verify that the operation completes successfully.

__7. Finally, query the table with a SELECT statement similar to this:

select * from db0JYXFCBY.watsonblogs limit 10;

Remember to adjust the SELECT statement to reference the appropriate schema for your environment.

__8. Run the command and inspect your output.

__9. If desired, create additional Big SQL tables based on other BigSheets workbooks that you export and experiment with querying data in these tables.

41

Optional Exercise: Setting up an Eclipse environment IBM provides Eclipse tooling to simplify development of applications that use

BigInsights services. This optional exercise takes you through the basics of configuring

an appropriate Eclipse environment to work with some of the BlueMix MapReduce

services available to you.

In this exercise, you will learn how to:

• Download IBM Eclipse tooling for BigInsights

• Configure a BigInsights server connection

• Configure a Big SQL connection

• Create and test a Big SQL JDBC client application

Allow 15 – 30 minutes to complete this exercise (not including software download time).

Getting started

__10. From the Welcome page of the Web console, click on the Quick Link for information about enabling your Eclipse development environment.

__11. Review the information displayed.

42

__12. If necessary, download the appropriate Eclipse shell from www.eclipse.org.

__13. Click on the link provided to review the detailed information in the BigInsights Information Center on this topic.

__14. Launch Eclipse, and follow the standard process for installing new software. (For example, click Help > Install new software.)

__15. After you've installed the BigInsights plug-in, verify that your installation was successful. Open Eclipse. The Task Launcher for Big Data should appear.

43

__16. If the Task Launcher does not appear, you may need to open the BigInsights perspective manually. From the Eclipse menu items at top, select Window > Perspective > BigInsights. (If necessary, click Window > Perspective > Other > BigInsights.)

Creating a BigInsights Server Connection Issuing interactive Big SQL statements requires a live connection to the IBM MapReduce

service on BlueMix (i.e., a BigInsights server connection). This section describes how

you can define a BigInsights server connection in Eclipse.

__1. From the Overview tab of the Task Launcher for Big Data, click Create a BigInsights server connection.

__2. Enter the appropriate information in the pop-up window, including the URL to access your BigInsights Web console, a server name of your choice, a valid BigInsights user ID, and a password. (The information shown below contains sample information -- the data you enter must match the VCAP_Services environment variable values for your BlueMix environment.)

44

.

__3. Click the Test connection button and verify that you can successfully connect to your target cluster.

__4. Click the Save password box and Finish.

__5. In the BigInsights Server pane, expand the list of servers and verify that the server connection you created appears.

Creating a Big SQL Connection in Eclipse Certain tasks require a live connection to a Big SQL server within the BigInsights cluster.

This section explains how you can define a JDBC connection to your Big SQL server.

__6. Open the Database Development perspective. Window > Open Perspective > Other > Database Development.

__7. In the Data Source Explorer pane, right click on Database Connections > Add Repository.

45

__8. In the New Connection Profile menu, select Big SQL JDBC Driver and enter a name for your new driver (e.g., My Big SQL Connection). Click Next.

__9. Enter the appropriate connection information for your environment, including the host name, port number user ID, and password. Verify that you have selected the correct JDBC driver at top. (The information shown below contains sample information -- the data you enter must match the VCAP_Services environment variable values for your BlueMix environment.)

46

__10. Click the Test connection button and verify that you can successfully connect to your target Big SQL server.

__11. Click the Save password box and Finish.

__12. In the Data Source Explorer, expand the list of data sources and verify that your Big SQL connection appears.

Creating and testing a Big SQL JDBC client application IBM’s MapReduce service enables you to write a JDBC client application to access Big

SQL data much as you would access data in any relational database. You don’t even

need to download and install the BigInsights Eclipse plug-in to do so – you just need the

Big SQL JDBC client package, accessible from the Web console’s Welcome page.

47

In this exercise, you’ll create a simple Java project with a client application that opens a

Big SQL database connection, executes a simple query, and displays the results.

__1. From the Welcome page of the Web console, click on the Quick Link for downloading the Big SQL Client drivers. If you download this .zip file, you’ll note that it contains a JDBC .jar file. (The JDBC client driver is also included in the BigInsights plug-in you downloaded in the previous sections.)

__2. In Eclipse, create a Java project by clicking File > New >Project. From the New Project window, select Java Project. Click Next.

__3. Type a name of your choice for the project in the Project Name field. Click Next.

__4. Open the Libraries tab and click Add External Jars. Provide a path to the Big SQL JDBC driver (bigsql-jdbc-driver.jar).

48

__5. Click Finish. (If you’re prompted to open a different perspective, click No.)

__6. Right-click on your Java project, and click New > Package. Enter a name for your package when prompted, and click Finish.

__7. Right-click your package, and click New > Class.

__8. In the New Java Class window, enter a name for your class. Select the public static void main(String[] args) check box. Click Finish.

49

__9. Copy or type the following code into your .java file. Note that you will need to adjust some variable settings shown here to match your environment.

// a. Declare package & class names; import required package(s) package test;

import java.sql.*;

public class Sample {

//b. set JDBC & database info – customize these for your env.

static final String db =

"jdbc:bigsql://74.111.222.33:7052/yourSchema";

static final String user = "yourID"; static final String pwd = "yourPassword";

/**

* @param args

*/

public static void main(String[] args) { // TODO Auto-generated method stub Connection conn = null;

Statement stmt = null;

System.out.println("Started sample JDBC application.");

try{

//c. Register JDBC driver

Class.forName("com.ibm.biginsights.bigsql.jdbc.BigSQLDriver");

//d. Get a connection

conn = DriverManager.getConnection(db, user, pwd);

50

System.out.println("Connected to the database.");

//e. Execute a query

// Change the schema name in the SELECT statement

stmt = conn.createStatement();

System.out.println("Created a statement.");

String sql;

sql = "select countryLang, subject, url from

yourSchema.watsonblogs limit 10";

ResultSet rs = stmt.executeQuery(sql);

System.out.println("Executed a query.");

//f. Obtain results

System.out.println("Result set: ");

while(rs.next()){

//Retrieve by column name

String lang = rs.getString("countrylang");

String subject = rs.getString("subject");

String url = rs.getString("url");

//Display values

System.out.print("* Language: " + lang + "\n");

System.out.print("* Subject: " + subject + "\n");

System.out.print("* Url: " + url + "\n\n");

}

//g. Close open resources

rs.close();

stmt.close();

conn.close();

}catch(SQLException sqlE){

// Process SQL errors

sqlE.printStackTrace();

}catch(Exception e){

// Process other errors

e.printStackTrace();

}finally{

// Ensure resources are closed before exiting

try{

if(stmt!=null)

stmt.close();

}catch(SQLException sqle2){

} // nothing we can do

try{

if(conn!=null)

conn.close();

}catch(SQLException sqlE){

sqlE.printStackTrace();

}// end finally block

}// end try block

}

}

__a. Adjust the package and class names shown, if needed, to match the names you selected earlier.

51

__b. Modify the database connectivity variables as needed to match your environment.

__c. Register the JDBC driver so that you can open a communications channel with the database. The correct JDBC driver is shown.

__d. Open the connection.

__e. Run a query. Note that you will need to modify the SQL statement shown to reference the correct database schema name for your table.

__f. Extract data from result set.

__g. Clean up the environment by closing all of the database resources.

52

Summary Congratulations! You’ve just learned how to get started using the beta version of IBM

MapReduce services on the BlueMix cloud. Behind the scenes, this service uses

InfoSphere BigInsights, IBM’s Hadoop-based platform, to execute jobs on your behalf.

Feel free to visit the public wiki to learn more about IBM’s Hadoop-based platform

through articles, videos, online course, etc. And be sure to post any questions you may

have about IBM MapReduce services or BigInsights to the public forum.

The authors would like to thank Louis Mau, Jayatheerthan Krishnamurthy, and Ellen

Patterson for their assistance.

53

Important Information:

References in this lab to IBM products, programs, or services do not imply that they will

be available in all countries in which IBM operates.

These materials are provided for informational purposes only, and are neither intended to,

nor shall have the effect of being, legal or other guidance or advice to any participant.

While efforts were made to verify the completeness and accuracy of the information

contained in this presentation, it is provided AS-IS without warranty of any kind, express

or implied. IBM shall not be responsible for any damages arising out of the use of, or

otherwise related to, this presentation or any other materials. Nothing contained in this

document is intended to, nor shall have the effect of, creating any warranties or

representations from IBM or its suppliers or licensors, or altering the terms and

conditions of the applicable license agreement governing the use of IBM software.

All customer examples described are presented as illustrations of how those customers

have used IBM products and the results they may have achieved. Actual environmental

costs and performance characteristics may vary by customer. Nothing contained in these

materials is intended to, nor shall have the effect of, stating or implying that any activities

undertaken by you will result in any specific sales, revenue growth or other results.

© Copyright IBM Corporation 2014. All rights reserved.

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by

GSA ADP Schedule Contract with IBM Corp.

IBM, the IBM logo, ibm.com, and BigInsights are trademarks or registered

trademarks of International Business Machines Corporation in the United States, other

countries, or both. If these and other IBM trademarked terms are marked on their first

occurrence in this information with a trademark symbol (® or ™), these symbols indicate

U.S. registered or common law trademarks owned by IBM at the time this information

was published. Such trademarks may also be registered or common law trademarks in

other countries. A current list of IBM trademarks is available on the Web at “Copyright

and trademark information” at www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of

others.

hands on lab introduction to hadoop on the cloud using ... · pdf file1 hands on lab...

Documents