hands on lab introduction to hadoop with biginsights · pdf file1 . hands on lab ....

1

Hands on Lab Introduction to Hadoop with BigInsights dev@Pulse, Feb. 24 - 25, 2014

Cindy Saracco, Senior Solutions Architect, [email protected], @IBMbigdata

Nicolas Morales, Solutions Engineer, [email protected], @NicolasJMorales

Table of Contents Getting started with InfoSphere BigInsights .................................................................. 2

Exercise 1: BigInsights Web Console .............................................................................. 3

Launching the Web Console ...................................................................................................................... 4

Working with the Welcome page ............................................................................................................... 5

https://twitter.com/@IBMbigdata�

https://twitter.com/NicolasJMorales�

https://twitter.com/NicolasJMorales�

2

Administering BigInsights ......................................................................................................................... 7

Working with Files ....................................................................................................................................10

Exercise 2: Social Media and BigSheets........................................................................ 15

Collecting Social Media ...........................................................................................................................16 Creating a BigSheets Workbook ...............................................................................................................19

Exercise 3: Querying data with IBM Big SQL ............................................................. 34

Connecting to the IBM Big SQL server ....................................................................................................35

Creating a project and an SQL script file .................................................................................................37

Creating tables and loading data .............................................................................................................39

Running basic Big SQL queries ................................................................................................................41

Optional Exercise: Using Big SQL from a JDBC client application ........................................................44 Analyzing the data with Big SQL ..............................................................................................................48

Exercise 4: Developing, publishing, and deploying your first big data app .............. 52

Developing your application ....................................................................................................................52

Publishing your application......................................................................................................................53

Deploy and run your application ..............................................................................................................56

Exercise 5: Analyzing BigSheets data in IBM Big SQL .............................................. 58

Create and modify a BigSheets workbook ................................................................................................59

Export your Workbook ..............................................................................................................................62 Creating an Eclipse project and new SQL script ......................................................................................64

Exporting a JSON Array to Big SQL ........................................................................................................69

Exercise 6: Text analysis of social media data............................................................. 72

Upload and deploy the sample text analytics application ........................................................................72

Upload sample data and create a master workbook .................................................................................75

Analyzing text content of social media posts ............................................................................................78

Summary .......................................................................................................................... 83

Acknowledgements and Disclaimers ............................................................................. 85

Getting started with InfoSphere BigInsights In this hands-on lab, you'll learn how to work with IBM InfoSphere BigInsights to analyze big data. In particular, you'll explore social media data using a spreadsheet-style interface to discover insights about the global coverage of a popular brand without writing any code. You'll also use Big SQL to query data managed by BigInsights and learn how to develop, test, publish, and deploy your first big data application. Finally, you'll see how business analysts can invoke a pre-packaged, custom text analytics

3

application that analyzes the sentiments expressed about a popular brand in news and blog postings on the Web. After completing this hands-on lab, you’ll be able to: • Navigate the BigInsights Web Console • Explore big data using a spreadsheet-style tool • Query big data using Big SQL • Create and run an application • Analyze text in social media data using a custom-built extractor Allow 2.5 to 3 hours to complete this lab. This version of the lab was designed using the InfoSphere BigInsights 2.1 Quick Start Edition. Throughout this lab, you will be using the following account login information: Username Password Linux biadmin biadmin

Exercise 1: BigInsights Web Console IBM’s InfoSphere BigInsights 2.1 Enterprise Edition enables firms to store, process, and analyze large volumes of various types of data. In this exercise, you’ll see how you can work with its Web console to administer your system, launch applications (jobs), monitor the status of applications, and perform other functions. For further details on the Web console or BigInsights, visit the product's InfoCenter. In addition, IBM's big data channel on YouTube contains a video demonstration of the Web console.

http://www.youtube.com/ibmbigdata�

4

After completing this hands-on lab, you’ll be able to:

• Launch the Web console. • Work with popular resources accessible through the Welcome page. • Administer BigInsights by inspecting the status of your cluster, starting and

stopping components, and accessing administrative tools available for open source components provided with BigInsights.

• Work with the distributed file system. In particular, you'll explore the HDFS directory structure, create subdirectories, and upload files to HDFS.

• Launch applications (jobs) and inspect their status. Allow 30 minutes to complete this section of lab. This lab is an introduction to a subset of console functions. Real-time monitoring dashboards and application linking are among the more advanced console functions that are out of this lab's scope.

Launching the Web Console The first step to accessing the BigInsights Web Console is to launch all of the BigInsights processes (Hadoop, Hive, Oozie, Map/Reduce etc.). Once started, these services will remain active for the remainder of this lab.

__1. Select the Start BigInsights icon to start all services.

5

__2. Select the BigInsights Console icon to launch the Web Console.

__3. Verify that your Web console appears similar to this, and note each section: Tasks: Quick access to popular BigInsights tasks Quick Links: Links to internal and external quick links and downloads to enhance your environment Learn More: Online resources available to learn more about BigInsights

Working with the Welcome page This section introduces you to the Web console's main page displayed through the Welcome tab. The Welcome page features links to common tasks, many of which can also be launched from other areas of the console. In addition, the Welcome page includes links to popular external resources, such as the BigInsights InfoCenter (product documentation) and community forum. You'll explore several aspects of this page.

__1. In the Welcome Tab, the Tasks pane allows you to quickly access common tasks. Select the View, start or stop a service task. If necessary scroll down.

6

__2. This takes you to the Cluster Status tab. Here, you can stop and start Hadoop services, as well as gain additional information as shown in the next section

__3. Click on the Welcome tab to return back to the main page.

__4. Inspect the Quick Links pane at top right and use its vertical scroll bar (if necessary) to become familiar with the various resources accessible through this pane. The first several links simply activate different tabs in the Web console, while subsequent links enable you to perform set-up functions, such as adding BigInsights plug-ins to your Eclipse development environment.

7

__5. Inspect the Learn More pane at lower right. Links in this area access external Web resources that you may find useful, such as the Accelerator demos and documentation, BigInsights InfoCenter, a public discussion forum, IBM support, and IBM's BigInsights product site. If desired, click on one or more of these links to see what's available to you

Administering BigInsights The Web console allows administrators to inspect the overall health of the system as well as perform basic functions, such as starting and stopping specific servers or components, adding nodes to the cluster, and so on. You’ll explore a subset of these capabilities here.

8

Inspecting the status of your cluster

__1. Click on the Cluster Status tab at the top of the page to return to the Cluster Status window

__2. Inspect the overall status of your cluster. The figure below was taken on a single-node cluster that had several services running. One service – Monitoring -- was unavailable. (If you installed and started all BigInsights services on your cluster, your display will show all services as running.).

__3. Click on the Hive service and note the detailed information provided for this service in the pane at right. From here, you can start or stop the hive service (or any service you select) depending on your needs. For example, you can see the URL for Hive's Web interface and its process ID.

__4. Optionally, cut-and-paste the URL for Hive’s Web interface into a new tab of your browser. You'll see an open source tool provided with Hive for administration purposes, as shown below.

9

__5. Close this tab and return to the Cluster Status section of the BigInsights Web console

Starting and stopping a component

__6. If necessary, click on the Hive service to display its status.

__7. In the pane to the right (which displays the Hive status), click the red Stop button to stop the service

__8. When prompted to confirm that you want to stop the Hive service, click OK and wait for the operation to complete. The right pane should appear similar to the following image

__9. Restart the Hive service by clicking on the green arrow just beneath the Hive Status heading, as seen in the previous figure. When the operation completes (may take a minute), the Web console will indicate that Hive is running again, likely under a process ID that differs from the earlier Hive process ID shown at the beginning of this lab module. You may need to use the Refresh button of your Web browser to reload information displayed in the left pane.

Accessing service specific Web tools

10

__10. Click on the Welcome tab

__11. Click on the Access secure cluster servers button in the Quick Links section at right (ensure the pop-up blocker is disabled).

__12. Inspect the list of server components for which there are additional Web-based tools. The BigInsights console displays the URLs you can use to access each of these Web sites directly. (Make sure your pop-up blocker is disabled for this step.)

__13. Click on the namenode alias. Your browser will display the standard NameNode Web interface provided by Apache Hadoop, which is covered in a separate lab on HDFS and MapReduce.

Working with Files The Files tab of the console enables you to explore the contents of your file system, create new subdirectories, upload small files for test purposes, and perform other file-related functions. In this module, you’ll learn how to perform such tasks against the Hadoop Distributed File System (HDFS) of BigInsights.

11

__1. Click on the Files tab of the Web console to begin exploring your distributed file system.

__2. Expand the directory tree shown in the pane at left (/user/biadmin). If you already uploaded files to HDFS, you’ll be able to navigate through the directory to locate them.

__3. Become familiar with the functions provided through the icons at the top of this pane, as we'll refer to some of these in subsequent sections of this module. Simply point your cursor at the icon to learn its function. From left to right, the icons enable you to Copy a file or directory, move a file, create a directory, rename a file or directory, upload a file to HDFS, download a file from HDFS to your local file system, remove a file from HDFS, set permissions, open a command window to launch HDFS shell commands, and refresh the Web console page

__4. Position your cursor on the user/biadmin directory and click the Create Directory icon to create a subdirectory for test purposes

12

__5. When a pop-up window appears prompting you for a directory name, enter ConsoleLab and click OK

__6. Expand the directory hierarchy to verify that your new subdirectory was created.

__7. Create another directory named ConsoleLabTest.

__8. Use the Rename icon to rename this directory to ConsoleLabTest2

__9. Click the Move icon, when the pop up Move screen appears select the ConsoleLab directory and click OK.

13

__10. Using the set permission icon, you can change the permission settings for your directory. When finished click OK.

__11. While highlighting the ConsoleLabTest2 folder, select the Remove icon and remove the directory.

14

__12. Remain in the ConsoleLab directory, and click the Upload icon to upload a small sample file for test purposes.

__13. When the pop-up window appears, click the Browse button to browse your local file system for a sample file.

__14. Navigate through your local file system to the directory where BigInsights was installed. For the IBM-provided VMware image, BigInsights is installed in file system: /opt/ibm/biginsights. Locate the …/IHC subdirectory and select the CHANGES.txt file. If you can’t find it, you may select a different file for this step. Click OK

__15. Verify that the window displays the name of this file. Note that you can continue to Browse for additional files to upload and that you can delete files as upload targets from the displayed list. However, for this exercise, simply click OK

__16. When the upload completes, verify that the CHANGES.txt file appears in the directory tree at left, if it is not immediately visible click the refresh button. On the right, you should see a subset of the file’s contents displayed in text format

15

__17. Highlight the CHANGES.txt file in your ConsoleLab directory and click the Download button.

__18. When prompted, click the Save File button. Then select OK.

__19. If Firefox is set as default browser, the file will be saved to your user Downloads directory. For this exercise, the default directory location is fine

Exercise 2: Social Media and BigSheets IBM’s InfoSphere BigInsights 2.1 Enterprise Edition enables firms to store, process, and analyze large volumes of various types of data. In this exercise, you’ll see how you can find and ingest social media data from Boardreader and do some ad hoc data exploration with BigSheets. This lab is based on an article that can be found here:

16

http://www.ibm.com/developerworks/data/library/techarticle/dm-1206socialmedia/index.html If you have not done so, it is suggested that you read the article before attempting this lab as additional details are contained in the article that will help you understand the business reasons for taking the actions we take in the following lab steps. You can also watch a 14-minute video from the author of the article here: http://www.youtube.com/watch?v=kny3nPwSZ_w After completing this hands-on lab, you’ll be able to:

• Create a Big Sheets workbook in order to process some of the Boardreader input data that has been loaded into the HDFS for you.

• Create Big Sheets visualizations to see some of the quantitative analytics. Allow 45 minutes to complete this lab.

Collecting Social Media

__1. If your BigInsights services aren’t started, select the Start BigInsights icon.

Where did this data come from? For time efficiency of this lab, the data was already collected using the BoardReader application. BoardReader was used to collect data from news and blog websites, with the output saved into your HDFS.

http://www.ibm.com/developerworks/data/library/techarticle/dm-1206socialmedia/index.html�

http://www.ibm.com/developerworks/data/library/techarticle/dm-1206socialmedia/index.html�

http://www.youtube.com/watch?v=kny3nPwSZ_w�

17

__2. Launch the BigInsights WebConsole icon to launch the Web Console

__3. Verify that your Web console appears similar to this:

The social media output that will be used for this lab has been preloaded into the HDFS. This data was retrieved using the Boardreader app. To learn how to use the Boardreader app to pull social media data, refer to the BigSheets developerWorks article referenced above.

__4. To review this data, use the Files tab to navigate to the following folder (/user/biadmin/sampleData/IBMWatson) and select the blogs-data.txt file as shown below

18

__5. In a future section, we will display this text file within BigSheets in order to display, interact with, and visualize the data within BigInsights.

19

Creating a BigSheets Workbook In this section, we will use the spread-sheet style interface known as BigSheets. BigSheets provides access to data in items known as “workbooks.” Creating a Workbook from Boardreader Data

__1. Return to the Files tab.

__2. Navigate to the /user/biadmin/sampleData/IBMWatson/blogs-data.txt file. Make sure this file is selected.

__3. Click the Sheet radio button to view this data within a BigSheets interface.

__4. The data that comes from the Boardreader application within BigInsights is formatted in a JSON Array structure. You will need to click on the pencil icon and select the JSON Array option for this file. Then click the green check.

20

__5. Save this as a Master Workbook named “Watson Blogs”. You can optionally provide a description. Click the Save button.

__6. You will want to do the same thing for the news-data.txt file in the same folder. To do this, you will need to return to the Files tab, navigate to the file, and follow the same process. This time, you will provide the name “Watson News”.

__7. Click on the “Workbooks” link in the upper left-hand corner of the page.

__8. You should now see (2) or more workbooks on your system, depending on the configuration of your VM. The top (2) are the ones you created in the previous steps.

__9. Adding tags allows us to quickly search and manage workbooks. Select the “Watson Blogs” workbook.

__10. Scroll down to Workbook Details, add the following tags “Watson” “IBM” “Blogs” by selecting the green + and adding each individually. If you don’t see Workbook Details, you may need to toggle between Normal and Full Screen.

21

__11. From the BigSheets tab, you can quickly filter workbooks and search for a specific tag. Enter the term “tag: Blogs” to see all workbooks that have the associated tag.

22

Tailoring Your Workbooks In this section, we will first show you how to reduce the amount of data you are working with by removing unwanted columns from within your BigSheets Workbooks. We will also perform a merge or “union” of two data workbooks.

__12. From the list of Workbooks (you launched in the previous steps), click on the link named “Watson News” to open this workbook in BigSheets.

__13. This Master Workbook is a “base” workbook and has a limited set of things you can edit. Therefore, in order to begin to manipulate the data contained within a workbook, we will want to create a dependent workbook.

__a. Click the “Build new Workbook” button

__b. When the new Workbook appears, you can change its default name (by clicking on the pencil icon next to the name) to the new name of “Watson News Revised” then click the green check mark.

__c. Click the Fit column(s) button to more easily see columns A through H on your screen

.

23

__14. You have decided you would like to remove the column “IsAdult” from your workbook. This is currently column E. Click on the triangle next to the column name of “IsAdult” and select the “Remove” option to remove this from your new workbook.

__15. In this case, you want to keep only a few columns. In order, to more easily remove a larger number of columns (without having to do this same click-remove process), click the triangle again (from any column) and select the “Organize Columns…” option.

__a. Click the red X button next to each column title you want to remove.

In this case, KEEP the following columns…

__i. Country __ii. FeedInfo __iii. Language __iv. Published __v. SubjectHtml __vi. Tags __vii. Type __viii. Url

Did I lose data? Deleting a column does not remove data. Deleting a column in a workbook just removing the mapping to this column.

24

__b. Click the green check mark button when you are ready to remove the columns you have selected to remove.

__16. Click on the Fit column(s) button again to show columns A through H. You should see the following columns in your new workbook.

__17. Select “Save and Exit”. You may input an optional description. Click Save to complete the save.

__18. After clicking Save, you will be shown two buttons (run and close). Click the Run button to run the workbook. You can monitor the progress of your request by watching the status bar indicator in the upper right-hand side of the page.

__19. To reduce the unwanted columns in the “Watson Blogs” workbook, you will want to perform the same steps above in order to wind up with a new workbook called “Watson Blogs Revised”

25

__20. Now, since we have two workbooks with the exact same structure, we can perform a “union” of these two workbooks as the basis for exploring the coverage of IBM Watson across the sources that Boardreader provided.

__21. To perform this action, make sure you are currently in the “Watson News Revised” workbook. Click the “Build New Workbook” button again.

__22. In the top left-hand side or bottom left, you should see a link called “Add sheets”. This allows you to perform additional analysis on your data within the current workbook. Click the “Add sheets” link.

__23. The Load option will allow you to load data into the current workbook from another workbook. Click the Load icon and select the “Watson Blogs Revised” workbook link.

__24. The system will ask you for a *Sheet Name and you should change Sheet1 to “Watson Blogs Revised” as the name of the new tab that will be created in your current workbook.

26

__25. Click the green check-mark button at this time to load the new workbook into your current workbook.

__26. You should now see two tabs at the bottom on your current workbook. If you move your mouse over the second one, a tooltip will show the action and the name you provided for this sheet / tab within you current workbook. (Giving your tabs meaningful names will help you and other that use your sheets an easy way to understand your data processing flow(s).)

__27. Next, we need to add a new sheet to perform the Union function. Select Union.

__28. The Union function asked for the other “sheet” you would like to use. Select the triangle to expose the pull-down menu.

__29. Select ”Watson News Revised” and then click the green plus-mark button.

__30. Then provide the sheet name “News and Blogs”. Before you click the green check-mark button to add this new tab/sheet to your workbook, make sure your options match the example below.

27

__31. Click the green check-mark button to add this tab to your workbook.

__32. Save and Exit and then run this new workbook. When prompted for a description, you can change the name of your new workbook from “Watson News Revised(1)” to “Watson News and Blogs”. Click the Save button. Then click the Run button to run the workbook.

__33. Select the Workflow Diagram icon to see a mapping of the workbooks associated with the News and Blogs workbook. This can be done at any point to keep a clear picture of which workbooks you are extending to/from.

__34. Close this frame and you can continue to the next section.

Exploring Your Workbook In this section, we will explore the data in your new workbook. You will perform actions like sorting, charting, cleansing, and grouping to make analysis of the HDFS data easier via BigSheets.

__35. If you are not already in the workbook, open the “Watson News and Blogs” workbook.

__36. In this case, we want to keep our initial workbook “as is” and produce another workbook that contains the records in sorted order. So, click the “Build New Workbook” button to do this.

__37. As a way to keep track of what you are doing, I typically suggest that you rename your new workbook right away as this helps to remind you of what you are working on. So, provide a new name for your workbook, like “Watson Sorted”.

__38. After looking at the data, you decide you would like to understand more about the language and the types of posts contained in your workbook.

28

__39. Clicking on the triangle next to the column name of any column, you will want to click on the Sort -> Advanced option

.

__40. Then, you will click on the pull-down triangle to expose the list of columns under the “Add Columns to Sort” area. Click on the green + button to add the two columns you wish to sort on. Then, select the desired order for sorting each column. In this case, your “Advance Sort” should look like the following picture.

__41. Click on the green check-mark button to continue and create the new tab/sheet with your desired sorting applied to it.

__42. As with all new tabs/sheets, the system shows you a simulated result based on the rows of data BigSheets keeps in memory. You should be able to click on Fit column(s) to review the contents of both the Language and Type columns to see that your “advanced sort” was applied to this simulated set of data.

__43. Now, “Save and Exit” and then run your workbook. This will apply the sorting options to more than the first 2,000 rows the system operates on as a simulation. This will sort the entire, larger data in the workbook. So, you should see different results once your workbook has been run. For example, in the simulated data, only one Vietnamese row was showing. However, against the entire data set, you should see twenty (20) rows that are of the Vietnamese language. This is because more of the Vietnamese rows were in the data beyond the first 2,000 rows the system uses in memory for a simulated result before you click the run button. Review and confirm these results after the job reaches 100% and then you can move onto the next step.

29

__44. To visualize this data better, you want to analyze the quantities of posts in the various languages in your current workbook. While still in the “Watson Sorted” workbook.

__45. Click on the “Add chart” link in the lower left (this may take a minute to populate the chart types the first time it is run), click on chart, then pie, and fill out the following information to produce a pie chart of the languages used.

__46. Click on the green check-mark button to create the chart tab.

__47. Just like working with tabular data, you will see a simulated visualization. Again, this is based on the rows in cache. (If you click on the Close button here, you can interact with the chart which is based on simulated data. You would then click the “Run” button in the upper right.)

__48. Click the Run button to run the visualization against the entire data set.

__49. Once the chart has been run, you can interact with it to find out the second, most-popular language for posts regarding IBM Watson is Russian. Move your mouse over this item within your pie chart to see these results.

30

__50. Additionally, you have decided you need to clean up some of the data within the workbook. Based on the fifth and sixth largest languages in the pie chart you just generated, you can see they are both variations on the Chinese language.

__51. Let’s perform an action to clean up this data which will show all Chinese items in the same bucket. To do this, we will combine these two items into one item called Chinese

__52. From the “Watson Sorted” workbook, you will want to click on the edit button in order to go back into edit mode for this workbook.

__53. To make things easier, you can optionally click on the Fit column(s) button to make your columns thinner and to see more data on the screen.

31

__54. At this point, you have decided to add another column to your workbook. To do this, click on the triangle next to the Language column name. Select the Insert Right -> New Column option.

__55. Then, you will provide a name for your new column, like “Language_Revised” and then click the green check-mark button (or hit enter) to apply your new column name.

__56. Your cursor is then moved to the fx (or function) area where you can provide the function to be used to generate the contents of your new column.

__57. Enter the following formula as your function… IF(SEARCH('Chin*', #Language) > 0, 'Chinese', #Language)

This formula looks at the Language column indicated by #Language. If the #Language column starts with ‘Chin*’, then the new #Language_Revised column with contain ‘Chinese’. If it does not, the value of #Language is copied over to #Language_Revised. (See the original article, URL at the top of this document, for additional explanation of this formula.)

__58. Clicking on the green check-mark button (or hitting enter) should produce content in your new column based on your new formula.

32

__59. “Save and Exit” and you should notice that you receive a message to “Click run to update the data”

__60. Click the Run button in the upper right to run the workbook.

__61. Now, click on the Language Coverage tab that contains your previously generated pie chart. This now has the status of “needs to be run”. Before we run it, we need to change one of the settings on the pie chart to use our newly generated column named Language_Revised.

__62. To change the settings, click on the triangle next to the Language Coverage tab.

__63. Click to select the “Chart Settings” option.

__64. Change the “Value:” item to be based on the new, Language_Revised column.

33

__65. Click on the green check-mark button to apply your new settings.

__66. Click on the Run button to regenerate your pie chart.

__67. Once your new pie chart has been generated, you should be able to see Chinese as a cleaned up, single item in your pie chart (compared to the two items you saw previously). With this cleansed data, Chinese is now the second largest and Russian is third.

34

Exercise 3: Querying data with IBM Big SQL Learn how to use IBM Big SQL, an SQL language processor, to summarize, query, and analyze data in a data warehouse system for Hadoop. Big SQL provides broad SQL support that is typical of commercial databases. You can issue queries using JDBC or ODBC drivers to access data that is stored in InfoSphere BigInsights, in the same way that you access databases from your enterprise applications. You can use the Big SQL server to execute standard SQL queries. Multiple queries can be executed concurrently. Big SQL provides support for large ad hoc queries by using MapReduce parallelism and point queries, which are low-latency queries that return information quickly to reduce response time and provide improved access to data. The Big SQL server is multi-threaded, so scalability is limited only by the performance and number of CPUs on the computer that runs the server. If you want to issue larger queries, you can increase the hardware performance of the server computer that Big SQL runs on. This tutorial uses data from the fictional Sample Outdoor Company. The Sample Outdoor Company sells and distributes outdoor products to third-party retailer stores around the world. The company also sells directly to consumers through its online store. For the last several years, the company has steadily grown into a worldwide operation, selling their line of products to retailers in nearly every part of the world. The fictional Sample Outdoor Company began as a business-to-business operation. The company does not manufacture its own products; rather, the products are manufactured by a third party, then the products are sold to third-party retailers. The company is expanding its business by creating a presence on the web. Now they also sell directly to consumers through their online store. You will learn more about the products and sales of the fictional Sample Outdoor Company by running Big SQL queries and doing the analysis on the data in the following lessons. In this lab, you will use the InfoSphere BigInsights Tools for Eclipse to create Big SQL queries so that you can extract large subsets of data for analysis. After you complete the lessons in this module, you will understand the concepts and know how to do the following actions:

• Connect to the Big SQL server by using the InfoSphere BigInsights Tools for Eclipse. Use the InfoSphere BigInsights Tools for Eclipse to load sample data and to create queries.

• Use BigSheets to analyze data generated from Big SQL queries. • Use the InfoSphere BigInsights Console to run Big SQL queries. • Use the InfoSphere BigInsights Tools for Eclipse to export data.

This module should take approximately 45 minutes to complete.

35

Connecting to the IBM Big SQL server In this section, you view the JDBC connection, learn how to navigate to the Data Management view in Eclipse, and how to open the SQL editor. Big SQL is installed with the InfoSphere BigInsights in the following directory: $BIGSQL_HOME/bin/bigsql. Ensure that you complete the following actions before you begin the tutorial:

__1. Review the status of Big SQL. From the InfoSphere BigInsights Console, click the Cluster Status page. Then, verify that the Big SQL node is in a Running state on the Big SQL Server Summary page. If it is not started, click on Start in the Big SQL Server Summary page.

__2. Start Eclipse and select the default workspace.

If you created an InfoSphere BigInsights server connection to a particular BigInsights server, the Big SQL JDBC driver connection and the Big SQL JDBC connection profile were both created for you automatically. Ensure that a JDBC driver definition exists. Generally, either a JDBC or an ODBC driver definition must exist before you can create a connection to a Big SQL server. For the purposes of this tutorial, only a JDBC driver definition is required.

__3. From the Eclipse environment, click Window > Open Perspective > Other > Database Development. Ensure that the Data Source Explorer view is open.

__4. From the Data Source Explorer view, expand the Database Connections folder.

36

__5. Right-click the Big SQL connection, select Properties. Verify the connection details.

__6. Click Test Connection to make sure that you have a valid connection to the Big SQL server. If the ping succeeded, you may skip to the next section.

If you do not have a connection, follow these steps:

__a. From the Data Source Explorer view, right-click the Database Connections directory and click Add Repository.

__b. In the New Connection Profile window, select Big SQL JDBC in the Connection Profile Types list. Edit the text in the Name field or the Description field, and click Next.

__c. In the Specify a Driver and Connection Details window, select IBM Big SQL JDBC Driver Default from the Drivers list. You can add or modify a driver by clicking the New Driver Definition icon or the Edit Driver Definition icon.

__d. The Connection URL field displays the URL that is generated. For example: jdbc:bigsql://bivm:7052/default

__e. Specify the Connect when the wizard completes check box to connect to the Big SQL server after you finish defining this connection profile.

__f. Specify the Connect every time the workbench is started to connect to the Big SQL server automatically by using this connection profile when you launch this Eclipse workspace.

__g. Click Test Connection to verify that the connection. Click Finish to save.

37

Creating a project and an SQL script file You can create Big SQL scripts and run them from the InfoSphere BigInsights Eclipse environment.

__1. Create an InfoSphere BigInsights project in Eclipse. From the Eclipse menu bar, click File > New > Other. Expand the BigInsights folder, select BigInsights Project, and then click Next.

__2. Type myBigSQL in the Project name field, and then click Finish.

__3. If you are not already in the BigInsights perspective, a Switch to the BigInsights perspective window opens. Click Yes to switch to the BigInsights perspective.

__4. Create an SQL script file. From the Eclipse menu bar, click File > New > Other. Expand the BigInsights folder, and select SQL Script, and then click Next.

__5. In the New SQL File window, in the Enter or select the parent folder field, select myBigSQL. Your new SQL file is stored in this project folder.

__6. In the File name field, type aFirstFile. The .sql file extension is added automatically. Click Finish.

38

__7. In the Select Connection Profile window, select the Big SQL connection. The properties of the selected connection display in the Properties field. When you select the Big SQL connection, the Big SQL database-specific context assistant and syntax checks are activated in the editor that is used to edit your SQL file.

__8. Click Finish to close the Select Connection Profile window.

__9. In the SQL Editor that opens the aFirstFile.sql SQL file you created, add the following Big SQL comments: -- This is a beginning SQL script -- These are comments. Any line that begins with two -- dashes is a comment line, -- and is not part of the processed SQL statements.

Some lines in the file contain two dashes in front of text. Those dashes mark the text as comment lines. Comment lines are not part of the processed code. It is always useful to comment your Big SQL code as you write the code, so that the reasons for using some statements are clear to all who must use the file. You will be using this file in later lessons.

__10. Save aFirstFile.sql by using the keyboard short, CTRL-S.

39

Creating tables and loading data In this lesson, you will create the tables, and load the data that you will use to run queries and create reports about the fictional Sample Outdoor Company. The examples in this tutorial use Big SQL tables and data that are managed by Hive. This tutorial uses sample data that is provided in the $BIGSQL_HOME/samples folder on the local file system of the InfoSphere BigInsights server. By default, the $BIGSQL_HOME environment variable is set to the installed location, which is /opt/ibm/biginsights/bigsql/. The time range of the fictional Sample Outdoor Company data is 3 years and 7 months, starting January 1, 2004 and ending July 31, 2007. The 43-month period reflects the history that is made available for analysis. The schema that is used in this tutorial is the GOSALESDW. It contains fact tables for the following areas:

• Distribution • Finance • Geography • Marketing • Organization • Personnel • Products • Retailers • Sales • Time.

Access the sample SQL files

__1. Open a Terminal. It is located in the desktop folder BigInsights Shell.

__2. Change your directory to $BIGSQL_HOME/samples. Open and review the information in the README file.

40

__3. Open a new Terminal and type the setup command: ./setup.sh -u biadmin -p biadmin

__4. The script runs when you see this message: Loading the data .. will take a few minutes to complete .. Please check /var/ibm/biginsights/bigsql/temp/GOSALESDW_LOAD.OUT file for progress information.

The script runs three files: GOSALESDW_drop.sql, GOSALESDW_ddl.sql, and GOSALESDW_load.sql.

__5. You can see the tables in the InfoSphere BigInsights Console Files tab: hdfs://biginsights/hive/warehouse/gosalesdw.db

41

Running basic Big SQL queries You now have tables with data that represents the fictional Sample Outdoor Company. In this lesson, you will explore some of the basic Big SQL queries, and begin to understand the sample data. You will use the SQL file that you created earlier to test the tables and data that you created in the previous lesson. In this lesson, several sample SQL files that are included with the downloaded Big SQL samples can be used to examine the data, and to learn how to manipulate Big SQL.

__1. From the Eclipse Project Explorer, open the myBigSQL project, and double-click the aFirstFile.sql file.

__2. In the SQL editor pane, type the following statement: SELECT * FROM GOSALESDW.GO_REGION_DIM;

Each complete SQL statement must end with a semi-colon. The statement selects, or fetches, all the rows that exist in the GO_REGION_DIM table.

__3. Click the Run SQL icon.

Depending on how much data is in the table, a SELECT * statement might take some time to complete. Your result should contain 21 records or rows.

42

You might have a script that contains several queries. When you want to run the entire script, click the Run SQL icon. When you want to run a specific statement, or set of statements, and you include the schema name with the table name (gosalesdw.go_region_dim), highlight the statement or statements that you want to run and press F5.

__4. Improve the SELECT * statement by adding a predicate to the second statement to return fewer rows. A predicate is a condition on a query that reduces and narrows the focus of the result. A predicate on a query with a multi-way join can improve the performance of the query. SELECT * FROM GOSALESDW.GO_REGION_DIM WHERE REGION_EN LIKE 'Amer%';

__5. Click Run SQL to run the entire script. This query results in four records or rows. It might run more efficiently than the original statement because of the predicate.

Tip: If you get an error, type out the full query instead of using copy/paste.

__6. You can learn something more about the structure of the table with some queries to the syscat table. The Big SQL catalog tables provide metadata support to the database. The Big SQL catalog consists of four tables in the schema SYSCAT: schemas, tables, columns, and index columns. Type the following query, and this time, select the statement, and press F5. SELECT * FROM syscat.columns WHERE tablename= 'go_region_dim' AND schemaname= 'gosalesdw';

This query uses two predicates in a WHERE clause. The query finds all of the information from the syscat.columns table as long as the tablename is 'go_region_dim' and the schemaname is 'gosalesdw'. Since you are using AND, both predicates must be true to return a row. Use single quotation marks around string values. The result of the query to the syscat.columns table is the metadata, or the structure of the table. Look at the SQL Results tab in Eclipse to see your output. The SQL Results tab in Eclipse shows 54 rows as your output. That means that there are 54 columns in the table go_region_dim.

43

__7. In addition to the query to learn about the structure of a table, you can also run a query that returns the number of rows in a table. Type the following query, select the statement, and then press F5. SELECT COUNT(*) FROM gosalesdw.go_region_dim;

The COUNT aggregate function returns the number of rows in the table, or the number of rows that satisfy the WHERE clause in the SELECT statement, if a WHERE clause was part of the statement. The result is the number of rows in the set. A row that includes only null values is included in the count. In this example, there are 21 rows in the go_region_dim table.

__8. Another form of the count function is the COUNT (distinct <expression>) statement. As the name implies, you can determine the number of unique values in a column, such as region_en. Type this statement in your SQL file:

SELECT COUNT (distinct region_en) FROM gosalesdw.go_region_dim;

The result is 5. This result means that there are five unique region names in English (the column name region_en).

__9. Another useful statement in Big SQL is the LIMIT clause. The LIMIT clause specifies a limit on the number of output rows that are produced by the SELECT statement for a given document. Type this statement in your SQL file, select the statement, and then press F5: SELECT * FROM GOSALESDW.DIST_INVENTORY_FACT limit 50;

The statement returns 53837 rows without the limit clause and depending on the setting in the Eclipse preferences for the SQL Results View. If there are fewer rows than the LIMIT value, then all of the rows are returned. This statement is useful to see some output quickly.

__10. Save your SQL file.

44

Optional Exercise: Using Big SQL from a JDBC client application You can use Big SQL through a JDBC application to open a database connection, run a Big SQL query, and then display the results of the query.

__1. In the IBM InfoSphere BigInsights Eclipse environment, create a Java project by clicking File > New >Project. From the New Project window, select Java Project. Click Next.

__2. Type a name for the project in the Project Name field, such as MyJavaProject. Click Next.

__3. Open the Libraries tab and click Add External Jars. Select the Big SQL JDBC driver from your local path /opt/ibm/biginsights/bigsql/jdbc bigsql-jdbc-driver.jar.

45

__4. Click Finish. Click No when you are asked if you want to open a different perspective.

__5. Right-click the MyJavaProject project, and click File > New > Package. In the Name field, in the New Java Package window, type a name for the package, such as aJavaPackage4me. Click Finish.

__6. Right-click the aJavaPackage4me package, and click File > New > Class.

__7. In the New Java Class window, in the Name field, type SampApp. Select the public static void main(String[] args) check box. Click Finish.

46

__8. Copy or type the following code into the SampApp.java file: package aJavaPackage4me; //a. Import required package(s) import java.sql.*; public class SampApp { /** * @param args */ //b. set JDBC & database info static final String db = "jdbc:bigsql://bivm:7052/default"; static final String user = "biadmin"; static final String pwd = "biadmin"; public static void main(String[] args) { Connection conn = null; Statement stmt = null; System.out.println("Started sample JDBC application."); try{ //c. Register JDBC driver Class.forName("com.ibm.biginsights.bigsql.jdbc.BigSQLDriver"); //d. Get a connection conn = DriverManager.getConnection(db, user, pwd); System.out.println("Connected to the database."); //e. Execute a query stmt = conn.createStatement(); System.out.println("Created a statement."); String sql; sql = "select * from gosalesdw.sls_product_dim " + "where product_key=30001"; ResultSet rs = stmt.executeQuery(sql); System.out.println("Executed a query."); //f. Obtain results System.out.println("Result set: "); while(rs.next()){ //Retrieve by column name int product_key = rs.getInt("product_key"); int product_number = rs.getInt("product_number"); //Display values System.out.print("* Product Key: " + product_key + "\n"); System.out.print("* Product Number: " + product_number + "\n"); }

47

//g. Close open resources rs.close(); stmt.close(); conn.close(); }catch(SQLException sqlE){ // Process SQL errors sqlE.printStackTrace(); }catch(Exception e){ // Process other errors e.printStackTrace(); }finally{ // Ensure resources are closed before exiting try{ if(stmt!=null) stmt.close(); }catch(SQLException sqle2){ } // nothing we can do try{ if(conn!=null) conn.close(); }catch(SQLException sqlE){ sqlE.printStackTrace(); }// end finally block }// end try block System.out.println("Application complete"); }}

__a. After the package declaration, ensure that you include the packages that contain the JDBC classes that are needed for database programming.

__b. Set up the database information so that you can refer to it.

__c. Then, register the JDBC driver so that you can open a communications channel with the database.

__d. Open the connection.

__e. Run a query by submitting an SQL statement to the database.

__f. Extract data from result set.

__g. Clean up the environment by closing all of the database resources.

__9. Save the file and right-click the Java file and click Run as > Java Application.

__10. The results show in the Console view of Eclipse: Started sample JDBC application. Connected to the database. Created a statement. Executed a query. Result set: * Product Key: 30001 * Product Number: 1110

48

Analyzing the data with Big SQL In this lesson, you create and run Big SQL queries so that you can better understand the products and market trends of the fictional Sample Outdoor Company.

__1. Create an SQL file. From the Eclipse menu, click File > New > Other. In the Select a wizard window, expand the BigInsights folder, select SQL Script from the list of wizards, and then click Next. In the New SQL File window, select the myBigSQL project folder in the Enter or select the parent folder field. In the File name field, type companyInfo. The .sql file extension is added automatically. Click Finish.

To learn what products were ordered from the fictional Sample Outdoor Company, and by what method they were ordered, you must join information from multiple tables in the gosalesdw database because it is a relational database where not everything is in one table.

__2. Type the following comments and statement into the companyInfo.sql file: -- Fetch the product name and the quantity and -- the order method. -- Product name has a key that is part of other -- tables that we can use as a join predicate. -- The order method has a key that we can use -- as another join predicate. -- Query 1 SELECT pnumb.product_name, sales.quantity, meth.order_method_en FROM gosalesdw.sls_sales_fact sales, gosalesdw.sls_product_dim prod, gosalesdw.sls_product_lookup pnumb, gosalesdw.sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key;

49

Because there is more than one table reference in the FROM clause, the query can join rows from those tables. A join predicate specifies a relationship between at least one column from each table to be joined.

• The predicates such as prod.product_number=pnumb.product_number help to narrow the results to product numbers that match in two tables.

• This query also uses an alias in the SELECT and FROM clauses, such as pnumb.product_name. pnumb is the alias for the gosalesdw.sls_product_lookup table. That alias can now be used in the where clause so that you do not need to repeat the complete table name, and the WHERE clause is not ambiguous.

• The use of the predicate and pnumb.product_language=’EN’ helps to further narrow the result to only English output. This database contains thousands of rows of data in various languages, so restricting the language provides some optimization.

__3. Highlight the statement, beginning with the keyword SELECT and ending with the semi-colon, and press F5. Review the results in the SQL Results page. You can now begin to see what products are sold, and how they are ordered by customers.

50

__4. By default, the Eclipse SQL Results page limits the output to only 500 rows. You can change that value in the Data Management preferences. However, to find out how many rows the query actually returns in a full Big SQL environment, type the following query into the companyInfo.sql file, then select the query, and then press F5:

--Query 2 SELECT COUNT(*) --(SELECT pnumb.product_name, sales.quantity, -- meth.order_method_en FROM gosalesdw.sls_sales_fact sales, gosalesdw.sls_product_dim prod, gosalesdw.sls_product_lookup pnumb, gosalesdw.sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key;

The result for the query is 446,023 rows.

__5. Update the query that is labeled --Query 1 to restrict the order method to equal only 'Sales visit'. Add the following string just before the semi-colon: AND order_method_en='Sales visit'

51

__6. Select the entire --Query 1 statement, and press F5. The result of the query displays in the SQL Results page in Eclipse.

The results now show the product and the quantity that is ordered by customers actually visiting a retail shop.

__7. To find out which method of all the methods has the greatest quantity of orders, you must add a GROUP BY clause (group by pll.product_line_en, md.order_method_en). You will also use a SUM aggregate function (sum(sf.quantity)) to total the orders by product and method. In addition, you can clean up the output a bit by using an alias (as Product) to substitute a more readable column header. -- Query 3 SELECT pll.product_line_en AS Product, md.order_method_en AS Order_method, sum(sf.QUANTITY) AS total FROM gosalesdw.sls_order_method_dim AS md, gosalesdw.sls_product_dim AS pd, gosalesdw.sls_product_line_lookup AS pll, gosalesdw.sls_product_brand_lookup AS pbl, gosalesdw.sls_sales_fact AS sf WHERE pd.product_key = sf.product_key AND md.order_method_key = sf.order_method_key AND pll.product_line_code = pd.product_line_code AND pbl.product_brand_code = pd.product_brand_code GROUP BY pll.product_line_en, md.order_method_en;

__8. Select the complete statement, and press F5.

Your results in the SQL Results page show 35 rows.

52

Exercise 4: Developing, publishing, and deploying your first big data app In this lesson, you will learn how to publish an application that can run a simple Big SQL script. After you create scripts that contain Big SQL statements or commands, you can create applications that can run those Big SQL scripts with or without variables. For the purposes of this lesson, you are going to create a new project that contains one simple Big SQL script.

Developing your application

__1. Create a project to hold the script and the application information. In the Eclipse Project Explorer, click File > New > BigInsights Project. In the New BigInsights Project window, type MyTestApp in the Project name field. Click Finish.

__2. Verify that the sample data (IOD_sampleData/RDBMS_data.csv) is located on your desktop folder called biadmin’s Home

Create a Big SQL script within the MyTestApp project that creates a table if it does not already exist, and then loads data into the table from a local file.

__3. Right-click the MyTestApp project and select New > SQL Script. In the File name field, type LoadLocal.sql, select BigSQL connection. Click Finish.

__4. Type or copy the following code into the LoadLocal.sql file: create table if not exists media_csv (id integer not null, name varchar(50), url varchar(50), contactdate string) row format delimited fields terminated by ',' stored as textfile; load hive data local inpath '/home/biadmin/IOD_sampleData/RDBMS_data.csv' -- overwrite into table media_csv;

Be sure to modify the inpath parameter value to match the path where you extracted the files.

__5. Save the LoadLocal.sql file.

53

Publishing your application

__1. In the Eclipse Project Explorer, right-click the MyTestApp project, and select BigInsights Application Publish.

From the BigInsights Application Publish window, complete the wizard with the following information:

__2. Location page - Verify the server name and click Next. If you do not have a valid connection to a Big SQL server, you can create one by clicking Create.

54

__3. Application page - Specify to Create New Application.

__a. Type an application name in the Name field, such as myBigSQLApp.

__b. You can type some text in the Description field. This text is optional, but can help others to understand how to run your application when it is deployed on the server.

__c. Optionally, select an icon that contains a visual identifier for this application file from the local file path by clicking Browse. You can use a default icon that is provided by the server.

__d. List one or more categories, such as load, that you want the application to be listed under. This category is a helpful tag when you search for new applications to deploy.

__e. Click Next.

__4. Type page - Select Workflow as the Application Type, and click Next.

__5. Workflow page

__a. Specify to Create a new single action workflow.xml file

__b. Select Big SQL from the Action Type drop down list.

55

__c. In the Properties pane, list the properties that you need.

__i. Specify the name of the script that contains your Big SQL statements so that you can run that script from the InfoSphere BigInsights server. Select script from the list. Enter “LoadLocal.sql” in the Value field. The Variable field is false.

Specify the credentials properties file for connecting to Big SQL.

__ii. Click New.

__iii. In the New Property window, in the Name field, select the property credentials_prop from the list.

__iv. In the Value field, specify the name of your properties file in the BigInsights credentials store. Type /user/biadmin/ credstore/private/bigsql.properties

__d. Parameters page - If you have parameters that you want to add, click New. In this case, click Next. Note: The simple script that you are using in this lesson does not need parameters.

__6. Publish page - You see the structure of your application in the preview of the BIApp.zip file. You can click Add to add other scripts or supporting information. Click Finish.

56

Deploy and run your application The application that you just published from Eclipse is now in the InfoSphere BigInsights Console.

__1. Click the Application tab of the InfoSphere BigInsights Console, and click Manage to see the new application.

__2. Find the myBigSQLApp application by the application title.

__3. Highlight the application and click Deploy.

__4. In the Deploy Application window, click Deploy.

__5. Click Run in the Applications tab. You see the myBigSQLApp in the list of deployed applications. Double-click the application.

57

__6. In the application, type some name in the Execution Name field, such as Test1 to identify a particular run time instance. Since everything in this application was hardcoded, you can now click Run.

__7. If the application runs successfully, you see a green checked box. If the application failed, you see a red box. You can examine the log by selecting the arrow in the Details column of the Application History pane.

Examine the output of your application:

__8. Click on the Files tab of the InfoSphere BigInsights Console. Navigate to the directory of the Hive warehouse (such as /biginsights/hive/warehouse).

__9. You can see that a new folder that represents the media_csv table is created. Expand the folder to see the RDBMS_data.csv file.

__10. Click the file to display its contents.

__11. You can also issue a select statement from your Eclipse project, or from the Big SQL Console of InfoSphere BigInsights to see the contents in Big SQL: Select * from media_csv;

You have created an application that creates a table with columns, if that table does not already exist. The application then loads data into the table from a local file. If you run the application again, it loads a duplicate set of the data. You can fill the file with new data, and run the application again, and it loads that new data. You can quickly see that this kind of exercise can be useful if you need to see refreshed data regularly. By adding scheduling information onto the application, you can have the application run daily or weekly to add or refresh the data. By including variables into the application, you can run the application with data from your dynamic input. You can uncomment the overwrite clause to add new data for each application run.

58

Exercise 5: Analyzing BigSheets data in IBM Big SQL Learn how to use IBM® Big SQL, an SQL language processor, to summarize, query, and analyze data in a data warehouse system for Hadoop. Big SQL provides broad SQL support that is typical of commercial databases. You can issue queries using JDBC or ODBC drivers to access data that is stored in InfoSphere® BigInsights™, in the same way that you access databases from your enterprise applications. You can use the Big SQL server to execute standard SQL queries. Multiple queries can be executed concurrently. Big SQL provides support for large ad hoc queries by using MapReduce parallelism and point queries, which are low-latency queries that return information quickly to reduce response time and provide improved access to data. The Big SQL server is multi-threaded, so scalability is limited only by the performance and number of CPUs on the computer that runs the server. If you want to issue larger queries, you can increase the hardware performance of the server computer that Big SQL runs on. Learn more about Big SQL: http://pic.dhe.ibm.com/infocenter/bigins/v2r1/topic/com.ibm.swg.im.infosphere.biginsights.tut.doc/doc/tut_Mod_BigSQL1.html In this lesson, you can use workbooks that you create in BigSheets as input to Big SQL tables. This lesson shows you how to export to a tab-separated value (TSV) format and a JSON array format and then use those outputs as input to a Big SQL table. This lesson assumes that you downloaded the data that was referred to in Lab 2: Social Media and BigSheets. You will need access to the blogs-data.txt file. If you gathered social media data and created workbooks from that data, you might find that further analysis in a table format is warranted. For example, assume that you examine the blogs data in the blogs-data.txt file. You can create a workbook and use that data for two new Big SQL tables. Note: You do not need to complete the BigSheets tutorial to do this lesson, but some knowledge of BigSheets is recommended.

http://pic.dhe.ibm.com/infocenter/bigins/v2r1/topic/com.ibm.swg.im.infosphere.biginsights.tut.doc/doc/tut_Mod_BigSQL1.html�

http://pic.dhe.ibm.com/infocenter/bigins/v2r1/topic/com.ibm.swg.im.infosphere.biginsights.tut.doc/doc/tut_Mod_BigSQL1.html�

59

Create and modify a BigSheets workbook

__1. From the InfoSphere BigInsights Console, click the BigSheets tab.

__2. Click New Workbook.

__3. In the Name field, type WatsonBlogData.

__4. In the File field, expand the Hadoop Distributed File System directory, browse to and select the blogs-data-txt file. It maybe located under /user/biadmin/SampleData or /bigsheets/.

60

__5. In the Preview area of the screen, select a new reader to map the data to the spreadsheet format. Click the edit icon that looks like a pencil.

__6. The data in the blogs-data.txt is formatted in a JSON Array structure. Select the JSON Array reader from the list, and click the check mark inside the Select a reader box to apply the reader.

__7. Since the data columns exceed the viewing space, click Fit column(s). The first eight columns display in the Preview area. Click the check mark to save the workbook.

__8. Click Build new workbook. Rename the workbook by clicking the edit icon, entering the new name of WatsonBlogDataRevised, and clicking the green check mark.

61

__9. To more easily see the columns, click Fit column(s). Now columns A through H fit within the width of the sheet.

There are several columns that you do not need to use in your Big SQL table. Remove multiple columns by following these steps:

__10. Click the down arrow in any column heading and select Organize columns.

__11. Click the X next to the following columns to mark them for removal:

__a. Crawled

__b. Inserted

__c. IsAdult

__d. PostSize

__12. Click the green check mark to remove the marked columns.

__13. Click Save and Exit, and then Run the workbook. . In the Save workbook dialog, click Save. Click Exit to start the run process. Click Run to run the workbook.

62

Export your Workbook

__1. In the menu bar of the WatsonBlogDataRevised workbook, click Export as.

__2. In the drop-down window, select the TSV type in the Format Type field. Big SQL can also use the output from the other selections, but extra steps might be needed. For example, if you select CSV, the quotation marks might be retained in the table. To use just the data without quotations, you can perform a SELECT or CREATE TABLE AS with the TRIM(both '"' from <column name>) function. You can see more information about this technique in a later step.

__3. Specify Export to File.

__4. Click Browse to select a destination directory in the HDFS file system. Select your path (ex. '/user/biadmin/sampleData/IBMWatson'), and then type the name for the new file, such as WatsonBlogs. Click OK.

63

__5. Make sure that the Include Headers check box is cleared. Click OK.

__6. A message dialog shows that the workbook is successfully exported. Click OK to close that dialog.

Make a note of the column names and the type of data from the BigSheets workbook that you want to define in Big SQL. You exported these columns from BigSheets:

• Country - contains a two-letter country identifier.

• FeedInfo - contains information from web feeds, with varying lengths.

• Language - contains the string that identifies the language of the feed.

• Published - contains a date and timestamp.

• SubjectHtml - contains a subject that is of varying length.

• Tags - contains a string of varying length that provides categories.

• Type - contains the source of the web feed, whether a news blog or a public feed.

• URL - contains the web address of the feed, with varying length.

64

Creating an Eclipse project and new SQL script In the InfoSphere BigInsights Eclipse environment, create a project and a new SQL script. Create an InfoSphere BigInsights project in Eclipse.

__1. Start Eclipse. Depending on your VM, the password will be password or passw0rd.

__2. From the Eclipse menu bar, click File > New > Other.

__3. In the Select a wizard window, expand the BigInsights folder, select BigInsights Project, and then click Next.

65

__4. Type MyBigSheetsAnalysis in the Project name field, and then click Finish.

__5. If you are not already in the BigInsights perspective, a Switch to the BigInsights perspective window opens. Click Yes to switch to the BigInsights perspective.

__6. Create a SQL script file. Right-click MyBigSheetsAnalysis, Select New > SQL Script.

66

__7. In the New SQL File window, in the Enter or select the parent folder field, select MyBigSheetsAnalysis. Your new SQL file is stored in this project folder.

__8. In the File name field, type NewsBlogs. The .sql file extension is added automatically. Click Finish.

__9. In the Select Connection Profile window, select the Big SQL connection. The properties of the selected connection display in the Properties field. When you select the Big SQL connection, the Big SQL database-specific context assistant and syntax checks are activated in the editor that is used to edit your SQL file. If no connection exists, proceed to the next step.

67

__10. If no connection is available for BigSQL, you need to add a new connection by clicking on New… and select "IBM Big SQL JDBC Driver"

Host: bivm User = biadmin Password = biadmin

__11. Click Finish to close the Select Connection Profile window.

__12. In the SQL Editor that opens the NewsBlogs.sql SQL file you created, add the following Big SQL comments: -- This is a beginning SQL script -- These are comments. Any line that begins with two -- dashes is a comment line, -- and is not part of the processed SQL statements.

Some lines in the file contain two dashes in front of text. Those dashes mark the text as comment lines. Comment lines are not part of the processed code. It is always useful to comment your Big SQL code as you write the code, so that the reasons for using some statements are clear to all who must use the file.

68

__13. Save NewsBlogs.sql by using the keyboard short, CTRL-S.

__14. In the NewsBlogs.sql file, copy or type the following code:

Notice that this script has few SQL statements (e.g. Create Database, Create Table, Load Data into Table, Select * from Table. create database if not exists BigSheetsAnalysis; use BigSheetsAnalysis; create table BigSheetsAnalysis.sheetsOut (country char(2),FeedInfo varchar(300), countryLang char(25),published char(25), subject varchar(300),tags varchar(100), type char(20),url varchar(100)) row format delimited fields terminated by '\t'; load hive data inpath '/<HDFS path>/WatsonBlogs.tsv' overwrite into table BigSheetsAnalysis.sheetsOut; select * from BigSheetsAnalysis.sheetsOut;

An example HDFS path may be </user/biadmin/sampleData/IBMWatson/WatsonBlogs.tsv>

__15. Save the content by pressing CTRL+S

__16. Run the SQL Script by doing the following:

__a. Select the all the content of the script ( by pressing CTRL-A) except "select * from BigSheetsAnalysis.sheetsOut;"

__b. Click on RUN SQL (green arrow at right corner of SQL panel). This will create database, create table and load data into the data.

69

__17. Now execute the command to retrieve the data from BigSheetsAnalysis.SheetsOut Table by doing the following:

__a. Select the Select Statement select * from BigSheetsAnalysis.sheetsOut; and click on RUN SQL arrow to execute the SQL Select Statement

__b. Click on "Result1" to see the result of query execution

Exporting a JSON Array to Big SQL You can also export data from a BigSheets workbook as a JSON Array and make it available to a Big SQL table. By using a web search engine, you can search for a JSON Serde and download a SerDe .jar file. SerDe applications (JAR files) can be downloaded from any open source host. For this lab, the Serde files have already been downloaded and placed in their appropriate directory. In this step, you use a SerDe application (Serializer/Deserializer) to process JSON data, which can be used to transform a JSON record into something that Hive and subsequently Big SQL can process. By using the SerDe interface, you instruct Hive as to how a record should be processed. You can write your own SerDe for processing JSON data, or you can use a package that is available from the web. If the SerDe file is not already downloaded to your VM, start at Step 1. If it is, you can proceed to Step 2.

__1. Add the SerDe .jar file to the $BIGSQL_HOME/userlib and $HIVE_HOME/lib directories.

__a. Stop the Big SQL server.

70

__b. Copy the SerDe .jar file to $BIGSQL_HOME/userlib and $HIVE_HOME/lib

__c. Restart the Big SQL server.

__2. From the same workbook (WatsonBlogDataRevised) that you used in Section 6.1, click Export as and in the drop-down window, select the JSON Array type in the Format Type field. Name the file WatsonBlogsData. Use the path /user/biadmin/sampleData/IBMWatson/ to store your file.

__3. Create another SQL Script (e.g. NewsBlogSerde.SQL) and use the BIG SQL JDBC Connection.

__4. Add the text bellow to the NewsBlogSerde.SQL file.

-------------------- SQL Script with Serde ----------------

create table BigSheetsAnalysis.watson_json (

Country String,

FeedInfo String,

CountryLanguage String,

Published String,

SubjectHtml String,

Tags String,

Type String,

Url String)

row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'

stored as textfile;

load hive data inpath '/user/biadmin/sampleData/IBMWatson/WatsonBlogsData.json'

overwrite into table BigSheetsAnalysis.watson_json;

select * from BigSheetsAnalysis.watson_json;

------------------------------------end of the script ---------------------------------

__5. Save the content of the script by pressing CTRL+S

__6. Create Table and Load Data

71

__a. Select the entire content of the script (pressing CTRL-A) except the SELECT Statement

__b. Click on RUN SQL (green arrow on right corner of SQL panel)

__7. Retrieve the data from the Table:

__a. Select the SELECT Statement " select * from BigSheetsAnalysis.watson_json; "

__b. b) Click on RUN SQL (green arrow on right upper corner of sql script)

__c. Click on "Result1" to see the result of query execution

72

Exercise 6: Text analysis of social media data This section of the lab introduces you to text analysis capabilities in BigInsights. In earlier sections of this lab, you explored meta data related to social media sites that included posts about IBM Watson. This enabled you to gain some insight into various characteristics of coverage of this brand, including some indication of its global visibility. In particular, you inspected values in the Language field to explore global coverage of IBM Watson. As you probably realize, you could continue to explore social media meta data regarding publication dates, URL host sites, country codes, and other information to learn even more about IBM Watson coverage. But what if you want to look beyond the meta data? What if you want to explore the content of the social media posts themselves? This lab module introduces you to text analysis capabilities in BigInsights that enable you to assess the sentiments expressed in these posts. You will deploy a custom-built text extractor that evaluates each news or blog post to determine if the sentiment expressed about IBM Watson is positive, negative, or neutral. Then, you'll invoke this extractor as a function in BigSheets and chart the results. We won't have time to cover the details about how this extractor was developed or explore its source code. However, if you want to learn how to develop your own custom extractors and learn the details behind the sample you'll be using here, attend IOD lab session IBD-2470: Text Analytics for Big Data.

Upload and deploy the sample text analytics application To analyze the content of the social media postings about IBM Watson, you need to upload the sample text analysis application provided with this lab into your BigInsights web application catalog. Then you need to deploy the application. This section takes you through those steps.

__1. If necessary, launch the Web console.

__2. On the Applications page, click the Manage button.

__3. Click the Upload button (just to the right of the search box) to upload the text analytics application supplied with this lab.

73

__4. When prompted, use the Browse button to navigate to the directory in your local file system containing the WatsonSentimentApp.zip file. This should be within the /home/biadmin subdirectory.

__5. Click OK to upload the .zip file and verify that the application appears in the Web catalog. You can either scroll through the alphabetical list of applications at right or expand the My Applications folder at left to locate the WatsonSentiment application that you just uploaded.

__6. Click on the WatsonSentiment application in the right pane. A Deploy button will appear near the top of the pane. Click it to deploy the application on your cluster.

74

__7. A new window will appear, prompting you to enter additional configuration information related to your application, if needed. In this case, just click the Deploy button at bottom.

__8. Return to the application catalog and locate the WatsonSentiment application again. Verify that its status is shown as DEPLOYED.

75

Upload sample data and create a master workbook Now that the text analysis application is ready for use, it's time to upload some data to analyze through BigSheets. The data that we've provided for this lab in the WatsonNewsBlogsData.json file is based on social media data collected by the sample Boardreader application that ships with BigInsights. We combined news and blog entries written in English from IBM-sponsored sites into a single file that you'll use for analytical purposes.

__1. From the Files page of the Web console, navigate to the /home/biadmin directory of HDFS and click the create directory button to create a new folder for the data you will upload.

__2. When prompted, enter a name of TextLab for the folder and click OK.

About the Boardreader App Boardreader is an IBM business partner that offers a social media content aggregation and provisioning service based on a multilingual data dating back to 2001. The service searches message boards / forums, social networks, blogs/comments, microblogs, reviews, videos/comments and online news. Its contents is updated continuously. The Boardreader application uses a search based API to retrieve data collected by this business partner from numerous Web sites Customers who want to use the Boardreader service should contact them directly to obtain a license key.

76

__3. Highlight the new TextLab folder in your HDFS directory tree and click the upload button to upload the sample data.

__4. A new window will appear, prompting you to specify the location of the file you want to upload. Use the Browse button to navigate to the directory containing the WatsonNewsBlogsData.json file provided with this lab. This should be in the /home/biadmin directory of your local file system. After locating the file, click OK to upload it.

__5. Click the refresh button and expand the TextLab folder to verify that the file was successfully uploaded.

77

__6. Click on the WatsonNewsBlogsData.json file. In the pane at right, change the display type from Text to Sheets.

__7. Since the data is in a JSON array format, you need to change the BigSheets reader (data format translator) from Line Reader to JSON Array. To do so, click on the pencil icon just to the right of the Line Reader heading. Select JSON Array from the drop down menu and click the green check mark.

__8. Click the Save as Master Workbook button. When prompted, specify a name of AllWatsonNewsBlogs and optionally provide a description. Click Save to save the master workbook. When the operation completes, your new workbook will open in BigSheets.

__9. Verify that your workbook appears similar to the image below.

78

Analyzing text content of social media posts With your custom text application deployed and your social media data loaded into a BigSheets workbook, you're ready to explore the sentiments expressed about IBM Watson. The WatsonSentiment application includes a small number of text extractors that determine if a given post is positive, negative, or neutral. As mentioned earlier, the specifics behind such analysis is beyond the scope of this exercise. Plan to attend IOD Lab 2470 if you'd like to explore the source code as well as the development and testing process. To wrap up this exercise, you'll invoke this custom-built sentiment analysis application as a function in BigSheets and create a pie chart that summarizes the results.

__1. With the AllWatsonNewsBlogs workbook open, click the Build New Workbook button.

__2. A new workbook opens in edit mode. Change the name to WatsonSentimentAnalysis and click the green check mark.

__3. Click Add Sheets -> Function.

79

__4. When prompted, view functions by Categories.

__5. Select the textanalytics category. Several functions will appear; these are part of the custom-created text extractor application provided with this lab.

__6. Skim through the list and select the Sentiment_BasicFeatures.FinalDisposition function.

80

__7. A new menu appears. For the Sheet name, enter Sentiment. Then fill in the three required input parameters. For ThreadId, select the ThreadId column from the drop-down menu. For TextHtml, select the TextHtml column. For Type, select the Type column. Then click the Carry Over(0) tab at bottom.

__8. Click the Add All button to carry over all columns in the existing workbook into the resulting output of the text analytics function. Then click the green check mark at bottom.

81

__9. Wait until BigSheets displays a model of the workbook it will create based on your specifications. Verify that it appears similar to the image below.

__10. Save and exit from the workbook.

__11. When prompted, Run the workbook and wait until the operation completes. Depending on your machine resources, this may take several minutes.

__12. Verify that your workbook appears similar to the image below. Note that the new "disposition" column assesses the sentiment expressed in the post associated with each record in your workbook.

82

__13. To chart the results, click the Add chart button at bottom.

__14. Select Chart -> Pie.

__15. When prompted, specify the properties of your pie chart as shown below. In particular, provide a meaningful Title, select the disposition column for the Value field, and set the Limit field to 3. Click the green check mark.

__16. When prompted, Run the chart and wait until the operation completes.

83

__17. Verify that your chart appears similar to the image below.

Congratulations! You've just analyzed the text of numerous social media posts about IBM Watson to assess the sentiments expressed. And you did that without writing a single line of code.

Summary In this lab, you gained hands-on experience using many popular capabilities of InfoSphere BigInsights, IBM's Hadoop-based platform for analyzing big data. You explored your BigInsights cluster using a Web-based console and manipulated social media data using a spreadsheet-style interface. You also queried structured data using Big SQL and even created a sample Big SQL application that you published in the BigInsights web application catalog. Finally, you analyzed unstructured text data (blog and news postings) to assess the sentiment expressed about a popular brand (IBM Watson). You're now ready to get started using BigInsights for your own projects. What will you do with big data?

84

Communities • On-line communities, User Groups, Technical Forums, Blogs, Social

networks, and more o Find the community that interests you …

• Information Management bit.ly/InfoMgmtCommunity • Business Analytics bit.ly/AnalyticsCommunity • Enterprise Content Management bit.ly/ECMCommunity

• IBM Champions o Recognizing individuals who have made the most outstanding

contributions to Information Management, Business Analytics, and Enterprise Content Management communities

• ibm.com/champion Thank You! Your Feedback is Important! • Access the Conference Agenda Builder to complete your session

surveys o Any web or mobile browser at

http://iod13surveys.com/surveys.html o Any Agenda Builder kiosk onsite

http://bit.ly/InfoMgmtCommunity�

http://bit.ly/InfoMgmtCommunity�

http://bit.ly/AnalyticsCommunity�

http://bit.ly/AnalyticsCommunity�

http://bit.ly/ECMCommunity�

http://bit.ly/ECMCommunity�

http://www.ibm.com/software/data/champion�

http://www.ibm.com/software/data/champion�

http://iod13surveys.com/surveys.html�

85

Acknowledgements and Disclaimers:

Availability: References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. © Copyright IBM Corporation 2013. All rights reserved.

• U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM, the IBM logo, ibm.com, and BigInsights are trademarks or registered

trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of

others.

http://www.ibm.com/legal/copytrade.shtml�

hands on lab introduction to hadoop with biginsights · pdf file1 . hands on lab ....

Documents