introduction to sql and sql in r - lisa (laboratory for ... to sql.pdf · sql is a special-purpose...

34
Introduction to SQL and SQL in R LISA Short Courses Xinran Hu 1

Upload: duonganh

Post on 01-Feb-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

Introduction to SQL and SQL in R

LISA Short Courses

Xinran Hu

1

2

Laboratory for Interdisciplinary Statistical

Analysis

LISA helps VT researchers benefit from the use of Statistics

Short Courses: Designed to help graduate students apply statistics in their research

Walk-In Consulting: M-F 1-3 PM GLC Video Conference Room;

11 AM-1 PM Old Security Building Room 103

For questions requiring <30 mins

All services are FREE for VT researchers.

We assist with research—not class projects or homework.

Collaboration:

Visit our website to request personalized statistical advice and assistance with:

Experimental Design • Data Analysis • Interpreting

Results

Grant Proposals • Software (R, SAS, JMP, SPSS...) LISA statistical collaborators aim to explain concepts in ways useful for your

research.

Great advice right now: Meet with LISA before collecting your data.

Part I. SQL

1. Basic Concepts in SQL

2. Querying From One Table

3. Nested Queries

4. Querying From Multiple Tables

Part II. SQL in R

1. sqldf Package

2. Write SQL Command in R

3

Course Plan

SQL is a special-purpose programming language designed for managing data held in a relational database management system (RDBMS). (wikipedia)

In the era of Big Data,

(1) Simple Method + Massive Dataset

(2) The time spent on data management often exceeds the time spent on analysis.

Why SQL?

4

Basic Concepts: Table

A Table (or a Relation) is a set of records that have the same attributes. 1. A table always have an unique table name in SQL.

Name Gender Age Occupation

Jessica Female 28 Student

Michael Male 28 Professor

Emma Female 34 Professor

5

Basic Concepts: Column

A Column ( or Attribute or Field) is a characteristics of objects. 1. A column always have a unique column name within a table.

2. A column is either numeric (age), categorical (gender) or one of the other pre-defined types in SQL, e.g. date.

3. Primary Key: a column that has unique value for each object.

Name(PK) Gender Age Occupation

Jessica Female 28 Student

Michael Male 28 Professor

Emma Female 34 Professor 6

Basic Concepts: Row

A Row ( or a Tuple or Record) is an object represented by a list of attributes. 1. A row always have an unique value in Primary Key.

Name(PK) Gender Age Occupation

Jessica Female 28 Student

Michael Male 28 Professor

Emma Female 34 Professor

7

Basic SQL Command

Example dataset: a world fact dataset (sqlzoo.net)

http://sqlzoo.net/wiki/SELECT_basics

(Table name: world)

8

SELECT …FROM …

Function: Selecting one or more columns from a table.

Example:

SELECT * FROM world

SELECT name FROM world

SELECT name, gdp FROM world

A SQL query always start with a SELECT…FROM… command.

9

DISTINCT Clause

Function: Removing duplicate records in query result.

Example: SELECT continent FROM world

(This query includes duplicate results)

SELECT DISTINCT continent FROM world

(This query excludes duplicate results)

10

LIMIT clause

Function: Limiting the number of records returned.

Example: SELECT DISTINCT continent FROM world

(This query returns all distinct results)

SELECT DISTINCT continent FROM world

LIMIT 3

(This query returns three distinct results)

11

WHERE clause

Function: Filtering the records returned by a select command.

Example: Filtering based on a numeric attribute SELECT name, area FROM world /* do not forget this line*/

WHERE area = 9596961 (km^2)

WHERE area<> 9596961

WHERE area > 5000000

WHERE area < 1

WHERE area IN (0, 9596961)

WHERE area BETWEEN 9000000 AND 10000000

WHERE area NOT BETWEEN 1 AND 10000000

12

WHERE clause (cont.)

Example: Filtering based on a categorical attribute. SELECT name, continent FROM world

WHERE continent = 'north america'

WHERE continent <> 'north america'

WHERE continent >'north america'

WHERE continent < 'north america'

WHERE continent IN ('north america', ‘south america')

WHERE continent BETWEEN ‘D’ AND ‘F’

WHERE continent NOT BETWEEN ‘D’ AND ‘F’

WHERE continent LIKE '%america‘

13

WHERE clause (cont.)

Example: Filtering based on two or more conditions. SELECT name, continent FROM world

WHERE area>9000000 AND area<10000000

WHERE continent = 'north america‘ AND area>9000000

WHERE continent = 'north america‘ OR area>9000000

WHERE (continent = 'north america‘ AND area>500000) OR (continent = ‘south america‘ AND area>500000)

14

ORDERED BY clause

Function: Sorting the query result by a column.

Example:

Sort Ascending (Default) SELECT name, area FROM world

ORDER BY area (ASC)

Sort Descending

SELECT name, area FROM world

ORDER BY area DESC

15

Arithmetic operations

SQL permits arithmetic operations on columns.

Example: Query GDP: select name, gdp from world

order by gdp DESC

Query GDP in Trillion: select name, gdp/1000000000 from world

order by gdp DESC

Query GDP per capita select name, gdp/population from world

order by gdp/population DESC

16

Write the SQL queries in the first tutorial.

http://sqlzoo.net/wiki/SELECT_basics

17

Practice: SQLZOO:SELECT basics

Function: Producing descriptive statistics of a column.

Example:

Query the average GDP of the world.

SELECT avg(gdp) FROM world

More Aggregate Functions:

max(), min(), sum(), count(), var(), stdev()

Note, max() and min() operates on strings as well.

Aggregate Functions

18

Function: Stratifying aggregate result by a categorical column

Example:

Query the average GDP of each continent.

SELECT avg(gdp) FROM world.

GROUP BY continent

GROUP BY

19

Function: Filtering aggregate results.

Example:

Find the continent with average country area greater than 1 million.

SELECT continent, avg(area) FROM world

GROUP BY continent

HAVING avg(area)>1000000

HAVING Clause

20

Query result can be used in the where clause of another query. This is referred as Nested Query

SELECT name FROM world WHERE population > (SELECT population FROM world WHERE name=‘Russia') If we run the nested query, we will get 146000000, so the above query is equivalent to

SELECT name FROM world WHERE population > 146000000

Nested Query

21

Function: Concatenating two tables “vertically”

UNION Command

Name Gender Age Occupation

Jessica Female 28 Student

Michael Male 28 Professor

Emma Female 34 Professor

Name Gender Age Occupation

James Male 23 Student

Anna Female 22 Staff

Andy Male 45 Professor

Name Gender Age Occupation

Jessica Female 28 Student

Michael Male 28 Professor

Emma Female 34 Professor

James Male 23 Student

Anna Female 22 Staff

Andy Male 45 Professor

22

Example: SELECT * FROM world

WHERE name = ‘Germany’

UNION

SELECT * FROM world

WHERE name=‘France’

Why is UNION command useful if we can use WHERE clause?

Answer: These queries could come from different tables.

UNION Command

23

Function: Concatenating two tables “horizontally”.

JOIN Command

Name Gender Age Occupation

Jessica Female 28 Student

Michael Male 28 Professor

Emma Female 34 Professor

Name Department

Jessica Math

Michael Statistics

Emma Art

Name Gender Age Occupation Department

Jessica Female 28 Student Math

Michael Male 28 Professor Statistics

Emma Female 34 Professor Art

24

JOIN is generally more complicated than union because you have to specify a common attribute to link two tables. In the previous example, we joined tables with the name column.

If both tables contain a column with the same column name, we should distinguish two columns by attaching the table name in front

Table1Name.ColumnName vs Table2Name.ColumnName

JOIN Command

25

There are four different types of join:

(1) INNER JOIN

(2) LEFT JOIN

(3) RIGHT JOIN

(4) FULL OUTER JOIN

JOIN Command

26

INNER JOIN produce a record if it exist in BOTH tables

INNER JOIN

Name Gender Age Occupation

Jessica Female 28 Student

Michael Male 28 Professor

Emma Female 34 Professor

Name Department

Jessica Math

John CS

Name Gender Age Occupation Department

Jessica Female 28 Student Math

27

LEFT JOIN produce a record if it exists in the left table

LEFT JOIN

Name Gender Age Occupation

Jessica Female 28 Student

Michael Male 28 Professor

Emma Female 34 Professor

Name Department

Jessica Math

John CS

Name Gender Age Occupation Department

Jessica Female 28 Student Math

Michael Male 28 Professor NULL

Emma Female 34 Professor NULL

28

RIGHT JOIN produce a record only if it exists in the right table

RIGHT JOIN

Name Gender Age Occupation

Jessica Female 28 Student

Michael Male 28 Professor

Emma Female 34 Professor

Name Department

Jessica Math

John CS

Name Gender Age Occupation Department

Jessica Female 28 Student Math

John NULL NULL NULL CS

29

FULL OUTER JOIN produce a record if it exists in either table.

Note, FULL OUTER JOIN is not supported in MySQL.

FULL OUTER JOIN

Name Gender Age Occupation

Jessica Female 28 Student

Michael Male 28 Professor

Emma Female 34 Professor

Name Department

Jessica Math

John CS

Name Gender Age Occupation Department

Jessica Female 28 Student Math

Michael Male 28 Professor NULL

Emma Female 34 Professor NULL

John NULL NULL NULL CS 30

JOIN syntax

Left Table Name: Games

Right Table Name: City

Year City

1896 Athens

1948 London

2004 Athens

2008 Beijing

2012 London

Olympic Game Dataset http://sqlzoo.net/wiki/SELECT_.._JOIN

Name Country

Sydney Australia

Athens Greece

Beijing China

London UK

31

INNER JOIN SELECT * FROM games JOIN city ON games.city=city.name

LEFT JOIN SELECT * FROM games LEFT JOIN city ON games.city=city.name

RIGHTT JOIN SELECT * FROM games RIGHT JOIN city ON games.city=city.name FULL OUTER JOIN SELECT * FROM games FULL OUTER JOIN city ON games.city=city.name

JOIN Example

32

Two ways of applying SQL queries in R

(1) A “Local” Way.

Query from a dataframe.

(2) A “Remote” Way

Accessing a SQL database remotely.

SQL in R

33

Package sqldf allows us writing SQL-like commands to query from a dataframe just like querying from a relational table. All we need to do is wrap a SQL command in double quotation and send it to the sqldf function.

E.g.

library(sqldf)

sqldf(“select * from hpc”)

Package sqldf

34