data science: data visualization boot camp distribution ...ccartled/teaching/2020... · information...

24
Data Science: Data Visualization Boot Camp Distribution Column Histogram Plot Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 25 January 2020 1/24

Upload: others

Post on 05-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Data Science: Data Visualization Boot CampDistribution

    Column Histogram Plot

    Chuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhD

    25 January 202025 January 202025 January 202025 January 202025 January 202025 January 202025 January 202025 January 202025 January 202025 January 202025 January 202025 January 202025 January 202025 January 202025 January 202025 January 202025 January 202025 January 202025 January 202025 January 202025 January 2020

    1/24

  • 2/24

    Type Sample data Hands on Q & A Conclusion References Files

    Table of contents (1 of 1)

    1 TypeUsesGeneral considerations

    2 Sample data

    3 Hands on

    4 Q & A

    5 Conclusion6 References7 Files

  • 3/24

    Type Sample data Hands on Q & A Conclusion References Files

    A definition

    “Sometimes referred to asfrequency diagram. The his-togram is the best known mem-ber of the family of data dis-tribution graphs. A frequencypolygon is a segmented orsmooth line version of a his-togram. Histograms and fre-quency polygons show the fre-quency with which specific val-ues (referred to as data ele-ments) or values within ranges(referred to as class intervals)occur in a set of data.”

    R. L. Harris [1]

  • 4/24

    Type Sample data Hands on Q & A Conclusion References Files

    R supplied data set (1 of 2)

    Included in the R package ggplot2.

    “This dataset contains a subset of the fuel econ-omy data that the EPA makes available on . It contains only models which hada new release every year between 1999 and 2008 - this wasused as a proxy for the popularity of the car.”

    H. Wickham [2]

    library(ggplot2)

    ?mpg

    head(mpg)

    Resulting in:

  • 5/24

    Type Sample data Hands on Q & A Conclusion References Files

    R supplied data set (2 of 2)

    # A tibble: 6 x 11

    manufacturer model displ year cyl trans drv cty hwy fl class

    1 audi a4 1.8 1999 4 auto(l5) f 18 29 p comp

    2 audi a4 1.8 1999 4 manual(m5) f 21 29 p comp

    3 audi a4 2 2008 4 manual(m6) f 20 31 p comp

    4 audi a4 2 2008 4 auto(av) f 21 30 p comp

    5 audi a4 2.8 1999 6 auto(l5) f 16 26 p comp

    6 audi a4 2.8 1999 6 manual(m5) f 18 26 p comp

  • 6/24

    Type Sample data Hands on Q & A Conclusion References Files

    More recent mileage data

    Downloaded from:https://www.fueleconomy.gov/feg/download.shtml

    Described at: https://www.fueleconomy.gov/feg/ws/index.shtml#vehicle

    We will:

    1 Extract csv data from a zip file (39,865 rows)

    2 Select certain makes (attempt to replicate the sample data)

    3 Display different data for selected makes/models

    https://www.fueleconomy.gov/feg/download.shtmlhttps://www.fueleconomy.gov/feg/ws/index.shtml#vehiclehttps://www.fueleconomy.gov/feg/ws/index.shtml#vehicle

  • 7/24

    Type Sample data Hands on Q & A Conclusion References Files

    The first codes. (1 of 4)

  • 8/24

    Type Sample data Hands on Q & A Conclusion References Files

    The first codes. (2 of 4)

    rm(list=ls())

    library(ggplot2)

    data(mpg, package="ggplot2")

    mpg_select

  • 9/24

    Type Sample data Hands on Q & A Conclusion References Files

    The first codes. (3 of 4)

    g + geom_histogram(binwidth=1)

    g + geom_histogram(binwidth=1,

    fill=I("blue"))

    g + geom_histogram(binwidth=1,

    fill=I("blue"),

    col=I("red"))

    g + geom_histogram(binwidth=1,

    fill="blue",

    col="red",

    alpha=I(0.2))

    ## http://ggplot2.tidyverse.org/reference/geom_histogram.html

    ## ..count.. is a computed variable

    g + geom_histogram(binwidth=1,

  • 10/24

    Type Sample data Hands on Q & A Conclusion References Files

    The first codes. (4 of 4)

    col="red",

    aes(fill=..count..)

    ) + labs(fill = "Count")

    ## See ?stat_bin for explantation of ..count.. and other variables

    g + geom_histogram(binwidth=1,

    col="red",

    aes(fill=..count..)

    ) + labs(fill = "Count") +

    stat_bin(aes(y=..count.., label=..count..),

    geom="text", vjust=-.5, binwidth=1)

  • 11/24

    Type Sample data Hands on Q & A Conclusion References Files

    The second codes. (1 of 5)

  • 12/24

    Type Sample data Hands on Q & A Conclusion References Files

    The second codes. (2 of 5)

    rm(list=ls())

    library(ggplot2)

    saveFileName

  • 13/24

    Type Sample data Hands on Q & A Conclusion References Files

    The second codes. (3 of 5)

    ## mpg_select

  • 14/24

    Type Sample data Hands on Q & A Conclusion References Files

    The second codes. (4 of 5)

    g + geom_histogram(binwidth=1)

    g + geom_histogram(binwidth=1,

    fill=I("blue"))

    g + geom_histogram(binwidth=1,

    fill=I("blue"),

    col=I("red"))

    g + geom_histogram(binwidth=1,

    fill="blue",

    col="red",

    alpha=I(0.2))

    ## http://ggplot2.tidyverse.org/reference/geom_histogram.html

    ## See ?stat_bin for explantation of ..count.. and other variables

    ## ..count.. is a computed variable

  • 15/24

    Type Sample data Hands on Q & A Conclusion References Files

    The second codes. (5 of 5)

    g + geom_histogram(binwidth=1,

    col="red",

    aes(fill=..count..)

    ) + labs(fill = "Count")

    g + geom_histogram(binwidth=1,

    col="red",

    aes(fill=..count..)

    ) + labs(fill = "Count") +

    stat_bin(aes(y=..count.., label=..count..),

    geom="text", vjust=-.5, binwidth=1) +

    theme(legend.position=c(0.9, 0.8))

  • 16/24

    Type Sample data Hands on Q & A Conclusion References Files

    The third codes. (1 of 4)

  • 17/24

    Type Sample data Hands on Q & A Conclusion References Files

    The third codes. (2 of 4)

    rm(list=ls())

    library(ggplot2)

    library(magrittr)

    library(ggpubr)

    library(png)

    saveFileName

  • 18/24

    Type Sample data Hands on Q & A Conclusion References Files

    The third codes. (3 of 4)

    theme_set(theme_gray())

    g

  • 19/24

    Type Sample data Hands on Q & A Conclusion References Files

    The third codes. (4 of 4)

    g + geom_histogram(alpha=0.2, position="stack")

    g + geom_histogram(alpha=0.2, position="stack", binwidth=1) +

    theme(legend.position=c(0.9, 0.8))

    g + geom_histogram(alpha=0.2, position="identity", binwidth=1) +

    theme(legend.position=c(0.9, 0.8))

    img.file

  • 20/24

    Type Sample data Hands on Q & A Conclusion References Files

    Hands-on exercises

    1 In the second set of images, change the bin width from 1 to 5and explain the effect.

    2 In the second set of images, change the auto make to include:Auston Martin, Hummer, and Maybach.

    3 Change the background image of the third plot to a coral reef.

  • 21/24

    Type Sample data Hands on Q & A Conclusion References Files

    Q & A time.

    Q: What’s Dr. Presume’s fullname?A: Dr. Livingston I. Presume.

  • 22/24

    Type Sample data Hands on Q & A Conclusion References Files

    What have we covered?

    Columnar histogram plots:

    Are useful to show how data“groups,” but not whyAre dependent on the bin widthCan show absolute numbers, orpercentages

    Good for showing gross differencesin the third dimension.

    Next: Line histograms (how grouping data can show patterns)

  • 23/24

    Type Sample data Hands on Q & A Conclusion References Files

    References (1 of 1)

    [1] Robert L. Harris,Information Graphics: A Comprehensive Illustrated Reference,Oxford University Press, 2000.

    [2] Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis,Springer-Verlag New York, 2009.

  • 24/24

    Type Sample data Hands on Q & A Conclusion References Files

    Files of interest

    1 Code snippet to createimages in this presentation

    2 Extract Federal fuel data

    ## Lots of good stuff## http://r-statistics.co/Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html

    ## First codesrm(list=ls())

    library(ggplot2)data(mpg, package="ggplot2")

    mpg_select