the bash dashboard (or: how to use bash for data analysis)

72
(or: How to Use Bash for Data Analytics) The Bash Dashboard Bram Adams Polytechnique Montreal M C I S

Upload: bram-adams

Post on 07-Aug-2015

27 views

Category:

Data & Analytics


0 download

TRANSCRIPT

(or: How to Use Bash for Data Analytics)

The Bash Dashboard

Bram AdamsPolytechnique

Montreal

MC IS

Yes, this kind of stuff :-)

Last time I checked,

every PC on earth had

Excel installed, so what gives?

(quote by random grad student)

One word: automation!

Let me rephrase:

Why Bash if one has

Python or R?

(fictitious quote)

To better understand and prepare your data before

deeper analysis!

Basic Constructsecho “Bram” > file.txt

echo “Michel” >> file.txt

echo “Giovanni” >> file.txt

cat file.txt | head -n 2

Basic Constructsecho “Bram” > file.txt

echo “Michel” >> file.txt

echo “Giovanni” >> file.txt

cat file.txt | head -n 2

replace file content

Basic Constructsecho “Bram” > file.txt

echo “Michel” >> file.txt

echo “Giovanni” >> file.txt

cat file.txt | head -n 2

replace file content

append file content

Basic Constructsecho “Bram” > file.txt

echo “Michel” >> file.txt

echo “Giovanni” >> file.txt

cat file.txt | head -n 2

replace file content

append file content

pipe: send output of first command to input of second command

Basic Constructsecho “Bram” > file.txt

echo “Michel” >> file.txt

echo “Giovanni” >> file.txt

cat file.txt | head -n 2

BramMichel

replace file content

append file content

pipe: send output of first command to input of second command

http://www.cs.wm.edu/semeru/data/tse-android/files/apps.csv

http://www.cs.wm.edu/semeru/data/tse-android/files/apps.csv

example data 1

apps.csvpackage_app,name,category,version,rating_average,votes,star1,star2,star3,star4,star5

a8.kv.chilly,a8 chili slot machine lite,CARDS,1.2,3.7,42,10,2,4,2,24

[censored apps]

accessline.spy_camera,Hidden camera free version,MEDIA_AND_VIDEO,1.41,2.4,67,34,7,5,5,16

acciones.chile,Acciones Chile,FINANCE,1.0,4.2,24,1,1,0,11,11

acgs.topanime.evawp.photos,Evangelion HD Live Wallpaper,SPORTS,1.1,3.7,70,16,2,7,10,35

Adam.androiddev,Anti Dog Repellent / Whistle,TOOLS,2.4,3.1,748,288,30,51,77,302

[…]

apps.csvpackage_app,name,category,version,rating_average,votes,star1,star2,star3,star4,star5

a8.kv.chilly,a8 chili slot machine lite,CARDS,1.2,3.7,42,10,2,4,2,24

[censored apps]

accessline.spy_camera,Hidden camera free version,MEDIA_AND_VIDEO,1.41,2.4,67,34,7,5,5,16

acciones.chile,Acciones Chile,FINANCE,1.0,4.2,24,1,1,0,11,11

acgs.topanime.evawp.photos,Evangelion HD Live Wallpaper,SPORTS,1.1,3.7,70,16,2,7,10,35

Adam.androiddev,Anti Dog Repellent / Whistle,TOOLS,2.4,3.1,748,288,30,51,77,302

[…]

typical csv file has comma-separated

list of attribute names on line 1

apps.csvpackage_app,name,category,version,rating_average,votes,star1,star2,star3,star4,star5

a8.kv.chilly,a8 chili slot machine lite,CARDS,1.2,3.7,42,10,2,4,2,24

[censored apps]

accessline.spy_camera,Hidden camera free version,MEDIA_AND_VIDEO,1.41,2.4,67,34,7,5,5,16

acciones.chile,Acciones Chile,FINANCE,1.0,4.2,24,1,1,0,11,11

acgs.topanime.evawp.photos,Evangelion HD Live Wallpaper,SPORTS,1.1,3.7,70,16,2,7,10,35

Adam.androiddev,Anti Dog Repellent / Whistle,TOOLS,2.4,3.1,748,288,30,51,77,302

[…]

typical csv file has comma-separated

list of attribute names on line 1

… followed by one line per different observation, each of which has a value for each attribute

http://www.cs.wm.edu/semeru/data/MSR14-android-reuse/files/apps_labels.csv

http://www.cs.wm.edu/semeru/data/MSR14-android-reuse/files/apps_labels.csv

example data 2

apps_labels.csvApp package,Category,Type

air.com.huale.Basketball,ARCADE,Obfuscated

air.com.smch.climatekiten,BOOKS_AND_REFERENCE,Obfuscated

air.comicc.app9019,BOOKS_AND_REFERENCE,Obfuscated

ait.podka,MEDIA_AND_VIDEO,Obfuscated

ak.alizandro.smartaudiobookplayer,MUSIC_AND_AUDIO,Obfuscated

amor.developer.android,LIFESTYLE,Obfuscated

[…]

What Kind of Data does apps.csv Contain?

What Kind of Data does apps.csv Contain?

head -n 1 apps.csv

What Kind of Data does apps.csv Contain?

head -n 1 apps.csv

show first line

Oh, does the File Contain the birthdayChocolate package?

Oh, does the File Contain the birthdayChocolate package?

grep -e "birthdayChocolate" apps.csv

Oh, does the File Contain the birthdayChocolate package?

grep -e "birthdayChocolate" apps.csv

search for a literal string

How Many Apps are There?

How Many Apps are There?

wc -l apps.csv

How Many Apps are There?

wc -l apps.csv

#lines in a file

Wait a Minute, What about the First Line?

Wait a Minute, What about the First Line?

tail +2 apps.csv | wc -l

Wait a Minute, What about the First Line?

tail +2 apps.csv | wc -l

all the lines of a file starting with

line 2 (i.e., removing line 1)

… and what about Apps with >1 Version?

… and what about Apps with >1 Version?

tail +2 apps.csv | cut -f 2 -d , | sort -u | wc -l

… and what about Apps with >1 Version?

tail +2 apps.csv | cut -f 2 -d , | sort -u | wc -l

only keep second column

of comma-delimited file

… and what about Apps with >1 Version?

tail +2 apps.csv | cut -f 2 -d , | sort -u | wc -l

only keep second column

of comma-delimited file

sort alphabetically and remove

duplicate lines

What is the Maximum #Versions of an App?

What is the Maximum #Versions of an App?

tail +2 apps.csv | cut -f 2 -d , | sort | uniq -c | sort -n

What is the Maximum #Versions of an App?

tail +2 apps.csv | cut -f 2 -d , | sort | uniq -c | sort -n

sort, but keep all the lines

What is the Maximum #Versions of an App?

tail +2 apps.csv | cut -f 2 -d , | sort | uniq -c | sort -n

sort, but keep all the lines

count #occurrences of each unique line, i.e., group per line and give #occurrences of each group

What is the Maximum #Versions of an App?

tail +2 apps.csv | cut -f 2 -d , | sort | uniq -c | sort -n

sort, but keep all the lines

count #occurrences of each unique line, i.e., group per line and give #occurrences of each group

sort numerically

Which App Category Contains Most of the Apps?

Which App Category Contains Most of the Apps?

tail +2 apps.csv | cut -f 2,3 -d , | sort -u

| cut -f 2 -d ,

| sort | uniq -c

| sort -n

Which App Category Contains Most of the Apps?

tail +2 apps.csv | cut -f 2,3 -d , | sort -u

| cut -f 2 -d ,

| sort | uniq -c

| sort -n

only keep app name and category

Which App Category Contains Most of the Apps?

tail +2 apps.csv | cut -f 2,3 -d , | sort -u

| cut -f 2 -d ,

| sort | uniq -c

| sort -n

only keep app name and category keep one version per app name

Which App Category Contains Most of the Apps?

tail +2 apps.csv | cut -f 2,3 -d , | sort -u

| cut -f 2 -d ,

| sort | uniq -c

| sort -n

only keep app name and category keep one version per app name

throw away app name

Which App Category Contains Most of the Apps?

tail +2 apps.csv | cut -f 2,3 -d , | sort -u

| cut -f 2 -d ,

| sort | uniq -c

| sort -n

only keep app name and category keep one version per app name

throw away app name

group and count per category

Which App Category Contains Most of the Apps?

tail +2 apps.csv | cut -f 2,3 -d , | sort -u

| cut -f 2 -d ,

| sort | uniq -c

| sort -n

only keep app name and category keep one version per app name

throw away app name

group and count per category

sort categories per count

Let’s Take a Look at the Obfuscation Data

Let’s Take a Look at the Obfuscation Data

less apps_labels.csv

Let’s Take a Look at the Obfuscation Data

less apps_labels.csv

buffer file to scroll up and

down (vs. more)

What a Mess?!

More on line-ending: http://www.cyberciti.biz/faq/howto-unix-linux-convert-dos-newlines-cr-lf-unix-text-format/

What a Mess?!

tr '\r' '\n' < apps_labels.csv > apps_obfus.csv

More on line-ending: http://www.cyberciti.biz/faq/howto-unix-linux-convert-dos-newlines-cr-lf-unix-text-format/

What a Mess?!

tr '\r' '\n' < apps_labels.csv > apps_obfus.csv

fix Windows end-of-line issues by replacing the \r character by \n

More on line-ending: http://www.cyberciti.biz/faq/howto-unix-linux-convert-dos-newlines-cr-lf-unix-text-format/

How to Merge the App Data with Obfuscation Results? (1)

How to Merge the App Data with Obfuscation Results? (1)

TMP=`head -n 1 apps.csv`

echo "${TMP},obfuscated" > apps_join.csv

tail +2 apps.csv | sort > sorted_apps.csv

tail +2 apps_obfus.csv

| sort > sorted_apps_obfus.csv

How to Merge the App Data with Obfuscation Results? (1)

TMP=`head -n 1 apps.csv`

echo "${TMP},obfuscated" > apps_join.csv

tail +2 apps.csv | sort > sorted_apps.csv

tail +2 apps_obfus.csv

| sort > sorted_apps_obfus.csv

store result of command in variable

How to Merge the App Data with Obfuscation Results? (1)

TMP=`head -n 1 apps.csv`

echo "${TMP},obfuscated" > apps_join.csv

tail +2 apps.csv | sort > sorted_apps.csv

tail +2 apps_obfus.csv

| sort > sorted_apps_obfus.csv

store result of command in variable

storing the column names first

How to Merge the App Data with Obfuscation Results? (1)

TMP=`head -n 1 apps.csv`

echo "${TMP},obfuscated" > apps_join.csv

tail +2 apps.csv | sort > sorted_apps.csv

tail +2 apps_obfus.csv

| sort > sorted_apps_obfus.csv

store result of command in variable

storing the column names first

merging requires sorted files

How to Merge the App Data with Obfuscation Results? (2)

How to Merge the App Data with Obfuscation Results? (2)

join -t , -1 1 -2 1 sorted_apps.csv sorted_apps_obfus.csv

| cut -f -11,13 -d ,

>> apps_join.csv

How to Merge the App Data with Obfuscation Results? (2)

join -t , -1 1 -2 1 sorted_apps.csv sorted_apps_obfus.csv

| cut -f -11,13 -d ,

>> apps_join.csv

comma-separate files

How to Merge the App Data with Obfuscation Results? (2)

join -t , -1 1 -2 1 sorted_apps.csv sorted_apps_obfus.csv

| cut -f -11,13 -d ,

>> apps_join.csv

comma-separate files

lines with same value for first column in file 1 and in file 2 should be merged

How to Merge the App Data with Obfuscation Results? (2)

join -t , -1 1 -2 1 sorted_apps.csv sorted_apps_obfus.csv

| cut -f -11,13 -d ,

>> apps_join.csv

comma-separate files

lines with same value for first column in file 1 and in file 2 should be merged

join removes the specified -2 column, but keeps rest of columns of file 2; here

we only want the last column of file 2, so we

remove the 12th column (keeping only the first 11 columns and the 13th)

Which Category has Most of the Obfuscated Code?

Which Category has Most of the Obfuscated Code?

tail +2 apps_join.csv | grep -e ",Obfuscated"

| cut -f 2,3 -d , | sort -u

| cut -f 2 -d ,

| sort | uniq -c

| sort -n

Which Category has Most of the Obfuscated Code?

tail +2 apps_join.csv | grep -e ",Obfuscated"

| cut -f 2,3 -d , | sort -u

| cut -f 2 -d ,

| sort | uniq -c

| sort -n

only consider lines that are obfuscated

Bonus: How to Create a Comma-Separated List from a List of Words?

Bonus: How to Create a Comma-Separated List from a List of Words?

cut -f 3 -d , apps.csv | sort -u

| paste -d , -s -

Bonus: How to Create a Comma-Separated List from a List of Words?

cut -f 3 -d , apps.csv | sort -u

| paste -d , -s -

take input from pipe

Bonus: How to Create a Comma-Separated List from a List of Words?

cut -f 3 -d , apps.csv | sort -u

| paste -d , -s -

take input from pipe

concatenate all lines

Bonus: How to Create a Comma-Separated List from a List of Words?

cut -f 3 -d , apps.csv | sort -u

| paste -d , -s -

take input from pipe

concatenate all lines… and put commas between them

If you’re Interested, Check Out these Books for More (and less ;-))