the bash dashboard (or: how to use bash for data analysis)
TRANSCRIPT
Last time I checked,
every PC on earth had
Excel installed, so what gives?
(quote by random grad student)
Basic Constructsecho “Bram” > file.txt
echo “Michel” >> file.txt
echo “Giovanni” >> file.txt
cat file.txt | head -n 2
Basic Constructsecho “Bram” > file.txt
echo “Michel” >> file.txt
echo “Giovanni” >> file.txt
cat file.txt | head -n 2
replace file content
Basic Constructsecho “Bram” > file.txt
echo “Michel” >> file.txt
echo “Giovanni” >> file.txt
cat file.txt | head -n 2
replace file content
append file content
Basic Constructsecho “Bram” > file.txt
echo “Michel” >> file.txt
echo “Giovanni” >> file.txt
cat file.txt | head -n 2
replace file content
append file content
pipe: send output of first command to input of second command
Basic Constructsecho “Bram” > file.txt
echo “Michel” >> file.txt
echo “Giovanni” >> file.txt
cat file.txt | head -n 2
BramMichel
replace file content
append file content
pipe: send output of first command to input of second command
apps.csvpackage_app,name,category,version,rating_average,votes,star1,star2,star3,star4,star5
a8.kv.chilly,a8 chili slot machine lite,CARDS,1.2,3.7,42,10,2,4,2,24
[censored apps]
accessline.spy_camera,Hidden camera free version,MEDIA_AND_VIDEO,1.41,2.4,67,34,7,5,5,16
acciones.chile,Acciones Chile,FINANCE,1.0,4.2,24,1,1,0,11,11
acgs.topanime.evawp.photos,Evangelion HD Live Wallpaper,SPORTS,1.1,3.7,70,16,2,7,10,35
Adam.androiddev,Anti Dog Repellent / Whistle,TOOLS,2.4,3.1,748,288,30,51,77,302
[…]
apps.csvpackage_app,name,category,version,rating_average,votes,star1,star2,star3,star4,star5
a8.kv.chilly,a8 chili slot machine lite,CARDS,1.2,3.7,42,10,2,4,2,24
[censored apps]
accessline.spy_camera,Hidden camera free version,MEDIA_AND_VIDEO,1.41,2.4,67,34,7,5,5,16
acciones.chile,Acciones Chile,FINANCE,1.0,4.2,24,1,1,0,11,11
acgs.topanime.evawp.photos,Evangelion HD Live Wallpaper,SPORTS,1.1,3.7,70,16,2,7,10,35
Adam.androiddev,Anti Dog Repellent / Whistle,TOOLS,2.4,3.1,748,288,30,51,77,302
[…]
typical csv file has comma-separated
list of attribute names on line 1
apps.csvpackage_app,name,category,version,rating_average,votes,star1,star2,star3,star4,star5
a8.kv.chilly,a8 chili slot machine lite,CARDS,1.2,3.7,42,10,2,4,2,24
[censored apps]
accessline.spy_camera,Hidden camera free version,MEDIA_AND_VIDEO,1.41,2.4,67,34,7,5,5,16
acciones.chile,Acciones Chile,FINANCE,1.0,4.2,24,1,1,0,11,11
acgs.topanime.evawp.photos,Evangelion HD Live Wallpaper,SPORTS,1.1,3.7,70,16,2,7,10,35
Adam.androiddev,Anti Dog Repellent / Whistle,TOOLS,2.4,3.1,748,288,30,51,77,302
[…]
typical csv file has comma-separated
list of attribute names on line 1
… followed by one line per different observation, each of which has a value for each attribute
apps_labels.csvApp package,Category,Type
air.com.huale.Basketball,ARCADE,Obfuscated
air.com.smch.climatekiten,BOOKS_AND_REFERENCE,Obfuscated
air.comicc.app9019,BOOKS_AND_REFERENCE,Obfuscated
ait.podka,MEDIA_AND_VIDEO,Obfuscated
ak.alizandro.smartaudiobookplayer,MUSIC_AND_AUDIO,Obfuscated
amor.developer.android,LIFESTYLE,Obfuscated
[…]
Oh, does the File Contain the birthdayChocolate package?
grep -e "birthdayChocolate" apps.csv
search for a literal string
Wait a Minute, What about the First Line?
tail +2 apps.csv | wc -l
all the lines of a file starting with
line 2 (i.e., removing line 1)
… and what about Apps with >1 Version?
tail +2 apps.csv | cut -f 2 -d , | sort -u | wc -l
only keep second column
of comma-delimited file
… and what about Apps with >1 Version?
tail +2 apps.csv | cut -f 2 -d , | sort -u | wc -l
only keep second column
of comma-delimited file
sort alphabetically and remove
duplicate lines
What is the Maximum #Versions of an App?
tail +2 apps.csv | cut -f 2 -d , | sort | uniq -c | sort -n
What is the Maximum #Versions of an App?
tail +2 apps.csv | cut -f 2 -d , | sort | uniq -c | sort -n
sort, but keep all the lines
What is the Maximum #Versions of an App?
tail +2 apps.csv | cut -f 2 -d , | sort | uniq -c | sort -n
sort, but keep all the lines
count #occurrences of each unique line, i.e., group per line and give #occurrences of each group
What is the Maximum #Versions of an App?
tail +2 apps.csv | cut -f 2 -d , | sort | uniq -c | sort -n
sort, but keep all the lines
count #occurrences of each unique line, i.e., group per line and give #occurrences of each group
sort numerically
Which App Category Contains Most of the Apps?
tail +2 apps.csv | cut -f 2,3 -d , | sort -u
| cut -f 2 -d ,
| sort | uniq -c
| sort -n
Which App Category Contains Most of the Apps?
tail +2 apps.csv | cut -f 2,3 -d , | sort -u
| cut -f 2 -d ,
| sort | uniq -c
| sort -n
only keep app name and category
Which App Category Contains Most of the Apps?
tail +2 apps.csv | cut -f 2,3 -d , | sort -u
| cut -f 2 -d ,
| sort | uniq -c
| sort -n
only keep app name and category keep one version per app name
Which App Category Contains Most of the Apps?
tail +2 apps.csv | cut -f 2,3 -d , | sort -u
| cut -f 2 -d ,
| sort | uniq -c
| sort -n
only keep app name and category keep one version per app name
throw away app name
Which App Category Contains Most of the Apps?
tail +2 apps.csv | cut -f 2,3 -d , | sort -u
| cut -f 2 -d ,
| sort | uniq -c
| sort -n
only keep app name and category keep one version per app name
throw away app name
group and count per category
Which App Category Contains Most of the Apps?
tail +2 apps.csv | cut -f 2,3 -d , | sort -u
| cut -f 2 -d ,
| sort | uniq -c
| sort -n
only keep app name and category keep one version per app name
throw away app name
group and count per category
sort categories per count
Let’s Take a Look at the Obfuscation Data
less apps_labels.csv
buffer file to scroll up and
down (vs. more)
What a Mess?!
More on line-ending: http://www.cyberciti.biz/faq/howto-unix-linux-convert-dos-newlines-cr-lf-unix-text-format/
What a Mess?!
tr '\r' '\n' < apps_labels.csv > apps_obfus.csv
More on line-ending: http://www.cyberciti.biz/faq/howto-unix-linux-convert-dos-newlines-cr-lf-unix-text-format/
What a Mess?!
tr '\r' '\n' < apps_labels.csv > apps_obfus.csv
fix Windows end-of-line issues by replacing the \r character by \n
More on line-ending: http://www.cyberciti.biz/faq/howto-unix-linux-convert-dos-newlines-cr-lf-unix-text-format/
How to Merge the App Data with Obfuscation Results? (1)
TMP=`head -n 1 apps.csv`
echo "${TMP},obfuscated" > apps_join.csv
tail +2 apps.csv | sort > sorted_apps.csv
tail +2 apps_obfus.csv
| sort > sorted_apps_obfus.csv
How to Merge the App Data with Obfuscation Results? (1)
TMP=`head -n 1 apps.csv`
echo "${TMP},obfuscated" > apps_join.csv
tail +2 apps.csv | sort > sorted_apps.csv
tail +2 apps_obfus.csv
| sort > sorted_apps_obfus.csv
store result of command in variable
How to Merge the App Data with Obfuscation Results? (1)
TMP=`head -n 1 apps.csv`
echo "${TMP},obfuscated" > apps_join.csv
tail +2 apps.csv | sort > sorted_apps.csv
tail +2 apps_obfus.csv
| sort > sorted_apps_obfus.csv
store result of command in variable
storing the column names first
How to Merge the App Data with Obfuscation Results? (1)
TMP=`head -n 1 apps.csv`
echo "${TMP},obfuscated" > apps_join.csv
tail +2 apps.csv | sort > sorted_apps.csv
tail +2 apps_obfus.csv
| sort > sorted_apps_obfus.csv
store result of command in variable
storing the column names first
merging requires sorted files
How to Merge the App Data with Obfuscation Results? (2)
join -t , -1 1 -2 1 sorted_apps.csv sorted_apps_obfus.csv
| cut -f -11,13 -d ,
>> apps_join.csv
How to Merge the App Data with Obfuscation Results? (2)
join -t , -1 1 -2 1 sorted_apps.csv sorted_apps_obfus.csv
| cut -f -11,13 -d ,
>> apps_join.csv
comma-separate files
How to Merge the App Data with Obfuscation Results? (2)
join -t , -1 1 -2 1 sorted_apps.csv sorted_apps_obfus.csv
| cut -f -11,13 -d ,
>> apps_join.csv
comma-separate files
lines with same value for first column in file 1 and in file 2 should be merged
How to Merge the App Data with Obfuscation Results? (2)
join -t , -1 1 -2 1 sorted_apps.csv sorted_apps_obfus.csv
| cut -f -11,13 -d ,
>> apps_join.csv
comma-separate files
lines with same value for first column in file 1 and in file 2 should be merged
join removes the specified -2 column, but keeps rest of columns of file 2; here
we only want the last column of file 2, so we
remove the 12th column (keeping only the first 11 columns and the 13th)
Which Category has Most of the Obfuscated Code?
tail +2 apps_join.csv | grep -e ",Obfuscated"
| cut -f 2,3 -d , | sort -u
| cut -f 2 -d ,
| sort | uniq -c
| sort -n
Which Category has Most of the Obfuscated Code?
tail +2 apps_join.csv | grep -e ",Obfuscated"
| cut -f 2,3 -d , | sort -u
| cut -f 2 -d ,
| sort | uniq -c
| sort -n
only consider lines that are obfuscated
Bonus: How to Create a Comma-Separated List from a List of Words?
cut -f 3 -d , apps.csv | sort -u
| paste -d , -s -
Bonus: How to Create a Comma-Separated List from a List of Words?
cut -f 3 -d , apps.csv | sort -u
| paste -d , -s -
take input from pipe
Bonus: How to Create a Comma-Separated List from a List of Words?
cut -f 3 -d , apps.csv | sort -u
| paste -d , -s -
take input from pipe
concatenate all lines
Bonus: How to Create a Comma-Separated List from a List of Words?
cut -f 3 -d , apps.csv | sort -u
| paste -d , -s -
take input from pipe
concatenate all lines… and put commas between them