our data ourselves, pydata 2015

Post on 10-Aug-2015

330 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Our Data, Ourselves

-The Data Democracy Deficit

Department of Digital Humanities

Giles Greenway

Tobias BlankeJenifer PybusMark Cote

A “mobile-data commons”?

• Most of us leave behind a data-trail created by our mobile devices.

• Usually, it returns to us as targeted adverts (See Private Eye's “Malgorithms”...)

• How aware of this are mobile device users?

• How else might this data be used?

• Can we build a “mobile data commons”?

Can we build a “mobile-data commons”?

• Can we capture the data our devices leak with an app?• No.• This would require rooting the 'phones. An Android

phone is a Linux system, where the end user typically doesn't have admin rights.

• If the app reaches a mass audience, we cannot expect users to root their phones. Some rooting software contains malware, we cannot ensure that users root their devices safely.

• For a technical description of the Android permissions system and Android malware, watch: http://tinyurl.com/weidmandroid

What can we do then? -MobileMiner

Log:When apps access the internetCell-tower IDs.Wireless networks.When apps send notifications.

Full description of the app:http://tinyurl.com/miningmobileyouth

Phones with the app pre-loaded were issued to 20 young developers from Young Rewired State.

(Young Coders: Attitudes Vary!)

• ~20 Young coders were issued with Android smartphones with our MobileMiner app installed.

• Invited to participate in hack-days and focus-groups.

.“If you have nothing to hide you have nothing to fear...”

“Privacy is attached to other people... so if someone you agree toconnect with is open then you can be accessed through them cause it's kind of herd thing, you've all got to do it otherwise, oneperson is in trouble.”

“People don't realise how large their digital footprint’s actually are...”

“Being of kind of this generation and being tech savvy we havesome control because we know how to have control...”

What can we do then? Network usage

• The Android API provides network traffic data on a per-app basis.

• Sample this every half second.• Each app corresponds to a user in the underlying Linux

system and has its own Dalvik virtual machine.• The API can identify the PID of each running app.• Poll /proc/<pid>/net/tcp every half second.• Obtain the port and IP address of each network socket.

sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode12: 4F01A8C0:E1D0 B422C2AD:0050 01 00000000:00000000 02:000003A3 00000000 1000 0 154153 2 0000000000000000 23 4 28 10 -1

What can we do then? GSM cells

• Full GPS is too invasive, and consumes power.

• Avoid use of Google location API.

• OpenCellId provides locations of (many) cell towers.

• http://opencellid.org

Getting hold of the data: CKAN

Getting hold of the data: CKAN

●The “Drupal of data”...●Needs Postgres and Apache Solr.●Based on Pylons.●Datastore plugin provides an API for uploading data.●Runs in a virtualenv.●“Out of the box” solution.●Provides basic search, filtering, plotting and maps.

Getting hold of the data: CKAN

CKAN: Writing plugins:

import ckan.plugins as plugins

class MobileMinerPlugin(plugins.SingletonPlugin):

plugins.implements(plugins.IAuthFunctions)

plugins.implements(plugins.IActions)

def get_auth_functions(self):

return {'miner_update': miner_auth_update,

'miner_register':miner_auth_register}

def get_actions(self):

return {'miner_update': miner_datastore_update,

'miner_register':miner_datastore_register}

CKAN: Writing plugins:

@plugins.toolkit.side_effect_free

def miner_datastore_register(context,data):

missing = [ field for field in ['androidid','version'] if not

data.get(field,False) ]

if missing:

raise plugins.toolkit.ValidationError({'message': 'Not specified: '+',

' '.join(missing)})

newUser = False

while not newUser:

uid = abs(random.getrandbits(32))

newUser = not user_exists(uid)

local = ckanapi.RemoteCKAN(ckan_url,apikey=api_key)

result = local.action.datastore_upsert(resource_id=resources['user'],

records=[{'uid':uid, 'androidid':data['androidid'],

'version':data['version'], 'time':datetime.datetime.now().isoformat()}],

method='insert')

return uid

CKAN integrates Celery...

• Celery: a distributed task queue.

• www.celeryproject.org• Compose tasks across

multiple machines.• Monitor tasks with

“Flower”.• “Hooray, we can do

proper data-science!”

CKAN integrates Celery...

• Celery: a distributed task queue.

• www.celeryproject.org• Compose tasks across

multiple machines.• Monitor tasks with

“Flower”.• “Hooray, we can do

proper data-science!”• “...unless we default to

SQLalchemy as the broker!”

“Using a database as a message queue is not recommended, but can be sufficient for very small installations.” -Celery documentation

# ckan/lib/celery_app.py

default_config = dict( BROKER_BACKEND='sqlalchey', BROKER_HOST=sqlalchemy_url, CELERY_RESULT_DBURI=sqlalchemy_url, CELERY_RESULT_BACKEND='database', CELERY_RESULT_SERIALIZER='json', CELERY_TASK_SERIALIZER='json', CELERY_IMPORTS=[],)

CKAN and Python 3: (We've got 5 years)

• Installation guide specifies v2.6 or 2.7.

• There's a road-map, Python 3 isn't a priority!

• Pyramid supports Python 3.

• A Pylons to Pyramid migration guide was written in 2011.

• How badly will extensions break?

The Data: Cell Towers.

• What does the trail of cell towers reveal about users? Can we cluster them?

• Devices connect to towers because of network traffic or cost, not just proximity.

• Not all cells are known to OpenCellId.

• Density varies.

Cell Towers: K-Means

• Clusters should be convex.

• Clusters should be compact.

• Space should be of reasonably low dimensionality.

• Euclidean distance should make sense (sklearn enforces this).

Cell Towers: K-Means

• Try [lat, lon] as feature vectors.

• Increase K until mean centroid distance is within 90% of the value for the previous K.

• Trails of points from journeys are split across multiple clusters.

Cell Towers: K-Means

• Try [lat, lon, d_lat/dt, d_lon/dt] as feature vectors.

• K is reduced, trails of points coalesce.

Cell Towers: Spatial-Temporal Clusters?

• Can we localize events in space and time?• Is day-of-the-week (vertical axis) a useful feature?• No cluster that spans all weekdays is credible as a daily

commute.

Cell Towers: Spatial-Temporal Clusters?

• Does adding the hour as a feature help?• No cluster that spans 9-5 is found.• Stop abusing K-means with categorical variables!

Cell Tower Clusters: Keep it simple.

• On how many distinct days is each cluster visited?• What is the range of days of occupation?• Is the cluster occupied more on weekdays or weekends?• What is the range of times of day when the cluster is

occupied.

• occupied at night all days == home• occupied 9-5 on weekdays == work / school• (OpenStreetMap correctly identified schools, WIFI is also

a clue.)• student, two visits to multiple cities weeks apart ==

university open-days and interviews.

Giving back the data.

• Give users a copy of the CKAN instance to play with.• Access data via Ipython notebooks.• Include multiple services, libraries, etc...• Produce a virtual machine that is easy to modify,

document and distribute.

• Dockerfiles specify images that instantiate containers.• “boot2docker” for Mac/Windows is just a re-branded

VirtualBox. -Use Docker in VirtualBox to distribute the container.

Giving back the data. -It works!

Docker: Criticisms...

• “sudo wget http://notdodqy.org/install.sh | /bin/sh”• Show us the dockerfile!• “That's not proper sysadmin!”

http://iops.io/blog/docker-hype • “What about OpenStack?”

• For distributing canned systems, none of these apply.• But, supervisord doesn't quite work in Python3!

The Data: App Activity.

• Is network activity a proxy for app usage?• The more Twitter friends, the more notifications.

0 200 400 600 800 1000 12000

200

400

600

800

1000

1200

Twitter Network Degree vs Notifications

Friends

Followers

Number of Notifications

frie

nd

s / f

ollo

we

rs c

ou

nt

The Data: App Activity.

• Is network activity a proxy for app usage?• Some games make sense...

The Data: App Activity.

• Is network activity a proxy for app usage?• ...others, not so much:

The Line! What is it doing?

The Line! AndroidManifest.xml

<receiver android:enabled="true" android:name="com.simplecreator.app.RemoteNotificationReceiver">

<intent-filter>

<action android:name="cn.jpush.android.intent.REGISTRATION"/>

<action android:name="cn.jpush.android.intent.UNREGISTRATION"/>

<action android:name="cn.jpush.android.intent.MESSAGE_RECEIVED"/>

<action android:name="cn.jpush.android.intent.NOTIFICATION_RECEIVED"/>

<action android:name="cn.jpush.android.intent.NOTIFICATION_OPENED"/>

<action android:name="cn.jpush.android.intent.ACTION_RICHPUSH_CALLBACK"/>

<category android:name="com.onetouchgame.TheLine"/>

</intent-filter>

</receiver>

<service android:name="com.umeng.update.net.DownloadingService" android:process=":DownloadingService"/>

<activity android:name="com.umeng.update.UpdateDialogActivity" android:theme="@android:style/Theme.Translucent.NoTitleBar"/>

• The app receives intents from the push notification service jpush.cn. Umeng is a mobile analytics service.

• Is that why it had open sockets on port 3000?

.

apktool d com.onetouchgame.TheLine.apk

The Line! Examining the source-code:

Look for PhoneStateListeners and LocationListeners: if (paramLocation != null) { d1 = paramLocation.getLatitude(); d2 = paramLocation.getLongitude(); boolean bool1 = d1 < 29.999998211860657D; ...Classes provided by tencent.com (a mobile ad service) reference latitude and longitude.Classes provided by jpush.cn and umeng.com also reference LocationListeners.

dex2jar.sh com.onetouchgame.TheLine

Docker: The Droid Destruction Kit!

• Can we put Android reversal and traffic capture tools into the hands of beginners?

• Many tools require building from source.

• “docker-ubuntu-vnc-desktop” puts an LXDE desktop in the user's browser.

• “Masterclass” on app reversal held by Darren Martyn (http://insecurety.net/) of Xiphos Research: http://www.xiphosresearch.com

Docker: The Droid Destruction Kit!

Docker: The Droid Destruction Kit!

Download our app: http://kingsbsd.github.io/MobileMiner

Follow us on Twitter: @KingsBSD

Read our blog:http://big-social-data.net/

Slideshare:http://www.slideshare.net/kingsBSD/•

top related