mongodb analytics: learn aggregation by example - exploratory analytics and visualization using...
DESCRIPTION
TRANSCRIPT
Mongo Analytics – Learn aggregation by example
Exploratory Analytics and Visualization using Flight Data
www.jsonstudio.com
Analyzing Flight Data
• JSON data imported from CSV downloaded from:http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
• Will build aggregation pipelines and visualize data using JSON Studio (www.jsonstudio.com)
• Sample document for a flight:
{ "_id": { "$oid": "534205f61c479f6149a92709" }, "YEAR": 2013, "QUARTER": 1, "MONTH": 1, "DAY_OF_MONTH": 18, "DAY_OF_WEEK": 5, "FL_DATE": "2013-01-18”, "UNIQUE_CARRIER": "DL”, "AIRLINE_ID": 19790, "CARRIER": "DL", "TAIL_NUM": "N325US”, "FL_NUM": 1497, "ORIGIN_AIRPORT_ID": 14100, "ORIGIN_AIRPORT_SEQ_ID": 1410002, "ORIGIN_CITY_MARKET_ID": 34100, "ORIGIN": "PHL", "ORIGIN_CITY_NAME": "Philadelphia, PA", "ORIGIN_STATE_ABR": "PA”, "ORIGIN_STATE_FIPS": 42, "DEST_AIRPORT_ID": 13487, "DEST_AIRPORT_SEQ_ID": 1348702, "DEST_CITY_MARKET_ID": 31650, "DEST": "MSP", "DEST_CITY_NAME": "Minneapolis, MN", "DEST_STATE_ABR": "MN", "DEST_STATE_FIPS": 27, "DEST_STATE_NM": "Minnesota", "DEST_WAC": 63, "CRS_DEP_TIME": 805, "DEP_TIME": 758,
"DEP_DELAY": -7, "DEP_DELAY_NEW": 0, "DEP_DEL15": 0, "DEP_DELAY_GROUP": -1, "DEP_TIME_BLK": "0800-0859", "TAXI_OUT": 24, "WHEELS_OFF": 822, "WHEELS_ON": 958, "TAXI_IN": 4, "CRS_ARR_TIME": 1015, "ARR_TIME": 1002, "ARR_DELAY": -13, "ARR_DELAY_NEW": 0, "ARR_DEL15": 0, "ARR_DELAY_GROUP": -1, "ARR_TIME_BLK": "1000-1059", "CANCELLED": 0, "CANCELLATION_CODE": "",
"DIVERTED": 0, "CRS_ELAPSED_TIME": 190, "ACTUAL_ELAPSED_TIME": 184, "AIR_TIME": 156, "FLIGHTS": 1, "DISTANCE": 980, "DISTANCE_GROUP": 4, "CARRIER_DELAY": "", "WEATHER_DELAY": "", "NAS_DELAY": "", "SECURITY_DELAY": "", "LATE_AIRCRAFT_DELAY": "", "FIRST_DEP_TIME": "", "TOTAL_ADD_GTIME": "", "LONGEST_ADD_GTIME": "", "": "" }
MongoDB aggregation steps/stages
• Grouping• Matching/filtering• Projection• Sorting• Unwind• Limit, skip• Added in 2.6– Out– Redact
Who are the largest carriers?
Which airports have the most cancellations?
Which carriers are most at fault for cancellations?
Arrival delays by distance
Delays by distance by carrier
Delays by distance by carrier – long haul only
Order Does Matter
An example for $unwind
• DB: companies• Collection: companies• Pipeline: movers
Hub airports – try1
Hub airports – try2
Hub airports – try 3
{ $group: { _id: { ORIGIN: "$ORIGIN", CARRIER: "$CARRIER" }, count: { $sum: 1 } } }, { $project: { airport: "$_id.ORIGIN", carrier: "$_id.CARRIER", "count": 1 } }, { $match: { "count": { $gte: "$$hub_threshold" } } }, { $group: {
_id: { airport: "$airport" }, airlines: { $sum: 1 }, flights: { $sum: "$count" }, avg_airline: { $avg: "$count" }, max_airline: { $max: "$count" } } },
{ $project: { "airlines": 1, "flights": 1, "avg_airline": 1, "max_airline": 1, "avg_no_max": { $divide: [ { $subtract: [ "$flights", "$max_airline" ] }, "$airlines" ] } } },
{ $sort: { "flights": -1 } }
Hub airports
Visualizing route data (from/to & main routes)
{ $group: { _id: { UNIQUE_CARRIER: "$UNIQUE_CARRIER", ORIGIN: "$ORIGIN", DEST: "$DEST" }, count: { $sum: 1 } } },
{ $match: { "count": { $gt: "$$count_threshold" } } }
Hub visualization (using routes – from/to, $$count=1, origin treemap)
From-to Insensitive{ $group: { _id: { UNIQUE_CARRIER: "$UNIQUE_CARRIER", ORIGIN: "$ORIGIN",
DEST: "$DEST" }, count: { $sum: 1 } } }, { $match: { "count": { $gt: "$$count_threshold" } } },
{ $project: { _id_UNIQUE_CARRIER: "$_id.UNIQUE_CARRIER", "count": 1,
rroute: { $cond: [
{ $lt: [ { $cmp: [ "$_id.ORIGIN", "$_id.DEST" ] }, 0 ] }, { $concat: [ "$_id.ORIGIN", "$_id.DEST" ] }, { $concat: [ "$_id.DEST", "$_id.ORIGIN" ] }
] } } },
{ $group: { _id: { _id_UNIQUE_CARRIER: "$_id_UNIQUE_CARRIER", rroute: "$rroute" }, _sum_count: { $sum: "$count" } } }
All routes treemap – airline fragmentation (rroute treemap)
New in 2.6 aggregation
• Cursors• MaxTimeMS• $out• $redact• $let/$map• Misc - $cond with either object or array, set
expression, ..
Quiz - NYC Airports
1. Import data – mongoimport (flights, airports, planes)2. Make join collection:
3. Query-Insert for all NYC data - ~500K flights per year originating in JFK/EWR/LGA
{ "Year": 2008, … "LateAircraftDelay": "NA", "to": { "city": "Detroit", "state": "MI", "lat": 42.21205889, "long": -83.34883583 }, "from": { "city": "Newark", "state": "NJ", "lat": 40.69249722, "long": -74.16866056 }, "aircraft": { "manufacturer": "EMBRAER", "model": "EMB-145EP", "year": 1997 }}
NYC Flights – Quiz Questions• Of the three airports, who has the most flights?
– Nyc1
• Who has the most cancellations and highest cancellation ratio?– Nyc2
• Taxi in/out times?– Nyc3
• What about delays?– Nyc4
• How do delays differ by month?– Nyc5 + nyc5– (summer vs. winter / bubble size vs. y-axis)
• What about weather delays only? Which months are worse? Are the three airports equivalent?– Nyc7 + nyc7
• Where can I fly to if I work for Boeing and am very loyal (and on which aicraft)?– Nyc8 + map