MAKING BIG DATA RELEVANTTHE IMPORTANCE OF DATA VISUALIZATION & ANALYTICS
S Anand, Chief Data Scientist, Gramener
@sanand0
A DATA VISUALISATION
CHALLENGE…
You will see 3 questions.You have 30 seconds.
Try it!
Your timerstarts now
HOW MANY NUMBERS ARE ABOVE 100? 1
23 32 71 72 58 87 11 77 70 16
17 21 56 44 68 51 84 20 60 40
37 8 107 14 12 41 69 14 18 71
62 55 59 64 33 55 71 58 103 92
101 56 45 34 43 15 73 78 6 93
39 53 22 26 26 94 60 82 99 74
11 12 36 67 70 71 97 59 73 99
75 74 69 69 51 48 2 66 92 98
15 10 41 58 104 94 92 84 74 82
12 52 10 57 33 77 88 81 81 91
15 56 25 30 21 7 66 66 78 87
29 23 5 34 11 96 74 99 99 88
37 10 43 15 50 71 65 60 101 98
46 34 19 102 57 70 95 84 63 91
3 34 39 37 60 81 65 63 9 71
48 46 25 50 22 64 91 76 71 79
HOW MANY NUMBERS ARE BELOW 10? 2
23 32 71 72 58 87 11 77 70 16
17 21 56 44 68 51 84 20 60 40
37 8 107 14 12 41 69 14 18 71
62 55 59 64 33 55 71 58 103 92
101 56 45 34 43 15 73 78 6 93
39 53 22 26 26 94 60 82 99 74
11 12 36 67 70 71 97 59 73 99
75 74 69 69 51 48 2 66 92 98
15 10 41 58 104 94 92 84 74 82
12 52 10 57 33 77 88 81 81 91
15 56 25 30 21 7 66 66 78 87
29 23 5 34 11 96 74 99 99 88
37 10 43 15 50 71 65 60 101 98
46 34 19 102 57 70 95 84 63 91
3 34 39 37 60 81 65 63 9 71
48 46 25 50 22 64 91 76 71 79
WHICH QUADRANT HAS THE HIGHEST TOTAL?
23 32 71 72 58 87 11 77 70 16
17 21 56 44 68 51 84 20 60 40
37 8 107 14 12 41 69 14 18 71
62 55 59 64 33 55 71 58 103 92
101 56 45 34 43 15 73 78 6 93
39 53 22 26 26 94 60 82 99 74
11 12 36 67 70 71 97 59 73 99
75 74 69 69 51 48 2 66 92 98
15 10 41 58 104 94 92 84 74 82
12 52 10 57 33 77 88 81 81 91
15 56 25 30 21 7 66 66 78 87
29 23 5 34 11 96 74 99 99 88
37 10 43 15 50 71 65 60 101 98
46 34 19 102 57 70 95 84 63 91
3 34 39 37 60 81 65 63 9 71
48 46 25 50 22 64 91 76 71 79
3
A DATA VISUALISATION
CHALLENGE…
We’ll answer the same questions again.But with simple visual cues.
See how long it takes.
Your timerstarts now
23 32 71 72 58 87 11 77 70 16
17 21 56 44 68 51 84 20 60 40
37 8 107 14 12 41 69 14 18 71
62 55 59 64 33 55 71 58 103 92
101 56 45 34 43 15 73 78 6 93
39 53 22 26 26 94 60 82 99 74
11 12 36 67 70 71 97 59 73 99
75 74 69 69 51 48 2 66 92 98
15 10 41 58 104 94 92 84 74 82
12 52 10 57 33 77 88 81 81 91
15 56 25 30 21 7 66 66 78 87
29 23 5 34 11 96 74 99 99 88
37 10 43 15 50 71 65 60 101 98
46 34 19 102 57 70 95 84 63 91
3 34 39 37 60 81 65 63 9 71
48 46 25 50 22 64 91 76 71 79
HOW MANY NUMBERS ARE ABOVE 100? 1
HOW MANY NUMBERS ARE BELOW 10? 2
23 32 71 72 58 87 11 77 70 16
17 21 56 44 68 51 84 20 60 40
37 8 107 14 12 41 69 14 18 71
62 55 59 64 33 55 71 58 103 92
101 56 45 34 43 15 73 78 6 93
39 53 22 26 26 94 60 82 99 74
11 12 36 67 70 71 97 59 73 99
75 74 69 69 51 48 2 66 92 98
15 10 41 58 104 94 92 84 74 82
12 52 10 57 33 77 88 81 81 91
15 56 25 30 21 7 66 66 78 87
29 23 5 34 11 96 74 99 99 88
37 10 43 15 50 71 65 60 101 98
46 34 19 102 57 70 95 84 63 91
3 34 39 37 60 81 65 63 9 71
48 46 25 50 22 64 91 76 71 79
WHICH QUADRANT HAS THE HIGHEST TOTAL? 3
23 32 71 72 58 87 11 77 70 16
17 21 56 44 68 51 84 20 60 40
37 8 107 14 12 41 69 14 18 71
62 55 59 64 33 55 71 58 103 92
101 56 45 34 43 15 73 78 6 93
39 53 22 26 26 94 60 82 99 74
11 12 36 67 70 71 97 59 73 99
75 74 69 69 51 48 2 66 92 98
15 10 41 58 104 94 92 84 74 82
12 52 10 57 33 77 88 81 81 91
15 56 25 30 21 7 66 66 78 87
29 23 5 34 11 96 74 99 99 88
37 10 43 15 50 71 65 60 101 98
46 34 19 102 57 70 95 84 63 91
3 34 39 37 60 81 65 63 9 71
48 46 25 50 22 64 91 76 71 79
WHY VISUALISE?
100
YE
AR
SO
FIN
DIA
’SW
EA
TH
ER
1901
1911
1921
1931
1941
1951
1961
1971
1981
1991
2001
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Most discussions of decision-making assume that only senior executives make decisions or that only senior executives’ decisions matter. This is a dangerous mistake…
Peter F Drucker
Data generation and analysis are not sufficient.
Consuming it as a team and acting in cohesion is.
SHOWme what is happening
with the data
EXPLAINto me why it’s
happening
Allow me to
EXPLOREand figure it out
Just
EXPOSEthe data to me
Low effort High effort
High effort
Low effort
Creator
Consumer
THERE ARE MANY WAYS TO AID DATA CONSUMPTION
SHOWme what is happening
with the data
EXPLAINto me why it’s
happening
Allow me to
EXPLOREand figure it out
Just
EXPOSEthe data to me
EDUCATION
PREDICTING MARKS
What determines a child’s marks?
Do girls score better than boys?
Does the choice of subject matter?
Does the medium of instruction matter?
Does community or religion matter?
Does their birthday matter?
Does the first letter of their name matter?
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
TN CLASS X: ENGLISH
TN CLASS X: SOCIAL SCIENCE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
TN CLASS X: MATHEMATICS
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
ICSE 2013 CLASS XII: TOTAL MARKS
CBSE 2013 CLASS XII: ENGLISH MARKS
DETECTING FRAUD
“We know meter readings are incorrect, for various reasons.
We don’t, however, have the concrete proof we need to start the process of meter reading automation.
Part of our problem is the volume of data that needs to be analysed. The other is the inexperience in tools or analyses to identify such patterns.
ENERGY UTILITY
This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011. An unusually large number of
readings are aligned with the tariff slab boundaries.
This clearly shows collusion of some form with the customers.
Apr-10 May-10Jun-10Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
217 219 200 200 200 200 200 200 200 350 200 200
250 200 200 200 201 200 200 200 250 200 200 150
250 150 150 200 200 200 200 200 200 200 200 150
150 200 200 200 200 200 200 200 200 200 200 50
200 200 200 150 180 150 50 100 50 70 100 100
100 100 100 100 100 100 100 100 100 100 110 100
100 150 123 123 50 100 50 100 100 100 100 100
0 111 100 100 100 100 100 100 100 100 50 50
0 100 27 100 50 100 100 100 100 100 70 100
1 1 1 100 99 50 100 100 100 100 100 100
This happens with specific customers, not randomly. Here are such customers’ meter readings.
Section Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109%
Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54%
Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34%
Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14%
Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15%
Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33%
Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14%
Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17%
Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11%
If we define the “extent of fraud” as the percentage excess of the 100 unitmeter reading, the value varies considerably across sections, and time
New section manager arrives
… and is transferred out
… with some explainable anomalies.
Why would these happen?
PARLIAMENT DECISIONS
The only other such times were Feb 23, 2008 (28 decisions) &Dec 26, 2008 (23 decisions).
Nearly two-thirds of decisions are taken on Thursday sessions, which is also visible on the calendar alongside.
UPA's best cabinet performance was last Friday, with a record 23 decisions taken in a single day, including some long pending key reform measures.
* CCEA: Cabinet Committee on Economic Affairs** CCI: Cabinet Committee on Infrastructure
Mon 63 5%
Tue 56 4%
Wed 105 8%
Thu 854 65%
Fri 223 17%
Sat 6 0%
RESTAURANT FOUND AN UNUSUAL DIP IN SALESA restaurant chain had data for every
single transaction made over a few
years. Plotting this as a time series
showed them nothing unusual.
However, the same data on a calendar
map reveals a very different story.
Specifically, at the bottom left point-of-sale terminal, sales dips on
every Wednesday. At the bottom right point-of-sale terminal, sales
rises on every Wednesday (almost as if to compensate for the loss.)
It turns out that the manager closes the bottom-left counter every
Wednesday afternoon due to shortage of staff, assuming that it results
in no loss of sales. There is, however, a net loss every Wednesday.
BANK FOUND ALL LOANS BEFORE 20TH POOREvery loan disbursed after the 20th of the month, i.e. from the 21st to
the end of the month, shows consistently lower non-performing assets
(i.e. better quality) than any loan disbursed prior to the 20th.
The bank mapped this back to their incentive scheme. The sales team’s
commission is based only on loans disbursed until the 20th. Hence new
loans are squeezed into this period without regard for their quality.
The personal finance division of a
bank, focusing on retail loans, drove
its sales through a branch sales team.
A study of the non-performing assets
of loans generated over the course of
one year shows a strange pattern.
Analytics can detect something that you’re specifically looking for.
It takes a visual to detect what we don’t know to look for
This representation, known as a
calendar map, can show some
interesting patterns, particularly
weekday-based patterns, as the next
example will show.
Profits Made: Over the last 6years, you would have beaten a 10% Inflation about 82% of the time and lost outabout 18% of the time. So, mostly, you would have made money on Cipla with an average return of 14.9%.
Highest Returns: An average return of 14.1% has been observed when held for a period of one year.with a maximum of 79.6% if sold in Dec 2009, after beingheld for a year. And a maximum of 486.9% if sold at the end of Nov 2007 after holding for a month. The highest stock pricewas Rs 414 in Nov/Dec 2012.
-50% +50%returns
This visual shows the returns from buying Cipla’s stock on any given month, and selling it in another.
The colour of each cell is the return (red is low, green is high) if you had invested in the stock in a given month and sold it on another. For example this mild red is the slightly negative return if you had bought Cipla stock in Mar 2011 (the row) and sold it in Jun 2011 (the column).
WHEN TO
INVEST
The Shawshank
Redepmption
The Godfather
The Dark Knight
Titanic
The Phantom
Menace
Twilight
New Moon
Wild Wild West
Transformers
The Good, The
Bad, The Ugly
12 Angry
Men
7 Samurai
Taare Zameen
Par
Rang De
Basanti
Yojinbo
MORE VOTES
BETTER RATED
Many unwatched movies
Few unwatched movies
Mix of watched & unwatched
Few watched movies
Many watched movies
Movies on the IMDb
3 Idiots
https://gramener.com/imdb/
< 50
< 75
< 95
< 100
= 100
MLA attendance at the AssemblyKarnataka, 2008-2012
SHOWme what is happening
with the data
EXPLAINto me why it’s
happening
Allow me to
EXPLOREand figure it out
Just
EXPOSEthe data to me
… to inform and to entertain
SHOWme what is happening
with the data
EXPLAINto me why it’s
happening
Allow me to
EXPLOREand figure it out
Just
EXPOSEthe data to me
Subject Girs higher by Girls Boys
Physics 0 119 119
Chemistry 1 123 122
English 4 130 126
Computers 6 137 131
Biology 6 129 123
Mathematics 11 123 112
Language 11 152 141
Accounting 12 138 126
Commerce 13 127 114
Economics 16 142 126
PERFORMANCE: GIRLS VS BOYS
Jain
Harini
Shweta
Sneha Pooja
Ashwin
Shah
Deepti
Sanjana
Varshini
Ezhumalai
Venkatesan
Silambarasan
Pandiyan
Kumaresan
Manikandan
Thirupathi
Agarwal
Kumar
Priya
Based on the results of the 20 lakh students taking the Class XII exams at Tamil Nadu over the last 3 years, it appears that the month you were born in can make a difference of as much as 120 marks out of 1,200.
June bornsscore the lowest
The marks shoot up for Aug borns
… and peaks for Sep-borns
120 marks out of 1200 explainable by month of birth
An identical pattern was observed in 2009 and 2010…
… and across districts, gender, subjects, and class X & XII.
“It’s simply that in Canada the eligibility cutoff for age-class hockey is January 1. A boy who turns ten on January 2, then, could be playing alongside someone who doesn’t turn ten until the end of the year—and at that age, in preadolescence, a twelve-month gap in age represents an enormous difference in physical maturity.”
-- Malcolm Gladwell, Outliers
1%2%
4%
6%
9%
11%
14%
11%
16%
18%
22% 22%
33%
0%
10%
20%
30%
40%
25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-75 75-80 80-85 85-90
0
500
1000
1500
2000
2500
Win %The number of winning candidates as a % of candidates in the age group
CandidatesThe number of candidates in each
age group
Lok
Sabha (
2004 o
nw
ard
s)
2%
4%
6%
9%
12%
15%
17%
15%16%
18% 18%
20%
27%
0%
10%
20%
30%
25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-75 75-80 80-85 85-90
0
2000
4000
6000
8000
10000
12000
14000
Win %The number of winning candidates as a % of candidates in the age group
CandidatesThe number of candidates in each
age group
Ass
em
bly
ele
cti
ons
(2004 o
nw
ard
s)
0%
10%
20%
30%
40%
50%
60%
0 2 4 6 8 10 12 14 16 18
# contestants
Win
ner
mar
gin
More contestants did not reduce the winner marginKarnataka, Assembly Elections 2008
0%
10%
20%
30%
40%
50%
60%
0 2 4 6 8 10 12 14 16 18
# contestants
Ru
nn
er-u
p m
argi
n
More contestants did reduce the runner-up marginKarnataka, Assembly Elections 2004
VISUALISING THE MAHABHARATA
How does Mahabharata, one of the largest epics with 1.8 million words lend itself to text analytics?
Can this ‘unstructured data’ be processed to extract analytical insights?
What does sentiment analysis of this tome convey?
Is there a better way to explore relations between characters?
How can closeness of characters be analysed & visualized?
Adult Educat
ion
Adminisrative
Reforms
Agricultura
l Marketing
AgricultureAnimal
Husbandry
Cooperative
Excise
Finance
Fisheries
Fisheries &
Inland
water
transport
Food & Civil
Supplies
Forest
Fuel
Haz & Wakf
Health and
family welfare
Higher Educati
on
Home Horticu
lture
Housing
Information
& Technology
Kannada &
Culture
Labour
Law &
Human Righ
ts
Major & Medium Industri
es
Medical Educatio
n
Medium and
Large Industrie
sMines
& Geolo
gy
Minor Irrigati
on
Muzrai
P.W.D.
Parliamentar
y Affairs
and Human Rights
Planning
Planning
and Statist
ics
Primary and
Secondary Education
Primary Educati
on
Prison
Public
Library
Revenue
Rural Developme
nt and Panchayat
Raj
Rural Wate
r Suppl
y
Rural Water Supply
and Sanitat
ion
Sericulture
Small
Scale Industrie
s
Small Indust
riesSocial Welfar
e
Sugar
Textile
Tourism
Transport
Transportatio
n
Urban Development
Water Resourc
es
Woman & Child
Development
Youth and
Sports
Youth
Service & Spor
ts
BJP focus
JD(S)focus
INC focus
What topics did parties focus on during questions?Karnataka, 2008-2012
P.W.D.
Health and family
welfare
Revenue
Rural Developme
nt and Panchayat
Raj
Social Welfar
e
Urban Development
Water Resour
ces
Minor Irrigati
on
Fuel
Housing
Agriculture
Primary Educati
on
Primary and Secondary Education
Woman & Child
Development
Higher Educati
on
HomeCoope
rative
Forest
Adminisrative
Reforms
Labour
Food & Civil
Supplies
Tourism
Finance
Animal Husbandry
Transportation
Horticulture
Muzrai
Haz & Wakf
TransportMedical
Education
Medium and Large Industries
Excise
Major & Medium Industrie
s
Kannada &
Culture
Textile
Fisheries
Parliamentary Affairs
and Human Rights
Adult Educati
on
Rural Water Supply
and Sanitati
on
Mines &
Geology
Small Industr
ies
Youth and
Sports
Sugar
Planning and Statisti
cs
Agricultural
Marketing
Rural Water Supply
Fisheries &
Inland water transport
Small Scale Industries
Youth
Service & Sport
s
Sericultur
e
Law &
Human
Rights
Prison
Planning
Information
& Technology
Public
Library
What topics did the young & old focus on during questions?Karnataka, 2008-2012
Young Old
promotion scheme
project
approved
development
agreement amendment
central
act
section
limited
billlaning
plan
government
new
ltd
phaseapproval
sector
state
settinginvestment
pradesh
policy
four
programme
amendments
indianextension
institute
commission
nhdp
technology
proposal
iii
implementation
fund
establishment
equity
assistance
cooperation
transfer
infrastructure
corporation
international
moucabinet
company
public
year
revised
construction
services
continuation
approves
stateseducation
additional
financial
revision
sponsored
port
mission
centrally
basis
signing
protection
management
capital
bank
two
projects
research
upgradation
rural
special
land
delhi
employees
existing
committee
relief
conventionsix
crore
payment
power
health
cost
package
institutions
acquisition
control
restructuring
air
grant
field
university
scheduled
PRE-2009 2009 AND AFTER
Decisions related to intervention, assistance and relief were almost entirely concentrated in pre-2009
The number of international agreements has declined dramatically between pre-2009 and post-2009
A significant rise in the number of decisions related to the States is seen post 2009 – in contrast with the focus
on “Central” pre-2009
Decisions to increase the number of lanes on highways grew significantly post-2009,
especially as part of the CCI (Cabinet Committee on Infrastructure) decisions
PARLIAMENT DECISIONS
SHOWme what is happening
with the data
EXPLAINto me why it’s
happening
Allow me to
EXPLOREand figure it out
Just
EXPOSEthe data to me
… to connect the dots for your readers
SHOWme what is happening
with the data
EXPLAINto me why it’s
happening
Allow me to
EXPLOREand figure it out
Just
EXPOSEthe data to me
Sudar, Yahoo!Anand C, ConsultantKiran, HasgeekAnand S, Gramener
Mugunth, SteinlogicHoncheng, buUukSau Sheong, HP LabsLim Chee Aung
Bangalore
Singapore
1 follower
100 followers
A follows B (or)
B follows A
Most followed in Bangalore
Most followed in Singapore
SOCIAL MEDIA IN AUTOMATED RECRUITING
Tata TeleservicesTata Consultancy Services
Tata Business Support ServicesTata Global Beverages
Tata Infotech (merged)Tata Toyo Radiator
Honeywell Automation IndiaTata Communications
A G C NetworksTata Technologies
Tata ProjectsTata PowerTata FinanceIdea CellularTata MotorsTata SonsTata SteelTayo RollsTata SecuritiesTata CoffeeTata Investment Corp
A J EngineerH H MalghamH K SethnaKeshub MahindraRavi KantRussi ModySujit Gupta
A S BamAmal GanguliD B EngineerD N GhoshM N BhagwatN N KampaniU M Rao
B MuthuramanIshaat Hussain
J J IraniN A PalkhivalaN A Soonawala
R GopalakrishnanRatan Tata
S RamadoraiS Ramakrishnan
DIRECTORSHIPS AT THE TATAS
Every person who was a Director at the Tata Group is shown here as an orange circle. The size of the circle is based on the number of directorship positions held over their lifetime.
Every company in the Tata Group is shown here as a blue circle. The size of the circle is based on the number of directors the company has had over time.
Every directorship relation is shown by a line. If a person has held a directorship position at a company, the two are connected by a line.The group appears to be divided into two clusters based on the network of directorship roles.
Prominent leadersbridge the groups
Second group of companies
First group of companies
Some directors are mainly associated with the first group of companies
Some directors are mainly associated with the second group of companies
Similar network patterns have helped our clients:
• locate terrorists (who called each other but no one outside their network)
• de-duplicate customers (who share the same address and date of birth)
• analyse competitor strengths (based on the cluster of keywords in their patents)
SHOWme what is happening
with the data
EXPLAINto me why it’s
happening
Allow me to
EXPLOREand figure it out
Just
EXPOSEthe data to me
… to allow your users to tell stories
VISUALISATION IS IMPERATIVE FOR
DATA → INSIGHTS → ACTIONSpot the unusual Communicate patterns Simplify decisions
We handle terabyte-size data via non-traditional analytics and visualise it in real-time.
A data analytics and visualisation company
gramener.com
for more examples