note on exercises let us start. - keio university1 osm lecture and exercises (2011/10/13) akito...

13
OSM Lecture and Exercises 2011/10/13Akito Sakurai Purposes of my L&E To know the basics of data analysis Through exercises Note on Exercises Solve the problems (in fact they are not problems but just exercises) and hand in your solutions Follow the procedures described in the slides and report the results with your discussions Experiment 1. character recognition Experiment 2. classification of songs in Japanese Experiment 3. prediction of USD/JPY rate URL: http://www.sakurai.comp.ae.keio.ac.jp/2010OSMLandE.html Let us start. Predictions Humans have desired to predict Prediction of seasons is one of them, – since they did not have a calendar – for agriculture, they need to know the best season to plant seeds temperature observed at one time is not reliable astronomical observation was definitely important – One who is able to observe could be a ruler Difficult to predict The motions of celestial bodies are predicted relatively nicely with models that are not really representing reality. Since they are the ones (with accuracy they needed at the time) of which simple mathematical or physical rules explain the motions But in many other cases phenomenon has very complex background too complex to predict In many cases, the phenomenon are probabilistic Many observations are not necessarily increase the possibility of correct predictions

Upload: others

Post on 24-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Note on Exercises Let us start. - Keio University1 OSM Lecture and Exercises (2011/10/13) Akito Sakurai Purposes of my L&E • To know the basics of data analysis • Through exercises

1

OSM Lecture and Exercises(2011/10/13)

Akito Sakurai

Purposes of my L&E

• To know the basics of data analysis• Through exercises

Note on Exercises

• Solve the problems (in fact they are not problems but just exercises) and hand in your solutions

• Follow the procedures described in the slides and report the results with your discussions– Experiment 1. character recognition– Experiment 2. classification of songs in Japanese– Experiment 3. prediction of USD/JPY rate

URL: http://www.sakurai.comp.ae.keio.ac.jp/2010OSMLandE.html

Let us start.

Predictions

• Humans have desired to predict• Prediction of seasons is one of them,

– since they did not have a calendar– for agriculture, they need to know the best

season to plant seeds• temperature observed at one time is not reliable

• astronomical observation was definitely important– One who is able to observe could be a ruler

Difficult to predict

• The motions of celestial bodies are predicted relatively nicely with models that are not really representing reality.– Since they are the ones (with accuracy they needed at the time)

of which simple mathematical or physical rules explain the motions

• But in many other cases phenomenon has very complex background– too complex to predict

• In many cases, the phenomenon are probabilistic– Many observations are not necessarily increase the possibility of

correct predictions

Page 2: Note on Exercises Let us start. - Keio University1 OSM Lecture and Exercises (2011/10/13) Akito Sakurai Purposes of my L&E • To know the basics of data analysis • Through exercises

2

But we do predict

• We humans, though, try to predict• Even the realities are complex, each event

happens to be (relatively) simple– Or it may be the case that "we believe it

simple"– Economic forecast is a typical one.– Professionals predict it with many data and

with their profession, but, say, politicians do predict

Let us think it over• Prediction or forecast is to say something in

advance in time• But it is in essence, based on

– instances which are pairs of a (first) set of values (e.g. the position of a typhoon on a day) and a (second) set of values (e.g. the position of the typhoon on the next day)

– the position of a typhoon today (the first value)to infer

– the position of the typhoon tomorrow (the second value)

that is,

2492837490872352293841698332149821390117498738179470913241248481

23489928372398479823123984716723498723234239487923239487987123984798712223598728

28383169 ???

Known set of obervations

New observation (one of the pair)

Note: data

• strings of numerals• strings of characters (words). Linguistic

expressions• sets of photographs• sets of paintings• sets of representations of behaviors• sets of representations of sounds

Predictions: two types

0174532903490659052359880698131708726646104719761221730513962634

17364823420201500000064278767660444866025493969269848078

06981428 ???

Find out seemingly similar ones0174532903490659052359880698131708726646104719761221730513962634

17364823420201500000064278767660444866025493969269848078

06981428 ???

Find out rules (ground truth)

x/10000000 sin()

x y

∗10000000

"structure" of data

Structures?

S[dcl]

S[dcl]

S[dcl]/NP

S[dcl]/NP

S[X]/(S[X]\NP)

NP

N

Dr.NNPN/N

JekyllNNP

N

sawVBD

(S[dcl]\NP)/NP

(S[dcl]/NP)\(S[dcl]/NP)

andCCconj

S[dcl]/NP

S[X]/(S[X]\NP)

NP

N

Mr.NNPN/N

HydeNNP

N

ateVBD

(S[dcl]\NP)/NP

NP[nb]

aDT

NP[nb]/N

lemonNNN

...

2 4 6 8 1014

15

16

17

18

19

20

21

22width

lightness

salmon sea bass

J. Curran and S. Clark. C&C tools.

R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification

USD/JPY: returns in four minutes (0.01%) vs. frequency in (2001-2008)

-100 -50 0 50 100

0.00001

0.0001

0.001

0.01

0.1

1

Page 3: Note on Exercises Let us start. - Keio University1 OSM Lecture and Exercises (2011/10/13) Akito Sakurai Purposes of my L&E • To know the basics of data analysis • Through exercises

3

Structure 1.• For numerals (simplest and just dry)

– Data themselves (structure?) representing normal and abnormal samples

– Stimuli that cause changes of time series• For characters or strings of characters (there seems to

be some structure, intuitively)– Habit of what characters to use (in Japanese, kanji, hiragana, or

katakana; or smiley)– Habit of what words to use

• For pictures (photographs, paintings etc.)– Objects themselves (depending on intention/objective)– Composition, how to draw, how to take photos

http://cert.yahoo.co.jp/text/digicame/chap2/c2_0302.html

Structure 2.

• For sounds (music, voices, singing of insects, etc.)– Instruments (musical instruments, sex/age/health

condition, kinds of insects)– (for music) genre, players, etc.

• For behaviors– (shopping) purpose, for whom, for what, where, etc– (web browsing) purpose, by what, for what, etc.

S&P500

0

200

400

600

800

1000

1200

1400

1600

1800

1950

1953

1955

1958

1960

1963

1966

1968

1971

1973

1976

1978

1981

1984

1986

1989

1991

1994

1997

1999

2002

2004

2007

If we could know structure• we could find the best action to take

– Health condition: we may go to doctor– Time series: sell or buy or wait– literature: similar novels or different novels– Trends found on web: go with/against the tide– In general, when we predict, we could optimize in certain sense,

our behaviors to take

How could we do?• In computer science, from quite old times, the ways have

been studied in the field of "machine learning"– A research field in "artificial intelligence"– Why do machines "learn"?

• Humans' learning is to understand, to memorize, and use it afterwards

• The point is to "understand"• To know structures in data is the first step towards understanding

– Why "machines"?• "machines" here are computers, i.e. computing machines

– Why not robots?• Robots need to learn. But learning is necessary in other machines.

– Different from humans?• Not as intelligent as humans. Computers do not know "real world".• But computers never complain about the huge amount of data that

thy are processing

How could we do? (cont.)

• Many algorithms have been developed• There are too many to know

– Therefore I will not explain them• In the following experiments, please allow

me to give you chances of just using a tool– You may have chances of far better (more

expensive tools) or programming environments

Statistics?• It is a completely different discipline 20 years ago.• Now is the time for merging• Still there are differences• Statistics:

– Simpler (statistical) structures compared to machine learning. Statistical test is important. Numerical data comes first.

• Machine learning: – Any structure and models. Not concerned with statistical test. In

many cases, data is fewer than statistical tests require. Discrete or symbolic data are in daily usage

http://wwwcsteep.bc.edu/TIMSS1/database.html (calculus)

-3 -2 -1 1 2 3

0.1

0.2

0.3

0.4

涙産生率

乱視

めがね調製

なし

ソフト

ハード なし

少 正常

なし あり

近視 遠視

Page 4: Note on Exercises Let us start. - Keio University1 OSM Lecture and Exercises (2011/10/13) Akito Sakurai Purposes of my L&E • To know the basics of data analysis • Through exercises

4

Data mining?

• No difference• The same researchers reside at both fields.

• If there were differences: – Machine learning: emphasizes on algorithms

• Accuracy, speed, lightness, wideness of applications, representation

– Data mining: Huge amount of data• To be able to take actions is important. Specific data is OK

Be a bit more concrete

• Simple applications

Examples of data

Make Model Year Head inj. c. Chest decel. L. Leg R. Leg D/P Protection Doors Weight Size

Acura Integra 87 599 35 791 262 Driver manual belts 2 2350 ltAcura Integra RS 90 585 . 1545 1301 Driver Motorized belts 4 2490 ltAcura Legend LS 88 435 50 926 708 Driver d airbag 4 3280 medAudi 80 89 600 49 168 1871 Driver manual belts 4 2790 compAudi 100 89 185 35 998 894 Driver d airbag 4 3100 medBMW 325i 90 1036 56 865 . Driver d airbag 2 2862 compBuick Century 91 815 47 1340 315 Driver passive belts 4 2992 compBuick Elect. Park 88 1467 54 712 1366 Driver manual belts 4 3360 medBuick Le Sabre 90 . 35 1049 908 Driver passive belts 2 3240 medBuick Regal 88 880 50 996 642 Driver passive belts 2 3210 medCadillac De Ville 90 423 39 541 1629 Driver d airbag 4 3500 hev

萼片長 萼片幅 花弁長 花弁幅 種別

5.1 3.5 1.4 0.2 Iris-setosa4.9 3 1.4 0.2 Iris-setosa4.7 3.2 1.3 0.2 Iris-setosa4.6 3.1 1.5 0.2 Iris-setosa

5 3.6 1.4 0.2 Iris-setosa5.4 3.9 1.7 0.4 Iris-setosa4.6 3.4 1.4 0.3 Iris-setosa

5 3.4 1.5 0.2 Iris-setosa4.4 2.9 1.4 0.2 Iris-setosa4.9 3.1 1.5 0.1 Iris-setosa5.4 3.7 1.5 0.2 Iris-setosa4.8 3.4 1.6 0.2 Iris-setosa

曜日 室温 前夕の 血圧(mmHg)通算 アルコー

ル量(LOW) (HIGH)

0 火 18 なし 107 1531 水 20 少々 78 1322 木 20 少々 92 1333 金 20 少々 87 1305 日 20 少々 86 1346 月 20 適度 90 1347 火 18 少々 87 1348 水 18 少々 104 1499 木 20 少々 83 130

10 金 20 適度 94 13111 土 20 少々 81 13712 日 20 少々 98 137

Date Open High Low Close Volume Adj. Cl YearWeek

27/11/2000 53.6875 54.5156 51.0312 51.25 40198100 51.250 20004928/11/2000 51.9375 53.1875 50.625 51 52037000 51.000 20004929/11/2000 51.3125 53 50.3125 51.6875 55316000 51.688 20004930/11/2000 50.1875 50.9375 45.1875 47.875 10840500 47.875 20004901/12/2000 49.1875 51.625 47.25 48.5 70468000 48.500 20004904/12/2000 49.0625 49.5625 45 45.8125 9501200 45.813 20005005/12/2000 47.75 52.125 47.3125 52.125 90848900 52.125 20005006/12/2000 52 53.5625 51.2656 51.4375 71419200 51.438 20005007/12/2000 50.3125 51 49 49.9375 46448400 49.938 20005008/12/2000 51.9375 53.25 51 52.375 55400200 52.375 20005011/12/2000 52.875 55.75 52.625 54.8125 78621500 54.813 20005112/12/2000 54.75 55.125 53.3125 54.375 39485300 54.375 20005113/12/2000 55.1875 55.25 50.8125 51.125 54330600 51.125 20005114/12/2000 51.0625 52.5625 50.875 50.9375 46244400 50.938 20005115/12/2000 50.0625 50.1875 47.125 48.1719 100237900 48.172 20005118/12/2000 49 50.125 42.3125 42.9375 126032400 42.938 20005219/12/2000 43 46 41.5 41.75 99018800 41.750 200052

Statistical representation

• Each data does not present regularity whereas as a whole a set of the data shows some regularity– Uniform distribution:

• Fair dices, fair coin tossing– Normal distribution

• Composition of many independent factors– Zipf's law, 80/20 rule, 1/f fluctuations

Normal distribution

• Sum of many independent factors– Distribution of the number of heads of 100 coin

tossing trials observed in 1000 experiments– (not correct in fact) distribution of score of an exam.

http://wwwcsteep.bc.edu/TIMSS1/database.html より (calculus)

-3 -2 -1 1 2 3

0.1

0.2

0.3

0.4

Plot[(1/Sqrt[2 Pi]) Exp[-x^2/2], {x, -3, 3}]

Zipf's law

• Frequency of word usage against their ranks obeys power law.– population ranks of

cities – Hit ranks of web pages– Income rankings– Property rankings

Page 5: Note on Exercises Let us start. - Keio University1 OSM Lecture and Exercises (2011/10/13) Akito Sakurai Purposes of my L&E • To know the basics of data analysis • Through exercises

5

Rules in machine leanring

• Conditional statements– IF … THEN …– with confidence ??%

• Decision tree– In the following slides

• Neural networks• and MANY others

If-then ruleIf tear-prod-rate = reduced then contact-lenses=noneIf age=young and astigmatism=no and tear-prod-rate=normal

then contact-lenses=softIf age=pre-presbyopic and astigmatism=no and tear-pro-rate=normal

then contact-lenses=softIf age=presbyopic and spectacle-prescrip=myope and astigmatism=no

then contact-lens=none

agespectacle-prescrip

astigmatism

tear-prod-rate

contact-lenses

young myope no reduced noneyoung myope no normal softyoung myope yes reduced noneyoung myope yes normal hardyoung hypermetrope no reduced noneyoung hypermetrope no normal softyoung hypermetrope yes reduced noneyoung hypermetrope yes normal hardpre-presbyopic myope no reduced nonepre-presbyopic myope no normal soft

agespectacle-prescrip

astigmatism

tear-prod-rate

contact-lenses

young myope no reduced noneyoung myope no normal softyoung myope yes reduced noneyoung myope yes normal hardyoung hypermetrope no reduced noneyoung hypermetrope no normal softyoung hypermetrope yes reduced noneyoung hypermetrope yes normal hardpre-presbyopic myope no reduced nonepre-presbyopic myope no normal softpre-presbyopic myope yes reduced nonepre-presbyopic myope yes normal hardpre-presbyopic hypermetrope no reduced nonepre-presbyopic hypermetrope no normal softpre-presbyopic hypermetrope yes reduced nonepre-presbyopic hypermetrope yes normal nonepresbyopic myope no reduced nonepresbyopic myope no normal nonepresbyopic myope yes reduced nonepresbyopic myope yes normal hardpresbyopic hypermetrope no reduced nonepresbyopic hypermetrope no normal softpresbyopic hypermetrope yes reduced nonepresbyopic hypermetrope yes normal none

Decision tree

tear-prod-rate

astigmatism

spectacle-prescrip

none

soft

hard soft

reduces normal

no yes

myope hypermetrope

Neural networks

= ∑

=i

n

ii xwo

0

σ

1x2xnx

1w2wnw

( ) xex −+≡

11σ

Many function composition

Representation of results• More important issue in data mining• In many cases, we have to represent it and

communicate it to some others (including ourselves)– When only predication is requested, it is not the case– In general, only the results are important, it is not the case– But in many cases when explanation (why the prediction is

deduced) is asked.– Comprehensible explanations are requested

• Quite unfortunately, accuracy and comprehensibility of results are in trade-off relation

• In machine learning, you can select either way (but of course not both), e.g.– Decision tree: comprehensible– SVM (support vector machine) and NN: accuracy and flexibility

Summary • Objects

– Numbers, any thing that can be represented by numbers

• e.g. word, images, music, numbers• Objective and approach

– Prediction, inference; Knowledge system that learns• Find rules that humans cannot describe explicitly (rules for

prediction and/or inference)– Separate objects from noises, and describe them

• Noise: random, without correlation• Objects: that have structure

• Instruments and tools– Statistics, artificial intelligence, and machine learning!!

Page 6: Note on Exercises Let us start. - Keio University1 OSM Lecture and Exercises (2011/10/13) Akito Sakurai Purposes of my L&E • To know the basics of data analysis • Through exercises

6

Note• Many good tools. For practice, good features are to be selected• "Feature" is something that are obtained by observing the object or

calculated from the observations, and is used to represent the object.– (maybe) Useless features:

• Height for diagnosis of influenza• Cloud cover of a day to predict stock prices (someone says that weather

affects stock prices, though.)

• Features, not techniques – Success or not is almost decided by features– Features: values obtained or calculated from observations of objects

• In general, some combination of features decide the class of objects we are considering

Experiments

• Weka as a tool of machine learning– Another slide, please.

• Experiment 1: character recognition– Just numerals for experience

• Experiment 2: genre inference from lyrics of songs (sorry. This one is based on Japanese kana)– It is not difficult to get data if you know Japanese

• Experiment 3: prediction of USD/JPY

Experiment 1: Character recognition

• Works very well in real world– To read license plate at highway

tollgates

• N-system in Japanfrom Wikipedia

– To read ZIP codes on evelopes. In these days, handwritten addresses and names are read, too.

An application

Hitachi information and control

Applications

• Restaurant menureader

• Universal accesshttp://tabelog.com/imgview/original?id=r173888580388

http://www.afb.org/afbpress/pub.asp?DocID=aw070605

http://ameblo.jp/20dai-makoto/day-20110624.html

http://www.thepotteries.org/walks/fenton1/7.htm

CR is difficult• Humans are very good at reading characters, unless the

letters are too much distorted• We are quite sure that it is very

difficult to tell you how to read Japanese hiragana if they are handwritten. – If they are calligraphic, many of us

cannot read.• Why is it difficult to read ?• It is quite difficult for us to study,

or for teachers to teach us,• because no one (so far) could not

write down rules.• It may not be just there are no rules.

霞たち木の芽もはるの雪降れば花なきさとも花ぞちりける

Page 7: Note on Exercises Let us start. - Keio University1 OSM Lecture and Exercises (2011/10/13) Akito Sakurai Purposes of my L&E • To know the basics of data analysis • Through exercises

7

Data for C.R.• A simple character recognition

– Only numerals• Preprocessing has been done (preprocessing is far more

difficult than recognition)– Separation (characters from other characters)– Normalization (size, slant, center, etc)

• But still it seems to be difficult– Suppose that we have to tell how to tell numerals to people who

do not know numerals• In essence, we do not know rules even if we can behave

as if we know rules– We could induce "rules" from data

• Data source: UCI Machine Learning RepositoryOptical Recognition of Handwritten Digits Data Set

Preprocessing

Giorgos Vamvakas

UCI Machine Learning Repository

Examples of characters in the data

Data format is simple8 pixels

8 pixels

Every pixel has values:0(white) … 16(black)

numerals0 … 9

Look into optdigits.tes.csv and verify the format by yourself

Data format is simple4. Relevant Information:

We used preprocessing programs made available by NIST to extractnormalized bitmaps of handwritten digits from a preprinted form. Froma total of 43 people, 30 contributed to the training set and different13 to the test set. 32x32 bitmaps are divided into nonoverlappingblocks of 4x4 and the number of on pixels are counted in each block.This generates an input matrix of 8x8 where each element is an integer in the range 0..16. This reduces dimensionality and gives invariance to small distortions.

5. Number of Instancesoptdigits.tra Training 3823optdigits.tes Testing 1797

6. Number of Attributes64 input+1 class attribute

Read it in optdigits.names.txt by yourself

Procedures• Get the data (I uploaded to lecture web site)

– Features are just pixel brightness values (worst!)• create arff files ( you already have csv files. So what you have to do

is just to add headers) . Use memo pad for example. Take care when you save it.

@relation OptDigitsTraining@attribute 00 real@attribute 01 real

……@attribute 06 real@attribute 07 real@attribute 10 real@attribute 11 real@attribute 12 real

……

……@attribute 76 real@attribute 77 real@attribute class {0,1,2,3,4,5,6,7,8,9}@dataHere data come.

At your disposal

At your disposal, too. 8 x 8 =64

optdigits.tra.csv

Page 8: Note on Exercises Let us start. - Keio University1 OSM Lecture and Exercises (2011/10/13) Akito Sakurai Purposes of my L&E • To know the basics of data analysis • Through exercises

8

Procedures: learning and test• Create a decision tree

– Weka: J48 under "Trees"• Get accuracy by "10-fold cross validation"• Get accuracy when a separate test data set optdigits.tes

is used.• Look at the decision tree. Is there any meaning in it? I

think you would not be able to find any. Why?• Try other tools

– neural network– SMO (one of many implementations of support vector machines)– naïve Bayes

• Compare accuracies and run time

Test data in images: test_images.zip, train_images.zip

How to choose other methods

Click on + of functions

This one is neural networks

Click on + of Bayes

SMO (SVM)naïve Bayes

Experiment 2: classification of lyrics

• Words used in lyrics may be different among songs in different genre

• The same is true in waka or tanka (thirty-one syllabledverse)– Prof. Shizuo Mizutani (founder of Mathematical Linguistics

Society in Japan. Founded in 1957) analyzed Shirakaba-ha (group) and Araragi-ha tanka and succeeded in classifying the tanka to authors' group based on the words used

• It is a bit difficult to classify lyrics based on words (difficult for me to prepare) , I set up environments to classify them based on syllables or mora– I should say I used Japanese syllabary in fact

Data used• Ten children's songs and ten J-POPs in Japanese

syllables.• Frequency is used, not the songs themselves.• Since the order of character codes and Japanese

syllabary do not coincide, we have to rearrange them. A bit cumbersome– "A I U E O" must be grouped etc.– I made up an Excel file for you.

• The frequencies are normalized so that the sum of them is 1. this is because the length (number of syllables in a song differs)

Procedures: data preparation 1In column A, syllables of a song is written in order. I put a code in column B that transforms syllables in A to corresponding values in B. In column C, "あいうえお" corresponding to syllables in A are filled.

To draw histogram, we calculate frequencies.Columns E and F are the ones to be used for the histogram generation.In Excel menu, select "tool → data analysis → histogram" to get histogram.Normalized frequency will be here.Be sure to count "あいうえお" to get histogram.Normalized frequency will be here.

Procedures: data preparation 2

Paste the links(done already)

Paste the links.(done already)

Sheet "Summary"

Page 9: Note on Exercises Let us start. - Keio University1 OSM Lecture and Exercises (2011/10/13) Akito Sakurai Purposes of my L&E • To know the basics of data analysis • Through exercises

9

Procedures: data preparation 3

1. Only values is to be pasted with rows and columns being exchanged. You might pasted to another sheet.2. Save it a CSV file.

Procedures: data preparation 4• Prepare data for ten children's songs and ten J-POPs.• Transform it to an arff.

– Insert a header on the right.– At the right end of each data

(a song in a row. 20 rows intotal), put a ",0" for children'ssong and put a ",1" for J-POP.

This has been done.Children.xlsJ-POP.xls

Procedures: Experiment and test

• Use Weka– J48, SMO, naïve Bayes and others

• Is accuracy high?• Is the decision tree obtained meaningful?

• Next, please– add 10 children's songs or J-POPs to them, or– prepare at least ten lyrics in a new genre that you

prefer.

and find out something

If you do not know Japanese songs, I am sure you do not, please forget about this problem

Experiment 3: Prediction of USD/JPY

• FX: foreign exchange market– Financial market for the trading of currencies where

the relative values of different currencies are determined. • Is it possible to get positive returns in FX?

– Spread is small, so that it is different from (government-run) lottery.

– But it is a typical gamble. It is very close to zero-sum game (not exactly). There are winners and losers. The number of losers is far larger than winners (80-20 rule or power law)

• Price movement must be a random walk.– i.e. it must be unpredictable

• Well, we need to know if it is real

Data

• To sell and bye US Dollares (USD) in Japanese Yen(JPY)

• The unit of price is 0.01 JPY• Time ticks are of minutes

Tokyo Financial Exchange Inc.

Data1• USD/JPY in minutes

– Let us think about June 21, 2010 (anytime is ok)– We use the ones uploaded on Forexite. The Time is GMT+1 (Central European

Time). Fidelity of data is not guaranteed.– In 24 hours, Open (open price), High (highest price), Low (lowest price), and

Close (closing price) are recorded in order of time.– We want to predict "Close" of the next minute.– Let us try AR model.

– USDJPY Close data need be extracted.

210610.zip

210610.txt in210610.zip

From here

Page 10: Note on Exercises Let us start. - Keio University1 OSM Lecture and Exercises (2011/10/13) Akito Sakurai Purposes of my L&E • To know the basics of data analysis • Through exercises

10

Procedure• Obtain data of June 21, 2010.• Use R and AR package to apply AR model to the data and use it for

prediction.• A program is already written for you. What you have to do is to

repeat it with different parameters and with different data to see what happens in terms of prediction– The data file ready for you is:– The program in R is:

210610.txt in210610.zip.

Data2010.zip

0 200 400 600 800 1000 1200 1400

90.4

90.6

90.8

91.0

91.2

91.4

Index

x

Sample1.r.txt

0 200 400 600 800 1000 1200 1400

90.4

90.6

90.8

91.0

91.2

91.4

Index

x[3:

leng

th(x

)]

Real data of a day Prediction for the left data

AR model

• One of models (mathematical expressions) expressing

• AR(p) is

where

– There are conditions under which we the system is allowed to be modeled by AR(p). But we do not consider them here.

,....,....,,,,, 2101 TXXXXX −L

tptpttt XXXX εααα ++++= −−− L2211

),0( 2σε Nt ∼

Expressing FX rates• To model FX rates or stock prices, it is common to adopt

ratio or logarithm of ratio instead of the rate or prices themselves.

• But for educational purposes, we try the values themselves and logarithm of the values.

111 logloglog,, −−− −=− ttttttt XXXXXXX

Procedure• R will be used since Weka does not have the functions• The data is, for example, June 21, 2010.• Read the data and extract closing rates of every minute of USD/JPY.• Plot it for examination.• Use arima for AR modeling. The order is (p,0,0) for AR(p).

210610.txt からUSDJPY のcloseを全部

setwd("D:/R/")# Read a file,x.tmp <- read.csv("210610.txt", header=T)# pick up UDSJPY rows and then select X.CLOSE. columns,x <- subset( x.tmp, X.TICKER. == "USDJPY" )$X.CLOSE.# plot it,plot( x, type="l")# and fit AR(2) model to the data(fit2 <- arima(x, c(2, 0, 0)))

0 200 400 600 800 1000 1200 1400

90.4

90.6

90.8

91.0

91.2

91.4

Index

x

Your folder, please.

Procedure (cont.)• Note that the followings are simply expressed.

• The coefficients in "arima" outputs are like the table. The meaning is

where c is the intercept, i.e., 90.9245 hereand s.e. is the standard error.

111 logloglog, −−− −=− tttttt XXXXXXpar( mfrow=c( 2, 1 ) )plot( diff(x), type="l" ); plot( diff(log(x)), type="l" )

0 200 400 600 800 1000 1200 1400

-0.2

0.1

0.3

Index

diff(

x)

0 200 400 600 800 1000 1200 1400

-0.0

020.

002

Index

diff(

log(

x))

Coefficients:ar1 ar2 intercept

0.9026 0.0952 90.9245s.e. 0.0263 0.0263 0.2112

tttt cXarcXarcX ε+−+−=− −− )(2)(1 21

Procedure (cont.)• To predict, use the following program:

• It seems that the black lines, i.e., true values, are overlaid by red lines, i.e., predicted values.

• But in fact, they are not.

# Read a test data filey.tmp <- read.csv("220610.txt", header=T)y <- subset( y.tmp, X.TICKER. == "USDJPY" )$X.CLOSE.# Prediction based on fit2 <- arima(x,c(2,0,0)) will be in y.ary.ar <- array( 0, dim = c( length(y) ) )int <- fit2$coef["intercept"]for ( i in (2+1):length(y) ) y.ar[i] <- int + coef(fitp)[1:2] %*% (y[(i-1):(i-2)] - int )plot(10:length(y), y[10:length(y)], type="l"); lines(10:length(y), y[10:length(y)], col=2)

0 200 400 600 800 1000 1200

87.4

87.6

87.8

88.0

88.2

10:length(y)

y[10

:leng

th(y

)]

plotrange= 700:730plot(plotrange,y[plotrange],type="l"); lines(plotrange,y.ar[plotrange],col=2)

700 705 710 715 720 725 730

90.5

290

.54

90.5

690

.58

90.6

090

.62

plotrange

y[pl

otra

nge]

Is this prediction successful?What will happen for other days data as test data?How about logarithm of ratio?

Page 11: Note on Exercises Let us start. - Keio University1 OSM Lecture and Exercises (2011/10/13) Akito Sakurai Purposes of my L&E • To know the basics of data analysis • Through exercises

11

Data2• USD/JPY in minutes

– Let us think about June 21, 2010 (anytime is ok)– We use the ones uploaded on Forexite. The Time is GMT+1 (Central European

Time). Fidelity of data is not guaranteed.– In 24 hours, Open (open price), High (highest price), Low (lowest price), and

Close (closing price) are recorded in order of time.– We want to predict "Close" – "Open" (returns) of a minute.– Difficult: what should be the basis of prediction? What are features?

– (Let us try with) returns of every minute form five minutes ago

210610.zip

210610.txt in210610.zip

From here

Procedures• Obtain data of June 21, 2010.• Put it into Excel file. For every minute, calculate returns of the

minute from five minutes ago.• To make the prediction problem simpler, let us put our target on

predicting "up or down" (not the relative price value) (+1 is for up and −1 for down)– The file is ready for you

210610.txt in210610.zip. All USDJPYs

USDJPY100621.xls

Procedures: prepare files

• Pick up only returns to form a csv file.• Put an arff header to make it an arff file for Weka.

Use your favorite editor, such as memo pad.

Procedures: a trial• Use Weka

– Before applying an algorithm, be sure to remove "returns" feature in "Preprocess" of Weka (see the next slide)

– J48 under "trees": decision tree– neural network under "function"– SMO under "functions": one of support vector machine

implementations– naïveBayes under "Bayes"

• Accuracies are around 1/3.– Do not just say "it is uniformly random". When we examine data,

it is easily see that the returns are positive, zero, or negative with almost 1/3 ratios. It was hard to believe for me.

Procedures in Weka:a note

Select andclick

Unbelievably symmetric

Procedures: other days• Try with other days.• How about 22, 23, and 24 of June.

– Please make xls, csv, and arff files• Results?

• Lacking information to predict?• Must be. Then shall we include High and Low of

previous minute?– I have prepared June 21st for you. Make files for the

other days and try prediction.– I guess no success results.

220610.zip230610.zip240610.zip

USDJPY100621A.xls

Page 12: Note on Exercises Let us start. - Keio University1 OSM Lecture and Exercises (2011/10/13) Akito Sakurai Purposes of my L&E • To know the basics of data analysis • Through exercises

12

Procedures• Next: how about making five minutes a unit.

– Since, for just a minute, dealers might not be able to observe other dealers' behavior, correlations among prices might not emerge, and therefore the price changes seem to be random, and therefore unpredictable.

– Since five minutes seem to be long enough, there must be some correlation between the prices and therefore some predictabilitymust exist.

– Correct ?

In reality, it was shown that there exist correlations up to 20 minutes long (but then disappears). Refer for exampleP. Gopikrishnan, et al. Scaling of the distribution of fluctuations of financial market indices, Physical Review E vol. 60, 5305 - 5316 (1999)

Procedures• Before preparing five minute data, we need to obtain

Open, High, Low, and Close of five minute ago.• After that, we could get data for five minute time interval.

Residue of <TIME> divided by 500.When we sort the data by this column,We get five minute interval data.

maximum

Copy and paset

Copy and paste

minimum

Procedures• Use Weka.

– J48, SMO, naïve Bayes and others• You might get better accuracy.

– But, if you look at distribution of returns, you will find the number of instances for returns=0 is less.

– As was expected (?), prediction is not possible?– Well, the number of data could be smaller than necessary.

• Let us try on 23rd June to 26th June.– A bit better?– Let us test (apply the obtained knowledge) to other days such as

28th June to 1st July.

280610.zip290610.zip300610.zip010710.zip

Procedures• A way to test a result on another data• Prepare test data that has the same number of

features as the learning data.• Since for the current problem, we have only files

for returns of this minute, the files will not serve as test data files. One easy way to prepare is to delete column "returns" of the time and to save it as a file in Weka.

For whom to go a bit further

• I have prepared data up to 2nd July, 2010 for USD/JPY and GBP/USD

2010-01to06.zip

Weka: a note for test data

Select this And click here

Then click here,

Specify a file of test data

And click here to close the widget

Page 13: Note on Exercises Let us start. - Keio University1 OSM Lecture and Exercises (2011/10/13) Akito Sakurai Purposes of my L&E • To know the basics of data analysis • Through exercises

13

Weka: a note for saving as a file

Select andclick

And save it②