datarepresentaon%and%encoding%harold/courses/old/cs2000.w12/diary/... · 1/13/12 • –the% – %...
TRANSCRIPT
1/13/12
1
Compu+ng and Reasoning Computa+on in the physical world Computa+on as Manipula+ng
Symbols
52
Symbols and Meaning
• Language, mathema+cs and computa+on use symbols to represent concepts and objects
• Computers manipulate symbols and these manipula+ons mean something – Because the symbols mean something
• In the 1930’s-‐1940’s computer designers began to realize that if they made their computers programmable, they could change what the symbols mean and what the computers did without having to build a new computer for each new problem or applica+on
53
Encoding
54
Data Representa+on and Encoding
• Computers use binary code (zeros and ones) to record informa+on. Binary digits (bits) can be recorded on different storage media electrically, (on a chip) magne+cally (on a hard drive or magne+c tape) or op+cally (on a DVD). Holes in paper (tape and cards) were once used for data storage.
• Since all data in all files are reduced to binary codes for recording and storage, any type of file can be recorded on any storage media. The difference is how the informa+on is encoded in binary. – You can use any symbols you want, but they are encoded in binary
codes to be stored or manipulated by computers. • For example, one binary code for text informa+on is the ASCII code.
(UTF, ANSI are other text code varia+ons). Other binary codes are used for other types of data.
55
Sound
Images
Video
Text
10010101 00101101 10101001 11100111
ENCODED
INFORMATION
TO BINARY
STORED IN FILES ON
STORAGE MEDIA
Op+cal Disk/DVD
Magne+c Hard Drive
USB Key
JPEG, GIF
MPG, AVI
MP3, WAV
UTF, ANSI
56
• Why do computers use binary? – It’s not inherent in the natural world (that is, we don’t have to choose binary)
– It is technologically simple to have a two state device (1 or 0, on or off) -‐ a switch -‐ to represent in computer circuits (chips) or memory devices
57
1/13/12
2
• Teletype ASR-‐33 input/output terminal device designed to be compliant with ASCII encoded messages
58
• Why do computers use binary? – It was adopted by early computers, so this make it easy for new machines and programs to follow the binary schemes already adopted…
• Why do computers use binary? – The important observa+on is: no maier what conven+on was chosen (in this case, binary) – any symbol system can be encoded or translated into the conven+onal choice • It is not necessary to design a new computer for every new language, mathema+cal system or type of computa+on we want to perform; we simple subs+tute or change the encoding and the programs we use
59
An Example of Binary Encoding: ASCII Char Decimal Binary 32 00100000
! 33 00100001 " 34 00100010 # 35 00100011 $ 36 00100100 % 37 00100101 & 38 00100110 ‘ 39 00100111 ( 40 00101000 ) 41 00101001 * 42 00101010 + 43 00101011 , 44 00101100 - 45 00101101 . 46 00101110 / 47 00101111 0 48 00110000 1 49 00110001 2 50 00110010 3 51 00110011 4 52 00110100 5 53 00110101 6 54 00110110 7 55 00110111 8 56 00111000 9 57 00111001 : 58 00111010 ; 59 00111011 < 60 00111100 = 61 00111101 > 62 00111110 ? 63 00111111
Char Decimal Binary NUL 0 00000000 SOH 1 00000001 STX 2 00000010 ETX 3 00000011 EOT 4 00000100 ENQ 5 00000101 ACK 6 00000110 BEL 7 00000111 BS 8 00001000
TAB 9 00001001 LF 10 00001010 VT 11 00001011 FF 12 00001100 CR 13 00001101 SO 14 00001110 SI 15 00001111
DLE 16 00010000 DC1 17 00010001 DC2 18 00010010 DC3 19 00010011 DC4 20 00010100 NAK 21 00010101 SYN 22 00010110 ETB 23 00010111 CAN 24 00011000 EM 25 00011001 SUB 26 00011010 ESC 27 00011011 FS 28 00011100 GS 29 00011101 RS 30 00011110 US 31 00011111
Char Decimal Binary @ 64 01000000 A 65 01000001 B 66 01000010 C 67 01000011 D 68 01000100 E 69 01000101 F 70 01000110 G 71 01000111 H 72 01001000 I 73 01001001 J 74 01001010 K 75 01001011 L 76 01001100 M 77 01001101 N 78 01001110 O 79 01001111 P 80 01010000 Q 81 01010001 R 82 01010010 S 83 01010011 T 84 01010100 U 85 01010101 V 86 01010110 W 87 01010111 X 88 01011000 Y 89 01011001 Z 90 01011010 [ 91 01011011 \ 92 01011100 ] 93 01011101 ^ 94 01011110 _ 95 01011111
Char Decimal Binary ` 96 01100000 a 97 01100001 b 98 01100010 c 99 01100011 d 100 01100100 e 101 01100101 f 102 01100110 g 103 01100111 h 104 01101000 I 105 01101001 j 106 01101010 k 107 01101011 l 108 01101100
m 109 01101101 n 110 01101110 o 111 01101111 p 112 01110000 q 113 01110001 r 114 01110010 s 115 01110011 t 116 01110100 u 117 01110101 v 118 01110110 w 119 01110111 x 120 01111000 y 121 01111001 z 122 01111010 { 123 01111011 | 124 01111100 } 125 01111101 ~ 126 01111110 Δ 127 01111111
60
An example of Binary Encoding: Binary Numbers
• Numbers can be encoded in binary, using base 2
15810 8 in the 1’s place (100)
5 in the 10’s place (101)
1 in the 100’s place (102)
– One hundred fily eight in base 10 (decimal)
61
– One hundred fily eight in base 2(binary)
0 in the 1’s place (20) = 0
1 in the 2’s place (21) = 2
1 in the 4’s place (22) = 4
100111102
1 in the 8’s place (23) = 8
1 in the 16’s place (24) = 16
0 in the 32’s place (25) = 0
0 in the 64’s place (26) = 0
1 in the 128’s place (27) = 128
62
Binary Number Arithme+c
0 + 0 0
0 + 1 1
1 + 1 one plus one 10 is two
1 with a carry.. 1 one plus one + 1 plus one more 11 is three
1111 10011110 one hundred fifty eight + 00111101 plus sixty-one 11011011 is two hundred nineteen
63
1/13/12
3
Encoding different data types
• Other informa+on types (images, video, music) must have different binary encodings for the informa+on to be stored in files.
• Some+mes files can have mul+ple types of encodings within the file format. – Such as a web browser, which display images and text.
• In order to display informa+on properly, each applica+on must use the coding appropriate to the file format.
• Note the numerals can be encoded for text display or for arithme+c (as numbers). This dis+nc+on is important for programs that display both text and numbers.
64
Data and file types
• Different types of data are encoded, or represented differently. The type or format of a file will olen reflect the encoding of a par+cular type of data stored in the file, such as audio, images, video, or text.
• One way the file type can be indicated is by providing a filename extension, which conven+onally is the last part of the name following a “dot”. Some example are:
65
Extension Data type
.mp3
.wav Audio file formats
.mp4
.avi
.mov
Video file formats
.jpg
.gif
.+ff
image file formats
.zip Compressed zip archive
.html
.htm
.xml
.vrml
Hypertext and media markup formats used for web page source
.exe Program code for executable programs (Windows and some other opera+ng systems)
.pdf Portable document format (Adobe)
.pptx Microsol powerpoint
.odt Openoffice text 66
File type conven+ons
• Keep in mind that filename extensions are a conven+on, not a property of the data in the file
– a file can have the wrong file extension – different programs may use the same extension for different formats
– The file format might be different than the extension indicates
• Microsol Windows desktop tries to launch an associated applica+on when you open a file. This is based on the file name extension.
67
Bits and Bytes: data size
• In order to represent different type of data or informa+on, groups of bits are olen needed
• A group of 8 bits is called a “byte”. (One bit is not very useful encoding unit for most data).
• Storage, memory capacity and file size are characterized in bytes: – KB – kilobytes (1000’s) thousands
• Range of a typical text file – MB – megabytes (1000000’s) millions
• Range of a typical digital photo – GB – gigabytes (1000000000’s) billions
• Range of typical removable media or computer memory – TB – terabytes (1012)
• Range of typical large storage device (hard drive)
• One peculiarity is that computer specifica+ons have a history of coun+ng in base 2 binary. For example, according to that conven+on, the prefix “kilo” was used for a count of 210, which works out to 1024 instead of the standard metric conven+on of “kilo” meaning of 1000. Main memory sizes, for example, use the binary coun+ng conven+on. If you’re used to the metric prefix, it seems like you’re gerng a liile “bit” extra.
68
Storage and Memory
• Don’t confuse main memory and storage – Memory holds data and programs in ac+ve use by the processing
components of the computer • Typical hardware for this is vola+le RAM
– RAM = Random Access Memory – Vola+le means power-‐off blanks the memory and requires a re-‐boot
» Laptops have “sleep” or “hibernate” modes to keep the memory from blanking
• More memory more simultaneously ac+ve programs and tasks – Storage records data and program permanently (or un+l deliberately
deleted) on some storage memory • Organized into files in a file system • Some recording media involved and a device that records on that media
– Media can be removable (like DVDs) or fixed (Hard drive)
• Programs have to be loaded from storage into main memory to run, and data is usually loaded into memory to be manipulated.
69
1/13/12
4
Levels of encoding
• We can think of levels of encoding to represent something, using different symbols or representa+ons
• For example, English text can use symbols that are encoded in UTF/ASCII (one level of encoding), which are internally represented or encoded in binary (a second level of encoding):
70 71
It was the year of Our Lord one thousand seven hundred and seventy-five. Spiritual revelations were conceded to England at that favoured period, as at this. Mrs. Southcott had recently attained her five-and-twentieth blessed birthday, of whom a prophetic private in the Life Guards had heralded the sublime appearance by announcing that arrangements were made for the swallowing up of London and Westminster. 010010010111010000100000011101110110000101110011001000000111010001101000011001010010000001111001011001010110000101110010001000000110111101100110001000000100111101110101011100100010000001001100011011110111001001100100001000000110111101101110011001010010000001110100011010000110111101110101011100110110000101101110
Encoded into UTF-‐8
Encoded into binary
Into physical media (op+cal/magne+c/electronic)
From the screen..
Or a computer program as wriien by a programmer in a computer language..
72 73
#include <stdio.h> !int main(void) !{ !! !printf("hello, world\n");
!return 0; !} !001000110110100101101110011000110110110
001110101011001000110010100100000001111000111001101110100011001000110100101101111001011100110100000111110001000000000110100001010011010010110111001110100001000000110110101100001011010010110111000101000011101100110111101101001011001000010100100100000000011010000101001111011
C-‐Language Encoded into UTF-‐8
UTF-‐8 as binary
Stored on physical media (op+cal/magne+c/electronic)
From the screen..
Or a web page as browsed by a web user..
74 75
Web page on screen
HTML encoded as UTF-‐8 binary codes
Stored on physical media or transmiied across the internet
From the screen..
Web page represented in HTM language
<!DOCTYPE HTML PUBLIC "-‐//W3C//DTD HTML 4.01 Transi+onal//EN" "hip://www.w3.org/TR/html4/loose.dtd"> <html> <head> <+tle>Become an Undergraduate Student | Welcome to Memorial</+tle> <link href="hip://www.mun.ca/appinclude/brand/2011v1/include/styles/base.css" rel="stylesheet" type="text/css" media="screen" /> <link href="hip://www.mun.ca/appinclude/brand/2011v1/include/styles/structure.css?1315478976" rel="stylesheet" type="text/css" media="screen" /> <link!
00111100001000010100010001001111010000110101010001011001010100000100010100100000010010000101010001001101010011000010000001010000010101010100
1/13/12
5
Other encoding examples.. • Images are made up of pixels: each pixel might be encoded as a Red, Green and Blue intensity
• (100%, 10%, 0%) = • The 3 numbers can then be converted to an 8 bit binary value, so each pixel takes 3 bytes to represent. – 01100100 00001010 00000000
76
Other ideas for encoding
77
In Class Exercise – Encode yourself!
• Propose an encoding for our class sea+ng plan • Use convenient symbols • Ul+mately, you must come up with (and explain) an encoding scheme that can indicate which student is in which seat
• Your encoding must (eventually) be in binary – You can start using binary directly – Or you can have some levels of symbols before you get to binary
78
Encoding exercise – hints and choices
• You can incorporate an exis+ng code (ASCII, binary numbers) or make up your own
• You can iden+fy things (seats, students) using your own symbols or strategy – Or you can use exis+ng iden++es (such as student names?)
– You have to convert them to binary eventually
79
Another Hint about file structures..
• Computer files are generally understood to be sequen+al – In other words, there is a first symbol in the file, a second
symbol in the file, a third thing… and so on – For example, if this slide were stored in a file, the word
“Another” is first, then the word “Hint” • Leier “A” comes first, then “n”, and so on.. • The corresponding binary codes are arranged in the appropriate order in the file containing the slide..
• So we don’t have to encode which thing comes first or second or third; we put them in the computer file in the right order
– You may be able to use the order of symbols in the file to make your encoding easier
– But you can ignore this if its more confusing than helpful
80
Reviewing the exercise
• Some things that weren’t specified in the problem: – Does the encoding scheme have to work for any sea+ng arrangement
or only our sea+ng arrangement? – Does the encoding scheme have to work for all classrooms, or only our
classroom? – Does the encoding scheme have to work for all possible groups of
students or only our group of students? – Does the encoding scheme have to indicate if the seats have been
moved around to a new loca+on or not? • These are ques+ons about the expressive power of the language or
encoding scheme you have created, and they echo basic ques+ons about compu+ng (Hilbert) – Does computer provide every thing that can be expressed? (Goedel) – Are all true things provable with a computer? (Turing)
81
1/13/12
6
An alerword… Machine Code
• There is one special code – machine code -‐ that is “built into” the computer processor
• Machine code is the binary code that indicates which circuits the computer should ac+vate to run a computer program – Computer circuits correspond to opera+ons on binary symbols; for example there are circuits
for add, subtract, compare, copy, and moving binary values in and out of computer memory and storage.
– So machine code already has a meaning within the machine – those binary code symbols mean the computer circuits to ac+vate.
– The meaning of each code (what the corresponding circuit does) is referred to as an instruc0on
• To create a computer program, these codes have to be arranged (or compiled) into the correct sequence to produce the desired result
• To run a computer program, the processor(s) takes a sequence of machine coded instruc+ons – and ac+vates the corresponding circuits in sequence order.
• When they’re not being used, the coded program instruc+ons (which as a whole comprise a program) are kept in a file on a storage medium (such as a hard drive or usb key), just like any other informa+on on the computer.
82
Simplified computer architecture diagram
• Computer Architecture refers to the organiza+on of hardware (the physical parts) of the computer
• Electronic circuits on computer boards and chips include these components: – The processor contains the circuits ac+vated by
machine code – Memory stores the data (including the machine code)
currently in use by a running program – Controllers are circuits that control computer devices
(such as screens, hard drives, usb keys, wireless receivers) according to the binary codes that are sent to them. Each controller is designed to control par+cular devices according to its own codes.
– A bus is a pathway of wires for sending binary coded signals among components (for example, from a storage controller to memory).
• A storage device contains data (and programs) organizing into files and recorded on a storage media. A tradi+onal name for storage is “Secondary Memory”.
83
Processor Memory
BUS
Controller
Controller Controller
Controller
Storage
Storage
Programmers don’t write in machine code (anymore)…
• Binary machine code is good for designing machines that run programs, not so good for human programmers to understand
• Instead, humans write programs in formalized “programming languages” (like C or Java), and a program translator translates them into machine code
• Programming languages are a kind of middle ground: easier to read and write than machine code (for humans), and easier to translate into machine code than English (for computers)
• Don’t confuse encoding (represen+ng concepts or symbols using coding symbols) with transla0on (changing from one encoding to another encoding). In the computer world, these are different ideas.
84
Program transla+on is not encoding…
85 85
#include <stdio.h> !int main(void) !{ !! !printf("hello, world
\n"); ! return 0; !} !
001000110110100101101110011000110110110001110101011001000110010100100000001111000111001101110100011001000110100101101111001011100110100000111110001000000000110100001010011010010110111001
Source program: in this example, C-‐Language code is one thing stored in one file, which can be viewed in different ways, including its binary representa+on
Machine-‐Language program: this is a different thing which can be stored in a different file. It it the transla+on of the program that can be run on a computer as a program.
001011011100110001101101100011101010110010001100101001000000011110001110011011101000110010001101001011011110010111001101000001111100010000000001101000010100110100101101110010010100101001
Program transla+on
Final thoughts on symbols and encoding..
• Concepts are represented by symbols • Computa+on is performed by manipula+on of the symbols • We have to decide what the symbols mean • There is a sense in which it doesn’t maier what symbols
you use – Because we can always encode using other symbols
• As long as we are careful about the encoding – Computers use binary, but we can encode whatever we want
into binary using our own code • What you decide the symbols represent (that is, their
meaning) does maier a great deal – It might determine what is possible or impossible to compute
(or, in scien+fic terms, to model)
86