floating point numbers - university of california, riversidehtseng/classes/ee120a...• therefore,...

Floating Point NumbersProf. Usagi

Floating point numbers

• If you add the largest integer with 1, the result will become the smallest integer.

Let’s revisit the 4-bit binary adding• 7 + 1 = ?

0 1 1 1+ 0 0 0 1

0001 = -8

Sign bit

• We want to express both a relational number’s “integer” and “fraction” parts • Fixed point

• One bit is used for representing positive or negative • Fixed number of bits is used for the integer part • Fixed number of bits is used for the fraction part • Therefore, the decimal point is fixed

• Floating point • One bit is used for representing positive or negative • A fixed number of bits is used for exponent • A fixed number of bits is used for fraction • Therefore, the decimal point is floating —

depending on the value of exponent4

“Floating” v.s. “Fixed” point

+/- Integer Fraction.is always here

+/- Exponent Fraction.Can be anywhere in the fraction

• Realign the number into 1.F * 2e • Exponent stores e + 127 • Fraction only stores F

IEEE 754 format+/- Exponent (8-bit) Fraction (23-bit)32-bit float

Floating point adder

Demo — what’s in c?

#include <stdio.h>

int main(int argc, char **argv) { float a, b, c; a = 1280.245; b = 0.0004; c = a + b; printf("1280.245 + 0.0004 = %f\n",c); return 0; }

• Not all applications require “high precision” • Deep neural networks are surprisingly error tolerable

Other floating point formats

+/- Exponent (8-bit) Fraction (23-bit)32-bit float

64-bit double +/- Exponent (11-bit) Fraction (52-bit)

+/- Exp (5-bit) Fraction (10-bit)16-bit half added in 2008

Can you tell the difference?

Higher resolution

But we all can tell they are our mascots!

How about this?

• Not all applications require “high precision” • Deep neural networks are surprisingly error tolerable

Other floating point formats

+/- Exponent (8-bit) Fraction (23-bit)32-bit float

64-bit double +/- Exponent (11-bit) Fraction (52-bit)

+/- Exp (5-bit) Fraction (10-bit)16-bit half added in 2008

Mixed-precision

Google’s Tensor Processing Units

https://cloud.google.com/tpu/docs/system-architecture

EdgeTPU

• Reading quiz 5 due 4/28 BEFORE the lecture • Under iLearn > reading quizzes

• Lab 3 due 4/30 • Watch the video and read the instruction BEFORE your session • There are links on both course webpage and iLearn lab section • Submit through iLearn > Labs

• Midterm on 5/7 during the lecture time, access through iLearn — no late submission is allowed — make sure you will be able to take that at the time

• Check your grades in iLearn15

Announcement

つづく

ElectricalComputerEngineering

Science 120A

floating point numbers - university of california, riversidehtseng/classes/ee120a...• therefore,...

Documents

floating point (cont.) & sequential...

1 floating-point arithmetic floating-point representation

floating point arithmetic

floating point units

floating point arithmetic august 25, 2007 topics ieee...

veriﬁcarlo: checking floating-point accuracy through...

floating point. agenda history basic terms general...

integer arithmetic floating point representation floating...

00575 floating point

floating point - ufrgs

tms320vc33 digital signal processor (rev. e) ·...

floating point numbers. topics covered fixed point numbers ...

lecture 3 floating point...

floating point arithmetic. outline overflow fixed point...

lecture 11 oct 12 floating-point numbers circuits for...

fujitsu limited ver 15, 26 apr. 2010 move selected...

floating-point numberskbeach/courses/fall2017/... ·...

logicore ip floating-point operator v6 - xilinx · the...

floating point arithmetic february 15, 2001 topics ieee...

chapter 3 floating point...