python video processing · import cv,cv2 import cpickle as pickle #program to capture video from a...

Video Coding

Python Video ProcessingThe OpenCV library also gives us the ability to stream data directly from a webcam, suchas the Raspberry Pi to the computer!For this purpose, the command is:

cap=cv2.VideoCapture(0)

This accesses the default camera 0, which, for example, is the inserted USB webcam. "Cap" contains a pointer to the address of this camera.We can read a frame of this stream as:[retval, frame]=cap.read()

where "retval" is a return value (says if everything is fnee, and "frame" contains the frame thus exctracted.Withcv2.imshow('frame',frame)

we can display this frame in a window.With cv2.waitKey(1)

the window is opened by cv2.imshow (it does not open without waitKeye and is kept open for as long prompted by waitKey. The argument of waitKey tells the function, how many milliseconds to wait for a key stroke. The function cv2.waitkey is only active whenthe associated window is active (if clicked onite. Only when this window is active, cv2.waitKey responds to the keystrokes!

When we do this in an infnite loop, we get the live video from the camera in the window. A sample program is shown as here:import cv2

#Program to capture a video from the default camera (0) and display it live on the screen

cap = cv2.VideoCapture(0)

while(True):

# Capture frame-by-frame

[retval, frame] = cap.read()

# Display the resulting frame

cv2.imshow('frame',frame)

if cv2.waitKey(1) & 0xFF == ord('q'):

break

# When everything done, release the capture

cap.release()

cv2.destroyAllWindows()

This program is stored as "videorecdisp.py" and then called from console window with:

python videorecdisp.py

and we obtain the live video. We can end the display with typing “q”.

A practical application is e.g. a "pipe camera" with goose-neck for the inspection of pipes.

The following example shows the contents ofthe upper left pixel of the video at the start of the program:

import cv2 #Program to capture an image from a cameraand display the pixel value on the screen cap = cv2.VideoCapture(0) # Capture one frame [ret, frame] = cap.read() print("image format: ", frame.shape) print("pixel 0,0: ",frame[0,0,:])

Save the program under "pyimageshowpixel.py" and start it with:python pyimageshowpixel.pyWith Output:

'image format: ', (480, 640, 3)) ('pixel 0,0: ', array([ 35, 167, 146], dtype=uint8))

Note that we only had addressed the pixels, so only 2 indices: frame [0,0,:].The colon ":" means that all the indices of these dimensions are addressed.In this way, we can also address an index range, via, Start:End, e.g, 0:3 addresses the indices 0,1,2 and onwards (Note: Not the last value, 3e.In our case, the above notation is identical to:

frame[0,0,0:3]

At the output, we, therefore, do not get a single value but a special array with the values of the 3 primary colors, in the order BGR. Here "Red" has the value 146, green 167, and blue 35.We can also observe the numeric representation in "uint8",i.e. unsigned integer with 8 bits, hence a number between0 and 255 (= 28

−1 e.

With the program “videorecdispRGB.py” we can see the separate R, G, and B

components of our video stream, with the command linepython videorecdispRGB.py

Observe: if you hold e.g. e red paper in front of the camera, only the red component will show it bright, as expected.

A simple Encoder/Decoder FrameworkWe start with a simple encoder and decoder framework in Python.

-The encoder framework simply takes each frame of our video sequence and appends it to our video fle, fornow uncompressed in a text format. -The decoder framework reads one frame afer the other from this fle and displays it on screen.

-During the course you are supposed to add more and more functonality to this framework, such that the video fle becomes smaller and smaller. The encoder framework is in the fle videoencframewk.py:

import numpy as np import cv,cv2 import cPickle as pickle #Program to capture video from a camera and store it in an recording file, in Py thon txt format, using cPickle #This is a framework for a simple video encoderto build. #It writes into file 'videorecord.txt' #Gerald Schuller, April 2015

cap = cv2.VideoCapture(0) f=open('videorecord.txt', 'w') #Process 25 frames: for n in range(25): ret, frame = cap.read()

Camera Encoder File

File Decoder Display

if ret==True: #show captured frame: cv2.imshow('frame',frame) #Here goes the processing to reduce data... reduced = frame.copy() reduced = np.array(reduced,dtype='uint8') # for the Cb and Cr components use the int8 type #"Serialize" the captured video frame (convert it to a string) #using pickle, and write/append it to file f: pickle.dump(reduced,f) if cv2.waitKey(1) & 0xFF == ord('q'): break else: break # Release everything if job is finished cap.release() f.close() cv2.destroyAllWindows()

Start it in a terminal shell with command:python videoencframewk.py

Observe that we use the package “pickle”. This is convenient because it turns an object, like the tensor ofour video frame, into a serial string, which then can be easily stored in a fle, and vice versa for the decoder framework, which we start withpython videodecframewk.py.

Observe that 25 frames of our video (about 1 second standard TV quality) in our simple, uncompressing framework result in a fle size of already over 40 MB!

The need for compression: assume out short video clip of 1 second duraton, 640x480 resoluton, 25 pictures per second, and 8 bits per pixel and primary color.Hence we have 8*3 * 640 * 480 bits per frame, tmes 25 frames per second..This results to 184 Mbit/s or ca. 23 Mbyte/s!(Our framework has about twice as much because it uses a text format). Observe: HD Video would be about 2 MegaPixel (1920x1080 pixel).This is partcularly a problem if we want to stream a video, to transmit it: DSL usually has max. 16 Mb/s. But it is already possible to stream HD video over DSL! This shows the power of compression. Example: with good compression, HD video can be transmited with about 4Mb/s (from about 480 Mb/s uncompressed).

How does this compression work? There are 2 basic principles:

1) Irrelevance reducton2) Redundancy reducton

Both are necessary to obtain the most compression.

Irrelevance reducton: This principle looks at the receiver of our data, in this case the human eye. We only need to transmit informaton that the eye can observe or detect. But observe that for this we need assumptons about the context in which the eye sees our signal (video); for instance the viewing distance, the size of the monitor, the brightness of the monitor and of the surroundings. For this reason we have specifcally determined viewing environments for perceptual measurements.

What are the usual propertes of the eye used for compression?

We have about 110 Mio. rods, which are sensitve to intensity/luminance (not color), and about 6 Mio cones, which are sensitve to the electromagnetc wavelength spectrum of the 3 fundamental colors (red,

green, blue). The specifc spectral sensitvity of the 3 types of cones can be seen in the following diagram:

Observe that the red cones have a “sidelobe” in the blue range (about 450 nm). This has the consequence that we can distnguish spectrally pure colors (colors of only one wavelength) in the range between about 400 to 450 nm as diferent colors (blue to purple). Otherwise they would all appear as blue! It is thought that this sidelobe developed when our ancestors in ancient tmes (early mammals) became actve during daylight, and needed more color vision afer the Dinosaurs died out. Dinosaurs where always actve in daylight, as are their descendants the birds, who have 4

or 5 types of color sensitve cones (more sensitvity in the UV range).We perceive diferent colors by diferent mixes of the relatve outputs of the 3 types of cones. To show the resultng color for each possible mix, we have the color triangle (CIE color space):

Observe: For instance 500 nm (light green) has a frequency of c/Lambda=300 000km/s / 500 nm=600 THz (600 *10^12 Hz), or 600 000 Ghz! Compare: Our TVsatellites use about 10 GHz.This shows the possibilites of fber optcs to transmit data.

Those about 6 Mio. cones are unequally distributed on our retna. They are highly concentrated in the center of vision, the Fovea, and become more sparse towards the periphery of vision. This is unlike typical digital cameras, which have equally distributed sensors.

Also observe the big diference in the number of only intensity or luminance sensitve rods: There are 110 Mio rods., but only 6 Mio cones. This means that the eye is much more sensitve for spatal changes in intensity or luminance than for spatal changes in color!This is also an important principle in video coding, where color is usually encoded and transmited with less spatal resoluton. For instance, the color informaton is subsampled by a factor of 2 vertcally and horizontally.They are also unequally distributed, with less density towards the periphery of vision, but with the excepton of the Fovea, where there are only cones at high density, with no space for rods lef.Rods are also much more light sensitve than cones, hence they are the main contributor for our night vision. Unlike the cones, which are tuned to our sun light, our rods spectral sensitvity is tuned to the spectrum of our moon light.

We see that the eye has a much higher spatal resoluton, corresponding to about 110 Mega-Pixel, forluminance (the intensity or black-and white picture) than for chrominance or color, corresponding to only about 6 Mega-Pixel. How can this be used in video coding?

First of all, luminance (intensity) and chrominance (color) can be coded diferently. For that, we frst have to separate the two, because usual cameras only have color sensitve sensors (similar to our cones), but they don’t have intensity sensors, like our rods. Hence we need to create an “artfcial rod output”.For this we have a color transformaton, if we take the red (R), green (G), and blue (B) channel:

Intensity Y= 0.299R + 0.587 G + 0.114 B,color component Cb=0.564(B-Y) =-0.16864R - 0.33107G + 0.49970Bcolor component Cr=0.713(R-Y)= 0.499813R - 0.418531G - 0.081282B

This color transform can be writen as this matrix multplicaton,

[YCbCr ]=[

0.299 0.587 0.114−0.16864 −0.33107 0.499700.499813 −0.418531 −0.081282]⋅[

RGB ]

Here, Cr and Cb are the color/chominance components,the normalized versions of U and V, such that we can store the value within a fxed length binary word (values between -128 and 127).Observe: The values for R,G,B are intensity values, and hence only positve. Usually they are represented by 8 bit integer values without a sign bit. Its range is 0...255. In Python this type is called “uint8”, for instance innp.array(...,dtype='uint8')in our videoencframewk.py.The same is true for the luminance Y. But Cband Cr have negative factors in the matrix, hence their values can also become negative. Their range is -127...127, which is 8 bit integer with a sign bit. The python typeis “int8”, for instancenp.array(...,dtype='int8')

The color transform is an invertble matrix calculaton, such that it can be inverted in the decoder,

[RGB ]=[

0.299 0.587 0.114−0.16864 −0.33107 0.499700.499813 −0.418531 −0.081282]

−1

⋅[YCbCr ]=

=[1.0 0.0 1.40251.0 −0.34434 −0.71441.0 1.7731 0.0 ]⋅[

YCbCr ]

In this way we obtain one intensity component and 2 color components which contain all the informaton of the image.

Python example:

We compute the Y, Cb and Cr (or YUV in the non-normalized case) components by applying the above matrix multplicatons to each pixel of our frame. It is best to avoid “for” loops in Python, but instead we can compute with entre images. So the computaton of theY image in Python is simply: 0.299*R+0.587*G+0.114*BWhere R,G,B are the variables which containthe entire R,G, and B images. Then we can display the resulting Y,U,V videos.Start the program in a terminal shell with the commandpython videorecprocyuv.py

Observe: The Y video is a monochrome version as expected. The U and V videos only contain content if there is color, for monochrome light they show black. Also observe how diferent colors appear at diferent channels (U or Ve.

Python example for switching on and off the YUV componentsppython videorecencdecyuvkey.py

Observep Turning of the Luminance components Y make the image appear indeed less sharp, because our eyes have areduced spacial resolution for color.Turning on and of the color components U and V shows the color space associated with each of them.

Compare this with the original color components, using our Python app:python videorecdispRGBkey.py

Here we can turn the R,G,B components individually on and of.

Observep Here, each individual color component looks still sharp (despite having just one colore, because it still contains the intensity, the Luminance information. This shows why we use a color transform.

Since our eye has a reduced spatal resoluton for color,we can now store or transmit color with a reduced spatal resoluton. For simplicity we can apply downsampling to the color components. In video coding, diferent chroma subsampling schemes are used:

Chroma subsampling

Y'CbCr 4:4:4 Y'CbCr 4:2:2

Y'CbCr 4:2:0 Y'CbCr 4:2:0 MPEG-2-Abtastpositionen

White: Luminance

Grey: Chrominance

(from: http://de.wikipedia.org/wiki/Farbunterabtastunge

You can imagine the naming as the number of samples for each component along a line. The last name doesn’treally follow this naming conventon, because it also has a downsampling along the rows, hence the name 4:2:0 was assigned. This is also the most used scheme for digital video. 4:2:2 is used for higher quality, and 4:4:4 is used for highest quality video (e.g. raw or lossless formats).

http://de.wikipedia.org/wiki/MPEG-2

http://de.wikipedia.org/wiki/Farbunterabtastung

http://de.wikipedia.org/wiki/Datei:Yuvformats444sampling.svg



http://de.wikipedia.org/wiki/Datei:Yuvformats420samplingMPEG-2.svg

With the 4:2:0 scheme we can now reduce the requireddata rate for the two color components by a factor of 4! (2 in each dimension), at least theoretcally without visible artefacts!

Downsampling, Upsampling, FilteringHow do we do the downsampling in the encoder and then the upsampling in the decoder, such that we don't see artifacts and such that it is not too complex?

If in the encoder we simply downsample our image directly, by keeping only every 2nd sample (or N'th sample in generale, we might get aliasing artifacts if we have fnepatterns in the image.

If in the decoder we simply upsample our image by inserting N-1 zeros after each sample in each dimension, we obtain a „pointy“ image, which results again from aliasing or spectral copies of our image.

To avoid both, we need to suitably lowpass our image before downsampling, and alsolowpass filter it after upsampling as a sort of interpolation.

This is what the Nyquist Theorem tells us. We know it from 1-dimensional signal processing:

Now we just have to think about how to extend it to 2-dimensional signals like images.

UVp- and down sampling in Python

In Python we can do the up- and downs-sampling using indexing „tricks“ with „:“. For

SignalLow Pass

Low Pass Signal

N

N

Downsampler by N

Upsampler by N

instance take a vector Y=np.array([1,2,3,4]).

Then

Y[0:4:2] means: Take the values of indices from 0 to excluding 4 in steps of 2.

Y[0::2] means: Take the values of indices from 0 to the end in steps of 2.

Then down-sampling with a factor of 2 is simply:

Yds=Y[0::2]

Yds contains the down-sampled signal.

For a 2D signal and downsampling in each dimension, Y it would be simply:

Yds=Y[0::2,0::2]

For up-sampling you frst generate a vectorof zeros:

Yus=np.zeros(4)

and then we can assign every 2nd value to the output:

Yus[0::2]=Yds

For a 2D signal and upsampling in each dimension it would be:

Yus=np.zeros((4,4))

Yus[0::2,0::2]=Yds

Python examplep

This is shown in the following python example. We take each frame and downsample it by N=8 in each dimension, then upsample it, and display the result.

Start it with:

python videofiltresampkey.py

It uses scipy.signal.convolve2d as convolution or flter function, where we 2D-convolve our frame with a 2D „flter kernel“ or 2D impulse response. This is the 2-D version of a lowpass flter.

The Python program uses two types of low pass flter:

-flter 1 consists of a 8x8 coefcient square matrix with entries of 1/8:

filt1=np.ones((8,8))/8;

-flter 2 results from convolving flter 1 with itself:

filt2=scipy.signal.convolve2d(filt1,filt1)/8

(The division by 8 is a power normalization, such that the total power of the kernel adds up to one, or close to ite

This is a 15x15 matrix with the highest valuein the center, with values linearly decaying towards the boundaries, like a pyramid.

The low pass flter is toggled on/of with keyboard key 'f', and the flter type is toggled with key 't'.

Observep

-Without a lowpass flter, a fne pattern in the image results in visible aliasing artifacts in the sampled image.

-Turning on a simple size 8x8 pixel rectangular 2D flter kernel with key 'f' removes the aliasing artifacts, but leads to

blocking artifacts and noticably slower computation. The result looks like just largerpixels. This is also called a sample and hold (hold the value of the sample until the value of the next sample arrivese.

Observep Convolution of our impulse response with the pulses of our sampled image results in placing the impulse response at the place of the pulses.

-Turning on a pyramid shaped size 15x15 2D flter with key 't' removes the blocking artefacts, but leads to even slower computation. The reconstructed image lookslike from a linear interpolation between neigbouring sampled pixels.

Observe: Now the pyramid impulse response appears at the place of the samplepulses. Adding them up results in the linear interpolation.

So how do we design „good“ 2D filters, such that we get good results and still reasonably fast computation?

For that it is helpful to look at the 2D frequency domain.

2D Discrete Fourier Transform

In the mathematical description we will use bold-face characters for matrices and vectors for clarity.

Our image consists of pixel values x(n1 , n2)

with position indices n1 , n2 in the ranges

0≤n1<N 1 ,0≤n2<N2 .

The 2D Discrete Fourier Transform (2D DFTe gives us this 2D frequency domain, and is defned as

X (k1 , k2)=∑n1=0

N1−1

∑n2=0

N2−1

x(n1 ,n2)⋅e− j π

N1

⋅k1n1

⋅e− j π

N2

⋅k2 n2

where 0≤k1<N1 and 0≤k2<N 2 are the

frequency indices. The index for the Nyquist

frequency is k1=N1/2 and k2=N 2/2 .

The 2D-DFT is equvalent to the application of our one-dimensional DFT to the rows and colums of our image.

In Python this is the function

X=numpy.fft.fft2(x)

We can plot an intensity plot of the 2D-DFT coefcients with

cv2.imshow(numpy.abs(X))

This way we plot the 2D-DFT as an image,

with the frequency indices k1 , k2 as image

coordinates, and the magnitude

|X (k1 , k2)| as brightness.

Observep This is similar to a spectrogram plot, but instead of the time/frequency axes we now have the normalized x-frequency

k2 and y-frequency k1 .

Since we apply a DFT vertically and horizontally, the lowest frequencies horiziontally are located at the left and right

corners with k2=0 and k2=N 2−1 , and the

lowest frequencies vertically at the top and bottom corners. The Nyquist frequency is located at the center.

The Inverse 2D-DFT

The 2D-DFT can also be inverted, with

x(k1 , k2)=1

N1⋅N2∑n1=0

N 1−1

∑n2=0

N 2−1

X (k1 , k2)⋅ej πN 1

⋅k1n1

⋅ej πN2

⋅k2n2

In Python:

x=numpy.fft.ifft2(X)

Python Examplep

This example takes the webcam input, computes the Y Luminance component, and displays it. Then it takes the 2D-DFT and displays its magnitude in a separate window.Then it takes the inverse 2D-DFT and displays it in the fnal third window.

Keyboard key 'f': A low pass can be switched on, which is implemented in the 2D-DFT domain, by setting the DFT coefcients of higher frequencies to 0 and then apply the inverse 2D-DFT. This is done before the downsampling and after the upsampling.

Keyboard key 's': Sampling with a factor of 8 in horizontal and vertical direction is turned on/of. First downsampling (for the

encodere and then upsampling (for the decodere.

Start the Python demo with:

python videofft0ifftresampleykey.py

Observe:

-Most coefcients of the 2D-DFT with high magnitudes

appear in the corners, where the low frequencies are

located.

-Holding fne paterns in front of the camera results

also in high magnitudes at higher frequencies, like

brighter rays.

-By pushing key “f” we can set most of the high

frequency coefcients to 0 and stll get a “good”

picture, just a litle blurry

-By pushing key “s” we can turn sampling on/of.

Without the lowpass flter the reconstructed image

only has a “latce” of actve pixels. The 2D-DFT image

shows spectral copies, aliasing, appearing periodically

in the spectrum.

-By then switching the low pass flter on we remove or

suppress the spectral copies. As the result, the

reconstructed images looks like before with only the

low pass turned on, and much beter than without the

low pass flter!

-If we leave the low pass on, we don't see a diference

in the reconstructed image when we toggle sampling

on and of!

python video processing · import cv,cv2 import cpickle as pickle #program to capture video from a...

Documents