· web viewthis is a dataset for binary sentiment classification containing a set of 25,000 highly...

11
Instructions: Datasets for DL course (2019 Fall): Preparations: Download the Platform Software: Python 3.7.4 (the latest version) Official Download Website: https://www.python.org/downloads/ Install TensorFlow: A widely used and open-source library developed by Google for implementing Machine Learning topics. A quick guide to install TensorFlow is available here: https://www.geeksforgeeks.org/introduction-to- tensorflow/ Install Keras API (Optional but Recommended): A highly developed Deep Learning API based on Python. There are several guides to build this with TensorFlow Backend: 1. If you are not using Anaconda https://www.pyimagesearch.com/2016/07/18/installing- keras-for-deep-learning/ 2. If you are using Anaconda: https://towardsdatascience.com/installing-keras- tensorflow-using-anaconda-for-machine-learning- 44ab28ff39cb Datasets: 1. MNIST Dataset

Upload: others

Post on 07-Feb-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1:  · Web viewThis is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. The review is stored

Instructions: Datasets for DL course (2019 Fall):

Preparations:

Download the Platform Software: Python 3.7.4 (the latest version) Official Download Website: https://www.python.org/downloads/

Install TensorFlow: A widely used and open-source library developed by Google for implementing Machine Learning topics. A quick guide to install TensorFlow is available here: https://www.geeksforgeeks.org/introduction-to-tensorflow/

Install Keras API (Optional but Recommended): A highly developed Deep Learning API based on Python. There are several guides to build this with TensorFlow Backend: 1. If you are not using Anacondahttps://www.pyimagesearch.com/2016/07/18/installing-keras-for-deep-learning/2. If you are using Anaconda:https://towardsdatascience.com/installing-keras-tensorflow-using-anaconda-for-machine-learning-44ab28ff39cb

Datasets:1. MNIST Dataset

Figure 1: Examples of the images in the MNIST dataset

Intro: The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.Applications: Shallow/Deep Neural Networks; CNN; RNNLink to Dataset: http://yann.lecun.com/exdb/mnist/

Page 2:  · Web viewThis is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. The review is stored

Import from Keras:

1. from keras.datasets import mnist  2. # Setup train and test splits  3. (x_train, y_train), (x_test, y_test) = mnist.load_data()

Import Dataset from Tensorflow:

1. import tensorflow as tf  2. from tensorflow.examples.tutorials.mnist import input_data  3. mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)  4. train_x = mnist.train.images  5. train_y = mnist.train.labels  6. test_x = mnist.test.images  7. test_y = mnist.test.labels  

2. MNIST Fashion Dataset

Figure 2: Examples of the images in the Fashion MNIST dataset

Intro: Fashion-MNIST contains 70,000 grayscale images of 10 categories’ fashion products like sneakers, trousers and coats. There are 7,000 images in each category. This dataset is more challenging than MNIST dataset therefore it is considered to be a replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size(28*28) , data format and the structure of training and testing splits. Link to Dataset: https://arxiv.org/abs/1708.07747Applications: Shallow/Deep Neural Networks; CNN; RNNImport from Keras:

1. from keras.datasets import fashion_mnist 2. # Setup train and test splits   3. (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()  

Page 3:  · Web viewThis is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. The review is stored

Import from TensorFlow :

Train(Size: 60000) Test(Size:10000)

train-images-download train-labels-download

test-images-download test-labels-download

Table 1: Download Links of Mnist Dataset

Make sure you have downloaded the data and saved it in the path: data/fashion. Otherwise, it will use the Mnist Dataset instead.

1. from tensorflow.examples.tutorials.mnist import input_data  2. data = input_data.read_data_sets('data/fashion')  

Import from mnist_reader :

1. import mnist_reader  2. X_train, y_train = mnist_reader.load_mnist('data/fashion', kind

='train')  3. X_test, y_test = mnist_reader.load_mnist('data/fashion', kind='t10k')  

3. Cat and Non-Cat Dataset

Figure 3: A cats’ picture from web

Intro: This “cat and non-cat” dataset is taken from Andrew Ng’s course on Coursera Neural Networks and Deep Learning. The dataset (“data.h5”) contains: a training set of train images labeled as cat (y=1) or non-cat (y=0) ; a test set of test images labeled as cat or non-cat; each image is of shape (num_px, num_px, 3) where 3 is for the 3 channels (RGB). Thus, each image is square (height = num_px) and (width = num_px).Link to Dataset:

Page 4:  · Web viewThis is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. The review is stored

Train data google drive link Test data google drive link Applications: Shallow/Deep Neural Networks; CNN; RNN

Import from h5py package: First, make sure you have already downloaded the dataset(both train and test), then import h5py package:

1. import numpy as np  2. import h5py  3. def load_data():  4.     train_dataset = h5py.File('train_catvnoncat.h5', "r")  5.     train_set_x_orig = np.array(train_dataset["train_set_x"][:]) # you

r train set features  6.     train_set_y_orig = np.array(train_dataset["train_set_y"][:]) # you

r train set labels  7.     test_dataset = h5py.File('test_catvnoncat.h5', "r")  8.     test_set_x_orig = np.array(test_dataset["test_set_x"][:]) # your t

est set features  9.     test_set_y_orig = np.array(test_dataset["test_set_y"][:]) # your t

est set labels  10.     classes = np.array(test_dataset["list_classes"][:]) # the list of 

classes  11.     return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_s

et_y_orig, classes  12.   13. train_x, train_y, test_x, test_y, classes = load_data()  

4. Cifar-10 Dataset

Figure 4: Examples of the images in the Cifar-10 dataset

Page 5:  · Web viewThis is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. The review is stored

Intro: “CIFAR-10  is an established computer-vision dataset used for object recognition. It is a subset of the 80 million tiny images dataset and consists of 60,000 32x32 color images containing one of 10 object classes, with 6000 images per class. It was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.”

Link to Dataset:http://www.cs.toronto.edu/~kriz/cifar.html

Applications: Shallow/Deep Neural Networks; CNN; RNNImport from Keras:

1. from keras.datasets import cifar10  2. (x_img_train,y_label_train),(x_img_test,y_label_test) = cifar10.load_d

ata() 

Import from Self-defined Functions:

1. import pickle  2. def unpickle(file):  3.     with open(file, 'rb') as fo:  4.         dict = pickle.load(fo, encoding='bytes')  5.     return dict  6.   7. import numpy as np  8. import os  9.   10. # define the function to load a batch   11. def load_CIFAR_batch(filename):  12.     with open(filename, 'rb') as f:  13.         datadict = pickle.load(f, encoding='latin1')  14.         X = datadict['data']  15.         Y = datadict['labels']  16.         X = X.reshape(10000, 3, 32, 32).transpose(0, 2, 3, 1).astype

("float")  17.         Y = np.array(Y)  18.         return X, Y  19.   20. # define the function to load the whole dataset  21. def load_CIFAR10():  22.     xs = []  23.     ys = []  24.     for b in range(1, 6):  25.         location = 'data_batch_'+str(b)  26.         X, Y = load_CIFAR_batch(location)  27.         xs.append(X)         #将所有 batch整合起来  28.         ys.append(Y)  29.     Xtr = np.concatenate(xs) #使变成行向量,最终 Xtr的尺寸为(50000,32,32,3) 

 30.     Ytr = np.concatenate(ys)  31.     del X, Y  32.     Xte, Yte = load_CIFAR_batch('test_batch')  33.     return Xtr, Ytr, Xte, Yte  34.   35. import numpy as np  36.   37. # load the dataset  38. X_train, y_train, X_test, y_test = load_CIFAR10()  

Page 6:  · Web viewThis is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. The review is stored

5. Street View House Numbers (SVHN) Dataset

Figure 5: Examples of the images in the SVHN dataset

Intro: SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. SVHN is obtained from house numbers in Google Street View images.Link to Dataset: http://ufldl.stanford.edu/housenumbers/Applications: Shallow/Deep Neural Networks; CNN; RNNImport from SciPy: First, make sure you have already downloaded these datasets: (Format 2) Cropped Digits: train_32x32.mat , test_32x32.mat , extra_32x32.mat

1. import numpy as np  2. import scipy.io as sio  3. import matplotlib.pyplot as plt  4. %matplotlib inline  5.   6. train_data = sio.loadmat('train_32x32.mat')  7. test_data = sio.loadmat('test_32x32.mat')  8.   9. x_train = train_data['X']  10. y_train = train_data['y']  11. x_test = test_data['X']  12. y_test = test_data['y']  13. x_extra = extra_data['X']  14. y_extra = extra_data['y']  

To use extra dataset as trainset as well:

Page 7:  · Web viewThis is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. The review is stored

1. # Use extra data as train data  2. x_train = np.concatenate([x_train, x_extra])  3. y_train = np.concatenate([y_train, y_extra])  

6. Large Movie Review Dataset (IMDB) Dataset

Figure 6: IMDB pic from web

Intro: This is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. The review is stored as a sequence of integers. These are word IDs that have been pre-assigned to individual words, and the label is an integer (0 for negative, 1 for positive).Link to Dataset: http://ai.stanford.edu/~amaas/data/sentiment/Applications: Shallow/Deep Neural Networks; CNN; LSTM(RNN)

Review Label

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.

Positive

Table 2: An example of the review and label in IMDB dataset

Import from Keras:

1. from keras.datasets import imdb  2. vocabulary_size = 5000  3. (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = voca

bulary_size)  

Page 8:  · Web viewThis is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. The review is stored

Import from Pandas and os package:First download the dataset from the link: http://ai.stanford.edu/~amaas/data/sentiment/ and save it to the same folder as your program.And then import os and pandas:

1. import os  2. import pandas as pd  3. path = "aclImdb/"  4. positiveFiles = [x for x in os.listdir(path+"train/pos/") if x.endswit

h(".txt")]  5. negativeFiles = [x for x in os.listdir(path+"train/neg/") if x.endswit

h(".txt")]  6. testFiles = [x for x in os.listdir(path+"test/") if x.endswith

(".txt")]  

Next, create the DataFrame:1. positiveReviews, negativeReviews, testReviews = [], [], []  2. for pfile in positiveFiles:  3.     with open(path+"train/pos/"+pfile, encoding="latin1") as f:  4.         positiveReviews.append(f.read())  5. for nfile in negativeFiles:  6.     with open(path+"train/neg/"+nfile, encoding="latin1") as f:  7.         negativeReviews.append(f.read())  8. for tfile in testFiles:  9.     with open(path+"test/"+tfile, encoding="latin1") as f:  10.         testReviews.append(f.read())  11. reviews = pd.concat([  12.     pd.DataFrame({"review":positiveReviews, "label":1, "file

":positiveFiles}),  13.     pd.DataFrame({"review":negativeReviews, "label":0, "file

":negativeFiles}),  14.     pd.DataFrame({"review":testReviews, "label":-1, "file":testFiles}) 

 15. ], ignore_index=True).sample(frac=1, random_state=1)  16. reviews.head()  

The output will be :

Last, we can perform train, validation and test set splits:

Page 9:  · Web viewThis is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. The review is stored

1. reviews = reviews[["review", "label", "file"]].sample(frac=1, random_state=1)  

2. train = reviews[reviews.label!=-1].sample(frac=0.6, random_state=1)  3. valid = reviews[reviews.label!=-1].drop(train.index)  4. test = reviews[reviews.label==-1]