journal of la dimensionality reduction for categorical data

21
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Dimensionality Reduction for Categorical Data Debajyoti Bera, Rameshwar Pratap, and Bhisham Dev Verma Abstract—Categorical attributes are those that can take a discrete set of values, e.g., colours. This work is about compressing vectors over categorical attributes to low-dimension discrete vectors. The current hash-based methods compressing vectors over categorical attributes to low-dimension discrete vectors do not provide any guarantee on the Hamming distances between the compressed representations. Here we present FSketch to create sketches for sparse categorical data and an estimator to estimate the pairwise Hamming distances among the uncompressed data only from their sketches. We claim that these sketches can be used in the usual data mining tasks in place of the original data without compromising the quality of the task. For that, we ensure that the sketches also are categorical, sparse, and the Hamming distance estimates are reasonably precise. Both the sketch construction and the Hamming distance estimation algorithms require just a single-pass; furthermore, changes to a data point can be incorporated into its sketch in an efficient manner. The compressibility depends upon how sparse the data is and is independent of the original dimension – making our algorithm attractive for many real-life scenarios. Our claims are backed by rigorous theoretical analysis of the properties of FSketch and supplemented by extensive comparative evaluations with related algorithms on some real-world datasets. We show that FSketch is significantly faster, and the accuracy obtained by using its sketches are among the top for the standard unsupervised tasks of RMSE, clustering and similarity search. Index Terms—Dimensionality Reduction, Sketching, Feature Hashing, Clustering, Classification, Similarity Search. 1 I NTRODUCTION Of the many types of digital data that are getting recorded every second, most can be ordered – they belong to the ordinal type (e.g., age, citation count, etc.), and a good pro- portion can be represented as strings but cannot be ordered — they belong to the nominal type (e.g., hair colour, country, publication venue, etc.). The latter datatype is also known as categorical which is our focus in this work. Categori- cal attributes are commonly present in survey responses, and have been used earlier to model problems in bio- informatics [1], [2], market-basket transactions [3], [4], [5], web-traffic [6], images [7], and recommendation systems [8]. The first challenge practitioners encounter with such data is how to process them using standard tools most of which are designed for numeric data, that too often are real-valued. Two important operations are often performed before running statistical data analysis tools and machine learning algorithms on such datasets. The first is encoding the data points using numbers, and the second is dimensionality reduction; many approaches combine the two, with the final objective being numeric vectors of fewer dimensions. To the best of our knowledge, the approaches usually followed are ad-hoc adaptations of those employed for vectors in the real space, and suffer from computational inefficiency and/or unproven heuristics [9]. The motivation of this work is to D. Bera is with the Department of Computer Science and Engineering, Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, India, 110020. E-mail: see http://www.michaelshell.org/contact.html R. Pratap and B. D. Verma are with the Indian Institute of Technology, Mandi, Himachal Pradesh, India. Emails: [email protected], [email protected] and [email protected]. Manuscript accepted for publication by IEEE Transactions on Knowledge and Data Engineering. Copyright 1969, IEEE. provide a solution that is efficient in practice and has proven theoretical guarantees. For the first operation, we use the standard method of label encoding in this paper. In this a feature with c categories is represented by an integer from {0, 1, 2,...c} where 0 indicates a missing category and i ∈{1, 2,...,c} indicates the i-th category. Hence, an n-dimensional data point, where each feature can take at most c values, can be represented by a vector from {0, 1, 2 ...c} n — we call such a vector as a categorical vector. Another approach is one- hot encoding (OHE) which is more popular since it avoids the implicit ordering among the feature values imposed by label-encoding. One-hot encoding of a feature with c possible values is a c-dimensional binary vector in which the i-th bit is set to 1 to represent the i-th feature value. Naturally, one-hot encoding of an n-dimensional vector will be nc dimensional — which can be very large if c is large (e.g., for features representing countries, etc.). Not only label encoding avoids this problem, but is essential for the crucial second step – that of dimensionality reduction. Dimensionality reduction is important when data points lie in a high-dimensional space, e.g., when encoded us- ing one-hot encoding or when described using tens of thousands of categorical attributes. High-dimensional data vectors not only increase storage and processing cost, but they suffer from the “curse of dimensionality” that points to the decrease in performance after the dimension of the data points crosses a peak. Hence it is suggested that the high- dimensional categorical vectors be compressed to smaller vectors, essentially retaining the information only from the useful features. Baraniuk et al. [10] characterised a good dimensionality reduction in the Euclidean space as a compres- sion algorithm that satisfies the following two conditions for any two vectors x and y. 1) Information preserving: For any two distinct vectors x and y, R(x) 6= R(y). arXiv:2112.00362v1 [cs.LG] 1 Dec 2021

Upload: others

Post on 10-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Dimensionality Reduction for Categorical DataDebajyoti Bera, Rameshwar Pratap, and Bhisham Dev Verma

Abstract—Categorical attributes are those that can take a discrete set of values, e.g., colours. This work is about compressing vectorsover categorical attributes to low-dimension discrete vectors. The current hash-based methods compressing vectors over categoricalattributes to low-dimension discrete vectors do not provide any guarantee on the Hamming distances between the compressedrepresentations. Here we present FSketch to create sketches for sparse categorical data and an estimator to estimate the pairwiseHamming distances among the uncompressed data only from their sketches. We claim that these sketches can be used in the usualdata mining tasks in place of the original data without compromising the quality of the task. For that, we ensure that the sketches alsoare categorical, sparse, and the Hamming distance estimates are reasonably precise. Both the sketch construction and the Hammingdistance estimation algorithms require just a single-pass; furthermore, changes to a data point can be incorporated into its sketch in anefficient manner. The compressibility depends upon how sparse the data is and is independent of the original dimension – making ouralgorithm attractive for many real-life scenarios. Our claims are backed by rigorous theoretical analysis of the properties of FSketchand supplemented by extensive comparative evaluations with related algorithms on some real-world datasets. We show that FSketchis significantly faster, and the accuracy obtained by using its sketches are among the top for the standard unsupervised tasks ofRMSE, clustering and similarity search.

Index Terms—Dimensionality Reduction, Sketching, Feature Hashing, Clustering, Classification, Similarity Search.

F

1 INTRODUCTION

Of the many types of digital data that are getting recordedevery second, most can be ordered – they belong to theordinal type (e.g., age, citation count, etc.), and a good pro-portion can be represented as strings but cannot be ordered— they belong to the nominal type (e.g., hair colour, country,publication venue, etc.). The latter datatype is also knownas categorical which is our focus in this work. Categori-cal attributes are commonly present in survey responses,and have been used earlier to model problems in bio-informatics [1], [2], market-basket transactions [3], [4], [5],web-traffic [6], images [7], and recommendation systems [8].The first challenge practitioners encounter with such data ishow to process them using standard tools most of which aredesigned for numeric data, that too often are real-valued.

Two important operations are often performed beforerunning statistical data analysis tools and machine learningalgorithms on such datasets. The first is encoding the datapoints using numbers, and the second is dimensionalityreduction; many approaches combine the two, with the finalobjective being numeric vectors of fewer dimensions. To thebest of our knowledge, the approaches usually followed aread-hoc adaptations of those employed for vectors in the realspace, and suffer from computational inefficiency and/orunproven heuristics [9]. The motivation of this work is to

• D. Bera is with the Department of Computer Science and Engineering,Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi,India, 110020.E-mail: see http://www.michaelshell.org/contact.html

• R. Pratap and B. D. Verma are with the Indian Institute of Technology,Mandi, Himachal Pradesh, India.

• Emails: [email protected], [email protected] [email protected].

Manuscript accepted for publication by IEEE Transactions on Knowledge andData Engineering. Copyright 1969, IEEE.

provide a solution that is efficient in practice and has proventheoretical guarantees.

For the first operation, we use the standard methodof label encoding in this paper. In this a feature with ccategories is represented by an integer from 0, 1, 2, . . . cwhere 0 indicates a missing category and i ∈ 1, 2, . . . , cindicates the i-th category. Hence, an n-dimensional datapoint, where each feature can take at most c values, can berepresented by a vector from 0, 1, 2 . . . cn — we call sucha vector as a categorical vector. Another approach is one-hot encoding (OHE) which is more popular since it avoidsthe implicit ordering among the feature values imposedby label-encoding. One-hot encoding of a feature with cpossible values is a c-dimensional binary vector in whichthe i-th bit is set to 1 to represent the i-th feature value.Naturally, one-hot encoding of an n-dimensional vector willbe nc dimensional — which can be very large if c is large(e.g., for features representing countries, etc.). Not only labelencoding avoids this problem, but is essential for the crucialsecond step – that of dimensionality reduction.

Dimensionality reduction is important when data pointslie in a high-dimensional space, e.g., when encoded us-ing one-hot encoding or when described using tens ofthousands of categorical attributes. High-dimensional datavectors not only increase storage and processing cost, butthey suffer from the “curse of dimensionality” that points tothe decrease in performance after the dimension of the datapoints crosses a peak. Hence it is suggested that the high-dimensional categorical vectors be compressed to smallervectors, essentially retaining the information only from theuseful features. Baraniuk et al. [10] characterised a gooddimensionality reduction in the Euclidean space as a compres-sion algorithm that satisfies the following two conditions forany two vectors x and y.

1) Information preserving: For any two distinct vectors xand y, R(x) 6= R(y).

arX

iv:2

112.

0036

2v1

[cs

.LG

] 1

Dec

202

1

Page 2: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

2) ε-Stability: (Euclidean) distances between all the pointsare approximately preserved (with ε inaccuracy).

We call these two conditions the “well-designed” condi-tions. To obtain their mathematically precise versions, weneed to narrow down upon a distance measure for cate-gorical vectors. A natural measure for categorical vectors isan extension of the binary Hamming distance. For two n-dimensional categorical data points x and y, the Hammingdistance between them is defined as the number of featureswith different attributes in x and y, i.e.,

HD(x, y) = Σni=1dist(x[i], y[i]), where

dist(x[i], y[i]) =

1, if x[i] 6= y[i],

0, otherwise.

Problem statement: The specific problem that we ad-dress is how to design a dimensionality reduction algo-rithm that can compress high-dimensional sparse label-encoded categorical vectors to low-dimensional categoricalvectors so that (a) compressions of distinct vectors aredistinct, and (b) the Hamming distance between two un-compressed vectors can be efficiently approximated fromtheir compressed forms. These conditions, in turn, guaran-tee both information-preserving and stability. Furthermore,we would like to take advantage of the sparse nature ofmany real-world datasets. The most important requirementis the compressed vectors should be categorical as well,specifically not over real numbers and preferably not binary;this is to allow the statistical tests and machine learningtools for categorical datasets, e.g. k-mode, to run on thecompressed datasets.

1.1 Challenges in the existing approaches

Dimensionality reduction is a well-studied problem [11](also see Table 8 in Appendix) but Hamming space doesnot allow the usual approaches applicable in the Euclideanspaces. Methods that work for continuous-valued data oreven ordinal data (such as integers) do not perform satis-factorily for unordered categorical data. Among those thatspecifically consume categorical data, techniques via featureselection have been well studied. For example, in the caseof labelled data χ2 [12] and Mutual Information [13] basedmethods select features based on their correlation with thelabel. This limits their applicability to only the classifica-tion tasks. Further, Kendall rank correlation coefficient [14]“learns” the important features based on the correlationamong them. Learning approaches tend to be computation-ally heavy and do not work reliably with small trainingsamples. So what about task-agnostic approaches that donot involve learning? PCA-based methods, e.g., MCA ispopular among the practitioners of biology [11]; however,we consider them merely a better-than-nothing approachsince PCA is fundamentally designed for continuous data.

A quick search among internet forums, tutorials andQ&A websites revealed that the more favourable approachto perform machine learning tasks on categorical datasetsis to convert categorical feature vectors to binary vectorsusing one-hot encoding [15, see DictVectorizer] — a widely-viewed tutorial on Kaggle calls it “The Standard Approachfor Categorical Data” [16]. The biggest problem with OHE

u = 220 −→ 10 · 10 · 00v = 202 −→ 10 · 00 · 10w = 201 −→ 10 · 00 · 01

Hamming=2

Hamming=2

Hamming=2

Hamming=1

Fig. 1. An example showing that the Hamming distances of one-hotencoded sparse vectors are not functionally related to the distancesbetween their unencoded forms. If a feature, say country, is missing,libraries differ in their handling of its one-hot encoding. In this paper, wefollow the common practice of using the c-dimensional all-zero vectoras its encoding. This retains sparsity since the number of non-missingattributes in the original vector equals the number of non-zero bits in theencoded vector.

is that it is impractical for large n or large c followedby a technical annoyance that some OHE implementationsdo not preserve the Hamming distances for sparse vectors(see illustration in Figure 1). Hence, this encoding is usedin conjunction with problem-specific feature selection orfollowed by dimensionality reduction from binary to bi-nary vectors [17], [18], [19]. The latter is a viable heuristicthat we wanted to improve upon by allowing non-binarycompressed vectors (see Appendix A for a quick analysis ofOHE followed by a state-of-the-art binary compression).

Another popular alternative, especially when n × c islarge, is feature hashing [20] that is now part of most libraries,e.g., scikit-learn [15, see FeatureHasher]. Feature hash-ing and other forms of hash-based approaches, also knownas sketching algorithms, both encode and compress categor-ical feature vectors into integer vectors (sometimes signed)of a lower dimension, and furthermore, provide theoreticalguarantees like stability, in some metric space. The currentlyknown results for feature hashing apply only to the Eu-clidean space, however, Euclidean distance and Hammingdistance are not monotonic for categorical vectors. It isneither known nor straightforward to ascertain whetherfeature hashing and its derivatives can be extended to theHamming space which lacks the continuity that is crucialto their theoretical bounds. Other hash-based approacheseither come with no guarantees and are used merely be-cause of their compressibility or come with stability-likeguarantees in a different space, e.g., cosine similarity bySimhash [21]. Our solution is a hashing approach that weprove to be stable in the Hamming space.

1.2 Overview of results

The commonly followed practices in dealing with categori-cal vectors, especially those with high dimensions and notinvolving supervised learning or training data, appear tobe either feature hashing or one-hot encoding followedby dimensionality reduction of binary vectors [22, Chapter5]. We provide a contender to these in the form of theFSketch sketching algorithm to construct lower-dimensionalcategorical vectors from high-dimensional ones.

The lower-dimensional vectors, sketches, produced byFSketch (we shall call these vectors as FSketch too)have the desired theoretical guarantees and perform wellon real-world datasets vis-a-vis related algorithms. Nowwe summarise the important features of FSketch; in thesummarisation, p is a constant that is typically chosen to bea prime number between 5-50.

Lightweight and unsupervised: First and foremost,FSketch is an unsupervised process, and in fact, quite

Page 3: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

lightweight making a single pass over an input vector andtaking O(poly(log p)) steps per non-missing feature. TheFSketch-es retain the sparsity of the input vectors andtheir size and dimension do not depend at all on c. Tomake our sketches applicable out-of-the-box for modernapplications where data keeps changing, we present anextremely lightweight algorithm to incorporate any changein a feature vector into its sketch in O(poly(log p))-steps permodified feature. It should be noted that FSketch supportschange of an attribute, deletion of an attribute and insertionof a previously missing attribute unlike some state-of-the-art sketches; for example, BinSketch [17] does not supportdeletion of an attribute.

Estimator for Hamming distance: We want to advocatethe use of FSketch-es for data analytic tasks like clustering,etc. that use Hamming distance for the (dis)similarity metric.We present an estimator that can approximate the Hammingdistance between two points by making a single pass overtheir sketches. The estimator follows a tight concentrationbound and has the ability to estimate the Hamming dis-tance from very low-dimensional sketches. In the theoreticalbounds, the dimensions could go as low as 4σ or even√σ (and independent of the dimension of the data) where

σ indicates the sparsity (maximum number of non-zeroattributes) of the input vectors; however, we later showthat a much smaller dimension suffices in practice. Oursketch generation and the Hamming distance estimationalgorithms combined meet the two conditions of “well-designed” dimensionality reduction.

Theorem 1. Let x and y be distinct categorical vectors, and φ(x)and φ(y) be their d-dimensional compressions.

1) φ(x) and φ(y) are distinct with probability ≈ HD(x, y)/d.2) Let HD′(x, y) denote the approximation to the Hamming

distance between x and y computed from φ(x) and φ(y). Ifd is set to 4σ, then with probability at least 1− δ (for any δof choice),∣∣HD(x, y)−HD′(x, y)

∣∣ = O

(√σ ln 2

δ

).

The proof of (1) follows from Lemma 3 and the proof of(2) follows from Lemma 8 for which we used McDiarmid’sinequality. The theorem allows us to use compressed formsof the vectors in place of their original forms for dataanalytic and statistical tools that depend largely on theirpairwise Hamming distances.

Practical performance: All of the above claims areproved rigorously but one may wonder how do they per-form in practice. For this, we design an elaborate array ofexperiments on real-life datasets involving many commonapproaches for categorical vectors. The experiments demon-strate these facts.• Some of the baselines do not output categorical vectors

(see Section 4). Our FSketch algorithm is super-fastamong those that do and offer comparable accuracy.

• When used for typical data analytic tasks like clus-tering, similarity search, etc. low-dimension FSketch-es bring immense speedup vis-a-vis using the original(uncompressed) vectors, yet achieving very high ac-curacy. The NYTimes dataset saw 140x speedup uponcompression to 0.1%.

• Even though highly compressed, the results of cluster-ing, etc. on FSketch-es are close to what could be ob-tained from the uncompressed vectors and are compa-rable with the best alternatives. For example, we wereable to compress the Brain cell dataset of dimensionality1306127 to 1000 dimensions in a few seconds, yetretaining the ability to correctly approximating the pair-wise Hamming distances from the compressed vectors.This is despite many other baselines giving either anout-of-memory error, not stopping even after runningfor a sufficiently long time, or producing significantlyworse estimates of pairwise Hamming distances.

• The parameter p can be used to fine-tune the quality ofresults and the storage of the sketches.

We claim that FSketch is the best method today tocompress categorical datasets for data analytic tasks thatrequire pairwise Hamming distances with respect to boththeoretical guarantee and practical performance.

1.3 Organisation of the paper

The rest of the paper is organised as follows. We discussseveral related works in Section 2. In Section 3, we presentour algorithm FSketch and derive its theoretical bounds.In Section 4, we empirically compare the performance ofFSketch on several end tasks with state-of-the-art algo-rithms. We conclude our presentation in Section 5. Theproofs of the theoretical claims and the results of additionalexperiments are included in Appendix.

2 RELATED WORK

Dimensionality reduction: Dimensionality reduction hasbeen studied in-depth for real-valued vectors, and to someextent, also for discrete vectors. We categorise them intothese broad categories — (a) random projection, (b) spectralprojection, (c) locality sensitive hashing (LSH), (d) otherhashing approaches, and (e) learning-based algorithms. Allof them compress high-dimensional input vectors to low-dimensional ones that explicitly or implicitly preserve somemeasure of similarity between the input vectors.

The seminal result by Johnson and Lindenstrauss [23] isprobably the most well known random projection-based al-gorithm for dimensionality reduction. This algorithm com-presses real-valued vectors to low-dimensional real-valuedvectors such that the Euclidean distances between the pairsof vectors are approximately preserved, but in such a man-ner that the compressed dimension does not depend uponthe original dimension. The algorithm involves projectinga data matrix onto a random matrix whose each entry issampled from a Gaussian distribution. This result has seenlots of enhancements, particularly with respect to generatingthe random matrix without affecting the accuracy [24], [25],[26]. However, it is not clear whether any of those ideascan be made to work for categorical data and that too, forapproximating Hamming distances.

Principal component analysis (PCA) is a spectralprojection-based technique for reducing the dimensionalityof high dimensional datasets by creating new uncorrelatedvariables that successively maximise variance. There areextensions of PCA that employ kernel methods that try to

Page 4: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

capture non-linear relationships [27]. Multiple Correspon-dence Analysis (MCA) [28] does the analogous job for thecategorical datasets. However, these methods perform di-mensionality reduction by creating un-correlated features ina low-dimensional space whereas our aim is to preserve thepairwise Hamming distances in a low-dimensional space.

Another line of dimensionality reduction techniquesbuilds upon the “Locality Sensitive Hashing (LSH)” algo-rithms. LSH algorithms have been proposed for differentdata types and similarity measures, e.g., real-valued vectorsand the Euclidean distance [29], real-valued vectors and thecosine similarity [21], binary vectors and the Jaccard simi-larity [30], binary vectors and the Hamming distance [31].However, generally speaking, the objective of an LSH isto group items so that similar items are grouped togetherand dissimilar items are not; unlike FSketch they do notprovide explicit estimators of any similarity metric.

There are quite a few learning-based dimensionality re-duction algorithms available such as Latent Semantic Anal-ysis (LSA) [32], Latent Dirichlet Allocation (LDA) [33], Non-negative Matrix Factorisation (NNMF) [34], Generalizedfeature embedding learning (GEL) [35] all of which striveto learn a low-dimensional representation of a dataset whilepreserving some inherent properties of the full-dimensionaldataset. They are rather slow due to the optimization stepinvolved during learning. T-distributed Stochastic Neigh-bour Embedding (t-SNE) [36] is a faster non-linear di-mensionality reduction technique that is widely used forthe visualisation of high-dimensional datasets. However,the low-dimensional representation obtained from t-SNEis not recommended for use for other end tasks such asclustering, classification, anomaly detection as it does notnecessarily preserve densities or pairwise distances. Anautoencoder [37] is another learning-based non-linear di-mension reduction algorithm. It basically consists of twoparts: An encoder which aims to learn a low-dimensionalrepresentation of the input and a decoder which tries to re-construct the original input from the output of the encoder.However, these approaches involve optimising a learningobjective function and are usually slow and CPU-intensive.

The other hashing approaches randomly assign eachfeature (dimension) to one of several bins, and then computea summary value for each bin by aggregating all the featurevalues assigned to it. A list of such summaries can be viewedas a low-dimensional sketch of the input. Such techniqueshave been designed for real-valued vectors approximatinginner product (e.g., feature hashing [20]), binary vectorsallowing estimation of several similarity measures suchas Hamming distance, Inner product, Cosine, and Jaccardsimilarity (e.g., BinSketch [17]), etc. This work is similar tothese approaches but for categorical vectors and only aimingto estimate the Hamming distances.

Another approach in this direction could be to encodecategorical vectors to binary and then apply dimensionalityreduction for binary vectors; unfortunately, the popularencodings, e.g. OHE, do not preserve Hamming distance forvectors with missing features. Nevertheless, it is possible toencode using OHE and then reduce its dimension. However,our theoretical analysis led to a worse accuracy compared tothat of FSketch (see Appendix A for the analysis) and thisapproach turned out to be one of the worst performers in

our experiments (see Section 4).While our motivation was to design an end-task agnostic

dimensionality reduction algorithm, there exist several thatare designed for specific tasks, e.g., for clustering [38], forregression and discriminant analysis of labelled data [39],and for estimating covariance matrix [40]. Deep learning hasgained mainstream importance and several researchers haveproposed a dimensionality reduction “layer” inside a neuralnetwork [41]; this layer is intricately interwoven with theother layers and cannot be separated out as a standalonetechnique that outputs compressed vectors.

Feature selection is a limited form of dimensionalityreduction whose task is to identify a set of good features,and maybe learn their relative importance too. Banerjeeand Pal [42] recently proposed an unsupervised techniquethat identifies redundant features and selects those withbounded correlation, but only for real-valued vectors. Forour experiments we chose the Kendall-Tau rank correlationapproach that is applicable to discrete-valued vectors.

Sketching algorithm: The use of “sketches” for com-puting Hamming distance has been explicitly studied inthe streaming algorithm framework. The first well-knownsolution was proposed by Cormode et al. [43] where theyshowed how to estimate a Hamming distance with highaccuracy and low error. There have been several improve-ments to this result, in particular, by Kane et al. [44] wherea sketch with the optimal size was proposed. However, weneither found any implementation nor an empirical evalua-tion of those approaches (the algorithms themselves appearfairly involved). Further, their objective was to minimise thespace usage in the asymptotic sense in a streaming setting,whereas, our objective is to design a solution that can bereadily used for data analysis. This motivated us to com-press categorical vectors onto low-dimensional categoricalvectors, unlike the real-valued vectors that the theoreticalresults proposed. A downside of our solution is that itheavily relies on the sparsity of a dataset unlike the sketchesoutput by the streaming algorithms.

TABLE 1Notations

categorical data vectors x, ytheir Hamming distance hcompressed categorical vectors (sketches) φ(x), φ(y)j-th bit of a sketch φ(x) φj(x)observed Hamming distance between sketches fexpected Hamming distance between sketches f∗

estimated Hamming distance between data vectors h

3 CATEGORY SKETCHING AND HAMMING DIS-TANCE ESTIMATION

Our technical objective is to design an effective algorithmto compress high-dimensional vectors over 0, 1, . . . , c tointeger vectors of a low dimension, aka. sketches; c can evenbe set to an upper bound on the largest number of categoriesamong all the features. The number of attributes in the inputvectors is denoted n and the dimension of the compressedvector is denoted d. We will later show how to choose ddepending on the sparsity of a dataset that we denote σ.

Page 5: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Family name starts with: Missing (0)

Colour of eye : grey (2)

Religion : Hindu (3)

(Indian) state of birth : Orissa (10)

Country of schooling : India (101)

Country of graduation : UK (7)

Country of post-graduation : Missing (0)

Country of current residence : India (101)

Nationality of spouse : Missing (0)

Favourite cuisine : Missing (0)

~x : [2, 3, 0, 10, 101, 7, 0, 101, 0, 0]

compressing to d = 3 with c = 195using ρ as given in this table

1 → 2, 2 → 33 → 1, 4 → 15 → 2, 6 → 17 → 3, 8 → 29 → 1 10 → 3

& p = 67& R = [37, 2, 56, 46, 17, 61, 26, 9, 12, 38]

φ1 = 0 · 56 + 10 · 46 + 7 · 61 + 0 · 12(mod 67)φ2 = 2 · 37 + 101 · 17 + 101 · 9(mod 67)φ3 = 3 · 2 + 0 · 26 + 0 · 38(mod67)

~φ = [16,20,6]

Fig. 2. An example illustrating how to compress a data point withcategorical features using FSketch to a 3-dimensional integer vector.The data point has 10 feature values, each of which is a categoricalvariable (the corresponding label encoded values are present inside thebrackets). c is chosen as 195 since the fifth, sixth, seventh, and eighthfeatures have 195 categories which is the largest. ρ, p andR are internalvariables of FSketch.

The commonly used notations in this section are listed inTable 1.

Algorithm 1 Constructing d-dimensional FSketch of n-dimensional vector x

1: procedure INITIALIZE2: Choose random mapping ρ : 1, . . . n → 1, . . . d3: Choose some prime p4: Choose n random numbers R = r1, . . . , rn with eachri ∈ 0, . . . p− 1

5: end procedure1: procedure CREATESKETCH(x ∈ 0, 1, . . . cn)2: Create empty sketch φ(x) = 0d

3: for i = 1 . . . n do4: j = ρ(i)5: φj(x) = (φj(x) + xi · ri) mod p6: end for7: return φ(x)8: end procedure

3.1 FSketch constructionOur primary tool for sketching categorical data is a ran-domised sketching algorithm named FSketch that is de-scribed in Algorithm 1; see Figure 2 for an example.

Let x ∈ 0, 1, . . . cn denote the input vector, and the i-th feature or co-ordinate of x is denoted by xi. The sketchof input vector x will be denoted φ(x) ∈ 0, 1, . . . p − 1dwhose coordinates will be denoted φ1(x), φ2(x), . . . , φd(x).Note that the initialisation step of FSketch needs to runonly once for a dataset. We are going to use the followingcharacterisation of the sketches in the rest of this section;a careful reader may observe the similarity to Freivald’salgorithm for verifying matrix multiplication [45].

Observation 2. It is obvious from Algorithm 1 that the sketchescreated by FSketch satisfy φj(x) = (

∑i∈ρ−1(j) xi · ri)

mod p.

3.2 Hamming distance estimationHere we explain how the Hamming distance between x andy denoted HD(x, y), percolates to their sketches as well.

The objective is derive an estimator for HD(x, y) from theHamming distance between φ(x) and φ(y).

The sparsity of a set of vectors denoted σ, is the max-imum number of non-zero coordinates in them. For thetheoretical analysis, we assume that we know the sparsityof the dataset, or at least an upper bound of the same. Notethat, for a pair of sparse vectors x, y ∈ 0, 1, . . . , cn, theHamming distance between them can vary from 0 (whenthey are same) to 2σ (when they are completely different).

We first prove case (a) of Theorem 1 which states thatsketches of different vectors are rarely the same.

Lemma 3. Let h denote HD(x, y) for two input vectors x, y toFSketch. Then

Prρ,R

[φj(x) 6= φj(y)] = (1− 1p )(1− (1− 1

d )h).

Proof. Fix a mapping ρ and then define Fj(x) as the vec-tor [xi1 , xi2 , . . . : ik ∈ 1, . . . n] of values of x thatare mapped to j in φ(x) in the increasing order of theircoordinates, i.e., ρ(ik) = j and i1 < . . . ik < ik+1. Since ρis fixed, Fj(y) is also a vector of the same length. The keyobservation is that if Fj(x) = Fj(y) then φj(x) = φj(y)but the converse is not always true. Therefore we separatelyanalyse both the conditions (a) Fj(x) 6= Fj(y) and (b)Fj(x) = Fj(y).

It is given that x and y differ at h coordinates. Therefore,Fj(x) 6= Fj(y) iff any of those coordinates are mapped to jby ρ. Thus,

Prρ

[Fj(x) = Fj(y)] = (1− 1d )h. (1)

Next we analyse the chance of φj(x) = φj(y) whenFj(x) 6= Fj(y). Note that φj(x) = (xi1 · ri1 + xi2 · ri2 + . . .)mod p (and a similar expression exists for y), where ris arerandomly chosen during initialisation (they are fixed for xand y). Using a similar analysis as that in the Freivald’salgorithm [46, Ch 1(Verifying matrix multiplication)],

Prρ,R

[φj(x) = φj(y) | Fj(x) 6= Fj(y)] = 1p . (2)

Due to Equations 1, 2, we have

Prρ,R

[φj(x) 6= φj(y)]

= Prρ,R

[φj(x) 6= φj(y) | Fj(x) 6= Fj(y)] · Prρ,R

[Fj(x) 6= Fj(y)]

+ Prρ,R

[φj(x) 6= φj(y) | Fj(x) = Fj(y)] · Prρ,R

[Fj(x) = Fj(y)]

=(1− 1p )(1− (1− 1

d )h).

The right-hand side of the expression in the statementof the lemma can be approximated as (1 − 1

p )hd which isstated as case (a) of Theorem 1. The lemma also allowsus to relate the Hamming distance of the sketches to theHamming distance of the vectors which is our main tool todefine an estimator.

Lemma 4. Let h denote HD(x, y) for two input vectorsx, y to FSketch, f denote HD(φ(x), φ(y)) and f∗ denoteE[HD(φ(x), φ(y))]. Then

f∗ = E[f ] = d(

1− 1p

)(1−

(1− 1

d

)h).

Page 6: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

Fig. 3. The distributions of Hamming distances for some of the datasets used in our experiments are shown in blue — the Y-axis shows the frequencyof each distance. The black points represent the actual Hamming distances and the red points are the estimates, i.e., a red-point plotted against aHamming distance d (on the X-axis) shows the estimated Hamming distance between two points with actual Hamming distance d. Observe that theHamming distances follow a long-tailed distribution and that most distances are fairly low — moreover, our estimates are more accurate for thosehigh frequent Hamming distances.

The lemma is easily proved using Lemma 3 by applyingthe linearity of expectation on the number of coordinates jsuch that φj(x) 6= φj(y). We are now ready to define anestimator for the Hamming distance.

Using D =(1− 1

d

)and P =

(1− 1

p

), we can write

f∗ = dP (1−Dh) and h = ln(

1− f∗

dP

)/ lnD. (3)

Our proposal to estimate h is to obtain a tight approxi-mation of f∗ and then use the above expression.

Definition 5 (Estimator of Hamming distance). Givensketches φ(x) and φ(y) of data points x and y, suppose f rep-resents HD(φ(x), φ(y)). We define the estimator of HD(x, y)

as h = ln(

1− fdP

)/ lnD if f < dP and 2σ otherwise.

Observe that h is set to 2σ if f ≥ dP . However, we shallshow in the next section that this occurs very rarely.

3.3 Analysis of Estimator

h is pretty reliable when the actual Hamming distance is0; in that case φ(x) = φ(y) and thus, f = 0 and so is h.However, in general, h could be different from h. The mainresult of this section is that their difference can be upperbounded when we set the dimension of FSketch to d = 4σ.

The results of this subsection rely on the followinglemma that proves that an observed value of f is concen-trated around its expected value f∗.

Lemma 6. Let α denote a desired additive accuracy. Then, forany x, y with sparsity σ,

Pr[|f − f∗| ≥ α

]≤ 2 exp (−α

2

4σ ).

The proof of the lemma employs martingales and Mc-Diarmid’s inequality and is available in Appendix B. Thelemma allows us to upper bound the probability of f ≥ dP .

Lemma 7. Pr[f ≥ dP ] ≤ 2 exp(−P 2σ).

The right-hand side is a very small number, e.g., it is ofthe order of 10−278 for p = 5 and σ = 1000. The proof is astraightforward application of Lemma 6 and is explained inAppendix B. Now we are ready to show that the estimatorh, which uses f instead of f∗ (refer to Equation 3) is almostequal to the actual Hamming distance.

Lemma 8. Choose d = 4σ as the dimension of FSketch andchoose a prime p and an error parameter δ ∈ (0, 1) (ensure that1 − 1

p ≥4√σ

√ln 2

δ — see the proof for discussion). Then theestimator defined in Definition 5 is close to the Hamming distancebetween x and y with high probability, i.e.,

Pr[|h− h| ≥ 32

1−1/p

√σ ln 2

δ

]≤ δ.

If the data vectors are not too dissimilar which is some-what evident from Figure 3, then a better compression ispossible which is stated in the next lemma. The proofs ofboth these lemmas are fairly algebraic and use standardinequalities; they are included in Appendix B.

Lemma 9. Suppose we know that h ≤√σ and choose d =

16√σ ln 2

δ as the dimension for FSketch. Then (a) also f < dP

with high probability and moreover we get a better estimator. That

is, (b) Pr[|h− h|

]≥ 8

1−1/p

√σ ln 2

δ

]≤ δ.

The last two results prove case (b) of Theorem 1 whichstates that the estimated Hamming distances are almostalways fairly close to the actual Hamming distances. Wewant to emphasise that the above claims on d and accuracyare only theoretical bounds obtained by worst-case analysis.We show in our empirical evaluations that an even smallerd leads to better accuracy in practice for real-life instances.

There is a way to improve the accuracy even further bygenerating multiple FSketch using several independentlygenerated internal variables and combining the estimatesobtained from each. We observed that the median of theestimates can serve as a good statistic, both theoreticallyand empirically. We discuss this in detail in Appendix H.

3.4 Complexity analysis

The results in the previous section show that the accuracyof the estimator h can be tightened, or a smaller probabilityof error can be achieved, by choosing large values of pwhich has a downside of a larger storage requirement. Inthis section, we discuss these dependencies and other factorsthat affect the complexity of our proposal.

The USP of FSketch is its efficiency. There are twomajor operations with respect to FSketch— constructionof sketches and estimation of Hamming distance from twosketches. Their time and space requirements are given in thefollowing table and explained in detail in Appendix C.

Page 7: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

TABLE 2Space savings offered by FSketch on an example scenario with 220 data points, each of 210 dimensions but having only 27 non-zero entries

where non-zero entry belongs to one of 23 categories. FSketch dimension is 29 (as prescribed theoretically) and its parameter p is close to 25. (*)The data required to construct the sketches is no longer required after the construction.

Uncompressed CompressedNaive Sparse vector format FSketch construction (*) Storage of sketches

220 × 210 × 3 220 × 27 × (log 23 + log 210) 210 × (log 29 + log 25) + 5 220 × log(25)

Construction Estimationtime per sketch O(n) time per pair O(d log p)space per sketch O(d log p)

We are aware of efficient representations of sparse datavectors, but for the sake of simplicity we assume full-sizearrays to store vectors in this table; similarly, we assumesimple dictionaries for storing the internal variables ρ,Rand p. While it may be possible to reduce the number ofrandom bits by employing k-wise independent bits andmappings, we left it out of the scope of this work.

Both the operations are quite fast compared to thematrix-based and learning-based methods. There is verylittle space overhead too; we explain the space requirementwith the help of an example in Table 2 — one should keepin mind that a sparse representation of a vector has to storethe non-zero entries as well as their positions in it.

Apart from the efficiency in both time and space mea-sures, FSketch provides additional benefits. Recall thateach entry of an FSketch is an integral value from 0 top−1. Even though 0 does not necessarily indicate a missingfeature in a compressed vector, we show below that 0 hasa predominant presence in the sketches. The sketches cantherefore be treated as sparse vectors that further facilitatestheir efficient storage.

Lemma 10. If d = 4σ (as required by Lemma 8), then theexpected number of non-zero entries of φ(x) is upper boundedby d

4 . Further, at least 50% of φ(x) will be zero with probabilityat least 1

2 .

The lemma can be proved using a balls-and-bins typeanalysis (see Appendix D for the entire proof).

3.5 Sketch updatingImagine a situation where the categories of attributes canchange dynamically, and they can both “increase”, “de-crease” or even “vanish”. We present Algorithm 2 to in-corporate such changes without recomputing the sketch afresh.The algorithm simply uses the formula for a sketch entry asgiven in Observation 2.

Most hashing-based sketching and dimensionality re-duction algorithms that we have encountered either requirecomplete regeneration of φ(x) when some attributes of xchange or are able to handle addition of previously missingattributes but not their removal.

4 EXPERIMENTS

We performed our experiments on a machine having In-tel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 94 GB RAM,and running a Ubuntu 64-bits OS.

We first study the effect of the internal parameters ofour proposed solution on its performance. We start with the

Algorithm 2 Update sketch σ(x) of x after i-th attribute ofx changes from v to v′

input: data vector x and its existing sketch φ(x) =〈φ1(x), φ2(x), . . . φd(x)〉

input: change xi : v 7→ v′ B v′ can be any value in0, 1, . . . , c

parameters: ρ,R = [r1 . . . rn], p (same as that was usedfor generating the sketch)

1: j = ρ(i)2: update φ(x) =

(φj(x) + (v′ − v) · ri

)mod p

3: return updated φ(x)

effect of the prime number p; then we compare FSketchwith the appropriate baselines for several unsuperviseddata-analytic tasks (see Table 5) and objectively establishthese advantages of FSketch over the others.(a) Significant speed-up in the dimensionality reduction

time,(b) considerable savings in the time for the end-tasks (e.g.,

clustering) which now runs on the low-dimensionalsketches,

(c) but with comparable accuracy of the end-tasks (e.g.,clustering).Several baselines threw out-of-memory errors or did not

stop on certain datasets. We discuss the errors separately inSection F in Appendix.

4.1 Dataset description

The efficacy of our solution is best described for high-dimensional datasets. Publicly available categorical datasetsbeing mostly low-dimensional, we treated several integer-valued freely available real-world datasets as categorical.Our empirical evaluation was done on the following sevensuch datasets with dimensions between 5000 and 1.3 mil-lion, and sparsity from 0.07% to 30%.• Gisette Data Set [47], [48]: This dataset consists

of integer feature vectors corresponding to imagesof handwritten digits and was constructed from theMNIST data. Each image, of 28 × 28 pixels, has beenpre-processed (to retain the pixels necessary to disam-biguate the digit 4 from 9) and then projected onto ahigher-dimensional feature space represented to con-struct a 5000-dimension integer vector.

• BoW (Bag-of-words) [47], [49]: We consider thefollowing five corpus – NIPS full papers, KOS blogentries, Enron Emails, NYTimes news articles, andtagged web pages from the social bookmarking sitedelicious.com. These datasets are “BoW”(Bag-of-words) representations of the corresponding text cor-

Page 8: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

TABLE 3Datasets

Datasets Categories Dimension Sparsity No. of pointsGisette [47], [48] 999 5000 1480 13500Enron Emails [47] 150 28102 2021 39861DeliciousMIL [47], [49] 58 8519 200 12234NYTimes articles [47] 114 102660 871 10000NIPS full papers [47] 132 12419 914 1500KOS blog entries [47] 42 6906 457 3430Million Brain Cells from E18 Mice [50] 2036 1306127 1051 2000

TABLE 413 baselines

1. SSD Sketching via Stable Distribution [51]2. OHE One Hot Encoding+BinSketch [17]3. FH Feature Hashing [20]4. SH Signed-random projection/SimHash [21]5. KT Kendall rank correlation coefficient [14]6. LSA Latent Semantic Analysis [32]7. LDA Latent Dirichlet Allocation [33]8. MCA Multiple Correspondence Analysis [28]9. NNMF Non-neg. Matrix Factorization [34]

10. PCA Vanilla Principal component analysis11. VAE Variational autoencoder [52]12. CATPCA Categorical PCA [53]13. HCA Hierarchical Cluster Analysis [53]

pora. In all these datasets, the attribute takes integervalues which we consider as categories.

• 1.3 Million Brain Cell Dataset [50]: Thisdataset contains the result of a single cell RNA-sequencing (scRNA-seq) of 1.3 million cells capturedand sequenced from an E18.5 mouse brain 1.Each gene represents a data point and for everygene, the dataset stores the read-count of that genecorresponding to each cell – these read-counts formour features.

We chose the last dataset due to its very high di-mension and the earlier ones due to their popularity indimensionality-reduction experiments. We consider all thedata points for KOS, Enron, Gisette, DeliciousMIL, a 10, 000sized sample for NYTimes, and a 2000 sized samples forBrainCell. We summarise the dimensionality, the number ofcategories, and the sparsity of these datasets in the Table 3.

4.2 Baselines

Recall that FSketch (hence Median-FSketch) compressescategorical vectors to shorter categorical vectors in an unsuper-vised manner that “preserves” Hamming distances.

Our first baseline is based on one-hot-encoding (OHE)which is one of the most common methods to convertcategorical data to a numeric vector and can approximatepairwise Hamming distance (refer to Appendix A). SinceOHE actually increases the dimension to very high levels(e.g., the dimension of the binary vectors obtained by en-coding the NYTimes dataset is 11, 703, 240), the best wayto use it is by further compressing the one-hot encodedvectors. For empirical evaluation we applied BinSketch [17]which is the state-of-the-art binary-to-binary dimensionalityreduction technique that preserves Hamming distance. We

1. https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M neurons

refer to the entire process of OHE followed by BinSketchsimply by OHE in the rest of this section.

To the best of our knowledge, there is no sketching al-gorithm other than OHE that compresses high-dimensionalcategorical vectors to low-dimensional categorical (or in-teger) vectors that preserves the original pairwise Ham-ming distances. Hence, we chose as baseline state-of-the-art and popularly employed algorithms that either preserveHamming distance or output discrete-valued sketches (pre-serving some other similarity measure). We list them inTable 4 and tabulate their characteristic in Table 5. Theirimplementation details are discussed in Appendix E.1.

We include Kendall rank correlation coefficient (KT) [14]– a feature selection algorithm which generates discrete val-ued sketches. Note that if we apply Feature Hashing (FH),SimHash (SH), and KT naively on categorical datasets, weget discrete valued sketches on which Hamming distancecan be computed.We also include a few other well knowndimensionality reduction methods such as Principal com-ponent analysis (PCA), Non-negative Matrix Factorisation(NNMF) [34], Latent Dirichlet Allocation (LDA) [33], La-tent Semantic Analysis (LSA) [32], Variational Autoencoder(VAE) [52], Categorical PCA (CATPCA) [53], HierarchicalCluster Analysis (HCA) [53] all of which output real-valuedsketches.

4.3 Choice of pWe discussed in Section 3 that a larger value of p (aprime number) leads to a tighter estimation of Hammingdistance but degrades sketch sparsity, which negativelyaffects performance at multiple fronts, and moreover, de-mands more space to store a sketch. We conducted anexperiment to study this trade-off, where we ran our pro-posal with different values of p, and computed the cor-responding RMSE values. The RMSE is defined as thesquare-root of the average error, among all pairs of datapoints, between their actual Hamming distances and thecorresponding estimate obtained via FSketch. Note that alower RMSE indicates that the sketch correctly estimates theunderlying pairwise Hamming distance. We also note thecorresponding space overhead which is defined as the ratioof the space used by uncompressed vector and its sketchobtained from FSketch. We consider storing a data pointin a typical sparse vector format – a list of non-zero entriesand their positions (see Table 2). We summarise our resultsin Figures 4 and 5, respectively. We observe that a largevalue of p leads to a lower RMSE (in Figure 4), howeversimultaneously it leads to a smaller space compression(Figure 5). As a heuristic, we decided to set p as the nextprime after c as shown in this table.

Brain cell 2039 NYTimes 127 Enron 151KOS 43 Delicious 59 Gisette 1009NIPS 137

That said, the experiments reveal that, at least for thedatasets in the above experiments, setting p to be at least c/4may be practically sufficient, since there does not appear tobe much advantage in using a larger p.

4.4 Variance of FSketchIn Section 3.3 we explained that the bias of our estimator isupper bounded with a high likelihood. However, there re-

Page 9: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

100 1000 2500 5000Reduced Dimension

102RMSE

NIPSp=2p=17p=29p=47p=67p=137p=139

100 1000 2500 5000Reduced Dimension

101

102

Enronp=2p=17p=31p=53p=79p=151p=211

100 1000 2500 5000Reduced Dimension

101

102

103NYTimes

p=2p=13p=31p=41p=59p=127p=157

100 1000 2500 5000Reduced Dimension

103

GISETTEp=2p=61p=151p=337p=503p=1009p=1361

Fig. 4. Comparison of RMSE measure obtained from FSketch algorithm on various choices of p. Values of c for NIPS, Enron, NYTimes, andGISETTE are 132, 150, 114, and 999, respectively.

1000 2500 5000Reduced Dimension

1

10

100

Spac

e Ef

ficie

ncy

NIPSp=2p=17p=29p=47p=67p=137p=139

1000 2500 5000Reduced Dimension

1

10

100 Enronp=2p=17p=31p=53p=79p=151p=211

1000 2500 5000Reduced Dimension

1

10

100 NYTimesp=2p=13p=31p=41p=59p=127p=157

1000 2500 5000Reduced Dimension

1

10

100 GISETTEp=2p=61p=151p=337p=503p=1009p=1361

Fig. 5. Space overhead of uncompressed vectors stored as a list of non-zero entries and their positions. Y -axis represents the ratio of the spaceused by uncompressed vector to that obtained from FSketch.

50 100 500 1000 3000 5000Reduced Dimension

60

40

20

0

20

Ham

min

g Er

ror

FSketch

50 100 500 1000 3000 5000Reduced Dimension

0

20

40

60

80

Ham

min

g Er

ror

FH

50 100 500 1000 3000 5000Reduced Dimension

2500

2000

1500

1000

500

0

Ham

min

g Er

ror

SH

50 100 500 1000 3000 5000Reduced Dimension

100

50

0

50

Ham

min

g Er

ror

SSD

50 100 500 1000 3000 5000Reduced Dimension

50

0

50

100

150

200H

amm

ing

Erro

r

OHE

Fig. 6. Comparison of avg. error in estimating Hamming distance of a pair of points from the Enron dataset.

100 1000 2500 5000Reduced Dimension

100

101

102

103

104

Tim

e (S

econ

ds)

Brain Cell

100 1000 2500 5000Reduced Dimension

100

101

102

103

104

NYTimes

100 1000 2500 5000Reduced Dimension

100

101

102

103

104

Enron

100 1000 2500 5000Reduced Dimension

10 1

100

101

102

103

104

KOS FSketchFHSHSSDOHEKTLDALSAPCAMCANNMFVAECATPCAHCA

Fig. 7. Comparison among the baselines on the dimensionality reduction time. See Appendix G for results on the other datasets which show asimilar trend and Section F for the errors encountered by some baselines.

Page 10: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

TABLE 5Summarisation of the baselines.

Characteristics FSketch FH SH SSD OHE KT NNMF MCA LDA LSA PCA VAE CATPCA HCA

Outputdiscretesketch

3 3 3 7 3 3 7 7 7 7 7 7 7 3

Outputreal-valuedsketch

7 7 7 3 7 7 3 3 3 3 3 3 3 7

Approximatingdistancemeasure

Hamming Dotproduct Cosine Hamming Hamming NA NA NA NA NA NA NA NA NA

Requirelabelleddata

7 7 7 7 7 7 7 7 7 7 7 7 7 7

Dependencyon the sizeof sample *

7 7 7 7 7 7 7 3 7 3 3 7 3 7

End taskscomparison All All All All All All

Clustering,SimilaritySearch

Clustering,SimilaritySearch

Clustering,SimilaritySearch

Clustering,SimilaritySearch

Clustering,SimilaritySearch

Clustering,SimilaritySearch

Clustering,SimilaritySearch

Clustering,SimilaritySearch

* The size of the maximum possible reduced dimension is the minimum of the number of data points and the dimension.

TABLE 6Speedup of FSketch w.r.t. baselines on the reduced dimension 1000. OOM indicates “out-of-memory error” and DNS indicates “did not stop” after a

sufficiently long time.

Dataset OHE KT NNMF MCA LDA LSA PCA VAE SSD SH FH CATPCA HCA

NYTimes OOM OOM 6149× OOM 189× 11.5× 88.14× 4340× 164.9× 1.2× 0.99× DNS DNSEnron OOM DNS 2624× OOM 122× 15.5× OOM DNS 25.5× 1.25× 0.87× DNS 1268.2×KOS 629× 14455× 1754× 20.41× 128× 6.40× 9.5× 1145× 14.62× 0.79× 0.98× DNS 81.24×DeliciousMIL 1332× 14036× 1753× 40.39× 136× 6.6× 18.1× 1557× 29.2× 0.61× 0.90× DNS 117.6×Gisette 399× 1347× 459× 5.7× 269× 5.4× 4.2× 285× 8.1× 0.69× 0.98× DNS 16.78×NIPS 378× 15863× 1599× 26.6× 302× 6.4× 3.17× 451× 29.9× 0.47× 1.20× DNS 58.49×Brain Cell OOM OOM DNS OOM 322× 79.38× 62.7× 1198× 443× 5× 0.89× DNS DNS

100 1000 2500 5000Reduced Dimension

101

102

103

RMSE

Brain Cell

100 1000 2500 5000Reduced Dimension

101

102

103

NYTimes

100 1000 2500 5000Reduced Dimension

101

102

103

Enron

100 1000 2500 5000Reduced Dimension

101

102

103

KOS

FSketchFHSHSSDOHEKTHCA

Fig. 8. Comparison on RMSE among baselines. A lower value is an indication of better performance. See Appendix G for results on the otherdatasets which show a similar trend.

mains the question of its variance. To decide the worthinessof our method, we compared the variance of the estimatesof the Hamming distance obtained from FSketch and fromthe other randomised sketching algorithms with integer-valued sketches (KT was not included as it is a deterministicalgorithm, and hence, has zero variance).

Figure 6 shows the Hamming error (estimation error) fora randomly chosen pair of points from the Enron dataset,averaged over 100 iterations. We make two observations.

First is that the estimate using FSketch is closer tothe actual Hamming distance even at a smaller reduceddimension; in fact, as the reduced dimension is increased,the variance becomes smaller and the Hamming error con-verges to zero. Secondly, FSketch causes a smaller errorcompared to the other baselines. On the other hand, fea-ture hashing highly underestimates the actual Hammingdistance, but has low variance, and tends to have negligibleHamming error with an increase of the reduced dimen-

sion. The behaviour of SimHash is counter-intuitive as onlower reduced dimensions it closely estimates the actualHamming distances, but on larger dimensions it starts tohighly underestimate the actual Hamming distances. Thiscreates an ambiguity on the choice of a dimension forgenerating a low-dimensional sketch of a dataset.Similarto FSketch, the sketches produced by SSD, though real-valued, allow estimation of pairwise Hamming distances.However the estimation error increases with the reduceddimension. Lastly, OHE seems to be highly underestimatingpairwise Hamming distances.

4.5 Speedup in dimensionality reductionWe compress the datasets to several dimensions usingFSketch and the baselines and report their running timesin Figure 7. We notice that FSketch has a comparablespeed w.r.t. Feature hashing and SimHash, and is signifi-cantly faster than the other baselines. However, both feature

Page 11: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

TABLE 7Speedup from running tasks on 1000-dimensional sketches instead of

the full dimensional dataset. We got a DNS error while runningclustering on the uncompressed BrainCell dataset.

Task Brain cell NYTimes Enron NIPS KOS Gisette DeliciousMIL

Clustering NA 139.64× 21.15× 10.6× 3.93× 4.35× 5.84×Similarity Search 1231.6× 118.12× 48.15× 15.1× 10.56× 8.34× 17.76×

hashing and SimHash are not able to accurately estimate theHamming distance between data points and hence performpoorly on RMSE measure (Subsection 4.6) and the othertasks. Many baselines such as OHE, KT, NNMF, MCA,CATPCA, HCA give “out-of-memory” (OOM) error, and alsodidn’t stop (DNS) even after running for a sufficiently longtime (∼ 10 hrs) on high dimensional datasets such as BrainCell and NYTimes. On other moderate dimensional datasetssuch as Enron and KOS, our speedup w.r.t. these baselinesare of the order of a few thousand. We report the numericalspeedups that we observed in Table 6.

4.6 Performance on root-mean-squared-error (RMSE)

How good are the sketches for estimating Hamming dis-tances between the uncompressed points in practice? To an-swer this, we compare FSketch with integer-valued sketch-ing algorithms, namely, feature hashing, SimHash, Kendallcorrelation coefficient and OHE+BinSketch. Note that fea-ture hashing and SimHash are known to approximate innerproduct and cosine similarity, respectively. However, weconsider them in our comparison nonetheless as they outputdiscrete sketches and Hamming distance can be computedon their sketch. We also include SSD for comparison whichoutputs real-valued sketches and estimates original pairwiseHamming distance. For each of the methods we computeits RMSE as the square-root of the average error, amongall pairs of data points, between their actual Hammingdistances and their corresponding estimates (for FSketchthe estimate was obtained using Definition 5). Figure 8compares these values of RMSE for different dimensions;note that a lower RMSE is an indication of better perfor-mance. It is immediately clear that the RMSE of FSketchis the lowest among all; furthermore, it falls to zero rapidlywith increase in reduced dimension. This demonstrates thatour proposal FSketch estimates the underlying pairwiseHamming distance better than the others.

4.7 Performance on clustering

We compare the performance of FSketch with baselineson the task of clustering and similarity search, and presentthe results for the first task in this section. The objectiveof the clustering experiment was to test if the data pointsin the reduced dimension maintain the original clusteringstructure. If they do, then it will be immensely helpful forthose techniques that use a clustering, e.g., spam filtering.We used the purity index to measure the quality of k-modeand k-means clusters on the reduced datasets obtainedthrough the compression algorithms; the ground truth wasobtained using k-mode on the uncompressed data (for moredetails refer to Appendix E.2).

We summarise our findings on quality in Figure 9. Thecompressed versions of the NIPS, Enron, and KOS datasetsthat were obtained from FSketch yielded the best purityindex as compared to those obtained from the other base-lines; for the other datasets the compressed versions fromFSketch are among the top. Even though it appears that KToffers comparable performance on the KOS, DeliciousMIL,and Gisette datasets w.r.t. FSketch, the downside of usingKT is that its compression time is much higher than thatof FSketch (see Table 6) on those datasets, and moreoverit gives OOM/DNS error on the remaining datasets. Per-formance of FH also remains in the top few. However, itsperformance degrades on the NIPS dataset.

We tabulate the speedup of clustering of FSketch-compressed data over uncompressed data in Table 7 wherewe observe significant speedup in the clustering time, e.g.,139× when run on a 1000 dimensional FSketch.

Recall that the dimensionality reduction time of ourproposal is among the fastest among all the baselines whichfurther reduces the total time to perform clustering byspeeding up the dimensionality reduction phase. Thus theoverall observation is that FSketch appears to be the mostsuitable method for clustering among the current alterna-tives, especially, for high-dimensional datasets on whichclustering would take a long time.

4.8 Performance on similarity search

We take up another unsupervised task – that of similaritysearch. The objective here is to show that after dimensional-ity reduction the similarities of points with respect to somequery points are maintained. To do so, we randomly splitthe dataset in two parts 5% and 95% – the smaller partitionis referred to as the query partition and each point of thispartition is called a query vector; we call the larger partitionas training partition. For each query vector, we find top-ksimilar points in the training partition. We then performdimensionality reduction using all the methods (for variousvalues of reduced dimensions). Next, we process the com-pressed dataset where, for each query point, we compute thetop-k similar points in the corresponding low-dimensionalversion of the training points, by maintaining the samesplit. For each query point, we compute the accuracy ofbaselines by taking the Jaccard ratio between the set oftop-k similar points obtained in full dimensional data withthe top-k similar points obtained in reduced dimensionaldataset. We repeat this for all the points in the queryingpartition, compute the average, and report this as accuracy.

We summarise our findings in Figure 10. Note that PCA,MCA and LSA can reduce the data dimension up to theminimum of the number of data points and the original datadimension. Therefore their reduced dimension is at most2000 for Brain cell dataset.

The top few methods appear to be feature hashing (FH),Kendall-Tau (KT), HCA along with FSketch. However,KT give OOM and DNS on the Brain cell, NYTimes andEnron datasets, and HCA give DNS error on BrainCell andNYTimes datasets. Ffurther, their dimensionality reductiontime are much worse than FSketch (see Table 6).

FSketch outperforms FH on the BrainCell and theEnron datasets; however, on the remaining datasets, both

Page 12: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

100 1000 2500 5000Reduced Dimension

0.90

0.92

0.94

0.96

0.98

1.00Pu

rity

Inde

xEnron (k = 7)

100 1000 2500 5000Reduced Dimension

0.82

0.84

0.86

0.88

0.90

0.92

0.94KOS (k =3)

100 1000 2500 5000Reduced Dimension

0.780.800.820.840.860.880.900.920.94 DeliciousMIL (k = 5)

100 1000 2500 5000Reduced Dimension

0.42

0.44

0.46

0.48

0.50NIPS (K=5) FSketch

FHSHSSDOHEKTLDALSAPCAMCANNMFVAECATPCAHCA

Fig. 9. Comparing the quality of clusters on the compressed datasets. See Appendix G for results on the other datasets which show a similar trend.

100 1000 2500 5000Reduced Dimension

0.0

0.1

0.2

0.3

0.4

Accu

racy

Brain Cell

100 1000 2500 5000Reduced Dimension

0.0

0.2

0.4

0.6

0.8

1.0 NYTimes

100 1000 2500 5000Reduced Dimension

0.0

0.2

0.4

0.6

0.8

Enorn

100 1000 2500 5000Reduced Dimension

0.0

0.2

0.4

0.6

0.8

NIPS FSketchFHSHSSDOHEKTLDALSAPCAMCANNMFVAECATPCAHCA

Fig. 10. Comparing the performance of the similarity search task (estimating top-k similar points with k = 100) achieved on the reduced dimensionaldata obtained from various baselines. See Appendix G for results on the other datasets which show a similar trend.

of them appear neck to neck for similarity search despitethe fact that there is no known theoretical understandingof FH for Hamming distance — in fact, it was included inthe baselines as a heuristic because it offers discrete-valuedsketches on which Hamming distance can be calculated.Here want to point out that FH was not a consistent top-performer for clustering and similarity search.

The two other methods that are designed for Hammingdistance, namely SSD and OHE, perform significantly worsethan FSketch; in fact, the accuracy of OHE lies almost to thebottom on all the four datasets.

We also summarise the speedup of FSketch-compressed data over uncompressed data, on similaritysearch task, in Table 7. We observe a significant speedup– e.g. 1231.6× speedup on the BrainCell dataset when runon a 1000 dimensional FSketch.

To summarise, FSketch is one of the best approachestowards similarity search for high-dimensional datasets andthe best if we also require theoretical guarantees or applica-bility towards other data analytic tasks.

5 CONCLUSION

In this paper, we proposed a sketching algorithm namedFSketch for sparse categorical data such that the Ham-ming distances estimated from the sketches closely approx-imate the original pairwise Hamming distances. The low-dimensional data obtained by FSketch are discrete-valued,and therefore, enjoy the flexibility of running the data ana-lytics tasks suitable for categorical data. The sketches allowtasks like clustering, similarity search to run which mightnot be possible on a high-dimensional dataset.

Our method does not require learning from the datasetand instead, exploits randomization to bring forth largespeedup and high-quality output for standard data ana-lytic tasks. We empirically validated the performance of

our algorithm on several metric and end tasks such asRMSE, clustering, similarity search, and observed compa-rable performance while simultaneously getting significantspeed up in dimensionality reduction and end-task withrespect to several baselines. A common practice to analysehigh-dimensional datasets is to partition them into smallerdatasets. Given the simplicity, efficiency, and effectivenessof our proposal, we hope that FSketch will allow suchanalysis to be done on the full datasets and on general-purpose hardware.

REFERENCES

[1] J. Moody, D. T. (eds, M. Kaufmann, M. O. Noordewier, G. G.Towell, and J. W. Shavlik, “Training knowledge-based neuralnetworks to recognize genes in dna sequences,” 1991.

[2] T. Rognvaldsson, L. You, and D. Garwicz, “State of the art pre-diction of hiv-1 protease cleavage sites,” Bioinformatics (Oxford,England), vol. 31, 2014.

[3] W. Hamalainen and M. Nykanen, “Efficient discovery of statis-tically significant association rules,” in 2008 Eighth IEEE Interna-tional Conference on Data Mining, 2008, pp. 203–212.

[4] J. Lavergne, R. Benton, and V. V. Raghavan, “Min-max itemsettrees for dense and categorical datasets,” in Foundations of Intel-ligent Systems, L. Chen, A. Felfernig, J. Liu, and Z. W. Ras, Eds.Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 51–60.

[5] R. Agrawal, T. Imielinski, and A. Swami, “Mining associationrules between sets of items in large databases,” in SIGMOD ’93:Proceedings of the 1993 ACM SIGMOD international conference onManagement of data, 1993, pp. 207–216.

[6] I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White, “Visu-alization of navigation patterns on a web site using model-basedclustering,” in Proceedings of the Sixth ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, 2000, pp. 280–284.

[7] L. Kurgan, K. Cios, R. Tadeusiewicz, M. Ogiela, and L. Goodenday,“Knowledge discovery approach to automated cardiac spect diag-nosis,” Artificial intelligence in medicine, vol. 23, pp. 149–69, 2001.

[8] S. Sidana, C. Laclau, and M.-R. Amini, “Learning to recommenddiverse items over implicit feedback on pandor,” 2018, pp. 427–431.

[9] J. T. Hancock and T. M. Khoshgoftaar, “Survey on categorical datafor neural networks,” Journal of Big Data, vol. 7, pp. 1–41, 2020.

Page 13: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

[10] R. G. Baraniuk, V. Cevher, and M. B. Wakin, “Low-dimensionalmodels for dimensionality reduction and signal recovery: A ge-ometric perspective,” Proceedings of the IEEE, vol. 98, no. 6, pp.959–971, 2010.

[11] L. H. Nguyen and S. Holmes, “Ten quick tips for effective dimen-sionality reduction,” PLOS Computational Biology, vol. 15, no. 6, pp.1–19, 2019.

[12] H. Liu and R. Setiono, “Chi2: feature selection and discretizationof numeric attributes,” in Seventh International Conference onTools with Artificial Intelligence, ICTAI ’95, Herndon, VA, USA,November 5-8, 1995, 1995, pp. 388–391. [Online]. Available:https://doi.org/10.1109/TAI.1995.479783

[13] H. Peng, F. Long, and C. H. Q. Ding, “Feature selectionbased on mutual information: Criteria of max-dependency,max-relevance, and min-redundancy,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 27, no. 8, pp. 1226–1238, 2005. [Online].Available: https://doi.org/10.1109/TPAMI.2005.159

[14] M. G. Kendall, “A new measure of rank correlation,” Biometrika,vol. 30, no. 1/2, pp. 81–93, 1938.

[15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,and E. Duchesnay, “Scikit-learn: Machine learning in Python,”Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

[16] D. Becker, “Using categorical data with one hot encoding,”2018. [Online]. Available: https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding

[17] R. Pratap, D. Bera, and K. Revanuru, “Efficient sketchingalgorithm for sparse binary data,” in 2019 IEEE InternationalConference on Data Mining, ICDM 2019, Beijing, China, November8-11, 2019, 2019, pp. 508–517. [Online]. Available: https://doi.org/10.1109/ICDM.2019.00061

[18] M. Mitzenmacher, R. Pagh, and N. Pham, “Efficient estimationfor high similarities using odd sketches,” in 23rd InternationalWorld Wide Web Conference, WWW ’14, Seoul, Republic ofKorea, April 7-11, 2014, 2014, pp. 109–118. [Online]. Available:http://doi.acm.org/10.1145/2566486.2568017

[19] R. Pratap, I. Sohony, and R. Kulkarni, “Efficient compressiontechnique for sparse sets,” in Advances in Knowledge Discoveryand Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018,Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III,2018, pp. 164–176. [Online]. Available: https://doi.org/10.1007/978-3-319-93040-4 14

[20] K. Q. Weinberger, A. Dasgupta, J. Langford, A. J. Smola,and J. Attenberg, “Feature hashing for large scale multitasklearning,” in Proceedings of the 26th Annual International Conferenceon Machine Learning, ICML 2009, Montreal, Quebec, Canada,June 14-18, 2009, 2009, pp. 1113–1120. [Online]. Available:http://doi.acm.org/10.1145/1553374.1553516

[21] M. Charikar, “Similarity estimation techniques from roundingalgorithms,” in Proceedings on 34th Annual ACM Symposium onTheory of Computing, May 19-21, 2002, Montreal, Quebec, Canada,2002, pp. 380–388. [Online]. Available: http://doi.acm.org/10.1145/509907.509965

[22] A. Zheng and A. Casari, Feature Engineering for Machine Learning:Principles and Techniques for Data Scientists. O’Reilly Media,2018. [Online]. Available: https://books.google.co.in/books?id=sthSDwAAQBAJ

[23] W. B. Johnson and J. Lindenstrauss, “Extensions of lipschitzmappings into a hilbert space,” Conference in modern analysisand probability (New Haven, Conn., 1982), Amer. Math. Soc.,Providence, R.I., pp. 189–206, 1983. [Online]. Available: http://dx.doi.org/10.1016/S0022-0000(03)00025-4

[24] D. Achlioptas, “Database-friendly random projections: Johnson-lindenstrauss with binary coins,” J. Comput. Syst. Sci., vol. 66,no. 4, pp. 671–687, 2003. [Online]. Available: http://dx.doi.org/10.1016/S0022-0000(03)00025-4

[25] P. Li, T. Hastie, and K. W. Church, “Very sparse randomprojections,” in Proceedings of the Twelfth ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining,Philadelphia, PA, USA, August 20-23, 2006, 2006, pp. 287–296.[Online]. Available: https://doi.org/10.1145/1150402.1150436

[26] D. M. Kane and J. Nelson, “Sparser johnson-lindenstrausstransforms,” J. ACM, vol. 61, no. 1, pp. 4:1–4:23, 2014. [Online].Available: https://doi.org/10.1145/2559902

[27] B. Scholkopf, A. J. Smola, and K. Muller, “Kernel principal com-ponent analysis,” in Artificial Neural Networks - ICANN ’97, 7th

International Conference, Lausanne, Switzerland, October 8-10, 1997,Proceedings, 1997, pp. 583–588.

[28] J. Blasius and M. Greenacre, “Multiple correspondence analysisand related methods,” Multiple Correspondence Analysis and RelatedMethods, 2006.

[29] P. Indyk and R. Motwani, “Approximate nearest neighbors: To-wards removing the curse of dimensionality,” in Proceedings of theThirtieth Annual ACM Symposium on the Theory of Computing, Dallas,Texas, USA, May 23-26, 1998, 1998, pp. 604–613.

[30] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher,“Min-wise independent permutations (extended abstract),” inProceedings of the Thirtieth Annual ACM Symposium on the Theory ofComputing, Dallas, Texas, USA, May 23-26, 1998, 1998, pp. 327–336.[Online]. Available: http://doi.acm.org/10.1145/276698.276781

[31] A. Gionis, P. Indyk, and R. Motwani, “Similarity search inhigh dimensions via hashing,” in VLDB’99, Proceedings of 25thInternational Conference on Very Large Data Bases, September 7-10,1999, Edinburgh, Scotland, UK, 1999, pp. 518–529. [Online].Available: http://www.vldb.org/conf/1999/P49.pdf

[32] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, andR. Harshman, “Indexing by latent semantic analysis,” JOURNALOF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE,vol. 41, no. 6, pp. 391–407, 1990.

[33] D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty, “Latent dirichletallocation,” Journal of Machine Learning Research, vol. 3, p. 2003,2003.

[34] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrixfactorization.” in NIPS, T. K. Leen, T. G. Dietterich, and V. Tresp,Eds., 2000, pp. 556–562.

[35] E. Golinko and X. Zhu, “Generalized feature embedding forsupervised, unsupervised, and online learning tasks,” InformationSystems Frontiers, vol. 21, no. 1, pp. 125–142, 2019.

[36] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,”Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.

[37] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MITPress, 2016, http://www.deeplearningbook.org.

[38] X. Li, M. Chen, and Q. Wang, “Discrimination-aware projectedmatrix factorization,” IEEE Transactions on Knowledge and DataEngineering, vol. 32, no. 4, pp. 809–814, 2020.

[39] X. Zhang, Q. Mai, and H. Zou, “The maximum separationsubspace in sufficient dimension reduction with categoricalresponse,” Journal of Machine Learning Research, vol. 21, no. 29,pp. 1–36, 2020. [Online]. Available: http://jmlr.org/papers/v21/17-788.html

[40] X. Chen, H. Yang, S. Zhao, M. R. Lyu, and I. King, “Effectivedata-aware covariance estimator from compressed data,” IEEETrans. Neural Networks Learn. Syst., vol. 31, no. 7, pp. 2441–2454,2020. [Online]. Available: https://doi.org/10.1109/TNNLS.2019.2929106

[41] Q. Wang, Z. Qin, F. Nie, and X. Li, “C2dnda: A deep frameworkfor nonlinear dimensionality reduction,” IEEE Transactions on In-dustrial Electronics, vol. 68, no. 2, pp. 1684–1694, 2021.

[42] M. Banerjee and N. R. Pal, “Unsupervised feature selection withcontrolled redundancy (ufescor),” IEEE Transactions on Knowledgeand Data Engineering, vol. 27, no. 12, pp. 3390–3403, 2015.

[43] G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan, “Compar-ing data streams using hamming norms (how to zero in),” IEEETrans. Knowl. Data Eng., vol. 15, no. 3, pp. 529–540, 2003.

[44] D. M. Kane, J. Nelson, and D. P. Woodruff, “An optimal algorithmfor the distinct elements problem,” in Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles ofDatabase Systems, PODS 2010, June 6-11, 2010, Indianapolis, Indiana,USA, 2010, pp. 41–52.

[45] R. Freivalds, “Probabilistic machines can use less running time,”in IFIP Congress, 1977.

[46] M. Mitzenmacher and E. Upfal, Probability and computing - random-ized algorithms and probabilistic analysis, 2005.

[47] M. Lichman, “UCI machine learning repository,” 2013.[48] I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror, “Result analysis of

the nips 2003 feature selection challenge,” in Advances in NeuralInformation Processing Systems 17, 2005, pp. 545–552.

[49] H. Soleimani and D. J. Miller, “Semi-supervised multi-label topicmodels for document classification and sentence labeling,” inProceedings of the 25th ACM International Conference on Informationand Knowledge Management, CIKM 2016, Indianapolis, IN, USA,October 24-28, 2016, 2016, pp. 105–114. [Online]. Available:https://doi.org/10.1145/2983323.2983752

Page 14: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

[50] X. Genomics, “1.3 million brain cells from e18 mice,” CC BY, vol. 4,2017.

[51] G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan, “Compar-ing data streams using hamming norms (how to zero in),” IEEETrans. Knowl. Data Eng., vol. 15, no. 3, pp. 529–540, 2003.

[52] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,”in 2nd International Conference on Learning Representations, ICLR2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceed-ings, 2014.

[53] Z. Sulc and H. Rezankova, “Dimensionality reduction of categori-cal data: Comparison of hca and catpca approaches,” 2015.

[54] C. McDiarmid, On the method of bounded differences, ser. LondonMathematical Society Lecture Note Series. Cambridge UniversityPress, 1989, p. 148–188.

[55] Z. Huang, “Extensions to the k-means algorithm for clusteringlarge data sets with categorical values,” Data Mining andKnowledge Discovery, vol. 2, no. 3, pp. 283–304, 1998. [Online].Available: https://doi.org/10.1023/A:1009769707641

[56] G. Cormode and S. Muthukrishnan, “An improved datastream summary: the count-min sketch and its applications,” J.Algorithms, vol. 55, no. 1, pp. 58–75, 2005. [Online]. Available:https://doi.org/10.1016/j.jalgor.2003.12.001

[57] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequentitems in data streams,” Theoretical Computer Science, vol. 312, no. 1,pp. 3 – 15, 2004, automata, Languages and Programming.

[58] P. Indyk, “Stable distributions, pseudorandom generators,embeddings, and data stream computation,” J. ACM, vol. 53,no. 3, pp. 307–323, 2006. [Online]. Available: http://doi.acm.org/10.1145/1147954.1147955

[59] R. Fisher, “The statistical utilization of multiple measurements,”Annals of Eugenics, vol. 8, pp. 376–386, 1938.

[60] G. E. Hinton and R. S. Zemel, “Autoencoders, minimum descrip-tion length and helmholtz free energy,” in Proceedings of the 6th In-ternational Conference on Neural Information Processing Systems, ser.NIPS’93. San Francisco, CA, USA: Morgan Kaufmann PublishersInc., 1993, p. 3–10.

[61] S. Mika, B. Scholkopf, A. Smola, K.-R. Muller, M. Scholz, andG. Ratsch, “Kernel pca and de-noising in feature spaces,” inAdvances in Neural Information Processing Systems, M. Kearns,S. Solla, and D. Cohn, Eds., vol. 11. MIT Press, 1999.[Online]. Available: https://proceedings.neurips.cc/paper/1998/file/226d1f15ecd35f784d2a20c3ecf56d7f-Paper.pdf

[62] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A globalgeometric framework for nonlinear dimensionality reduction,”Science, vol. 290, no. 5500, p. 2319, 2000.

[63] T. Kohonen, “Self-organized formation of topologically correctfeature maps,” Biological Cybernetics, vol. 43, no. 1, pp. 59–69, Jan.1982. [Online]. Available: http://dx.doi.org/10.1007/BF00337288

Debajyoti Bera received his B.Tech. in Com-puter Science and Engineering in 2002 at In-dian Institute of Technology (IIT), Kanpur, Indiaand his Ph.D. degree in Computer Science fromBoston University, Massachusetts, USA in 2010.Since 2010 he is an assistant professor at In-draprastha Institute of Information Technology,(IIIT-Delhi), New Delhi, India. His research in-terests include quantum computing, randomizedalgorithms, and engineering algorithms for net-works, data mining, and information security.

Rameshwar Pratap has earned Ph.D in Theo-retical Computer Science in 2014 from ChennaiMathematical Institute (CMI). Earlier, he com-pleted Masters in Computer Application (MCA)from Jawaharlal Nehru University and BSc inMathematics, Physics, and Computer Sciencefrom University of Allahabad. Post Ph.D he hasworked TCS Innovation Labs (New Delhi, In-dia), and Wipro AI-Research (Bangalore, India).Since 2019 he is working as an assistant profes-sor at School of Computing and Electrical Engi-

neering (SCEE), IIT Mandi. His research interests include algorithms fordimensionality reduction, robust sampling, and algorithmic fairness.

Bhisham Dev Verma is pursuing Ph.D fromIIT Mandi. He has done his Masters in Ap-plied Mathematics from IIT Mandi and BSc inMathematics, Physics, and Chemistry from Hi-machal Pradesh University. His research interestincludes data mining, algorithms for dimensionreduction, optimization and machine learning.

Page 15: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15

APPENDIX AANALYSIS OF ONE-HOT ENCODING + BINARY COM-PRESSION

Let x and y be two n-dimensional categorical vectors withsparsity at most σ; c will denote the maximum number ofvalues any attribute can take. Let x′ and y′ be the one-hotencodings of x and y, respectively. Further, let x′′ and y′′

denote the compression of x′ and y′, respectively, usingBinSketch [17] which is the state-of-the-art dimensionalityreduction for binary vector using Hamming distance.

Observe that the sparsity of x′ is same as that of x anda similar claim holds for y′ and y. However, HD(x′, y′)does not hold a monotonic relationship with HD(x, y). Itis easy to show that HD(x, y) ≤ HD(x′, y′) ≤ 2HD(x, y).Therefore,

|HD(x, y)−HD(x′, y′)| ≤ HD(x, y) ≤ 2σ. (4)

We need the following lemma that was used to analyseBinSketch [17, Lemma 12,Appendix A].

Lemma 11. Suppose we compress two n′-dimensional binaryvectors x′ and y′ with sparsity at most σ to g-dimensionalbinary sketches, denotes x′′ and y′′ respectively, by following analgorithm proposed in the BinSketch work. If g is set to σ

√σ2 ln 6

δ

for any δ ∈ (0, 1), then the following holds with probability atleast 1− δ.

|HD(x′, y′)−HD(x′′, y′′)| ≤ 6√

σ2 ln 6

δ .

Combining the above inequality with that in Equation 4gives us

|HD(x, y)−HD(x′′, y′′)| ≤ 2σ + 6√

σ2 ln 6

δ ≤ 2σ√

ln 2δ

if we set the reduced dimension to σ√

σ2 ln 6

δ .This bound is worse compared to that of FSketch where

we able to prove an accuracy of Θ(√σ ln 2

δ ) using reduceddimension value of 4σ (see Lemma 8).

APPENDIX BPROOFS FROM SECTION 3.3Lemma 6. Let α denote a desired additive accuracy. Then, forany x, y with sparsity σ,

Pr[|f − f∗| ≥ α

]≤ 2 exp (−α

2

4σ ).

Proof. Fix any R and x, y; the rest of the proof applies toany R, and therefore, holds for a random R as well. Definea vector z ∈ 0,±1, . . . ,±cn in which zi = (xi − yi); thenumber of non-zero entries of z are at most 2σ since thenumber of non-zero entries of x and y are at most σ. Let J0

be the set of coordinates from 1, . . . , n at which z is 0, andlet J1 be the set of the rest of the coordinates; from above,J1 ≤ 2σ.

Define the event Ej as “[φj(x) 6= φj(y)]”. Note thatf can be written as a sum of indicator random variables,∑j I(Ej), and we would like to prove that f is almost

always close to f∗ = E[f ].Observe that φj(x) = φj(y) iff

∑i∈ρ−1(j) zi · ri = 0

mod p iff∑i∈ρ−1(j)∩J1 zi · ri = 0 mod p. In other words,

ρ(i) could be set to anything for i ∈ J0 without any

effect on the event Ej ; hence, we will assume that themapping ρ is defined as a random mapping only for i ∈ J1,and further for the ease of analysis, we will denote themas ρ(i1), ρ(i2), . . . , ρ(i2σ) (if |J1| < 2σ then move a fewcoordinates from J0 to J1 without any loss of correctness).

To prove the concentration bound we will employ mar-tingales. Consider the sequence of these random variablesρ′ = ρ(i1), ρ(i2), . . . , ρ(i2σ) – these are independent. Definea function g(ρ′) of these random variables as a sum ofindicator random variables as stated below (note that R andρ(i), for i ∈ J0, are fixed at this point)

g(ρ(i1), ρ(i2), . . . ρ(i2σ))

=∑j

I

∑i∈ρ−1(j)∩J1

zi · ri 6= 0 mod p

=∑j

I(Ej) = f

Now consider an arbitrary t ∈ 1, . . . , 2σ and letq = ρ(it); observe that zit influences only Eq . Choose anarbitrary value q′ ∈ 1, . . . , d that is different from q.Observe that, if ρ is modified only by setting ρ(it) = q′

then we claim that “bounded difference holds”.

Proposition 12. | g(ρ(i1), . . . , ρ(it−1), q, . . . , ρ(i2σ)) −g(ρ(i1), . . . , ρ(it−1), q′, . . . , ρ(i2σ)) | ≤ 2.

The proposition holds since the only effects of the changeof ρ(it) from q to q′ are seen in Eq and Eq′ (earlier Eqdepended upon zit that now changes toEq′ being dependedupon zit ). Since g() obeys bounded difference, therefore, wecan apply McDiarmid’s inequality [46, Ch 17], [54].

Theorem 13 (McDiarmid’s inequality). Consider independentrandom variables X1, . . . , Xm ∈ X , and a mapping f : Xm →R which for all i and for all x1, . . . xm, xi

′ satisfies the property:|f(x1, . . . , xi, . . . , xm) − f(x1, . . . , xi

′, . . . , xm)| ≤ ci, wherex1, . . . xm, xi

′ are possible values for the input variables of thefunction f . Then,

Pr[∣∣E[f(X1, . . . , Xm)− f(X1, . . . , Xm)]

∣∣ ≥ ε]≤ 2 exp

( −2ε2∑mi=1 c

2i

).

This inequality implies that, for every x, y,R,

Prρ

[∣∣E[f ]− f∣∣ ≥ α] ≤ 2 exp

(− 2α2

(2σ)22

)= exp

(−α

2

).

Hence, the lemma is proved.

Lemma 7. Pr[f ≥ dP ] ≤ 2 exp(−P 2σ).

Proof. Since f∗ = dP (1 − Dh) = dP − dPDh, if f ≥ dPthen |f − f∗| ≥ dPDh.

Pr[f ≥ dP ] ≤ Pr[|f − f∗| ≥ dPDh]

≤ 2 exp(−d2P 2D2h

4σ ) (using Lemma 6)

= 2 exp(−P2

4σ d2(1− 1

d )2h)

≤ 2 exp(−P2

4σ (d− h)2) (∵ (1− 1d )h ≥ 1− h

d )

≤ 2 exp(−P 2σ) (∵ (d−h)2

4σ ≥ σ)

Page 16: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16

Here we have used the fact that h ≤ 2σ which, along withthe setting d = 4σ, implies that (d− h) ≥ 2σ.

Lemma 8. Choose d = 4σ as the dimension of FSketch andchoose a prime p and an error parameter δ ∈ (0, 1) (ensure that1 − 1

p ≥4√σ

√ln 2

δ — see the proof for discussion). Then theestimator defined in Definition 5 is close to the Hamming distancebetween x and y with high probability, i.e.,

Pr[|h− h| ≥ 32

1−1/p

√σ ln 2

δ

]≤ δ.

Proof. Denote |h − h| by ∆h and let α =√d ln 2

δ . We will

prove that ∆h < 32P

√σ ln 2

δ for the case |f − f∗| ≤ α

which, by Lemma 6, happens with probability at least(1− 2 exp (−α

2

4σ )) = 1− δ.First we make a few technical observations all of which

are based on standard inequalities of binomial series andlogarithmic functions. It will be helpful to remember thatD = 1− 1/d ∈ (0, 1).

Observation 14. For reasonable values of σ, and reasonablevalues of δ, almost all primes satisfy the bound P ≥ 4√

σ

√ln 2

δ .We will assume this inequality to hold without loss of generality 2.

For example, p = 2 is sufficient for σ ≈ 1000 and δ ≈0.001 (remember that P = 1− 1

p ). Furthermore, observe thatP is an increasing function of p, and the right hand side isa decreasing function of σ, increasing with decreasing deltabut at an extremely slow logarithmic rate.

Observation 15. dPα > 4 can be assumed without loss of

generality. This holds since the left hand side is dP√d√ln(2/δ)

=

P√d√

ln(2/δ)≥ 4√d√σ

(by Observation 14) which is at least 4.

Observation 16. Based on the above assumptions, f < dP .

Proof of Observation. We will prove that√d ln 2

δ < dPDh.

Since |f − f∗| ≤√d ln 2

δ and f∗ = dP (1 − Dh), it follows

that f ≤ f∗ +√d ln 2

δ < dP .

√dPDh =

dPDh

√d≥ P√

dd(1− 1

d )h ≥ P√dd(1− h

d )

=P√d

(d− h) ≥ P√d

d

2(∵ h ≤ 2σ, d− h ≥ 2σ = d

2 )

= P√σ ≥ 4

√ln 2

δ (Observation 14)

which proves the claim stated at the beginning of the proof.

Based on this observation, h is calculated asln(

1− fdP

)/ lnD (see Definition 5). Thus, we get Dh =

1− fdP . Further, from Equation 3 we get Dh = 1− f∗

dP .

Observation 17. Dh ≥ D2σ ≥ 916 . This is since h ≤ 2σ and

Dσ = (1− 1d )σ ≥ 1− σ

d = 34 .

2. If the reader is wondering why we are not proving this fact, it maybe observed that this relationship does not hold for small values of σ,e.g., σ = 16, δ = 0.01.

Observation 18. Dh > 516 .

This is not so straight forward as Observation 17 sinceh is calculated using a formula and is not guaranteed, abinitio, to be upper bounded by 2σ.

Proof of Observation. We will prove that fdP < 11

16 which willimply that Dh = 1− f

dP > 516 .

For the proof of the lemma we have considered the casethat f ≤ f∗ + α. Therefore, f

dP ≤f∗

dP + αdP . Substituting

the value of f∗ = dP (1 − Dh) from Equation 3 and usingObservation 17 we get the bound f

dP ≤716 + α

dP . We canfurther simplify the bound using Observation 15:fdP ≤

716 + α

dP ≤716 + 1

4 <1116 , validating the observation.

Now we get into the main proof which proceeds byconsidering two possible cases.

(Case h ≥ h, i.e., ∆h = h−h:) We start with the identityDh −Dh = f−f∗

dP .Notice that the RHS is bounded from the above by α

dPand the LHS can bounded from the below as

Dh −Dh = Dh(1−D∆h) > 916 (1−D∆h)

where we have used Observation 17. Combining these factswe get α

dP > 916 (1−D∆h).

(Case h ≥ h, i.e., ∆h = h − h:) In a similar manner,we start with the identity Dh − Dh = f∗−f

dP in which theRHS we bound again from the above by α

dP and the LHS istreated similarly (but now using Observation 18).

Dh −Dh = Dh(1−D∆h) > 516 (1−D∆h)

and then, αdP > 5

16 (1−D∆h).So in both the cases we show that α

dP > 516 (1 − D∆h).

Our desired bound on ∆h can now be obtained.

∆h lnD ≥ ln(1− 16

5αdP

)≥ − 16α

5dP /(1−16α5dP ) = − 16α

5dP−16α

(using the inequality ln(1 + x) ≥ x1+x for x > −1)

∴ ∆h ≤ 1

ln 1D

16α

5dP − 16α≤ 16αd

5dP − 16α

(it is easy to show that ln 1D = ln 1

1−1/d ≥ 1/d)

=165 d

dPα −

165

<165 ddP5α

(using Observation 15, dPα −165 > dP

5α )

=16α

P=

16

P

√d ln 2

δ =32

P

√σ ln 2

δ

Lemma 9. Suppose we know that h ≤√σ and choose d =

16√σ ln 2

δ as the dimension for FSketch. Then (a) also f < dP

with high probability and moreover we get a better estimator. That

is, (b) Pr[|h− h|

]≥ 8

1−1/p

√σ ln 2

δ

]≤ δ.

Proof of (a) f < dP with high probability. Following the stepsof the proof of Lemma 7,

Pr[f ≥ dP ] ≤ 2 exp(−d2P 2D2h

4σ )

≤ 2 exp(−P 2 (d−h)2

4σ )

Page 17: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17

Let L denote√

ln 2δ ; note that L > 1. Now, d = 16L

√σ

and h ≤√σ. So, d − h ≥ 15L

√σ > 15

√σ and, therefore,

(d−h)2

σ > 225. Using this bound in the equation above, wecan upper bound the right-hand side as 2 exp(−225(1 −1p )2/4) which is a decreasing function of p, the lowest (forp = 2) being 2 exp(−225/4 ∗ 4) ≈ 10−6.

Proof of (b) a better estimator of h. The proof is almost exactlysame as that of Lemma 8, with only a few differences. Weset α = d/8 where d = 16

√σ ln 2

δ . Incidentally, the value of

α remains the same in terms of σ ( α =√

4σ ln 2δ ). Thus, the

probability of error remains same as before;

2 exp (− d2

64·4σ ) = δ.

Observation 14 is true without any doubt. dPα = 8P

which is greater than 4 for any prime number; so Obser-vation 15 is true in this scenario.

Observation 16 requires a new proof. Following the stepsof the above proof of Observation 16, it suffices to prove thatdPDh > d

8 .

PDh = P (1− 1d )h ≥ P (1− h

d )

= P (d−hd ) ≥ P 15L√σ

16L√σ

= P 1516 >

12

1516 >

18

Observation 17 is now tighter since Dh ≥ D√σ = (1 −

1d )√σ ≥ 1−

√σd = 1− 1

16√

ln 2/δ≥ 3

4 for reasonable values of

δ. Similarly Observation 18 is also tighter (it relies on onlythe above observations) since f∗

dP = 1 − Dh ≤ 1 − 34 and

αdP < 1

4 ; we get Dh > 12 .

Following similar steps as above, for the case h ≥ h,we get α

dP > 34 (1 − D∆h) and for the case h < h, we get

αdP > 1

2 (1 − D∆h) leading to the common condition thatαdP > 1

2 (1−D∆h).The final thing to calculate is the bound on ∆h.

∆h lnD ≥ ln(1− 2α

dP

)≥ − 2α

dP /(1−2αdP ) = − 2α

dP−2α

(using the inequality ln(1 + x) ≥ x1+x for x > −1)

∴ ∆h ≤ 1

ln 1D

dP − 2α≤ 2αd

dP − 2α

(it is easy to show that ln 1D = ln 1

1−1/d ≥ 1/d)

=2d

dPα − 2

<2ddP2α

(using Observation 15, dPα − 2 > dP2α )

=4α

P=

4

P

√4σ ln 2

δ =8

P

√σ ln 2

δ

APPENDIX CCOMPLEXITY ANALYSIS OF FSketch

There are two major operations with respect to FSketch—construction of sketches and estimation of Hamming dis-tance from two sketches. We will discuss their time andspace requirements. There are efficient representations ofsparse data vectors, but for the sake of simplicity we as-sume full-size arrays to store vectors; similarly we assume

simple dictionaries for storing the interval variables ρ,R byFSketch. While it may be possible to reduce the numberof random bits by employing k-wise independent bits andmappings, we left it out of the scope of this work and forfuture exploration.

1) Construction: Sketches are constructed by the FSketchalgorithm which does a linear pass over the inputvector, maps every non-zero attribute to some entry ofthe sketch vector and then updates that correspondingentry. The time to process one data vector becomesΘ(n) +O(σ · poly(log p)) which is O(n) for constant p.The interval variables, ρ,R, p, require space Θ(n log d),Θ(n log p) and Θ(log p), respectively, which is almostO(n) if σ n. Furthermore, ρ and R, that can consumebulk of this space, can be freed once the sketch construc-tion phase is over. A sketch itself consumes Θ(d log p)space.

2) Estimation: There is no additional space requirementfor estimating the Hamming distance of a pair ofpoints from their sketches. The estimator scans boththe sketches and computes their Hamming distance;finally it computes an estimate by using Definition 5.The running time is O(d log p).

APPENDIX DPROOFS FROM SECTION 3.4

Lemma 10. If d = 4σ (as required by Lemma 8), then theexpected number of non-zero entries of φ(x) is upper boundedby d

4 . Further, at least 50% of φ(x) will be zero with probabilityat least 1

2 .

Proof. The lemma can be proved by treating it as a balls-and-bins problem. Imagine throwing σ balls (treat them as thenon-zero attributes of x) into d bins (treat them as the sketchcells) independently and uniformly at random. If the jth-binremains empty then φj(x) must be zero (the converse is nottrue). Therefore, the expected number of non-zero cells inthe sketch is upper bounded by the expected number ofempty bins, which can be easily shown to be d[1− (1− 1

d )σ].Using the stated value of d, this expression can further beupper bounded.

d[1− (1− 1d )σ] ≤ d[1− (1− σ

d )] = d4

Furthermore, let NZ denote the number of non-zero entriesin φ(x). We derived above E[NZ] ≤ d

4 . Markov inequalitycan help in upper bounding the probability that φ(x) con-tains many non-zero entries.

Pr[NZ ≥ d2 ] ≤ E[NZ]/d2 ≤

12

APPENDIX EREPRODUCIBILITY DETAILS

E.1 Baseline implementations

1) We implemented the feature hashing (FH) [20],SimHash (SH) [21], Sketching via Stable Distribution

Page 18: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 18

(SSD) [51] and One Hot Encoding (OHE) [17] algo-rithms on own own; we have made these implemen-tations publicly available 3.

2) For Kendall rank correlation coefficient [14] we used theimplementation provided by pandas data frame 4.

3) For Latent Semantic Analysis (LSA) [32], Latent Dirich-let Allocation (LDA) [33], Non-negative Matrix Factori-sation (NNMF) [34], and vanilla Principal componentanalysis (PCA), we used their implementations avail-able in the sklearn.decomposition library 5.

4) For Multiple Correspondence Analysis (MCA) [28], weused a Python library 6.

5) For HCA [53], we performed hierarchical clustering 7

over the features in which we set the number of clustersto the value of reduced dimension. We then randomlysampled one feature from each of the clusters, andconsidered the data points restricted to the sampledfeatures.

6) For CATPCA [53], we used an R package 8.

It should be noted that PCA, MCA and LSA cannot reducethe dimension beyond the number of data points.

E.2 Reproducibility details for clustering task

We first generated the ground truth clustering results on thedatasets using k-mode [55] (we used a Python library 9).

We then compressed the datasets using the baselines.Of them, feature hashing [20], SimHash [21], and Kendallrank correlation coefficient [14] generate integer/discretevalued sketches on which we can define a Hamming dis-tance. Therefore we use the k-mode algorithm on com-pressed datasets. On the other hand, Latent Semantic Anal-ysis (LSA) [32], Latent Dirichlet Allocation (LDA) [33],Non-negative Matrix Factorisation (NNMF) [34], Principalcomponent analysis (PCA), and Multiple CorrespondenceAnalysis (MCA) [28] generate real-valued sketches. Forthese we used the k-means algorithm (available in thesklearn library 10) on the compressed datasets. We setrandom_state = 42 for both k-mode and k-means.

We evaluated the clustering outputs using purity in-dex. Let m be the number of data points and Ω =ω1, ω2, . . . , ωk be a set of clusters obtained on the originaldata. Further, let C = c1, c2, . . . , ck be a set of clustersobtained on reduced dimensional data. Then the purity indexof the clusters C is defined as

purity index(Ω, C) =1

m

k∑i=1

max1≤j≤k

|ωi ∩ cj |.

3. https://github.com/Anonymus135/F-Sketch4. https://pandas.pydata.org/pandas-docs/stable/reference/api/

pandas.DataFrame.corr.html5. https://scikit-learn.org/stable/modules/classes.html#

module-sklearn.decomposition6. https://pypi.org/project/mca/7. https://scikit-learn.org/stable/modules/generated/sklearn.

cluster.AgglomerativeClustering.html8. https://rdrr.io/rforge/Gifi/man/princals.html9. https://pypi.org/project/kmodes/10. https://scikit-learn.org/stable/modules/classes.html#

module-sklearn.cluster

APPENDIX FERRORS DURING DIMENSIONALITY REDUCTION EX-PERIMENTS

Several baselines give out-of-memory error or their runningtime is quite high on some datasets. This makes it infeasibleto include them in empirical comparison on RMSE andother end tasks.

We list these errors here. OHE gives out-of-memory errorfor Brain cell dataset. HCA gives DNS errors on NYTimesand BrainCell datasets. CATPCA could only on KOS andDeliciousMIL datasets that too upto only 300 reduced di-mension. Other than that it gives a DNS error. VAE givesDNS errors on Enron datasets. KT gives out-of-memory errorfor NYTimes and Brain cell and on Enron it didn’t stop evenafter 10 hrs. MCA also gives out-of-memory error for NY-Times and Brain cell datasets. Further, the dimensionalityreduction time for NNMF was quite high – on NYTimesit takes around 20 hrs to do the dimensionality reductionfor 3000 dimension, and on the Brain cell dataset, NNMFdidn’t stop even after 10 hrs. These errors prevented usfrom performing dimensionality reduction for all dimensionusing some of the algorithms.

APPENDIX GEXTENDED EXPERIMENTAL RESULTS

This section contains the remaining comparative plots forthe RMSE (Figure 11), clustering (Figure 12), similaritysearch experiments (Figure 13) and the dimensionality re-duction time (Figure 14).

APPENDIX HMedian-FSketch: COMBINING MULTIPLEFSketch

We proved in Lemma 8 that our estimate h is within an addi-tive error of h. A standard approach to improve the accuracyin such situations is to obtain several independent estimatesand then compute a suitable statistic of the estimates. Wewere faced with a choice of mean, median and minimumof the estimates of which we decided to choose medianafter extensive empirical evaluation (see Section H.3) andobtaining theoretical justification (explained in Section H.2).We first explain our algorithms in the next subsection.

H.1 Algorithms for generating a sketch and estimaingHamming distanceLet k, d be some suitably chosen integer parameters.An arity-k dimension-d Median-FSketch for a categor-ical data, say x, is an array of k sketches: Φ(x) =〈φ1(x), φ2(x), . . . φk(x)〉; the i-th entry of Φ(x) is a d-dimensional FSketch. See Figure 15 for an illustration.Note that the internal parameters ρ,R, p required to runFSketch to obtain the i-entry are same across all datapoints; the parameters corresponding to different i are,however, chosen independently (p can be the same).

Our algorithm for Hamming distance estimation isinspired from the Count-Median sketch [56] and Countsketch [57]. It estimates the Hamming distances between thepairs of “rows” from Φ(x) and Φ(y) and returns the median

Page 19: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 19

100 1000 2500 5000Reduced Dimension

102

103RM

SENIPS

100 1000 2500 5000Reduced Dimension

101

102

103

Delicious

100 1000 2500 5000Reduced Dimension

101

102

103

GISETTE

FSketchFHSHSSDOHEKTHCA

Fig. 11. Comparison of RMSE values. A lower value is an indication of better performance. The GISETTE dataset is of 5000 dimensions and hence,FSketch suffers from an increase in RMSE as the embedding dimension also reaches 5000.

100 1000 2500 5000Reduced Dimension

0.80

0.85

0.90

0.95

1.00

Purit

y In

dex

Gisette (K=5)

100 1000 2500 5000Reduced Dimension

0.9726

0.9728

0.9730

0.9732

0.9734

0.9736NyTimes (K= 5)

FSketchFHSHSSDOHEKTLDALSAPCAMCANNMFVAEHCA

Fig. 12. Comparing the quality of clusters on the compressed datasets.

100 1000 2500 5000Reduced Dimension

0.0

0.2

0.4

0.6

0.8

Accu

racy

DeliciousMIL

100 1000 2500 5000Reduced Dimension

0.0

0.2

0.4

0.6

0.8

KOS

100 1000 2500 5000Reduced Dimension

0.0

0.2

0.4

0.6

0.8

1.0Gisette

FSketchFHSHSSDOHEKTLDALSAPCAMCANNMFVAECATPCAHCA

Fig. 13. Comparing the performance of the similarity search task (estimating top-k similar points with k = 100) achieved on the reduced dimensionaldata obtained from various baselines.

100 1000 2500 5000Reduced Dimension

10 1

100

101

102

103

104

Tim

e (S

econ

ds)

NIPS

100 1000 2500 5000Reduced Dimension

100

101

102

103

104

Gisette

100 1000 2500 5000Reduced Dimension

10 1

100

101

102

103

104

DeliciousMILFSketchFHSHSSDOHEKTLDALSAPCAMCANNMFVAECATPCAHCA

Fig. 14. Comparison of the dimensionality reduction times.

Page 20: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 20

Φ(x1)

x1x2…

Dataset

Φ(x2)

k

dϕ2(x2)

ϕ1(x2) ϕ1(x1) ...

...

FSketchUse same p, π, R

x21

x22

x23

x24

r1 r2 r3 r4 ...

...

ϕ12(x2)

π

ϕ12(x2) = x2

1 · r1 + x2

4 · r4 + x2

7 · r7 (mod p)

d

Fig. 15. Median-FSketch for categorical data — sketch of each datapoint is a 2-dimensional array whose each row is an FSketch. The i-throws corresponding to all the data points use the same values of ρ,R.

of the estimated distances. This procedure is followed inAlgorithm 3.

Algorithm 3 Estimate Hamming distance between x and yfrom their Median-FSketch

Input: Φ(x) = 〈φ1(x), φ2(x), . . . φk(x)〉, Φ(y) =〈φ1(y), φ2(y), . . . φk(y)〉

1: for i = 1 . . . k do2: Compute f = Hamming distance between φi(x) andφi(y)

3: If f < dP , hi = ln(

1− fdP

)/ lnD

4: Else hi = 2σ5: end for6: return h = minh1, h2, . . . hk

H.2 Theoretical justificationWe now give a proof that our Median-FSketch estimatoroffers a better approximation. Recall that σ indicates themaximum number of non-zero attributes in any data vector,and is often much small compared to the their dimension,n. Surprisingly, our results are independent of n.

Lemma 19. Let hm denote the median of the estimates of Ham-ming distances obtained from t independent FSketch vectorsof dimension 4σ and let h denote the actual Hamming distance.Then,

Pr[|hm − h| ≥ 18

√σ]≤ δ

for any desired δ ∈ (0, 1) if we use t ≥ 48 ln 1δ .

Proof. We start by using Lemma 8 with p = 3 and error (δ inthe lemma statement) = 1

4 . Let hi denote the k-th estimate.From the lemma we get that

Pr[|hi − h| ≥ 18

√σ]≤ 1

4

Define indicator random variables W1 . . .Wt as Wi = 1iff |hi − h| ≥ 18

√σ. We immediately have Pr[Wi] ≤ 1

4 .Notice that Wi = 1 can also be interpreted to indicatethe event h − 18

√σ ≤ hi ≤ h + 18

√σ. Now, hm is

the median of h1, h2, . . . ht, and so, hm falls outside therange [h − 18

√σ, h + 18

√σ] only if more than half of the

estimates fall outside this range., i.e., if∑ti=1Wi > t/2.

Since E[∑iWi] ≤ t/4, the probability of this event is easily

bounded by exp (−( 12

2 t4/3)) = e−t/48 ≤ δ using Chernoff’s

bound.H.3 Choice of statistics in Median-FSketch

We conducted an experiment to decide whether to takemedian, mean or minimum of k FSketch estimates inthe Median-FSketch algorithm. We randomly sampled apair of points and estimated the Hamming distance fromits low-dimensional representation obtained from FSketch.We repeated this 10 times over different random mappingsand computed the median, mean, and minimum of those10 different estimates. We further repeat this experiment10 times and generate a box-plot of the readings whichis presented in Figure 16. We observe that median hasthe lowest variance and also closely estimates the actualHamming distance between the pair of points.

APPENDIX IDIMENSIONALITY REDUCTION ALGORITHMS

Page 21: JOURNAL OF LA Dimensionality Reduction for Categorical Data

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 21

400Reduced Dimension

370

380

390

400

410

420

Ham

min

g Es

timat

e

NYTimes

MedianMeanMin

1000Reduced Dimension

480

485

490

495

500

Ham

min

g Es

timat

e

NYTimes

MedianMeanMin

3000Reduced Dimension

156

158

160

162

Ham

min

g Es

timat

e

Enron

MedianMeanMin

Fig. 16. Box plot for the median, mean, and minimum of the FSketch’s estimate obtain via it’s from its 10 repetitions, then each experiment isrepeated 10 times for computing the variance of these statistics. The black dotted line corresponds to the actual Hamming distance.

TABLE 8A tabular summary of popular dimensionality reduction algorithms. Linear dimensionality reduction algorithms are those whose features in reduceddimension are linear combinations of the input features, and the others are known as non-linear algorithms. Supervised dimensionality reduction

methods are those that require labelled datasets for dimensionality reduction.

S.No.

Data type of in-put vectors

Objective/ Properties Data type ofsketch vectors

Result Supervised orUnsupervised

Type of dimen-sionality reduc-tion

1 Real-valuedvectors

Approximating pairwise eu-clidean distance, inner product

Real-valuedvectors

JL-lemma [23] Unsupervised Linear

2 Real-valuedvectors

Approximating pairwise eu-clidean distance, inner product

Real-valuedvectors

Feature Hash-ing [20]

Unsupervised Linear

3 Real-valuedvectors

Approximating pairwise cosineor angular similarity

Binary vectors SimHash [21] Unsupervised Non-Linear

4 Real-valuedvectors

Approximating pairwise `pnorm for p ∈ (0, 2]

Real-valuedvectors

p-stable randomprojection(SSD) [58]

Unsupervised Linear

5 Sets Approximating pairwise Jac-card similarity

Integer valuedvectors

MinHash [30] Unsupervised Non-linear

6 Sparse binaryvectors

Approximating pairwise Ham-ming distance, Inner product,Jaccard and Cosine similarity

Binary vectors BinSketch [17] Unsupervised Non-linear

7 Real-valuedvectors

Minimize the variance in lowdimension

Real-valuedvectors

PrincipalComponentAnalysis (PCA)

Unsupervised Linear

8 Real-valuedvectors(labelled input)

Maximizes class separability inthe reduced dimensional space

Real-valuedvectors

Linear DiscriminantAnalysis [59]

Supervised Linear

9 Real-valuedvectors

Embedding high-dimensionaldata for visualization in alow-dimensional space of twoor three dimensions

Real-valuedvectors

t-SNE [36] Unsupervised Non-linear

10 Real-valuedvectors

Minimize the reconstruction er-ror

Real-valuedvectors

Auto-encoder [60] Unsupervised Non-linear

11 Real-valuedvectors

Extracting nonlinear structuresin low-dimension via Kernelfunction

Real-valuedvectors

Kernel-PCA [61] Unsupervised Non-linear

12 Real-valuedvectors

Factorize input matrix into twosmall size non-negative matri-ces

Real-valuedvectors

Non-negativematrix factorization(NNMF) [34]

Unsupervised Linear

13 Real-valuedvectors

Compute a quasi-isometric low-dimensional embedding

Real-valuedvectors

Isomap [62] Unsupervised Non-linear

14 Real-valuedvectors

Preserves the topological struc-ture of the data

Real-valuedvectors

Self-organizingmap [63]

Unsupervised Non-linear