a novel approach to text compression using n-grams

15
A Novel Approach to Text Compression Using N-Grams Dylan Freedman Carmel High School March 5, 2010 Advisors: Craig Martell, PhD. Naval Postgraduate School George Dinolt, PhD. Naval Postgraduate School

Upload: dylan-freedman

Post on 07-Apr-2015

441 views

Category:

Documents


1 download

DESCRIPTION

In our current age of high speed communications, the spread and growth of data are taking off at higher rates than the world’s available storage can handle. With about 4.6 billion mobile phone subscribers and approximately 2 billion internet users, this problem has extended throughout the globe and will continue at exponential rates. For individuals and businesses, compressing files is becoming more and more relevant, as cutting down on digital storage has the potential to save money and electricity, lessening the user’s digital footprint and helping the environment through reduced consumption of resources. Because of this demand, most attention and research for file compression has been devoted towards larger files, typically media files, so as to reap the maximum benefit of compression ratio. However, recent compressors are often not able to compress a small file (less than 1KB) due to the necessity of including a statistical model of the file’s properties in the compressed archive and due to an insufficiency of data from which to construct this model.In my project, I propose a method that is able to effectively compress files of English text by using a standardized language model—Google’s N-Gram corpus. Composed of almost 90GB of raw text files, this database contains n-grams up to 5-grams that were observed more than 40 times in a one trillion word survey of the internet. For the purposes of efficiency, I rearranged this database so as to optimize retrieval of information through a binary search, and I wrote an undemanding program that uses a simplified Katz back-off model to find the relative probability of a word occurring in an n-gram given a preceding context. I then designed a compressor that first converts an input file of text to a sequence of these probabilities encoded with variable-byte encoding, and then applies arithmetic coding on the resultant file. With this process, I was able to achieve an average 25% or less compression ratio with text files, even those only a few hundred bytes long. With the added benefit of not having to include a statistical model of the input file’s properties, this method is ideally suited for enterprises that transmit thousands of small text files. The initial n-gram database could be hosted remotely, and its large size would eventually be made up for in file size savings. One practical application that comes to mind is SMS cell phone texting, which encompasses over 4.1 billion text messages sent daily.

TRANSCRIPT

Page 1: A Novel Approach to Text Compression Using N-Grams

A Novel Approach to Text Compression Using N-Grams

Dylan Freedman Carmel High School

March 5, 2010

Advisors: Craig Martell, PhD. Naval Postgraduate School George Dinolt, PhD. Naval Postgraduate School

Page 2: A Novel Approach to Text Compression Using N-Grams

Abstract In our current age of high speed communications, the spread and growth of data

are taking off at higher rates than the world’s available storage can handle. With about 4.6 billion mobile phone subscribers and approximately 2 billion internet users, this problem has extended throughout the globe and will continue at exponential rates. For individuals and businesses, compressing files is becoming more and more relevant, as cutting down on digital storage has the potential to save money and electricity, lessening the user’s digital footprint and helping the environment through reduced consumption of resources. Because of this demand, most attention and research for file compression has been devoted towards larger files, typically media files, so as to reap the maximum benefit of compression ratio. However, recent compressors are often not able to compress a small file (less than 1KB) due to the necessity of including a statistical model of the file’s properties in the compressed archive and due to an insufficiency of data from which to construct this model.

In my project, I propose a method that is able to effectively compress files of English text by using a standardized language model—Google’s N-Gram corpus. Composed of almost 90GB of raw text files, this database contains n-grams up to 5-grams that were observed more than 40 times in a one trillion word survey of the internet. For the purposes of efficiency, I rearranged this database so as to optimize retrieval of information through a binary search, and I wrote an undemanding program that uses a simplified Katz back-off model to find the relative probability of a word occurring in an n-gram given a preceding context. I then designed a compressor that first converts an input file of text to a sequence of these probabilities encoded with variable-byte encoding, and then applies arithmetic coding on the resultant file. With this process, I was able to achieve an average 25% or less compression ratio with text files, even those only a few hundred bytes long. With the added benefit of not having to include a statistical model of the input file’s properties, this method is ideally suited for enterprises that transmit thousands of small text files. The initial n-gram database could be hosted remotely, and its large size would eventually be made up for in file size savings. One practical application that comes to mind is SMS cell phone texting, which encompasses over 4.1 billion text messages sent daily.

Introduction With our society’s widespread proliferation of information, it may come as a surprise that there is not enough storage available on all the hard drives in the world to fit the volume of digital data created by people, and following current trends this situation will only worsen. There are currently 4.6 billion cell phone subscribers in the world and 1.7 billion internet users, numbers which have been growing exponentially over the past few years. The consequences of this excessive use of information are costly and have a negative effect on the environment as new data centers are being erected every day.

Page 3: A Novel Approach to Text Compression Using N-Grams

But is the situation hopeless? No—almost every conflict is met by mankind with innovation, and abundance of information is no exception. Data compression is the tool commonly used to reduce the size of files, and computer users deal with these techniques daily whether they know it or not. Images in websites are almost always compressed with a lossy algorithm called, JPG, and uncompressed audio and video are usually compressed to a lower quality filetype such as mp3 or AVI. But most research and attention in the realm of compression is devoted to media files or large general purpose files, both of which take up a good amount of data. Due to a natural interest in linguistics, I wondered why there were not any specialized file formats designed for compressing text.

As I became more involved in this field, I figured that English text would be a good candidate for compression experimentation for the following reasons:

1. There are many external corpora of English text available to examine more of the language’s linguistic properties, and an abundance of research has already been conducted.

2. English is prolific and relevant, yet challenging for computers to comprehend. 3. Text files are typically small and do not consume much space; however, Internet

services and large companies often handle thousands of these files and could save space and time with an efficient compressor.

4. Furthermore, due to the relatively high compressibility of English through conventional means, preexisting work on this subject is sparse compared to that of generic file compression, leaving room for innovation.

My research began with an investigation of the commonly employed lossless compression schemes, and over time I discovered that most fell into two categories: (1) static coders - compression schemes that build a statistical model of the properties of the file and then compress the file efficiently using this model, yet at the expense of having to store these statistics (usually in the form of a frequency table) along with the compressed contents; and (2) adaptive coders - those that feature an adaptive compression scheme, starting with a limited, generic statistical model and dynamically updating it as the file is read with the benefit of not having to store this model inside the final file. Both these categories of compression generate statistical models, which allow compressors and decompressors to represent more frequently occurring patterns with a lesser amount of bits and less frequently occurring patters with more bits, so as to produce an optimally short file.

However, while researching these algorithms, I asked myself why a statistical model needed to be generated for each and every file to be compressed, even if multiple files shared some statistical properties. Does this redundancy not challenge the very purpose of compression, where the goal is to eliminate all redundancy?

This led me to question the existing compression methods. I soon found out that both these compression methods were highly inefficient when dealing with small files. For a given small file (<1 KB), static coding algorithms usually produced larger files than the initial due to the requirement that the compressed archive include the file's statistical model (which was often bigger than the file itself), and adaptive coders could not achieve a high compression ratio either as the sample of data within a small file was typically insufficient to produce an effective statistical model. But what if a comprehensive, external database existed that could generalize the statistical properties of a type of file?

Page 4: A Novel Approach to Text Compression Using N-Grams

If this were the case, a file of this type may not require a compressor to generate these properties.

To take a simple example, consider the average English-speaking adult, who can naturally understand the syntax and semantics of his language and could loosely be considered an external linguistics engine. Claude Shannon, the father of information theory, proposed an experiment in a 1950 paper in which he attempted to calculate the entropy of English by asking an average English-speaking person to guess a phrase he was thinking of letter-by-letter (including spaces). When a letter was guessed correctly, Shannon would write under that letter the number of guesses required, and the subject would then attempt to guess the next letter. Below is the result of one of his experiments on a human subject:

In this experiment, the subject correctly guessed the letter on his first try for 79 of

the letters out of the 102 total letters in the sequence. He was able to perform such a feat by logically choosing letters that have a high likelihood of following given the preceding context of the phrase. Through a series of calculations, Shannon placed the redundancy of English text at approximately 75%, meaning that a piece of text could optimally be represented in 1/4 of the length.

I wondered for the purposes of my experiment how well this technique of predicting text given context could be emulated by a computer. Was this process deeply rooted in linguistics that were beyond computer's current capacity or could it be feasibly implemented?

I researched more on existing corpora and was ultimately driven towards Google’s 2006 N-Gram corpus, titled, Web 1T 5-gram Version 1. This database was comprised of n-grams up to 5-grams that appeared over 40 times in a one trillion word corpus of text from the Internet. Used correctly, Google’s n-grams could be accessed to determine a word’s conditional probability given a context of up to 4 words in length. The knowledge of this powerful, comprehensive database finally led to my experimental question: From Google’s n-gram data is it possible to compress files efficiently, and if so, does this method offer any advantages over traditional compression algorithms when dealing with small English text files?

Hypothesis Using Google’s n-gram database, Web 1T 5-gram Version 1, it is possible to compress English text with a much higher compression ratio for small files than could be achieved by conventional algorithms.

Page 5: A Novel Approach to Text Compression Using N-Grams

Querying the N-Gram Database In its current state, the Google n-gram database was in an unmanageable format to extract useful data. After ordering the database from its distributor, the Linguistic Data Consortium (LDC), and signing a license agreement, I received a six-DVD set in the mail that contained the database in compressed gzip archives. The decompression process took a couple of hours, and I was then required to move some files so that the database was contained within five folders, “1GM,” “2GM,” … “5GM,” where the folder number indicated the type of n-grams inside (i.e. “2GM” corresponds to 2-grams). Each folder contained multiple text files labeled numerically, and an index file that listed the first n-gram in each of the files. Inside each of the files were listings of n-grams in alphabetical order, and each line also contained the observed count of the n-gram on that line within the database delimited by a tab character. Here is an example excerpt from one of the 4-grams files: n-gram observed count serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234 serve as the industrial 52 serve as the industry 607 serve as the info 42 serve as the informal 102 serve as the information 838 serve as the informational 41 serve as the infrastructure 500 serve as the initial 5331 serve as the initiating 125 serve as the initiation 63 serve as the initiator 81 serve as the injector 56 serve as the inlet 41 serve as the inner 87 serve as the input 1323

As shown above, each n-gram can be used to extract the conditional probability of a word occurring. Using a line from the example above, infrastructure follows serve as the 500 times out of the total number of 4-grams that start with serve as the. I figured that for the purposes of my project, I wanted to be able to extract the relative likelihood of a word occurring in a given context. In other words, I desired a computationally feasible method to determine the relative probability that a word would occur in comparison with the other words that could have occurred following a given context. To accomplish this, I devised a ranking system that relies on the fact that the arg max of the conditional probabilities of the possible words following a given context is constant even when these probabilities are scaled. My first step involved sorting the database by the observed count within each set of n-grams that share a base context. Following the example above (and assuming that this deals with all 4-grams with the context, serve as the), the excerpt would be

Page 6: A Novel Approach to Text Compression Using N-Grams

rearranged to form the example below and to the left. From this sorted table, I assigned each n-gram a rank equal to the number of n-grams in the database that shared the same context and had a higher conditional probability than the given n-gram. This ranking system starts at 1 (see example excerpt below and to the right). For instance, the n-gram serve as the information has a ranking of 3, since information is the 3rd most common word following serve as the, according to Google’s database. Rearrangement Assigning the Rank serve as the initial 5331 serve as the initial 1 serve as the input 1323 serve as the input 2 serve as the information 838 serve as the information 3 serve as the industry 607 serve as the industry 4 serve as the infrastructure 500 serve as the infrastructure 5 serve as the individual 234 serve as the individual 6 serve as the initiating 125 serve as the initiating 7 serve as the indicator 120 serve as the indicator 8 serve as the indispensable 111 serve as the indispensable 9 serve as the informal 102 serve as the informal 10 serve as the inner 87 serve as the inner 11 serve as the initiator 81 serve as the initiator 12 serve as the initiation 63 serve as the initiation 13 serve as the injector 56 serve as the injector 14 serve as the industrial 52 serve as the industrial 15 serve as the indicators 45 serve as the indicators 16 serve as the info 42 serve as the info 17 serve as the informational 41 serve as the informational 18 serve as the inlet 41 serve as the inlet 19 serve as the indispensible 40 serve as the indispensible 20

Then, to produce a more usable format, I created two files, each optimized for a specific data retrieving functionality. − The first file, referred to as the alphabetical ranking index, was created by sorting the

ranking table produced in the previous step alphabetically (see example below and to the left). This allows information to efficiently be extracted via a binary search to determine the ranking of a given n-gram. With minimal effort, this file can be queried in logarithmic time, and one could compute, say, that indicator is the 8th most common word following serve as the (still assuming that the example excerpt being used contains all the n-grams with a context of serve as the).

− The second file, referred to as the numerical ranking index, was created by rearranging the final word in the n-gram with the rank for each n-gram in the ranking table produced in the previous step (see example below and to the right). Once again, this allows information to efficiently be extracted via binary search to determine the word that is the n-th most likely to follow a given context. For example, a program searching for serve as the 13, would return the word with the ranking 13 that follows the context, serve as the. In this case, initiation would be returned.

Alphabetical Ranking Index Numerical Ranking Index serve as the 1 initial serve as the indicator 8 serve as the 2 input serve as the indicators 16 serve as the 3 information serve as the indispensable 9 serve as the 4 industry serve as the indispensible 20 serve as the 5 infrastructure serve as the individual 6 serve as the 6 individual serve as the industrial 15 serve as the 7 initiating serve as the industry 4 serve as the 8 indicator serve as the info 17

Page 7: A Novel Approach to Text Compression Using N-Grams

serve as the 9 indispensable serve as the informal 10 serve as the 10 informal serve as the information 3 serve as the 11 inner serve as the informational 18 serve as the 12 initiator serve as the infrastructure 5 serve as the 13 initiation serve as the initial 1 serve as the 14 injector serve as the initiating 7 serve as the 15 industrial serve as the initiation 13 serve as the 16 indicators serve as the initiator 12 serve as the 17 info serve as the injector 14 serve as the 18 informational serve as the inlet 19 serve as the 19 inlet serve as the inner 11 serve as the 20 indispensible serve as the input 2 I implemented these techniques in Java, and after spending several days optimizing code for virtual memory management, was able to produce 10 output files, 5 of which were created with the alphabetical ranking index, and 5 of which were created with the numerical ranking index. Each subset of 5 files contained one file of data for 1-grams, one for 2-grams, and so on through 5-grams. Finally, querying the database required the creation of a modified binary search program. The determination of the rank of a word from a given context or of the word at a specified rank from a given context required the use of principles from the Katz back-off model. Essentially, this meant that if a word did not exist in the database given a specific context, the program would successively back-up by chopping off the first word of the context until the particular query was found. Mathematically, the conditional probability of a word occurring given a specified context, P(wi|wi-n+1…wi-1), was equal to the conditional probability of a word occurring given that context without its first word, P(wi|wi-n+2…wi-1), if in the first case the probability was 0 (wi denotes the word at position i, where i is the position of the word being tested; n is the length of the n-grams being used). My program neglects the k, d and α constants present in the published version of this equation in order to maximize the efficiency of the search. In the case that my program had to back-up the current context, the rank being calculated was increased by the maximum ranking of the context before backing-up. With the created program, words could be queried from Google’s n-gram database stored on an external hard drive at a rate of approximately 200 milliseconds per word. In future experimentation, I would like to split the created n-gram files and index them to attempt to increase the efficiency of this search.

Page 8: A Novel Approach to Text Compression Using N-Grams

Tokenization of the Text To be able to compress text efficiently, the text had to first be tokenized in a format uniform with the n-gram database’s tokenization system. As the database was not well-documented, many of the rules that governed this tokenization system had to be determined through trial and error. Several notable rules include the following:

� Words (including letters and numbers) and punctuation marks receive their own token. Multiple occurrences of the same punctuation mark in succession also receive one token; however, differing punctuation marks in succession receive a token for each run of similar punctuation marks.

� Numbers with periods and commas inside them are counted as one token. Words may also have a period inside them provided that the character following the period is not whitespace or a punctuation mark.

� Numerical dates separated by two or more slashes constitute one token; however in every other case the slash receives its own token. i.e. 1/2/2003 is one token; the fraction 1/2 is three tokens: ‘1’, ‘/’, and ‘2’.

� The hyphen separating hyphenated words is a token. Multiple hyphens in succession only counts as one token.

� After end of sentence punctuation marks (. ? !), the database-specific tokens </S> (end of sentence) and <S> (start of sentence) are required.

� Any token that does not occur anywhere in Google’s n-gram database is to be replaced with the database-specific token, <UNK> (unknown word).

I created a class in Java called Tokenizer, which read tokens from an input text file in the format mentioned above. This class was efficient in that it was stream-based and offered a command to read one token in the input file at a time rather than a common approach involving storing the whole file in the virtual memory and then splitting it into tokens.

Page 9: A Novel Approach to Text Compression Using N-Grams

Compressing the Text Compression relied on many factors that I altered during the course of testing; however, there was an underlying design algorithm I created that proved to be effective in terms of compression ratio. The basic steps are the following:

1. Read input text token by token, and for each token return the ranking in an n-gram query with the preceding tokens as the context.

2. Encode these ranking numbers in a file. Initially, I implemented this step by converting numbers into 2 bytes, which can be used to represent any range from 1-65536 (28×28). Rankings that were above this upper bound were converted to the <UNK> token. However, I soon found out that variable byte encoding produced more efficient results for my type of compression. In variable byte encoding, numbers are encoded with multiples of 7 bits, but recorded as 8-bit bytes. If a number requires more than 7 bits to convey its contents, then the 8th bit is a 1, otherwise it is a 0. This way, variable-length numbers can be encoded without needing to specify the number’s length in the file.

3. For any unknown tokens, take the initial word which produced this token and add it to a string of unknown words delimited by a space.

4. Compress the string of unknown words using a preexisting generic file compression algorithm (I used gzip because of its efficiency and simplicity). The theory behind using a generic compression algorithm for this step is that, since unknown tokens are so rare to begin with (the number of unique words in the n-gram database is over 13 million) there is likely some redundancy if there is more than one unknown token. For example, in a fictional story featuring a character with a highly unusual name, every occurrence of his name within the database will be replaced with an unknown token and added to the unknown string, which will then compress effectively through generic means due to this redundancy.

5. Apply an arithmetic coding algorithm to compress the encoded rankings in the file. This algorithm is similar to Huffman coding in that it is a form of variable-length entropy encoding; however, it works much more effectively than Huffman coding does for symbols that occur at probabilities far from a power of ½. It will compress the encoded ranking data well because, for the typical text file, these values will be skewed heavily towards lower values. Two variants of this algorithm are tested: one which is adaptive and dynamically generates a table representing the frequencies of certain symbols’ occurrences, and one which is also adaptive yet initially begins with a generic static table previously generated from the frequencies of symbols present in a short story.

6. The length of the compressed unknown string is calculated, and this number is stored using variable byte encoding.

7. The final compressed archive is assembled in the following order: length of the compressed unknown string expressed in variable byte encoding, compressed unknown string, rankings compressed in arithmetic coding. This process is fully reversible for decompression.

I constructed an implementation of this general method in Java and then proceeded to analyze compression ratios for varying strings of English text. I had to be careful in selecting samples of English text to use so as to not pick text that had been written from before the year the Google N-Gram database was published, 2006, as it may influence the

Page 10: A Novel Approach to Text Compression Using N-Grams

data. For instance, the string, “To be or not to be – that is the question,” would compress well using the above method; had Shakespeare not written Hamlet, this may not be the case. Ultimately, I selected 12 samples of English text of various length, divided into four categories: Technical, News, Generic, and Text Message. Technical writing was comprised of an article I read while researching for my project that had to do with Google’s n-gram data along with my abstract and hypothesis for this project. News included a recent article about the proliferation of data in The Economist, a New York Times article detailing acceleration defects in some current Toyota cars, and a headline from my local paper, The Carmel Pine Cone. Generic writing dealt with three relatively small paragraphs of text about the character Alice in Alice and Wonderland, SMS texting, and the anatomy of the International System of Units (SI Units). Lastly, a text message was extracted from my mobile phone from three different friends, one with a formal writing style, one with a semi-formal writing style, and one with an abbreviated informal writing style. For each sample of text, I compressed and decompressed the file’s contents using my algorithm and recorded the following variables:

� Input Size (Bytes) – the size of the original text file before compression takes place

� Time (ms) – the amount of time it took to compress the file’s contents � Token Count – the number of tokens within the uncompressed text file � Avg. Ranking – the average rank value assigned to the tokens � Unknown Tokens – the number of tokens which were not present in the n-gram

database � Unknown Compress. Ratio – The compression ratio achieved by applying the

gzip algorithm on the unknown data � Zip Compress. Ratio – The compression ratio achieved by compressing the

original file with zip compression (a commonly used algorithm for generic file compression) on the highest compression setting

� Rar Compress. Ratio – The compression ratio achieved by compressing the original file with rar compression (a high power algorithm commonly used for generic file compression) on the highest compression setting.

� Output Size 1 – The size of the compressed file after compression with the adaptive arithmetic coder that starts with an empty frequency table

� Output Size 2 – The size of the compressed file after compression with the adaptive arithmetic coder that starts with a previously generated frequency table from a large sample of text

� Compress. Ratio 1 – The compression ratio achieved by dividing output size 1 by input size

� Compress. Ratio 2 – The compression ratio achieved by dividing output size 2 by input size

Following is a graph showing these variables’ experimentally determined values for each of the 12 samples of English text being tested, sorted by decreasing initial size.

Page 11: A Novel Approach to Text Compression Using N-Grams
Page 12: A Novel Approach to Text Compression Using N-Grams

Analysis There are some definite trends and inferences that can be extracted from the

previous chart. For example, zip and rar files—commonly used general compression algorithms—feature increasing compression ratios as the list goes down and the filesize decreases. rar compression ratios are also consistently less than zip compression ratios for the files in my sample of English text. Following is a chart that demonstrates the exponential increase in compression ratio as the text file size decreases for rar files.

RAR Compression Ratio with Small Text Files

0.00%

50.00%

100.00%

150.00%

200.00%

250.00%

300.00%

350.00%

0200040006000800010000120001400016000

Input File Size (Bytes)

Co

mp

ress

ion

Rat

io

For larger files, rar compression works well as it can build an effective statistical

model of the uncompressed file’s properties; however, short file compressibility is minimal.

Perhaps most notably, Compress. Ratio 2 was less than or equal to Compress. Ratio 1 for every file compressed. This is demonstrated in the following graph in which the pink line representing Compression Ratio 2 is consistently below the blue line representing Compression Ratio 1. Also notice that the gap between these two lines grows wider as the text file gets smaller. That means that for large files, the ratio achieved by loading the adaptive coder with an initial database of frequencies (Compress. Ratio 2) is not significantly different from the ratio achieved when the adaptive coder starts with an empty frequency table (Compress. Ratio 1).

Page 13: A Novel Approach to Text Compression Using N-Grams

Compress. Ratio 1 vs. Compress. Ratio 2

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

0200040006000800010000120001400016000

Input File Size (Bytes)

Co

mp

ress

ion

Rat

io

Compress. Ratio 1 Compress. Ratio 2

Here also arises the significant application of my method. While compression ratio actually increases above 100% for rar and zip when dealing with files about 200 bytes and less, my method is able to keep compression ratios below 50%, even when the input text is only 31 bytes long. Following is a chart that shows Compress. Ratio 2 graphed on the same chart as RAR Compress. Ratio for input file sizes less than 300 bytes. Notice the significant difference between these ratios and how low Compress. Ratio 2 is for files above 100 bytes. Achieving compression ratios this low for such a small sample of text is rarely ever seen in data compression.

Compression Ratios for Small Text Files

0.00%

25.00%

50.00%

75.00%

100.00%

125.00%

150.00%

175.00%

200.00%

225.00%

250.00%

275.00%

300.00%

325.00%

350.00%

050100150200250300

Input File Size (Bytes)

Co

mp

ress

ion

Rat

io

RAR Compress. Ratio Compress. Ratio 2

Page 14: A Novel Approach to Text Compression Using N-Grams

Conclusion Overall, my project wielded surprisingly good results when dealing with compression of small files. Typical compression algorithms have high compression ratios for these types of files due to the limited amount of redundancy present in their contents; however, my program instead analyzes and exploits redundancy between the relative probabilities (rankings) of the words of the initial file given the context of the preceding words. While compressing files that are already small seems to be a needless operation, companies and enterprises dealing with millions or more of these types of files may benefit. For example, potential markets for this compression scheme include cell phone texting, Twitter posts, Facebook updates—anything dealing with a small amount of text that needs to be transmitted to a user real-time, yet is stored for later access. Since the Google n-gram database is so huge, a company implementing a scheme such as this would store the database on their server’s hard drives and simply compress and decompress messages that are sent to the server. I would imagine that a corporation such as a cell phone company would want to produce their own n-gram corpus specialized for the types of messages it would receive. This compression scheme is novel, and since not much work has been conducted in the field of short text compression, there is room for improvement. For example, a question that arises in this scheme’s application is whether the required n-gram querying scheme takes too long to calculate each rank, and whether this computational difficulty would increase the bandwidth of the server too much to make such an algorithm practical and useful. With many optimizations, my program still took approximately 1/5 of a second to calculate each rank; however, I believe that there are still further optimizations to be made. In my project, I was accessing the n-gram database from an external hard drive with a slow read and write speed, but I would imagine that a company would implement more advanced hardware such as a high speed internal hard drive, or possibly even a solid state drive (which is fast for random access). If RAM continues to grow, the database could potentially be stored in virtual memory. Despite the impressive compression ratio, I believe that there is still much room for optimizing this as well. For instance, since my program is word-based, single letter words and punctuation are typically hard to compress well. Perhaps this could be dealt with using generic compression algorithms on these small tokens where there would inevitably be a high amount of redundancy. There are many areas to expand and improve upon in this experiment. In fact, as the project progressed, I ended up having more questions than answers, and more ideas than could be contained in one science fair display. The n-grams are particularly applicable and interesting as they are used in a wide variety of applications such as speech recognition, plagiarism detection, machine learning, and translation. Based on n-grams, my project could be expanded to compress other types of data, such as DNA strands, with an applicable corpus. I also considered topics that related to the concepts in my project yet did not deal with compression. For example, I believe one possible application is implementing authorship detection by creating a digital writing signature based on how often certain ranks appeared in the compressed versions of one’s works. I even thought of my project as a potential metric of a writing originality. In school, this could mean that the higher a student’s essay’s compression ratio, the more that that essay followed predictable trends. One amusing thought I had was that this would lead our

Page 15: A Novel Approach to Text Compression Using N-Grams

society to be graded for its work on compression ratio. If this were the case, then my abstract for this project would likely not be accepted by society, as it compressed to 18.63% of its original size. Overall, I believe that implemented correctly, the compression scheme introduced in my project has the potential for widespread application. For companies managing significant quantities of data, this novel form of compression could be used effectively to save energy and resources, and for massive markets such as text messaging or Twittering, a small improvement like this could spare the construction of new data centers or server farms, which are difficult to maintain and expensive. And with approximately 4.1 billion text messages sent daily—an estimate from last year—the time and opportunity are appropriate to implement such a scheme. A small change in compression can make a big difference for the world and the consequences of an overabundance of information.

References: “On Prediction Using Variable Order Markov Models.” Ron Begleiter, Ran El-Yaniv, et. al. <http://www.jair.org/media/1491/live-1491-2335-jair.pdf>. “An Enhanced Short Text Compression Scheme for Smart Devices.” Md. Rafiquel Islam, S. A. Ahson Rajon. <http://www.academypublisher.com/ojs/index.php/jcp/article/view/05014958/1345.> “Prediction and Entropy of Printed English.” C. E. Shannon. <http://languagelog.ldc.upenn.edu/myl/Shannon1950.pdf>. “The Data Deluge.” The Economist. <http://www.economist.com/surveys/displaystory.cfm?story_id=15557443>. “All our N-Gram Are Belong to You.” Alex Franz, Thorsten Brants. <http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html>.