txt compression project.ppt
TRANSCRIPT
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
1/23
TEXT COMPRESSION
SUBMITTED BY
MUKESH KUMAR TANWAR 11104EN067
VIKASH KUMAR 11104EN068
PARAS CHANDOLIA 11104EN070
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
2/23
ABSTRACT
TextCompressionisthescienceandartofrepresentinginfoacompactform.Therearealotoftextcompressionalgorithareavailable tocompress filesofdifferent formats. In this
haveusedHuffmanalgorithmforthetextcompression.
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
3/23
INTRODUCTIONTextcompressionreferstoreducingtheamountofspacenstore
the
text.
There
are
two
types
ofcompression-
LOSSLESS- Data compression can be lossless only if it is pexactlyreconstructtheoriginaldatafromthecompressedversilossless technique is used when the original data of a sourimportant that we cannot afford to lose any details. Examplesource data are medical images, text, some computer executable f
LOSSY- Another method of compression algorithms is called losalgorithms irreversibly remove some parts of data and approximation of the original data can be reconstructed. Datmultimedia images, video and audio are more easily compressecompression techniques
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
4/23
Huffman codingTheHuffmancodingtransformstheoriginalcodeusedforthecharactersoftcodeon8bits).Codingthetext isjustreplacingeachsymbolby itsnewcoHuffman algorithm uses the notion of prefix code. A prefix code is a scontainingnowordthatisprefixofanotherwordoftheset.Theadvantageo
isthatthedecoding is immediate. Huffman Code assigns shorter encodingswith a high frequency. Elements with the highest frequency get assigned thlength code. The key to decompressing Huffman code is a Huffman tree.
Thestepsrequiredwhilecompressingthetextareasfollows- Countsthecharacterfrequencies. BuildstheHuffmancodingtree. Buildscharactercode-wordsfromthecodingtree. Storesthecodingtreeinthecompressedfile. Encodesthecharactersinthecompressedfile. CompletefunctionfortheHuffmanencoding. Rebuildsthetreefromheaderofthecompressedfile. Recoverstheoriginaltext. CompletefunctionfortheHuffmandecoding.
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
5/23
ENCODING Thecodingalgorithmiscomposedofthreesteps:countofcharafrequencies,constructionoftheprefixcode,encodingofthetext
stepconsistsincountingthenumberofoccurrencesofeachchartheoriginaltext.
Algorithmtocountthefrequenciesofcharaters-intCOUNT(FILECODE*code) { intc;
while((c=getc(fin))!=EOF)code[c].freq++;code[END].freq=1; Thesecondstepofthealgorithmbuildsthetreeofaprefixcodecharacterfrequency freq(a)ofeachcharacterainthefollowingway
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
6/23
Create a root-tree t for each character a with weight (t) =freq(a),
Repeat
Select the two least frequent trees t1 and t2,
Create a new tree t3 with left son t1, right son t2 and weight(t3)=weight(t1)+weight
Until it remains only one tree.
The tree is constructed by the algorithm, the only tree remaining at the end of thethe coding tree.
In the implementation of the above scheme, a priority queue is used to identifyfrequent trees. After the tree is built, it is possible to recover the code-words acharacters by a simple depth-first-search of the tree. In the third step, the oencoded. Since the code depends on the original text, in order to be able tcompressed text, the coding tree must be stored with the compressed text. Again t
a depth-first-search of the tree. Each time an internal node is encountered a 0 is pra leaf is encountered a 1 is produced followed by the ASCII code of the corresponon 9 bits (so that the end marker can be equal to 256 if all the ASCII characters aoriginal text). Bits are written 8 by 8 in the compressed file using both a buffer counter (bits2go). Each time the counter is equal to 8, the corresponding byte is writotal number of bits is not necessarily a multiple of 8, some bits may have to be prend. All the preceding steps are composed to give a complete encoding program.
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
7/23
Building the Huffman coding tre
int BUILD_HEAP(CODE *code, PRIORITY_QUEUE queue) {
int i, size; NODE node; size=0;
for (i=0; i
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
8/23
BUILDING THE CODING TRE NODE BUILD_TREE(PRIORITY_QUEUE queue, int size)
{
NODE root, leftnode, rightnode; while (size > 1) {
leftnode=HEAP_EXTRACT_MIN(queue);
rightnode=HEAP_EXTRACT_MIN(queue);
root=ALLOCATE_NODE();
PUT_WEIGHT(root, GET_WEIGHT(leftnode)+GET_WEIGHT(rightno
PUT_LEFT(root, leftnode); PUT_RIGHT(root, rightnode);
HEAP_INSERT(queue, root);
size--;
}
return(root);
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
9/23
Builds the character codes by a dfirst-search of the coding tre
void BUILD_CODE(NODE root, CODE *code, int length) {
static char temp[ASIZE+1];
int c;
if (GET_LEFT(root) != NULL) {
temp[length]=0;
BUILD_CODE(GET_LEFT(root), code, length+1);
temp[length]=1;
BUILD_CODE(GET_RIGHT(root), code, length+1);
}
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
10/23
else {
c=GET_LABEL(root);
code[c].codeword=(char *)malloc(length); code[c].lg=length;
strncpy(code[c].codeword, temp, length);
}
}
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
11/23
MEMORIZES THE CODING TRETHE COMPRESSED FILE
void CODE_TREE(FILE *fout, NODE root){
if (GET_LEFT(root) != NULL){
SEND_BIT(fout, 0);
CODE_TREE(fout, GET_LEFT(root)); CODE_TREE(fout, GET_RIGH}
else { SEND_BIT(fout, 1); ASCII2BITS(fout, GET_LABEL(root));
}}
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
12/23
Encodes the characters in the compressed file
voidCODE_TEXT(FILE *fin, FILE *fout)
{
int c, i;rewind(fin);
while ((c=getc(fin)) != EOF)
for (i=0; i < code[c].lg; i++)
SEND_BIT(fout, code[c].codeword[i]);
for (i=0; i < code[END].lg; i++)
SEND_BIT(fout, code[END].codeword[i]);
}
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
13/23
Sends one bit in the compressed filevoid SEND_BIT(FILE *fout, int bit)
{
buffer>>=1;
if (bit) buffer|=0x80;
bits2go++;if (bits2go == 8)
{
putc(buffer, fout);
bits2go=0;
}
} Encodes n on 9 bits
Int ASCII2BITS(FILE *fout, int n)
{
int i;
for (i=8; i>= 0; i--) SEND_BIT(fout, (n>>i)&1);
}
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
14/23
Outputs a final byte if necessary
Void SEND_LAST_BITS(FILE *fout)
{
if (bits2go) putc(buffer>>(8-bits2go), fout);
} Initializes the array code
Void INIT(CODE *code)
{
int i;
for (i=0; i < ASIZE; i++) {
code[i].freq=code[i].lg=0;
code[i].codeword=NULL;
}
}
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
15/23
Complete function of Huffman cod#defineASIZE255
#defineEND(ASIZE+1)/*codeofEOF*/
typedefstruct
{
intfreq,lg;char*codeword;
}CODE;intbuffer;intbits2go;voidCODING(char*fichin,char*fichout){
FILE*fin,*fout;intsize;
PRIORITY_QUEUEqueue;NODEroot;CODEcode[ASIZE+1];if((fin=fopen(fichin,"r"))==NULL)exit(0);
if((fout=fopen(fichout,"w"))==NULL)exit(0);INIT(code);
COUNT(fin,code);
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
16/23
size=BUILD_HEAP(code,queue);root=BUILD_TREE(queue,siz
BUILD_CODE(root,code,0);buffer=bits2go=0;CODE_TREE(fo
CODE_TEXT(fin,fout);SEND_LAST_BITS(fout);
fclose(fin);
fclose(fout);
}
Example:
y
=
CAGATAAGAGAA
Length:12*8=96bits(ifan8-bitcodeisassumed)
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
17/23
character frequencies:
The required coding tree:Tree code:
0001binary(END,9)01binary(C,9)1binary(T,9)1binary(G,
A,9)
Length: 54
Text code: 0010 1 01 1 0011 1 1 01 1 01 1 1 000Length: 24
Total length: 78
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
18/23
DECODING
Decoding a file containing a text compressed by Huffman algorithm is a mere program
First the coding tree is rebuild by the algorithm1 given below. It uses a function to dec
written on 9 bits. Then, the uncompressed text is recovered by parsing the compresse
text with the coding tree. The process begins at the root of the coding tree, andbranch when a 0 is read or a right branch when a 1 is read. When a leaf is enccorresponding character (in fact the original codeword of it) is produced and the resumes at the root of the tree. The parsing ends when the codeword of the end mThe decoding of the text is presented in Figure Algorithm3. Again in this algorithm thwith the use of a buffer
(buffer)
and a counter (bits in stock). A byte is read in the compressed file only if the counter izero Algorithm4.
The complete decoding program is given in Algorithm5. It calls the preceding function
running time of the decoding program is linear in the sum of the sizes of the texts it m
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
19/23
Algorithm1:Rebuildsthetreereadfrom compressed file
voidREBUILD_TREE(FILE *fin, NODE root)
{
NODE leftnode, rightnode;
if (READ_BIT(fin) == 1) { /* leaf */
PUT_LEFT(root, NULL);
PUT_RIGHT(root, NULL);
PUT_LABEL(root, BITS2ASCII(fin));
}
else {
leftnode=ALLOCATE_NODE();
PUT_LEFT(root, leftnode);REBUILD_TREE(fin, leftnode);
rightnode=ALLOCATE_NODE();
PUT_RIGHT(root, rightnode);
REBUILD_TREE(fin, rightnode);
}
}
Algorithm2: Reads the next 9 bits in the compressed file and
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
20/23
Algorithm2: Reads the next 9 bits in the compressed file and the corresponding value
Int BITS2ASCII(FILE *fin)
{
int i, value;value=0;
for (i=8; i >= 0; i--) value=value
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
21/23
( ) {
if (GET_LEFT(node) == NULL)
if (GET_LABEL(node) != END)
{
putc(GET_LABEL(node), fout);
node=root;
}else break;
else if (READ_BIT(fin) == 1) node=GET_RIGHT(node);
else node=GET_LEFT(node);
}
}
Algorithm4: Reads the next bit from the compressed file
Int READ_BIT(FILE *fin){
int bit;
if (bits_in_stock == 0) {
buffer=getc(fin);
bits_in_stock=8;
}
bit b ffer&1
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
22/23
bit=buffer&1;
buffer>>=1;
bits_in_stock--;
return(bit);
}
Algorithm5: Complete function for decoding
int buffer;
int bits_in_stock;
void DECODING(char *fichin, char *fichout)
{
FILE *fin, *fout;
NODE root;
-
8/21/2019 TXT COMPRESSION PROJECT.ppt
23/23
if ((fin=fopen(fichin, "r")) == NULL) exit(0);
if ((fout=fopen(fichout, "w")) == NULL) exit(0);
INIT(code);
buffer=bits_in_stock=0;root=ALLOCATE_NODE();
REBUILD_TREE(fin, root);
UNCOMPRESS(fin, fout, root);
fclose(fin);fclose(fout);
}