txt compression project.ppt

Upload: ajeetrocks

Post on 07-Aug-2018

232 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    1/23

    TEXT COMPRESSION

    SUBMITTED BY

    MUKESH KUMAR TANWAR 11104EN067

    VIKASH KUMAR 11104EN068

    PARAS CHANDOLIA 11104EN070

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    2/23

    ABSTRACT

    TextCompressionisthescienceandartofrepresentinginfoacompactform.Therearealotoftextcompressionalgorithareavailable tocompress filesofdifferent formats. In this

    haveusedHuffmanalgorithmforthetextcompression.

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    3/23

    INTRODUCTIONTextcompressionreferstoreducingtheamountofspacenstore

    the

    text.

    There

    are

    two

    types

    ofcompression-

    LOSSLESS- Data compression can be lossless only if it is pexactlyreconstructtheoriginaldatafromthecompressedversilossless technique is used when the original data of a sourimportant that we cannot afford to lose any details. Examplesource data are medical images, text, some computer executable f

    LOSSY- Another method of compression algorithms is called losalgorithms irreversibly remove some parts of data and approximation of the original data can be reconstructed. Datmultimedia images, video and audio are more easily compressecompression techniques

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    4/23

    Huffman codingTheHuffmancodingtransformstheoriginalcodeusedforthecharactersoftcodeon8bits).Codingthetext isjustreplacingeachsymbolby itsnewcoHuffman algorithm uses the notion of prefix code. A prefix code is a scontainingnowordthatisprefixofanotherwordoftheset.Theadvantageo

    isthatthedecoding is immediate. Huffman Code assigns shorter encodingswith a high frequency. Elements with the highest frequency get assigned thlength code. The key to decompressing Huffman code is a Huffman tree.

    Thestepsrequiredwhilecompressingthetextareasfollows- Countsthecharacterfrequencies. BuildstheHuffmancodingtree. Buildscharactercode-wordsfromthecodingtree. Storesthecodingtreeinthecompressedfile. Encodesthecharactersinthecompressedfile. CompletefunctionfortheHuffmanencoding. Rebuildsthetreefromheaderofthecompressedfile. Recoverstheoriginaltext. CompletefunctionfortheHuffmandecoding.

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    5/23

    ENCODING Thecodingalgorithmiscomposedofthreesteps:countofcharafrequencies,constructionoftheprefixcode,encodingofthetext

    stepconsistsincountingthenumberofoccurrencesofeachchartheoriginaltext.

    Algorithmtocountthefrequenciesofcharaters-intCOUNT(FILECODE*code) { intc;

    while((c=getc(fin))!=EOF)code[c].freq++;code[END].freq=1; Thesecondstepofthealgorithmbuildsthetreeofaprefixcodecharacterfrequency freq(a)ofeachcharacterainthefollowingway

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    6/23

    Create a root-tree t for each character a with weight (t) =freq(a),

    Repeat

    Select the two least frequent trees t1 and t2,

    Create a new tree t3 with left son t1, right son t2 and weight(t3)=weight(t1)+weight

    Until it remains only one tree.

    The tree is constructed by the algorithm, the only tree remaining at the end of thethe coding tree.

    In the implementation of the above scheme, a priority queue is used to identifyfrequent trees. After the tree is built, it is possible to recover the code-words acharacters by a simple depth-first-search of the tree. In the third step, the oencoded. Since the code depends on the original text, in order to be able tcompressed text, the coding tree must be stored with the compressed text. Again t

    a depth-first-search of the tree. Each time an internal node is encountered a 0 is pra leaf is encountered a 1 is produced followed by the ASCII code of the corresponon 9 bits (so that the end marker can be equal to 256 if all the ASCII characters aoriginal text). Bits are written 8 by 8 in the compressed file using both a buffer counter (bits2go). Each time the counter is equal to 8, the corresponding byte is writotal number of bits is not necessarily a multiple of 8, some bits may have to be prend. All the preceding steps are composed to give a complete encoding program.

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    7/23

    Building the Huffman coding tre

    int BUILD_HEAP(CODE *code, PRIORITY_QUEUE queue) {

    int i, size; NODE node; size=0;

    for (i=0; i

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    8/23

    BUILDING THE CODING TRE NODE BUILD_TREE(PRIORITY_QUEUE queue, int size)

    {

    NODE root, leftnode, rightnode; while (size > 1) {

    leftnode=HEAP_EXTRACT_MIN(queue);

    rightnode=HEAP_EXTRACT_MIN(queue);

    root=ALLOCATE_NODE();

    PUT_WEIGHT(root, GET_WEIGHT(leftnode)+GET_WEIGHT(rightno

    PUT_LEFT(root, leftnode); PUT_RIGHT(root, rightnode);

    HEAP_INSERT(queue, root);

    size--;

    }

    return(root);

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    9/23

    Builds the character codes by a dfirst-search of the coding tre

    void BUILD_CODE(NODE root, CODE *code, int length) {

    static char temp[ASIZE+1];

    int c;

    if (GET_LEFT(root) != NULL) {

    temp[length]=0;

    BUILD_CODE(GET_LEFT(root), code, length+1);

    temp[length]=1;

    BUILD_CODE(GET_RIGHT(root), code, length+1);

    }

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    10/23

    else {

    c=GET_LABEL(root);

    code[c].codeword=(char *)malloc(length); code[c].lg=length;

    strncpy(code[c].codeword, temp, length);

    }

    }

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    11/23

    MEMORIZES THE CODING TRETHE COMPRESSED FILE

    void CODE_TREE(FILE *fout, NODE root){

    if (GET_LEFT(root) != NULL){

    SEND_BIT(fout, 0);

    CODE_TREE(fout, GET_LEFT(root)); CODE_TREE(fout, GET_RIGH}

    else { SEND_BIT(fout, 1); ASCII2BITS(fout, GET_LABEL(root));

    }}

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    12/23

    Encodes the characters in the compressed file

    voidCODE_TEXT(FILE *fin, FILE *fout)

    {

    int c, i;rewind(fin);

    while ((c=getc(fin)) != EOF)

    for (i=0; i < code[c].lg; i++)

    SEND_BIT(fout, code[c].codeword[i]);

    for (i=0; i < code[END].lg; i++)

    SEND_BIT(fout, code[END].codeword[i]);

    }

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    13/23

    Sends one bit in the compressed filevoid SEND_BIT(FILE *fout, int bit)

    {

    buffer>>=1;

    if (bit) buffer|=0x80;

    bits2go++;if (bits2go == 8)

    {

    putc(buffer, fout);

    bits2go=0;

    }

    } Encodes n on 9 bits

    Int ASCII2BITS(FILE *fout, int n)

    {

    int i;

    for (i=8; i>= 0; i--) SEND_BIT(fout, (n>>i)&1);

    }

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    14/23

    Outputs a final byte if necessary

    Void SEND_LAST_BITS(FILE *fout)

    {

    if (bits2go) putc(buffer>>(8-bits2go), fout);

    } Initializes the array code

    Void INIT(CODE *code)

    {

    int i;

    for (i=0; i < ASIZE; i++) {

    code[i].freq=code[i].lg=0;

    code[i].codeword=NULL;

    }

    }

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    15/23

    Complete function of Huffman cod#defineASIZE255

    #defineEND(ASIZE+1)/*codeofEOF*/

    typedefstruct

    {

    intfreq,lg;char*codeword;

    }CODE;intbuffer;intbits2go;voidCODING(char*fichin,char*fichout){

    FILE*fin,*fout;intsize;

    PRIORITY_QUEUEqueue;NODEroot;CODEcode[ASIZE+1];if((fin=fopen(fichin,"r"))==NULL)exit(0);

    if((fout=fopen(fichout,"w"))==NULL)exit(0);INIT(code);

    COUNT(fin,code);

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    16/23

    size=BUILD_HEAP(code,queue);root=BUILD_TREE(queue,siz

    BUILD_CODE(root,code,0);buffer=bits2go=0;CODE_TREE(fo

    CODE_TEXT(fin,fout);SEND_LAST_BITS(fout);

    fclose(fin);

    fclose(fout);

    }

    Example:

    y

    =

    CAGATAAGAGAA

    Length:12*8=96bits(ifan8-bitcodeisassumed)

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    17/23

    character frequencies:

    The required coding tree:Tree code:

    0001binary(END,9)01binary(C,9)1binary(T,9)1binary(G,

    A,9)

    Length: 54

    Text code: 0010 1 01 1 0011 1 1 01 1 01 1 1 000Length: 24

    Total length: 78

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    18/23

    DECODING

    Decoding a file containing a text compressed by Huffman algorithm is a mere program

    First the coding tree is rebuild by the algorithm1 given below. It uses a function to dec

    written on 9 bits. Then, the uncompressed text is recovered by parsing the compresse

    text with the coding tree. The process begins at the root of the coding tree, andbranch when a 0 is read or a right branch when a 1 is read. When a leaf is enccorresponding character (in fact the original codeword of it) is produced and the resumes at the root of the tree. The parsing ends when the codeword of the end mThe decoding of the text is presented in Figure Algorithm3. Again in this algorithm thwith the use of a buffer

    (buffer)

    and a counter (bits in stock). A byte is read in the compressed file only if the counter izero Algorithm4.

    The complete decoding program is given in Algorithm5. It calls the preceding function

    running time of the decoding program is linear in the sum of the sizes of the texts it m

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    19/23

    Algorithm1:Rebuildsthetreereadfrom compressed file

    voidREBUILD_TREE(FILE *fin, NODE root)

    {

    NODE leftnode, rightnode;

    if (READ_BIT(fin) == 1) { /* leaf */

    PUT_LEFT(root, NULL);

    PUT_RIGHT(root, NULL);

    PUT_LABEL(root, BITS2ASCII(fin));

    }

    else {

    leftnode=ALLOCATE_NODE();

    PUT_LEFT(root, leftnode);REBUILD_TREE(fin, leftnode);

    rightnode=ALLOCATE_NODE();

    PUT_RIGHT(root, rightnode);

    REBUILD_TREE(fin, rightnode);

    }

    }

    Algorithm2: Reads the next 9 bits in the compressed file and

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    20/23

    Algorithm2: Reads the next 9 bits in the compressed file and the corresponding value

    Int BITS2ASCII(FILE *fin)

    {

    int i, value;value=0;

    for (i=8; i >= 0; i--) value=value

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    21/23

    ( ) {

    if (GET_LEFT(node) == NULL)

    if (GET_LABEL(node) != END)

    {

    putc(GET_LABEL(node), fout);

    node=root;

    }else break;

    else if (READ_BIT(fin) == 1) node=GET_RIGHT(node);

    else node=GET_LEFT(node);

    }

    }

    Algorithm4: Reads the next bit from the compressed file

    Int READ_BIT(FILE *fin){

    int bit;

    if (bits_in_stock == 0) {

    buffer=getc(fin);

    bits_in_stock=8;

    }

    bit b ffer&1

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    22/23

    bit=buffer&1;

    buffer>>=1;

    bits_in_stock--;

    return(bit);

    }

    Algorithm5: Complete function for decoding

    int buffer;

    int bits_in_stock;

    void DECODING(char *fichin, char *fichout)

    {

    FILE *fin, *fout;

    NODE root;

  • 8/21/2019 TXT COMPRESSION PROJECT.ppt

    23/23

    if ((fin=fopen(fichin, "r")) == NULL) exit(0);

    if ((fout=fopen(fichout, "w")) == NULL) exit(0);

    INIT(code);

    buffer=bits_in_stock=0;root=ALLOCATE_NODE();

    REBUILD_TREE(fin, root);

    UNCOMPRESS(fin, fout, root);

    fclose(fin);fclose(fout);

    }