scaling out for extreme scale corpus data matthew coole, paul rayson & john mariani

SCALING OUT FOR EXTREME SCALE CORPUS DATA MATTHEW COOLE, PAUL RAYSON & JOHN MARIANI INTRODUCTION Corpora large collections of texts. Corpus linguists use specialist queries to analyse language i.e. concordance lines, collocations etc. according to the Daily Labor Report of Sept. 19, 1961, defends th ng dissents from the majority report of the Joint Economic Commite ement. In the above mentioned report of the Notre Dame Chapter of come. SWELLING TRAFFIC. A new report on the earning records of tol nkless task. According to one report, however, Mr Hammarskjold was *examples from the Brown Corpus [1] INTRODUCTION Corpora have grown from millions to billions of words in recent years. Brown Corpus (1961) ~1 million words BNC (1994) [2] ~100 million words Hansard (2005) [3] ~2 billion words EEBO (2015+) [4] ~4 billion words Simple concordancers i.e. AntConc [5], cannot handle this scale How can distributed IR and DBMSs help solve this problem? BACKGROUND Meaningful concordance analysis requires 30+ lines. Smaller corpora fine for common words but not uncommon words. Zipfs law [6] applied to corpus linguistics tells us 50% of the different words in a corpus will only occur once. BACKGROUND Many corpus tools exist e.g. AntConc, WordSmith [7], CQPWeb [8], Wmatrix [9] etc. Few can handle corpora of extreme scale. Scaling out has been utilised for IR systems how effective is it for storing and querying corpus data in distributed DBMSs? EXPERIMENTAL SETUP Deploy 2 No-SQL DBMSs in distributed configurations (MongoDB [10], Cassandra [11]) Each deployed in 4, 8 & 16 node configuration on AWS (Amazon Web Services [12]) m4_xl 4 vCPUs 16GB RAM 1000GB EBS Volume (500 IOPS) Hansard corpus stored and queried for 3 word groups low, medium & high frequency words. Setups vaguely mirror data model employed by BYU [13] EXPERIMENTAL SETUP Low FrequencyMedium FrequencyHigh Frequency gauntly croquet patronym ratpayers thugutt ogies fecias gacious unspared moyland weeny kilometers plebs appraiser earldoms candlemas laudations coachmakers heinkel conegate it I is a in and that to of the EXPERIMENTAL SETUP MONGODB BSON Document store. All words stored as individual BSON documents in a single collection. MongoDB text index built on searchable word form. { _id: ObjectId(555324ca7986c0c3d6b3a97), originalform: The, searchableform: the, docid: S6CV0196P , pos: 103 } EXPERIMENTAL SETUP CASSANDRA Column Family. Each word stored in a row within a table. Searchable form column indexed for search. CREATE TABLE hansar.words ( doc text, pos int, s_f text, o_f text, PRIMARY KEY (doc, pos) ); RESULTS MONGODB RESULTS CASSANDRA CONCLUSIONS Scaling out - feasible solution for corpus linguistic data. Limitations of existing DBMSs. Differing indexing strategies may be more or less effective for simple and complex queries. Further work to be undertaken to build upon what is learnt to develop a scalable corpus database. REFERENCES [1][2][3][4][5][6][7][8] https://cqpweb.lancs.ac.uk/ [9][10] https://www.mongodb.com/ [11][12] https://aws.amazon.com/ [13] QUESTIONS?

scaling out for extreme scale corpus data matthew coole, paul rayson & john mariani

Documents