generation of personal data for test persons for use in the swedish tax agency's population

42
Generation of personal data for test persons for use in the Swedish Tax Agency’s Population Registry An application of Context Free Grammar Stefan Lindström Stefan Lindström VT 2016 Bachelor Thesis, 15 hp Supervisor: Suna Bensch External Supervisor: Anders Marklund Examiner: Lars-Erik Janlert Bachelor’s programme in Computing Science, 180 hp

Upload: others

Post on 11-Sep-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

Generation of personal data for testpersons for use in the Swedish TaxAgency’s Population RegistryAn application of Context Free Grammar

Stefan Lindström

Stefan LindströmVT 2016Bachelor Thesis, 15 hpSupervisor: Suna BenschExternal Supervisor: Anders MarklundExaminer: Lars-Erik JanlertBachelor’s programme in Computing Science, 180 hp

Page 2: Generation of personal data for test persons for use in the Swedish Tax Agency's Population
Page 3: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

Abstract

When software systems grow large, the need for testing be-comes more and more apparent the bigger the system be-comes. These systems usually require data to be tested with.The acquisition of this data can be very time-consuming andthus, expensive. This is especially true if the data needs to bereacquired or otherwise changed every so often. This thesiswill look at context-free grammars as a tool to generate thedata required for testing the Swedish Tax Agency’s Popula-tion Registry with the purpose to reduce the time and effortto acquire new data.

Generering av persondata för testpersoner föranvändning i Skatteverkets folkbokföring

Sammanfattning

När mjukvarusystem växer sig stora blir behovet av testningmer och mer uppenbart allt eftersom systemet blir större ochstörre. Dessa stora system kräver vanligvis data för att kunnatestas. Att ta fram eller på annat sätt hitta detta data kanvara mycket tidskrävande, och därmed också dyrt. Detta blirextra tydligt om datat måste tas fram eller ändras regelbun-det. Denna avhandling kommer att titta på kontextfria gram-matiker som ett verktyg för att generera det data som krävsför att testa Skatteverkets Folkfokföring med målet att min-ska tiden och ansträngningen som krävs för att ta fram nyttdata.

Page 4: Generation of personal data for test persons for use in the Swedish Tax Agency's Population
Page 5: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

Acknowledgements

Thanks to Suna Bensch for supervision and help regarding context free gram-mars as well as help with the report. Also thanks to Anders Marklund for helpwith understanding the system and data dependencies as well as testing thedata generated with the context free grammars.

Page 6: Generation of personal data for test persons for use in the Swedish Tax Agency's Population
Page 7: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

Contents

1 Introduction 1

2 Background 32.1 Context-Free Grammar 32.2 Structure of personal data 4

3 Result 73.1 Small example from the grammar 73.2 Notes about the grammar 93.3 Limitations in relationships 103.4 Limitations in dates 113.5 Large amount of production rules 123.6 Prolog Implementation 13

4 Discussion 154.1 Hyperedge Replacement Graph Grammars 154.2 Random Context Grammars 16

5 Conclusion 19

References 21

A Resulting grammar 23

Page 8: Generation of personal data for test persons for use in the Swedish Tax Agency's Population
Page 9: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

1(34)

1 Introduction

As software systems grow larger and larger the need for easy and efficienttesting becomes more and more apparent.[4, 8] Acquiring test data is one ofthe problems that needs to be solved in order to have proper testing thatactually verifies that the system works as intended. There are a couple ofapproaches to handle test data, with the two main approaches being eithercreating a large set of static data with a high degree of quality and quantityto be sufficient for all the test cases or generating the data when it is neededfor a specific test to later be discarded.One way of acquiring this dataset is to copy the actual production data anduse it in a test environment, an easy and reliable way due to it capturing allaspects of the current data that is being used (including corrupted or non-complete data, which might be important). Another way of dealing with testdata is to generate data when needed, which will be the topic of this thesis.More specifically, the topic will be using Context Free Grammars (also knownas CFGs) for test data generation in a real-world application.The application I have been investigating is the Swedish Tax Agency’s popu-lation registry, a database over all citizens, past and present, that has lived oris living in Sweden.Due to several factors, including the massive size of the application in combi-nation with insufficient test hardware as well as rules and regulations, copyingproduction data for use in test environments is not feasible nor allowed. Evenif it was possible and allowed, it still should not be done due to the ethicalissues that might arise from using real persons in testing environments if theyaren’t as secure as the production environment that could cause information toleak out. Thus, data generation is required. This generation cannot have anybias towards the production data for the same reasons as why the productiondata cannot be used.Currently all data generation is done by hand whenever it is needed, a longand time-consuming process. An easier and faster way of generating data isnecessary, both to reduce the time it takes but also to minimize he amount oferrors in the data itself. When testing, it’s imperative to have the data thatyou expect, and when it’s all hand-crafted that is hard to guarantee.Lots of information about each person is stored, and thus will have to begenerated, including names, home addresses, relationships and more.

Page 10: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

2(34)

The main questions to be answered are:

1. ”Are context-free grammars an adequate tool to generate personal datafor test persons?”

2. ”Which advantages and insights can context-free grammars provide?”

3. ”What are the limitations of context-free grammars?”

4. ”Are there other grammar models that provide "good modelling"?”

Page 11: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

3(34)

2 Background

Grammars in general are string-rewriting systems where either a symbol ora substring of the string is replaced with another string for every iterationaccording to the rules of the specific grammar. This is done until there are nomore substrings or symbols to replace and the resulting string is output.Every grammar (G) consists of 4 different parts: N, a set of nonterminalsymbols, Σ, a set of terminal symbols, P, a set of production rules and S, astart symbol. Note that Σ has to be disjoint from N, i.e. they cannot shareany element of their sets. Furthermore, N has to be disjoint from every stringformed by G.Grammars are classified using the Chomsky hierarchy. This hierarchy contains4 different Types, Type-0 through Type-3. The higher the number is, theweaker the grammar is and it can generate a smaller amount of languages.For example, Type-0 contains all formal grammars and can generate everylanguage a Turing machine can recognize. A Type-3 on the other hand canonly generate the languages that a finite state automaton can decide, a much,much smaller set of languages [3], namely the regular languages.The reason why grammars were chosen as the tool for this thesis is becauseof the fact that grammars are finite devices that can generate infinite strings.These strings can then represent anything, depending on the rule-set that hasbeen chosen or developed. In the case of this thesis, the rule-set has beendeveloped to generate strings that describes the personal information requiredfor test persons in the Swedish Tax Agency’s Population Registry.A grammar can also be used for the inverse purpose, to see if a string is apart of the language, that is, can the given grammar produce that specificstring. This is called parsing. When parsing a string with a specific grammar,a parse-tree is generated. This tree describes how the string can be createdfrom the grammar. This allows for a much better view of the structure of thestring and how it is structured.

2.1 Context-Free Grammar

Context-Free Grammars (CFGs) are a specific category of grammars whereevery rule consists of rewriting a single symbol into a string of symbols. Itis not allowed to rewrite several symbols at once, nor is it allowed to rewrite

Page 12: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

4(34)

substrings in a CFG. Thus, every rule has the form:

A → x

Where A is a single nonterminal symbol and x is a string of terminals and/ornonterminals. x can also be emtpy.[6]CFGs are a Type-2 grammar in the Chomsky Hierarchy, meaning it can berecognized by a non-deterministic pushdown automaton. The most prominentuse of CFGs in computing science in modern times is its use in the DocumentType Definition (DTD) used in XML.[2]

G = ({S , S1} , {a , b } , P, S) , P = {S → aS1bS1 → aS1bS1 → λ

}

Figure 1: An example CFG. λ denotes the empty string.

The example grammar shown in Figure 1 will generate the language of {anbn :n≥ 1}. Some example sentences from this language are: ab, aabb, aaaaabbbbb,aaabbb.In addition to using CFGs for generating sentences, it is also possible to checkif a sentence is a part of the language by using different parsing algorithmswith the CFG. This is the reason why CFGs, and also regular grammars,are so popular. Not because they’re the most powerful and can describe themost amount of languages, but rather because they have relatively simple andefficient parsing algorithms to see if a sentence is a part of a language or not.

2.2 Structure of personal data

The data structures and overall data composition described in Figure 2 isextracted from a specification given by the Swedish Tax Agency. However,the specification is not publicly available and thus, a shortened, more essentialversion of it is presented here.The specification is divided into 8 parts, where some (data such as namesor addresses) are completely independent from the others, whilst other parts(personal identity number and relationships) has dependencies between eachother.

Person The main structure to be generated. It contains all of the otherdata-structures, but also some other things directly, such as personal identity

Page 13: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

5(34)

Figure 2: The full UML-diagram describing the data

number, gender and age. However, both gender and age are present in thepersonal identity number and thus only the personal identity number will begenerated.

Personal Identity Number A personal identity number is made up of 12integers, the first 8 being the birth date of the person. The following three arealmost completely random, however, the third of those three numbers are evenfor females and odd for males. The last digit is a pure check digit and will thusbe left out of this thesis since it does not provide any more information.[1]To ensure that no test person has the same personal identity number as a realperson, the last three digits in a test persons personal identity number willalways be 238 or 239, depending on if they’re male or female.

Name A name has 3 parts, first name, middle name and last name. Afirst name can in and of itself be several different names, all separated by awhitespace.

Death Information Burial location and date. Doesn’t exist for any livingperson.

Page 14: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

6(34)

Citizenship A person can have up to three different citizenships at the sametime, each with a country and a date.

Birth Location Consists of either only a birthplace if not located in Sweden,or of a birthplace and county if located in Sweden.

Civil Registration Consists of a county, a municipality, a district, a build-ing, a street address and finally a date. Civil registration is no dependent onanything, but the data that composes the civil registration is dependent oneach other. For example, in each county, there a several municipalities. In eachmunicipality there are several districts and many buildings. For each buildingthere may be one or more addresses associated with it. Thus, all of these hasto be correctly chosen depending on the choices made before.

Postal address A postal address is either 2 delivery addresses, a postalarea and a postal code or 3 delivery addresses and a country, depending onthe type of address. All types have a date and the possibility of a care-ofaddress. Different from civil registration in the regard that a postal addressis an address where you live temporarily and civil registration is permanent.Also, civil registration is much more strict with what choices can be made.

Civil status The civil status can have a number of different values, UM(Unmarried), M (Married), W (Widowed)and more. All civil states exceptfor Unmarried require another Person to whom there is a relation.

Relationships Each relationship requires a relationship type, a date andanother person. There are several different types of relations, but the mostimportant ones are the following 5 types: Child, Biological mother, Biologicalfather, Adoptive mother, Adoptive father.

Page 15: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

7(34)

3 Result

Before going to the full grammar and information about it, a small exam-ple grammar is provided in Section 3.1. The full grammar can be found inAppendix A with explanations in Section 3.2.

3.1 Small example from the grammar

The example given in Figure 3 is a rough sketch of the real grammar. Everyperson P consists of two parts, PS and PNR. PS represents all the uninter-esting information such as name, birth location and civil registration. This isalso the case in the real grammar. PNR on the other hand represents personalidentity number, relationships and civil status, all interconnected with age andgender.All letters that are in lower-case as well as <, > and / are terminal sym-bols and all letters in upper-case are non-terminal symbols. This grammar isintentionally not complete to keep the size small.

P → <person> PS PNR </person>PS → N D C PA CR B

N → <name> NF NM NL </name>

D → <deceased>no</deceased>D → <deceased>yes DS DD </deceased>

C → C CC → <c i t i z e n s h i p s > CS </ c i t i z e n s h i p s >C → <c i t i z e n s h i p s > CF </ c i t i z e n s h i p s >

PA → PA PAPA → <p o s t a l a d r e s s > COA PT PAD </p o s t a l a d r e s s >PA → λ

CR → CR CRCR → <c i v i l r e g > CRL CRD </ c i v i l r e g >

B → <b i r t h l o c a t i o n > BL BP </b i r t h l o c a t i o n >

PNR → <pnr> GP NR </pnr><r e l s > R </r e l s >

R → R RR → <r e l a t i o n s h i p > P RT RD </r e l a t i o n s h i p >R → λ

Figure 3: A small, non-complete example of the grammar.

Page 16: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

8(34)

Figure 4 demonstrates how the output from the grammar can look. Of course,the example here is not fully parsed, even disregarding the fact that not allrules are shown. In the bottom, there is a section regarding relationshipswhich contains P (which was the start of the whole grammar) that is notparsed. That P could be expanded, similarly to the first one, but containedwithin the first person. That second person could then also have relationshipsinside itself to other people, just as the first person did. This can recursivelyhappen infinitely many times.The output of this grammar is very similar to the output of the real grammar,with the biggest difference being that rules have been cut half-way down thechain in this example.

<person><name>

NF

NM

NL

</name><deceased> no </deceased><c i t i z e n s h i p s >

CS

CF

</ c i t i z e n s h i p s ><p o s t a l a d r e s s >

COAPTPAD

</p o s t a l a d r e s s ><c i v i l r e g >

CRL

CRD

</ c i v i l r e g ><b i r t h l o c a t i o n >

BL

BP

</b i r t h l o c a t i o n ><pnr>

GP NR

</pnr><r e l s >

<r e l a t i o n s h i p >PRT

RD

</r e l a t i o n s h i p ></r e l s >

</person>

Figure 4: An example of the output from the grammar in Figure 3. Theremaining of the symbols needs to be expanded for it to be com-plete, which is not possible due to the grammar being incom-plete. Newlines and indentation has been added for readability.

Page 17: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

9(34)

3.2 Notes about the grammar

As said in Section 3, the grammar can be found in Appendix A. It consists ofa little over 300 production rules. To help understand what each rule in thegrammar does, most of the rules correspond quite well with terms from theinformation model given in Section 2.2.All non-terminal symbols are in uppercase, and terminal symbols are all lettersin lowercase as well as <, > and /.In general, each rule correspond to a category in the information model, ac-cording to Table 1a. Note that there are several more categories than shownin the table in the grammar, but these are the most important ones.The rules are then split into subcategories using subscript notation wherea couple of letters are used to denote special meaning to the rules. Thesesubscripts can be seen in Table 1b.

Table 1 Production rules categories and subscripts(a) Categories

P PersonR Relationship(s)N NameD DeceasedC Citizenship(s)PA Postal AddressCR Civil RegistrationB BirthlocationPNR Personal Identity NumberCIV Civil statusG GenerateCH Character

(b) Subscripts

M Male/MotherF Female/FatherB BiologicalA AdoptiveC ChildV ViaY YoungO OldP Parent/PartnerS SharedD Date

M & F have differing meanings depending on which context they’re used in.Usually, it’s Male/Female, but in the context of a relationship to a parent andthe symbols that follow those it is Mother/Father. For example, RCV M

meanRelationship to Child via Male (since it is not a relationship to a parent), butRAF means Relationship to Adoptive Father. The same goes for the symbolPF which is only generated from RAF . However, when PF is expanded, itbecomes PNRSMO where M now means Male again.The split between partner and parent should be more obvious, the only placewhere the meaning is Partner is in the expansion of the CIV terms.Both these double-meanings could be removed with some refactoring, but nosuitable substitution was found. However, this does not impact the actualgrammar itself, only the readability and ease of understanding it.

Page 18: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

10(34)

Generation rules are the rules called Gx, the ”last” rules that are applied.This means that there are no symbols generated that can trigger any otherrules but the one in that specific generation chain. These rules generate eithera 35 character string (GS), a random date (GD), a random country (GC) ora random personal identification number (GP NRXX

), depending on which ofthem is called.

3.3 Limitations in relationships

One big limitation of the presented context-free grammar (as well as context-free grammars in general) is that it cannot produce fully connected families,only semi-connected. What this means is that due to the nature of how theywork and are parsed, they can never create the relations shown in Figure 5.In Figure 5 and 7, the following notation is used: 1F is the wife, 2M is thehusband and 3M is the child. The subscript denotes the persons gender. InFigure 7, the same is used, but the biological father is 4M .

1F 2M

3M

Married

Biological Father/ChildBiological Mother/Child

Figure 5: A fully connected family of 3 persons.

The reason for this is the fact that when the grammar is parsed, it will producea tree-structure. A tree-structure cannot accurately represent this informationwithout references (which would break the tree structure) due to the cross-connections between the nodes. No node can have two separate incomingedges in a tree [7], which is exactly what is happening in Figure 5.The best a context-free grammar can do is shown in Figure 6 and when com-paring this to Figure 5, there are two obvious differences. Firstly, none ofthe relations are double-sided, which isn’t a problem since the biological childrelation from 1 to 3 implicates that there is a biological mother relation from3 to 1, it just isn’t explicitly stated.Second, Figure 6 is missing the relation from 2 to 3. This is a much biggerproblem, since there is no real answer to what this relation should be, if any atall. One could assume that there should be a biological child relation between2 and 3, since 2 is married to the biological mother of 3. But it could also be

Page 19: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

11(34)

1F 2M

3M

Married

Biological Child

Figure 6: A semi-connected family of 3

the case that there shouldn’t be one and someone else that isn’t in the familyat the moment is the biological father, as shown in Figure 7.

1F 2M

3M

4MMarried

Biological Child Biological Father

Figure 7: A semi-connected family of 3 with an external biological father.

To solve this problem, other kinds of tools which are more suited towardsgraphs instead of trees have to be used. One such tool is discussed more indepth in Section 4.1.

3.4 Limitations in dates

The set of people that can be generated by this grammar is a superset of allthe real people in the date-range 1900-01-01 – 2099-12-31, due to the fact thatthe birth date is never taken into consideration when generating other dates.This causes the grammar to be able to generate that a person was marriedbefore he/she was born and other such inconsistencies. That this is the casecan easily be seen in the grammar when looking at how it generates dates, alldates are generated from GD. GD is completely unaware of what birth datethe person has, and will thus generate a completely random date every time.There are solutions to this problem, one which is talked about in the next

Page 20: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

12(34)

section (Section 3.5), and one is discussed in Section 4.2.

3.5 Large amount of production rules

As can be seen in Appendix A, many of the rules regarding personal identifi-cation number (PNR) as well as the rules regarding persons (P) and relations(R) have multiple subscripts simultaneously. This causes the amount of rulesto quickly expand to large numbers due to the combinatorial nature of themultiple subscripts. If you add one more factor to consider which has twochoices, the number of rules will double (where that extra factor is taken intoaccount).The two obvious examples are adulthood (Y /O) and gender (M/F) subscriptsas seen in Figure 8. Just these two extra factors to consider creates an ex-plosion in the number of rules, going from about 20 rules to over 80. Addinganother binary factor to consider would double that number again, thus mak-ing the number of rules grow very quickly compared to the number of factorsto consider. Of course, adding factors that have more options than binarywould make the increase happen even faster.

PBCV F→ <person> PS PNRSO RBCV F O </person>

RBCV F Y → <re l s > RBF </r e l s >PNRSF Y → <pnr> GP NRY F

</pnr><civ>um CIVD </civ>

Figure 8: An excerpt from the grammar showing PNR, P and R havingmultiple subscripts

As said in the end of Section 3.4, a persons birth date is not taken into ac-count when generating the other dates. This could be fixed in a context-freegrammar, but the amount of rules that would be needed is huge. As statedin the previous paragraph, a binary factor (such as gender) requires twice asmany rules. Dates are not binary, in fact, there are 365 days per year and thisgrammar currently handles 200 years. The current grammar would be equiv-alent to a single day in the grammar that would take birth date into accountwhen generating all the other dates.This would equate to an increase of a factor of 73000, making the total numberof rules about 22 million. Whilst it wouldn’t be hard to increase the size of thegrammar to that degree, it would probably be easier and better to use someother system that’s more powerful and can handle such tasks better. One suchsystem is discussed in Section 4.2.

Page 21: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

13(34)

3.6 Prolog Implementation

A version of the grammar has been implemented as a DCG (Definite ClauseGrammar) in Prolog.[9] However, due to how Prolog handles choices, no usefulinformation could be generated from it. When Prolog is given a choice of twoor more substitutions it will always take the first one it has not yet tried toensure that it will iterate over all possible outcomes in the end. This is not abehaviour that is wanted in a data generation tool that is supposed to generaterandomized data.The Prolog implementation could however verify that the handful of examplesthat has been created from the grammar by hand in fact are part of thelanguage. It could also conclude that other examples that were not completeor otherwise wrong are not part of the language.

Page 22: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

14(34)

Page 23: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

15(34)

4 Discussion

4.1 Hyperedge Replacement Graph Grammars

Graph grammars are a natural development from the concept of string gram-mars, and Hyperedge Replacement Graph Grammars are the equivalent ofContext-Free Grammars in normal string grammars, although many otherkinds of graph grammars has also been called that.[5]

”A hyperedge is an atomic item with a fixed number of tentacles,called the type of the hyperedge. It can be attached to any kindof structure coming with a set of nodes by attaching each of itstentacles to a node. The hyperedge controls the sequence of theseattachment nodes and can play the role of a place holder, whichmay be replaced with some other structure eventually.” [5]

When replacing a hyperedge, the hyperedge is first removed, then the newstructure R is embedded into the old structure where the hyperedge used tobe. Each external node in the original structure is fused with an attachmentin the new structure. This feeds back into the type of the hyperedge, since thenumber of external nodes has to match the number of attachment nodes.Just as in context-free grammars, the production rules are defined by a label onthe left-hand side and a replacement structure on the right-hand side, only withgraphs instead of strings. An example showing the rules needed to generateall possible graphs which can have either parallel or sequential edges can beseen in Figure 9.

Figure 9: All combinations of parallel and sequential graphs can be derivedfrom this grammar.

Using this technique it should be possible to properly model the relationsshown in Figure 5. In fact, is should be possible to model any combinationof relationships between any set of people, as long as the number of peopleinvolved are finite.

Page 24: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

16(34)

4.2 Random Context Grammars

As mentioned in Section 3.4, the current grammar is a superset of the ac-tual population we want to generate. A more elegant solution than creat-ing a huge set of context-free rules, significantly more powerful tools can beused, namely ”Random Forbidding-Context Grammars” (RFCGs) and ”Ran-dom Permitting-Context Grammars” (RPCGs), collectively called ”Random-Context Grammars” (RCGs).[10]

L → Ls ({R} ;∅)

Figure 10: Example rule from a RCG.

Both RCGs are similar to CFGs, but with one addition. For each rule A→ xthere also exists a tuple (P ;F ) which determines if the rule can be applied,as shown in Figure 10. P is the permitting set of A, and requires that allmembers of P is present in the sentential form of the string. The converse istrue for F , which is the forbidding set of A: None of the members in F canexist in the sentential form of the string for the rule to be able to be applied.If P = ∅ then the rule is only forbidding, and if F = ∅ then the rule is onlypermitting.1 If P = ∅∧F = ∅ then the rule is context-free.2

Figure 11: Random Context (RC), Context Sensitive (CS), Random Per-mitting Context (RPC), Random Forbidding Context (RFC)and Context Free (CF) languages.

As can be seen in Figure 11, the set of languages generated by RFCGs andRPCGs are smaller than the set of RCGs, but bigger than the set of CFGs. Theset of languages generated by RCGs is also smaller than the set of languagesgenerated by CSGs (Context-Sensitive Grammars), but no language that is

1∅ denotes the empty set.2∧ denotes the logical AND operator.

Page 25: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

17(34)

impossible to generate with RCGs and possible to generate with CSGs hasbeen found.[10]Using this technique, a much more refined way of handling dates in generalcould be implemented which would also facilitate the permitting part of theRCG. Suppose all dates would be split into three parts, year, month and day,all with unique identifiers. Then the rules regarding GD would also be splitinto three different rule: GDY

, GDMand GDD

.These rules would then have to exist in many different versions, one for eachvalue the year/month/day may have. For GDM

this would mean having 12different versions, each with a different permitting set containing the specificmonth for which the rule is applicable. For GDD

there would have to be 31different versions and GDY

would require as many versions as the amount ofyears the grammar handles.

GDM→GM01 ({<bir thmonth >01</bir thmonth >};∅)

GDM→GM02 ({<bir thmonth >02</bir thmonth >};∅)

GDM→GM03 ({<bir thmonth >03</bir thmonth >};∅)

GDM→GM04 ({<bir thmonth >04</bir thmonth >};∅)

GDM→GM05 ({<bir thmonth >05</bir thmonth >};∅)

GDM→GM06 ({<bir thmonth >06</bir thmonth >};∅)

GDM→GM07 ({<bir thmonth >07</bir thmonth >};∅)

GDM→GM08 ({<bir thmonth >08</bir thmonth >};∅)

GDM→GM09 ({<bir thmonth >09</bir thmonth >};∅)

GDM→GM10 ({<bir thmonth >10</bir thmonth >};∅)

GDM→GM11 ({<bir thmonth >11</bir thmonth >};∅)

GDM→GM12 ({<bir thmonth >12</bir thmonth >};∅)

Figure 12: A excerpt of how the grammar might look

In Figure 12, only one rule is applicable for each person, even though all ofthem have the same name. This is due to the difference in permitting setson the right hand side, the sets match the birth month in the string with thecorrect rule for further generation of months.The reason behind splitting the date into three distinct parts is that they eachcan be permitted by themselves and if all of the different parts are correctby themselves, the entire date is correct as well. Doing it this way drasticallyreduces the amount of rules that is needed, but the complexity of the grammargoes up a small margin.However, there is a drawback to this method as well. Imagine a person born on1995-12-31. Even though 1996-01-01 is a valid date (by one day), this grammarwill not consider the string a part of the language due to the month and daybeing earlier than the reference month and day.Then the following conclusion can be made: The grammar presented here inSection 4.2 creates a subset of all the possible persons. Everything it generates

Page 26: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

18(34)

will be correct, but it cannot generate every possible person. This might bemore desirable than the opposite, that it can generate every possible person,but will also be able to generate people that are not correct.

Page 27: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

19(34)

5 Conclusion

In conclusion, context-free grammars works reasonably well for generating per-sonal data for test persons for use in the Swedish Tax Agency’s PopulationRegistry, if we disregard the relationship and date issues. The problem is thatthese two issues cannot be disregarded, making the grammar not very usablein its current state. There are simply too many variables and different rulesdepending on small variations in personal data and their relations to each otherto be easily modelled by a simple system such as context-free grammar.That is not to say that context-free grammars aren’t useful for this task, whichit very much is, but it cannot model the entirety of it very nicely. It can modelmost of it very cleanly, as seen in the PS rule in the grammar, which contains6 of the 9 major blocks of information, again disregarding the issue with dates.However, it is a very good foundation to continue from if more work is to bedone in the area. Any further implementation of any grammar or other systemcan most likely use parts of the grammar or ideas shown in this thesis.An interesting approach for further study in this field would be a mix ofrandom-permitting context and hyperedge replacement graph grammars, wherethe graph grammar exists to model the relationships between the persons andthe random-permitting context grammar creates the persons. This approachwould have the benefit of both grammars, and almost guaranteed be a betterfit for the problem than a context-free grammar.

Page 28: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

20(34)

Page 29: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

21(34)

References

[1] Personnummer. Technical Report SKV704 utgåva 8, Skatteverket, 2008.

[2] Michael Benedikt, Chee Yong Chan, Wenfei Fan, Rajeev Rastogi, ShihuiZheng, and Aoying Zhou. Dtd-directed publishing with attribute trans-lation grammars. In Proceedings of the 28th International Conference onVery Large Data Bases, VLDB ’02, pages 838–849. VLDB Endowment,2002.

[3] Noam Chomsky. On certain formal properties of grammars. Informationand control, 2(2):137–167, 1959.

[4] Rick D Craig and Stefan P Jaskiel. Systematic software testing. ArtechHouse, 2002.

[5] Frank Drewes, Hans-Jörg Keowski, and Annegret Habel. Hyperedge re-placement, graph grammars. Handbook of Graph Grammars, 1:95–162,1997.

[6] John E Hopcroft. Introduction to automata theory, languages, and com-putation. Pearson Education India, 1979.

[7] Donald E. Knuth. The Art of Computer Programming, Volume 1 (3rdEd.): Fundamental Algorithms. Addison Wesley Longman PublishingCo., Inc., Redwood City, CA, USA, 1997.

[8] Glenford J Myers, Corey Sandler, and Tom Badgett. The art of softwaretesting. John Wiley & Sons, 2011.

[9] Fernando CN Pereira and David HDWarren. Definite clause grammars forlanguage analysis—a survey of the formalism and a comparison with aug-mented transition networks. Artificial intelligence, 13(3):231–278, 1980.

[10] Max Stacey Rabkin. Ogden’s lemma for random permitting-andforbidding-context and et0l languages, 2013.

Page 30: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

22(34)

Page 31: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

23(34)

A Resulting grammar

P → <person> PS PNR </person>

PS → N D C PA CR B

PP M → <person> PS PNRP M </person>PP F → <person> PS PNRP F </person>

PF → <person> PS PNRSMO RF </person>RF → <re l s > RCV F

RP </r e l s >PM → <person> PS PNRSF O RM </person>RM → <re l s > RCV M

RP </r e l s >

PBCV M→ <person> PS PNRSO RBCV M O </person>

PBCV M→ <person> PS PNRSY RBCV M Y </person>

RBCV M O → <re l s > RBM RC </r e l s >RBCV M Y → <re l s > RBM </r e l s >

PBCV F→ <person> PS PNRSO RBCV F O </person>

PBCV F→ <person> PS PNRSY RBCV F Y </person>

RBCV F O → <re l s > RBF RC </r e l s >RBCV F Y → <re l s > RBF </r e l s >

PACV M→ <person> PS PNRSO RACV M O </person>

PACV M→ <person> PS PNRSY RACV M Y </person>

RACV M O → <re l s > RAM RBF RBM RC </r e l s >RACV M Y → <re l s > RAM RBF RBM </r e l s >

PACV F→ <person> PS PNRSO RACV F O </person>

PACV F→ <person> PS PNRSY RACV F Y </person>

RACV F O → <re l s > RAF RBF RBM RC </r e l s >RACV F Y → <re l s > RAF RBF RBM </r e l s >

N → <name> NF NM NL </name>

NF → NF NF

NF → <fir s tname> GS NF D </f i r s tname>NF → λ

Page 32: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

24(34)

NF D → <f i r s tnameda t e > GD </f i r s tnameda t e >

NM → NM NM

NM → <middlename> GS NMD </middlename>NM → λNMD → <middlenamedate> GD </middlenamedate>

NL → NL NL

NL → <lastname> GS NLD </lastname>NL → λNLD → <las tnamedate> GD </las tnamedate>

D → <deceased>no</deceased>D → <deceased>yes DBS DBD </deceased>

DBS → <b u r i a l s i t e > GS </b u r i a l s i t e >DBD → <bu r i a l d a t e > GD </bu r i a l d a t e >

C → C CC → <c i t i z e n s h i p s > CS </c i t i z e n s h i p s >C → <c i t i z e n s h i p s > CF1 </c i t i z e n s h i p s >

CS → <c i t i z e n s h i p > CCS CD </c i t i z e n s h i p >CCS → <country>sweden</country>CD → <date> GD </date>

CF1 → <c i t i z e n s h i p > CCF CD</c i t i z e n s h i p > CF2

CF2 → <c i t i z e n s h i p > CCF CD</c i t i z e n s h i p > CF3CF2 → λ

CF3 → <c i t i z e n s h i p > CCF CD</c i t i z e n s h i p >CF3 → λ

CCF → <country> GC </country>

PA → PA PAPA → <po s t a l a d r e s s > COA PT PD </po s t a l a d r e s s >PA → λ

COA → <coadress> GS </coadress>COA → λ

PD → <po s t a l a d r e s s d a t e > GD </po s t a l a d r e s s d a t e >

Page 33: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

25(34)

PT → PTS PLH DA1 DA2PTS → <po s t a l a d r e s s t y p e >swedish </po s t a l a d r e s s t y p e >

PLH → <p o s t a l l o c a l i t y > PL </p o s t a l l o c a l i t y >PL → a b i s k o <pos t a l n r >PNLUND</pos t a l n r >PL → o v e r k a l i x <pos t a l n r >PNUMEA</pos t a l n r >

PNLUND → 98107PNUMEA → 95681PNUMEA → 95692PNUMEA → 95699

PT → PTF PC DA1 DA2 DA3PTF → <po s t a l a d r e s s t y p e >fo r e i gn </po s t a l a d r e s s t y p e >PC → <pos t a l c oun t r y > GC </pos t a l c oun t r y >

DA1 → <de l i v e r y a d r e s s 1 > GS </de l i v e r y a d r e s s 1 >DA2 → <de l i v e r y a d r e s s 2 > GS </de l i v e r y a d r e s s 2 >DA2 → λDA3 → <de l i v e r y a d r e s s 3 > GS </de l i v e r y a d r e s s 3 >DA3 → λ

CR → CR CRCR → <c i v i l r e g > CRL CRD </c i v i l r e g >CRL → <crcounty> L </crcounty>CRD → <crdate> GD </crdate>

L → b l e k i n g e <crmunicip> KBLEKINGE </crmunicip>L → kalmar <crmunicip> KKALMAR </crmunicip>

KBLEKINGE → berg FOHBERGFAHBERG

KBLEKINGE → ronneby FOHRONNEBYFAHRONNEBY

FOHBERG→ <crcongr> FOBERG </crcongr>

FAHBERG→ <cr b u i l d i n g > FABERG </c r b u i l d i n g >

FOBERG → bergFOBERG → e l l e h o lm

FABERG → vagaren24 <craddr> ADV AGAREN24 </craddr>FABERG → munin38 <craddr> ADMUNIN38 </craddr>

ADV AGAREN24 → s t o r g a t an 1 aADV AGAREN24 → s t o r g a t an 1 bADV AGAREN24 → s t o r g a t an 1 c

Page 34: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

26(34)

ADV AGAREN24 → s t o r g a t an 1 dADV AGAREN24 → s t o r g a t an 1 e

ADMUNIN38 → e t vagen 52

FOHRONNEBY→<crcongr> FORONNEBY </crcongr>

FAHRONNEBY→<cr b u i l d i n g > FARONNEBY </c r b u i l d i n g >

FORONNEBY → ronnebyFORONNEBY → ede s t ad

FARONNEBY → r a s t a 23 <craddr> ADRAST A23 </craddr>FARONNEBY → l oparen 5 <craddr> ADLOP AREN5 </craddr>

ADRAST A23 → k a r o l i n a ga t an 10 aADRAST A23 → k a r o l i n a ga t an 10 bADRAST A23 → k a r o l i n a ga t an 10 c

ADLOP AREN5 → f r e j g r a n d 2ADLOP AREN5 → f r e j g r a n d 3ADLOP AREN5 → g r e j g r and 4ADLOP AREN5 → f r e j g r a n d 5

B → <b i r t h l o c a t i o n > BLHBP </b i r t h l o c a t i o n >

BLH→ <b i r t h l a n > BL </b i r t h l a n >

BL → b l e k i n g eBL → v a s t e r b o t t e nBL → λ

BP → <b i r t h p l a c e > GS </b i r t h p l a c e >

PNR → PNRM

PNR → PNRF

PNRM → PNRSMY <re l s > RP </r e l s >PNRM → PNRSMO <re l s > RCV M

RP </r e l s >PNRF → PNRSF Y <re l s > RP </r e l s >PNRF → PNRSF O <re l s > RCV F

RP </r e l s >

PNRP M → <pnr> GP NROM</pnr><re l s > RCV M

RP </r e l s >PNRP F → <pnr> GP NROF

</pnr><re l s > RCV FRP </r e l s >

PNRSY → PNRSMY

PNRSY → PNRSF Y

Page 35: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

27(34)

PNRSMY → <pnr> GP NRY M</pnr><civ>um CIVD </civ>

PNRSF Y → <pnr> GP NRY F</pnr><civ>um CIVD </civ>

PNRSO → PNRSMO

PNRSO → PNRSF O

PNRSMO → <pnr> GP NROM</pnr><civ> CIVMO CIVD </civ>

PNRSF O → <pnr> GP NROF</pnr><civ> CIVF O CIVD </civ>

CIVMO → umCIVMO → m PP F

CIVMO → w PP F

CIVMO → d PP F

CIVMO → rp PP M

CIVMO → wp PP M

CIVMO → dp PP M

CIVF O → umCIVF O → m PP M

CIVF O → w PP M

CIVF O → d PP M

CIVF O → rp PP F

CIVF O → wp PP F

CIVF O → dp PP F

CIVD → <c i vda t e > GD </c i vda t e >

RP → RAF RAM RBF RBM

RAF → <r e l a t i o n s h i p > RTAFPF RD </r e l a t i o n s h i p >

RAF → λ

RAM → <r e l a t i o n s h i p > RTAMPM RD</r e l a t i o n s h i p >

RAM → λ

RBF → <r e l a t i o n s h i p >RTBFPF RD </r e l a t i o n s h i p >

RBF → λ

RBM → <r e l a t i o n s h i p >RTBMPM RD </r e l a t i o n s h i p >

RBM → λ

RCV M→ RCV M

RCV M

RCV M→ <r e l a t i o n s h i p > RTBC

PBCV MRD</r e l a t i o n s h i p >

RCV M→ <r e l a t i o n s h i p > RTAC

PACV MRD</r e l a t i o n s h i p >

Page 36: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

28(34)

RCV M→ λ

RCV F→ RCV F

RCV F

RCV F→ <r e l a t i o n s h i p > RTBC

PBCV FRD</r e l a t i o n s h i p >

RCV F→ <r e l a t i o n s h i p > RTAC

PACV FRD</r e l a t i o n s h i p >

RCV F→ λ

RTAF→ <r e l a t i o n s h i p t y p e >af </r e l a t i o n s h i p t y p e >

RTAM→ <r e l a t i o n s h i p t y p e >am</r e l a t i o n s h i p t y p e >

RTBF→ <r e l a t i o n s h i p t y p e >bf </r e l a t i o n s h i p t y p e >

RTBM→ <r e l a t i o n s h i p t y p e >bm</r e l a t i o n s h i p t y p e >

RTAC→ <r e l a t i o n s h i p t y p e >ac</r e l a t i o n s h i p t y p e >

RTBC→ <r e l a t i o n s h i p t y p e >bc</r e l a t i o n s h i p t y p e >

RD → <r e l a t i o n s h i p d a t e > GD </r e l a t i o n s h i p d a t e >

GD → GYYGM

GD → GYOGM

GYY→ 20 GY L

GYO→ 19 GY L

GY L → GN GN

GN → 0GN → 1GN → 2GN → 3GN → 4GN → 5GN → 6GN → 7GN → 8GN → 9

GM → 01 GD31GM → 02 GD28GM → 03 GD31GM → 04 GD30GM → 05 GD31GM → 06 GD30GM → 07 GD30GM → 08 GD31GM → 09 GD30GM → 10 GD31GM → 11 GD30GM → 12 GD31

Page 37: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

29(34)

GD28 → 01GD28 → 02GD28 → 03GD28 → 04GD28 → 05GD28 → 06GD28 → 07GD28 → 08GD28 → 09GD28 → 10GD28 → 11GD28 → 12GD28 → 13GD28 → 14GD28 → 15GD28 → 16GD28 → 17GD28 → 18GD28 → 19GD28 → 20GD28 → 21GD28 → 22GD28 → 23GD28 → 24GD28 → 25GD28 → 26GD28 → 27GD28 → 28

GD30 → GD28GD30 → 29GD30 → 30

GD31 → GD30GD31 → 31

GS → CH01

CH01 → CH CH02

CH02 → CH CH03CH02 → λ

CH03 → CH CH04

Page 38: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

30(34)

CH03 → λ

CH04 → CH CH05CH04 → λ

CH05 → CH CH06CH05 → λ

CH06 → CH CH07CH06 → λ

CH07 → CH CH08CH07 → λ

CH08 → CH CH09CH08 → λ

CH09 → CH CH10CH09 → λ

CH10 → CH CH11CH10 → λ

CH11 → CH CH12CH11 → λ

CH12 → CH CH13CH12 → λ

CH13 → CH CH14CH13 → λ

CH14 → CH CH15CH14 → λ

CH15 → CH CH16CH15 → λ

CH16 → CH CH17CH16 → λ

CH17 → CH CH18CH17 → λ

CH18 → CH CH19

Page 39: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

31(34)

CH18 → λ

CH19 → CH CH20CH19 → λ

CH20 → CH CH21CH20 → λ

CH21 → CH CH22CH21 → λ

CH22 → CH CH23CH22 → λ

CH23 → CH CH24CH23 → λ

CH24 → CH CH25CH24 → λ

CH25 → CH CH26CH25 → λ

CH26 → CH CH27CH26 → λ

CH27 → CH CH28CH27 → λ

CH28 → CH CH29CH28 → λ

CH29 → CH CH30CH29 → λ

CH30 → CH CH31CH30 → λ

CH31 → CH CH32CH31 → λ

CH32 → CH CH33CH32 → λ

CH33 → CH CH34

Page 40: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

32(34)

CH33 → λ

CH34 → CH CH35CH34 → λ

CH35 → CHCH35 → λ

CH → aCH → bCH → cCH → dCH → eCH → fCH → gCH → hCH → iCH → jCH → kCH → lCH → mCH → nCH → oCH → pCH → qCH → rCH → sCH → tCH → uCH → vCH → wCH → xCH → yCH → z

GC → swedenGC → eng landGC → f r anceGC → uruguayGC → jamaicaGC → new zea l and

GP NROM→ GYO

GM 239GP NROF

→ GYOGM 238

GP NRY M→ GYY

GM 239

Page 41: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

33(34)

GP NRY F→ GYY

GM 238

Page 42: Generation of personal data for test persons for use in the Swedish Tax Agency's Population

34(34)