a new parser - haskell community server
TRANSCRIPT
© Neil Mitchell 2004 2
Disclaimer
This is not something I have done as part of my PhDI have done it on my ownI haven’t researched other systemsIts not finishedAny claims may turn out to be wrong
© Neil Mitchell 2004 3
Overview
What is parsing?What systems exist?What do you want to parse?What is my system?Why is mine better (or not)?How do you interact with a parser?
© Neil Mitchell 2004 4
Parsing is a function
From Text to Abstract Syntax TreeParse :: String -> Tree(Token)How?
Hand codedUsing Lex/Yacc (or Flex/Bison)Parser combinatiorsEtc…
© Neil Mitchell 2004 5
What Systems Exist?
Lex/Yacc – the classicCreated to write parsers for CSteps
Write a grammar file, including C codeGenerate a C fileCompile C fileLink to your code
© Neil Mitchell 2004 6
Lex
Lex :: String -> List(Token)
Uses Regular Expressions to split up a string into various lexemes.
Runs in O(n), using Finite State Automata.
© Neil Mitchell 2004 7
Yacc
Yacc :: List(Token) -> Tree(Token)
Based on a BNF grammar.Runs in just over O(n), using an LALR(1)
stack automaton.Often fails unpredictably…
© Neil Mitchell 2004 8
Early
Early :: List(Token) -> Tree(Token)
Almost identical to Yacc, but removes the unpredictable failures, requiring less knowledge of LALR(1)
A fair bit slower, worst case of O(n3) or O(n2.6) depending on implementation.
© Neil Mitchell 2004 9
Parser
ParserC = Yacc . LexParserHaskell = Happy . AlexParserJava = …
Very language dependant, Yacc/Lex both tied to C
© Neil Mitchell 2004 10
Bad points
Language dependantYacc – shift/reduce conflictNot CFGNot very intuitive to write Yacc
Summary: Lex good, Yacc bad
© Neil Mitchell 2004 11
What do you want to parse?
Languages: Haskell, C#, JavaConfigurations: INI, XMLGrammar files for this toolNOT: Perl, Latex, HTML, C++
Insane syntaxHorrid historyTwisted parody of languages
© Neil Mitchell 2004 12
Brackets, Strings, Escapes
Brackets () [] {} <> - YaccStrings “” ‘’ – LexAre strings not brackets, just which disallow nesting?What about escape characters?
Parse them in Lex: “((\.)|.)*”Re-parse them later
© Neil Mitchell 2004 13
My System
Bracket :: String -> Tree(Token)Lex :: String -> List(Token)Group :: Tree(Token) -> Tree(Token)
Parser :: String -> Tree(Token)Parser = Group . map Lex . Bracket
© Neil Mitchell 2004 14
Bracket
Match brackets, strings, escape charsDefine nesting
main = 0 * all [lexer]all : round stringround = "(" ")" allstring = "\"" "\"" escape [raw]escape = "\\" 1
© Neil Mitchell 2004 15
Lex
Same as traditional Lex, but…Easier – no need to do string escapingCan be different for different parts
In comments use [none]In strings use [raw]Can have many lexers for different parts
© Neil Mitchell 2004 16
Lex (2)
keyword = `[a-zA-Z][a-zA-Z0-9_]*`number = `[0-9]+`white = `[ \t]`star = "*"eq = "="for.while.
© Neil Mitchell 2004 17
Group
Group :: Tree(Token) -> Tree(Token)Id :: a -> a
Therefore "Group = Id" works
Sometimes you need a higher level of structure, what the brackets meanThe most complex element (unfortunately)
© Neil Mitchell 2004 18
Group (2)
root = main[*{rule literal}]
rule = line[ keyword eq {regexp string} ]
literal = line[ keyword dot ]
© Neil Mitchell 2004 19
Summary of BLG
Complete lack of embedded C/HaskellData format defined generically
Can be Haskell linked listCan be C arrayThere is an XML format defined
Similar in style to each otherAll "simple" langauges
© Neil Mitchell 2004 20
Implementation
BracketDeterministic Push down stack automata
LexSteal existing lex, FSA
GroupFSA? Maybe…Have a sketched automaton
© Neil Mitchell 2004 21
Implementation (2)
I have implemented most of it in C#Slow, but very useableBracket seems pretty perfectLex uses Regex objects, but worksGroup is less complete, uses backtracking, doesn't have maximal munch semantics, NP, etc.
© Neil Mitchell 2004 22
Implementation (3)
BLG is self-parsing ☺1 Lex file for all 31 Bracket file for all 33 Group files, one each
Reuse is good
© Neil Mitchell 2004 23
Interaction
How do you interact with a parser?Yacc/Lex
Translate, Compile, Link, Execute
BLGTranslate, Compile, Link, ExecuteCompile into resource fileLoad at runtime (Text Editors)
© Neil Mitchell 2004 24
Exclamations!
BLG defines a complete set of exclamations which allow for code hoisting and deletingRemove tokens from the output (white space/comments)Promote tokens, i.e. line![ x ] returns xSimple, but ignored here
© Neil Mitchell 2004 25
$Directives
In the Bracket, before any processingStream processing directives$text (remove '\r', append '\n')$tab-indent (for Haskell/Python)$upper-caseEasy, simple, generic, reusable
© Neil Mitchell 2004 26
Advantages
Language neutralHaskell parsing
GHC in HaskellHugs in CCould now use the same grammar
Can reuse elements, i.e. Lex and Bracket are almost identical for C#/Java
© Neil Mitchell 2004 27
But best of all
The grammars are really easy to specifyA bit of a leapWould need years of hypothesis testingAnd maybe even a working implementation
FasterAlmost irrelevant, thanks to faster computers