treetop - i'd rather have one problem

63
Some people, when faced with a problem think, “I know, I’ll use regular expressions”. Now they have two problems. I’d rather have one problem. Treetop • Roland Swingler • LRUG May 2009 Tuesday, 19 May 2009 This quotation is used a lot in presentations, normally before the presenter delves into some gnarly regexps. I’m looking for a better way.

Upload: roland-swingler

Post on 07-Dec-2014

3.776 views

Category:

Technology


0 download

DESCRIPTION

Talk given at LRUG, may, 2009 about Treetop, a ruby parsing expression grammar. It should hopefully convince you that parsers fit better than regular expressions in quite a few cases.

TRANSCRIPT

Page 1: Treetop - I'd rather have one problem

Some people, when faced with a problem think,“I know, I’ll use regular expressions”.

Now they have two problems.

I’d rather have one problem.

Treetop • Roland Swingler • LRUG May 2009

Tuesday, 19 May 2009

This quotation is used a lot in presentations, normally before the presenter delves into some gnarly regexps. I’m looking for a better way.

Page 2: Treetop - I'd rather have one problem

Example 1

Tuesday, 19 May 2009

Page 3: Treetop - I'd rather have one problem

Tuesday, 19 May 2009

I run a film listing site: http://filmli.st. All the data is scraped from other sites - getting the data is easy with net/http or httparty or similar and then parsing the html with nokogiri or hpricot, but...

Page 4: Treetop - I'd rather have one problem

<span>Fri/Sun-Tue 10.45 12.30 (Tue) 12.40 (not Tue) 4.00 7.00 9.30; Wed 3.00 7.30 9.00</span>

Tuesday, 19 May 2009

... you still need to turn a text string like this into a list of Times so you can do interesting things with it. Regexps? No. That way lies madness.

Page 5: Treetop - I'd rather have one problem

Example 2

Tuesday, 19 May 2009

Page 6: Treetop - I'd rather have one problem

Tuesday, 19 May 2009

Chatroom bots need to be able to distinguish between messages that they should take actions on and those which they should ignore. How should we define what messages they should listen out for?

Page 7: Treetop - I'd rather have one problem

/^\s*whereis\s+(.+?)(?:\s+(?:on\s+)?(.+?))?\s*$/

Tuesday, 19 May 2009

Regular expressions? Pretty confusing.

Page 8: Treetop - I'd rather have one problem

whereis <person> [[on] <day>]

Tuesday, 19 May 2009

Much nicer to have a simpler language.

Page 9: Treetop - I'd rather have one problem

Example 3

Tuesday, 19 May 2009

Page 10: Treetop - I'd rather have one problem

Scenario: producing human-readable tests Given I have non-technical stakeholders When I write some integration tests Then they should be understandable by everyone

Tuesday, 19 May 2009

Wouldn’t it be great if someone had written a library like this?

Page 11: Treetop - I'd rather have one problem

Tuesday, 19 May 2009

They have! Cucumber. Cucumber’s implementation got me started looking into...

Page 12: Treetop - I'd rather have one problem

Tuesday, 19 May 2009

Treetop. A ruby Parsing Expression Grammar. Basically a parser generator, but really simple.

Page 13: Treetop - I'd rather have one problem

What is a parser?

Tuesday, 19 May 2009

A parser determines whether strings are syntactically valid according to a set of rules known as a grammar.

Page 14: Treetop - I'd rather have one problem

Yes / No

Tuesday, 19 May 2009

From a theoretical viewpoint, parsers just say true or false, depending on whether the string is valid or not.

Page 15: Treetop - I'd rather have one problem

Syntax Tree

Tuesday, 19 May 2009

Not so useful, so instead we get back a syntax tree we can do useful things with.

Page 16: Treetop - I'd rather have one problem

whereis <person> [on <day>]

Tuesday, 19 May 2009

Lets try building a tree for this example. You can consider a string to be a list of characters, but to start getting meaning from it, you need a tree.

Page 17: Treetop - I'd rather have one problem

whereis <person> [on <day>]

wordswords

Tuesday, 19 May 2009

We have some words...

Page 18: Treetop - I'd rather have one problem

whereis <person> [on <day>]

wordswords variable variable

Tuesday, 19 May 2009

variables...

Page 19: Treetop - I'd rather have one problem

whereis <person> [on <day>]

words variable

optional part

words variable

Tuesday, 19 May 2009

an optional part of an expression (enclosed with square brackets)

Page 20: Treetop - I'd rather have one problem

whereis <person> [on <day>]

optional part

words variable words variable

expression

Tuesday, 19 May 2009

and a root node for the whole expression

Page 21: Treetop - I'd rather have one problem

grammar Messageend

Tuesday, 19 May 2009

lets build that up in treetop. Each of those four types of node in the tree is going to have a rule. We write these rules in a grammar - you think of it like a ruby module.

Page 22: Treetop - I'd rather have one problem

grammar Message rule expression (words / variable / optional_part)+ endend

Tuesday, 19 May 2009

The first rule for the whole expression. Lots of things should be familiar from regular expressions - ‘+’ for one or more, brackets for grouping, and ‘/’ is like the regexp ‘|’ for alternation. So this says an expression is one or more words, variables or optional parts, in any order.

Page 23: Treetop - I'd rather have one problem

grammar Message rule expression (words / variable / optional_part)+ end

rule words [^><\[\]]+ endend

Tuesday, 19 May 2009

words - character classes, just like regexps

Page 24: Treetop - I'd rather have one problem

grammar Message rule expression (words / variable / optional_part)+ end

rule words [^><\[\]]+ end

rule variable '<' identifier:( [a-zA-Z_] [a-zA-Z_0-9 ]* ) '>' endend

Tuesday, 19 May 2009

variables are enclosed with angle brackets, can be any valid ruby identifier string, and are labeled so we can use part of the text later.

Page 25: Treetop - I'd rather have one problem

grammar Message rule expression (words / variable / optional_part)+ end

rule words [^><\[\]]+ end

rule variable '<' identifier:( [a-zA-Z_] [a-zA-Z_0-9 ]* ) '>' end

rule optional_part "[" expression "]" endend

Tuesday, 19 May 2009

optional parts are enclosed with square brackets. Here we see that rules can be recursive - which makes the parser significantly more powerful than regular expressions.

Page 26: Treetop - I'd rather have one problem

$ tt message.treetop

Tuesday, 19 May 2009

We compile the grammar with the command line tt command - you can also load grammars dynamicaly

Page 27: Treetop - I'd rather have one problem

require ‘message’

parser = MessageParser.newtree = parser.parse(“whereis <person>...”)

Tuesday, 19 May 2009

this gives us a parser we can call from ruby code

Page 28: Treetop - I'd rather have one problem

require ‘message’

parser = MessageParser.newtree = parser.parse(“whereis <person>...”)

tree.elements[0].text_value #=> “whereis ”

tree.elements[1].identifier.text_value#=> “person”

Tuesday, 19 May 2009

each node knows about its children and its text_value. The label we defined earlier provides sugar methods to access particular subnodes.

Page 29: Treetop - I'd rather have one problem

Fri/Sun-Tue 4.00 7.00

Tuesday, 19 May 2009

Another example. This time we’ll think about the tree in a top down fashion rather than bottom up. This is closer to how treetop will actually evaluate an expression.

Page 30: Treetop - I'd rather have one problem

Fri/Sun-Tue 4.00 7.00

expression

Tuesday, 19 May 2009

Page 31: Treetop - I'd rather have one problem

Fri/Sun-Tue 4.00 7.00

expression

days times

Tuesday, 19 May 2009

Page 32: Treetop - I'd rather have one problem

Fri / Sun-Tue 4.00 7.00

time time

expression

days

day day range

times

Tuesday, 19 May 2009

Page 33: Treetop - I'd rather have one problem

Fri / Sun - Tue 4 . 00 7 . 00

time

hrs mins

time

hrs mins

expression

days

day day range

day day

times

Tuesday, 19 May 2009

Page 34: Treetop - I'd rather have one problem

rule expression days “ ” timesend

Tuesday, 19 May 2009

Page 35: Treetop - I'd rather have one problem

rule times time (“ ” time)+end

rule time hours “.” minutesend

rule hours 1 [0-2] / [0-9]end

rule minutes [0-5] [0-9]end

Tuesday, 19 May 2009

Page 36: Treetop - I'd rather have one problem

rule days (day !“-” / day_range) (“/” days)?end

rule day_range day “-” dayend

rule day “Mon”/“Tue”/“Wed”/“Thu”/“Fri”/“Sat”/“Sun”end

Tuesday, 19 May 2009

The bit highlighted in red is a negative lookahead assertion. We need this because treetop evaluates alternatives from left to right - if we didn’t have the assertion then Sun-Tue would match Sun as a Day, not a DayRange, and we’d be left with “-Tue” which isn’t valid.

Page 37: Treetop - I'd rather have one problem

Enriching Nodes

Tuesday, 19 May 2009

Adding in some semantics

Page 38: Treetop - I'd rather have one problem

rule time hours “.” minutesend

irb> aTimeNode.text_value #=> “9.00”irb> aTimeNode.elements.size #=> 3irb> aTimeNode.hours.text_value #=> “9”

Tuesday, 19 May 2009

Page 39: Treetop - I'd rather have one problem

rule time hours “.” minutes { def to_seconds hours.to_i * 60 * 60 + minutes.to_i * 60 end }end

irb> aTimeNode.text_value #=> “9.00”irb> aTimeNode.to_seconds #=> 32400

Tuesday, 19 May 2009

We can add in methods inline in the grammar. This is just like a module scope, and we can do any ruby we like in here.

Page 40: Treetop - I'd rather have one problem

# in film_time.treetoprule time hours “.” minutes <TimeNode>end

# in another .rb fileclass TimeNode < Treetop::Runtime::SyntaxNode def to_seconds hours.to_i * 60 * 60 + minutes.to_i * 60 endend

Tuesday, 19 May 2009

Cleaner in my mind to split these out into actual subclasses of SyntaxNode - keeps the grammar more readable. In some cases you need to have modules rather than subclasses.

Page 41: Treetop - I'd rather have one problem

Interpretation & Compilation

Tuesday, 19 May 2009

We’re going to build up a regular expression for the bot example. Each node will be reponsible for building a different part of the regexp.

Page 42: Treetop - I'd rather have one problem

whereis <person> [on <day>]

/^whereis (.+?)(?:\s+on (.+?))?$/

optional part

words variable words variable

expression

Tuesday, 19 May 2009

Page 43: Treetop - I'd rather have one problem

whereis <person> [on <day>]

/^whereis (.+?)(?:\s+on (.+?))?$/

optional part

words variable words variable

expression

Tuesday, 19 May 2009

Page 44: Treetop - I'd rather have one problem

whereis <person> [on <day>]

/^whereis (.+?)(?:\s+on (.+?))?$/

optional part

words variable words variable

expression

Tuesday, 19 May 2009

Page 45: Treetop - I'd rather have one problem

whereis <person> [on <day>]

/^whereis (.+?)(?:\s+on (.+?))?$/

optional part

words variable words variable

expression

Tuesday, 19 May 2009

Page 46: Treetop - I'd rather have one problem

whereis <person> [on <day>]

/^whereis (.+?)(?:\s+on (.+?))?$/

optional part

words variable words variable

expression

Tuesday, 19 May 2009

Page 47: Treetop - I'd rather have one problem

Interpreter Pattern

Tuesday, 19 May 2009

This is confusing - it comes from GoF. Actually we’re doing compilation here. Each node gets an interpret method - you treat the syntax tree as a composite.

Page 48: Treetop - I'd rather have one problem

# expressiondef interpret children = elements.map {|node| node.interpret } RegExp.compile(“^” + children.join + “$”)end

Tuesday, 19 May 2009

Page 49: Treetop - I'd rather have one problem

# wordsdef interpret Regexp.escape(text_value)end

Tuesday, 19 May 2009

Page 50: Treetop - I'd rather have one problem

# variabledef interpret “(.+?)”end

Tuesday, 19 May 2009

Page 51: Treetop - I'd rather have one problem

# optional_partdef interpret children = elements.map {|node| node.interpret } “(?:\s+” + children.join + “)?”end

Tuesday, 19 May 2009

Page 52: Treetop - I'd rather have one problem

Adding context

Tuesday, 19 May 2009

For anything more than a simple language, you’ll need to pass around context as you interpret the tree.

Page 53: Treetop - I'd rather have one problem

# expressiondef interpret(context=[]) children = elements.map do |node| node.interpret(context) end matcher = RegExp.new(“^” + children.join + “$”) ...

Tuesday, 19 May 2009

In our case we just want to record the list of variable names, so an Array will suffice. Each interpret method now needs to take this context.

Page 54: Treetop - I'd rather have one problem

# variabledef interpret(context) context << identifier.text_value.to_sym “(.+?)”end

Tuesday, 19 May 2009

Page 55: Treetop - I'd rather have one problem

# expressiondef interpret(context=[]) children = elements.map do |node| node.interpret(context) end matcher = RegExp.new(“^” + children.join + “$”)

class << matcher send(:define_method, :variables) do context end end matcherend

Tuesday, 19 May 2009

we decorate the regular expression with a list of the variables. In the real code, the returned match objects are also decorated so you have methods for each variable and don’t have to remember the captured groups by position

Page 56: Treetop - I'd rather have one problem

Other Options

Tuesday, 19 May 2009

You can also build external interpreters / compilers that use the tree

Page 57: Treetop - I'd rather have one problem

Complications?

Tuesday, 19 May 2009

Page 58: Treetop - I'd rather have one problem

# We want to write:hello [world]

# We actually mean:hello[ world]

Tuesday, 19 May 2009

whitespace shuffling. In the reall code, grammar is more complicated - most of the complication comes from dealing with edge cases here

Page 59: Treetop - I'd rather have one problem

# We should optimize:hello [[[world]]]

# To this:hello [world]

Tuesday, 19 May 2009

This isn’t done in the real code, but should be.

Page 60: Treetop - I'd rather have one problem

# Left recursion without consuming input BAD:rule infinity_and_beyond infinity_and_beyond / “foo”end

Tuesday, 19 May 2009

Page 61: Treetop - I'd rather have one problem

Problems?

Tuesday, 19 May 2009

Slow.

Page 62: Treetop - I'd rather have one problem

Other libraries

Tuesday, 19 May 2009

Racc - accepts yacc grammars. Racc runtime is part of the ruby std dist. so once you’ve built your parser there is no dependency. Ragel - used by mongrel/thin.

Page 63: Treetop - I'd rather have one problem

Thanks!

Twitter: @knaveofdiamonds

XMPP bot:http://github.com/knaveofdiamonds/harken

Film listings for London’s indie cinemas:http://filmli.st

Treetop:http://github.com/nathansobo/treetophttp://treetop.rubyforge.org

Tuesday, 19 May 2009