an introduction on language processing

31
© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle Language processing Course "Software Language Engineering" University of Koblenz-Landau Department of Computer Science Ralf Lämmel Software Languages Team 1

Upload: ralf-laemmel

Post on 06-May-2015

275 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Language processingCourse "Software Language Engineering"

University of Koblenz-Landau

Department of Computer Science

Ralf Lämmel

Software Languages Team

1

Page 2: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Summary

2

This is an introduction to language processing.

We classify components of a language definition.

We classify components in language processing.

We illustrate language processing for a simple language.

The processing components are written in Haskell.

Page 3: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Language definition components

Concrete syntax (useful for parsing and humans)Abstract syntax (useful for most processing)Type systemExtra rulesDynamic semanticsTranslation semanticsPragmatics

3

Page 4: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Language processing components

Recognizer (execute concrete syntax)

Parser (return parse/syntax trees)

Option 1: Return concrete parse trees

Option 2: Return abstract syntax trees

Imploder (parse tree to abstract syntax trees)

Pretty printer or unparser (return text again)

4

Page 5: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Language processing componentscont’d

Type checkerInterpreterCompiler to machine languageTranslator / generator to high-level languageSoftware visualizers (e.g., call graph or flow chart)Software analyzers (e.g., dead-code detection)Software transformers (e.g., dead-code elimination)Software metrics tools (e.g., cyclomatic complexity)IDE integration (coloring, navigation, ...)

5

Page 6: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

How to do language processors?

What programming techniques to use?

What programming technologies to use?

Code generators

APIs / combinator libraries

Metaprogramming frameworks

How to leverage language definitions?6

Let’s get some data points today.

Page 7: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Haskell-based language processors for a simple imperative language

7

Page 8: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

A Pico example

8

begin declare input : natural,               output : natural,               repnr : natural,              rep : natural;      input := 14;      output := 1;      while input - 1 do           rep := output;          repnr := input;          while repnr - 1 do             output := output + rep;             repnr := repnr - 1          od;          input := input - 1      odend

Factorial function in the simple imperative

language Pico

Pico is an illustrative language which has been used by the SEN1/SWAT team at CWI Amsterdam for many years.

Page 9: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Components for Pico

9

Abstract syntaxRecognizerParserType checkerInterpreter

Pretty printerAssembly codeCompilerMachineFlow chartsVisualizer

https://github.com/slecourse/slecourse/tree/master/sources/pico/

Page 10: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Abstract syntax

10

type Name = Stringdata Type = NatType | StrType type Program = ([Decl], [Stm])type Decl = (Name,Type)data Stm = Assign Name Expr | IfStm Expr [Stm] [Stm] | While Expr [Stm]data Expr = Id Name | NatCon Int | StrCon String | Add Expr Expr | Sub Expr Expr | Conc Expr Expr

+ implementational details (“deriving”)

See Haskell code online: AbstractSyntax.hs

Page 11: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Concrete syntax to be recognized

11

Program ="begin" Declarations {Statement ";"} "end" ;Declarations = "declare" {Declaration ","}* ";" ; Declaration = Id ":" Type;Type = "natural" | "string" ;

Statement = Id ":=" Expression | "if" Expression "then" {Statement ";"}* "else" {Statement ";"}* "fi" | "while" Expression "do" {Statement ";"}* "od" ;

Expression = Id | String | Natural | "(" Expression ")" | Expression "+" Expression | Expression "-" Expression | Expression "||" Expression

Id = [a-z][a-z0-9]* !>> [a-z0-9];Natural = [0-9]+ ;String = "\"" ![\"]* "\"";

We need to implement this grammar.

Page 12: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Syntax vs. recognizer

Syntax definition may be declarative / technology-independent.

Unambiguous grammar needed for recognition.

Think of dangling else or operator priorities.

Recognizers may be hand-written.

Recognizers specs may be non-declarative / technology-dependent.

12

The same is true for parsers.

Page 13: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Recognizer

Leverage parser combinators (Parsec)

Recognizer = functional program.

Parser combinators are monadic.

13

See Haskell code online: Recognizer.hs

Page 14: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Parsec

(http://www.haskell.org/haskellwiki/Parsec, 23 April 2013)

“Parsec is an industrial strength, monadic parser combinator library for Haskell. It can parse context-sensitive, infinite look-ahead grammars but it performs best on predictive (LL[1]) grammars.”

14

See also:http://101companies.org/wiki/Technology:Parsec

http://hackage.haskell.org/packages/archive/parsec/3.1.3/doc/html/Text-Parsec.html

Page 15: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Parser

Add synthesis of abstract syntax terms to recognizer.

Concrete and abstract syntaxes are coupled up.

15

See Haskell code online: Parser.hs

Page 16: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Type checker

Check that ...

declarations are unambiguous;

all referenced variables are declared;

all expected operand types agree with actual ones;

...

16

See Haskell code online: TypeChecker.hs

Page 17: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Type system vs. checker

17

Type systems ...are based on formal/declarative specifications;they are possibly executable (perhaps inefficiently).

Type checkers ...implement type systems efficiently;they provide useful error messages.

See Haskell code online: TypeChecker.hs

Page 18: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Interpreter

Leverage natural semantics.

Define it as a total function.

Use a special error result.

Error messages could also be provided that way.

18

See Haskell code online: Interpreter.hs

Page 19: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Pretty printer

Map abstract syntax to text.Composable documents.

Vertical / horizontal composition.Indentation.

Another case of a combinator library (= DSL).

19

See Haskell code online: PrettyPrinter.hs

Page 20: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Assembly code

Stack-based assembly language.

Explicit notion of label and gotos.

Could be translated to Bytecode or x86 or ....

20

Page 21: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Assembly code

21

type Label = Stringtype Name = String

data Instr  = DclInt Name -- Reserve a memory location for an integer variable | DclStr Name -- Reserve a memory location for a string variable | PushNat Int -- Push integer constant on the stack | PushStr String -- Push string constant on the stack | Rvalue Name -- Push the value of a variable on the stack | Lvalue Name -- Push the address of a variable on the stack | AssignOp -- Assign value on top, to variable at address top-1 | AddOp -- Replace top two stack values by their sum | SubOp -- Replace top two stack values by their difference | ConcOp -- Replace top two stack values by their concatenation | Label Label -- Associate a label with the next instruction | Go Label -- Go to instruction with given label | GoZero Label -- Go to instruction with given label, if top equals 0 | GoNonZero Label -- Go to instruction with given label, if top not equal to 0     deriving (Eq, Show, Read)

See Haskell code online: AssemblyCode.hs

Page 22: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Compiler

Use state to keep track of label generator.

Generate machine instructions compositionally.

A stream could also be used for linear output.

22

See Haskell code online: Compiler.hs

Page 23: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Machine

Very much like the interpreter.

Interpret the assembly code instructions.

Maintain a “memory” abstraction.

Maintain an “instruction” pointer.

23

See Haskell code online: Machine.hs

Page 24: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Flow charts

24

begin declare input : natural,               output : natural,               repnr : natural,              rep : natural;      input := 14;      output := 1;      while input - 1 do           rep := output;          repnr := input;          while repnr - 1 do             output := output + rep;             repnr := repnr - 1          od;          input := input - 1      odend

Page 25: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Flow charts

Define an abstract syntax of flow charts.

No concrete syntax needed.

Abstract syntax is mapped to “dot” (graphviz).

25

See Haskell code online: FlowChart.hs

Page 26: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Flow charts

26

See Haskell code online.

type FlowChart = ([Box], [Arrow])type Box = (Id, BoxType)type Id = String -- Identifier for boxesdata BoxType = Start | End | Decision Text | Activity Texttype Text = String -- Text to showtype Arrow = ((Id, FromType), Id)data FromType = FromStart | FromActivity | FromYes | FromNo

Page 27: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

“Dot” sublanguage for flowcharts

27

digraph FlowChart { id1 [label="Start", shape=box, style=bold]; id2 [label="End", shape=box, style=bold]; id3 [label=" input := 14 ", shape=box]; id4 [label=" output := 1 ", shape=box]; id5 [label=" input - 1 ", shape=diamond]; id6 [label=" rep := output ", shape=box]; id7 [label=" repnr := input ", shape=box]; id8 [label=" repnr - 1 ", shape=diamond]; id9 [label=" output := output + rep ", shape=box]; id10 [label=" repnr := repnr - 1 ", shape=box]; id1 -> id3 [label=" ", headport= n , tailport= s ] id3 -> id4 [label=" ", headport= n , tailport= s ] id4 -> id5 [label=" ", headport= n , tailport= s ] id5 -> id6 [label=" Yes ", headport= n , tailport= sw ] id6 -> id7 [label=" ", headport= n , tailport= s ] id7 -> id8 [label=" ", headport= n , tailport= s ] id8 -> id9 [label=" Yes ", headport= n , tailport= sw ] id9 -> id10 [label=" ", headport= n , tailport= s ] id10 -> id7 [label=" ", headport= n , tailport= s ] id8 -> id4 [label=" No ", headport= n , tailport= se ] id5 -> id2 [label=" No ", headport= n , tailport= se ]}

Page 28: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

All of “dot”

28

dot User’s Manual, January 26, 2006 34

A Graph File Grammar

The following is an abstract grammar for the DOT language. Terminals are shownin bold font and nonterminals in italics. Literal characters are given in singlequotes. Parentheses ( and ) indicate grouping when needed. Square brackets [and ] enclose optional items. Vertical bars | separate alternatives.

graph → [strict] (digraph | graph) id ’{’ stmt-list ’}’stmt-list → [stmt [’;’] [stmt-list ] ]stmt → attr-stmt | node-stmt | edge-stmt | subgraph | id ’=’ idattr-stmt → (graph | node | edge) attr-listattr-list → ’[’ [a-list ] ’]’ [attr-list]a-list → id ’=’ id [’,’] [a-list]node-stmt → node-id [attr-list]node-id → id [port]port → port-location [port-angle] | port-angle [port-location]port-location → ’:’ id | ’:’ ’(’ id ’,’ id ’)’port-angle → ’@’ idedge-stmt → (node-id | subgraph) edgeRHS [attr-list]edgeRHS → edgeop (node-id | subgraph) [edgeRHS]subgraph → [subgraph id] ’{’ stmt-list ’}’ | subgraph id

An id is any alphanumeric string not beginning with a digit, but possibly in-cluding underscores; or a number; or any quoted string possibly containing escapedquotes.

An edgeop is -> in directed graphs and -- in undirected graphs.The language supports C++-style comments: /* */ and //.Semicolons aid readability but are not required except in the rare case that a

named subgraph with no body immediate precedes an anonymous subgraph, be-cause under precedence rules this sequence is parsed as a subgraph with a headingand a body.

Complex attribute values may contain characters, such as commas and whitespace, which are used in parsing the DOT language. To avoid getting a parsingerror, such values need to be enclosed in double quotes.

Page 29: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Visualizer

Very much like the compiler.

Generate low-level graph representation.

Also very much like the pretty printer.

Pretty print expressions and statement in boxes.

29

See Haskell code online: Visualizer.hs

Page 30: An introduction on language processing

© 2012-13 Ralf Lämmel, Software Language Engineering http://softlang.wikidot.com/course:sle

Concluding remarks

30