November 2002
YAY (YAY's another YACC)
Introduction
YAY is an incremental parser specification language. It is much like Lex and Yacc, but simpler in terms of semantics. The simplicity comes from the inclusion of "special forms" that better match human-understandable language patterns. YAY is a little more verbose than Yacc due to it's incremental nature, I am sure you will find the semantics much easier to understand.
Documentation is not complete yet, please bear with me.
Terminology
It is assumed that some formal language theory is known. I define the terminology I use in the rest of the YAY pages here to ensure clarity.
YAY documentation is usually phrased in terms of a parsing engine acting on some input and producing an output.
Tokens |
are the objects that are manipulated by machines. Only context can determine if the term "token" refers to an input object or an output object, Symbols and variables are specific types of tokens. |
|---|---|
Symbols |
are strictly input tokens to a machine, these are usually raw characters. |
Variables |
are compound tokens produced by a machine. |
Parser |
is a machine that takes symbols and produces tokens. Any parser is a machine. |
Machine |
a cascade of parsers (possibly one), resulting in another parser. |
Pattern |
a specification for matching a series of symbols. |
Production |
defines what type of token is produced for a given set of symbols. Each production is a set of patterns (grammar rules) that can be inherited and extended. Tokens are instances of productions. |
Overview
YAY can declare parsers (and lexxers). Usually these are cascaded so that the tokens produced by one are delivered as symbols to another. Having different parsers work on different parts of the parsing problem allow the reuse of parsers in other machines.
The Built-in Parser
YAY provides a built-in parser that is used for the input of all machines. This means that no machine will ever deal with characters directly, but rather with the slightly processed tokens of this built-in parser. The built-in parser simply categorizes characters into logical groups. Here is a list of token types (productions) that are delivered by the built-in parser.
| Name | ASCII | Latin-1 | ||
|---|---|---|---|---|
| Decimal | Hex | Decimal | Hex | |
| CapitalLetter | 65..90 | 0x41..0x5A | 192..214, 216..222 | 0xC0..0xD6, 0xD8..0xDE |
| SmallLetter | 97..122 | 0x61..0x7A | 223..246, 248..255 | 0xDF..0xF6, 0xF8..0xFF |
| Letter | = CapitalLetter | SmallLetter | = CapitalLetter | SmallLetter | ||
| Digit | 48..57 | 0x30..0x39 | 48..57 | 0x30..0x39 |
| Glyph | 33..126 | 0x21..0x7E | 33..255 | 0x30..0xFF |
| White | 9, 10, 13, 32 | 0x9, 0xA, 0xD, 0x20 | 9, 10, 13, 32 | 0x9, 0xA, 0xD, 0x20 |
| Control | 0..31, 127 | 0x00..0x1F, 0x7F | 0..31, 127, 128, 129 | 0x00..0x1F, 0x7F, 0x80, 0x81 |
| Character | 0..127 | 0x00..0x7F | 0..255 | 0x00..0xFF |
A Simple Parser (A Lexxer)
We will ignore the necessary prefix and suffix for a full parser specification right now, and focus on the production definitions for the parser:
PATTERN Word <Value AS List(Type=WordChar, Min=1)>
;
PATTERN WhiteSpace List(Type=White, Min=1)
;
PATTERN WordChar;
PATTERN "_"
EXTENDS WordChar;
PATTERN Digit
EXTENDS WordChar;
PATTERN Letter
EXTENDS WordChar;
Productions are defined using the PATTERN keyword (excuse the confusion). Each production can be given a name and extended with patterns with subsequent PATTERN declarations. The most recent declaration (furthest down) is tested first against the input symbols, and a match results in the creation of a token for that production. This parser will group consecutive WordChar into a single Word, and similarly with WhiteSpace.
Patterns
Patterns are defined within the curly braces the PATTERN keyword. A pattern matches a series of symbols, of which there are four major types:
Literal Pattern
A literal is a string delimited by double quotes. The literal must be matched exactly, character for character on the input stream.
PATTERN FieldKeyword "FIELD"
;
The FieldKeyword production expects a sequence of tokens that spell "FIELD".
Possible Patterns
If a pattern is meant to be one of many possible patterns we would usually use the EXTENDS keyword. We can shorten the pattern declaration when the constituent patterns are relatively simple. Here is how WordChar would "usually" be declared:
PATTERN
PATTERN
PATTERN
We can simplify this to:
PATTERN WordChar "_" | <Digit> | <Letter>
;
Of course we loose the ability to name each constituent pattern
In-Line Sub Pattern
A sub pattern can be defined in-line, but it does not get a name and can make the pattern harder to read. An anonymous sub pattern can be used wherever a pattern is expected.
Here we are explicit about the allowed name of a field; it must be a list of WordChars:
PATTERN FieldDeclaration "FIELD" <Name AS
List(Type=WordChar, Min=1)
>
;
In this next example, words can start with "+", followed by a list of at least one WordChars:
PATTERN Word "+" | WordChar
List(Type=WordChar, Min=1)
;
Named Subpattern
A named subpattern is usually used when declaring compound patterns. Giving a name to each also allows reference to these subpatterns during semantic checks and higher level processing.
PATTERN TypeDeclaration <Name AS Word> "AS" <Type AS Word> ";"
;
Every TypeDeclaration production will expect two words separated by "AS". Should a match be made, the generated token will have two variables, called "Name" and "Type", that contain what was matched to each.
Anonymous Subpattern
Many times we do not want to distinguish compound pattern components, and we use anonymous pattern declarations instead
PATTERN TypeDeclaration <Word> "AS" <Word>
;
Here, the TypeDeclaration is simply a string conforming to a certain pattern. It has no parameters to distinguish the prefix Word from the suffix Word.
Variable Subpattern
Many languages have redundant aspects to them. This means that some sequence of characters are expected more than once in the lexical stream. <example>For example, XML requires the prefix and suffix tags to be identical, except for extra slash.</example>.
PATTERN SimpleXMLTag "<" <TagName AS Word> ">" XML "</" <TagName> ">"
;
Notice that TagName was used a second time. This syntax demands that there is a named subpattern, defined leftward, that matches in name. The similarity to the anonymous subpattern is deliberate; you can understand TagName as a pattern declaration itself. During token processing, the named subpattern must match the input tokens, and any subsequent variable subpatterns with the same name must match the same literal value.
The variable subpattern allows us to reduce our semantic checks significantly.
Special Forms
A special form is a parametric description of a common language pattern.
List Special Form
List matches many similar items in a row (replaces *, +, ., ?, (n), (n,), (n,m) of YACC)
| Parameter | Required | Description |
|---|---|---|
| Type |
|
The type of the elements of the list |
| Separator |
|
The expected separator |
| End |
|
What terminates the list parsing |
| Min |
|
Minimum number of matches |
| Max |
|
Maximum number of matches |
Example (Strings)
The String example shows that a String is a list of StringChar delimited by quotes. StringChars can consist of double characters indicating special escape sequences.
PATTERN String "\"" <Value AS List(Type=StringChar, End="\"")> "\""
;
PATTERN StringChar;
PATTERN " "
EXTENDS StringChar;
PATTERN Glyph
EXTENDS StringChar;
PATTERN "\\" "\\"
EXTENDS StringChar;
PATTERN "\\" "\""
EXTENDS StringChar;
PATTERN "\\" "n"
EXTENDS StringChar;
PATTERN "\\" "t"
EXTENDS StringChar;
Remember that the Glyph production is provided by the built-in parser, so is used here to succinctly refer to majority of valid string characters. Also notice that the End parameter allows us to forget checking that a non-escaped quote is not allowed in the StringChar production.
"\n" = newline, as intended in this example
"\\n" = "\n" (backslash, then 'n')
"\n\" is invalid because it has no closing quote
"n\\" = "n\" ('n' then backslash)
"\q" = "\q"
because the backslash is only used to escape certain characters
Operator Special Form
The Operator special form is usually used for expression parsing, and specifies infix syntax only. It must always EXTEND an existing pattern, and it is assumed the operands are both of that same pattern. Operator is not meant to be used for typed expressions: For example <String> "+" <Integer> returns String. Typed expressions can be done using normal patterns, or better yet, left to the backend type analysis engine.
The Operator special form provides clear precedence rules, operator symbols, and associativity. YAY's incremental nature allows new operators to be added easily. Operator replaces the need for %left, %right, %nonassoc, %prec statements in YACC
| Parameter | Required | Description |
|---|---|---|
| Symbol |
|
The symbol used for this operator |
| Associate |
|
Set the associatively of the operator |
| Message |
|
Define the KMAL message (DBOS) |
Example (Expressions)
PATTERN Expression;
PATTERN <FieldName AS Word>
EXTENDS Expression;
PATTERN Integer
EXTENDS Expression;
PATTERN "(" Expression ")"
EXTENDS Expression;
PATTERN Select Operator(Symbol=".", Associate=Left) EXTENDS Expression;
PATTERN Compare Operator(Symbol=Equal, Associate=None) EXTENDS Expression;
PATTERN Assign Operator(Symbol="=", Associate=None) EXTENDS Expression;
"(a).(b.25.e)==h.y" is allowed by this production, only later semantic checks can distinguish errors.
My Notes:
<e-kyle> each PARSER is given an input (either text, or another parsers output) and will have a function called getToken() that will get the next token
<e-kyle> "match" refers to the productions that are allowed to be matched to the input (other productions are sub-productions).
<e-kyle> "export" refers to tokens that are exported. Tokens that are matched, but not exported means the PARSER will continue matching tokens until it finds one to export.
<e-kyle> You can see the KMALLexxer will skip WhiteSpace and Comments.