November 21, 2017, 03:55:16 PM

Author Topic: [bb] Lexical scanner framework by Yasha [ 1+ years ago ]  (Read 175 times)

Offline BlitzBot

  • Newbie
  • *
  • Posts: 0
[bb] Lexical scanner framework by Yasha [ 1+ years ago ]
« on: June 29, 2017, 12:28:43 AM »
Title : Lexical scanner framework
Author : Yasha
Posted : 1+ years ago

Description : This is a simple library that allows you to describe a lexical scanner, and use it to tokenise an input file or string. It is suitable for use in things like console inputs, expression calculators, data file parsers, or programming/scripting language compilers.
"Lexing" or "tokenising" is the stage of reading some input where the source is broken up into homogenous "tokens", so that you can stop worrying about the input as text (with all of the annoying variations text can have, like the presence of spaces, indents, cases, and whatnot), and instead treat it as a stream of linear "chunks". e.g. a suitable lexer could take the input:

Code: [Select]
1+2  * 4= 0xC

...which is polluted by annoying spacing problems, formats, and different value lengths, and turn it into a stream of objects representing numbers and operators, so that we can just pull tokens off the top and be given either an operator or a number.

The output of a lexer is "flat" - turning it into a "tree" (i.e. parenting values to operators, putting operators in relative precedence to one another, etc.) is the job of a "parser". So this isn't enough to evaluate languages on its own, but it's the first step. Lexers and parsers are normally designed to work together, with the parser consuming the token stream produced by the lexer (<a href="codearcs369d.html?code=2990" target="_blank">a sister parser API to work with this is now also available[/url]).
This library lets you create lexer objects and add rules to them in the form of regular-expression patterns for the input to be matched against, combined with actions and extra result information. Output is directed to a list supplied by the user.

Rules are tested in the order they were added; a longer match overrules a previous match. This may be relevant if you have two potential matches of the same length, so keep it in mind.

Upon match an "action" is performed. By default the lexer can store the result, switch mode, raise an error, "include" a filename, and discard the match. The built-in action constants are all negative, so use positives for any extra ones you care to define; add any extra action implementations to LEX_HandleAction in Lexer-Interface (that's what it's there for). The Include functionality is fairly simple and will require some use of modes to be useful; it does however try to detect recursive includes, and optionally guards against repeated includes (like Blitz Basic does, no error).

The file Lexer-Interface.bb is intended to be supplied by the user: in it, define LEX_HandleAction(l.LEX_Lexer, r.LEX_LexRule, tok$) as explained above, to handle any extra "actions" you want to add; and LEX_Error(l.LEX_Lexer, msg$), because you'll want error handling to suit the logging and handling mechanism of your host program. These two functions are all that are required, and the library does not depend upon either of them doing any more than receiving their arguments (so if you don't care about errors you can even leave them empty!).

Input is read in the form of a LEX_File object, representing a custom pseudo-filestream (basically a bank, but we're calling it a file), which is passed to LEX_ScanFile for scanning; remember to close LEX_File objects with LEX_CloseFile when done (any files opened by Include actions will be closed automatically). LEX_File objects can either be opened from real files (using LEX_ReadFile), or from strings (using LEX_FileFromString); the latter is handy for scanning input from a command prompt or REPL or something, when you don't necessarily actually have a file providing input. You still need to close LEX_Files created from strings, though.
Here's a small example that uses the library to create a lexer for a minimal calculator language with operators, numbers, functions and comments:

Code: [Select]
; Generic Lexer: demonstration
;==============================

Include "Lexer.bb"
Include "LList.bb"


Local l.LEX_Lexer = LEX_CreateLexer() ;Lexer object

Local tokenList.LList = CreateList() ;Output token list (this isn't "owned" by the lexer so we create it externally)
LEX_SetOutput l, tokenList ;Do this before each scanning job, or they'll get appended!


; For the example, we'll scan a simple mathematical expression language with numbers,
; arithmetic operators, function calls, parentheses and also (for some reason) comments.
; We won't bother with variables or includes.

; Values: numeric, in decimal or hex format (won't distinguish ints from floats today)
; Comments: /* */ and //, like in C
; Operators: + - * / % << >> ^ = != <= >=
; Punctuation: ( ) ,
; Function names: begin with a letter and then have numbers or underscores as well, like Blitz
; Whitespace: is ignored, but separates tokens or terminates comments where relevant
; Everything else: is an error


;Block comments need a mode; custom modes must not be zero
Const COMMENT_MODE = 1


;Numbers (most complicated patterns)
LEX_AddLexRule l, "[0-9]*", LEX_ACTION_STORE, "number" ;Simple int
LEX_AddLexRule l, "0[xX][0-9a-fA-F]+", LEX_ACTION_STORE, "number" ;Hex int (style: 0xABC12)
LEX_AddLexRule l, "[0-9]*.[0-9]+([eE]-?[0-9][0-9]*)?", LEX_ACTION_STORE, "number" ;Float, simple or scientific

LEX_AddLexRule l, "/*", LEX_ACTION_MODE, COMMENT_MODE, 0 ;Note that the star must be escaped
LEX_AddLexRule l, ".", LEX_ACTION_DISCARD, "", COMMENT_MODE ;Match any charcater, but throw it away in comment mode only
LEX_AddLexRule l, "*/", LEX_ACTION_MODE, 0, COMMENT_MODE ;Return to mode 0 on hitting */
LEX_AddLexRule l, "//[^n]*n", LEX_ACTION_DISCARD ;Line comment: match up to end of line

LEX_AddLexRule l, "+", LEX_ACTION_STORE, "add" ;Any Regex operators need to be escaped with
LEX_AddLexRule l, "-",  LEX_ACTION_STORE, "sub"
LEX_AddLexRule l, "*", LEX_ACTION_STORE, "mul"
LEX_AddLexRule l, "/",  LEX_ACTION_STORE, "div"
LEX_AddLexRule l, "%",  LEX_ACTION_STORE, "mod"
LEX_AddLexRule l, "<<", LEX_ACTION_STORE, "shl"
LEX_AddLexRule l, ">>", LEX_ACTION_STORE, "shr"
LEX_AddLexRule l, "^", LEX_ACTION_STORE, "pow"
LEX_AddLexRule l, "=",  LEX_ACTION_STORE, "eql"
LEX_AddLexRule l, "!=", LEX_ACTION_STORE, "neq"
LEX_AddLexRule l, "<=", LEX_ACTION_STORE, "leq"
LEX_AddLexRule l, ">=", LEX_ACTION_STORE, "geq"

LEX_AddLexRule l, "(", LEX_ACTION_STORE, "lparen"
LEX_AddLexRule l, ")", LEX_ACTION_STORE, "rparen"
LEX_AddLexRule l, ",",  LEX_ACTION_STORE, "comma"

LEX_SetCaseSensitivity l, False ;Only really useful to simplify the pattern below
LEX_AddLexRule l, "[a-z][a-z0-9_]*", LEX_ACTION_STORE, "function"

LEX_AddLexRule l, "[^[:space:]]", LEX_ACTION_ERROR ;Raise an error over any other printable character


; OK, let's load a "file":
Local f.LEX_File = LEX_FileFromString("1+2 * atan2(0x43, 1.5e-03)  /* Comment! */ %2")

LEX_ScanFile l, f
LEX_CloseFile f

; So what did we get?
Local i.Iterator = GetIterator(tokenList) : While EachIn(i)
Local t.LEX_Token = Object.LEX_Token iValue
Print tvalue + " : " + ttType
Wend
Print "done"
;... a tasty list of tokens all ready to be turned into something useful by a parser!


WaitKey
End


This uses the following Lexer-Interface.bb file:

Code: [Select]
; Example lexer specialisations
;===============================


Function LEX_Error(l.LEX_Lexer, msg$)
If l <> Null
Local err$ = "Lexer Error in '" + lcFilename
err = err + "' ('" + lcFiledir + lcFilename + "')"
err = err + " at line " + lcFilecLine + ", col " + lcFilecCol
Print err + ": " + msg
Else
Print "Lexer Error: " + msg
EndIf
End Function

Function LEX_HandleAction(l.LEX_Lexer, r.LEX_LexRule, tok$)
; Example doesn't have any custom actions
End Function


That's one way to get nice clean input for an expression evaluator!
This library depends on <a href="codearcs5392.html?code=2632" target="_blank">regular expressions[/url] and <a href="codearcs5913.html?code=2873" target="_blank">linked list[/url] include files, both of which are available from the code archives at the given links.

This library replaces my older <a href="codearcs31fe.html?code=2636" target="_blank">lexer generator[/url]. It's more or less equally powerful (same regex-powered backend), so there's no reason to use the old version, with its inconvenient preprocessing step and custom source format. [/i]

Code :
Code: BlitzBasic
  1. ; Generic lexical scanner/tokeniser framework
  2. ;=============================================
  3.  
  4.  
  5. ;Dependencies
  6. Include "LList.bb"      ;Get this file at http://www.blitzbasic.com/codearcs/codearcs.php?code=2873
  7. Include "Regex.bb"      ;Get this file at http://www.blitzbasic.com/codearcs/codearcs.php?code=2632
  8.  
  9. ;User-specialised interface: supplies LEX_Error and LEX_HandleAction
  10. Include "Lexer-Interface.bb"
  11.  
  12.  
  13. Type LEX_Token
  14.         Field value$, tType$
  15.         Field file$, l, c
  16. End Type
  17.  
  18. Type LEX_Lexer
  19.         Field rules.LList
  20.         Field cFile.LEX_File
  21.         Field out.LList         ;Output list is not "owned" by the lexer
  22.         Field csMode, guardMode, errState
  23.         Field istk.LList, prev.LList
  24. End Type
  25.  
  26. Type LEX_LexRule
  27.         Field rule.RegEx_Node, pattern$
  28.         Field action, result$, mode
  29. End Type
  30.  
  31. Type LEX_File
  32.         Field dir$, name$
  33.         Field stream, sLen, cPtr, cLine, cCol
  34. End Type
  35.  
  36. Type LEX_Guard
  37.         Field name$
  38. End Type
  39.  
  40.  
  41. Const LEX_ACTION_STORE = -1, LEX_ACTION_MODE = -2, LEX_ACTION_ERROR = -3, LEX_ACTION_DISCARD = -4, LEX_ACTION_INCLUDE = -5
  42.  
  43.  
  44. ;Create a new, empty lexer object
  45. Function LEX_CreateLexer.LEX_Lexer()
  46.         Local l.LEX_Lexer = New LEX_Lexer
  47.         lrules = CreateList()
  48.         lcsMode = True
  49.         lguardMode = False
  50.         Return l
  51. End Function
  52.  
  53. ;Set the output token list for the lexer
  54. Function LEX_SetOutput(l.LEX_Lexer, out.LList)
  55.         lout = out
  56. End Function
  57.  
  58. ;Set whether rules added to this lexer will be case-sensitive (changing this doesn't affect old rules)
  59. Function LEX_SetCaseSensitivity(l.LEX_Lexer, csMode)
  60.         lcsMode = csMode
  61. End Function
  62.  
  63. ;Set whether included files are auto-guarded or not (i.e. can they be included twice?)
  64. Function LEX_SetIncludeMode(l.LEX_Lexer, guarded)
  65.         lguardMode = guarded
  66. End Function
  67.  
  68. ;(Internal) Try to include a source file, guards and recursion checks permitting (returns boolean indicating success)
  69. Function LEX_TryIncludeFile(l.LEX_Lexer, file$)
  70.         Local i.Iterator, fd$[0], fn$[0]
  71.         LEX_GetPathCmpts file, fd, fn : file = fd[0] + fn[0]    ;Force the path to be absolute for easier comparison
  72.        
  73.         If lguardMode   ;Auto-guarded includes: check if it's been used already, if so ignore it
  74.                 i = GetIterator(lprev) : While EachIn(i)
  75.                         Local g.LEX_Guard = Object.LEX_Guard iValue
  76.                         If gname = file Then Return True        ;...return without actually changing it
  77.                 Wend
  78.         EndIf
  79.        
  80.         i = GetIterator(listk) : While EachIn(i)        ;Check against the currently-open files
  81.                 Local f.LEX_File = Object.LEX_File iValue
  82.                 If fdir + fname = file
  83.                         LEX_Error l, "Cannot recursively include '" + file + "'"
  84.                         Return False
  85.                 EndIf
  86.         Wend
  87.        
  88.         lcFile = LEX_ReadFile(file)
  89.         If lcFile <> Null
  90.                 ListAddLast listk, Handle lcFile : LEX_GuardFileName l, lcFile
  91.                 Return True
  92.         Else
  93.                 Return False    ;Error has already been issued by LEX_ReadFile
  94.         EndIf
  95. End Function
  96.  
  97. ;(Internal) Clear the include stacks when a scan has stopped, closing any "owned" files
  98. Function LEX_ClearStacks(l.LEX_Lexer, keep.LEX_File)
  99.         Local i.Iterator
  100.         If listk <> Null
  101.                 i = GetIterator(listk) : While EachIn(i)
  102.                         Local f.LEX_File = Object.LEX_File iValue : If f <> keep Then LEX_CloseFile f
  103.                 Wend
  104.                 FreeList listk
  105.         EndIf
  106.         If lprev <> Null
  107.                 i = GetIterator(lprev) : While EachIn(i)
  108.                         Delete Object.LEX_Guard iValue
  109.                 Wend
  110.                 FreeList lprev
  111.         EndIf
  112.         lcFile = Null
  113. End Function
  114.  
  115. ;Delete a lexer object and its resources (not output)
  116. Function LEX_FreeLexer(l.LEX_Lexer)
  117.         Local i.Iterator = GetIterator(lrules) : While EachIn(i)
  118.                 Local r.LEX_LexRule = Object.LEX_LexRule(iValue)
  119.                 RegEx_Delete rrule
  120.                 Delete r
  121.         Wend
  122.         FreeList lrules
  123.        
  124.         LEX_ClearStacks l, Null
  125.         If lcFile <> Null Then LEX_CloseFile lcFile
  126.        
  127.         Delete l
  128. End Function
  129.  
  130. ;Delete an output token stream (convenience function)
  131. Function LEX_FreeTokenStream(s.LList)
  132.         Local i.Iterator = GetIterator(s) : While EachIn(i)
  133.                 Delete Object.LEX_Token iValue
  134.         Wend
  135.         FreeList s
  136. End Function
  137.  
  138. ;Add a scan rule, consisting of a match pattern, action, "result" (additional data such as type or error message) and permitted mode
  139. Function LEX_AddLexRule(l.LEX_Lexer, rule$, action, result$ = "", mode = 0)
  140.         Local r.LEX_LexRule = New LEX_LexRule
  141.         rrule = RegEx_Parse(rule, lcsMode)
  142.         rpattern = rule
  143.         raction = action
  144.         rresult = result
  145.         rmode = mode
  146.         ListAddLast lrules, Handle r
  147. End Function
  148.  
  149. ;Scan a file (opened with LEX_ReadFile) using the given lexer
  150. Function LEX_ScanFile(l.LEX_Lexer, f.LEX_File)
  151.         LEX_ResetFile f
  152.         lerrState = False
  153.         lcFile = f
  154.        
  155.         listk = CreateList()
  156.         lprev = CreateList()
  157.         ListAddLast listk, Handle f
  158.         LEX_GuardFileName l, f
  159.        
  160.         Repeat
  161.                 Local token$, rule.LEX_LexRule, mode = 0
  162.                
  163.                 While lcFilecPtr < lcFilesLen
  164.                         token = "" : rule = Null
  165.                        
  166.                         Local i.Iterator = GetIterator(lrules) : While EachIn(i)
  167.                                 Local r.LEX_LexRule = Object.LEX_LexRule iValue
  168.                                 If mode = rmode
  169.                                         Local cMatch$ = RegEx_Match(rrule, lcFilestream, lcFilecPtr)
  170.                                         If Len(cMatch) > Len(token) Then token = cMatch : rule = r
  171.                                 EndIf
  172.                         Wend
  173.                        
  174.                         If rule <> Null         ;Something matched successfully!
  175.                                 Select ruleaction
  176.                                         Case LEX_ACTION_STORE
  177.                                                 ListAddLast lout, Handle LEX_NewToken(token, ruleresult, lcFilename, lcFilecLine, lcFilecCol)
  178.                                         Case LEX_ACTION_MODE
  179.                                                 mode = Int ruleresult
  180.                                         Case LEX_ACTION_ERROR
  181.                                                 LEX_Error l, ruleresult
  182.                                                 LEX_ClearStacks l, f : Return   ;Clear the include stacks and stop scanning
  183.                                         Case LEX_ACTION_DISCARD
  184.                                                 ;Do nothing
  185.                                         Case LEX_ACTION_INCLUDE
  186.                                                 LEX_IncrementFilePtr lcFile, Len(token)
  187.                                                 token = LEX_FilterIncludeString(token, ruleresult)      ;Shorten the token to just the file path
  188.                                                 If Not LEX_TryIncludeFile(l, token) Then LEX_ClearStacks l, f : Return
  189.                                                
  190.                                         Default
  191.                                                 LEX_HandleAction l, rule, token
  192.                                                 If lerrState Then LEX_ClearStacks l, f : Return
  193.                                 End Select
  194.                                
  195.                                 If ruleaction <> LEX_ACTION_INCLUDE Then LEX_IncrementFilePtr lcFile, Len(token)
  196.                         Else
  197.                                 LEX_IncrementFilePtr lcFile, 1
  198.                         EndIf
  199.                 Wend
  200.                
  201.                 If lcFile <> f          ;Pop back to the previous file in the include stack
  202.                         LEX_CloseFile lcFile
  203.                         ListRemoveLast listk
  204.                         lcFile = Object.LEX_File ListLast(listk)
  205.                 Else
  206.                         Exit    ;If it's f, we're done
  207.                 EndIf
  208.         Forever
  209.        
  210.         LEX_ClearStacks l, f
  211. End Function
  212.  
  213. ;Load a file into a suitable format for scanning
  214. Function LEX_ReadFile.LEX_File(path$)
  215.         path = Replace(path, "", "/")   ;Normalise slashes
  216.        
  217.         Local dirName$[0], fileName$[0] : LEX_GetPathCmpts path, dirName, fileName
  218.        
  219.         Local stream = ReadFile(path) : If Not stream
  220.                 LEX_Error Null, "Unable to open file '" + path + "' ('" + dirName[0] + fileName[0] + "')"
  221.                 Return Null
  222.         EndIf
  223.        
  224.         Local f.LEX_File = New LEX_File
  225.         fdir = dirName[0] : fname = fileName[0]
  226.        
  227.         fstream = CreateBank(FileSize(path))
  228.         ReadBytes fstream, stream, 0, BankSize(fstream)
  229.         CloseFile stream
  230.         fsLen = BankSize(fstream)
  231.        
  232.         LEX_ResetFile f
  233.         Return f
  234. End Function
  235.  
  236. ;Flag an error, preventing the lexer from continuing and possibly logging it via the interface
  237. Function LEX_Error(l.LEX_Lexer, msg$)
  238.         LEX_LogError l, msg
  239.         lerrState = True
  240. End Function
  241.  
  242. ;(Internal) Use a simple set of filter chars to chop the path out of an include directive
  243. Function LEX_FilterIncludeString$(inc$, filter$)
  244.         Local i, p
  245.         For i = 1 To Len(filter)
  246.                 p = Instr(inc, Mid(filter, i, 1))
  247.                 If p Then inc = Mid(inc, p + 1) : Exit
  248.         Next
  249.         For i = 1 To Len(filter)
  250.                 p = Instr(inc, Mid(filter, i, 1))
  251.                 If p Then inc = Left(inc, p - 1) : Exit
  252.         Next
  253.         Return inc
  254. End Function
  255.  
  256. ;(Internal) Get the absolute directory and stripped filename from a path
  257. Function LEX_GetPathCmpts(path$, dirOut$[0], fileOut$[0])
  258.         dirOut[0] = "" : fileOut[0] = path
  259.         Local tgt$ : If Instr(path, "/") Then tgt = "/" : Else tgt = ":"
  260.        
  261.         Local c : For c = Len(path) To 1 Step -1
  262.                 If Mid(path, c, 1) = tgt
  263.                         dirOut[0] = Left(path, c) : If Left(dirOut[0], 1) = "/" Then dirOut[0] = Mid(dirOut[0], 2)
  264.                         fileOut[0] = Mid(path, c + 1)
  265.                         Exit
  266.                 EndIf
  267.         Next
  268.        
  269.         Local isAbsolute = (Left(path, 2) = "//") Or Instr(path, ":")   ;UNC/driveletter/URL
  270.         If Not isAbsolute Then dirOut[0] = dirOut[0] + Replace(CurrentDir(), "", "/")
  271. End Function
  272.  
  273. ;(Internal) Reset a LEX_File's pointer fields
  274. Function LEX_ResetFile(f.LEX_File)
  275.         fcPtr = 0
  276.         fcLine = 1
  277.         fcCol = 1
  278. End Function
  279.  
  280. ;(Internal) Increment a LEX_File's pointer fields
  281. Function LEX_IncrementFilePtr(f.LEX_File, count)
  282.         Local c : For c = 1 To count
  283.                 Local char = PeekByte(fstream, fcPtr)
  284.                 If char < 32    ;Only count printable characters in the column field
  285.                         If char = 10 Then fcLine = fcLine + 1 : fcCol = 1
  286.                 Else
  287.                         fcCol = fcCol + 1
  288.                 EndIf
  289.                 fcPtr = fcPtr + 1
  290.         Next
  291. End Function
  292.  
  293. ;(Internal) "Guard" a file against being included twice
  294. Function LEX_GuardFileName(l.LEX_Lexer, f.LEX_File)
  295.         Local g.LEX_Guard = New LEX_Guard
  296.         gname = fdir + fname
  297.         ListAddLast lprev, Handle g
  298. End Function
  299.  
  300. ;"Close" a file opened for scanning
  301. Function LEX_CloseFile(f.LEX_File)
  302.         If fstream Then FreeBank fstream
  303.         Delete f
  304. End Function
  305.  
  306. ;Treat the contents of a string as a lexer file (useful for short inputs)
  307. Function LEX_FileFromString.LEX_File(s$)
  308.         Local f.LEX_File = New LEX_File
  309.         fdir = "" : fname = "<string>"
  310.        
  311.         fstream = CreateBank(Len(s))
  312.         Local c : For c = 1 To BankSize(fstream)
  313.                 PokeByte fstream, c - 1, Asc(Mid(s, c, 1))
  314.         Next
  315.         fsLen = BankSize(fstream)
  316.        
  317.         LEX_ResetFile f
  318.         Return f
  319. End Function
  320.  
  321. ;Create a new match result (token object)
  322. Function LEX_NewToken.LEX_Token(val$, tt$, file$, l = 0, c = 0)
  323.         Local t.LEX_Token = New LEX_Token
  324.         tvalue = val : ttType = tt
  325.         tfile = file : tl = l : tc = c
  326.         Return t
  327. End Function
  328.  
  329.  
  330. ;~IDEal Editor Parameters:
  331. ;~F#D#12#1A#1F#24#2D#36#3B#40#45#62#74#83#8B#96#D6#ED#F3#101#112
  332. ;~F#119#126#12D#133#142
  333. ;~C#Blitz3D


Comments : none...