January 15, 2021, 05:51:15 PM

Author Topic: [bb] bb Tokenizer + parser + bbToDecls by Bobysait [ 1+ years ago ]  (Read 630 times)

Offline BlitzBot

  • Jr. Member
  • **
  • Posts: 1
Title : bb Tokenizer + parser + bbToDecls
Author : Bobysait
Posted : 1+ years ago

Description : the "How To" tokenize some code, fully commented.
It does not search for errors and does not make difference between a number and an hex or bin value as it is not made this way.

A sample is released on the bottom
it parses some blitz code using syntax color.

ps : this is a kind of tutorial, I didn't even test it for production. It may be slow or not, I don't even know, but actually, it should run pretty fast.


Code :
Code: BlitzBasic
  1. Const MAX_TOKEN_PER_SET% = 512
  2.  
  3. Const TOKEN_OP%=0, TOKEN_WORD%=1, TOKEN_NUM%=2, TOKEN_STR%=3
  4.  
  5. Type TokenSet Field Tokens, count% End Type
  6. Type Token Field v$, t%, p%, e% End Type
  7.  
  8. Function TokenSetCount%(ts.TokenSet)
  9.         Return tscount
  10. End Function
  11.  
  12. Function TokenSetToken.Token(ts.TokenSet,id)
  13.         Return Object.Token(PeekInt(tsTokens,id*4-4))
  14. End Function
  15.  
  16. Function TokenValue$(ts.TokenSet,id)
  17.         Local tok.Token = Object.Token(PeekInt(tsTokens,id*4-4)) : If tok<>Null : Return tokv : EndIf
  18.         Return ""
  19. End Function
  20. Function TokenType%(ts.TokenSet,id)
  21.         Local tok.Token = Object.Token(PeekInt(tsTokens,id*4-4)) : If tok<>Null : Return tok     : EndIf
  22.         Return 0
  23. End Function
  24. Function TokenStart%(ts.TokenSet,id)
  25.         Local tok.Token = Object.Token(PeekInt(tsTokens,id*4-4)) : If tok<>Null : Return tokp : EndIf
  26.         Return 0
  27. End Function
  28. Function TokenEnd%(ts.TokenSet,id)
  29.         Local tok.Token = Object.Token(PeekInt(tsTokens,id*4-4)) : If tok<>Null : Return toke : EndIf
  30.         Return 0
  31. End Function
  32.  
  33. Function FreeTokenSet(ts.TokenSet)
  34.         If tsTokens
  35.                 Local size=BankSize(tsTokens)
  36.                 If size>3
  37.                         For n = 0 To size-1 Step 4
  38.                                 Local tok.Token = Object.Token(PeekInt(tsTokens,n))
  39.                                 If tok<>Null Then Delete tok
  40.                         Next
  41.                 EndIf
  42.                 FreeBank tsTokens
  43.         EndIf
  44.         Delete ts
  45. End Function
  46.  
  47. Function NewToken.Token(ts.TokenSet, v$, t%, p%,e%)
  48.         ResizeBank(tsTokens,BankSize(tsTokens)+4)
  49.         Local tok.Token = New Token
  50.         ; value of the token
  51.         tokv = v
  52.         ; type of token (word, operator)
  53.         tok      = t
  54.         ; as the tokenizer declare numerals as they were words, check if is actually is a numeral or not
  55.         If tok   = TOKEN_WORD
  56.                 If IsNum(tokv)
  57.                         tok     =TOKEN_NUM
  58.                 EndIf
  59.         EndIf
  60.         ; token position in the string
  61.         tokp = p
  62.         ; end position in the string ( length is e-p+1 ... or Len(v) )
  63.         toke = e
  64.        
  65.         ; insert the token in the token bank
  66.         PokeInt(tsTokens,tscount*4,Handle(tok))
  67.         ; increase the count of token in the set
  68.         tscount = tscount + 1
  69.        
  70.         ; eventually returns the token ... (not really usefull BTW)
  71.         Return tok
  72. End Function
  73.  
  74. ; tokenize a string @s
  75. ; @ops  : a string containing all symbols (one char length per symbol)
  76. ; @seps : a string containing all symbols removed from the set
  77. ;         -> like spaces or tabs, they are used to split words
  78. ;                    but not relevant in for the language.
  79. Function Tokenize.TokenSet(s$, ops$, seps$=" ", strchars$="'")
  80.         Local ts.TokenSet = New TokenSet
  81.         tsTokens = CreateBank(0)
  82.         Local word$="",start%=0
  83.         Local ln=Len(s), i, c$, b
  84.        
  85.         For i = 1 To ln
  86.                
  87.                 c = Mid(s,i,1)
  88.                
  89.                 ; first -> detect strings !
  90.                 If Instr(strchars,c)
  91.                        
  92.                         ; register eventual started word
  93.                         If word<>"" Then NewToken(ts,word,TOKEN_WORD,start,i)
  94.                        
  95.                         ; find the respective closer
  96.                         start=i
  97.                         i=Instr(s,c,i+1)
  98.                         If i>start
  99.                                 ; add the string to the tokenset
  100.                                 NewToken(ts,Mid(s,start,i-start+1),TOKEN_STR,start,i)
  101.                         ; not closed ? let's close it with to the end of the string.
  102.                         Else
  103.                                 NewToken(ts,Right(s,ln-start),TOKEN_STR,start,i)
  104.                                 ; exit the loop. We encountered the end, there 's nothing left to parse.
  105.                                 Exit
  106.                         EndIf
  107.                        
  108.                 ; ok, so it's not a String, maybe an empty char.
  109.                 ElseIf Instr(seps,c)
  110.                        
  111.                         ; add the current word (if any)
  112.                         If word<>"" Then NewToken(ts,word,TOKEN_WORD,start,i)
  113.                         ; reset position
  114.                         word="" : start=i+1
  115.                         ; do not add the token ... it's an empty token!
  116.                        
  117.                 ; or an operator maybe
  118.                 ElseIf Instr(ops,c)
  119.                        
  120.                         Local isnumber=False
  121.                         ; as a dot may be an operator or the separator between integer and decimal
  122.                         ; we wouldn't to separate them, don't we ?! ... Do you ? oO ... it's bad ! really bad !
  123.                        
  124.                         ; then let's track numerals
  125.                         If c="."
  126.                                
  127.                                 ; if current word is not empty, check if its first char is litteral or numeral
  128.                                 ; generally a word can contain some number, but can't start with a number.
  129.                                 ; (except for Hex wich actually are not really numbers but a litteral expression
  130.                                 ;  to represent a number .... outch ... my brain is bleeding
  131.                                 ;  anyway, it's a kind of a rule up to the user to define. so ... we don't care about 'hex'.)
  132.                                 If Len(word)
  133.                                         b=Asc(Left(word,1))
  134.                                         ; alright, we found a number (or an ... something erroneous containing some chars that we don't care about.
  135.                                         ;                             Did I told you about Hex ?... mmm... probably.)
  136.                                         If b>=Asc("0") And b<=Asc("9")
  137.                                                 isnumber=True
  138.                                                 word=word+c
  139.                                         ; else the first char is not a numeral
  140.                                         ; it means we have a word and a dot. So the dot will be managed as an operator.
  141.                                         EndIf
  142.                                 Else
  143.                                 ; no word currently ? so we have to be sure the next char is a numeral or not.
  144.                                 ; because we can start a decimal using just the dot ( ex : ".0704" )
  145.                                         If i<ln
  146.                                                 b=Asc(Mid(s,i+1,1))
  147.                                                 ; ok we found a numeral :) (the integer part will get back its baby o/)
  148.                                                 If b>=Asc("0") And b<=Asc("9")
  149.                                                         isnumber=True
  150.                                                         word=word+c
  151.                                                 EndIf
  152.                                         ; else the string ends with a dot ... uncommun oO ...
  153.                                         ; but maybe it's an unfinished string or ... don't know ..."
  154.                                         ; maybe you're trying to parse a book ? ... did I noticed you this parser is only made for programming language ?
  155.                                         ; my bad, I should have... doesn't matter, now, you know it.
  156.                                         EndIf
  157.                                 EndIf
  158.                         EndIf
  159.                        
  160.                         ; we reach this point without finding numeral, so this is an operator (wether it's a dot or not)
  161.                         If isnumber=False
  162.                                 ; if any started word, then just add it to the set.
  163.                                 If word<>"" Then NewToken(ts,word,TOKEN_WORD,start,i)
  164.                                 ; add the symbol
  165.                                 NewToken(ts,c,TOKEN_OP,i,i)
  166.                                 ; reset start and word
  167.                                 word="" : start=i+1
  168.                         EndIf
  169.                        
  170.                         ; by the way, don't leave a variable with a state (doesn't really matter but, it's an habit to have)
  171.                         isnumber=False
  172.                        
  173.                 ; Else, this is just a legal char ... let's say it's part of a word.
  174.                 ; (this tokenizer doesn't care about illegal chars, it lives in a wonderfull world of freedom)
  175.                 Else
  176.                         ; add the char to the word.
  177.                         word=word+c
  178.                 EndIf
  179.         Next
  180.        
  181.         ; the word contains some chars ? let's register them before ending the set.
  182.         If word<>"" Then NewToken(ts,word,TOKEN_WORD,start,Len(s))
  183.        
  184.         ; and Voila !
  185.         ; we can return our set, the user will take care of it (I hope ...)
  186.         Return ts
  187.        
  188. End Function
  189.  
  190. ; a basic function that returns true if the string contains only numeral chars (0-9 + ".")
  191. ; actually this function returns true if there is more than one "." ...
  192. ; but as we don't deal this kind of errors, whatever ... let's say it's a number !
  193. Function IsNum%(v$)
  194.         Local a0=Asc("0"), a9=Asc("9"), ad=Asc("."), l=Len(v), i, b
  195.         For i = 1 To l:b=Asc(Mid(v,i,1)):If((b<a0 Or b>a9) And b<>ad):Return False:EndIf:Next
  196.         Return l>0 ; return false is the string is empty, it's not a numeral.
  197. End Function
  198.  
  199.  
  200. Type Keyword
  201.         Field word$, cs
  202. End Type
  203. Function NewKeyWord(word$,casesensitive%=False)
  204.         Local kw.Keyword = New Keyword
  205.         If casesensitive
  206.                 kwword = word
  207.         Else
  208.                 kwword = Lower(word)
  209.         EndIf
  210.         kwcs=casesensitive
  211. End Function
  212.  
  213. ; small sample
  214. Function Tokenizer_SimpleSample()
  215.         Graphics 800,600,0,2
  216.        
  217.         ClsColor 0,40,60
  218.         Cls
  219.        
  220.         ; let's load a font (I like consolas, as it's a fixed width and lite font)
  221.         ; but if you don't have it, let's go for the blitz font (you should have it, as it's the ... blitz font)
  222.         Local font=LoadFont("Consolas",18) : If Not(font) Then font=LoadFont("Blitz",16)
  223.         SetFont font
  224.        
  225.         ; setup our tokenizer
  226.         Local ops$ = ".=()[]#$%;:/+-*.,?" ; basic symbols to parse
  227.         Local emptyops$ = " "+Chr(9) ; space + tab -> thoose symbols split words but are not output by the tokenizer
  228.         Local strops$ = Chr(34); (chr(34) = ["] ) -> the string chars. everything started with thoose symbol end with the same symbol.
  229.        
  230.         ; some strings to parse
  231.         Local mystring$[4]
  232.         mystring[0] = "; the Holy function"
  233.         mystring[1] = "Function TokenizeMe.ReturnType(Params%=12.6,param2$="+Chr(34)+"I'm a striiiiIiing"+Chr(34)+")"
  234.         mystring[2] = "  This code will never compile ... but the parser doesn't know it"
  235.         mystring[3] = " Return Null"
  236.         mystring[4] = "End Function ; and I'm a bronish comment"
  237.        
  238.         Local ns
  239.         For ns = 0 To 4
  240.        
  241.                 ; tokenize the string
  242.                 Local ts.TokenSet=Tokenize(mystring[ns], ops, emptyops,strops)
  243.                
  244.                 NewKeyWord("Function")
  245.                 NewKeyWord("End")
  246.                 NewKeyWord("Return")
  247.                 NewKeyWord("Null")
  248.                
  249.                 Local NbToken = TokenSetCount(ts), n
  250.                
  251.                 If NbToken
  252.                        
  253.                         Local tok.Token
  254.                         ; parse each token in the set with a Blitz-like style
  255.                         For n = 1 To NbToken
  256.                                
  257.                                 Select TokenType(ts,n)
  258.                                        
  259.                                         ; if it's a word color it Yellow
  260.                                         Case TOKEN_WORD : Color 255,255,000
  261.                                                
  262.                                                 ; we should parse all keywords, but this is just a sample, we'll only track the "Function" and "End keywords
  263.                                                 ; keywords are light-blue (actually, it's turquoise... or orange ... mmm ... colorblindness is a pain in the ****)
  264.                                                 Local word$ = TokenValue(ts,n)
  265.                                                 Local lowerword$=Lower(word)
  266.                                                 Local kw.Keyword
  267.                                                 For kw = Each Keyword
  268.                                                         If kwcs
  269.                                                                 If word=kwword Then Color 000,255,255:Exit
  270.                                                         Else
  271.                                                                 If lowerword=kwword Then Color 000,255,255:Exit
  272.                                                         EndIf
  273.                                                 Next
  274.                                                
  275.                                         ; operators are whyte
  276.                                         Case TOKEN_OP   : Color 255,255,255
  277.                                                 ; except the comments ...
  278.                                                 If TokenValue(ts,n)=";"
  279.                                                         ; let's color the comments in a brownish orange
  280.                                                         Color 255,100,000
  281.                                                         ; write all the left tokens and quit the loop.
  282.                                                         Write "; "
  283.                                                         Local n2=n+1
  284.                                                         For n=n2 To NbToken
  285.                                                                 Write TokenValue(ts,n)+" "
  286.                                                         Next
  287.                                                         Exit
  288.                                                 EndIf
  289.                                                
  290.                                         ; numbers are blue
  291.                                         Case TOKEN_NUM  : Color 000,128,255
  292.                                        
  293.                                         ; and finally, the string in green
  294.                                         Case TOKEN_STR  : Color 000,255,000
  295.                                        
  296.                                 End Select
  297.                                
  298.                                 ; here we are. Write the token
  299.                                 Write TokenValue(ts,n)+" "
  300.                                
  301.                         Next
  302.                 EndIf
  303.                 FreeTokenSet(ts)
  304.                 Print ""
  305.         Next
  306.        
  307.         Flip True
  308.        
  309.         FreeFont font
  310.        
  311.         WaitKey
  312.         End
  313. End Function
  314.  
  315.  
  316. Tokenizer_SimpleSample()


Comments :


Bobysait(Posted 1+ years ago)

 And here is an other sampleIt parses a bb file and extract a decls (with the same filename at same directory)(this is actually the reason why I made the tokenizer... big libraries to export, and too lazzy to write the decls myself)
Code: [Select]
Function bbToDecls(file$, ExportType%=False,ExportConst%=False,ExportGlobal%=False,ComFunc%=False,ComType%=False,ComConst%=False,ComGlobal%=False)

Local ops$ = ".=()[]#$%;:/+-*.,?"
Local emptyops$ = " "+Chr(9)
Local strops$ = Chr(34)

If Lower(Right(file,3))<>".bb" Then Return False
Local o$ = Left(file,Len(file)-3)+".decls" ; same name replacing ".bb" with ".decls"

; read the inputfile (return if not "readable")
Local in = ReadFile(file) : If Not(in) Then Return False
; write the output decls
Local out = WriteFile(o) : If Not(out) Then Return False

; add the decoration
WriteLine out, ".lib "+Chr(34)+" "+Chr(34)
WriteLine out, ""

Local LId% = 0
Local lastcomment$=""

; parse the bb-file
While Not(Eof(in))

; read the lines (remove spaces before and after)
Local l$=Trim(ReadLine(in))
LId = LId+1

If Len(l)

Local ts.TokenSet = Tokenize(l,ops,emptyops,strops)

If TokenSetCount(ts) ; assert there is tokens (not an empty line ... should not happen due to the trim ... but anyway.)

; find function signature (decoration)
Select Lower(TokenValue(ts,1))
Case "function"

If TokenSetCount(ts)<4 Then RuntimeError "Error at line "+LId+" : in the function declaration '"+l+"'"
; tokenize the line

; set all to default.
Local fname$=""
Local freturn$=""
Local freturntype$=""

; actually, the start of the parameters can't be smaller than "3"
; Function Name([...])
; if can start at 4
; Function Name%([...])
; if can also start at 5
; Function Name .ReturnType([...])
; I don't think it can start later...
Local fparamstart%=3
; so, let's start grabing the "name" and the "return" thing

; let's assume the file is a "valid" bb-file (we won't deal with errors in the file)
; pos 1 = function
; pos 2 = name
; pos 3 = "." or % or "$" or "#" or "("

; so the name is at pos 2 !
fname=TokenValue(ts,2)

; for the pos 3, we have to check wether it's a "(" or something else
Select TokenValue(ts,3)

; so we found the first bracket
Case "("
; no return type
freturn=""
; and the first argument start at pos 4 (if any)
fparamstart=3

Case "."
freturn="" ; do not mark the 'type' return (decls does not support them)
; eventually if we wanted to catch the type : (for document or else)
freturntype = TokenValue(ts,4)
fparamstart=5 ; go after the bracket

Case "%","$","#"
freturn=TokenValue(ts,3)
fparamstart=4

Default
; maybe we could check what else happend here
RuntimeError "Error at line "+LId+" : This symbol is Not supported : ["+TokenValue(ts,3)+"] as a Function decoration"
; we should not be here as there is a RuntimeError call ...
; but if the user override the RuntimeError Function ?! ... I know someone who ..."
End
; Hey, maybe someone override the "End" too >.< what knid of person do that ?!
Return False
; I hope the return can't be override ... damn, I think I'm paranoid
DebugLog "Are you serious ? He !"
Print "You think I'm paranoid, don't you ?!"

End Select

; so we have the name and the return style, let's parse the arguments.
Local fargs$="(", lastarg$=""
For n = fparamstart To tscount

Select TokenType(ts,n)

Case TOKEN_WORD

lastarg = TokenValue(ts,n)
fargs=fargs+lastarg

Case TOKEN_OP

Select TokenValue(ts,n)

Case "," ; start of a new argument
; check if there is no previous argument ... if so, its' a syntax error.
If lastarg=""
RuntimeError "Error at line "+LId+" : Illegal declaration in function '"+fname+"' missing argument before comma"
EndIf
fargs=fargs+", "

Case "%","#","$"

; add the type specifier
fargs=fargs+TokenValue(ts,n)

Case "[" ; a static array as parameter, Yes Sir !

; arrays are marked with a "*" for pointers, but we have to remove the type specifier (if any)
Select TokenValue(ts,n-1)
Case "%","#","$"
fargs=Left(fargs,Len(fargs)-1)
End Select
; add the "*"
fargs=fargs+"*"
; jump to the end of the array decoration
n2 = n+1
For n=n2 To TokenSetCount(ts)
If TokenValue(ts,n)="]" Then Exit
Next

Case "."

; it's a Blitz-type > marked with a "*" in decls files. (but we have to skyp the type-name)
fargs = fargs+"*"
; skip the next token (should be the typename, but if it's not there is an error)
If TokenType(ts,n+1)=TOKEN_WORD
n=n+1
Else
RuntimeError "Error at line "+LId+" : Illegal declaration of BlitzType in Function '"+fname+"'"
EndIf

Case "=" ; start of optionnal value of the current argument

If TokenSetCount(ts)>n+1
; skip the optional value and check if the next token is a valid one. (end of arguments, or separator)
Select TokenValue(ts,n+2)
Case ")", ","
n=n+1
Default
RuntimeError "Error at line "+LId+" : Unknown decoration for argument '"+lastarg+"'"
End Select
Else
; not enough space : function is not defined correctly
RuntimeError "Error at line "+LId+" : Unknown error in end of Function declaration '"+fname+"'"
EndIf

; The ending brackets, we did it !
Case ")"

fargs=fargs+")"
Exit

End Select

Case TOKEN_NUM
; as the "=value..." is skiped, we should not find any numbers.
RuntimeError "Error at line "+LId+" : Unexpected Number in function declaration '"+fname+"'"

Case TOKEN_STR
; we should not find any strings either.
RuntimeError "Error at line "+LId+" : Unexpected string in function declaration '"+fname+"'"

End Select

Next

; assert the arguments are closed !
If Right(fargs,1)<>")"
RuntimeError "Error at line "+LId+" : missing bracket ')' in function declaration '"+fname+"'"
EndIf

If ComFunc And lastcomment<>"" Then WriteLine out, lastcomment
WriteLine out, fname+" "+freturn+" "+fargs

lastcomment = ""

; manage the comments ?
Case ";"

lastcomment=l

Case "type"
; export types as function the currently won't be function, but they will be highlighted)
If ExportType
If TokenSetCount(ts)>1
If ComType And lastcomment<>"" Then WriteLine out, lastcomment
WriteLine out, TokenValue(ts,2)+"()"
Else
RuntimeError "Error at line "+LId+" : missing Type specifier"
EndIf
EndIf
lastcomment = ""

Case "global"
; export globals as function the currently won't be function, but they will be highlighted)
If ExportGlobal
If TokenSetCount(ts)>4
If TokenType(ts,2)=TOKEN_WORD
If ComGlobal And lastcomment<>"" Then WriteLine out, lastcomment
Select TokenValue(ts,3)
Case "%","$","#"
WriteLine out, TokenValue(ts,2)+TokenValue(ts,3)+"()"
Default
WriteLine out, TokenValue(ts,2)+"()"
End Select
Else
RuntimeError "Error at line "+LId+" : missing Type specifier"
EndIf
Else
RuntimeError "Error at line "+LId+" : missing Type specifier"
EndIf
EndIf
lastcomment = ""

Case "const"
; export const as function the currently won't be function, but they will be highlighted)
If ExportConst
If TokenSetCount(ts)>4
If TokenType(ts,2)=TOKEN_WORD
If ComConst And lastcomment<>"" Then WriteLine out, lastcomment
Select TokenValue(ts,3)
Case "%","$","#"
WriteLine out, TokenValue(ts,2)+TokenValue(ts,3)+"()"
Default
WriteLine out, TokenValue(ts,2)+"()"
End Select
Else
RuntimeError "Error at line "+LId+" : missing Type specifier"
EndIf
Else
RuntimeError "Error at line "+LId+" : missing Type specifier"
EndIf
EndIf
lastcomment = ""

; else ... skip the line.
Default
; here we should get all tokens and catch the "function" words inside the set
; some people use the ":" to stick lines on the same line
; ex : Function f1():dothings:End Function : Function F2(): blablabla : End Function
; but whatever, this small version does not deal with it at the moment.
; hey ! it's just a sample :)
lastcomment = ""

End Select

EndIf

FreeTokenSet(ts)

EndIf

Wend

CloseFile In
CloseFile Out

End Function

bbToDecls("YOUR_BB_FILE_HERE.bb",True,True,True,True,False,False,False)




Yasha(Posted 1+ years ago)

 You seem too have it well in hand, but if you want a tested and "practical" (quite slow though) backend for such a task, you might find these useful: <a href="codearcs369d.html?code=2990" >Parser framework[/url]<a href="codearcs54e8.html?code=2985" >Lexical scanner framework[/url] (tokeniser and lexical scanner are the same thing) [/i]

 

SimplePortal 2.3.6 © 2008-2014, SimplePortal