[bmx] SplitString by Perturbatio [ 1+ years ago ]

Started by BlitzBot, June 29, 2017, 00:28:40

Previous topic - Next topic

BlitzBot

Title : SplitString
Author : Perturbatio
Posted : 1+ years ago

Description : Function to split a string at the specified delimiters and returns a Tlist as the result

Code :
Code (blitzmax) Select
Function SplitString:TList(inString:String, Delim:String)
Local tempList : TList = New TList
Local currentChar : String = ""
Local count : Int = 0
Local TokenStart : Int = 0

If Len(Delim)<>1 Then Return Null

inString = Trim(inString)

For count = 0 Until Len(inString)
If inString[count..count+1] = delim Then
tempList.AddLast(inString[TokenStart..Count])
TokenStart = count + 1
End If
Next
tempList.AddLast(inString[TokenStart..Count])
Return tempList
End Function

'Example usage:
Local myList:TList = SplitString("This is a longer test string that I am using to test this split string test thing", " ")

If myList Then
For a$ = EachIn myList
Print a$
Next
EndIf


Comments : none...

Henri

Hi,

sorry I'm bit late to comment:-(.

Blitzmax already has a built-in split string method:
Code (blitzmax) Select
Local txt:String = "This is a longer test string that I am using to test this split string test thing"
Local ar:String[] = txt.Split(" ")

For Local i:Int = 0 Until ar.length
Print ar[i]
Next


Split is a basic tool for parsing text in an easy way. Yours is a slightly lower level approach, but good when more advanced control is needed: For instance split can only have one delimiter as far as I know.

-Henri
- Got 01100011 problems, but the bit ain't 00000001

Derron

His whole function seems to do performance-critical stuff:

- initializes new objects and afterwards returns null when an invalid param is passed (better return null right at the beginning as no precalculations are needed)

- does a "bla[c ... c+1] = delim" check which is a "slice" operation instead of just checking the array indice itself (dunno if the "ASM" code is the same at the end but always thought "slicing" is more expensive)


bye
Ron

Henri

Yes, slicing in checking is unnecessary.

Better way would be checking against character code.

Here is a sligtly more optimised version with possible to add multiple delimiters

Code (blitzmax) Select

Function SplitString:TList(inString:String, delim:String[])

If Not inString Or Not delim Then Return Null

Local tempList:TList = New TList
Local currentChar:String = ""
Local count:Int = 0
Local tokenStart:Int = 0
Local delimChar:Int
Local delimPos:Int

'Multiple delimiters
If delim.length > 1
For count = 0 Until inString.length
For delimPos = 0 Until delim.length
If inString[count] = delim[delimPos][0] Then
tempList.AddLast(inString[tokenStart..Count])
tokenStart = count + 1
Exit
End If
Next
Next

'Single delimiter. Just to save few microseconds to process this seperately
Else
delimChar = Asc(delim[0])
For count = 0 Until inString.length
If inString[count] = delimChar Then
tempList.AddLast(inString[tokenStart..Count])
tokenStart = count + 1
End If
Next
EndIf

tempList.AddLast(inString[TokenStart..Count]) 
Return tempList
End Function

'Example usage:
Local myList:TList = SplitString("This is a longer test string that I am using,to,test,this split string test thing", [" ", ","])

For Local s:String = EachIn myList
Print s
Next



-Henri
- Got 01100011 problems, but the bit ain't 00000001

Derron

You have this line here:
inString = inString.Trim()

which _can_ create trouble as it removes whitespace.

Imagine someone used "    " (four whitespaces as delim - 4 whitespaces is common when replacing "tab" with spaces). Now the glued together code of "empty string" + "empty string" + "hey" becomes:
4 whitespaces
4 whitespaces
hey

With your trim() call you destroy the data.


Regardless of how "efficient" something is: never manipulate data if you cannot ensure that the manipulation just contains unused data. " " space is a valid delimiter.


bye
Ron

Henri

What I wrote doesn't support 4 character delimiters, only single. It would have to be modified to allow them.
Also this is very basic, and doesn't cover advanced scenarios. Again it would have to be modified for that.

Trim was left there, because of original posters desire to trim unprintable characters at both ends of input string. From my point of view there is a valid reason for it, as you might not want those as a token. It depends on what you want.

Here are some split examples http://pages.cs.wisc.edu/~hasti/cs302/examples/Parsing/parseString.html

What above is, is a simple parsing. Parsing something like computer code (which I assume you mean, for case like replacing tabs with 4 spaces) requires some more advanced stuff as you have consider things like comments / blocks / different EOL-chars / typing styles / Start of line etc.

-Henri
- Got 01100011 problems, but the bit ain't 00000001

Derron

Ah yes... my fault. Code clearly only handles single-char delims. So most probably ignore what I wrote about the 4-spaces delimiter.

Nonetheless: same example, this time the delimiter becomes "~t" (tab)  which is pretty common (eg. tab separated values - like in a "csv" - instead of commas)
Again the first serialized objects are empty strings. So the resulting string becomes "~t~they".

The trim now removes the preceding tabs - which results in the first param becoming "hey" instead of "". This is why "trim()" is not useful for splitting stuff. I would only trim results - and this only if I know that there cannot be any whitespace at the begin or end. so an "hey " would get trimmed to "hey".


bye
Ron

Henri

I agree with you that trimming should be left to the user, so I updated the code (although it was not my intention to supersede the OP :-) )

On a side note, wanting leading or trailing hidden characters on a token in general is almost never desired.

-Henri
- Got 01100011 problems, but the bit ain't 00000001

Derron

yes, on a token it might not be desired - which is why you could trim the values before adding to the list. But as described above (and seconded by you) the trimming of the original/raw string can lead to issues.


The OP is no longer active for many years now, so I assume he wont come here full of anger because you optimized his outdated code :-)


bye
Ron