[bmx] Burrows-Wheeler-Transformation by Vertex [ 1+ years ago ]

BlitzBot · June 29, 2017, 00:28:43

Title : Burrows-Wheeler-Transformation
Author : Vertex
Posted : 1+ years ago

Description : Hi!
The BWT-Algorithm optimize a huffmancompression extremely. This algorithm is using for example in bzip2 compression.

The Huffmancompression you can find by:
<a href="codearcsdf3d.html?code=195" target="_blank">http://blitzbasic.com/codearcs/codearcs.php?code=195</a>
<a href="codearcsdf3d.html?code=195" target="_blank">http://blitzbasic.com/codearcs/codearcs.php?code=195</a>
(I was to putridly to write a huffmancode self

Huffman create a binary tree. This tree has any branches that represent the chars. Chars that occurs frequently become a shorter bit-path, chars that occurs rarely become a long bit-path.

The BWT optimize datablocks for huffman(and/or for Run Length Encoding).

For example: "blitzbasicrules!" = 2:2:2:1:1:1:1:1:1:1:1:1:1
create a rotated list:

Code Select

blitzbasicrules!
litzbasicrules!b
itzbasicrules!bl
tzbasicrules!bli
zbasicrules!blit
basicrules!blitz
asicrules!blitzb
sicrules!blitzba
icrules!blitzbas
crules!blitzbasi
rules!blitzbasic
ules!blitzbasicr
les!blitzbasicru
es!blitzbasicrul
s!blitzbasicrule
!blitzbasicrules

now you must sort this table:

Code Select

!blitzbasicrule[s]
asicrules!blitz[b]
basicrules!blit[z]
blitzbasicrules[!]
crules!blitzbas[i]
es!blitzbasicru[l]
icrules!blitzba[s]
itzbasicrules!b[l]
les!blitzbasicr[u]
litzbasicrules![b]
rules!blitzbasi[c]
s!blitzbasicrul[e]
sicrules!blitzb[a]
tzbasicrules!bl[i]
ules!blitzbasic[r]
zbasicrules!bli[t]

rotate te original 1x:
"blitzbasicrules!" -> "litzbasicrules!b"
and search it, in the sorted table. You can find it on index 9.

Save the cloum and the index as following:
"sbz!ilslubceairt, 9"

Now make a alphabet that contain all chars:
"!abceilrstuz"

searching this alphabet for the char, replace this char at the first in this alphabet:

Code Select

                 !abceilrstuz
s -> index  8 -> s!abceilrtuz
b -> index  3 -> bs!aceilrtuz
z -> index 11 -> zbs!aceilrtu
! -> index  3 -> !zbsaceilrtu
i -> index  7 -> i!zbsacelrtu
l -> index  8 -> l!zbsaceirtu
r -> index  9 -> rl!zbsaceitu
s -> index  5 -> srl!zbaceitu
t -> index 10 -> tsrl!zbaceiu
u -> index 11 -> usrl!zbaceit
z -> index  5 -> zusrl!baceit

now you must write "8 3 11 3 7 8 9 5 10 11", 9 = 2:2:2:1:1:1:1
(ok, ok it was a bad example

to decode it, you must do the same:

Code Select

                 !abceilrstuz
index  8 -> s -> s!abceilrtuz
index  3 -> b -> bs!aceilrtuz
index 11 -> z -> zbs!aceilrtu
index  3 -> ! -> !zbsaceilrtu
index  7 -> i -> i!zbsacelrtu
index  8 -> l -> l!zbsaceirtu
index  9 -> r -> rl!zbsaceitu
index  5 -> s -> srl!zbaceitu
index 10 -> t -> tsrl!zbaceiu
index 11 -> u -> usrl!zbaceit
index  5 -> z -> zusrl!baceit

to decode "sbz!ilslubceairt, 9" you must do this:
Last = "sbz!ilslubceairt" sorting this to:
First = "!abbceiillrsstuz"

Finding to all first entrys the last entry, set the last to " "(for no doublefinding!), and save the lastindex in to a transform vektor:

Code Select

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
------------------------------------------------
 s  b  z [!] i  l  s  l  u  b  c  e  a  i  r  t
[!] a  b  b  c  e  i  i  l  l  r  s  s  t  u  z
------------------------------------------------
 3

->

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
------------------------------------------------
 s  b  z     i  l  s  l  u  b  c  e [a] i  r  t
 ! [a] b  b  c  e  i  i  l  l  r  s  s  t  u  z
------------------------------------------------
 3 12

...

->

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
------------------------------------------------
                                               
 !  a  b  b  c  e  i  i  l  l  r  s  s  t  u  z
------------------------------------------------
 3 12  1  9 10 11  4 13  5  7 14  0  6 15  8  2

Trans = " 3 12 1 9 10 11 4 13 5 7 14 0 6 15 8 2", 7

With this vector you can rebuild the data:
start at position 9:

Code Select

Index = 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
Trans = 3 12  1  9 10 11  4 13  5  7 14  0  6 15  8  2
Last  = s  b  z  !  i  l  s  l  u  b  c  e  a  i  r  t

at Last[9] is the char "b", at Trans[9] is the index 7. So the next index is at 7 Last[7] = l and so on:

Code Select

                Index = 9
Last[9]  = b -> Index = Trans[9]
Last[7]  = l -> Index = Trans[7]
Last[13] = i -> Index = Trans[13]
Last[15] = t -> Index = Trans[15]
Last[2]  = z -> Index = Trans[2]
Last[1]  = b -> Index = Trans[1]
Last[12] = a -> Index = Trans[12]
Last[6]  = s -> Index = Trans[6]
Last[4]  = i -> Index = Trans[4]
Last[10] = c -> Index = Trans[10]
Last[14] = r -> Index = Trans[14]
Last[8]  = u -> Index = Trans[8]
Last[5]  = l -> Index = Trans[5]
Last[11] = e -> Index = Trans[11]
Last[0]  = s

Puhhh, sorry for my bad english

I have testet this with a 171 KB Bitmap file. Without BWT, Huffman compress it to 119 KB. With BWT to 52,2 KB!

When I ready, i will write my own huffman code, to optimizie it, for 512 byte datablocks.

You can set the datablock size bigger, for better results, but need more time.

cu olli [/i]

Code :

Code (blitzmax) Select

Strict

Print "Start"
BWT_Encode("test.bmp", "test.bwt")
Print "End"
Print "Start"
BWT_Decode("test.bwt", "test2.bmp")
Print "End"

End

Function BWT_Encode(sFileIn:String, sFileOut:String)
	Local iIndex:Int
	Local iFileSize:Int, iEncodeEnd:Int, tStreamIn:TStream, tStreamOut:TStream
	Local tDataBlock:TBank, sFirst:String, sSortedTable:String[512]
	Local iFirstIndex:Int
	Local sAlphabet:String, bFind:Byte

	iFileSize = FileSize(sFileIn)
	iEncodeEnd = iFileSize-(iFileSize Mod 512)
	
	tStreamIn = ReadFile(sFileIn)
	If tStreamIn = Null Then
		Return False
	EndIf
    
	tStreamOut = WriteFile(sFileOut)
	If tStreamOut = Null Then
		CloseFile tStreamIn
		Return False
	EndIf
	
	WriteInt tStreamOut, iEncodeEnd
	
	tDataBlock = CreateBank(512)
	While StreamPos(tStreamIn) < iEncodeEnd
		ReadBank(tDataBlock, tStreamIn, 0, 512)
		
		sFirst = ""
		For iIndex = 0 To 511
			sFirst = sFirst+Chr(PeekByte(tDataBlock, iIndex))
		Next
		sSortedTable[0] = sFirst
		
		For iIndex = 1 To 511
			sSortedTable[iIndex] = sSortedTable[iIndex-1][1..512]+Chr(sSortedTable[iIndex-1][0])
		Next
		sSortedTable.Sort()
		
		sFirst = sFirst[1..512]+Chr(sFirst[0])
		For iIndex = 0 To 511
			If sSortedTable[iIndex] = sFirst Then
				iFirstIndex = iIndex
				Exit
			EndIf
		Next
		
		sAlphabet = ""
		For iIndex = 0 To 255
			sAlphabet = sAlphabet+Chr(iIndex)
		Next
		
		For iIndex = 0 To 511
			bFind = sSortedTable[iIndex][511]
			WriteByte tStreamOut, sAlphabet.Find(Chr(bFind))
			sAlphabet = Chr(bFind)+sAlphabet.Replace(Chr(bFind), "")
		Next
		WriteInt tStreamOut, iFirstIndex
		
	Wend
	
	While Not Eof(tStreamIn)
		WriteByte tStreamOut, ReadByte(tStreamIn)
	Wend
	
	CloseFile tStreamOut
	CloseFile tStreamIn
	
	Return True
End Function

Function BWT_Decode(sFileIn:String, sFileOut:String)
	Local iIndex:Int, iIndex2:Int, bChar:Byte
	Local iEncodeEnd:Int, tStreamIn:TStream, tStreamOut:TStream
	Local tDataBlock:TBank, iFirst:Int[512], iLast:Int[512], iTrans:Int[512]
	Local iFirstIndex:Int
	Local sAlphabet:String, bFind:Byte
	
	tStreamIn = ReadFile(sFileIn)
	If tStreamIn = Null Then
		Return False
	EndIf
    
	tStreamOut = WriteFile(sFileOut)
	If tStreamOut = Null Then
		CloseFile tStreamIn
		Return False
	EndIf
	
	iEncodeEnd = (ReadInt(tStreamIn)/512)*516
	
	tDataBlock = CreateBank(512)
	While StreamPos(tStreamIn) < iEncodeEnd
		ReadBank(tDataBlock, tStreamIn, 0, 512)
		iFirstIndex = ReadInt(tStreamIn)

		sAlphabet = ""
		For iIndex = 0 To 255
			sAlphabet = sAlphabet+Chr(iIndex)
		Next
		
		For iIndex = 0 To 511
			bFind = PeekByte(tDataBlock, iIndex)
			PokeByte tDataBlock, iIndex, sAlphabet[bFind]
			sAlphabet = Chr(sAlphabet[bFind])+sAlphabet.Replace(Chr(sAlphabet[bFind]), "")
		Next
		
		For iIndex = 0 To 511
			iFirst[iIndex] = PeekByte(tDataBlock, iIndex) 
			iLast[iIndex] = PeekByte(tDataBlock, iIndex) 
		Next
		
		For iIndex = 0 To 511
			iFirst[iIndex] = PeekByte(tDataBlock, iIndex) 
			iLast[iIndex] = PeekByte(tDataBlock, iIndex) 
		Next
		iFirst.Sort
		
		For iIndex = 0 To 511
			bChar = iFirst[iIndex]
			For iIndex2 = 0 To 511
				If iLast[iIndex2] = bChar Then
					iTrans[iIndex] = iIndex2
					iLast[iIndex2] = iLast[iIndex2]+256
					Exit
				EndIf
			Next
		Next

		iIndex2 = iFirstIndex
		For iIndex = 0 To 511
			WriteByte tStreamOut, iLast[iIndex2]-256
			iIndex2 = iTrans[iIndex2]
		Next
		
		FlushMem
	Wend

	While Not Eof(tStreamIn)
		WriteByte tStreamOut, ReadByte(tStreamIn)
	Wend
	
	CloseFile tStreamOut
	CloseFile tStreamIn
	
	Return True
End Function

Comments :

Jeroen(Posted 1+ years ago)

gheh nice work man

GW(Posted 1+ years ago)

Very nice!How would you modify this to work with bmax strings?

Filax(Posted 1+ years ago)

I don't understand very well but i have try with a BMP fileOriginal : 921 656 byteBWT file : 928 860 octetsOriginal file with winrar : 404 603 byteBWT file with winrar : 692 631 byte ?Strange :)May be winrar use different than huffman compress method ?

Murilo(Posted 1+ years ago)

Very nice - It greatly improves the output of my huffman routines. I'm considering releasing my revised and improved algorithm soon...

skidracer(Posted 1+ years ago)

Removed FlushMems, can someone confirm this code still works as expected.

Chroma(Posted 1+ years ago)

Yep you're right Filax this is no good. It doesn't make the .bwt file smaller at all.

ImaginaryHuman(Posted 1+ years ago)

This is neat but I'm not sure about the bit near the end of the encoding where you form a separate alphabet of characters and map them to numbers ... this seems like a way of `compacting` the range of symbols used to represent the string, as an extra step, which might not be in the original BWT? And maybe that is what's throwing it off? I'll have to try this for myself.

SyntaxBomb - Indie Coders

[bmx] Burrows-Wheeler-Transformation by Vertex [ 1+ years ago ]

BlitzBot