[bmx] Burrows-Wheeler-Transformation by Vertex [ 1+ years ago ]

Started by BlitzBot, June 29, 2017, 00:28:43

Previous topic - Next topic

BlitzBot

Title : Burrows-Wheeler-Transformation
Author : Vertex
Posted : 1+ years ago

Description : Hi!
The BWT-Algorithm optimize a huffmancompression extremely. This algorithm is using for example in bzip2 compression.

The Huffmancompression you can find by:
<a href="codearcsdf3d.html?code=195" target="_blank">http://blitzbasic.com/codearcs/codearcs.php?code=195</a>
<a href="codearcsdf3d.html?code=195" target="_blank">http://blitzbasic.com/codearcs/codearcs.php?code=195</a>
(I was to putridly to write a huffmancode self :))

Huffman create a binary tree. This tree has any branches that represent the chars. Chars that occurs frequently become a shorter bit-path, chars that occurs rarely become a long bit-path.

The BWT optimize datablocks for huffman(and/or for Run Length Encoding).

For example: "blitzbasicrules!" = 2:2:2:1:1:1:1:1:1:1:1:1:1
create a rotated list:
blitzbasicrules!
litzbasicrules!b
itzbasicrules!bl
tzbasicrules!bli
zbasicrules!blit
basicrules!blitz
asicrules!blitzb
sicrules!blitzba
icrules!blitzbas
crules!blitzbasi
rules!blitzbasic
ules!blitzbasicr
les!blitzbasicru
es!blitzbasicrul
s!blitzbasicrule
!blitzbasicrules


now you must sort this table:
!blitzbasicrule[s]
asicrules!blitz[b]
basicrules!blit[z]
blitzbasicrules[!]
crules!blitzbas[i]
es!blitzbasicru[l]
icrules!blitzba[s]
itzbasicrules!b[l]
les!blitzbasicr[u]
litzbasicrules![b]
rules!blitzbasi[c]
s!blitzbasicrul[e]
sicrules!blitzb[a]
tzbasicrules!bl[i]
ules!blitzbasic[r]
zbasicrules!bli[t]


rotate te original 1x:
"blitzbasicrules!" -> "litzbasicrules!b"
and search it, in the sorted table. You can find it on index 9.

Save the cloum and the index as following:
"sbz!ilslubceairt, 9"

Now make a alphabet that contain all chars:
"!abceilrstuz"

searching this alphabet for the char, replace this char at the first in this alphabet:
                !abceilrstuz
s -> index  8 -> s!abceilrtuz
b -> index  3 -> bs!aceilrtuz
z -> index 11 -> zbs!aceilrtu
! -> index  3 -> !zbsaceilrtu
i -> index  7 -> i!zbsacelrtu
l -> index  8 -> l!zbsaceirtu
r -> index  9 -> rl!zbsaceitu
s -> index  5 -> srl!zbaceitu
t -> index 10 -> tsrl!zbaceiu
u -> index 11 -> usrl!zbaceit
z -> index  5 -> zusrl!baceit


now you must write "8 3 11 3 7 8 9 5 10 11", 9 = 2:2:2:1:1:1:1
(ok, ok it was a bad example :))

to decode it, you must do the same:
                !abceilrstuz
index  8 -> s -> s!abceilrtuz
index  3 -> b -> bs!aceilrtuz
index 11 -> z -> zbs!aceilrtu
index  3 -> ! -> !zbsaceilrtu
index  7 -> i -> i!zbsacelrtu
index  8 -> l -> l!zbsaceirtu
index  9 -> r -> rl!zbsaceitu
index  5 -> s -> srl!zbaceitu
index 10 -> t -> tsrl!zbaceiu
index 11 -> u -> usrl!zbaceit
index  5 -> z -> zusrl!baceit


to decode "sbz!ilslubceairt, 9" you must do this:
Last = "sbz!ilslubceairt" sorting this to:
First = "!abbceiillrsstuz"

Finding to all first entrys the last entry, set the last to " "(for no doublefinding!), and save the lastindex in to a transform vektor:
0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
------------------------------------------------
 s  b  z [!] i  l  s  l  u  b  c  e  a  i  r  t
[!] a  b  b  c  e  i  i  l  l  r  s  s  t  u  z
------------------------------------------------
 3

->

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
------------------------------------------------
 s  b  z     i  l  s  l  u  b  c  e [a] i  r  t
 ! [a] b  b  c  e  i  i  l  l  r  s  s  t  u  z
------------------------------------------------
 3 12

...

->

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
------------------------------------------------
                                               
 !  a  b  b  c  e  i  i  l  l  r  s  s  t  u  z
------------------------------------------------
 3 12  1  9 10 11  4 13  5  7 14  0  6 15  8  2


Trans = " 3 12  1  9 10 11  4 13  5  7 14  0  6 15  8  2", 7

With this vector you can rebuild the data:
start at position 9:
Index = 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
Trans = 3 12  1  9 10 11  4 13  5  7 14  0  6 15  8  2
Last  = s  b  z  !  i  l  s  l  u  b  c  e  a  i  r  t


at Last[9] is the char "b", at Trans[9] is the index 7. So the next index is at 7 Last[7] = l and so on:
               Index = 9
Last[9]  = b -> Index = Trans[9]
Last[7]  = l -> Index = Trans[7]
Last[13] = i -> Index = Trans[13]
Last[15] = t -> Index = Trans[15]
Last[2]  = z -> Index = Trans[2]
Last[1]  = b -> Index = Trans[1]
Last[12] = a -> Index = Trans[12]
Last[6]  = s -> Index = Trans[6]
Last[4]  = i -> Index = Trans[4]
Last[10] = c -> Index = Trans[10]
Last[14] = r -> Index = Trans[14]
Last[8]  = u -> Index = Trans[8]
Last[5]  = l -> Index = Trans[5]
Last[11] = e -> Index = Trans[11]
Last[0]  = s


Puhhh, sorry for my bad english :)

I have testet this with a 171 KB Bitmap file. Without BWT, Huffman compress it to 119 KB. With BWT to 52,2 KB!

When I ready, i will write my own huffman code, to optimizie it, for 512 byte datablocks.

You can set the datablock size bigger, for better results, but need more time.

cu olli [/i]

Code :
Code (blitzmax) Select
Strict

Print "Start"
BWT_Encode("test.bmp", "test.bwt")
Print "End"
Print "Start"
BWT_Decode("test.bwt", "test2.bmp")
Print "End"

End

Function BWT_Encode(sFileIn:String, sFileOut:String)
Local iIndex:Int
Local iFileSize:Int, iEncodeEnd:Int, tStreamIn:TStream, tStreamOut:TStream
Local tDataBlock:TBank, sFirst:String, sSortedTable:String[512]
Local iFirstIndex:Int
Local sAlphabet:String, bFind:Byte

iFileSize = FileSize(sFileIn)
iEncodeEnd = iFileSize-(iFileSize Mod 512)

tStreamIn = ReadFile(sFileIn)
If tStreamIn = Null Then
Return False
EndIf
   
tStreamOut = WriteFile(sFileOut)
If tStreamOut = Null Then
CloseFile tStreamIn
Return False
EndIf

WriteInt tStreamOut, iEncodeEnd

tDataBlock = CreateBank(512)
While StreamPos(tStreamIn) < iEncodeEnd
ReadBank(tDataBlock, tStreamIn, 0, 512)

sFirst = ""
For iIndex = 0 To 511
sFirst = sFirst+Chr(PeekByte(tDataBlock, iIndex))
Next
sSortedTable[0] = sFirst

For iIndex = 1 To 511
sSortedTable[iIndex] = sSortedTable[iIndex-1][1..512]+Chr(sSortedTable[iIndex-1][0])
Next
sSortedTable.Sort()

sFirst = sFirst[1..512]+Chr(sFirst[0])
For iIndex = 0 To 511
If sSortedTable[iIndex] = sFirst Then
iFirstIndex = iIndex
Exit
EndIf
Next

sAlphabet = ""
For iIndex = 0 To 255
sAlphabet = sAlphabet+Chr(iIndex)
Next

For iIndex = 0 To 511
bFind = sSortedTable[iIndex][511]
WriteByte tStreamOut, sAlphabet.Find(Chr(bFind))
sAlphabet = Chr(bFind)+sAlphabet.Replace(Chr(bFind), "")
Next
WriteInt tStreamOut, iFirstIndex

Wend

While Not Eof(tStreamIn)
WriteByte tStreamOut, ReadByte(tStreamIn)
Wend

CloseFile tStreamOut
CloseFile tStreamIn

Return True
End Function

Function BWT_Decode(sFileIn:String, sFileOut:String)
Local iIndex:Int, iIndex2:Int, bChar:Byte
Local iEncodeEnd:Int, tStreamIn:TStream, tStreamOut:TStream
Local tDataBlock:TBank, iFirst:Int[512], iLast:Int[512], iTrans:Int[512]
Local iFirstIndex:Int
Local sAlphabet:String, bFind:Byte

tStreamIn = ReadFile(sFileIn)
If tStreamIn = Null Then
Return False
EndIf
   
tStreamOut = WriteFile(sFileOut)
If tStreamOut = Null Then
CloseFile tStreamIn
Return False
EndIf

iEncodeEnd = (ReadInt(tStreamIn)/512)*516

tDataBlock = CreateBank(512)
While StreamPos(tStreamIn) < iEncodeEnd
ReadBank(tDataBlock, tStreamIn, 0, 512)
iFirstIndex = ReadInt(tStreamIn)

sAlphabet = ""
For iIndex = 0 To 255
sAlphabet = sAlphabet+Chr(iIndex)
Next

For iIndex = 0 To 511
bFind = PeekByte(tDataBlock, iIndex)
PokeByte tDataBlock, iIndex, sAlphabet[bFind]
sAlphabet = Chr(sAlphabet[bFind])+sAlphabet.Replace(Chr(sAlphabet[bFind]), "")
Next

For iIndex = 0 To 511
iFirst[iIndex] = PeekByte(tDataBlock, iIndex)
iLast[iIndex] = PeekByte(tDataBlock, iIndex)
Next

For iIndex = 0 To 511
iFirst[iIndex] = PeekByte(tDataBlock, iIndex)
iLast[iIndex] = PeekByte(tDataBlock, iIndex)
Next
iFirst.Sort

For iIndex = 0 To 511
bChar = iFirst[iIndex]
For iIndex2 = 0 To 511
If iLast[iIndex2] = bChar Then
iTrans[iIndex] = iIndex2
iLast[iIndex2] = iLast[iIndex2]+256
Exit
EndIf
Next
Next

iIndex2 = iFirstIndex
For iIndex = 0 To 511
WriteByte tStreamOut, iLast[iIndex2]-256
iIndex2 = iTrans[iIndex2]
Next

FlushMem
Wend

While Not Eof(tStreamIn)
WriteByte tStreamOut, ReadByte(tStreamIn)
Wend

CloseFile tStreamOut
CloseFile tStreamIn

Return True
End Function


Comments :


Jeroen(Posted 1+ years ago)

 gheh nice work man


GW(Posted 1+ years ago)

 Very nice!How would you modify this to work with bmax strings?


Filax(Posted 1+ years ago)

 I don't understand very well :) but i have try with a BMP fileOriginal : 921 656 byteBWT file : 928 860 octetsOriginal file with winrar : 404 603 byteBWT file with winrar : 692 631 byte ????Strange :)May be winrar use different than huffman compress method ?


Murilo(Posted 1+ years ago)

 Very nice - It greatly improves the output of my huffman routines. I'm considering releasing my revised and improved algorithm soon...


skidracer(Posted 1+ years ago)

 Removed FlushMems, can someone confirm this code still works as expected.


Chroma(Posted 1+ years ago)

 Yep you're right Filax this is no good.  It doesn't make the .bwt file smaller at all.


ImaginaryHuman(Posted 1+ years ago)

 This is neat but I'm not sure about the bit near the end of the encoding where you form a separate alphabet of characters and map them to numbers ... this seems like a way of `compacting` the range of symbols used to represent the string, as an extra step, which might not be in the original BWT? And maybe that is what's throwing it off? I'll have to try this for myself.