Ooops
October 28, 2020, 11:34:23 PM

Author Topic: [bmx] Burrows-Wheeler-Transformation by Vertex [ 1+ years ago ]  (Read 1535 times)

Offline BlitzBot

  • Jr. Member
  • **
  • Posts: 1
Title : Burrows-Wheeler-Transformation
Author : Vertex
Posted : 1+ years ago

Description : Hi!
The BWT-Algorithm optimize a huffmancompression extremely. This algorithm is using for example in bzip2 compression.

The Huffmancompression you can find by:
<a href="codearcsdf3d.html?code=195" target="_blank">http://blitzbasic.com/codearcs/codearcs.php?code=195[/url]
<a href="codearcsdf3d.html?code=195" target="_blank">http://blitzbasic.com/codearcs/codearcs.php?code=195[/url]
(I was to putridly to write a huffmancode self :))

Huffman create a binary tree. This tree has any branches that represent the chars. Chars that occurs frequently become a shorter bit-path, chars that occurs rarely become a long bit-path.

The BWT optimize datablocks for huffman(and/or for Run Length Encoding).

For example: "blitzbasicrules!" = 2:2:2:1:1:1:1:1:1:1:1:1:1
create a rotated list:
Code: [Select]
blitzbasicrules!
litzbasicrules!b
itzbasicrules!bl
tzbasicrules!bli
zbasicrules!blit
basicrules!blitz
asicrules!blitzb
sicrules!blitzba
icrules!blitzbas
crules!blitzbasi
rules!blitzbasic
ules!blitzbasicr
les!blitzbasicru
es!blitzbasicrul
s!blitzbasicrule
!blitzbasicrules


now you must sort this table:
Code: [Select]
!blitzbasicrule[s]
asicrules!blitz[b]
basicrules!blit[z]
blitzbasicrules[!]
crules!blitzbas[i]
es!blitzbasicru[l]
icrules!blitzba[s]
itzbasicrules!b[l]
les!blitzbasicr[u]
litzbasicrules![b]
rules!blitzbasi[c]
s!blitzbasicrul[e]
sicrules!blitzb[a]
tzbasicrules!bl[i]
ules!blitzbasic[r]
zbasicrules!bli[t]


rotate te original 1x:
"blitzbasicrules!" -> "litzbasicrules!b"
and search it, in the sorted table. You can find it on index 9.

Save the cloum and the index as following:
"sbz!ilslubceairt, 9"

Now make a alphabet that contain all chars:
"!abceilrstuz"

searching this alphabet for the char, replace this char at the first in this alphabet:
Code: [Select]
                !abceilrstuz
s -> index  8 -> s!abceilrtuz
b -> index  3 -> bs!aceilrtuz
z -> index 11 -> zbs!aceilrtu
! -> index  3 -> !zbsaceilrtu
i -> index  7 -> i!zbsacelrtu
l -> index  8 -> l!zbsaceirtu
r -> index  9 -> rl!zbsaceitu
s -> index  5 -> srl!zbaceitu
t -> index 10 -> tsrl!zbaceiu
u -> index 11 -> usrl!zbaceit
z -> index  5 -> zusrl!baceit


now you must write "8 3 11 3 7 8 9 5 10 11", 9 = 2:2:2:1:1:1:1
(ok, ok it was a bad example :))

to decode it, you must do the same:
Code: [Select]
                !abceilrstuz
index  8 -> s -> s!abceilrtuz
index  3 -> b -> bs!aceilrtuz
index 11 -> z -> zbs!aceilrtu
index  3 -> ! -> !zbsaceilrtu
index  7 -> i -> i!zbsacelrtu
index  8 -> l -> l!zbsaceirtu
index  9 -> r -> rl!zbsaceitu
index  5 -> s -> srl!zbaceitu
index 10 -> t -> tsrl!zbaceiu
index 11 -> u -> usrl!zbaceit
index  5 -> z -> zusrl!baceit


to decode "sbz!ilslubceairt, 9" you must do this:
Last = "sbz!ilslubceairt" sorting this to:
First = "!abbceiillrsstuz"

Finding to all first entrys the last entry, set the last to " "(for no doublefinding!), and save the lastindex in to a transform vektor:
Code: [Select]
0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
------------------------------------------------
 s  b  z [!] i  l  s  l  u  b  c  e  a  i  r  t
[!] a  b  b  c  e  i  i  l  l  r  s  s  t  u  z
------------------------------------------------
 3

->

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
------------------------------------------------
 s  b  z     i  l  s  l  u  b  c  e [a] i  r  t
 ! [a] b  b  c  e  i  i  l  l  r  s  s  t  u  z
------------------------------------------------
 3 12

...

->

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
------------------------------------------------
                                               
 !  a  b  b  c  e  i  i  l  l  r  s  s  t  u  z
------------------------------------------------
 3 12  1  9 10 11  4 13  5  7 14  0  6 15  8  2


Trans = " 3 12  1  9 10 11  4 13  5  7 14  0  6 15  8  2", 7

With this vector you can rebuild the data:
start at position 9:
Code: [Select]
Index = 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
Trans = 3 12  1  9 10 11  4 13  5  7 14  0  6 15  8  2
Last  = s  b  z  !  i  l  s  l  u  b  c  e  a  i  r  t


at Last[9] is the char "b", at Trans[9] is the index 7. So the next index is at 7 Last[7] = l and so on:
Code: [Select]
               Index = 9
Last[9]  = b -> Index = Trans[9]
Last[7]  = l -> Index = Trans[7]
Last[13] = i -> Index = Trans[13]
Last[15] = t -> Index = Trans[15]
Last[2]  = z -> Index = Trans[2]
Last[1]  = b -> Index = Trans[1]
Last[12] = a -> Index = Trans[12]
Last[6]  = s -> Index = Trans[6]
Last[4]  = i -> Index = Trans[4]
Last[10] = c -> Index = Trans[10]
Last[14] = r -> Index = Trans[14]
Last[8]  = u -> Index = Trans[8]
Last[5]  = l -> Index = Trans[5]
Last[11] = e -> Index = Trans[11]
Last[0]  = s


Puhhh, sorry for my bad english :)

I have testet this with a 171 KB Bitmap file. Without BWT, Huffman compress it to 119 KB. With BWT to 52,2 KB!

When I ready, i will write my own huffman code, to optimizie it, for 512 byte datablocks.

You can set the datablock size bigger, for better results, but need more time.

cu olli [/i]

Code :
Code: BlitzMax
  1. Strict
  2.  
  3. Print "Start"
  4. BWT_Encode("test.bmp", "test.bwt")
  5. Print "End"
  6. Print "Start"
  7. BWT_Decode("test.bwt", "test2.bmp")
  8. Print "End"
  9.  
  10. End
  11.  
  12. Function BWT_Encode(sFileIn:String, sFileOut:String)
  13.         Local iIndex:Int
  14.         Local iFileSize:Int, iEncodeEnd:Int, tStreamIn:TStream, tStreamOut:TStream
  15.         Local tDataBlock:TBank, sFirst:String, sSortedTable:String[512]
  16.         Local iFirstIndex:Int
  17.         Local sAlphabet:String, bFind:Byte
  18.  
  19.         iFileSize = FileSize(sFileIn)
  20.         iEncodeEnd = iFileSize-(iFileSize Mod 512)
  21.        
  22.         tStreamIn = ReadFile(sFileIn)
  23.         If tStreamIn = Null Then
  24.                 Return False
  25.         EndIf
  26.    
  27.         tStreamOut = WriteFile(sFileOut)
  28.         If tStreamOut = Null Then
  29.                 CloseFile tStreamIn
  30.                 Return False
  31.         EndIf
  32.        
  33.         WriteInt tStreamOut, iEncodeEnd
  34.        
  35.         tDataBlock = CreateBank(512)
  36.         While StreamPos(tStreamIn) < iEncodeEnd
  37.                 ReadBank(tDataBlock, tStreamIn, 0, 512)
  38.                
  39.                 sFirst = ""
  40.                 For iIndex = 0 To 511
  41.                         sFirst = sFirst+Chr(PeekByte(tDataBlock, iIndex))
  42.                 Next
  43.                 sSortedTable[0] = sFirst
  44.                
  45.                 For iIndex = 1 To 511
  46.                         sSortedTable[iIndex] = sSortedTable[iIndex-1][1..512]+Chr(sSortedTable[iIndex-1][0])
  47.                 Next
  48.                 sSortedTable.Sort()
  49.                
  50.                 sFirst = sFirst[1..512]+Chr(sFirst[0])
  51.                 For iIndex = 0 To 511
  52.                         If sSortedTable[iIndex] = sFirst Then
  53.                                 iFirstIndex = iIndex
  54.                                 Exit
  55.                         EndIf
  56.                 Next
  57.                
  58.                 sAlphabet = ""
  59.                 For iIndex = 0 To 255
  60.                         sAlphabet = sAlphabet+Chr(iIndex)
  61.                 Next
  62.                
  63.                 For iIndex = 0 To 511
  64.                         bFind = sSortedTable[iIndex][511]
  65.                         WriteByte tStreamOut, sAlphabet.Find(Chr(bFind))
  66.                         sAlphabet = Chr(bFind)+sAlphabet.Replace(Chr(bFind), "")
  67.                 Next
  68.                 WriteInt tStreamOut, iFirstIndex
  69.                
  70.         Wend
  71.        
  72.         While Not Eof(tStreamIn)
  73.                 WriteByte tStreamOut, ReadByte(tStreamIn)
  74.         Wend
  75.        
  76.         CloseFile tStreamOut
  77.         CloseFile tStreamIn
  78.        
  79.         Return True
  80. End Function
  81.  
  82. Function BWT_Decode(sFileIn:String, sFileOut:String)
  83.         Local iIndex:Int, iIndex2:Int, bChar:Byte
  84.         Local iEncodeEnd:Int, tStreamIn:TStream, tStreamOut:TStream
  85.         Local tDataBlock:TBank, iFirst:Int[512], iLast:Int[512], iTrans:Int[512]
  86.         Local iFirstIndex:Int
  87.         Local sAlphabet:String, bFind:Byte
  88.        
  89.         tStreamIn = ReadFile(sFileIn)
  90.         If tStreamIn = Null Then
  91.                 Return False
  92.         EndIf
  93.    
  94.         tStreamOut = WriteFile(sFileOut)
  95.         If tStreamOut = Null Then
  96.                 CloseFile tStreamIn
  97.                 Return False
  98.         EndIf
  99.        
  100.         iEncodeEnd = (ReadInt(tStreamIn)/512)*516
  101.        
  102.         tDataBlock = CreateBank(512)
  103.         While StreamPos(tStreamIn) < iEncodeEnd
  104.                 ReadBank(tDataBlock, tStreamIn, 0, 512)
  105.                 iFirstIndex = ReadInt(tStreamIn)
  106.  
  107.                 sAlphabet = ""
  108.                 For iIndex = 0 To 255
  109.                         sAlphabet = sAlphabet+Chr(iIndex)
  110.                 Next
  111.                
  112.                 For iIndex = 0 To 511
  113.                         bFind = PeekByte(tDataBlock, iIndex)
  114.                         PokeByte tDataBlock, iIndex, sAlphabet[bFind]
  115.                         sAlphabet = Chr(sAlphabet[bFind])+sAlphabet.Replace(Chr(sAlphabet[bFind]), "")
  116.                 Next
  117.                
  118.                 For iIndex = 0 To 511
  119.                         iFirst[iIndex] = PeekByte(tDataBlock, iIndex)
  120.                         iLast[iIndex] = PeekByte(tDataBlock, iIndex)
  121.                 Next
  122.                
  123.                 For iIndex = 0 To 511
  124.                         iFirst[iIndex] = PeekByte(tDataBlock, iIndex)
  125.                         iLast[iIndex] = PeekByte(tDataBlock, iIndex)
  126.                 Next
  127.                 iFirst.Sort
  128.                
  129.                 For iIndex = 0 To 511
  130.                         bChar = iFirst[iIndex]
  131.                         For iIndex2 = 0 To 511
  132.                                 If iLast[iIndex2] = bChar Then
  133.                                         iTrans[iIndex] = iIndex2
  134.                                         iLast[iIndex2] = iLast[iIndex2]+256
  135.                                         Exit
  136.                                 EndIf
  137.                         Next
  138.                 Next
  139.  
  140.                 iIndex2 = iFirstIndex
  141.                 For iIndex = 0 To 511
  142.                         WriteByte tStreamOut, iLast[iIndex2]-256
  143.                         iIndex2 = iTrans[iIndex2]
  144.                 Next
  145.                
  146.                 FlushMem
  147.         Wend
  148.  
  149.         While Not Eof(tStreamIn)
  150.                 WriteByte tStreamOut, ReadByte(tStreamIn)
  151.         Wend
  152.        
  153.         CloseFile tStreamOut
  154.         CloseFile tStreamIn
  155.        
  156.         Return True
  157. End Function


Comments :


Jeroen(Posted 1+ years ago)

 gheh nice work man


GW(Posted 1+ years ago)

 Very nice!How would you modify this to work with bmax strings?


Filax(Posted 1+ years ago)

 I don't understand very well :) but i have try with a BMP fileOriginal : 921 656 byteBWT file : 928 860 octetsOriginal file with winrar : 404 603 byteBWT file with winrar : 692 631 byte ????Strange :)May be winrar use different than huffman compress method ?


Murilo(Posted 1+ years ago)

 Very nice - It greatly improves the output of my huffman routines. I'm considering releasing my revised and improved algorithm soon...


skidracer(Posted 1+ years ago)

 Removed FlushMems, can someone confirm this code still works as expected.


Chroma(Posted 1+ years ago)

 Yep you're right Filax this is no good.  It doesn't make the .bwt file smaller at all.


ImaginaryHuman(Posted 1+ years ago)

 This is neat but I'm not sure about the bit near the end of the encoding where you form a separate alphabet of characters and map them to numbers ... this seems like a way of `compacting` the range of symbols used to represent the string, as an extra step, which might not be in the original BWT? And maybe that is what's throwing it off? I'll have to try this for myself.


 

SimplePortal 2.3.6 © 2008-2014, SimplePortal