Dictionary methods


The compression algorithms LZ78 and LZW are dictionary methods having the property of universality. It means that they tend to reach the entropy bound, they are asymptotically optimal.

The functions clz78 and clzw in the code below give the encoding with the corresponding algorithms of any text string not including the character # that is employed as an EOF symbol. The clzw function has an optional argument specifying the initial dictionary required by the LZW algorithm. If this argument is omitted then it is considered automatically the minimal dictionary containing all the characters in the text string input.

For instance, clz78( 'ban bananas!!!' ) gives

[(0, 'b'), (0, 'a'), (0, 'n'), (0, ' '), (1, 'a'), (3, 'a'), (6, 's'), (0, '!'), (8, '!')]

and clzw( 'ban bananas!!!' )

[3, 2, 4, 0, 6, 4, 7, 2, 5, 1, 15]

If we use the not optimal initial dictionary [' ','!','a','b','c','n','s'] which adds the unneeded character 'c', the result is

[3, 2, 5, 0, 7, 5, 8, 2, 6, 1, 16]

Uncommenting print(L_dic) in these functions, the dictionary is displayed. In the previous example they are respectively

['', 'b', 'a', 'n', ' ', 'ba', 'na', 'nas', '!', '!!', '#']

and

[' ', '!', 'a', 'b', 'n', 's', 'ba', 'an', 'n ', ' b', 'ban', 'na', 'ana', 'as', 's!', '!!', '!!#']

Perhaps the name "dictionary method" is a little misleading because the dictionary is not a list of shorthand abbreviations to be included in the header of the compressed file (as in Huffman encoding). The dictionary is rather a subdivision of the file and only appears in the encoding-decoding process.

The implementations of the dictionary methods in practice may deviate from the theoretical model because they can include extra steps like a previous division of the file. The functions ratio_lz78 and ratio_lzw compute the number of bits per character required by the pure encoding with LZ78 and LZW of a random generated string of length N composed by zeros and ones. As they are universal methods, these ratios must tend to 1 because a random list of N bits cannot be compressed (the entropy of a Bernoulli distribution p=1/2 is 1, if you prefer so). The function plot_files tests this and includes it in graphic files for N in [10,4000) and also in [50000, 100000) with increments of 150. This function needs the data files built with make_datafile. It can take very long depending on your computer.

For LZ78 the graphs of the number of bits per character are

./images/dict1_78.png     ./images/dict2_78.png
N in [10,4000)
   
N in [50000, 100000)
and for LZW, they are
./images/dict1_w.png     ./images/dict2_w.png
N in [10,4000)
   
N in [50000, 100000)

It is apparent that there is a strange band behavior. One may suspect that the function ceil explains it because it quantifies that the number of bits needs to represent nonnegative integers less than a certain number jumps each dyadic interval, but if we smooth ceil to the identity, the band structure persists. Another possible failed explanation is that outliers in the maximum can rule the output of this function during a while, but a representation of the successive maxima in the encoding reveals much shorter and random plateaus. Summing up, I have not a convincing explanation for this structure.

The conclusion after looking at these figures is that the convergence to 1 is very slow. Repeating this experiment with gzip, an open source application for compression based on LZ77, for N=100000 I got 1.277, which is comparable to the result in our case. This ratio with gzip was still 1.272 for N=108.

If we plot the difference between the number of bits per character for LZ78 and LZW, the result is

./images/dict1d.png     ./images/dict2d.png
N in [10,4000)
   
N in [50000, 100000)

This implies that except for some small irrelevant values, in this example LZ78 is better. It is unclear to me if for "real life texts" typically LZ78 outperforms LZW or vice-versa. It is possible to find on the internet opinions in both directions.


The code

This is the SAGE code that produces all the images:




def clz78(text):
  """
  LZ78 encoding of a text string not containing '#'.
  The optional argument is the initial dictionary.
  """
  
  L_dic = ['']
  codif = [0]
  text += '#'

  long = len(text)

  j = 0
  maxv = 0
  while text[j:] != '':
    text = text[j:]
    j = 1
    while text[0:j] in L_dic:
      j += 1
    L_dic.append( text[0:j] )
    temp = L_dic.index(text[0:j-1])
    if temp>maxv: maxv = temp
    codif.append( (temp, text[j-1:j]) )

  codif.pop( 0 )
  codif.pop( -1 )

#  print(L_dic)

  return codif




def clzw(text, L_dic=None):
  """
  LZW encoding of a text string not containing '#'.
  The optional argument is the initial dictionary.
  """
  if L_dic is None:
    L_dic = []
    # If not initial dictionary, make it
    for k in range(len(text)):
      if text[k] not in L_dic:
        L_dic.append( text[k] )
        
  L_dic.sort()
  
  l_ini = len(L_dic)
  codif = []

  text += '#'

  while text != '#':
    j = 1
    while text[0:j] in L_dic:
      j += 1
    L_dic.append( text[0:j] )
    text = text[j-1:]

#  print(L_dic)

  for k in range(l_ini, len(L_dic) ):
    codif.append( L_dic.index( L_dic[k][0:-1] ) )
  
  return codif

def ratio_lz78( text ):
  L = clz78( text )
  le = len(L)
  ma = max(L)[0]+1
  nbits = (ceil( log(ma)/log(2) ) +1)*le
  ratio = nbits/len(text)

  return le, ma, ratio

def ratio_lzw( text ):
  L = clzw( text )
  le = len(L)
  ma = max(L)+1
  nbits = ceil( log(ma)/log(2) ) *le
  ratio = nbits/len(text)

  return le, ma, ratio

def save_in_file(N1,N2,h,fname):
  t = walltime()

  set_random_seed(0)

  text = ''

  with open(fname, 'w') as f:
    for N in srange(N1, N2,h):
      for _ in range(N-len(text)):
        text += chr( 48+randint(0,1) )
      ans1 = ratio_lz78( text )
      ans2 = ratio_lzw( text )
      if ((N-N1)/h)%50==0:
        print '--------------'
        print N, ans1, ans2
        print walltime(t)
      f.write(str((N, ans1[0], ans1[1], ans1[2], ans2[0], ans2[1], ans2[2]))[1:-1]+'\n')
  

def plot_file( fname ):
  Lr78 = []
  Lrw = []
  Ldif = []
  with open(fname, 'r') as f:
    text = f.read()
    L = text.split('\n')
    L.pop()
    for item in L:
      Lid = item.split(',')
      for k in range(7):
        Lid[k] = sage_eval( Lid[k] )
      Lr78.append( (Lid[0], Lid[3]) )
      Lrw.append( (Lid[0], Lid[6]) )
      Ldif.append( (Lid[0], Lid[3]-Lid[6]) )
    nfname = fname.replace('.txt','_')
    
    P = list_plot(Lr78)
    P.fontsize( 20 )
    P.save(nfname+'78.png')
    P = list_plot(Lrw)
    P.fontsize( 20 )
    P.save(nfname+'w.png')
    P = list_plot(Ldif)
    P.fontsize( 20 )
    P.save(nfname+'d.png')
        
  return

def make_datafile():
  """
  It takes very long. Be careful!
  """
  save_in_file(10,4000,1, 'dat_10_4000_1.txt')  
  save_in_file(50000,100000,150, 'dat_50000_100000_150.txt')  
  return
  
def plot_files():
  plot_file('dat_10_4000_1.txt')
  plot_file('dat_50000_100000_150.txt')


print clz78( 'ban bananas!!!' )
print '----------------------'
print clzw( 'ban bananas!!!' )
print '----------------------'
print clzw( 'ban bananas!!!', [' ','!','a','b','c','n','s'] )
print '----------------------'
  
  
  
  
#make_datafile()
plot_files()