Huffman Coding

What is Huffman Coding

  • Background
    • Developed by David A. Huffman in 1952 while he was a Ph.D. student at MIT (Huffman, 1952)
    • Created as a more efficient alternative to other variable-length encoding schemes of the time
  • Definition:
    • Huffman coding is a lossless data compression algorithm that assigns variable-length codes to characters based on their frequency of occurrence (Cormen et al., 2009)

    • It creates a binary tree (Huffman tree) where more frequent characters are assigned shorter codes and less frequent characters longer codes (Sayood, 2017)

    • The algorithm guarantees a prefix-free code, meaning no codeword is a prefix of another codeword (Blelloch, 2013)

  • Practical uses:
  • Common Misconceptions
    • Huffman coding is not always the most efficient compression method for all types of data (Sayood, 2017)
    • It does not work well for sources with uniform symbol distribution (Sayood, 2017)

Examples

Text Compression

  • Consider the string “BCCABBDDAEE”
  • Frequency: A(2), B(3), C(2), D(2), E(2)
  • Huffman codes: A(00), B(1), C(01), D(100), E(101)
  • Compressed binary: 1010110010010010110101
  • Original: 11 characters * 8 bits = 88 bits
  • Compressed: 22 bits, saving 75% space (Cormen et al., 2009)

Image Compression

  • In JPEG compression, Huffman coding is applied to the quantized DCT coefficients
  • More frequent coefficients (often zeros) get shorter codes
  • This contributes to JPEG’s efficient compression of photographic images (Salomon & Motta, 2010)

Literature Review

Huffman, 1952

Cormen Et Al., 2009

  • Lossless CompressionDataCompression
    • Huffman coding is a type of lossless compression, meaning the original data can be perfectly reconstructed from the compressed data
  • Entropy CodingInformationTheory
    • Huffman coding is a type of entropy coding, which aims to compress data close to its information-theoretic limit
  • Shannon-Fano CodingDataCompression
    • Another variable-length encoding scheme that predates Huffman coding but is generally less efficient

궁금증

  • 2024-10-17 23:28 최근에는 특히 adaptive models에서 Huffman Coding보다 arithmetic coding이 더 많이 사용됩니다. 그 이유는 adaptive 모델에서 기호의 확률이 지속적으로 업데이트되기 때문에, arithmetic coding이 더 효율적으로 연속적인 확률 분포를 처리할 수 있기 때문입니다. 반면, static Huffman은 정적 확률 분포에서 여전히 빠르고 효율적인 방식입니다.