What is Entropy?

Entropy is a fundamental concept in Information Theory that measures the uncertainty or unpredictability of a random variable. It quantifies the expected amount of information or “surprise” inherent in the possible outcomes of the variable.

Definition of Entropy

For a discrete random variable $X$ with possible outcomes $x_{1}, x_{2}, \dots, x_{n}$ occurring with probabilities $P (X = x_{i})$ , the entropy $H (X)$ is defined as:

H (X) = - i = 1 \sum n P (X = x_{i}) lo g P (X = x_{i})

$H (X)$ : Entropy of the random variable $X$ .
$P (X = x_{i})$ : Probability that $X$ takes the value $x_{i}$ .
$lo g$ : Logarithm function (commonly base 2 for bits or base $e$ for nats).

This formula calculates the expected value of the information content of the outcomes.

Information Content ( $I (x)$ )

The information content (or self-information) of an outcome $x$ is given by:

I (x) = - lo g P (X = x)

Represents the amount of “surprise” associated with the occurrence of $x$ .
Lower probability events yield higher information content.
Why Use the Logarithm?
- Monotonic Decrease: As probability increases, information content decreases logarithmically.
- Additivity: Logarithms convert multiplication into addition, which is essential for combining independent events.

Derivation of Entropy

Entropy is the expected information content across all possible outcomes of $X$ :

H (X) = E [I (X)] = i = 1 \sum n P (X = x_{i}) \cdot I (x_{i})

Substituting $I (x_{i})$ :

H (X) = - i = 1 \sum n P (X = x_{i}) lo g P (X = x_{i})

= i = 1 \sum n (- lo g P (X = x_{i})) P (X = x_{i})

This formula sums the product of each outcome’s probability and its information content.

Why Use Logarithms in Entropy?

Properties of the Logarithm in Entropy

1. Log Converts Multiplication into Addition

For independent events $A$ and $B$ :

lo g [P (A) \times P)] = lo g P (A) + lo g P (B)

Additivity of Information: The total information content of independent events is the sum of their individual information contents.

2. Logarithm Reflects Information Content

Lower Probabilities Yield Higher Logs: Rare events provide more information when they occur.
Continuous Scale: Logarithms provide a smooth and continuous measure of information content.

Handling Independent Events

For independent events, the joint probability is the product of individual probabilities:

P (A, B) = P (A) \times P (B)

Using logarithms:

I (A, B) = - lo g [P (A) \times P (B)] = - lo g P (A) - lo g P (B) = I (A) + I (B)

Additive Information: This property aligns with our intuitive understanding that knowing two independent events provides the sum of their individual information.

Measuring Surprise

Logarithmic Scale: Captures the diminishing returns of additional information from more probable events.
Consistency: Provides a consistent method to quantify information across different probability distributions.

Interpretation of Entropy

High Entropy: Indicates a high level of uncertainty or unpredictability in the outcomes.
Low Entropy: Indicates that the outcomes are more predictable or certain.

Entropy answers the question:

“How surprised should we expect to be about the outcome of an event?”

Real-World Examples

1. Coin Toss

Consider a fair coin toss:

Possible Outcomes: Heads ( $H$ ) or Tails ( $T$ ).
Probabilities: $P (H) = 0.5$ , $P (T) = 0.5$ .

Entropy calculation:

H (X) = - [P (H) lo g P (H) + P (T) lo g P (T)] = - [0.5 lo g 0.5 + 0.5 lo g 0.5] = - [0.5 \times (- 1) + 0.5 \times (- 1)] = 1 bit

So, Each toss provides 1 bit of information.

Biased Coin Toss with Biased Coin Toss with 80% Probability of Heads

Toss a biased coin 4 times.
Objective: Determine how many bits are needed to store the outcome on average, and demonstrate with an example encoding that uses fewer than 4 bits on average.

Consider a biased coin where:

$p (heads) = 0.8$
$p (tails) = 0.2$

E [- lo g p (X)] = - i = 1 \sum N p (x_{i}) lo g_{2} p (x_{i})

For this biased coin:

- (0.8 \times lo g_{2} (0.8) + 0.2 \times lo g_{2} (0.2))

Calculating each term:

$lo g_{2} (0.8) \approx - 0.32193$
$lo g_{2} (0.2) \approx - 2.32193$

Substitute the values:

- (0.8 \times - 0.32193 + 0.2 \times - 2.32193) \approx 0.72 bits

The entropy for a single toss with $p (heads) = 0.8$ is $0.72$ bits.
Assuming independence, the total entropy for 4 tosses is:
- Entropy for tossing independently 4 times

4 \times 0.72 = 2.89 bits

This means we can reduce the number of bits, actually to less than 4 bits!

If all outcomes are heads, send a single bit ‘1’.
If any toss is tails, send ‘0’ followed by four additional bits to record the exact sequence (where heads = 1, tails = 0).

Examples:

$HHHH \to [1] (uses just 1 bit)$
$H T HH \to [0, 1, 0, 1, 1] (uses 5 bits)$
$T H T H \to [0, 0, 1, 0, 1] (uses 5 bits)$

The probability of all heads is $0. 8^{4}$ :
$0. 8^{4} = 0.41$
For all heads, we use 1 bit.
For any other sequence, we use 5 bits.
The expected number of bits is:
$0.41 \times 1 + (1 - 0.41) \times 5$
Simplifying this expression:
$= 0.41 + 0.59 \times 5 = 0.41 + 2.95 = 3.36 bits$

Therefore, on average, we only need 3.36 bits to encode the outcome of the 4 tosses, which is fewer than 4 bits. 2.89 bits is just theoretical information underline.

2. Weather Forecast

Suppose the probability of it raining on a given day is $P (Rain) = 0.1$ , and not raining is $P (No Rain) = 0.9$ .

Entropy calculation:

H (X) = - [P (Rain) lo g P (Rain) + P (No Rain) lo g P (No Rain)] = - [0.1 lo g 0.1 + 0.9 lo g 0.9] \approx - [0.1 \times (- 1) + 0.9 \times (- 0.0458)] \approx - [- 0.1 - 0.0412] \approx 0.469 bits

Interpretation: There’s less uncertainty because one outcome is much more likely.

3. Language Text Compression

In English text:

The letter “e” appears frequently, with a high probability.
The letter “z” appears less frequently, with a low probability.

Entropy helps in designing compression algorithms:

High-frequency letters: Less information per occurrence, can be encoded with shorter codes.
Low-frequency letters: More information per occurrence, require longer codes.

Conclusion

Entropy is a crucial concept that quantifies the expected amount of information or surprise from a random variable’s outcomes. By using logarithms, we can:

Add Information Content: Simplify the calculation of combined information from independent events.
Reflect Uncertainty: Accurately represent how likely or surprising an event is.

Understanding entropy and its properties is essential for fields like information theory, data compression, communication systems, and machine learning.

Table of Contents

Beomsu

Entropy

What is Entropy?

Definition of Entropy

Information Content ( $I (x)$ )

Derivation of Entropy

Why Use Logarithms in Entropy?

Properties of the Logarithm in Entropy

1. Log Converts Multiplication into Addition

2. Logarithm Reflects Information Content

Handling Independent Events

Measuring Surprise

Interpretation of Entropy

Real-World Examples

1. Coin Toss

Biased Coin Toss with Biased Coin Toss with 80% Probability of Heads

2. Weather Forecast

3. Language Text Compression

Conclusion

Graph View

Backlinks

Table of Contents

Entropy

What is Entropy?

Definition of Entropy

Information Content (I(x))

Derivation of Entropy

Why Use Logarithms in Entropy?

Properties of the Logarithm in Entropy

1. Log Converts Multiplication into Addition

2. Logarithm Reflects Information Content

Handling Independent Events

Measuring Surprise

Interpretation of Entropy

Real-World Examples

1. Coin Toss

Biased Coin Toss with Biased Coin Toss with 80% Probability of Heads

2. Weather Forecast

3. Language Text Compression

Conclusion

Graph View

Backlinks

Information Content ( $I (x)$ )