1. ML Classifier for Cancer Detection

You have built an ML classifier that detects whether a tissue appearing in an image is cancerous or not. Consider the cancerous class as the positive class. The following confusion matrix shows the predicted results obtained in the validation set:

cancerous (predicted)healthy (predicted)
cancerous (actual)305
healthy (actual)15100

Compute the precision, recall and accuracy of your ML classifier.

  • Confusion Matrix
    • Precision
    • Accuracy
    • Recall

Solution

Based on the confusion matrix:

  • True Positive (TP) = 30
  • False Positive (FP) = 15
  • False Negative (FN) = 5
  • True Negative (TN) = 100

Calculated metrics:

  1. ≈ 0.6667 or about 66.67%

  2. Recall = ≈ 0.8571 or about 85.71%

  3. Accuracy = ≈ 0.8667 or about 86.67%

2. Student Exam Scores Normalization and Standardization

(*) The table below shows the scores achieved by a group of students on an exam. Using this data, perform the following tasks on the Score feature:

(a). A normalisation in the range [0, 1].

(b). A normalisation in the range [-1, 1].

(c). A standardisation of the data.

ID1234567891011121314151617181920
Score4247592784497243735958825079897570596735
  • Normalization

  • [0, 1] 범위로 정규화할 때 사용되는 기본 공식은 다음과 같습니다:

  • [-1, 1] 범위로 정규화할 때는, 위의 결과에 2를 곱하고 1을 빼서 다음과 같이 변형합니다:

따라서, (b)번 문제에서 사용한 공식도 Min-Max Normalization의 일종으로, 단순히 변환 범위를 [-1, 1]로 설정한 것입니다.

Solution

(a) Normalization in range [0, 1]:

(b) Normalization in range [-1, 1]:

(c) Standardization of the data:

Where:

  • is the original score
  • ,
  • ,

To apply these formulas, substitute each score for in the respective equation.

3. Bike Rental Prediction Model

(*) We designed a model for predicting the number of bike rentals (y) from two attributes, temperature (x₁) and humidity (x₂),

The model was trained after normalising the training data (between [0,1]). x₁ had values between −10 and 39 while x₂ had values between 20 and 100. At test time, the model is used to predict the bike rentals for a vector . What is the value of the prediction y?

  • Feature
    • Already normalised
  • The data also needs to be Normalization so we need to get in there!
    • Data를 Normalize하는 이유

Solution

  1. 먼저 입력 데이터를 [0,1] 범위로 정규화합니다:

  2. 정규화된 값을 예측 모델에 적용합니다:

  3. 최종 예측값 계산:

이 식을 계산하면 최종 예측값 y를 얻을 수 있습니다.

4. Outlier Removal Criterion

A simple criterion to remove outliers from a dataset is to compute the mean, μ, and the standard deviation, σ, of the variable of interest and consider values outside the range as outliers.

Applying this criterion to the Scores in Exercise 2, which ones of them can be considered as outliers?

Solution

  1. 데이터 재확인: 42, 47, 59, 27, 84, 49, 72, 43, 73, 59, 58, 82, 50, 79, 89, 75, 70, 59, 67, 35

  2. 평균(μ) 계산: μ = (42 + 47 + … + 67 + 35) / 20 = 1219 / 20 = 60.95

  3. 표준편차(σ) 계산: σ = √[(Σ(x - μ)²) / N] ≈ 16.97

  4. 이상치 범위 계산: 하한: μ - 3σ = 60.95 - 3(16.97) = 10.04 상한: μ + 3σ = 60.95 + 3(16.97) = 111.86

  5. 이상치 판별: 1.04 < 정상 데이터 < 111.86

모든 점수가 이 범위 내에 있으므로, 이 기준에 따르면 주어진 데이터셋에는 이상치가 없습니다.

가장 낮은 점수(27)와 가장 높은 점수(89)도 이 범위 내에 있어 이상치로 간주되지 않습니다.

5. Joint Probability Mass Function Analysis

Suppose the joint probability mass function of two RVs X and Y is given as,

  • Probability Independent
  • Probability Correlation

(a). Are X and Y Independent?

  1. 주변 확률 계산:
    • P(X = 0) = 1/3, P(X = 1) = 1/3, P(X = 2) = 1/3
    • P(Y = 0) = 1/3, P(Y = 1) = 2/3
  2. 독립성 검증:
    • 만약 X와 Y가 독립이라면, 모든 x와 y에 대해 가 성립해야 합니다.
P(X = 1, Y = 0) = 1/3 ≠ P(X = 1) * P(Y = 0) = 1/3 * 1/3 = 1/9 $$ $$ P(X = 2, Y = 1) = 1/3 ≠ P(X = 2) * P(Y = 1) = 1/3 * 2/3 = 2/9

따라서 X와 Y는 독립이 아닙니다.

(b). Are X and Y Uncorrelated?

  1. 기대값 계산:
  1. 계산:
  1. 공분산 계산:

X와 Y의 공분산이 0이므로, X와 Y는 상관관계가 없습니다(uncorrelated).

결론: (a) X와 Y는 독립이 아닙니다. (b) X와 Y는 상관관계가 없습니다(uncorrelated).

6. Uncorrelated and Independent Random Variables

Two RVs X and Y are uncorrelated if . Since , the two RVs are uncorrelated if . Show that if the RVs are independent, then they are also uncorrelated.

Solution

To show that if RVs X and Y are independent, they are also uncorrelated:

  1. Independence definition: for all X and Y
  2. For independent variables:
  3. Covariance:
  4. Substituting (2) into (3):

Therefore, if X and Y are independent, their covariance is 0, making them uncorrelated.

7. Covariance and Correlation of Linear Transformation

Let , where Y and X are RVs and a and b are constants.

(a). Find the covariance of X and Y.

(b). Find the correlation coefficient of X and Y.

Solution

(a). Covariance of X and Y:

(b). Correlation coefficient of X and Y:

The correlation coefficient is 1 if , -1 if , and undefined if .

8. Information Content of Six-Letter English Words

You need to store a six letter English word. Assume there are 26 possible letters to choose from.

(a). Naively (assuming any combination of letters are equally likely and independent) how many bits of information do the 6 letters contain? [compute the shannon entropy of 6 independent RVs, each of which can take one of 26 possible values].

(b). We know that letters are not really independent of each other. What are the implications for how much entropy is really represented by the 6 letters?

(c). We know there are in fact 22,000 6-letter words. Assuming they are equally likely, how many bits of entropy are there in the word?

(d). Some 6-letter words are more common than others - what does that mean for the entropy?

(a). Naive Entropy Calculation

  • Each letter: bits
  • Six letters: bits

For a uniform distribution with a n equally likely outcomes, the Entropy is simplified

(b). Non-independent Letters Implication

Actual entropy would be lower due to letter dependencies and patterns in English words.

(c). Entropy with 22,000 Equally Likely Words

bits

(d). Impact of Word Frequency on Entropy

Common words would have lower entropy, reducing overall entropy of the system.

9. RFID Bee Monitoring System

An RFID reader and a microcontroller monitor bees entering a bee-hive. Each second it records how many bees have entered. For example:

Time1234567891011121314151617181920
Bees00010000100201000010

Currently 2 bits are used each second to store the number of bees (0, 1, 2 or 3), but the microprocessor runs out of storage.

The PMF of the number of bees is:

(a). Suggest an encoding scheme that can store this more efficiently. [Hint: Remember that you need a way to ensure that the decoder knows where one ‘message’ ends and the next begins].

(b). After 100 seconds, the original storage scheme uses 200 bits of memory. What is the actual entropy in 100 seconds of observations (in bits)? Assume that individual bee arrival times are independent.

(c). How many bits would your encoding scheme need to store 100 seconds of observations (on average)?

(a). Efficient Encoding Scheme

Currently, 2 bits are used each second to store the number of bees (0, 1, 2 or 3). This fixed-length encoding uses 2 bits regardless of the frequency of each value.

We can use a more efficient variable-length encoding based on the probabilities:

  • Use Huffman Coding:
    • 0: 0 bees (15/20 probability)
    • 10: 1 bee (4/20 probability)
    • 11: 2 bees (1/20 probability)

This encoding reduces the average bits used:

  • 0 bees: 1 bit
  • 1 bee: 2 bits
  • 2 bees: 2 bits This scheme reduces bit usage for the most common case (0 bees) while maintaining the same length for 1 and 2 bees, resulting in overall efficiency gain.

Extend: Calculate Bit Reduction

To calculate the bit reduction:

  1. Original scheme (fixed 2 bits per second):

    • Average bits per second = 2
  2. Huffman coding scheme:

    • Average bits per second = 1.25 bits
  3. Bit reduction:

This means that on average, the Huffman coding scheme saves 0.75 bits per second, or reduces the bit usage by 37.5%.

Over 100 seconds:

  • Original scheme: 200 bits
  • Huffman coding: 125 bits
  • Total reduction: 75 bits

The Huffman coding scheme thus reduces the required storage by 37.5% compared to the original fixed-length encoding.

(b). Actual Entropy in 100 Seconds

  • bits per second
  • 100 seconds: bits

Because this problem has a time domain, therefore entropy is related to per second.

(c). Bits Needed for Encoding Scheme (average)

We can use the Expectations formula.

  • Difference between Expectation and Average Average bits per second = bits 100 seconds: bits

10. Lab Room Allocation and Entropy

There is a 10% chance that there won’t be enough space in the computer lab, if our module is allocated Lab Room 1 ; and a 30% if allocated Lab Room 2 . If there is an even chance of being allocated either room;

(a). Compute the (marginal) entropy of the Full random variable, i.e. .

(b). Compute the conditional entropy . Confirm it is no greater than .

(a). Marginal Entropy of Full

  • bits

(b). Conditional Entropy

  • bits

  • bits

  • bits

  • Since , the conditional entropy is indeed no greater than .

measures the uncertainty of the Full variable when Lab information is given. This represents “How certain can we be about Full status when we know the Lab information?” The lower conditional entropy means that knowing the Lab information reduces uncertainty about Full status.

  • Without Lab information, Full status is more uncertain.
  • Knowing the Lab, especially if it’s Lab 1, we can be more confident that the Full probability is low.
  • If it’s Lab 2, we know the Full probability is relatively high.

Extend: Naive Bayes Vs Conditional Entropy

Naive Bayes and Conditional Entropy seemed similar in terms of supporting decision-making because:

  • Knowing Lab information reduces uncertainty about Full status. This is similar to updating posterior probabilities with new evidence (Lab information) in Naive Bayes.
  • Lower conditional entropy means better prediction is possible. This is similar to improving classification accuracy with new information in Naive Bayes.
  • Uncertainty Reduction: Both methods reduce uncertainty about the target variable (Full) through additional information (Lab).

The reason I saw them as similar is that both methods use conditional probability and support decision-making. In fact, both concepts use Conditional Probability. Naive Bayes calculates P(class|features), and conditional entropy calculates H(Y|X).

The difference in purpose is that Naive Bayes is used to solve classification problems and predict the class of new data. Conditional entropy measures the amount of information one variable provides about another.

Update method:

  • Naive Bayes: Updates beliefs by calculating posterior probabilities whenever new evidence is given.
  • Conditional Entropy: A static measure that measures the uncertainty of one variable given another.

Result interpretation:

  • Naive Bayes: Provides the probability of belonging to a specific class.

  • Conditional Entropy: Measures the amount of information in bits.