Probability Theory 101: Building Blocks for Machine Learning. Chapter 1.

Think of probability as a tool that helps us make sense of uncertainty. It's not about knowing what will happen for sure, but about understanding the chances of different outcomes. Probability guides everything from weather forecasts to stock market trends and even the decisions made by AI.

When you interact with AI language models like ChatGPT or Claude or other LLMs, the responses you receive are based on probability calculations. The model identifies the most likely response from a vast array of alternatives, effectively selecting the top contender based on its learned patterns and data.

But here’s the interesting part: even though we use probability all the time, people still haven't come to a clear conclusion about its definition. Mathematicians, philosophers, and scientists often debate what probability truly means. It's a tricky concept that shapes how we understand the uncertain world around us.

This seemingly simple concept is at the heart of:

Weather forecasts predicting tomorrow's rain chance
Doctors assessing the likelihood of treatment success
Insurance companies calculating your premiums
Physicists describing the behavior of quantum particles

What is Probability theory?

Probability theory provides a mathematical framework for reasoning about uncertain events.It's branch of mathematics that deals with analyzing & quantifying the likelihood of events occurring. In the context of deep learning, we often with uncertain or incomplete information. Probability theory provides a formal framework for reasoning about uncertainty & reasoning.

Table of Content:

Introduction to Probability Theory
Core Concepts (Events, Sample Space, Probability)
Axioms of Probability
Key Areas (Discrete and Continuous Probability)
Random Variables (Discrete and Continuous)
Probability Mass Function (PMF)
Joint Probability Mass Function
Marginal Probability
Probability Density Function (PDF)
Conditional Probability
Chain Rule of Probability

let's start:

Events: Possible outcomes of an experiment or process

Sample space: Set of all possible outcomes.

Probability: How likely an event is to occur, expressed between number 0 & 1.

Axioms of Probability:

The fundamental rules or principles that form the foundation of probability theory

Non-negativity: Probabilities are always non-negative.
Normalization: Probability of entire sample space is 1.
Coin toss: P(Heads) + P(Tails) = 0.5 + 0.5 = 1
Die roll: P(1) + P(2) + P(3) + P(4) + P(5) + P(6)
= 1/6 + 1/6 + 1/6 + 1/6 + 1/6 + 1/6 = 1
Additivity:
Imagine you have a regular six-sided die. We'll consider two events:
A: Rolling an even number (2, 4, or 6)
B: Rolling an odd number (1, 3, or 5)
These events are mutually exclusive because you can't roll an even and odd number at the same time.Now, let's break down the additivity principle:
Probability of rolling an even number (A): P(A) = 3/6 = 1/2
Probability of rolling an odd number (B): P(B) = 3/6 = 1/2
Probability of rolling either an even OR an odd number:
P(A or B) = P(A) + P(B) = 1/2 + 1/2 = 1
The additivity principle simply states that for events that can't happen at the same time (mutually exclusive), you can add their individual probabilities to find lthe probability of either event occurring.

Key areas of probability theory:

1. Discrete probability:

Deals with countable outcomes (eg: coin flips, die rolls) i.e either head or tail in coin flip

2. Continuous probability:

Handles uncountable, infinite outcomes (e.g., time, height). Height can theoretically take any real value within a range (e.g., 150.0 cm, 150.1 cm, 150.11 cm, 150.111 cm, etc.)There are infinitely many possible values between any two heights
The probability of any exact height (e.g., exactly 170.000000... cm) is technically zero
We always work with ranges in continuous probability

4. Random Variables:

There are two kinds:

Discrete Random Variable
Probability Mass Function (PMF)
Join Probability Mass Function (Joint PMF)
Join Probability
Relationship between PMF & joint PMF
Marginal Probability
Relationship between marginal probabilities & joint probabilities.
Continuous Random Variable
Probability Density Function

4.1. Discrete Random Variable:

Imagine we're flipping a coin. We want to assign numbers to the outcomes so we can do math with them. This assignment is what we call a random variable.

Let's define a random variable X like this:

If the coin lands on heads, X = 1
If the coin lands on tails, X = 0
So X is a function that takes the physical outcome of the coin flip and turns it into a number:
X(Heads) = 1
X(Tails) = 0
What's the probability that X equals 1? (This is the same as asking "What's the probability of getting heads?")
What's the average value of X if we flip the coin many times? (This would tell us about the fairness of the coin)
This is the essence of a random variable - it's a way to assign numbers to outcomes so we can analyze random events mathematically.

Now that we've defined our discrete random variable X, we need a way to describe its probability distribution. For discrete random variables like our coin flip example, we use what's called a Probability Mass Function (PMF).

Why do we need to describe the probability distribution?

Because it provides a complete picture of random variable behaviors.

It tells us about all possible outcomes & their associated probabilities.
This information is crucial for making predictions & perform statistical analysis.
Answers some of the most important questions:
What the likelihood of each outcome?
What's the most probable outcome?

4.a Probability Mass function:

This is for a single discrete random variable. A function that gives the probability of each possible value for a single discrete random variable

Deals with one random variable at a time
Assigns a probability to each possible value of that variable
Example: For a coin flip (X) P(X = Heads) = 0.5 P(X = Tails) = 0.5

Probability mass function for our coin flip example would look like this:

P(X = 0) = 0.5 (probability of tails)
P(X = 1) = 0.5 (probability of heads)

It's called mass function because it's assigns weight/mass to each discrete value that X can take.

Key properties of Probability Mass Function (PMF):

It gives the probability for each possible value of the discrete random variable.
All probabilities are non-negative.
The sum of all probabilities equals 1.

Using the PMF, we can easily answer questions like:

What's the probability that X equals 1? (0.5, or the probability of heads)
What's the probability that X is less than or equal to 0? (0.5, or the probability of tails)

4.1.b Join Probability Mass Function (PMF):

Deals with multiple discrete random variable. A function that gives joint probability for every possible combination of values. Deals with two or more random variables simultaneously.

Joint Probability

What does it mean?

This is the probability of two or more events occurring together.It's a single value for a specific combination of outcomes. Example: P(X = 3, Y = 4) = 1/36 for rolling of two dice. The joint PMF is the complete function or table of probabilities. A joint probability is a single value from this function or table.

It describes the entire probability distribution for multiple discrete random variables.
Assigns a probability to each possible combination of values across all variables
Example: For two coin flips (X and Y)
Each of these individual probabilities are joint probabilites.
P(X = Heads, Y = Heads) = 0.25
P(X = Heads, Y = Tails) = 0.25
P(X = Tails, Y = Heads) = 0.25
P(X = Tails, Y = Tails) = 0.25

Joint PMF considers the simultaneous occurrence of multiple events, while the regular pmf only considers one event at a time.

4.1.c PMF vs Joint PMF difference:

Let's consider a scenario with two dice, a red die (R) and a blue die (B).

PMF for a single die (let's use the red die R):
this function gives the probability for each possible outcome of rolling just one die (red)
P(R = 1) = 1/6
P(R = 2) = 1/6
P(R = 3) = 1/6
...
P(R = 6) = 1/6
Here, we're only concerned with probability of the outcome of one variable (R).
Joint PMF for both dice (R and B):
function gives probability for each possible outcomes combination of rolling both die R & B together
P(R = 1, B = 1) = 1/36
P(R = 1, B = 2) = 1/36
P(R = 1, B = 3) = 1/36
P(R = 2, B = 2) = 1/36
...
P(R = 6, B = 6) = 1/36
Here we are looking for the probability of specific combination of outcomes for both variable R & B, simultaneously.
This is the complete table of all possible joint probabilities.
It would be a 6x6 table with each cell containing 1/36.

Key difference:

Joint PMF allows us to consider multiple variables at once, capturing how they occur together, while the regular pmf only deals with one variable at a time.

4.1.d Marginal Probability:

The probability of an outcome for one die, regardless of the outcomes of other die.

Considering scenario with two dice, a red die (R) and a blue die (B).

For the red die: P (R = 2) , is the probability of rolling a 2 on the red die, no matter what blue dies shows.
For the blue die: P( B = 3), is the probability of rolling a 3 on the blue die, no matter what red dies shows.

Calculating Marginal Probabilities:

P (R =2), we sum all the joint probabilities where red die is 2.

P(R=2) = P(R=2, B=1) + P(R=2, B=2) + P(R=2, B=3) + P(R=2, B=4) + P(R=2, B=5) + P(R=2, B=6) 
= 1/36 + 1/36 + 1/36 + 1/36 + 1/36 + 1/36 = 6/36 
= 1/6

similarly for blue die to roll 3,

we sum all the all the joint probabilities where blue die is 3

P(B=3) = P(R=1, B=3) + P(R=2, B = 3) + P(R=3, B=3) + P(R=4,  B=3) + P(R=5, B=3) + P(R=6,B=3)
= 1/36 + 1/36 + 1/36 + 1/36 + 1/36 + 1/36
= 1/6

Interpretation:

The marginal probability P(R=2) = 1/6 means:

Probability of rolling a 2 on the red die is 1/6 regardless what happens with the blue die
Probability of rolling a 3 on blue dies is 1/6, regardless what happens with the red die.

4.1.e. Relationship between marginal probabilities & joint probabilities:

Marginal probabilities are derived from joint probabilities, by summing over all the possibilities of other variables.

Joint probabilities give us the complete picture of both dice together.
Marginal probabilities give us information about one die, ignoring the other.
We can always calculate marginal probabilities from joint probabilities, but not vice versa.

4.2. Continuous Random Variable:

Imagine we're measuring the waiting time at a busy coffee shop. Let's define a random variable T as the time (in minutes) a customer waits for their order.

Continuous nature: T can take any real value greater than or equal to 0. It could be 2.5 minutes, 3.7 minutes, 4.1234 minutes, etc.
We might model this with an exponential distribution. The Probability Density function could be: f(t) = λe^(-λt) for t ≥ 0, where λ is the average number of customers served per minute.

Properties:

T is non-negative (you can't wait for negative time).
T can theoretically be any real number ≥ 0.
The probability of T taking any exact value is 0 (e.g., P(T = 2.5) = 0).

Questions we might ask:

What's the probability of waiting less then 5 minutes? (integrate pdf from 0 to 5)
What's the average waiting time (Expected value of T):
For an exponential distribution, the average (or expected value) is 1/λ.
If λ = 2 customers served per minute, the average wait time is 1/2 = 0.5 minutes.
This tells us the typical wait time over many customers.
What's the median waiting time? (We'd find the 50th percentile of the distribution)
For an exponential distribution, the median is (ln 2)/λ, ln = natural log
Using λ = 2, the median wait time is (ln 2)/2 ≈ 0.35 minutes.
This tells us the point where half the customers wait less time, and half wait more.

4.2.f What is Probability Density Function?

It's a simply function like a one shown in the above graph. It's type of function used specially only for continuous random variable. It is used to describe the likelihood of continuous random variable taking on a specific value. For example, the probability that a customer waits exactly 2 minutes for their coffee is zero. Not 2.01 minutes, not 2.002 minutes, not 1.9999 minutes, not 2.000001 minutes, but precisely 2.000000... minutes. Not a single millisecond more or less than 2 minutes.

The probability of this exact waiting time is 0.

By showing a probability of an exact value is 0, it demonstrates why we need a different approach for continuous variables. This is where probability density comes in.

Zero probability for an exact value also explain why we need to integrate an interval to find probabilities, rather than simply evaluating a function at a point.

Key points:

It's used only for continuous random variable, not for discrete ones.
The function f(x) is non-negative for all real numbers x.
Total area under the curve of pdf

Probability Density Function, f(x) is not a probability, but a density. The probability is found by integrating the PDF over an interval. For a PDF, the integral of f(x) over all possible x is 1. This property means that the total area under the entire PDF curve is always equal to 1.

Since PDF is for continuous random variables, what do we have for discrete random variables?

Probability Mass Function:

Both PMFs and PDFs describe the probability distribution of random variables. In this sense, they serve analogous roles for discrete and continuous variables respectively.

5. Conditional Probability:

Conditional probability is the probability of an event occurring, given that another event has already occurred.

An example of diagnosing strep throat:

Overall probability of having a strep throat in general population is
P (Strep) = 0. 01 ~( 1%)
This means that 1% of population has strep throat at any given time
Conditional Probability: Let's consider the probability of having strep throat given that a person has sore throat & fever. P( Strep | Sore throat, Fever)
The condition probability might be much higher, let' say
P( Strep | Sore throat, Fever) = 0.3 ~ (30 %)

Why is conditional probability more informative?

Conditional probability helps us update our beliefs based on new information.

Context Specific: Conditional probability takes into account specific symptoms (Sore throat & fever) which provides more targeted assessment based on patient actual condition.
Higher predictive value: Conditional probability (30%) is much higher than overall probability (1%)
Decision -Making: A doctor is more likely to order a strep throat test or consider antibiotics for a patient with these symptoms given a higher probability. If they only considered overall probability 1%, they it's highly likely they might dismiss it.

In machine learning this is crucial as it allows models to make prediction based on specific input features (like symptoms) rather than overall statistics. This leads to more accurate & useful predictions.

Definition: The conditional probability of event A given event B is denoted as P(A|B) and is defined as:

P(A|B) = P(A ∩ B) / P(B)

Where:

P(A|B) reads as "the probability of A given B"
P(A ∩ B) is the probability of both A and B occurring
P(B) is the probability of B occurring

Another simple example:

Example: Let's consider a deck of 52 playing cards.

Event A: Drawing a King
P(King) = 4/52 = 1/13
Event B: Drawing a Face card (Jack, Queen, or King)
P(Face card) = 12/52 = 3/13
P(King | Face card) = P(King ∩ Face card) / P(Face card)
P(King | Face card) = (4/52) / (12/52)
= 1/3

Interpretation: If we know we've drawn a face card, the probability of it being a King is 1/3.

Applications in Deep Learning:

Naive Bayes Classifier uses conditional probability as their fundamental principle.
In neural network, the output of classifier can be interpreted as conditional probabilities of classes given the input.
Conditional random fields (CRFs) uses conditional probabilities for sequence labelling task.

6. Chain Rule of Probability:

Chain rule of probability is closely related to conditional probability. It's a direct application & extension of conditional probability to scenarios involving multiple events.

Chain rule is built on the definition of conditional probability.

conditional probability = P(A|B) = P(A ∩ B) / P(B)

We can rearrange this to get:

P(A ∩ B) = P(A|B) * P(B)

This is the simplest form of the chain rule, for just two events.

Let's consider three events related to diagnosing strep throat:

A: Patient has a strep throat.

B: Patient has a sore throat.

C: Patient has a fever.

Simple Two-Event Case:

P(A ∩ B) = P(A|B) * P(B)

Which means:

The probability of having Strep throat & Sore throat is equal to probability of having Strep throat given that you have Sore throat times probability of having sore throat.

Now, let's extend this to three events using the Chain Rule:

P(A ∩ B ∩ C) = P(A| B ∩ C) * P(B|C) * P(C)

The probability of having Strep throat & Sore throat & fever is equal to

Probability of having Strep throat given you've both sore throat & fever
Probability of having sore throat, given that you have fever
Times (*) the probability of having fever

The extension of chain rule allows us to calculate the joint probability of all three events occurring together by breaking it down to conditional probabilities.

The power of this rule is that it can be extended to any number of events.

for example if we added 4th symptoms D, where

D = swollen lymph nodes

we can further extend it to:

P(A ∩ B ∩ C ∩ D) = P (A | B,C,D) * P (B | C, D) * P (C |D) * P(D)

This process of breaking down joint probabilities into product of conditional probabilities is fundamental in many areas of machine learning & probabilistic modeling, as it allows to work with manageable pieces of information.

To be continued to next note chapter 2.

#CSCE598 #deeplearning #probability #probabilitytheory #ml #ai #mathematics

#DeepLearningNotes