Skip to content

Mathematical foundations

Covariance

\Large cov(x,y)=\sum_{i}^{} (x_{i} - u_{x})(y_{i} - u_{y})

Correlation

\Large corr(x,y)

Probability

  • the probability of two independent events happening at the same time:

  • the probability of two dependent events happening at the same time:

  • the probability of disjoint events A and B (they are mutually exclusive) is:

  • if A and B are not mutually exclusive:

What is the probability that a card chosen from a standard deck will be a Jack or a heart? becomes P(Jack or Heart) = P(Jack) + P(heart) - P( jack of Hearts) = 16/52

Bayesian

In machine learning, there are two main approaches: the Bayesian approach and the frequentist approach. The Bayesian approach is based on probability theory and uses Bayes' theorem to update probabilities based on new data. The frequentist approach, on the other hand, is based on statistical inference and uses methods such as hypothesis testing and confidence intervals to make decisions.

Bayes theorem

Bayes' theorem provides a way to update probabilities based on new evidence. Understanding it involves three levels:

  1. Knowing the formula - being able to plug in numbers
  2. Understanding why it's true - grasping the derivation
  3. Recognizing when to apply it - identifying real-world situations

The formula

The Bayes formula for the probability of hypothesis H given evidence E:

Where:

  • P(H) - the prior: probability of the hypothesis before seeing evidence
  • P(E|H) - the likelihood: probability of the evidence assuming the hypothesis is true
  • P(E) - the evidence: total probability of seeing the evidence under all hypotheses

  • P(H|E) - the posterior: updated probability of the hypothesis after seeing evidence

Another view of this formula:

The Steve example

Consider Steve, described as "meek and tidy soul, with a need for order and a passion for detail." Is Steve more likely a librarian or a farmer?

The intuitive answer might be librarian, but Bayes' theorem requires considering:

  • Prior probability: There are roughly 20 farmers for every librarian in the population -> P(H) = 1 /( 1 + 20) = 1/21
  • Likelihood: What fraction of librarians vs farmers fit Steve's description?

Even if 40% of librarians (4 librarians for 10 of them), this is P(E|H), and only 10% of farmers match the character description (20 of the 200 farmers), the prior matters:

  • Librarians matching: 1 × 0.40 = 0.40
  • Farmers matching: 20 × 0.10 = 2.00

Steve is about 5x more likely to be a farmer than a librarian.

The P(Librarian given description) = 4 /(4 + 20) = 16.7%

  • Rationality is not about knowing facts, it's about recognizing which facts are relevant.
  • Seeing evidence restricts the space of possibilities

Visual representation

A geometric interpretation uses a unit square representing the sample space:

  1. Divide the square into regions representing each hypothesis (proportional to prior probabilities)
  2. Within each region, shade the area where the evidence holds (proportional to likelihood)
  3. The posterior is the ratio of shaded hypothesis area to total shaded area

This visualization shows how restricting to cases where evidence holds (conditioning) changes the probability.

Key takeaways

  • Always consider base rates (priors) before updating beliefs
  • New evidence updates but does not replace prior knowledge
  • Context and representative sampling affect the validity of conclusions

The Bayesian approach handles uncertainty and complex data well. Frequentist methods are more common when data is abundant and variable relationships are well-defined.

See the conditional probability notebook exercise to simulate the probability of buying thing knowing the age and previous buying data: totals contains the total number of people in each age group and purchases contains the total number of things purchased by people in each age group.

See this video from 3Blue1Brown for a geometric interpretation of Bayes' theorem.

Data distributions

See this notebook presenting some python code on different data distributions like Uniform, Gaussian, Poisson. It can be executed in VScode using the pytorch kernel.

Normalization

Normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging.

In statistics, normalization refers to the creation of shifted and scaled versions of statistics, where the intention is that these normalized values allow the comparison of corresponding normalized values for different datasets in a way that eliminates the effects of certain gross influences, as in an anomaly time series.

Feature scaling used to bring all values into the range [0,1]. This is also called unity-based normalization.

)

Sigmoid function

The Sigmoid function has a S shaped curve, one of them being the logistic function, to change a real to a value between 0 and 1.

It is used as an activation function of artificial neuron. The logistic sigmoid function is invertible, and its inverse is the logit function:

P being a probability, is the corresponding odds.