Statistics I
Lisbon Accounting and Business School – Polytechnic University of Lisbon
These slides are a free translation and adaptation from the slide deck for Estatística I by Prof. Sandra Custódio and Prof. Teresa Ferreira from the Lisbon Accounting and Business School - Polytechnical University of Lisbon.
Consider the following scenario:
💼 Investor: What’s the probability this startup will succeed?
📊 Analyst: Hard to say—every startup is different.
💼 Investor: But if you had to guess, based on similar cases?
📊 Analyst: Maybe 1 in 3 succeed under these conditions.
💼 Investor: So, would you bet on it?
📊 Analyst: Yes, I would.
💼 Investor: Even if the odds aren’t great?
📊 Analyst: I believe this one has what it takes.
Here we can define probability in terms of frequency of occurrence, i.e. as a percentage of successes in a moderately large number of similar situations.
This is the most natural and traditional way of thinking about probability.
But, what if this company belongs to a completely novel market sector?
There might be situations where the frequency concept is not adequate, because it might refer to a one-time event. These are subjective beliefs.
A company is recruiting a new CEO, and a board member says:
“I believe there’s a 90% chance that our chosen candidate will be an effective CEO.”
It might seem easy to disregard the second case as unscientific or useless. However, many times people need to make decisions under uncertainty with not enough data (or no data at all!) about previous realizations of the specific event.
Beliefs allow the decision maker to, well, make some decision, at least consistently.
Uncertainty
A set is a collection of objects, which are elements of the set.
Definition
Let \(S\) represent a set, and \(s\) an element of that set, we write \(s\in S\) to mean \(s\) belongs to \(S\).
If \(s\) does not belong to \(S\), we write \(s\notin S\).
Definition
If a set \(S\) does not have any element, then it is the empty set, denoted by \(\emptyset\).
There are several ways to specify a set.
By extension, or as a list:
If a set \(S\) has a finite number of elements (\(x_i\in S\)) we can write it like this: \[S=\{x_1, x_2, ..., x_n\}\]
If a set \(S\) has an infinite (but countable) elements (\(x_i\in S\)) we can write it like: \[S=\{x_1, x_2, ...\}\]
By describing the property (\(P\)) that \(x\) must satisfy to be included in \(S\): \[S= \{x|x \text{ satisfies } P\}\] in this case \(|\) reads as such that. For example \[S=\{x\in\mathbb{R}|x>=0\}\] to describe the non-negative real numbers.
This example is special, as the positive real numbers cannot be written down as a a list. In this case the interval \([0,\infty)\) is an uncountable set.
Definition
If \(\forall x\in S\) it is also true that \(x\in T\), then we say that \(S\) is a subset of \(T\), and we write it like \(S\subseteq T\).
Definition
If \(S\subseteq T\) and at the same time \(T\subseteq S\) then we say that \(S\) and \(T\) are equal, and we write it \(S=T\).
Definition
The universal set \(\Omega\) is the set that contains all objects that could conceivably be of interest in a particular context.
By definition the, any set \(S\) must be a subset of \(\Omega\).
The universal set is important because it defines the scope of our analysis. Say we are studying the performance of students of Statistics I in 2026.
The cars parked outside our institution do not belong to the universal set, because they are not relevant for our purpose. Only students of Statistics I in 2026 belong to the universal set.
Definition
The complement of a set \(S\), with respect to \(\Omega\), is the set \(\{x\in\Omega| x\notin S\}\), that is, all the relevant elements that do not belong in \(S\). We denote it as \(S^c\).
Corollary: It is easy to see that \(\Omega^c=\emptyset\).
Definition
The union of two sets \(S\) and \(T\) is the set of all elements that belong to \(S\) or \(T\) (or both), and is denoted by \(S\cup T\). \[S\cup T=\{x\in\Omega | x\in S\ \vee\ x\in T\}\]
Definition
The intersection of two sets \(S\) and \(T\) is the set of all elements that belong to \(S\) and \(T\), and is denoted by \(S\cap T\). \[S\cap T=\{x\in\Omega | x\in S\wedge x\in T\}\]
Note that \(\vee\) stands for or, and \(\wedge\) stands for and.
Sometimes we might need to consider the union or intersection of many sets, and for that we can use a notation simmilar to the one we used for summations:
\[\bigcup_{n=1}^\infty S_n = S_1 \cup S_2 \cup ... = \{x\in\Omega | x \in S_n \text{ for some } n\}\]
\[\bigcap_{n=1}^\infty S_n = S_1 \cap S_2 \cap ... = \{x\in\Omega | x \in S_n \text{ for every } n\}\]
Definition
Two sets (say \(S\) and \(T\)) are said to be disjoint if \(S\cap T=\emptyset\).
More generally, a collection of sets \(S_n\) is disjoint if \(S_i\) and \(S_j\) are disjoint when \(i\neq j\).
Definition
A collection of sets is said to be a partition of a set \(S\) if the sets in the collection are:
Disjoint
Their union is \(S\)
We can use the notation of \(\mathcal{P}(\Omega)\)
The number of elements of a set \(S\) is known as its cardinality and it is denoted as \(\# S\). \(\# S\) satisfies:
If we have two sets, \(S\) and \(T\), we define the operation set minus as \(\setminus\) the set that contains all elements of \(S\) that do not belong to \(T\)
\[S\setminus T = \{x\in S| s\notin T\}\]
In probability, \(\Omega\), the universal set, is a non-empty set that contains all possible outcomes of an experiment. Each outcome is represented by \(\omega\), and obviously \(\omega\in\Omega\).
The sample space (\(\Omega\)) can be:
Consider the experiment of throwing a die 🎲 and noting the number shown on side facing upwards.
Consider now the random experiment of measuring the life expectancy of a lamp 💡, measured in hours.
Remember, \(\Omega\) must include all possible outcomes from your experiment! Even then ones that seem ludicrous.
Definition
A subset \(A\) of the sample space \(\Omega\) is called an event. \[A\subseteq \Omega\]
By definition then \(\Omega\) is also an event.
Definition
We call the realization of an event \(A\) if, after an experiment, outcome \(\omega\) is realized, and \(\omega \in A\).
Let’s go back to our experiment with the 🎲
The sample space is: \(\Omega=\{1,2,3,4,5,6\}\)
Within this space, we can define the following events:
Now let’s revisit the example of our 💡
The sample space is: \(\Omega=\{x\in\mathbb{R}|x\geq 0\}\)
In this space, we can define the following events:
Definitions
An elementary event is any event that contains a single element (i.e. \(\# A = 1\))
An impossible event is an event with no outcome, (i.e. \(\# A=0\)). As a consequence, an impossible event coincides with the empty set \(\emptyset\).
A certain event is indeed the event \(\Omega\), as for any outcome we obtain \(\omega\), this outcome belongs to the sample space \(\Omega\) by definition.
Consider two events \(A\) and \(B\) both subsets of \(\Omega\)
\(A^c\) contains all the outcomes that are not in \(A\). \(A^c\) is the event of not \(A\).
If \(A\subseteq B\), then an outcome that realizes event \(A\) (\(\omega\in A\)), also realizes \(B\), as \(A\subseteq B\Rightarrow \omega\in B\) as well. \(A\Rightarrow B\)
For \(A\cup B\) to happen, we need \(\omega \in A\) or \(\omega \in B\), which means that \(A\) happens, or \(B\) happens, or both happen simultaneously.
For \(A\cap B\) to happen, we need \(\omega \in A\) and \(\omega \in B\), which means that \(A\) and \(B\) happen simultaneously.
\(A\) and \(B\) are incompatible if \(A\cap B=\emptyset\), i.e. if an outcome is in one set, it cannot be in another, for example it cannot be that \(A\) and \(A^c\) happen simultaneously!
Consider two events \(A\) and \(B\) both subsets of \(\Omega\)
Consider the sample space \(\Omega =\{1,2,3,4,5,6\}\), from our 🎲 case.
Define the events: \[A=\{1\},\ B=\{3,6\},\ C=\{2,4,6\},\ D=\{4,5,6\}\]
Let’s define the following events in \(\Omega\)
Besides the concepts we already saw of frequency and subjectivity for probability, there was an older, called “classic” one. This one was introduced by Pierre-Simon Laplace in 1812.
Laplace or Classic interpretation of probability
Let \(A\) be an event defined over a finite \(\Omega\). The probability of event \(A\) is defined as:
\[P(A)=\frac{\# A}{\# \Omega}\]
Consider now an experiment throwing two dice 🎲 🎲
The problem with this interpretation, is that we cannot use it, or it becomes meaningless, it when \(\Omega\) is uncountable or infinite. Also, what if the outcomes are not equally likely? (i.e. if the dice are not fair?)
This is today still the dominant interpretation of probability.
In this case, what we want is to observe several independent repetitions of the experiment. After a while, some statistical regularity begins to emerge.
Logically, if you run an experiment, and are interested in the probability of event \(A\), then the events you are registering are \(A\) and \(A^c\) or not \(A\).
Every time you run your experiment, you count when you get an \(A\) and when you observe an \(A^c\) event. Obviously, the total number of experiments is how many times you observed \(A\) and how many times you observed \(A^c\).
Experiment: Draw a random number in the interval \([0,1]\). \(A\) denotes \(x<0.4\).
| \(A\) | \(A^c\) | \(N\) | \(P(A)\) |
|---|---|---|---|
| 0 | 1 | 1 | 0 |
| 2 | 8 | 10 | 0.2 |
| 17 | 33 | 50 | 0.34 |
| 39 | 61 | 100 | 0.39 |
| 217 | 283 | 500 | 0.434 |
| 802 | 1198 | 2000 | 0.401 |
As you can see, the more experiments we run, the more stabilized the ratio of occurrences for \(A\) over the total number of experiments. More generally:
\[P(A)=\lim_{N\rightarrow \infty}\frac{A\text{ occurrences}}{N \text{- Number of Experiments}}\]
This is the relative frequency of \(A\) in \(N\) experiments: \(f_A\)
Not always possible to repeat that many times the experiment in the same conditions.
It seems that the probability that the random number between 0 and 1 is below 0.4 is approximately 40%. The more experiments we run, the closer our relative frequency is to that number.
\[P(A)\underset{N \rightarrow \infty}{\rightarrow} 0.4\]
By the way, we will see later that theoretically, indeed \(P(A)=0.4\)
Andrey Kolmogorov defined a set of characteristics that any probability \(P\) measure should have, these are called the Kolmogorov’s axioms (1933):
Let \(A\) and \(B\) be some events in \(\Omega\)
Conditional probability, as the wording implies, means the probability of something happening given something else has happened. Now, note “something” here makes reference to an event.
\[P(A|B)\]
It reads the probability of \(A\), given \(B\).
Note that if we think on sets, saying given \(B\) we are immediately excluding everything that could have happened if \(B\) did not happen, and therefore our Universal set is no longer \(\Omega\), but \(B\).
What we are looking for are, among the events that live in \(B\), how many of those live in \(A\) (because those would trigger event \(A\)). Actually, we are interested on the relative measure of those outcomes, compared to the whole size of \(B\): \[P(A|B)=\frac{P(A\cap B)}{P(B)}\]
Consider a factory that makes 10 wrenches 🔧. Among those, we know that 2 have imperfections. Suppose you intend to remove, randomly, 2 🔧 from the lot (of 10). Consider the following events:
\(A = \{\text{The first :wrench: is faulty}\}\) \(B = \{\text{The second :wrench: is faulty}\}\)
What if we want to compute \(P(B)\)? For a correct assessment for \(B\), we would better have some information on the realization of \(A\)!
If the fist 🔧 was faulty, then \(A\) happened. If the first 🔧 was ok, then \(A^c\) happened, and therefore we can compute \(P(B|A)\) and \(P(B|A^c)\). We are assuming that we are removing these 🔧 without replacing them.
Let’s see why this last detail (replacing the 🔧) is so relevant before going on.
Initial set:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|
| 🔧 | 🔧 | 💥 | 🔧 | 💥 | 🔧 | 🔧 | 🔧 | 🔧 | 🔧 |
Remove one (if randomly you do not know which), but after you remove you can see what happened, let’s take out 7.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|
| 🔧 | 🔧 | 💥 | 🔧 | 💥 | 🔧 | 🔧 | 🔧 | 🔧 |
We observe, and voilá it was a fine wrench 🔧. If we replace it though, we would be picking from
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|
| 🔧 | 🔧 | 💥 | 🔧 | 💥 | 🔧 | 🔧 | 🔧 | 🔧 | 🔧 |
That is in the exact same conditions we made our first choice, and therefore what happens with the first pick is irrelevant: These events are now independent!
With \(A\), we know that when we pick the second wrench (\(B\)), in the box there are 2 💥 and 8 🔧.
With \(A^c\), we know that when we pick the second wrench (\(B\)), in the box there are 2 💥 and 8 🔧.
The probability of getting a 💥 is the same in each scenario! \(P(B|A)= P(B|A^c)\)!
\[P(B|A)=\frac{2}{10}=0.2\ \text{and}\ P(B|A^c)=\frac{2}{10}=0.2\]
Initial set:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|
| 🔧 | 🔧 | 💥 | 🔧 | 💥 | 🔧 | 🔧 | 🔧 | 🔧 | 🔧 |
Remove one (if randomly you do not know which is broken)
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|
| 🔧 | 🔧 | 💥 | 🔧 | 🔧 | 🔧 | 🔧 | 🔧 | 🔧 |
We observe, and voilá it was a broken wrench 💥 \(A\) happened!
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|
| 🔧 | 🔧 | 💥 | 🔧 | 💥 | 🔧 | 🔧 | 🔧 | 🔧 |
We observe, and voilá it was a fine wrench 🔧 \(A^c\) happened!
With \(A\), we know that when we pick the second wrench (\(B\)), in the box there are 1 💥 and 9 🔧.
With \(A^c\), we know that when we pick the second wrench (\(B\)), in the box there are 2 💥 and 9 🔧.
The probability of getting a 💥 is different in each scenario! \(P(B|A)\neq P(B|A^c)\)!
\[P(B|A)=\frac{1}{9}=0.111\ \text{and}\ P(B|A^c)=\frac{2}{9}=0.222\]
So formally
Definition
Let \(A,B\subset\Omega\), the we say the probability of \(A\) given \(B\) is the conditional probability: \[P(A|B)=\frac{P(A\cap B)}{P(B)}\] Note that from here, we can obtain also \(P(A\cap B)=P(A|B)P(B)\). In both situations we need \(P(B)\neq 0\).
It follows that the identity \[P(B|A)=\frac{P(A\cap B)}{P(A)}\] or \[P(A\cap B)=P(B|A)P(A)\] with \(P(A)\neq 0\) also holds true.
Now let’s think on \(P(A\cap B \cap C)\):
Note that given the commutativity of the intersection, we could have obtained also:
And to make sense of all of this we need \(P(X)>0\), \(P(X\cap Y)>0\) with \(X,Y\in\{A,B,C\}\).
Consider a region with 1,000 adults. Their job data is captured by the following table:
| Employed | Unemployed | Total | |
|---|---|---|---|
| Women | 470 | 55 | 525 |
| Men | 430 | 45 | 475 |
| Total | 900 | 100 | 1,000 |
Let’s define the events:
\(W=\{Woman\}\), \(M=\{Man\}\), \(U=\{Unemployed\}\)
\(P(U|W)=P(U\cap W)P(W)=0.055\times 0.525=0.105\)
\(P(W|U)=P(W\cap U)P(U)=0.055\times 0.1=0.55\)
Definition
Two events \(A\) and \(B\) \(\subset\Omega\), are probabilistically independent if and only if: \[P(A\cap B)=P(A)P(B)\]
From the definition of independence, we can obtain several properties. Let \(A\) and \(B\) independent events with \(P(A)P(B)>0\):
Are \(W\) and \(U\) from the previous example independent?
\(P(W)=0.525\), \(P(U)=0.1\), \(P(W|U)=0.55\), \(P(U|W)=0.105\).
Note that \(P(W)\neq P(W|U)\) and \(P(U)\neq P(U|W)\). Therefore, they cannot be independent.
Consider a die 🎲 that is thrown twice. Consider the following two events:
\(A=\{\text{The die shows an odd number the first time}\}\) \(B=\{\text{The die shows a number }>4\text{ the second time}\}\)
Are \(A\) and \(B\) independent events?
In this case, \(\Omega=\{(x,y)\in \mathbb{N}^2| x,y \leq 6\}\) with \(\# \Omega = 6^2=36\)
\(P(A)=\frac{18}{36}=\frac{1}{2}\)
\(P(B)=\frac{12}{36}=\frac{1}{3}\)
\(P(A\cap B)=\frac{1}{6}=P(A)P(B)\)
\(P(A|B)=\frac{P(A\cap B)}{P(B)}=\frac{1}{2}=P(A)\)
\(P(B|A)=\frac{P(B\cap A)}{P(A)}=\frac{1}{3}=P(B)\)
They are independent events!
Two events being independent is not the same that they being incompatible:
| \(A\) and \(B\) independent | \(A\) and \(B\) incompatible |
|---|---|
| \(P(A\cap B)=P(A)P(B)\) | \(P(A\cap B)=0\) |
| \(P(A|B)=P(A)\) and \(P(B|A)=P(B)\) | \(P(A|B)=0\) and \(P(B|A)=0\) |
Let \(A\) and \(B\) be two events such that: \(P(A)=0.6\), \(P(B)=t\), and \(P(A\cup B)=0.8\)
Find \(t\) such that \(A\) and \(B\) are:
From probability theory, we have \[P(A\cup B)=P(A)+P(B)-P(A\cap B)\] and therefore we get: \[0.8 = 0.6+t\Rightarrow t=0.2\]
Using the same identity we just used: \[0.8=0.6+t-0.6t\Rightarrow t= 0.5\]
Theorem
Let \(\{A_i\}_{i=1}^n\) be a partition of \(\Omega\), or \(\{A_i\}\in \mathcal{P}(\Omega)\). Then, for any \(B\subset\Omega\), it holds that:
\[P(B)=\sum_{i=1}^n P(A_i\cap B)=\sum_{i=1}^n P(B|A_i)P(A_i)\]
Consider a financial institution that sells two products, \(\alpha\) and \(\beta\), with very high yields. It is known that, among its clients, 10% invest a share of their wealth in \(\alpha\) and the rest in \(\beta\). From those who invest in \(\alpha\), 70% manage to get returns above the market. From among those who do not invest in \(\alpha\), 55% get returns above the market. Randomly choosing a client of this firm, find the probability this customer gets a return above the market.
Let’s define the events:
Matching with the available data we obtain:
\(P(A_1)=0.1\), \(P(A_2)=0.9\), \(P(B|A_1)=0.7\), and \(P(B|A_2)=0.55\).
From the Law of Total Probability:
\[P(B)=\sum_{i=1}^2 P(A_i\cap B)\] \[P(B)=P(B|A_1)P(A_1)+P(B|A_2)P(A_2)\] \[P(B)=0.7\times 0.1 + 0.55\times 0.9 = 0.565\]
Bayes Theorem
Let events \(A_1\), \(A_2\), … , \(A_n\) with \(n\in\mathbb{N}\) a partition of \(\Omega\), then, for any event \(B\subset\Omega\), with \(P(B)>0\):
\[P(A_i|B)=\frac{P(A_i\cap B)}{P(B)}=\frac{P(B|A_i)P(A_i)}{\sum_{i=1}^n P(B|A_i)P(A_i)}\] with \(i=1,2,...n\)
Note that this is a consequence of the Law of Total Probability.
On the other side, \(\sum_i P(A_i)=1\) and \(\sum_{i} P(A_i|B)=1\)
Bayes Theorem has been widely used in economics, in biomedical sciences, and social sciences when looking for causality.
If event \(B\) represents consequences and event \(A_i\) probable cause, Bayes Theorem allows to assess the probability of this cause \((P(A_i))\).
Let’s go back to the previous example, about our investors.
Let’s compute the probability that the client invested his money on product \(\beta\), but given that the client had returns above the market (event \(B\)).
If the customer invested in \(\beta\), then the event we are trying to is \(A_2\), but conditional on event \(B\), \(P(A_2|B)\):
\[P(A_2|B)=\frac{P(A_2\cap B)}{P(B)}=\frac{P(A_2)\times P(B|A_2)}{\sum_i P(A_i)\times P(B|A_i)}\]
We knew from the previous exercise that \(P(B)=0.565\), and therefore we obtain:
\[P(A_2|B)=\frac{0.9\times 0.55}{0.565}=0.876\]
How do we interpret this?
The probability that the client invested in \(\beta\), given that he had a return above the market, is 0.876.
All these computations can be very easy with the help of the following table:
| \(A_i\) | \(P(A_i)\) | \(P(B|A_i)\) | \(P(A_i)P(B|A_i)\) | \(P(A_i|B)\) |
|---|---|---|---|---|
| \(A_1\) | 0.1 | 0.7 | 0.07 | 0.124 |
| \(A_2\) | 0.9 | 0.55 | 0.495 | 0.876 |
| 1 | 0.565 | 1 |
Let’s verify now if the event \(A_1\) and \(B^c\) are independent or not!
According to the definition of independence: \(P(A_1\cap B^c)=P(A_1)P(B^c)\)
Then \(A_1\) and \(B^c\) are not independent.
These slides are a free translation and adaptation from the slide deck for Estatística I by Prof. Sandra Custódio and Prof. Teresa Ferreira from the Lisbon Accounting and Business School - Polytechnical University of Lisbon.
A random variable is a function that will allow us to quantify (transform into a number) each outcome.
Random Variable
A random variable (r.v.) \(X\) is a function \(f:\Omega\rightarrow \Omega_X\subset \mathbb{R}\). \(\Omega_X\) is known as the support of the r.v. \(X\).
\[\omega\in\Omega \overset{X}{\rightarrow} X(\omega)\in\Omega_X\subset\mathbb{R}\]
\(X(\omega)\) is the image under \(X\) of the outcome \(\omega\)
Summarizing, a r.v. is a function that associates a real number to each outcome from \(\Omega\).
\(X\) is a discrete r.v. when:
In this case, \(\Omega_X=\{x_1, x_2, ... , x_n\}\) with \(n\in\mathbb{N}\) if \(\Omega_X\) is finite, and \(\Omega_X=\{x_1,x_2,...,x_n,...\}\) if it is countable infinite.
Let \(X\) be a discrete r.v. The pdf of \(X\) is a function \(f_X:\mathbb{R}\rightarrow\mathbb{R}\) such that:
\[f_X(x)=\left\{\begin{array}{cc}P(X=x) & ,\text{ if } x\in\Omega_X\\ 0 & ,\text{ if } x\in\mathbb{R}\setminus\Omega_X\end{array}\right.\]
Naturally, by construction the pdf satisfies the following properties:
| \(f_X(x)\geq 0 \quad \forall x\in\mathbb{R}\) |
| \(\sum_{x_i\in\Omega_X}P(X=x_i)=1\) |
The pdf gives the probability in a single point. The total probability is distributed among single points, \(x_i\). A reasonable representation of a pdf of a discrete r.v. could be:
| \(x\) | \(x_1\) | \(x_2\) | … | \(x_n\) | … |
|---|---|---|---|---|---|
| \(f(x)\) | \(p_1\) | \(p_2\) | … | \(p_n\) | … |
Where \(p_i=P(X=x_i)\)
Consider the discrete r.v. \(X\) with the following pdf:
| \(x\) | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| \(f(x)\) | \(0.05\) | \(a\) | \(0.35\) | \(0.25\) | \(0.05\) |
We could define:
Our table is now:
| \(x\) | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| \(f(x)\) | \(0.05\) | \(0.3\) | \(0.35\) | \(0.25\) | \(0.05\) |
What is \(P(X=2|X\leq 3)\)?
\[P(X=2|X\leq 3)=\frac{P(X=2 \cap X\leq 3)}{P(X\leq 3)}= \frac{P(X=2)}{P(X\leq 3)}\]
\[= \frac{f(2)}{f(0)+...+f(3)}=\frac{0.35}{0.95}=`r round(.35/.95,3)`\]
\(X\) is a continuous r.v. if:
Let \(X\) a continuous r.v.
There is a function \(f_X:\mathbb{R}\rightarrow\mathbb{R}\), the pdf of \(X\) such that:
Technically, from Measure Theory, we need an absolutely continuous r.v. to ensure the existence of a pdf. These issues are beyond the scope of this course. Just know that when we say continuous r.v. we mean absolutely continuous r.v.
Note that this pdf allows to compute the probability of events \(x\in(a,b]\):
\[P(a<X\leq b)=\int_a^b f_X(x)dx\]
Observe that if you would do \(X=a\) you would get the integral from \(a\) to \(a\), which makes \(dx=0\) and therefore the integral (and the probability) becomes 0.
Let \(X\) be a continuous r.v. with the following pdf:
\[ f(x)=\left\{\begin{array}{cc} \theta x^2 & , 0\leq x< 1\\ 0 & ,\mathbb{R}\setminus [0,1) \end{array}\right. \]
Support for \(X\): \(\Omega_X=[0,1)\)
\[\int_{-\infty}^{\infty}f(x)dx=1\Leftrightarrow\int_0^1\theta x^2dx=\left[\theta\frac{x^3}{3}\right]_{0}^1\]
\[\theta\frac{1}{3}-\theta\frac{0}{3}=1\Leftrightarrow \theta=3\]
Let \(X\) be a r.v. The distribution function \(F_X:\mathbb{R}\rightarrow[0,1]\), defined as:
\[F_X(x)=P(X\leq x)\]
\(F_x\) is unique.
With a discrete r.v.
| \(x\) | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| \(f(x)\) | \(0.05\) | \(0.3\) | \(0.35\) | \(0.25\) | \(0.05\) |
\[F(x)=P(X\leq x)=\left\{ \begin{array}{cc} 0 & x<0 \\ 0.05 & 0\leq x < 1 \\ 0.05 + 0.3 = 0.35 & 1 \leq x < 2 \\ 0.35 + 0.35 = .7 & 2 \leq x < 3 \\ 0.7 + 0.25 = .95 & 3 \leq x < 4 \\ 1 & x\geq 4 \end{array} \right.\]
Let’s revisit our previous example:
\[P(X=2|X\leq 3)= \frac{P(X=2)}{P(X\leq 3)}=\] \[\frac{F(2)-F(2^-)}{F(3)}= \frac{0.7-0.35}{0.95}=0.368 \]
With a continuous r.v.:
\[F_X(x)=P(X\leq x)=\int_{-\infty}^{x} f_X(x)dx\]
The distribution function, \(F_X\) allows to compute the probability of \(\{X\in(a,b]\}\)
\[P(a<X\leq b)=\int_a^b f_X(x)dx=F_X(b)-F_X(a)\]
# Load necessary package
library(ggplot2)
# Create a sequence of x values
x <- seq(-4, 4, length.out = 1000)
y <- dnorm(x)
# Create a data frame
df <- data.frame(x = x, y = y)
# Define the shaded region
df$shade <- ifelse(df$x >= -0.2 & df$x <= 0.5, df$y, NA)
# Plot
ggplot(df, aes(x = x, y = y)) +
geom_line(color = "blue") +
geom_area(aes(y = shade), fill = "skyblue", alpha = 0.5) +
annotate("text", x = -0.2, y = 0.02, label = "a", color = "black", size = 7) +
annotate("text", x = 0.5, y = 0.02, label = "b", color = "black", size = 7) +
annotate("text", x = 0.35, y = 0.1, label = "P(a < x ~ '\u2264' ~ b)", parse = TRUE, size = 5) +
labs(title = "Standard Normal Distribution",
x = "x", y = "Density") +
theme_minimal()Consider the continuous r.v. defined previously, with the pdf:
\[ f(x)=\left\{ \begin{array}{cc} 3x^2 & , 0\leq x< 1\\ 0 & , \mathbb{R}\setminus [0,1) \end{array} \right. \]
Support for \(X\): \(\Omega_X=[0,1)\)
Distribution function (cdf):
\[ F(x)=P(X\leq x) = \int_{-\infty}^x f(t)dt = \left\{ \begin{array}{cc} 0 & , x<0\\ x^3 & ,0\leq x<1 \\ 1& ,x\geq 1 \end{array} \right. \]
Nonetheless the r.v. is discrete or continuous, \(F_X\) has the following properties:
\(F_X\) for \(X\) r.v. discrete
\(F_X\) for \(X\) r.v. continuous
For a discrete r.v.
\[P(X=x)=F_X(x)-F_X(x^-)\] \[+\downarrow \uparrow -\] \[F_X(x)=\sum_{x_i\leq x}P(X=x_i)\]
For a continuous r.v.
\[ f_X(x)=\left\{ \begin{array}{cc} F_X'(x) & ,x\in\mathbb{R} \text{ if }F_X'\text{ exists} \\ 0 & \text{, otherwise} \end{array} \right. \]
\[ Derivative \downarrow \uparrow Primitive\]
\[ F_X(x)=\int_{-\infty}^x f_X(t)dt\]
The pdf of \(X\), a continuous r.v. is not unique.
We could describe the range of \(X\), a r.v., as a population, in the statistical sense, because it describes all the possible values it can take.
We can use numerical values to do so, which can represent dispersion or centrality of the data.
The expected value or mean, is a location parameter for our r.v.
Definition
The expected value or mean of a random variable \(X\) is:
Not all random variables have an expected value, it might be infinite.
Let \(X,Y\) rvs, and \(a,b\in\mathbb{R}\) scalars. Some properties of the mean
Let \(X\) be a discrete r.v. as in the previous example:
| \(x\) | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| \(f(x)\) | \(0.05\) | \(0.3\) | \(0.35\) | \(0.25\) | \(0.05\) |
Let \(g(X)=2(X-1)^2+3(X-1)-5\), find \(E[g(X)]\).
\[g(X)=2(X-1)^2+3(X-1)-5\] \[=2(X^2-2X+1)+3X-3-5\] \[=2X^2-4X+2+3X-8\] \[=2X^2-X-6\]
\[E[Y]=E[2X^2-X-6]\]
\[=2E[X^2]-E[X]-6\]
We only need to find \(E[X]\) and \(E[X^2]\) to obtain \(E[g(X)]\).
\[E[X]=\sum xP(X=x)\] \[ = 0\times .05 + 1 \times .3 + 2 \times .35 + 3 \times .25 + 4\times .05 = 1.95\]
\[E[X^2]=\sum x^2 P(X=x)\]
\[ = 0\times .05 + 1 \times .3 + 4 \times .35 + 9 \times .25 + 16\times .05 = 4.75\]
\[E[g(X)]=2\times 4.75 - 1.95 - 6 = 1.55\]
Recall our example for continuous r.v.s. \(X\): \[ f_X(x)=\left\{ \begin{array}{cc} 3x^2 & ,0\leq x < 1 \\ 0 & , \mathbb{R}\setminus[0,1) \end{array} \right. \]
Find \(E[g(X)]\) when \(g(X)=2(X-1)^2+3(X-1)-5\) We know already \(g(X)=2X^2-X-6\). Let’s focus on \(E[x]\) and \(E[X^2]\).
\[E[X]=\int_{-\infty}^{\infty} xf_X(x)dx = \int_0^1 x\times 3x^2 dx\] \[= \int_0^1 3x^3dx=\left[3\frac{x^4}{4}\right]_{0}^1=\frac{3}{4}=0.75\]
\[E[X^2]=\int_{-\infty}^{\infty} x^2f_X(x)dx = \int_0^1 x^2\times 3x^2 dx\] \[= \int_0^1 3x^4dx=\left[3\frac{x^5}{5}\right]_{0}^1=\frac{3}{5}=0.6\]
Finally,
\(E[g(x)]=2\times 0.6 - 0.75 - 6 = -5.55\)
The p-quantile, \(x_p\), of a r.v. \(X\) is a location parameter, with fixed value.
\(x_p\) is the value for \(x\in\Omega_X\) such that:
O what is the same, \(x\in\Omega_X\) such that \(F_X(x^-)\leq p \leq F_X(x)\)
\(x_p\) is an \(x\in\Omega_X\) such that \(F_X(x)=p\)
Let’s apply this for the examples we just used for the expected value.
Find the median (0.5-quantile) for \(X\)
\[ F(x)=P(X\leq x)=\left\{ \begin{array}{cc} 0 &, x\leq 0\\ 0.05 &, 0\leq x <1 \\ 0.35 &, 1\leq x < 2 \\ 0.7 &, 2 \leq x < 3 \\ 0.95 &, 3 \leq x < 4 \\ 1 &, x\geq 4 \end{array} \right. \]
For example \(F(2^-)=0.35\leq 0.5 \leq 0.7=F(2)\) and therefore \(X_{0.5}=Me = 2\). Given that \(E[X]=1.95<Me(X)=2\) the distribution is slightly negatively (or left) skewed.
\[F_X(x)=P(X\leq x)=\left\{ \begin{array}{cc} 0 & x < 0 \\ x^3 & 0\leq x <1 \\ 1 & x\geq 1 \end{array} \right. \]
Let’s find \(x\) such that \(F(x)=0.5\)
\(F(x)=0.5\Leftrightarrow x^3=0.5\Leftrightarrow x=\sqrt[3]{0.5}\approx`r round(0.5^(1/3),4)`\)
And therefore, \(x_{0.5}=Me=.7937\)
Let \(X\) be a r.v. The variance of \(X\), if it exists, is defined as:
\[V[X]=E\left[\left(X-E[X]\right)^2\right]\]
It can be show, very easily, with some algebraic manipulation that \(V[x]=E\left[X^2\right]-\left(E[X]\right)^2\)
Remember that \(E[X]\equiv\mu_X\)
Usually we write \(V[X]\) as \(\sigma^2_X\).
Some properties for the variance:
If \(\sigma^2_X\) is the variance of \(X\), then the standard deviation is known as: \[\sigma_X=\sqrt{V[x]}\]
One characteristic of the standard deviation is that its units are the same as those of the random variable.
While the variance and standard deviation allow us to measure the dispersion of the data, we might want to have it relative to the mean (a \(\sigma_X=1\) can be a lot for \(X\) taking relatively low values, but negligible if we are talking in millions!)
For that we use the coefficient of variation:
\[C.V._X =\frac{\sigma_X}{\mu_X}\times 100\]
Some properties of the \(CV_X\)
Let’s compute \(\sigma^2\), \(\sigma\), and \(CV\) for our previous examples:
Values below \(50\%\) for \(CV\) allow us to see \(\mu\) as representative for the data. The lower, the closer the data to \(\mu\) and therefore the more representative it is.
For the continuous r.v. case:
When running an experiment, it could be interesting to study the relationship between two numeric features associated to each of the outcomes.
Random pair
A random pair \((X,Y)\) is a function \(f_{X,Y}:\Omega\rightarrow \left(\Omega_X,\Omega_Y\right)\subset\mathbb{R}^2\). \(\left(\Omega_X, \Omega_Y\right)\) is known as the support of the random pair \((X,Y)\).
\[\omega\in\Omega \overset{(X,Y)}{\rightarrow}\left(X(\omega),Y(\omega)\right)\in(\Omega_X,\Omega_Y)\subset\mathbb{R}^2\]
\(X(\omega)\) is the image, under \(X\) of outcome \(\omega\), and \(Y(\omega)\) the image under \(Y\) of the same outcome.
A random pair \((X,Y)\) is discrete when:
Let \((X,Y)\) a discrete random pair, the joint density function \(f_{X,Y}(x,y)\) is a function \(f_{X,Y}:\mathbb{R}^2\rightarrow\mathbb{R}\) defined as:
\[ f_{X,Y}(x,y)=\left\{ \begin{array}{cl} P(X=x,Y=y) & , (x,y)\in(\Omega_X,\Omega_Y)\\ 0 & , (x,y)\in\mathbb{R}^2\setminus(\Omega_X,\Omega_Y) \end{array} \right. \]
\(f_{X,Y}\) satisfies the following properties:
A possible notation for \(P(X=x_i,Y=y_j)\) is \(p_{i,j}\)
| \(y_1\) | \(y_2\) | \(\dots\) | \(y_j\) | \(\dots\) | ||
|---|---|---|---|---|---|---|
| \(x_1\) | \(p_{11}\) | \(p_{12}\) | \(\dots\) | \(p_{1j}\) | \(\dots\) | \(\sum_{j=1}^\infty p_{1j}\) |
| \(x_2\) | \(p_{21}\) | \(p_{22}\) | \(\dots\) | \(p_{2j}\) | \(\dots\) | \(\sum_{j=1}^\infty p_{2j}\) |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) |
| \(x_i\) | \(p_{i1}\) | \(p_{i2}\) | \(\dots\) | \(p_{ij}\) | \(\dots\) | \(\sum_{j=1}^\infty p_{ij}\) |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) |
| \(\sum_{i=1}^\infty p_{i1}\) | \(\sum_{i=1}^\infty p_{i2}\) | \(\dots\) | \(\sum_{i=1}^\infty p_{ij}\) | \(\dots\) | 1 |
Given a random pair \((X,Y)\), the marginal probability function of \(X\) and \(Y\) is respectively:
For \(i=1,2,...\) and \(j=1,2,...\). Note that these functions have one dimension only.
At SuperStore 🏪, three trained employees are qualified to operate the checkout counters, restock products on the shelves, and perform some administrative tasks. SuperStore has three checkout counters, and at least one of them must always be operating.
At any given day and moment when SuperStore is open to customers, consider the following random variables:
The r.v. \(X\) has \(\Omega_X=\{1,2,3\}\) and the following pdf:
| \(x\) | 1 | 2 | 3 |
|---|---|---|---|
| \(f_X(x)\) | 0.17 | 0.8 | 0.03 |
Consider the following table for the joint probability of \((X,Y)\)
| \(X\setminus Y\) | 0 | 1 | 2 | |
|---|---|---|---|---|
| 1 | \(a\) | \(2b\) | \(b\) | |
| 2 | 0.1 | \(c\) | 0 | |
| 3 | 0.03 | 0 | 0 | |
| 1 |
If \(P(X=1, Y=0)=0.02\), find \(b\) and \(c\)
| \(X\setminus Y\) | 0 | 1 | 2 | \(\color{red}{f_X(x)}\) |
|---|---|---|---|---|
| 1 | \(\color{red}{0.02}\) | \(2b\) | \(b\) | \(\color{red}{0.17}\) |
| 2 | 0.1 | \(c\) | 0 | \(\color{red}{0.8}\) |
| 3 | 0.03 | 0 | 0 | \(\color{red}{0.03}\) |
| \(\color{red}{f_Y(y)}\) | \(\color{red}{0.15}\) | \(\color{red}{2b+c}\) | \(\color{red}{b}\) | 1 |
\(P(X=2|Y\geq 1)\) is approximately…?
Let \((X,Y)\) a discrete random pair, \(X,Y\) are independent if, and only if: \[P(X=x, Y=y)=P(X=x)P(Y=y)\quad \forall(x,y)\in\mathbb{R}^2\]
The joint cdf is the same as the product of each marginal pdf
… (continued exercise) Are \(X\) and \(Y\) independent?
Definition
Let the discrete random pair \((X,Y)\) have a joint \(cdf\) \(P(X=x,Y=y)\) and a function \(g:\mathbb{R}^2\rightarrow\mathbb{R}\). The expected value or mean of \(g(X,Y)\) is:
\[E[g(X,Y)]=\sum_{i=1}^\infty\sum_{j=1}^\infty g(x_i,y_j)P(X=x_i,Y=y_j)\]
If \(g(x,y)=xy\), then \(E[g(x,y)]=E[XY]\) and that equals \[\sum_{i=1}^\infty\sum_{j=1}^\infty x_iy_jP(x=x_i,Y=y_j)\]
Definition
Let the discrete random pair \((X,Y)\) have a joint \(cdf\) \(P(X=x,Y=y)\), and \(\mu_X=E[X]\) and \(\mu_Y=E[Y]\). The covariance between \(X\) and \(Y\) is:
\[cov(X,Y)=E[(X-\mu_X)(Y-\mu_Y)]\]
Given that \(E[(X-\mu_X)(Y-\mu_Y)]\) exists.
Note that this is equivalent to \(cov(X,Y)=E[XY]-E[X]E[Y]\)
The covariance tries to capture how the two r.v. move together. If it is positive, it means that both tend to go in the same direction more often than not (both above or below their means at the same time). Being negative means that more often than not when one is above its mean, the other is below.
If \(X\) and \(Y\) are independent r.v. then \(cov(X,Y)=0\). Note that the opposite is not necessarily true, i.e. \(cov(X,Y)=0\) does not imply that \(X\) and \(Y\) are independent.
Another important identity with the covariance is the following:
\[V[X\pm Y] = V[X]+V[Y]\pm 2 cov(X,Y)\]
From the first table: \(E[X]=0.17\times 1 + 0.8 \times 2 + 0.03 \times 3 = 1.86\)
\[E[XY]=\sum_{x}\sum_{y}xyP(X=x,Y=y)=\] \[= 1 \times 0 \times 0.02 + 1\times 1 \times 0.1 + 1\times 2 \times 0.05 +\] \[+ 3\times 0 \times 0.1 + 2 \times 1 \times 0.7 + 2\times 2 \times 0 + \] \[+ 3 \times 0 \times 0.03 + 3 \times 1 \times 0 + 3 \times 2 \times 0 = 1.6\]
\(cov(X,Y)=E[XY]-E[X]E[Y]=1.6-1.86\times 0.9=-0.074\)
A caveat of the covariance is that its units depends directly on the units of \(X\) and \(Y\). The correlation coefficient allow us to express this relationship, between \(X\) and \(Y\) without being affected by the units in which these r.v. are measured.
\[\rho_{XY} = \frac{cov(X,Y)}{\sqrt{V[X]V[Y]}}=\frac{cov(X,Y)}{\sigma_X\sigma_Y}\]
Clearly \(\rho\in[-1,1]\). Note also that \(|\rho|=1\) if and only if \(P(Y=a+bX)=1\) with \(a,b\in\mathbb{R}\). If \(X\) and \(Y\) are independent r.v. then \(\rho=0\).
| Correlation coefficient | Correlation |
|---|---|
| \(|\rho| = 1\) | Perfect |
| \(0.8 \leq |\rho| < 1\) | Strong |
| \(0.5 \leq |\rho| < 0.8\) | Moderate |
| \(0.1 \leq |\rho| < 0.5\) | Weak |
| \(0 < |\rho| < 0.1\) | Very weak |
| \(\rho= 0\) | None |
Write positive or negative in front of correlation if \(\rho>0\) or \(\rho<0\) respectively.
From the marginal probability function we obtain \(V[X]\) and \(V[Y]\): \[V[X]=0.1804\text{ and }V[Y]=0.19\]
Therefore, \[\rho = \frac{cov(X,Y)}{\sigma_X\sigma_Y}=\frac{-0.074}{\sqrt{0.1804}\sqrt{0.19}}=-0.3997\]
We observe a weak negative linear correlation between \(X\) and \(Y\).
Statistics I — ISCAL