Random Variables

Statistics I

Paulo Fagandini

Lisbon Accounting and Business School – Polytechnic University of Lisbon

Disclaimer

These slides are a free translation and adaptation from the slide deck for Estatística I by Prof. Sandra Custódio and Prof. Teresa Ferreira from the Lisbon Accounting and Business School - Polytechnical University of Lisbon.

Probability, background

The concept

Consider the following scenario:

💼 Investor: What’s the probability this startup will succeed?

📊 Analyst: Hard to say—every startup is different.

💼 Investor: But if you had to guess, based on similar cases?

📊 Analyst: Maybe 1 in 3 succeed under these conditions.

💼 Investor: So, would you bet on it?

📊 Analyst: Yes, I would.

💼 Investor: Even if the odds aren’t great?

📊 Analyst: I believe this one has what it takes.

The concept

Here we can define probability in terms of frequency of occurrence, i.e. as a percentage of successes in a moderately large number of similar situations.

This is the most natural and traditional way of thinking about probability.

  • Regarding a fair coin 🪙 we could say: “with probability 50% the coin lands on heads” meaning “roughly half of the time.”

But, what if this company belongs to a completely novel market sector?

The concept

There might be situations where the frequency concept is not adequate, because it might refer to a one-time event. These are subjective beliefs.

  • A company is recruiting a new CEO, and a board member says:

    “I believe there’s a 90% chance that our chosen candidate will be an effective CEO.”

The concept

It might seem easy to disregard the second case as unscientific or useless. However, many times people need to make decisions under uncertainty with not enough data (or no data at all!) about previous realizations of the specific event.

Beliefs allow the decision maker to, well, make some decision, at least consistently.

  • What’s the difference between both situations?
  • What do they have in common?

Uncertainty

A refresh on Set Theory

Sets and elements

A set is a collection of objects, which are elements of the set.

Definition

Let \(S\) represent a set, and \(s\) an element of that set, we write \(s\in S\) to mean \(s\) belongs to \(S\).

If \(s\) does not belong to \(S\), we write \(s\notin S\).

Definition

If a set \(S\) does not have any element, then it is the empty set, denoted by \(\emptyset\).

How to write down a Set

There are several ways to specify a set.

By extension, or as a list:

  • If a set \(S\) has a finite number of elements (\(x_i\in S\)) we can write it like this: \[S=\{x_1, x_2, ..., x_n\}\]

  • If a set \(S\) has an infinite (but countable) elements (\(x_i\in S\)) we can write it like: \[S=\{x_1, x_2, ...\}\]

How to write down a Set

By describing the property (\(P\)) that \(x\) must satisfy to be included in \(S\): \[S= \{x|x \text{ satisfies } P\}\] in this case \(|\) reads as such that. For example \[S=\{x\in\mathbb{R}|x>=0\}\] to describe the non-negative real numbers.

This example is special, as the positive real numbers cannot be written down as a a list. In this case the interval \([0,\infty)\) is an uncountable set.

More definitions

Definition

If \(\forall x\in S\) it is also true that \(x\in T\), then we say that \(S\) is a subset of \(T\), and we write it like \(S\subseteq T\).

Definition

If \(S\subseteq T\) and at the same time \(T\subseteq S\) then we say that \(S\) and \(T\) are equal, and we write it \(S=T\).

Definition

The universal set \(\Omega\) is the set that contains all objects that could conceivably be of interest in a particular context.

By definition the, any set \(S\) must be a subset of \(\Omega\).

More definitions

The universal set is important because it defines the scope of our analysis. Say we are studying the performance of students of Statistics I in 2026.

The cars parked outside our institution do not belong to the universal set, because they are not relevant for our purpose. Only students of Statistics I in 2026 belong to the universal set.

Set Operations

Definition

The complement of a set \(S\), with respect to \(\Omega\), is the set \(\{x\in\Omega| x\notin S\}\), that is, all the relevant elements that do not belong in \(S\). We denote it as \(S^c\).

Corollary: It is easy to see that \(\Omega^c=\emptyset\).

Set Operations

Definition

The union of two sets \(S\) and \(T\) is the set of all elements that belong to \(S\) or \(T\) (or both), and is denoted by \(S\cup T\). \[S\cup T=\{x\in\Omega | x\in S\ \vee\ x\in T\}\]

Definition

The intersection of two sets \(S\) and \(T\) is the set of all elements that belong to \(S\) and \(T\), and is denoted by \(S\cap T\). \[S\cap T=\{x\in\Omega | x\in S\wedge x\in T\}\]

Note that \(\vee\) stands for or, and \(\wedge\) stands for and.

Set Operations

Sometimes we might need to consider the union or intersection of many sets, and for that we can use a notation simmilar to the one we used for summations:

  • \[\bigcup_{n=1}^\infty S_n = S_1 \cup S_2 \cup ... = \{x\in\Omega | x \in S_n \text{ for some } n\}\]

  • \[\bigcap_{n=1}^\infty S_n = S_1 \cap S_2 \cap ... = \{x\in\Omega | x \in S_n \text{ for every } n\}\]

Set Operations

Definition

Two sets (say \(S\) and \(T\)) are said to be disjoint if \(S\cap T=\emptyset\).

More generally, a collection of sets \(S_n\) is disjoint if \(S_i\) and \(S_j\) are disjoint when \(i\neq j\).

Definition

A collection of sets is said to be a partition of a set \(S\) if the sets in the collection are:

  • Disjoint

  • Their union is \(S\)

We can use the notation of \(\mathcal{P}(\Omega)\)

Set Definition

The number of elements of a set \(S\) is known as its cardinality and it is denoted as \(\# S\). \(\# S\) satisfies:

  1. \(\# S \geq 0\)
  2. \(\# \emptyset = 0\)

Set definition

If we have two sets, \(S\) and \(T\), we define the operation set minus as \(\setminus\) the set that contains all elements of \(S\) that do not belong to \(T\)

\[S\setminus T = \{x\in S| s\notin T\}\]

Back to Probability

In probability, \(\Omega\), the universal set, is a non-empty set that contains all possible outcomes of an experiment. Each outcome is represented by \(\omega\), and obviously \(\omega\in\Omega\).

Back to Probability

The sample space (\(\Omega\)) can be:

  • Discrete, when \(\#\Omega\) is finite, or countable infinite.
  • Continuous, when \(\#\Omega\) is uncountable.

Back to Probability

Consider the experiment of throwing a die 🎲 and noting the number shown on side facing upwards.

  1. The sample space is \(\Omega=\{1,2,3,4,5,6\}\)
  2. In this case \(\# \Omega = 6\)
  3. \(\Omega\) is discrete.

Back to Probability

Consider now the random experiment of measuring the life expectancy of a lamp 💡, measured in hours.

  1. The sample space is \(\Omega=\{x\in\mathbb{R}|x\geq 0\}\)
  2. In this case, \(\Omega\) is all non-negative real numbers.
  3. \(\Omega\) is continuous.

Remember, \(\Omega\) must include all possible outcomes from your experiment! Even then ones that seem ludicrous.

Back to Probability

Definition

A subset \(A\) of the sample space \(\Omega\) is called an event. \[A\subseteq \Omega\]

By definition then \(\Omega\) is also an event.

Definition

We call the realization of an event \(A\) if, after an experiment, outcome \(\omega\) is realized, and \(\omega \in A\).

Example

Let’s go back to our experiment with the 🎲

The sample space is: \(\Omega=\{1,2,3,4,5,6\}\)

Within this space, we can define the following events:

  1. \(A=\{1,3,5\}\), i.e. the number is odd.
  2. \(B=\{3,4,5,6\}\), i.e. the number is at least 3.
  3. \(C=\{1,2,3\}\) , i.e. the number is lower than 4.
  4. \(D=\{6\}\), i.e. the number is larger than 5.

Example

Now let’s revisit the example of our 💡

The sample space is: \(\Omega=\{x\in\mathbb{R}|x\geq 0\}\)

In this space, we can define the following events:

  1. \(A=\{x\in\mathbb{R}|75<x<95\}\), i.e. the 💡 lasts between 75 and 95 hours.
  2. \(B=\{x\in\mathbb{R}|x\leq 100\}\), i.e. the 💡 lasts no longer than 100 hours.
  3. \(C=\{x\in\mathbb{R}|x\geq 60\}\), i.e. the 💡 lasts at least 60 hours.

Events

Definitions

  • An elementary event is any event that contains a single element (i.e. \(\# A = 1\))

  • An impossible event is an event with no outcome, (i.e. \(\# A=0\)). As a consequence, an impossible event coincides with the empty set \(\emptyset\).

  • A certain event is indeed the event \(\Omega\), as for any outcome we obtain \(\omega\), this outcome belongs to the sample space \(\Omega\) by definition.

Mixing up Sets and Probability

Consider two events \(A\) and \(B\) both subsets of \(\Omega\)

  1. \(A^c\) contains all the outcomes that are not in \(A\). \(A^c\) is the event of not \(A\).

  2. If \(A\subseteq B\), then an outcome that realizes event \(A\) (\(\omega\in A\)), also realizes \(B\), as \(A\subseteq B\Rightarrow \omega\in B\) as well. \(A\Rightarrow B\)

  3. For \(A\cup B\) to happen, we need \(\omega \in A\) or \(\omega \in B\), which means that \(A\) happens, or \(B\) happens, or both happen simultaneously.

Mixing up Sets and Probability

  1. For \(A\cap B\) to happen, we need \(\omega \in A\) and \(\omega \in B\), which means that \(A\) and \(B\) happen simultaneously.

  2. \(A\) and \(B\) are incompatible if \(A\cap B=\emptyset\), i.e. if an outcome is in one set, it cannot be in another, for example it cannot be that \(A\) and \(A^c\) happen simultaneously!

Inherited set properties for events

Consider two events \(A\) and \(B\) both subsets of \(\Omega\)

  1. Commutativity: \(A\cup B = B\cup A\); \(A\cap B=B\cap A\)
  2. Associativity: \((A\cup B)\cup C=A\cup(B\cup C)\); \((A\cap B)\cap C=A\cap(B\cap C)\)
  3. Distributivity: \(A\cup(B\cap C)=(A\cup B)\cap(A\cup C)\); \(A\cap (B\cup C)=(A\cap B)\cup(A\cap C)\)
  4. Morgan Law’s: \((A\cap B)^c=A^c\cup B^c\); \((A\cup B)^c=A^c\cap B^c\)

Inherited set properties for events

  1. \(\left(A^c\right)^c=A\)
  2. Complement law: \(A\cup A^c=\Omega\); \(A\cap A^c=\emptyset\)
  3. Identity element: \(A\cup\emptyset = A\); \(A\cap\Omega = A\)
  4. Absorbing element: \(A\cup\Omega = \Omega\); \(A\cap\emptyset = \emptyset\)
  5. Idempotent law: \(A\cup A=A\) ; \(A\cap A=A\)
  6. \(A\subset B\Rightarrow A\cap B=A\); \(A\subset B\Rightarrow A\cup B = B\)

Example

Consider the sample space \(\Omega =\{1,2,3,4,5,6\}\), from our 🎲 case.

Define the events: \[A=\{1\},\ B=\{3,6\},\ C=\{2,4,6\},\ D=\{4,5,6\}\]

Example

Let’s define the following events in \(\Omega\)

  • \(A\cup B\)
  • \(A\cap B\)
  • \(A^c\)
  • \((A\cup B)^c\)
  • \((B\cap C)^c\)
  • \(B\setminus C\)
  • \(C\setminus D\)

Concept of probability

Besides the concepts we already saw of frequency and subjectivity for probability, there was an older, called “classic” one. This one was introduced by Pierre-Simon Laplace in 1812.

Laplace or Classic interpretation of probability

Let \(A\) be an event defined over a finite \(\Omega\). The probability of event \(A\) is defined as:

\[P(A)=\frac{\# A}{\# \Omega}\]

Example

Consider now an experiment throwing two dice 🎲 🎲

  1. How many outcomes are in \(\Omega\)? \(6^2=36\), \(\#\Omega=36\).
  2. Let \(A\) be the event where both dice show the same number: \[A=\{(1,1), (2,2),...,(6,6)\}\] here \(\# A=6\)
  3. The probability that both dice show the same number is \[P(A)=\frac{\# A}{\# \Omega}=\frac{6}{36}=\frac{1}{6}\]

Laplace or Classic interpretation of probability

The problem with this interpretation, is that we cannot use it, or it becomes meaningless, it when \(\Omega\) is uncountable or infinite. Also, what if the outcomes are not equally likely? (i.e. if the dice are not fair?)

Frequency interpretation

This is today still the dominant interpretation of probability.

In this case, what we want is to observe several independent repetitions of the experiment. After a while, some statistical regularity begins to emerge.

Frequency interpretation

Logically, if you run an experiment, and are interested in the probability of event \(A\), then the events you are registering are \(A\) and \(A^c\) or not \(A\).

Every time you run your experiment, you count when you get an \(A\) and when you observe an \(A^c\) event. Obviously, the total number of experiments is how many times you observed \(A\) and how many times you observed \(A^c\).

Example:

Experiment: Draw a random number in the interval \([0,1]\). \(A\) denotes \(x<0.4\).

\(A\) \(A^c\) \(N\) \(P(A)\)
0 1 1 0
2 8 10 0.2
17 33 50 0.34
39 61 100 0.39
217 283 500 0.434
802 1198 2000 0.401

Example:

Frequency interpretation

As you can see, the more experiments we run, the more stabilized the ratio of occurrences for \(A\) over the total number of experiments. More generally:

\[P(A)=\lim_{N\rightarrow \infty}\frac{A\text{ occurrences}}{N \text{- Number of Experiments}}\]

This is the relative frequency of \(A\) in \(N\) experiments: \(f_A\)

Not always possible to repeat that many times the experiment in the same conditions.

About our previous example

It seems that the probability that the random number between 0 and 1 is below 0.4 is approximately 40%. The more experiments we run, the closer our relative frequency is to that number.

\[P(A)\underset{N \rightarrow \infty}{\rightarrow} 0.4\]

By the way, we will see later that theoretically, indeed \(P(A)=0.4\)

Probability Axioms

Probability

Andrey Kolmogorov defined a set of characteristics that any probability \(P\) measure should have, these are called the Kolmogorov’s axioms (1933):

  1. \(P(A)\in\mathbb{R}\) and \(P(A)\geq 0\), for any event \(A\subseteq\Omega\).
  2. \(P(\Omega)=1\)
  3. For any \(A\) and \(B\) disjoint, \(P(A)+P(B)=P(A\cup B)\)

Corollary

Let \(A\) and \(B\) be some events in \(\Omega\)

  1. \(P(\emptyset)=0\)
  2. \(P(A^c)=1-P(A)\)
  3. \(A\subset B\Rightarrow P(A)\leq P(B)\)
  4. \(0\leq P(A)\leq 1,\ \forall A\)
  5. \(P(A\setminus B)=P(A)-P(A\cap B)\)

Corollary

  1. \(P(A\cup B)= P(A)+P(B)-P(A\cap B)\)
  2. \[P\left(\bigcup^n_{i=1}A_i\right)=\sum_{i=1}^n P(A_i)-\sum_{i\neq j} P(A_i\cap A_j)+\\ \sum_{i\neq j\neq k} P(A_i\cap A_j\cap A_k)+...+(-1)^{n-1}P\left(\bigcap_{i=1}^n A_i\right)\]

Conditional Probability

Conditional probability, as the wording implies, means the probability of something happening given something else has happened. Now, note “something” here makes reference to an event.

\[P(A|B)\]

It reads the probability of \(A\), given \(B\).

Conditional Probability

Note that if we think on sets, saying given \(B\) we are immediately excluding everything that could have happened if \(B\) did not happen, and therefore our Universal set is no longer \(\Omega\), but \(B\).

What we are looking for are, among the events that live in \(B\), how many of those live in \(A\) (because those would trigger event \(A\)). Actually, we are interested on the relative measure of those outcomes, compared to the whole size of \(B\): \[P(A|B)=\frac{P(A\cap B)}{P(B)}\]

Example

Consider a factory that makes 10 wrenches 🔧. Among those, we know that 2 have imperfections. Suppose you intend to remove, randomly, 2 🔧 from the lot (of 10). Consider the following events:

\(A = \{\text{The first :wrench: is faulty}\}\) \(B = \{\text{The second :wrench: is faulty}\}\)

What if we want to compute \(P(B)\)? For a correct assessment for \(B\), we would better have some information on the realization of \(A\)!

Example

If the fist 🔧 was faulty, then \(A\) happened. If the first 🔧 was ok, then \(A^c\) happened, and therefore we can compute \(P(B|A)\) and \(P(B|A^c)\). We are assuming that we are removing these 🔧 without replacing them.

Let’s see why this last detail (replacing the 🔧) is so relevant before going on.

Example: With replacement

Initial set:

1 2 3 4 5 6 7 8 9 10
🔧 🔧 💥 🔧 💥 🔧 🔧 🔧 🔧 🔧

Remove one (if randomly you do not know which), but after you remove you can see what happened, let’s take out 7.

1 2 3 4 5 6 7 8 9 10
🔧 🔧 💥 🔧 💥 🔧 🔧 🔧 🔧

We observe, and voilá it was a fine wrench 🔧. If we replace it though, we would be picking from

1 2 3 4 5 6 7 8 9 10
🔧 🔧 💥 🔧 💥 🔧 🔧 🔧 🔧 🔧

That is in the exact same conditions we made our first choice, and therefore what happens with the first pick is irrelevant: These events are now independent!

Example: with replacement

  • With \(A\), we know that when we pick the second wrench (\(B\)), in the box there are 2 💥 and 8 🔧.

  • With \(A^c\), we know that when we pick the second wrench (\(B\)), in the box there are 2 💥 and 8 🔧.

The probability of getting a 💥 is the same in each scenario! \(P(B|A)= P(B|A^c)\)!

\[P(B|A)=\frac{2}{10}=0.2\ \text{and}\ P(B|A^c)=\frac{2}{10}=0.2\]

Example: no replacement

Initial set:

1 2 3 4 5 6 7 8 9 10
🔧 🔧 💥 🔧 💥 🔧 🔧 🔧 🔧 🔧

Remove one (if randomly you do not know which is broken)

1 2 3 4 5 6 7 8 9 10
🔧 🔧 💥 🔧 🔧 🔧 🔧 🔧 🔧

We observe, and voilá it was a broken wrench 💥 \(A\) happened!

1 2 3 4 5 6 7 8 9 10
🔧 🔧 💥 🔧 💥 🔧 🔧 🔧 🔧

We observe, and voilá it was a fine wrench 🔧 \(A^c\) happened!

Example: no replacement

  • With \(A\), we know that when we pick the second wrench (\(B\)), in the box there are 1 💥 and 9 🔧.

  • With \(A^c\), we know that when we pick the second wrench (\(B\)), in the box there are 2 💥 and 9 🔧.

The probability of getting a 💥 is different in each scenario! \(P(B|A)\neq P(B|A^c)\)!

\[P(B|A)=\frac{1}{9}=0.111\ \text{and}\ P(B|A^c)=\frac{2}{9}=0.222\]

Conclusion Conditional Probability

So formally

Definition

Let \(A,B\subset\Omega\), the we say the probability of \(A\) given \(B\) is the conditional probability: \[P(A|B)=\frac{P(A\cap B)}{P(B)}\] Note that from here, we can obtain also \(P(A\cap B)=P(A|B)P(B)\). In both situations we need \(P(B)\neq 0\).

Corollary

It follows that the identity \[P(B|A)=\frac{P(A\cap B)}{P(A)}\] or \[P(A\cap B)=P(B|A)P(A)\] with \(P(A)\neq 0\) also holds true.

Corollary

Now let’s think on \(P(A\cap B \cap C)\):

  • \(P(A\cap B \cap C)=P(A\cap (B \cap C))\)
  • \(P(A\cap B \cap C)=P(A|B\cap C)P(B\cap C)\)
  • \(P(A\cap B \cap C)=P(A|B\cap C)P(B|C) P(C)\)

Corollary

Note that given the commutativity of the intersection, we could have obtained also:

  • \(P(A\cap B \cap C)=P(A|B\cap C)P(C|B) P(B)\)
  • \(P(A\cap B \cap C)=P(B|A\cap C)P(A|C) P(C)\)
  • \(P(A\cap B \cap C)=P(B|A\cap C)P(C|A) P(A)\)
  • \(P(A\cap B \cap C)=P(C|A\cap B)P(A|B) P(B)\)
  • \(P(A\cap B \cap C)=P(C|A\cap B)P(B|A) P(A)\)

And to make sense of all of this we need \(P(X)>0\), \(P(X\cap Y)>0\) with \(X,Y\in\{A,B,C\}\).

Example

Consider a region with 1,000 adults. Their job data is captured by the following table:

Employed Unemployed Total
Women 470 55 525
Men 430 45 475
Total 900 100 1,000
  1. Randomly selecting a person in this region, what is the probability this person is:
    1. Woman
    2. Unemployed
    3. Unemployed woman

Example

Let’s define the events:

\(W=\{Woman\}\), \(M=\{Man\}\), \(U=\{Unemployed\}\)

  1. Woman: \(P(W)=\frac{525}{1000}=0.525\)
  2. Unemployed: \(P(U)=\frac{100}{1000} = 0.1\)
  3. Unemployed woman: \(P(W\cap U)=\frac{55}{1000}=0.055\)

Example

  1. A citizen is randomly chosen from the population, and it happens to be a woman. What is the probability she is unemployed?

\(P(U|W)=P(U\cap W)P(W)=0.055\times 0.525=0.105\)

  1. A citizen is randomly chosen from the population, and it happens to be unemployed. What is the probability is unemployed?

\(P(W|U)=P(W\cap U)P(U)=0.055\times 0.1=0.55\)

Independent Events

Definition

Two events \(A\) and \(B\) \(\subset\Omega\), are probabilistically independent if and only if: \[P(A\cap B)=P(A)P(B)\]

Independent Events

From the definition of independence, we can obtain several properties. Let \(A\) and \(B\) independent events with \(P(A)P(B)>0\):

  1. \(P(A|B)=P(A)\) and \(P(B|A)=P(B)\) (remember conditional example with replacement)
  2. \(A^c\) and \(B\) are independent, as well as \(A\) and \(B^c\), and even \(A^c\) and \(B^c\).
  3. If \(A\) and \(B\) are incompatible, they cannot be independent. \(P(A\cap B)=P(\emptyset)=0\neq P(A)P(B)\)
  4. Any event is independent of \(\Omega\) and \(\emptyset\).

Example

Are \(W\) and \(U\) from the previous example independent?

\(P(W)=0.525\), \(P(U)=0.1\), \(P(W|U)=0.55\), \(P(U|W)=0.105\).

Note that \(P(W)\neq P(W|U)\) and \(P(U)\neq P(U|W)\). Therefore, they cannot be independent.

Example

Consider a die 🎲 that is thrown twice. Consider the following two events:

\(A=\{\text{The die shows an odd number the first time}\}\) \(B=\{\text{The die shows a number }>4\text{ the second time}\}\)

Are \(A\) and \(B\) independent events?

Example

In this case, \(\Omega=\{(x,y)\in \mathbb{N}^2| x,y \leq 6\}\) with \(\# \Omega = 6^2=36\)

  • \(P(A)=\frac{18}{36}=\frac{1}{2}\)

  • \(P(B)=\frac{12}{36}=\frac{1}{3}\)

  • \(P(A\cap B)=\frac{1}{6}=P(A)P(B)\)

  • \(P(A|B)=\frac{P(A\cap B)}{P(B)}=\frac{1}{2}=P(A)\)

  • \(P(B|A)=\frac{P(B\cap A)}{P(A)}=\frac{1}{3}=P(B)\)

  • They are independent events!

Remark

Two events being independent is not the same that they being incompatible:

\(A\) and \(B\) independent \(A\) and \(B\) incompatible
\(P(A\cap B)=P(A)P(B)\) \(P(A\cap B)=0\)
\(P(A|B)=P(A)\) and \(P(B|A)=P(B)\) \(P(A|B)=0\) and \(P(B|A)=0\)

Example

Let \(A\) and \(B\) be two events such that: \(P(A)=0.6\), \(P(B)=t\), and \(P(A\cup B)=0.8\)

Find \(t\) such that \(A\) and \(B\) are:

  1. Mutually exclusive or incompatible.
  2. Independent.

Example

  1. In this case, what we need is that \(P(A\cap B)=0\).

From probability theory, we have \[P(A\cup B)=P(A)+P(B)-P(A\cap B)\] and therefore we get: \[0.8 = 0.6+t\Rightarrow t=0.2\]

Example

  1. To make \(A\) and \(B\) independent, we need that \(P(A\cap B)=P(A)P(B)=0.6t\):

Using the same identity we just used: \[0.8=0.6+t-0.6t\Rightarrow t= 0.5\]

Law of Total Probability

Theorem

Let \(\{A_i\}_{i=1}^n\) be a partition of \(\Omega\), or \(\{A_i\}\in \mathcal{P}(\Omega)\). Then, for any \(B\subset\Omega\), it holds that:

\[P(B)=\sum_{i=1}^n P(A_i\cap B)=\sum_{i=1}^n P(B|A_i)P(A_i)\]

Example

Consider a financial institution that sells two products, \(\alpha\) and \(\beta\), with very high yields. It is known that, among its clients, 10% invest a share of their wealth in \(\alpha\) and the rest in \(\beta\). From those who invest in \(\alpha\), 70% manage to get returns above the market. From among those who do not invest in \(\alpha\), 55% get returns above the market. Randomly choosing a client of this firm, find the probability this customer gets a return above the market.

Example

Let’s define the events:

  • \(A_1\) the client invest in \(\alpha\)
  • \(A_2\) the client invest in \(\beta\)
  • \(B\) has returns above the market.

Matching with the available data we obtain:

\(P(A_1)=0.1\), \(P(A_2)=0.9\), \(P(B|A_1)=0.7\), and \(P(B|A_2)=0.55\).

From the Law of Total Probability:

Example

\[P(B)=\sum_{i=1}^2 P(A_i\cap B)\] \[P(B)=P(B|A_1)P(A_1)+P(B|A_2)P(A_2)\] \[P(B)=0.7\times 0.1 + 0.55\times 0.9 = 0.565\]

Bayes Theorem

Bayes Theorem

Let events \(A_1\), \(A_2\), … , \(A_n\) with \(n\in\mathbb{N}\) a partition of \(\Omega\), then, for any event \(B\subset\Omega\), with \(P(B)>0\):

\[P(A_i|B)=\frac{P(A_i\cap B)}{P(B)}=\frac{P(B|A_i)P(A_i)}{\sum_{i=1}^n P(B|A_i)P(A_i)}\] with \(i=1,2,...n\)

Note that this is a consequence of the Law of Total Probability.

Bayes Theorem

On the other side, \(\sum_i P(A_i)=1\) and \(\sum_{i} P(A_i|B)=1\)

Bayes Theorem has been widely used in economics, in biomedical sciences, and social sciences when looking for causality.

If event \(B\) represents consequences and event \(A_i\) probable cause, Bayes Theorem allows to assess the probability of this cause \((P(A_i))\).

Example

Let’s go back to the previous example, about our investors.

Let’s compute the probability that the client invested his money on product \(\beta\), but given that the client had returns above the market (event \(B\)).

Example

If the customer invested in \(\beta\), then the event we are trying to is \(A_2\), but conditional on event \(B\), \(P(A_2|B)\):

\[P(A_2|B)=\frac{P(A_2\cap B)}{P(B)}=\frac{P(A_2)\times P(B|A_2)}{\sum_i P(A_i)\times P(B|A_i)}\]

We knew from the previous exercise that \(P(B)=0.565\), and therefore we obtain:

\[P(A_2|B)=\frac{0.9\times 0.55}{0.565}=0.876\]

Example

How do we interpret this?

The probability that the client invested in \(\beta\), given that he had a return above the market, is 0.876.

Example

All these computations can be very easy with the help of the following table:

\(A_i\) \(P(A_i)\) \(P(B|A_i)\) \(P(A_i)P(B|A_i)\) \(P(A_i|B)\)
\(A_1\) 0.1 0.7 0.07 0.124
\(A_2\) 0.9 0.55 0.495 0.876
1 0.565 1

Example

Let’s verify now if the event \(A_1\) and \(B^c\) are independent or not!

According to the definition of independence: \(P(A_1\cap B^c)=P(A_1)P(B^c)\)

  • \(P(A_1\cap B^c)=P(A_1)\times P(B^c|A_1)=0.1\times 0.3=0.03\)
  • \(P(A_1)\times P(B^c)=0.1\times(1-0.565)=0.0435\)
  • Then: \[P(A_1\cap B^c)=P(A_1)\times P(B^c)\Leftrightarrow 0.03\neq 0.045\]

Then \(A_1\) and \(B^c\) are not independent.

Bibliography

  • Murteira, B.; Ribeiro C.; Silva, J. and Pimenta, C. (2010) Introdução à Estatística (2a Edição). McGraw-hill.
  • Paulino C.D.; Branco J.A. (2005). Exercícios de Probabilidade e Estatística. Escolar Editora.
  • Pedrosa, A.; Gama, S. (2004). Introdução Computacional à Probabilidade e Estatística. Porto Editora.
library(lubridate)
library(webexercises)

Disclaimer

These slides are a free translation and adaptation from the slide deck for Estatística I by Prof. Sandra Custódio and Prof. Teresa Ferreira from the Lisbon Accounting and Business School - Polytechnical University of Lisbon.

(Single) Random Variables

Random Variables

A random variable is a function that will allow us to quantify (transform into a number) each outcome.

Random Variable

A random variable (r.v.) \(X\) is a function \(f:\Omega\rightarrow \Omega_X\subset \mathbb{R}\). \(\Omega_X\) is known as the support of the r.v. \(X\).

\[\omega\in\Omega \overset{X}{\rightarrow} X(\omega)\in\Omega_X\subset\mathbb{R}\]

\(X(\omega)\) is the image under \(X\) of the outcome \(\omega\)

Summarizing, a r.v. is a function that associates a real number to each outcome from \(\Omega\).

Random Variables

  • \(\{X=x\}\), \(\{X\leq x\}\), \(\{X>x\}\) are events
  • \(\{X=x\}\) happens when, from our experiment, we obtain \(\omega\) such that \(X(\omega)=x\). \(\{X=x\}=\{\omega\in\Omega | X(\omega)=x\}\)
  • The probability of \(X\leq x\) for example, is then \[P(X\leq x)=P\left(\{\omega\in \Omega| X(\omega)\leq x\}\right)\]

Types of random variables

Discrete Random Variable

\(X\) is a discrete r.v. when:

  • The support of \(X\), \(\Omega_X\), is finite or countable infinite.
  • \(P(\Omega_X)=1\)

In this case, \(\Omega_X=\{x_1, x_2, ... , x_n\}\) with \(n\in\mathbb{N}\) if \(\Omega_X\) is finite, and \(\Omega_X=\{x_1,x_2,...,x_n,...\}\) if it is countable infinite.

Probability density function (pdf)

Let \(X\) be a discrete r.v. The pdf of \(X\) is a function \(f_X:\mathbb{R}\rightarrow\mathbb{R}\) such that:

\[f_X(x)=\left\{\begin{array}{cc}P(X=x) & ,\text{ if } x\in\Omega_X\\ 0 & ,\text{ if } x\in\mathbb{R}\setminus\Omega_X\end{array}\right.\]

Naturally, by construction the pdf satisfies the following properties:

\(f_X(x)\geq 0 \quad \forall x\in\mathbb{R}\)
\(\sum_{x_i\in\Omega_X}P(X=x_i)=1\)

Probability density function (pdf)

The pdf gives the probability in a single point. The total probability is distributed among single points, \(x_i\). A reasonable representation of a pdf of a discrete r.v. could be:

\(x\) \(x_1\) \(x_2\) \(x_n\)
\(f(x)\) \(p_1\) \(p_2\) \(p_n\)

Where \(p_i=P(X=x_i)\)

Example

Consider the discrete r.v. \(X\) with the following pdf:

\(x\) 0 1 2 3 4
\(f(x)\) \(0.05\) \(a\) \(0.35\) \(0.25\) \(0.05\)

We could define:

  • Support of \(X\): \(\Omega_X=\{0,1,2,3,4\}\)
  • \(\sum_{x\in\Omega_X} f(x)=1 \Rightarrow 0.05 + a + 0.35 + 0.25 + 0.05 = 1\) that is \(a=\)
  • 0.3

Example

Our table is now:

\(x\) 0 1 2 3 4
\(f(x)\) \(0.05\) \(0.3\) \(0.35\) \(0.25\) \(0.05\)

What is \(P(X=2|X\leq 3)\)?

\[P(X=2|X\leq 3)=\frac{P(X=2 \cap X\leq 3)}{P(X\leq 3)}= \frac{P(X=2)}{P(X\leq 3)}\]

\[= \frac{f(2)}{f(0)+...+f(3)}=\frac{0.35}{0.95}=`r round(.35/.95,3)`\]

Continuous random variables

\(X\) is a continuous r.v. if:

  • The support of \(X\), \(\Omega_X\) is uncountable infinite.
  • \(P(\Omega_X)=1\)
  • \(P(X=x)=0\) \(\forall x\in\mathbb{R}\).

Probability density function (pdf)

Let \(X\) a continuous r.v.

There is a function \(f_X:\mathbb{R}\rightarrow\mathbb{R}\), the pdf of \(X\) such that:

  • \(f_X(x)=0\), \(\forall x\in\mathbb{R}\)
  • \(\int_{-\infty}^{\infty}f_X(x)dx=1\)

Technically, from Measure Theory, we need an absolutely continuous r.v. to ensure the existence of a pdf. These issues are beyond the scope of this course. Just know that when we say continuous r.v. we mean absolutely continuous r.v.

Probability density function (pdf)

Note that this pdf allows to compute the probability of events \(x\in(a,b]\):

\[P(a<X\leq b)=\int_a^b f_X(x)dx\]

Observe that if you would do \(X=a\) you would get the integral from \(a\) to \(a\), which makes \(dx=0\) and therefore the integral (and the probability) becomes 0.

Example

Let \(X\) be a continuous r.v. with the following pdf:

\[ f(x)=\left\{\begin{array}{cc} \theta x^2 & , 0\leq x< 1\\ 0 & ,\mathbb{R}\setminus [0,1) \end{array}\right. \]

  • Support for \(X\): \(\Omega_X=[0,1)\)

  • \[\int_{-\infty}^{\infty}f(x)dx=1\Leftrightarrow\int_0^1\theta x^2dx=\left[\theta\frac{x^3}{3}\right]_{0}^1\]

  • \[\theta\frac{1}{3}-\theta\frac{0}{3}=1\Leftrightarrow \theta=3\]

Cumulative distribution function (cdf)

Let \(X\) be a r.v. The distribution function \(F_X:\mathbb{R}\rightarrow[0,1]\), defined as:

\[F_X(x)=P(X\leq x)\]

\(F_x\) is unique.

Cumulative distribution function (cdf)

With a discrete r.v.

\(x\) 0 1 2 3 4
\(f(x)\) \(0.05\) \(0.3\) \(0.35\) \(0.25\) \(0.05\)

\[F(x)=P(X\leq x)=\left\{ \begin{array}{cc} 0 & x<0 \\ 0.05 & 0\leq x < 1 \\ 0.05 + 0.3 = 0.35 & 1 \leq x < 2 \\ 0.35 + 0.35 = .7 & 2 \leq x < 3 \\ 0.7 + 0.25 = .95 & 3 \leq x < 4 \\ 1 & x\geq 4 \end{array} \right.\]

Cumulative distribution function (cdf)

Let’s revisit our previous example:

\[P(X=2|X\leq 3)= \frac{P(X=2)}{P(X\leq 3)}=\] \[\frac{F(2)-F(2^-)}{F(3)}= \frac{0.7-0.35}{0.95}=0.368 \]

Cumulative distribution function (cdf)

With a continuous r.v.:

\[F_X(x)=P(X\leq x)=\int_{-\infty}^{x} f_X(x)dx\]

The distribution function, \(F_X\) allows to compute the probability of \(\{X\in(a,b]\}\)

\[P(a<X\leq b)=\int_a^b f_X(x)dx=F_X(b)-F_X(a)\]

Example

# Load necessary package
library(ggplot2)

# Create a sequence of x values
x <- seq(-4, 4, length.out = 1000)
y <- dnorm(x)

# Create a data frame
df <- data.frame(x = x, y = y)

# Define the shaded region
df$shade <- ifelse(df$x >= -0.2 & df$x <= 0.5, df$y, NA)

# Plot

ggplot(df, aes(x = x, y = y)) +
  geom_line(color = "blue") +
  geom_area(aes(y = shade), fill = "skyblue", alpha = 0.5) +
  annotate("text", x = -0.2, y = 0.02, label = "a", color = "black", size = 7) +
  annotate("text", x = 0.5, y = 0.02, label = "b", color = "black", size = 7) +

  annotate("text", x = 0.35, y = 0.1, label = "P(a < x ~ '\u2264' ~ b)", parse = TRUE, size = 5) +
  labs(title = "Standard Normal Distribution",
       x = "x", y = "Density") +
  theme_minimal()

Example

Consider the continuous r.v. defined previously, with the pdf:

\[ f(x)=\left\{ \begin{array}{cc} 3x^2 & , 0\leq x< 1\\ 0 & , \mathbb{R}\setminus [0,1) \end{array} \right. \]

Support for \(X\): \(\Omega_X=[0,1)\)

Distribution function (cdf):

\[ F(x)=P(X\leq x) = \int_{-\infty}^x f(t)dt = \left\{ \begin{array}{cc} 0 & , x<0\\ x^3 & ,0\leq x<1 \\ 1& ,x\geq 1 \end{array} \right. \]

Properties of the cdf

Nonetheless the r.v. is discrete or continuous, \(F_X\) has the following properties:

  • \(F_X:\mathbb{R}\rightarrow[0,1]\)
  • \(F_X\) is continuous from the right: \(\lim_{x\rightarrow a^+}F_X(x)=F(a)\)
  • \(F_X\) is monotone non-decreasing.
  • \(\lim_{x\rightarrow -\infty} F_X(x)=0\)
  • \(\lim_{x\rightarrow \infty} F_X(x)=1\)

Properties of the cdf

\(F_X\) for \(X\) r.v. discrete

  • Single points, discontinuous support \(\Omega_X\)
  • It is continuous from the right
  • \(P(X<x)=F(X^-)\)
  • \(P(X=x)=F(x)-F(x^-)\)
  • \(P(a<X\leq b)=F(b)-F(a)\)
  • \(P(a\leq X\leq b)=F(b)-F(a^-)\)
  • \(P(a<X<b)=F(b^-)-F(a^-)\)
  • \(P(a\leq X < b) = F(b^-)-F(a^-)\)

Properties of the cdf

\(F_X\) for \(X\) r.v. continuous

  • Continuous support \(\Omega_X\)
  • It is continuous in \(\mathbb{R}\)
  • \(P(X<x)=P(X\leq x)=F(x)\)
  • \(P(a\square X\square b)=F(b)-F(a)\), [replace \(\square\) for \(<\) or \(\leq\)]

pdf and cdf

For a discrete r.v.

\[P(X=x)=F_X(x)-F_X(x^-)\] \[+\downarrow \uparrow -\] \[F_X(x)=\sum_{x_i\leq x}P(X=x_i)\]

pdf and cdf

For a continuous r.v.

\[ f_X(x)=\left\{ \begin{array}{cc} F_X'(x) & ,x\in\mathbb{R} \text{ if }F_X'\text{ exists} \\ 0 & \text{, otherwise} \end{array} \right. \]

\[ Derivative \downarrow \uparrow Primitive\]

\[ F_X(x)=\int_{-\infty}^x f_X(t)dt\]

The pdf of \(X\), a continuous r.v. is not unique.

Statistical Moments of a Population

Describing a population

We could describe the range of \(X\), a r.v., as a population, in the statistical sense, because it describes all the possible values it can take.

We can use numerical values to do so, which can represent dispersion or centrality of the data.

Expected Value or mean

The expected value or mean, is a location parameter for our r.v.

Definition

The expected value or mean of a random variable \(X\) is:

  • \(\mu_X=E[X]=\sum_{x\in\Omega_X}x P(X=x) < \infty\) if \(X\) is discrete
  • \(\mu_X=E[X]=\int_{-\infty}^{\infty}xf_X(x)dx<\infty\) if \(X\) is continuous.

Not all random variables have an expected value, it might be infinite.

Expected Value or mean

Let \(X,Y\) rvs, and \(a,b\in\mathbb{R}\) scalars. Some properties of the mean

  1. \(E[a]=a\)
  2. \(E[aX+bY]=aE[X]+bE[Y]\)
  3. If \(Y=g(X)\), a r.v.:
    • Discrete, then \[E[Y]=\sum_{x\in\Omega_X}g(x)P(X=x)<\infty\]
    • Continuous, then \[E[Y]=\int_{-\infty}^{\infty}g(x)f_X(x)dx<\infty\]

Example discrete

Let \(X\) be a discrete r.v. as in the previous example:

\(x\) 0 1 2 3 4
\(f(x)\) \(0.05\) \(0.3\) \(0.35\) \(0.25\) \(0.05\)

Let \(g(X)=2(X-1)^2+3(X-1)-5\), find \(E[g(X)]\).

Example discrete

\[g(X)=2(X-1)^2+3(X-1)-5\] \[=2(X^2-2X+1)+3X-3-5\] \[=2X^2-4X+2+3X-8\] \[=2X^2-X-6\]

\[E[Y]=E[2X^2-X-6]\]

\[=2E[X^2]-E[X]-6\]

Example discrete

We only need to find \(E[X]\) and \(E[X^2]\) to obtain \(E[g(X)]\).

\[E[X]=\sum xP(X=x)\] \[ = 0\times .05 + 1 \times .3 + 2 \times .35 + 3 \times .25 + 4\times .05 = 1.95\]

\[E[X^2]=\sum x^2 P(X=x)\]

\[ = 0\times .05 + 1 \times .3 + 4 \times .35 + 9 \times .25 + 16\times .05 = 4.75\]

\[E[g(X)]=2\times 4.75 - 1.95 - 6 = 1.55\]

Example continuous

Recall our example for continuous r.v.s. \(X\): \[ f_X(x)=\left\{ \begin{array}{cc} 3x^2 & ,0\leq x < 1 \\ 0 & , \mathbb{R}\setminus[0,1) \end{array} \right. \]

Find \(E[g(X)]\) when \(g(X)=2(X-1)^2+3(X-1)-5\) We know already \(g(X)=2X^2-X-6\). Let’s focus on \(E[x]\) and \(E[X^2]\).

Example continuous

\[E[X]=\int_{-\infty}^{\infty} xf_X(x)dx = \int_0^1 x\times 3x^2 dx\] \[= \int_0^1 3x^3dx=\left[3\frac{x^4}{4}\right]_{0}^1=\frac{3}{4}=0.75\]

\[E[X^2]=\int_{-\infty}^{\infty} x^2f_X(x)dx = \int_0^1 x^2\times 3x^2 dx\] \[= \int_0^1 3x^4dx=\left[3\frac{x^5}{5}\right]_{0}^1=\frac{3}{5}=0.6\]

Example continuous

Finally,

\(E[g(x)]=2\times 0.6 - 0.75 - 6 = -5.55\)

p-quantile

The p-quantile, \(x_p\), of a r.v. \(X\) is a location parameter, with fixed value.

p-quantile for discrete r.v.

\(x_p\) is the value for \(x\in\Omega_X\) such that:

  1. \(P(X\leq x)\geq p\)
  2. \(P(X\geq x)\geq 1-p\)

O what is the same, \(x\in\Omega_X\) such that \(F_X(x^-)\leq p \leq F_X(x)\)

p-quantile for continuous r.v.

\(x_p\) is an \(x\in\Omega_X\) such that \(F_X(x)=p\)

Let’s apply this for the examples we just used for the expected value.

Example - Discrete

Find the median (0.5-quantile) for \(X\)

\[ F(x)=P(X\leq x)=\left\{ \begin{array}{cc} 0 &, x\leq 0\\ 0.05 &, 0\leq x <1 \\ 0.35 &, 1\leq x < 2 \\ 0.7 &, 2 \leq x < 3 \\ 0.95 &, 3 \leq x < 4 \\ 1 &, x\geq 4 \end{array} \right. \]

For example \(F(2^-)=0.35\leq 0.5 \leq 0.7=F(2)\) and therefore \(X_{0.5}=Me = 2\). Given that \(E[X]=1.95<Me(X)=2\) the distribution is slightly negatively (or left) skewed.

Example - Continuous

\[F_X(x)=P(X\leq x)=\left\{ \begin{array}{cc} 0 & x < 0 \\ x^3 & 0\leq x <1 \\ 1 & x\geq 1 \end{array} \right. \]

Let’s find \(x\) such that \(F(x)=0.5\)

\(F(x)=0.5\Leftrightarrow x^3=0.5\Leftrightarrow x=\sqrt[3]{0.5}\approx`r round(0.5^(1/3),4)`\)

And therefore, \(x_{0.5}=Me=.7937\)

Variance

Let \(X\) be a r.v. The variance of \(X\), if it exists, is defined as:

\[V[X]=E\left[\left(X-E[X]\right)^2\right]\]

It can be show, very easily, with some algebraic manipulation that \(V[x]=E\left[X^2\right]-\left(E[X]\right)^2\)

Variance

Remember that \(E[X]\equiv\mu_X\)

  1. For a discrete r.v.: \[V[X]=\sum_{x\in\Omega_X}(x-\mu_X)^2P(X=x)\]
  2. For a continuous r.v.: \[V[X]=\int_{-\infty}^{\infty}(x-\mu_X)^2f_X(x)dx\]

Usually we write \(V[X]\) as \(\sigma^2_X\).

Variance

Some properties for the variance:

  1. \(V[a]=0\) for \(a\in\mathbb{R}\)
  2. \(V[aX+b] = a^2V[X]\) for \(a,b\in \mathbb{R}\) and \(X\) a r.v.
  3. If \(X\) and \(Y\) are independent r.v. with finite variance, then \[V[X\pm Y]=V[X]+V[Y]\]

Standard deviation \(\sigma_X\)

If \(\sigma^2_X\) is the variance of \(X\), then the standard deviation is known as: \[\sigma_X=\sqrt{V[x]}\]

One characteristic of the standard deviation is that its units are the same as those of the random variable.

Coefficient of variation

While the variance and standard deviation allow us to measure the dispersion of the data, we might want to have it relative to the mean (a \(\sigma_X=1\) can be a lot for \(X\) taking relatively low values, but negligible if we are talking in millions!)

For that we use the coefficient of variation:

\[C.V._X =\frac{\sigma_X}{\mu_X}\times 100\]

Coefficient of variation

Some properties of the \(CV_X\)

  1. It is not defined for \(\mu_X=0\)
  2. Lower values for \(CV_X\) means less dispersion around \(\mu_X\), therefore, more precision.
  3. \(CV_X\) could be a measure for risk, if we are looking at returns for some asset.
  4. \(CV_X\) is the relative weight of deviations from the mean, over the mean itself.

Example

Let’s compute \(\sigma^2\), \(\sigma\), and \(CV\) for our previous examples:

  1. \(V[X]=E[X^2]-\mu_X^2=4.75-1.95^2=0.9475\)
  2. \(\sigma=\sqrt{V[X]}=\sqrt{0.9475}=0.97\)
  3. \(CV=\frac{\sigma}{\mu}\times 100 = \frac{0.97}{1.95}\times 100 = 49.7\%\)

Values below \(50\%\) for \(CV\) allow us to see \(\mu\) as representative for the data. The lower, the closer the data to \(\mu\) and therefore the more representative it is.

Example

For the continuous r.v. case:

  1. \(V[X]=E[X^2]-\mu_X^2=0.6-0.75^2=0.0375\)
  2. \(\sigma=\sqrt{V[X]}=\sqrt{0.0375}=0.1936\)
  3. \(CV=\frac{\sigma}{\mu}\times 100 = \frac{0.1936}{0.75}\times 100=25.81\% < 50\%\)

Random Pairs

Random pair

When running an experiment, it could be interesting to study the relationship between two numeric features associated to each of the outcomes.

Random pair

A random pair \((X,Y)\) is a function \(f_{X,Y}:\Omega\rightarrow \left(\Omega_X,\Omega_Y\right)\subset\mathbb{R}^2\). \(\left(\Omega_X, \Omega_Y\right)\) is known as the support of the random pair \((X,Y)\).

\[\omega\in\Omega \overset{(X,Y)}{\rightarrow}\left(X(\omega),Y(\omega)\right)\in(\Omega_X,\Omega_Y)\subset\mathbb{R}^2\]

\(X(\omega)\) is the image, under \(X\) of outcome \(\omega\), and \(Y(\omega)\) the image under \(Y\) of the same outcome.

Discrete random pair

A random pair \((X,Y)\) is discrete when:

  • The support \((X,Y)\), \((\Omega_X,\Omega_Y)\), is a finite or countable infinite of pairs.
  • \(P\left(\Omega_X,\Omega_y\right)=1\)

Joint density function \(f_{X,Y}\)

Let \((X,Y)\) a discrete random pair, the joint density function \(f_{X,Y}(x,y)\) is a function \(f_{X,Y}:\mathbb{R}^2\rightarrow\mathbb{R}\) defined as:

\[ f_{X,Y}(x,y)=\left\{ \begin{array}{cl} P(X=x,Y=y) & , (x,y)\in(\Omega_X,\Omega_Y)\\ 0 & , (x,y)\in\mathbb{R}^2\setminus(\Omega_X,\Omega_Y) \end{array} \right. \]

Joint density function \(f_{X,Y}\)

\(f_{X,Y}\) satisfies the following properties:

  1. \(f_{X,Y}(x,y)\geq 0\forall(x,y)\in\mathbb{R}^2\)
  2. \(\sum_{x_i\in\Omega_X}\sum_{y_j\in\Omega_Y}P\left(X=x_i,Y=y_j\right)=1\) \(\forall i,j=1,2,...\)

A possible notation for \(P(X=x_i,Y=y_j)\) is \(p_{i,j}\)

Joint density function \(f_{X,Y}\)

\(y_1\) \(y_2\) \(\dots\) \(y_j\) \(\dots\)
\(x_1\) \(p_{11}\) \(p_{12}\) \(\dots\) \(p_{1j}\) \(\dots\) \(\sum_{j=1}^\infty p_{1j}\)
\(x_2\) \(p_{21}\) \(p_{22}\) \(\dots\) \(p_{2j}\) \(\dots\) \(\sum_{j=1}^\infty p_{2j}\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\ddots\) \(\vdots\) \(\ddots\) \(\vdots\)
\(x_i\) \(p_{i1}\) \(p_{i2}\) \(\dots\) \(p_{ij}\) \(\dots\) \(\sum_{j=1}^\infty p_{ij}\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\ddots\) \(\vdots\) \(\ddots\) \(\vdots\)
\(\sum_{i=1}^\infty p_{i1}\) \(\sum_{i=1}^\infty p_{i2}\) \(\dots\) \(\sum_{i=1}^\infty p_{ij}\) \(\dots\) 1

Marginal probability function

Given a random pair \((X,Y)\), the marginal probability function of \(X\) and \(Y\) is respectively:

  • \(f_X(x_i) = P(X=x_i)=\) \[\sum_{j=1}^\infty P(X=x_i, Y=y_j) = \sum_{j=1}^\infty p_{ij}\]
  • \(f_Y(y_j) = P(Y=x_j)=\) \[\sum_{i=1}^\infty P(X=x_i, Y=y_j) = \sum_{i=1}^\infty p_{ij}\]

For \(i=1,2,...\) and \(j=1,2,...\). Note that these functions have one dimension only.

Example

At SuperStore 🏪, three trained employees are qualified to operate the checkout counters, restock products on the shelves, and perform some administrative tasks. SuperStore has three checkout counters, and at least one of them must always be operating.

At any given day and moment when SuperStore is open to customers, consider the following random variables:

  • \(X\) N of employees in the checkout counters 🛒 💳 .
  • \(Y\) N of employees restocking products on the shelves 📦.

Example

The r.v. \(X\) has \(\Omega_X=\{1,2,3\}\) and the following pdf:

\(x\) 1 2 3
\(f_X(x)\) 0.17 0.8 0.03

Consider the following table for the joint probability of \((X,Y)\)

\(X\setminus Y\) 0 1 2
1 \(a\) \(2b\) \(b\)
2 0.1 \(c\) 0
3 0.03 0 0
1

Example

  1. If \(P(X=1, Y=0)=0.02\), find \(b\) and \(c\)

    • Directly we get \(a=0.02\) as it is \(P(X=1, Y=0)\)
    • Note that the top row is \(f_Y(y)\) and the left column is \(f_X(x)\)
    • Fill directly \(f_X(x)\) with \(0.17\), \(0.8\), and \(0.03\).
    • Fill \(f_Y(y)\) with the summation of each column

Example

\(X\setminus Y\) 0 1 2 \(\color{red}{f_X(x)}\)
1 \(\color{red}{0.02}\) \(2b\) \(b\) \(\color{red}{0.17}\)
2 0.1 \(c\) 0 \(\color{red}{0.8}\)
3 0.03 0 0 \(\color{red}{0.03}\)
\(\color{red}{f_Y(y)}\) \(\color{red}{0.15}\) \(\color{red}{2b+c}\) \(\color{red}{b}\) 1
  • From first row: \(0.02 + 2b + b = 0.17\) \(\Rightarrow\) \(b=\frac{0.15}{3}=0.05\)
  • From second row: \(0.1 + c = 0.8\) \(\Rightarrow\) \(c=0.7\)

Example

  1. \(P(X=2|Y\geq 1)\) is approximately…?

    • \[P(X=2|Y\geq 1)=\frac{P(X=2, Y\geq 1)}{P(Y\geq 1)}\]
    • \[= \frac{P(X=2, Y=1) + P(X=2, Y=2)}{P(Y=1)+P(Y=2)}\]
    • \[\frac{0.7+0}{0.8+0.05}=\frac{0.7}{0.85}\approx 0.8235\]

Independence of random variables (revisited)

Let \((X,Y)\) a discrete random pair, \(X,Y\) are independent if, and only if: \[P(X=x, Y=y)=P(X=x)P(Y=y)\quad \forall(x,y)\in\mathbb{R}^2\]

The joint cdf is the same as the product of each marginal pdf

Example

  1. … (continued exercise) Are \(X\) and \(Y\) independent?

    • \(P(X=2, Y=2)=0\)
    • \(P(X=2)P(Y=2)= 0.8 \times 0.05=0.04\)
    • \(0\neq 0.04\)
    • \(X\) and \(Y\) are not independent.

Moments of a random pair

Definition

Let the discrete random pair \((X,Y)\) have a joint \(cdf\) \(P(X=x,Y=y)\) and a function \(g:\mathbb{R}^2\rightarrow\mathbb{R}\). The expected value or mean of \(g(X,Y)\) is:

\[E[g(X,Y)]=\sum_{i=1}^\infty\sum_{j=1}^\infty g(x_i,y_j)P(X=x_i,Y=y_j)\]

If \(g(x,y)=xy\), then \(E[g(x,y)]=E[XY]\) and that equals \[\sum_{i=1}^\infty\sum_{j=1}^\infty x_iy_jP(x=x_i,Y=y_j)\]

Moments of a random pair

Definition

Let the discrete random pair \((X,Y)\) have a joint \(cdf\) \(P(X=x,Y=y)\), and \(\mu_X=E[X]\) and \(\mu_Y=E[Y]\). The covariance between \(X\) and \(Y\) is:

\[cov(X,Y)=E[(X-\mu_X)(Y-\mu_Y)]\]

Given that \(E[(X-\mu_X)(Y-\mu_Y)]\) exists.

Note that this is equivalent to \(cov(X,Y)=E[XY]-E[X]E[Y]\)

Properties of the covariance

The covariance tries to capture how the two r.v. move together. If it is positive, it means that both tend to go in the same direction more often than not (both above or below their means at the same time). Being negative means that more often than not when one is above its mean, the other is below.

  1. \(cov(X,Y)=cov(Y,X)\)
  2. \(cov(X,X)=V[X]\) and \(cov(Y,Y)=V[Y]\) if \(V[X]\) and \(V[Y]\) exist.
  3. \(cov(a+bX, c+dY)=bd cov(X,Y)\) with \(a,b,c,e\in\mathbb{R}\)

Properties of the covariance

If \(X\) and \(Y\) are independent r.v. then \(cov(X,Y)=0\). Note that the opposite is not necessarily true, i.e. \(cov(X,Y)=0\) does not imply that \(X\) and \(Y\) are independent.

Another important identity with the covariance is the following:

\[V[X\pm Y] = V[X]+V[Y]\pm 2 cov(X,Y)\]

Example

Knowing that \(E[Y]=0.9\), \(cov(X,Y)\) is equal to? …
  • From the first table: \(E[X]=0.17\times 1 + 0.8 \times 2 + 0.03 \times 3 = 1.86\)

  • \[E[XY]=\sum_{x}\sum_{y}xyP(X=x,Y=y)=\] \[= 1 \times 0 \times 0.02 + 1\times 1 \times 0.1 + 1\times 2 \times 0.05 +\] \[+ 3\times 0 \times 0.1 + 2 \times 1 \times 0.7 + 2\times 2 \times 0 + \] \[+ 3 \times 0 \times 0.03 + 3 \times 1 \times 0 + 3 \times 2 \times 0 = 1.6\]

  • \(cov(X,Y)=E[XY]-E[X]E[Y]=1.6-1.86\times 0.9=-0.074\)

Correlation coefficient

A caveat of the covariance is that its units depends directly on the units of \(X\) and \(Y\). The correlation coefficient allow us to express this relationship, between \(X\) and \(Y\) without being affected by the units in which these r.v. are measured.

\[\rho_{XY} = \frac{cov(X,Y)}{\sqrt{V[X]V[Y]}}=\frac{cov(X,Y)}{\sigma_X\sigma_Y}\]

Clearly \(\rho\in[-1,1]\). Note also that \(|\rho|=1\) if and only if \(P(Y=a+bX)=1\) with \(a,b\in\mathbb{R}\). If \(X\) and \(Y\) are independent r.v. then \(\rho=0\).

Correlation coefficient

Correlation coefficient Correlation
\(|\rho| = 1\) Perfect
\(0.8 \leq |\rho| < 1\) Strong
\(0.5 \leq |\rho| < 0.8\) Moderate
\(0.1 \leq |\rho| < 0.5\) Weak
\(0 < |\rho| < 0.1\) Very weak
\(\rho= 0\) None

Write positive or negative in front of correlation if \(\rho>0\) or \(\rho<0\) respectively.

Example

  1. Based on the previous question, find the correlation coefficient between \(X\) and \(Y\).
  • From the marginal probability function we obtain \(V[X]\) and \(V[Y]\): \[V[X]=0.1804\text{ and }V[Y]=0.19\]

  • Therefore, \[\rho = \frac{cov(X,Y)}{\sigma_X\sigma_Y}=\frac{-0.074}{\sqrt{0.1804}\sqrt{0.19}}=-0.3997\]

  • We observe a weak negative linear correlation between \(X\) and \(Y\).

Bibliography

  • Figueiredo, F., Figueiredo, A., Ramos, A. & Teles, R. (2009). Estatística Descritiva e Probabilidades (2a Edição). Escolar Editora.
  • Murteira, B., Ribeiro, C.R., Silva, J.R. & Pimenta, C. (2007). Introdução à Estatística (2a Edição). McGraw-Hill.
  • Pestana, D. & Velosa, S.F. (2008). Introdução à Probabilidade e à Estatística (3a Edição). Fundação Calouste Gulbenkian.
  • Paulino, C.D. & Branco, J.A. (2005). Exercícios de Probabilidade e Estatística. Escolar Editora.