Introduction to DS

Probability

Inference means making conclusions from data. We need probability because formal logic describes things as always or never true which is rarely true in real-life situations. Natural language is too subjective to describe the uncertainty in life, so we use math to describe uncertainty through probability.

An experiment is something repeatable. The probability space consists of the sample space, event space, and probability measure.

Sample Space (Ω) is the set of all possible outcomes from one random experiment.
The event space is the set of all events possible, while an event is a set of outcomes.
A probability measure maps the event space to a number from 0 to 1.

A random variable is a function that maps the sample space to real numbers. Axioms are things we assume to be true.

Three Axioms of Probability:

Axiom 1: Probabilities are positive real numbers, greater than or equal to 0.
Axiom 2: The probability that an outcome is in the sample space (Omega) equals 1.
Axiom 3: For disjoint subsets, the probability of their union is the sum of their individual probabilities.

Probability Rules

Addition rule: The probability of a union b is p(a) + p(b) - p(a and b).
Multiplication rule: If events are independent, it's p(a) * p(b). If not independent, it's p(a) * p(b|a).

Conditional probability: p(b | a) is the probability of b given a, calculated as p(b intersection a) / p(a). The sample space shrinks to a, so of all probability of a, we need the part where b also happens.

Normal distribution occurs if 1) there are lots of events and 2) each event is independent of each other.

Moments characterize the shape of a distribution. Moments come from moment generating functions that converge (as some do not).

The moment generating function (MGF) of a random variable $X$ is defined as:

\[M_X(t) = E[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} f_X(x) dx\]

Example of a distribution that does not converge: The Cauchy distribution. It has undefined expected value and variance, with higher raw moments also undefined. It is characterized by fat tails. This becomes relevant when dealing with black swan events.

First (raw) moment: E[x], or the "balancing point". These are raw because they're centered on the origin.
Second (centralized) moment: Variance. These are centralised as they are calculated by subtracting the mean.
Third (centralized) moment: Cubed deviation. If equal, it's symmetric; otherwise, it's asymmetric.
Fourth moment (centralized): Kurtosis, which indicates heavy tails compared to a normal distribution. In a normal distribution, it drops off extremely and exponentially with increasing n.

There are 5+ moments, but these are the most common. For example, the Edgeworth series has hyper skew moment which is the 5th moment.

Hypothesis Testing

Normal distribution: Symmetric and fully described by the first two moments. It results from the combination of 1) lots of 2) independent events.

Hypothesis Testing:

Weak Law of Large Numbers: For random sampling from independent and identically distributed (IID) variables with finite mean and variance (excluding black swan events like Cauchy distribution), the probability of the difference between the sample mean deviation and population mean being larger than a small epsilon approaches zero. This applies only to probability and individual deviations can still occur.

Central Limit Theorem (CLT): With repeated resampling, the distribution of sample means becomes normal as sample size increases, regardless of the underlying population, provided the underlying population has finite mean and variance. This phenomenon occurs due to the cancellation of deviations.

The CLT applies to the distribution of sample means, not the sample itself.
It begins to take effect as soon as the sample size increases.
Hence, 30 is not a magic number for a "big" sample size, although there is value going above a sample size of 30 as the standard error of the mean continues to decrease with larger sample sizes.

Random Sampling: Every member of the population has an equal chance of being chosen. The opposite is sampling bias.

Standard Error of the Mean (SEM): Calculated as the standard deviation of sample means divided by the square root of the sample size. To decrease SEM, the sample size must be increased by the square of the desired reduction factor.

Practical Implications: In real-life scenarios, we typically have only one sample, not multiple samples of size n. The CLT is relevant because it ensures that deviations from the true mean decrease with the square root of the sample size, allowing us to be confident that a single sample is as close as possible to the population mean.

Common Misconception: The CLT doesn't suddenly converge at a sample size of 30. It begins converging as soon as the sample size increases, even from 1 to 2. The notion that 30 is sufficient isn't universally true, especially for certain unusual distributions.

Mean vs. Median: The mean minimizes the squared L2 distance between points, while the median minimizes the L1 distance (absolute).

Significance Testing

Real Experiments:

Create treatment and conditions.
Randomly assign users to different treatments (Treatments = Independent variable).
Measure outcomes and analyze.

Randomization: The important aspect that allows us to establish causality. Given a large number of users, randomization can control for pre-existing relationships, as we can assume an equal proportion of confounds in each treatment. This is also more likely with a large sample size, which is why large sample size is again important.

A/B Tests: These apply experimental procedures to real-life situations, allowing for practical implementation of controlled experiments.

Statistical Significance:

Definition: Given that the null hypothesis is true, the observation is unlikely due to chance alone. This is a statement about data, not a conclusion. It means the probability of the data is below a chosen level (the "alpha" level), assuming the null hypothesis is true. (Also invented by Fischer.)

Interpretation: It represents p(data | null hypothesis), not p(hypothesis | data). However, it can be inverted using Bayes' theorem and prior probabilities.

Caution: There's always a chance that sampling error alone could produce unlikely data. Hypothesis testing does not prove anything.

Null Hypothesis Significance Testing:

Method based on falsification to test hypothesis. This method assumes the opposite is true (null hypothesis), collects data, and then shows that either 1) the result is so unlikely that the null hypothesis is dropped, supporting the alternative hypothesis. 2) or the result is plausible due to the null hypothesis and nothing is done.

The p-value is the probability of observing the data assuming the null hypothesis is true. For countable data, binomial distribution or combinatorics can be used to find this probability.

Interesting Fact: The commonly used significance level of 0.05 (1 in 20 chance) originated from Fisher, who found it convenient in an era without calculators. Thus, the choice of alpha (significance level) is somewhat arbitrary.

End result is always a p-value for significance tests.

Parametric Significance Tests

Parametric significance tests is a type of significance test, so you drop the null hypothesis if the probability of the data is too large to be plausibly consistent with chance. These tests reduce the data to a distribution with a parameter, usually the mean.

Methodology:

Establish null hypothesis
Calculate test statistic
Determine likelihood of statistic under null hypothesis (p-value)

All tests measure how far the data is from the population mean in units of Standard Error of the Mean (SEM).

Misconception: Normalization doesn't necessarily mean the data becomes normally distributed. It's primarily used to compare variables different units on the same space.

Types of Errors:

Type 1 Error (False Positive): Concluding significance when there isn't any in reality. This could be due to sampling error.

Type 2 Error (False Negative): Failing to detect significance when it exists. This can occur if the sampling error is too large, and can be mitigated by increasing sample size.

Z-test:

Z = (sample mean - population mean) / SEM

SEM = population sd / √n

The Z-score is converted to a probability using the standard normal distribution (mean = 0, sd = 1). However, this test has limitations: it usually requires knowledge of population mean and SD, and relies on the Central Limit Theorem.

Degrees of Freedom (DOF):

Misconception: DOF is not always n-1.

Why is it called this? Comes from DOF in mechanics- how many ways can an object move.

DOF represents the number of independent pieces of information in the dataset. It's calculated as n - k, where k is the number of calculations performed.

Higher DOF, more evidence, more stable estimate is. Only increased if more independent samples added, ie measurements not calculations.

Marginals: Are sums of rows / columns. IE total calculation. So if you have 1 marginal, one value is completely determined.

DOF: (r-1)*(c-1) for table (r includes extra row for sum / marginals).

(Student) T-test:

Used for small sample sizes and unknown population parameters. As DOF increases, the t-distribution approaches the z-distribution.

T statistic = (sample mean 1 - sample mean 2) / SEMpooled

Assumptions: mean is meaningful, data is normally distributed, homogeneity of variance.

Two versions:

T-test for 2 independent groups (DOF = n1 + n2 - 2)
T-test for 1 correlated group (DOF = N - 1)

Welch's t-test:

Preferred when homogeneity of variance can't be assumed. Like Student's t-test, but it doesn't pool variance and models them seperately, leading to possible fractional DOF.

ANOVA:

Extends t-test logic to more groups. Equivalent to multiple regression.

Non-parametric Significance Tests

Non-parametric tests are needed when it does not make sense to reduce data to sample means, such as with categorical data or ratings.

Types of Data:

Categorical: Sets with common attributes, only having nominality.
Ratings: Discrete but not categorical. They have nominality and ordinality, but not cardinality.

Properties of Numbers:

Nominality
Cardinality
Ordinality

Non-parametric tests typically require higher sample sizes / power but have fewer assumptions.

Chi-squared Test:

Used for categorical counts or frequencies. It compares observed data patterns with expected frequency counts under the null hypothesis.

Test statistic: sum of ((differences squared) / expected Count)

Degrees of freedom (DOF) = number of categories - 1

Both location and spread increases with increasing DF.

Mann-Whitney U Test:

Also known as Mann-Whitney Wilcoxon test or ranksum test. It tests whether two samples come from populations with the same median.

Procedure: Rank all data from smallest to largest; sum of ranks for both datasets should be similar if from populations with same median.

Note: Still possible to have different distributions that happen to have the same median.

Kolmogorov-Smirnov (KS) Test:

Compares underlying distributions via the empirical Cumulative Distribution Function (CDF).

Procedure: Finds maximum distance between two distributions and compares to the null hypothesis that distributions are the same.

Kruskal-Wallis Test:

Non-parametric version of ANOVA for 3+ groups.

Permutation Test Approach:

For unusual cases, you can create your own test using this approach:

Develop a statistic based on domain knowledge
Sample without replacement
Calculate the statistic for all groups
Determine the p-value using statistic of original group

Note: Results should be robust to details of the test statistic.

Problems with p-values

Decisions based solely on statistical significance can be misleading. P-values merely indicate statistical significance, which just says that assuming null hypothesis (and a proven distribution of data under null hypothesis), data is unlikely.

Contextualizing p-values:

P-values need to be considered alongside:

Effect Size
Statistical Power
Confidence

Replicability Crisis: Many results fail to replicate, primarily due to lack of statistical power. Root cause is lack of power, proximal causes are p-hacking and the incentive to publish statistically significant results.

P-hacking:

Flexible stopping: Adding observations until reaching the significance threshold, leading to alpha inflation. P-values are typically just below alpha level as we stop as soon as significan is reached, as adding more datapoints might push it above statistically significant level.
HARKing (Hypothesis After Results are Known): Developing hypotheses based on which variables show significance.
Removal of outliers: (Removing outliers). Should only be done if the outlier is known to be erroneous.

Effect Size:

Measures the practical significance of / difference. IE The magnitude of the real difference in populations. For example, Cohen's d = (pop 1 mean - pop 2 mean ) / sd of population. (Implied sample size of 1)

Typical effect sizes in psychology are around 0.2 to 0.25.

Effect size is independent of sample size, but the ability to detect it relates to sample size through statistical power.

Power and Confidence

Statistical Power

Definition: The probability of detecting an effect of a given size if the effect exists. Mathematically, it's expressed as 1 - β, where β is the probability of making a Type II error.

Typically, a power of at least 0.8 (β = 0.2) is desired. Power is implicitly set by your sample size and effect size.

Tells you what sample size is large enough if you know what effect size is likely to be.

Importance of Power:

Determines adequate sample size
Allows absence of evidence to become evidence of absence
Helps prevent real effects from being overwhelmed by sampling error

Factors Affecting Power

Alpha level: Higher alpha increases power by accepting more H1
Sample size: Larger samples increase power non-linearly (√n relationship)
Effect size: Larger effects increase power due to greater distance between means. (easier to spot difference).
Type of test: Pairwise t-test > independent t-test (removes variability as everyone is own control, so better chance of finding effect.) > parametric ( if assumptions of t-test met its powerful. ) > non-parametric (Lot of information in distance between points that parametric can use).

Power Curve: TODO: INSERT.

In psychology, real effects are typically around 0.2. The rate of significant findings should be 10%-30%, but is often reported as 95% in practice[7].

Legitimate way to increase power: Using a one-tailed test can increase power by increasing alpha.

A Priori vs. Post Hoc Power: Only a priori power analysis is considered valid. Post-hoc (observed) power can be misleading due to potential discrepancies between true and sample effect sizes.

Confidence Intervals

Definition: A range that likely contains the true population parameter.

Confidence intervals visualize sampling error. A 95% CI means that if the study were repeated, 95% of the time the parameter would fall within the interval, assuming only sampling error.

Calculation: CI can be calculated using z*SEM if the Central Limit Theorem (CLT) holds. If CLT cannot be proven, bootstrap methods are used.

Absence of evidence: CI is very wide. If higher statistical power, shrink CI, hence evidence of absence.

Bootstrap

A resampling with replacement technique used to estimate confidence intervals when normal distribution assumptions don't hold.

Caution: Bootstrap can be dangerous with small samples as it may overrepresent parameters. It's only effective if the entire sample is representative of the population[6].

Limitation: Bootstrapping provides more numbers but not more data, and does not increase degrees of freedom. Works only if sample is representative of population.

Bayesian Inference

So far we have been taking a frequentist approach.

Bayesian vs. Frequentist Approach

Frequentist probability is based on the long-run frequency of outcomes, while Bayesian probability represents a degree of belief. Bayesian inference allows for updating prior beliefs with new information, which can be advantageous when prior beliefs are well-founded.

If prior belief is certain, ie with probability of 1 or 0, no evidence will change the posterior.

\[p(A | B) = \frac{p(B | A) \cdot p(A)}{p(B)}\]

Likelihood = probability of data, $p(B | A)$

Prior = prior belief, $p(A)$, $p(B)$

Posterior = updated belief, $p(A | B)$

Now we cover the bayesian version of hypothesis testing, power and confidence intervals.

Bayesian Hypothesis Testing

Bayesian hypothesis testing allows for direct comparison of specific hypotheses, unlike the traditional null hypothesis approach. The Bayes factor quantifies the relative evidence for one hypothesis over another, calculated as the ratio of likelihoods under one hypothesis over another.

Factor of 100 means p(d | h1) is 100x more than p(d | h0)

Bayesian Power Analysis

The Positive Predictive Value (PPV) represents the post-study probability that a significant result is true (p(effect true | significant) = p(True) * p(sig | true) / p(sig) ):

\[= \frac{\frac{\text{true}}{\text{total}} \cdot \text{power}}{\text{power} \cdot \left(\frac{\text{true}}{\text{total}}\right) + \alpha \cdot \left(\frac{\text{false}}{\text{total}}\right)}\]

This formula highlights that the false positive rate depends on power, prior probability of a true effect, and is not (necessarily) equal to the false positive rate.

Need higher power (ability to detect affect) to get higher posterior PPV. IE higher ability to detect effect if present means can get higher post study probability that a significant result is true. So if studying far fetched stuff make sure you have enough power. Pre-study odds can be from literature.

Bayesian Credible Intervals

In Bayesian statistics, parameters are treated as random variables with distributions corresponding to our prior beliefs. As sample size increases, the likelihood dominates the prior, converging to frequentist results with infinite data.

Hence if lots of data (high power), just use frequentist approach. If not much data but strong prior, use bayesian approach.

Machine Learning - Regression

Types of Machine Learning

Supervised Learning: Regression and Classification
Unsupervised Learning: Dimension Reduction and Clustering
Reinforcement Learning

Linear Regression

Regression models correlations, not causation. It minimizes the summed squared distance (L2 norm) between predictions and actual values.

Don't interpret regression causally, could be confounders that contribute to both. Regression is purely correlation.

Why is it called regression? Rule of 3 was used before, and Galton realised outcomes “regress” to mean, hence called regression.

Beta coefficient:

Scaling factor making independent variable closest to dependent variable.
Interpreted as the change in standard deviations of DV for one standard deviation change in IV
Calculated as $$ \beta = (X^TX)^{-1}X^Ty $$
Equivalent to projecting data points onto subspace A (either, line plane etc. representing regression model), and finding A that minimises distance to all points. Hence can find beta in terms of linear algebra.

Residual:distance from prediction and actual value. (Point: Not perpendicular distance but vertical to prediction line)

Multiple Regression

Allows comparison of multiple variables by accounting for confounds. It can help control for variables when experiments are not possible, but data is available on known confounds.

Model Evaluation

Coefficient of Determination (R²): Proportion of variance explained by the model

$$ R^2 = \frac{\text{Explained Variance}}{\text{Total Variance}} $$

R: Correlation between predicted and actual values.

Summing all correlation coefficents of features wont add up to explained variance as they subtract a covariance factor (ie the relationship between the features).

Best guess if not taking variable into account is mean, trying regression is seeing how much better can we do than this mean. Using the mean gives total variance.

Confound: something else that links IV and DV.

Plotting multiple regression: Plotting predicted vs. actual values helps visualize model performance (R²).

Machine Learning - Regularization

Regularization aims to create models that not only account for maximum variance but are also as simple as possible.

Problems with Multiple Regression

Multicollinearity: Predictors are correlated. We want predictors to be uncorrelated to each other, but each correlated to outcome. Solution: use fewer predictors or regularization. [Visually vectors go from plane to roughly a line, so projecting onto 2 vectors close to each other means we could draw a plane in many ways and hence wobbly/unreliable project. IE collinearity reduces a dimension but still projecting on full space, so could be any full space. EG if projecting onto a plane, if 2 vectors are collinear (like line), could draw many planes through it. Bad as way too much variance with not much gain].
Curse of dimensionality: Many predictors need lots of parameters, so coverage (some data being in every category possible) becomes an issue. Goes away if lots of data.
Overfitting: Model with more variance does not necessarily mean better. With enough parameters can fit any model perfectly. This is problem as every data point is a true value plus noise, so the model won't generalise. Need to cross validate and check RMSE on test set. (Unlike hypothesis testing where you use all data). Doesn't go away with more data.

One-hot Encoding

Converts categorical variables into separate boolean columns. To avoid the dummy variable trap, one category should be removed to prevent collinearity.

Dummy Variable Trap:Using all categories from one hot encoding.

Bias-Variance Tradeoff

Bias: Model is too simple to capture the relationship (underfitting).

Variance: Model doesn't generalize well (overfitting). Not related to previous variance. High variance means can explain a lot of the relationship.

Need middle ground as Linear Regression pushes to high variance, low bias, regression pushes it the other way.

Regularization Techniques

Ridge Regression: Adds sqrt(λ * summation of (β²)) to the minimization term, introducing some bias to fit better to the population.

Lasso Regression: Uses L1 norm as penalty, which can reduce some coefficients to zero.

Machine Learning - Classification

Classification aims to find boundaries that maximize distance between classes, unlike regression which minimizes distance from boundary to points.

Logistic Regression

Maps continuous inputs to binary outcomes using a nonlinear function (sigmoid). It outputs the odds of two classes and predicts the inverse of the logit (natural log of the odds = sigmoid). Need to choose threshold over which we pick the class.

Computes odds of two classes. Probability of happening over probability of not happening.

Varying beta 0: shifts inflection point further left as beta 0 increases

Varying beta 1:increases how steep cuttoff is as beta 1 increases

Problems with regression on classification:

Residuals are not normal - needed for LR.
RMSE is too high
Most predictions are non-sensical
Regression is not bounded.
Line implies constant increase, but really small range where increase matters a lot for class but after or before does not matter.

ROC Curve

A metric for classification performance that plots True Positive Rate against False Positive Rate. The Area Under ROC (AUROC) quantifies overall performance, with 0.5 indicating random guessing.

Why is accuracy bad for classification? Class imbalance - predicting most common class gets that proportion of accuracy.

Called ROC as its from radar usage from WW2.

Threshold does not change AUROC value, just changes proportion of true positive vs false positives.

Support Vector Machines (SVM)

Finds the line/plane that maximizes the distance between groups. Soft margin SVM allows for some misclassification with a penalty, compared to (hard margin) SVMs.

Machine Learning - Unsupervised Learning

Conceptually, this is using distance to yield labels.

Dimensionality Reduction

Takes advantage of correlation of columns. We create new columns that only represent the independent cols. ! Note: DOF refers to independent rows not columns. This is not feature selection, we are creating new columns.

Principal Component Analysis (PCA): Finding an orthonormal basis and projecting data onto that, then we can keep only the biggest independent values, and these preserve most of the data. In other words, we keep projecting onto directions with most variance. Variables should be standardized (z-score) before PCA.

Scree plot: plot of eigenvalues in order of decreasing.

Clustering

K-means: Partitions data into K clusters.

Description in machine learning page.

The optimal K can be chosen using the elbow method or silhouette method.

Elbow Method: See Machine learning page.

Silhouette Method: Calculates a score for each datapoint based on its distance to its own cluster versus other clusters. The K with the highest average silhouette score is chosen.

For each datapoint calculate silhouette coefficient = (ave distance of points to nearest cluster) - (ave distance of point to points in own cluster) / max(of above 2). (We want a bigger fraction as it represents a better classification). Negative means misclassification.
Sum up silhouette coefficient for all data points. Max sum is number of datapoints as max for each score is 1.
Compare sum for different K, choose one with max score.

Error always goes down with increased K!

HDBSCAN: Cluster by repeatedly identifying points as core, edge, or anomaly points.