Dimensionality reduction and PCA

PCA is unsupervised learning (contains no label information). It is usually used to explore data and understand patterns and used to set up clustering analysis.

Objective of clustering is to find patterns and relationships within a dataset, such as: - Geometric (feature distances) - Connectivity (spectral clustering, graphs)

However, the original dimensionally {R^n} can be large! The goal is to reduce it, k≪n!

Dimensionality reduction is the process of reducing the number of random variables under consideration - Combine, transform or select variables - Can use linear or nonlinear operations

PCA, or prinicpal component analysis, is one common way to perform dimensionality reduction. Dimensionality reduction has many uses in data science: - Visualizing, exploring and understanding the data - Extracting ”features” and dominant modes - Cleaning data - Speeding up subsequent learning task - Building simpler model later

It has many practical applications as well, such as image/audio compression, face recognition, and natural language processing (latent semantic analysis) to name a few!

PCA, by hand

Step 1: Normalize

Normalize (sometimes called scaling and/or standardizing) your data, especially if the different dimensions contain different scales of data.

Step 2: Estimate the mean and covariance matrix from the data

The covariance matrix, C, captures the variability of the data points along different directions, it also captures correlations, covariance between different coordinates (features in the x vector).

Step 3: Take the largest K eigenvectors of C which correspond to the largest eigenvalues

Use a solver for this part, such as {eigs} function from scipy.sparse.linalg. In this case, I used K=2 to reduce the data to 2 dimensions.

C=UΛU^T

C has d×d dimensions

U=eigenvectors

Step 4: Compute the reduced representation of a data point

Key Concept: PCA is a form of linear reduction scheme, because the prinicpal component \(z_i\) is linear transformation of the x.

Food consumption in European countries

This anaylsis uses PCA to compare European countries consumption of various food and drink.

Libraries

import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import scipy.io as spio
import scipy.sparse.linalg as ll
import sklearn.preprocessing as skpp

Load and Prep Data

# Import CSV and save it in different dataframes
food_consumption = pd.read_csv("data/food-consumption.csv")
display(food_consumption)

# pull out country list
countries = food_consumption['Country']
#display(countries)

# remove country column, convert to array
Anew = food_consumption.drop(columns=['Country']).to_numpy()
#display(Anew)

# get shape of matrix
m, n = Anew.shape
Country Real coffee Instant coffee Tea Sweetener Biscuits Powder soup Tin soup Potatoes Frozen fish ... Apples Oranges Tinned fruit Jam Garlic Butter Margarine Olive oil Yoghurt Crisp bread
0 Germany 90 49 88 19 57 51 19 21 27 ... 81 75 44 71 22 91 85 74 30 26
1 Italy 82 10 60 2 55 41 3 2 4 ... 67 71 9 46 80 66 24 94 5 18
2 France 88 42 63 4 76 53 11 23 11 ... 87 84 40 45 88 94 47 36 57 3
3 Holland 96 62 98 32 62 67 43 7 14 ... 83 89 61 81 15 31 97 13 53 15
4 Belgium 94 38 48 11 74 37 23 9 13 ... 76 76 42 57 29 84 80 83 20 5
5 Luxembourg 97 61 86 28 79 73 12 7 26 ... 85 94 83 20 91 94 94 84 31 24
6 England 27 86 99 22 91 55 76 17 20 ... 76 68 89 91 11 95 94 57 11 28
7 Portugal 72 26 77 2 22 34 1 5 20 ... 22 51 8 16 89 65 78 92 6 9
8 Austria 55 31 61 15 29 33 1 5 15 ... 49 42 14 41 51 51 72 28 13 11
9 Switzerland 73 72 85 25 31 69 10 17 19 ... 79 70 46 61 64 82 48 61 48 30
10 Sweden 97 13 93 31 61 43 43 39 54 ... 56 78 53 75 9 68 32 48 2 93
11 Denmark 96 17 92 35 66 32 17 11 51 ... 81 72 50 64 11 92 91 30 11 34
12 Norway 92 17 83 13 62 51 4 17 30 ... 61 72 34 51 11 63 94 28 2 62
13 Finland 98 12 84 20 64 27 10 8 18 ... 50 57 22 37 15 96 94 17 21 64
14 Spain 70 40 40 18 62 43 2 14 23 ... 59 77 30 38 86 44 51 91 16 13
15 Ireland 30 52 99 11 80 75 18 2 5 ... 57 52 46 89 5 97 25 31 3 9

16 rows × 21 columns

Normalize data

In this case, we normalize the data because features have very different ranges

stdA = np.std(Anew,axis = 0)
Anew = Anew @ np.diag(np.ones(stdA.shape[0])/stdA)
Anew = Anew.T

#display(pd.DataFrame(Anew))

PCA

Here, we extract the first two dimensions following the steps listed above.

mu = np.mean(Anew,axis = 1)
xc = Anew - mu[:,None]

C = np.dot(xc,xc.T)/m

K=2
S,W = ll.eigs(C,k = K)
S = S.real
W = W.real

dim1 = np.dot(W[:,0].T,xc)/math.sqrt(S[0]) # extract 1st eigenvalues
dim2 = np.dot(W[:,1].T,xc)/math.sqrt(S[1]) # extract 2nd eigenvalue

Plot 1

The location of the countries does seem logical (countries close to one another are likely to consume similar foods) – Nordic countries are grouped together, as are countries in the S-SW (Portugal, Spain, Italy) with a likely strong Mediterranean influence. Additionally, countries in central Europe are grouped (Germany, Belgium, Switzerland).

plt.figure(figsize=(7, 7))
plt.xlabel("dim1")
plt.ylabel("dim2")
plt.plot(dim1, dim2, 'r.')
plt.axvline(0, linewidth=0.5)
plt.axhline(0, linewidth=0.5)

# add labels
for i in range(0, countries.shape[0]):
    plt.text(dim1[i], dim2[i] + 0.01, countries[i])
plt.show()

Check correlations to dim1 and dim2

Dimension 1 (PC 1) is highly correlated to countries with high consumption of tea, sweetener, tin soup, frozen veggies, and tinned fruit. Dimension 2 is highly correlated with high consumption of instant coffee, powder soup, and low consumption of frozen fish and crisp bread).

# add dim1 and dim2
food_consumption['dim1'] = dim1.tolist()
food_consumption['dim2'] = dim2.tolist()
display(food_consumption.drop(columns=['Country']).corr(method='pearson').tail(2).style.format('{0:,.2f}'))
Real coffee Instant coffee Tea Sweetener Biscuits Powder soup Tin soup Potatoes Frozen fish Frozen veggies Apples Oranges Tinned fruit Jam Garlic Butter Margarine Olive oil Yoghurt Crisp bread dim1 dim2
dim1 0.08 0.43 0.70 0.78 0.58 0.43 0.77 0.51 0.51 0.75 0.61 0.51 0.88 0.68 -0.61 0.29 0.30 -0.38 0.21 0.43 1.00 0.00
dim2 -0.39 0.78 -0.09 -0.23 0.30 0.68 0.13 -0.36 -0.72 -0.54 0.48 0.26 0.35 0.13 0.36 0.13 -0.08 0.21 0.55 -0.76 0.00 1.00

Flipping the Analysis

In the analysis above, we looked at which countries were similar to one another based on their food consumptions. In the next set of analysis, we will look at how foods are related to another based on the countries were they are eaten.

If part 1 was comparing culinary culures by country, this next analysis can be thought of as comparing how common (or uncommon) food pairings are.

First, let’s treat this as a completely new problem and reset our notebook.

%reset -f

Libraries

import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import scipy.io as spio
import scipy.sparse.linalg as ll
import sklearn.preprocessing as skpp

Load and Prep Data

# Import CSV and save it in different dataframes
food_consumption = pd.read_csv("data/food-consumption.csv")

# transpose df
food_consumption = food_consumption.set_index('Country').T.rename_axis(index='Food Item', columns=None).reset_index()
display(food_consumption)

# pull out country list
food_items = food_consumption['Food Item']
#display(food_items)

# remove food item column, convert to array
Anew = food_consumption.drop(columns=['Food Item']).to_numpy()
#display(Anew)

# get shape of matrix
m, n = Anew.shape
Food Item Germany Italy France Holland Belgium Luxembourg England Portugal Austria Switzerland Sweden Denmark Norway Finland Spain Ireland
0 Real coffee 90 82 88 96 94 97 27 72 55 73 97 96 92 98 70 30
1 Instant coffee 49 10 42 62 38 61 86 26 31 72 13 17 17 12 40 52
2 Tea 88 60 63 98 48 86 99 77 61 85 93 92 83 84 40 99
3 Sweetener 19 2 4 32 11 28 22 2 15 25 31 35 13 20 18 11
4 Biscuits 57 55 76 62 74 79 91 22 29 31 61 66 62 64 62 80
5 Powder soup 51 41 53 67 37 73 55 34 33 69 43 32 51 27 43 75
6 Tin soup 19 3 11 43 23 12 76 1 1 10 43 17 4 10 2 18
7 Potatoes 21 2 23 7 9 7 17 5 5 17 39 11 17 8 14 2
8 Frozen fish 27 4 11 14 13 26 20 20 15 19 54 51 30 18 23 5
9 Frozen veggies 21 2 5 14 12 23 24 3 11 15 45 42 15 12 7 3
10 Apples 81 67 87 83 76 85 76 22 49 79 56 81 61 50 59 57
11 Oranges 75 71 84 89 76 94 68 51 42 70 78 72 72 57 77 52
12 Tinned fruit 44 9 40 61 42 83 89 8 14 46 53 50 34 22 30 46
13 Jam 71 46 45 81 57 20 91 16 41 61 75 64 51 37 38 89
14 Garlic 22 80 88 15 29 91 11 89 51 64 9 11 11 15 86 5
15 Butter 91 66 94 31 84 94 95 65 51 82 68 92 63 96 44 97
16 Margarine 85 24 47 97 80 94 94 78 72 48 32 91 94 94 51 25
17 Olive oil 74 94 36 13 83 84 57 92 28 61 48 30 28 17 91 31
18 Yoghurt 30 5 57 53 20 31 11 6 13 48 2 11 2 21 16 3
19 Crisp bread 26 18 3 15 5 24 28 9 11 30 93 34 62 64 13 9

Normalize data

# In this case, we normalize the data because features have very different ranges
stdA = np.std(Anew,axis = 0)
Anew = Anew @ np.diag(np.ones(stdA.shape[0])/stdA)
Anew = Anew.T

#display(pd.DataFrame(Anew))

PCA

mu = np.mean(Anew,axis = 1)
xc = Anew - mu[:,None]

C = np.dot(xc,xc.T)/m

K=2
S,W = ll.eigs(C,k = K)
S = S.real
W = W.real

dim1 = np.dot(W[:,0].T,xc)/math.sqrt(S[0]) # extract 1st eigenvalues
dim2 = np.dot(W[:,1].T,xc)/math.sqrt(S[1]) # extract 2nd eigenvalue
(20,)

Plot 2: Common (and uncommon) pairings

The location of the foods does seem logical (similar foods or ‘compatible’ foods are grouped together). Fresh fruit such as apples and oranges are grouped. So are common toppings such as jam, margarine, butter, and what they are put on (biscuits). Garlic and olive oil are loosely grouped and separated. Visually inspecting the graph, it seems that PC1 can be described as fish vs. high caffine (coffee, tea) intake and PC2 can be described by how much garlic and olive oil are apart of the diet for a country.

plt.figure(figsize=(7, 7))
plt.xlabel("dim1")
plt.ylabel("dim2")
plt.plot(dim1, dim2, 'r.')
plt.axvline(0, linewidth=0.5)
plt.axhline(0, linewidth=0.5)

# add labels
for i in range(0, food_items.shape[0]):
    plt.text(dim1[i], dim2[i] + 0.01, food_items[i])

plt.show()

Check correlations to dim1 and dim2

Dimension 1 (PC 1) is highly correlated to food consumption areas like Germany, although it also has mild to strong correlations to many of the countries – indicating that this is indeed the dimension upon variation is maximized. Dimension 2 is most highly correlated to Sweden (positive) and Spain (negative).

# add dim1 and dim2
food_consumption['dim1'] = dim1.tolist()
food_consumption['dim2'] = dim2.tolist()
display(food_consumption.drop(columns=['Food Item']).corr(method='pearson').tail(2).style.format('{0:,.2f}'))
Germany Italy France Holland Belgium Luxembourg England Portugal Austria Switzerland Sweden Denmark Norway Finland Spain Ireland dim1 dim2
dim1 0.95 0.80 0.80 0.70 0.90 0.86 0.60 0.73 0.89 0.82 0.49 0.82 0.84 0.80 0.73 0.70 1.00 0.00
dim2 0.13 -0.42 -0.32 0.33 -0.07 -0.33 0.40 -0.49 -0.13 -0.24 0.56 0.43 0.38 0.35 -0.60 0.31 0.00 1.00