import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import scipy.io as spio
import scipy.sparse.linalg as ll
import sklearn.preprocessing as skpp
Dimensionality reduction and PCA
PCA is unsupervised learning (contains no label information). It is usually used to explore data and understand patterns and used to set up clustering analysis.
Objective of clustering is to find patterns and relationships within a dataset, such as: - Geometric (feature distances) - Connectivity (spectral clustering, graphs)
However, the original dimensionally {R^n} can be large! The goal is to reduce it, k≪n!
Dimensionality reduction is the process of reducing the number of random variables under consideration - Combine, transform or select variables - Can use linear or nonlinear operations
PCA, or prinicpal component analysis, is one common way to perform dimensionality reduction. Dimensionality reduction has many uses in data science: - Visualizing, exploring and understanding the data - Extracting ”features” and dominant modes - Cleaning data - Speeding up subsequent learning task - Building simpler model later
It has many practical applications as well, such as image/audio compression, face recognition, and natural language processing (latent semantic analysis) to name a few!
PCA, by hand
Step 1: Normalize
Normalize (sometimes called scaling and/or standardizing) your data, especially if the different dimensions contain different scales of data.
Step 2: Estimate the mean and covariance matrix from the data
The covariance matrix, C, captures the variability of the data points along different directions, it also captures correlations, covariance between different coordinates (features in the x vector).
Step 3: Take the largest K eigenvectors of C which correspond to the largest eigenvalues
Use a solver for this part, such as {eigs} function from scipy.sparse.linalg. In this case, I used K=2 to reduce the data to 2 dimensions.
C=UΛU^T
C has d×d dimensions
U=eigenvectors
Step 4: Compute the reduced representation of a data point
Key Concept: PCA is a form of linear reduction scheme, because the prinicpal component \(z_i\) is linear transformation of the x.
Food consumption in European countries
This anaylsis uses PCA to compare European countries consumption of various food and drink.
Libraries
Load and Prep Data
# Import CSV and save it in different dataframes
= pd.read_csv("data/food-consumption.csv")
food_consumption
display(food_consumption)
# pull out country list
= food_consumption['Country']
countries #display(countries)
# remove country column, convert to array
= food_consumption.drop(columns=['Country']).to_numpy()
Anew #display(Anew)
# get shape of matrix
= Anew.shape m, n
Country | Real coffee | Instant coffee | Tea | Sweetener | Biscuits | Powder soup | Tin soup | Potatoes | Frozen fish | ... | Apples | Oranges | Tinned fruit | Jam | Garlic | Butter | Margarine | Olive oil | Yoghurt | Crisp bread | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Germany | 90 | 49 | 88 | 19 | 57 | 51 | 19 | 21 | 27 | ... | 81 | 75 | 44 | 71 | 22 | 91 | 85 | 74 | 30 | 26 |
1 | Italy | 82 | 10 | 60 | 2 | 55 | 41 | 3 | 2 | 4 | ... | 67 | 71 | 9 | 46 | 80 | 66 | 24 | 94 | 5 | 18 |
2 | France | 88 | 42 | 63 | 4 | 76 | 53 | 11 | 23 | 11 | ... | 87 | 84 | 40 | 45 | 88 | 94 | 47 | 36 | 57 | 3 |
3 | Holland | 96 | 62 | 98 | 32 | 62 | 67 | 43 | 7 | 14 | ... | 83 | 89 | 61 | 81 | 15 | 31 | 97 | 13 | 53 | 15 |
4 | Belgium | 94 | 38 | 48 | 11 | 74 | 37 | 23 | 9 | 13 | ... | 76 | 76 | 42 | 57 | 29 | 84 | 80 | 83 | 20 | 5 |
5 | Luxembourg | 97 | 61 | 86 | 28 | 79 | 73 | 12 | 7 | 26 | ... | 85 | 94 | 83 | 20 | 91 | 94 | 94 | 84 | 31 | 24 |
6 | England | 27 | 86 | 99 | 22 | 91 | 55 | 76 | 17 | 20 | ... | 76 | 68 | 89 | 91 | 11 | 95 | 94 | 57 | 11 | 28 |
7 | Portugal | 72 | 26 | 77 | 2 | 22 | 34 | 1 | 5 | 20 | ... | 22 | 51 | 8 | 16 | 89 | 65 | 78 | 92 | 6 | 9 |
8 | Austria | 55 | 31 | 61 | 15 | 29 | 33 | 1 | 5 | 15 | ... | 49 | 42 | 14 | 41 | 51 | 51 | 72 | 28 | 13 | 11 |
9 | Switzerland | 73 | 72 | 85 | 25 | 31 | 69 | 10 | 17 | 19 | ... | 79 | 70 | 46 | 61 | 64 | 82 | 48 | 61 | 48 | 30 |
10 | Sweden | 97 | 13 | 93 | 31 | 61 | 43 | 43 | 39 | 54 | ... | 56 | 78 | 53 | 75 | 9 | 68 | 32 | 48 | 2 | 93 |
11 | Denmark | 96 | 17 | 92 | 35 | 66 | 32 | 17 | 11 | 51 | ... | 81 | 72 | 50 | 64 | 11 | 92 | 91 | 30 | 11 | 34 |
12 | Norway | 92 | 17 | 83 | 13 | 62 | 51 | 4 | 17 | 30 | ... | 61 | 72 | 34 | 51 | 11 | 63 | 94 | 28 | 2 | 62 |
13 | Finland | 98 | 12 | 84 | 20 | 64 | 27 | 10 | 8 | 18 | ... | 50 | 57 | 22 | 37 | 15 | 96 | 94 | 17 | 21 | 64 |
14 | Spain | 70 | 40 | 40 | 18 | 62 | 43 | 2 | 14 | 23 | ... | 59 | 77 | 30 | 38 | 86 | 44 | 51 | 91 | 16 | 13 |
15 | Ireland | 30 | 52 | 99 | 11 | 80 | 75 | 18 | 2 | 5 | ... | 57 | 52 | 46 | 89 | 5 | 97 | 25 | 31 | 3 | 9 |
16 rows × 21 columns
Normalize data
In this case, we normalize the data because features have very different ranges
= np.std(Anew,axis = 0)
stdA = Anew @ np.diag(np.ones(stdA.shape[0])/stdA)
Anew = Anew.T
Anew
#display(pd.DataFrame(Anew))
PCA
Here, we extract the first two dimensions following the steps listed above.
= np.mean(Anew,axis = 1)
mu = Anew - mu[:,None]
xc
= np.dot(xc,xc.T)/m
C
=2
K= ll.eigs(C,k = K)
S,W = S.real
S = W.real
W
= np.dot(W[:,0].T,xc)/math.sqrt(S[0]) # extract 1st eigenvalues
dim1 = np.dot(W[:,1].T,xc)/math.sqrt(S[1]) # extract 2nd eigenvalue dim2
Plot 1
The location of the countries does seem logical (countries close to one another are likely to consume similar foods) – Nordic countries are grouped together, as are countries in the S-SW (Portugal, Spain, Italy) with a likely strong Mediterranean influence. Additionally, countries in central Europe are grouped (Germany, Belgium, Switzerland).
=(7, 7))
plt.figure(figsize"dim1")
plt.xlabel("dim2")
plt.ylabel('r.')
plt.plot(dim1, dim2, 0, linewidth=0.5)
plt.axvline(0, linewidth=0.5)
plt.axhline(
# add labels
for i in range(0, countries.shape[0]):
+ 0.01, countries[i])
plt.text(dim1[i], dim2[i] plt.show()
Check correlations to dim1 and dim2
Dimension 1 (PC 1) is highly correlated to countries with high consumption of tea, sweetener, tin soup, frozen veggies, and tinned fruit. Dimension 2 is highly correlated with high consumption of instant coffee, powder soup, and low consumption of frozen fish and crisp bread).
# add dim1 and dim2
'dim1'] = dim1.tolist()
food_consumption['dim2'] = dim2.tolist()
food_consumption[=['Country']).corr(method='pearson').tail(2).style.format('{0:,.2f}')) display(food_consumption.drop(columns
Real coffee | Instant coffee | Tea | Sweetener | Biscuits | Powder soup | Tin soup | Potatoes | Frozen fish | Frozen veggies | Apples | Oranges | Tinned fruit | Jam | Garlic | Butter | Margarine | Olive oil | Yoghurt | Crisp bread | dim1 | dim2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dim1 | 0.08 | 0.43 | 0.70 | 0.78 | 0.58 | 0.43 | 0.77 | 0.51 | 0.51 | 0.75 | 0.61 | 0.51 | 0.88 | 0.68 | -0.61 | 0.29 | 0.30 | -0.38 | 0.21 | 0.43 | 1.00 | 0.00 |
dim2 | -0.39 | 0.78 | -0.09 | -0.23 | 0.30 | 0.68 | 0.13 | -0.36 | -0.72 | -0.54 | 0.48 | 0.26 | 0.35 | 0.13 | 0.36 | 0.13 | -0.08 | 0.21 | 0.55 | -0.76 | 0.00 | 1.00 |
Flipping the Analysis
In the analysis above, we looked at which countries were similar to one another based on their food consumptions. In the next set of analysis, we will look at how foods are related to another based on the countries were they are eaten.
If part 1 was comparing culinary culures by country, this next analysis can be thought of as comparing how common (or uncommon) food pairings are.
First, let’s treat this as a completely new problem and reset our notebook.
%reset -f
Libraries
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import scipy.io as spio
import scipy.sparse.linalg as ll
import sklearn.preprocessing as skpp
Load and Prep Data
# Import CSV and save it in different dataframes
= pd.read_csv("data/food-consumption.csv")
food_consumption
# transpose df
= food_consumption.set_index('Country').T.rename_axis(index='Food Item', columns=None).reset_index()
food_consumption
display(food_consumption)
# pull out country list
= food_consumption['Food Item']
food_items #display(food_items)
# remove food item column, convert to array
= food_consumption.drop(columns=['Food Item']).to_numpy()
Anew #display(Anew)
# get shape of matrix
= Anew.shape m, n
Food Item | Germany | Italy | France | Holland | Belgium | Luxembourg | England | Portugal | Austria | Switzerland | Sweden | Denmark | Norway | Finland | Spain | Ireland | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Real coffee | 90 | 82 | 88 | 96 | 94 | 97 | 27 | 72 | 55 | 73 | 97 | 96 | 92 | 98 | 70 | 30 |
1 | Instant coffee | 49 | 10 | 42 | 62 | 38 | 61 | 86 | 26 | 31 | 72 | 13 | 17 | 17 | 12 | 40 | 52 |
2 | Tea | 88 | 60 | 63 | 98 | 48 | 86 | 99 | 77 | 61 | 85 | 93 | 92 | 83 | 84 | 40 | 99 |
3 | Sweetener | 19 | 2 | 4 | 32 | 11 | 28 | 22 | 2 | 15 | 25 | 31 | 35 | 13 | 20 | 18 | 11 |
4 | Biscuits | 57 | 55 | 76 | 62 | 74 | 79 | 91 | 22 | 29 | 31 | 61 | 66 | 62 | 64 | 62 | 80 |
5 | Powder soup | 51 | 41 | 53 | 67 | 37 | 73 | 55 | 34 | 33 | 69 | 43 | 32 | 51 | 27 | 43 | 75 |
6 | Tin soup | 19 | 3 | 11 | 43 | 23 | 12 | 76 | 1 | 1 | 10 | 43 | 17 | 4 | 10 | 2 | 18 |
7 | Potatoes | 21 | 2 | 23 | 7 | 9 | 7 | 17 | 5 | 5 | 17 | 39 | 11 | 17 | 8 | 14 | 2 |
8 | Frozen fish | 27 | 4 | 11 | 14 | 13 | 26 | 20 | 20 | 15 | 19 | 54 | 51 | 30 | 18 | 23 | 5 |
9 | Frozen veggies | 21 | 2 | 5 | 14 | 12 | 23 | 24 | 3 | 11 | 15 | 45 | 42 | 15 | 12 | 7 | 3 |
10 | Apples | 81 | 67 | 87 | 83 | 76 | 85 | 76 | 22 | 49 | 79 | 56 | 81 | 61 | 50 | 59 | 57 |
11 | Oranges | 75 | 71 | 84 | 89 | 76 | 94 | 68 | 51 | 42 | 70 | 78 | 72 | 72 | 57 | 77 | 52 |
12 | Tinned fruit | 44 | 9 | 40 | 61 | 42 | 83 | 89 | 8 | 14 | 46 | 53 | 50 | 34 | 22 | 30 | 46 |
13 | Jam | 71 | 46 | 45 | 81 | 57 | 20 | 91 | 16 | 41 | 61 | 75 | 64 | 51 | 37 | 38 | 89 |
14 | Garlic | 22 | 80 | 88 | 15 | 29 | 91 | 11 | 89 | 51 | 64 | 9 | 11 | 11 | 15 | 86 | 5 |
15 | Butter | 91 | 66 | 94 | 31 | 84 | 94 | 95 | 65 | 51 | 82 | 68 | 92 | 63 | 96 | 44 | 97 |
16 | Margarine | 85 | 24 | 47 | 97 | 80 | 94 | 94 | 78 | 72 | 48 | 32 | 91 | 94 | 94 | 51 | 25 |
17 | Olive oil | 74 | 94 | 36 | 13 | 83 | 84 | 57 | 92 | 28 | 61 | 48 | 30 | 28 | 17 | 91 | 31 |
18 | Yoghurt | 30 | 5 | 57 | 53 | 20 | 31 | 11 | 6 | 13 | 48 | 2 | 11 | 2 | 21 | 16 | 3 |
19 | Crisp bread | 26 | 18 | 3 | 15 | 5 | 24 | 28 | 9 | 11 | 30 | 93 | 34 | 62 | 64 | 13 | 9 |
Normalize data
# In this case, we normalize the data because features have very different ranges
= np.std(Anew,axis = 0)
stdA = Anew @ np.diag(np.ones(stdA.shape[0])/stdA)
Anew = Anew.T
Anew
#display(pd.DataFrame(Anew))
PCA
= np.mean(Anew,axis = 1)
mu = Anew - mu[:,None]
xc
= np.dot(xc,xc.T)/m
C
=2
K= ll.eigs(C,k = K)
S,W = S.real
S = W.real
W
= np.dot(W[:,0].T,xc)/math.sqrt(S[0]) # extract 1st eigenvalues
dim1 = np.dot(W[:,1].T,xc)/math.sqrt(S[1]) # extract 2nd eigenvalue dim2
(20,)
Plot 2: Common (and uncommon) pairings
The location of the foods does seem logical (similar foods or ‘compatible’ foods are grouped together). Fresh fruit such as apples and oranges are grouped. So are common toppings such as jam, margarine, butter, and what they are put on (biscuits). Garlic and olive oil are loosely grouped and separated. Visually inspecting the graph, it seems that PC1 can be described as fish vs. high caffine (coffee, tea) intake and PC2 can be described by how much garlic and olive oil are apart of the diet for a country.
=(7, 7))
plt.figure(figsize"dim1")
plt.xlabel("dim2")
plt.ylabel('r.')
plt.plot(dim1, dim2, 0, linewidth=0.5)
plt.axvline(0, linewidth=0.5)
plt.axhline(
# add labels
for i in range(0, food_items.shape[0]):
+ 0.01, food_items[i])
plt.text(dim1[i], dim2[i]
plt.show()
Check correlations to dim1 and dim2
Dimension 1 (PC 1) is highly correlated to food consumption areas like Germany, although it also has mild to strong correlations to many of the countries – indicating that this is indeed the dimension upon variation is maximized. Dimension 2 is most highly correlated to Sweden (positive) and Spain (negative).
# add dim1 and dim2
'dim1'] = dim1.tolist()
food_consumption['dim2'] = dim2.tolist()
food_consumption[=['Food Item']).corr(method='pearson').tail(2).style.format('{0:,.2f}')) display(food_consumption.drop(columns
Germany | Italy | France | Holland | Belgium | Luxembourg | England | Portugal | Austria | Switzerland | Sweden | Denmark | Norway | Finland | Spain | Ireland | dim1 | dim2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dim1 | 0.95 | 0.80 | 0.80 | 0.70 | 0.90 | 0.86 | 0.60 | 0.73 | 0.89 | 0.82 | 0.49 | 0.82 | 0.84 | 0.80 | 0.73 | 0.70 | 1.00 | 0.00 |
dim2 | 0.13 | -0.42 | -0.32 | 0.33 | -0.07 | -0.33 | 0.40 | -0.49 | -0.13 | -0.24 | 0.56 | 0.43 | 0.38 | 0.35 | -0.60 | 0.31 | 0.00 | 1.00 |