Superconductors are materials that, when cooled below a certain temperature (called the critical temperature), lose all electrical resistance, meaning electric current can flow through them without any energy loss. This important behavior distinguishes them from standard conductors whose resistance decreases gradually as their temperature is lowered. The superconductivity property turns out to be highly useful in various fields of science from health-care, to energy production and delivery, not to mention its critical role in the high-tech electrical industry.
This analysis is based on the Superconductivity dataset (from UCI machine learning depository:https://archive.ics.uci.edu/dataset/464/superconductivty+data) and has been conducted with the dual purpose of both introducing two robust classification models, namely Adaptive Boosting and Bagging Trees and seeing them applied to a real-world problem. For this reason, this report will have both applied parts, in which the interpretation phase will be prominent, and more theoretical ones focused on mathematical details and performance comparison.
I want to mention from the beginning that these data are the output of a cleaning procedure conducted on a larger dataset created by the Japan’s National Institute for Materials Science. For this reason, they do not contain any missing value and they will turn out to be low-noise, highly-informative and well-separated data. As a consequence we won’t see a huge improvement in performances of Bagging and AdaBoost with respect to standard methods.
rm(list = ls())
set.seed(123)
library(tidyverse)
library(caret) # to compute confusion matrices
library(tree) # to fit decision trees
library(heatmaply)# to manage correlation matrix visualization
library(rpart) # to fit decision trees with additional parameters control
library(class) # to perform k-NN
library(cluster) # to compute the silhouette matrix
library(knitr) # helps with data visualization
The Superconductivity data are composed by two datasets:
train.csv
data_tr <- read.csv("train.csv")
Below it is shown a sample of 20 variables of the dataset train.csv. Each variable is an aggregate of the underlying elemental properties.
## 'data.frame': 21263 obs. of 20 variables:
## $ wtd_std_atomic_radius : num 69.2 68 67.8 68.5 70.6 ...
## $ wtd_range_Valence : num 1.09 1.13 1.11 1.1 1.06 ...
## $ wtd_std_ElectronAffinity : num 42.6 41.7 41.6 42.1 43.5 ...
## $ gmean_fie : num 718 721 718 718 718 ...
## $ wtd_entropy_ThermalConductivity: num 0.263 0.568 0.25 0.257 0.273 ...
## $ mean_ElectronAffinity : num 81.8 90.9 81.8 81.8 81.8 ...
## $ std_ElectronAffinity : num 51.4 49.4 51.4 51.4 51.4 ...
## $ wtd_mean_ElectronAffinity : num 112 112 112 112 111 ...
## $ wtd_std_Valence : num 0.437 0.469 0.445 0.441 0.429 ...
## $ wtd_gmean_atomic_radius : num 84.5 84.4 84.2 84.4 84.8 ...
## $ wtd_range_ThermalConductivity : num 57.1 51.4 57.1 57.1 57.1 ...
## $ wtd_entropy_FusionHeat : num 0.995 1.073 0.927 0.964 1.045 ...
## $ wtd_range_atomic_mass : num 31.8 36.2 35.7 33.8 27.8 ...
## $ entropy_atomic_radius : num 1.26 1.51 1.26 1.26 1.26 ...
## $ wtd_entropy_atomic_mass : num 1.062 1.058 0.976 1.022 1.129 ...
## $ wtd_entropy_Valence : num 1.07 1.05 1.03 1.05 1.1 ...
## $ std_ThermalConductivity : num 169 199 169 169 169 ...
## $ entropy_Density : num 1.03 1.31 1.03 1.03 1.03 ...
## $ gmean_Valence : num 2.21 1.89 2.21 2.21 2.21 ...
## $ wtd_entropy_fie : num 0.791 0.807 0.774 0.783 0.805 ...
It is worth mentioning how the 81 variables were derived: they represent statistical summaries of the properties of the internal chemical elements in each superconductor. In other words, each chemical compound (superconductor) is composed of certain elements, for which specific attributes—such as atomic radius, atomic mass, electrical conductivity, and others—are known. By applying various statistical aggregation functions to these properties (e.g., mean, range, standard deviation, geometric mean, etc.), the characteristics of the constituent elements have been aggregated for each superconductor. This approach naturally suggests a potential dimensionality reduction, as multiple redundant statistics have been computed for each elemental property.
unique_m.csv
data <- read.csv("unique_m.csv")
Below It is shown a sample of 12 variables (including the target critical temperature) for 20 sampled superconductors (labeled material).
data %>%
select(material, critical_temp, sample(colnames(data), 11, replace = F)) %>%
sample_n(size = 20, replace = F) %>%
kable()
material | critical_temp | Tc | Os | P | Ge | N | F | Nb | W | V | Co | Nd |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Ba1Ni2Ge0.2P1.8 | 2.590 | 0 | 0 | 1.8 | 0.2 | 0 | 0.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
Bi2Sr2Ca0.79Y0.21Cu2O8 | 90.000 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
Y0.72Ca0.28Ba2Cu3O6 | 47.500 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
Y0.7Tb0.3Ni2B2C1 | 8.100 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
Y0.8Ce0.1Ca0.1Ba2Cu4O8 | 63.000 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
S1N1 | 0.235 | 0 | 0 | 0.0 | 0.0 | 1 | 0.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
Cs0.33Ta1S2 | 3.800 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
Nd1.8315Y0.0185Ce0.15Cu1O4 | 20.000 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.000 | 0 | 0 | 0.00 | 1.8315 |
Ba2Ca1Cu2F2O4 | 82.000 | 0 | 0 | 0.0 | 0.0 | 0 | 2.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
Eu1Fe1.4As2P0.6 | 28.700 | 0 | 0 | 0.6 | 0.0 | 0 | 0.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
Ce0.3Nd0.7F0.5Bi1S2O0.5 | 4.070 | 0 | 0 | 0.0 | 0.0 | 0 | 0.5 | 0.000 | 0 | 0 | 0.00 | 0.7000 |
Li0.8Fe1.2Se0.78S0.22H1O1 | 35.600 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
Fe1Se0.42Te0.58 | 14.800 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
La3Ni1 | 2.200 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
Y1Ba2Cu3O6.6 | 69.000 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
Ga0.245Nb0.755 | 20.200 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.755 | 0 | 0 | 0.00 | 0.0000 |
Y1Ba2Cu3O | 90.000 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
Er1Rh1.1Sn3.6 | 1.360 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
La1.82Sr0.18Cu0.97Co0.03O4 | 3.000 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.000 | 0 | 0 | 0.03 | 0.0000 |
Lu0.8Pr0.2Ba2Cu3O7 | 79.000 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.000 | 0 | 0 | 0.00 | 0.0000 |
For the training of classification models I will work exclusively with the first 21263 x 82 dataset; only in the initial phase of data interpretation I will use the second DTM dataset to better understand the problem structure.
In this phase I want to understand if I’m able to cluster Supercondutors based on their internal chemical elements; Besides that, I’m interested in assessing if the critical temperature across the estimated group is coherent with the materials involved. Indeed Superconductors (SC) could be split in two groups based on their critical temperature, namely, High-Temperature SC and Low-Temperature SC. Despite that, in this phase I’m trying to cluster them based on chemical element internal proportion (I’m not considering the target feature).
Therefore I upload the “DTM” matrix (directly provided from UCI site) on which i will try to cluster superconductors, to better understand their composition and their nature. In the clustering phase, I remove the chemical formulas of each chemical compound, and the critical temperature. I also normalize the frequency of appearance of each chemical element in each Supercondutor. In this way I’m guaranteed to have all rows that sums to 1. It seemed to me more natural to normalize each row using the sum rather than defining the weights as \(w_{ij}=tf_{ij}\ \cdot idf_i\).
pro <- data %>% select(-c(material, critical_temp))
proe <- t(apply(pro, 1, function(x) x / sum(x)))
I want to perform Hierarchical clustering, (a distance-based clustering technique), with the aim of grouping together similar superconductors based on their internal proportion of 87 chemical elements. Specifically I compute the distance matrix (d) using standard euclidean distances and compute the clustering using Ward’s Method. I then plot the resulting dendrogram
d <- dist(proe)
groups <- hclust(d, method = "ward.D2")
Here I try to choose the number of clusters from 3 to 10 that minimizes the Silhuoette metrics. The best number of clusters turns out to be 4.
i <- 0
sil <- numeric(8)
for (k in 3:10) {
i <- i + 1
lab <- cutree(groups, k = k)
s <- silhouette(lab, d)
summ <- summary(s)
sil[i] <- summ$avg.width
}
k_opt <- which.min(sil) + 2
Dendrogram with 4 highlithed clusters
plot(groups, labels = FALSE)
optclust <- cutree(groups, k = k_opt)
rect.hclust(groups, k = 4, border = c("red", "green", "dodgerblue", "darkorange"))
In the following lines I want to show, for each cluster, the average frequency (over all SC) of internal proportion of each chemical element. Average Frequencies have been ordered from the most frequent to the least; not all 87 elements have been plotted. The following code to plot the histogram for each group is very similar across the 4 classes hence I decided not to repeat it. It basically only changes the filter argument and the number of elements shown.
chem <- data.frame(proe, optclust)
chem1 <- chem %>%
filter(optclust == 1) %>%
select(-optclust)
means_1 <- apply(chem1, 2, mean)
df1 <- data.frame(elements = colnames(proe), frequency_mean = means_1)
df1_or <- df1 %>%
arrange(desc(frequency_mean)) %>%
slice_head(n = 25)
Knowing the average frequency of the most common chemical elements (namely Copper and Oxigen) in the first group, and using the Internet I’ve labeled this first category as the Copper-Oxide superconductors group. These are known to be High-Temperature superconductors. Among their applications there are Maglev trains and MRI machines.
ggplot(df1_or, aes(x = reorder(elements, -frequency_mean), y = frequency_mean, fill = elements)) +
geom_bar(stat = "identity", col = "grey50") +
theme_light() +
labs(x = "Element", y = "Mean Frequency", title = "Top 25 Elements by Mean Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = "none")
Using the same strategy as above,the second group is likely to be composed by BSSCO superconductors ,namely, Bismuth-Strontium Based Cuprates. This are also known to be HTS but they are also likely to contain some LTS. They are used in the power transmission industry and will likely be experimented for future particle accelerators.
This third group is composed by many different chemical elements, with an important component of Iron, hence I’ve labeled them as Iron-Based Superconductors. They are know to be LTS but they are also likely to be HTS. For example they have gained attention due to their application in quantum computing.
The fourth and last group, due to its strong presence of Niobium, has been labeled as the class of Niobium-Based Superconductors. These SC are known to be LTS. They are the backbone of high-energy particle accelerators such as the Large Hadron Collider (LHC) at CERN.
It is interesting to see if the distinction between HTS and LTS, (that is not precisely clear up to this point since we haven’t defined yet a threshold for the split) is respected across the four categories that have been created. To assess this, I have plotted the Boxplot of the variable critical temperature conditioned to the four estimated categories. I first relabel the four groups.
optclust_char <- ifelse(optclust == 1, "Copper-Oxide SC",
ifelse(optclust == 2, "BSSCO SC",
ifelse(optclust == 3, "Iron-based SC", "Niobium-based SC")
)
)
dfbox <- data.frame(critical_temp = data$critical_temp, optclust_char)
box_colors <- c(rep("salmon", 2), rep("lightblue", 2))
It is clear the distinction between HTS and LTS and how the four estimated categories are supported by the critical temp. variable. Indeed by checking the estimated types of superconductors, one can cross-check their belonging to the class by looking at their group critical temperature.
boxplot(critical_temp ~ optclust_char,
data = dfbox, ylab = "critical temperature",
cex.axis = 0.7, col = box_colors, xlab = "Superconductor Type"
)
text(x = 1.5, y = 150, labels = c("HTS"), col = "salmon")
text(x = 3.6, y = 60, labels = c("LTS"), col = "lightblue")
It is interesting to investigate the nature of the far outliers classified as Iron-based SC (LTS) that have high critical temperature
outliers <- data %>%
filter(optclust == 3) %>%
filter(critical_temp > 70)
outliers[,c(87,88)]
## critical_temp material
## 1 115 H1Br3C61
## 2 80 H1Cl3C61
## 3 185 H2S1
The far outlier H2S1 (Hydrogen Sulfide) has been clustered as an Iron based SC but it has a critical temperature of 185 Kelvin. This suggest a problem of some sort related to this SC and in fact it turns out that it is not even a Superconductor, strictly speaking; Indeed only recent development has shown that at extreme pressure it can reach superconductivity behavior. Besides this far outlier, it is interesting to notice many outliers falling all within the iron based category. Delving deeper I hypothesize this could be the consequence of Iron-based SC unconventional superconductivity. In other words, they show similar feature characteristics that has led them within the third group but at the same time they have an high variability in terms of critical temperatures (Likely due to superconductivity, triggered at unconventional thresholds under extreme pressure ).
data <- read.csv("train.csv")
I want to transform this problem from a regression task to a binary classification problem. Hence, the goal will be to accurately classify a superconductor either in the group of high-temperature superconductors (HTS) or in the group of low temperature superconductors (LTS).
In this dataset, each superconductor could be labeled as HTS or LTS based on its critical temperature which will be the binary target variable for the classification problem. In the scientific literature there are two widely used thresholds to perform this split: 30 Kelvin and 77 Kelvin. I will arbitrarily use the first one. This choice has been influenced also by the fact that it produces more balanced classes. This is the histogram of the continuous target variable
hist(data$critical_temp, prob = T, breaks = 100, xlab = "critical temperature", main = "Histogram", ylab = "frequency density")
It could be seen an highly skewed target variable that will likely yield unbalanced classes. I will tame this problem by adjusting the train set with oversampling. It is worth mentioning that HTS could be considered “more” interesting than LTS, due to their superconductivity behaviour at high temperatures. This is a property that allows to harness the superconductivity of the material without cooling it to very low temperatures (the cooling procedure requires more energy). This is the reason why I’ve arbitrarily chosen to set HTS as the class of interest. All variables are numeric, except for two. Since I considered transforming the variables number_of_elements and range_Valence into factors not meaningful, I converted them to numeric values to align them with the others.
data$number_of_elements <- as.numeric(data$number_of_elements)
data$range_Valence <- as.numeric(data$range_Valence)
Here I’ve listed the above mentioned features of the chemical elements, that have been aggregated for each Superconductor.
Chemical element features:
This quantities have been aggregated ,for each SC, using the following statistics:
via Principal Component Analysis (PCA)
I want to use PCA to do feature selection. PCA is an unsupervised technique that is performed on covariates and is completely independent by the target. The core idea in this phase is to summarize the information conveyed by the features and then understand which is the subset of covariates that best explain the overall variability of the dataset; And only successively use this subset of covariates to perform classification. The selection technique that i will use is based on PCA in the following way:
I will select a number q of principal components (with q < 81), knowing that these q components explain a certain fraction of the total variance of the data. Then i will select for each of the q principal components scores the variable that influences the most each component. This is done based on PC loadings, namely the entries of eigenvectors associated with each principal component
data_features <- data %>% select(-critical_temp)
pca <- princomp(data_features, cor = T)
p <- ncol(data_features)
I arbitrarily select the first 20 Principal components
q <- 20
The first line is the total variance explained while the second is the matrix of first q eigenvectors. Then i select the first q most influential variables on the first q principal components, by taking for each column the index associated with the highest squared weight.
sum((pca$sdev[1:q])^2) / p
variables <- names(data_features)[apply(pca$loadings[, 1:q], 2, function(x) which(x**2 == max(x**2)))]
Since the 18th principal component is primarily influenced by the same variable, wtd_entropy_ThermalConductivity, that most influences the 3rd principal component, I will replace the 18th component with the variable that has the second-highest influence on the 18th principal component (based on loadings).
index <- which(colnames(data_features) == "wtd_entropy_ThermalConductivity")
pc18 <- pca$loadings[-index, 18]
v18 <- names(data_features)[which(pc18**2 == max(pc18**2))]
variables[18] <- v18
If I change the number of components then I also need to check whether there are repetitions in the variables selected or not.
data_select <- data %>% select(c(critical_temp, variables))
There’s no clear multicollinearity since I’ve selected variables using impact on principal components scores, for this reason highly correlated variables have been removed, they don’t bring additional information. Below I visualize through an Heatmap the correlation matrix among selected variables; as can be seen high-correlated variables are no longer included.
data_1 <- data %>% select(variables)
Cor <- cor(data_1)
heatmaply_cor(Cor,
cexRow = 0.6, cexCol = 0.6,
xlab = "variables",
ylab = "variables", show_dendrogram = c(F, F),
k_col = 4, k_row = 2, key.title = "correlation",
plot_method = "ggplot"
)
I divide the data into training and test set, and also in features and response variable. I will sample 80% of the observations of the original dataset to compose the training set. The remaining 20% will be the test.
data_or <- data
data <- data_select
n <- nrow(data)
prop <- 80 / 100
ind_train <- sample(1:n, size = floor(prop * n), replace = FALSE)
ind_test <- c(1:n)[-ind_train]
Training: covariates and target
X_train <- data[ind_train, ] %>% select(-critical_temp)
y_train <- data[ind_train, ] %>% select(critical_temp)
Test: covariates and target
X_test <- data[ind_test, ] %>% select(-critical_temp)
y_test <- data[ind_test, ] %>% select(critical_temp)
It is useful here to understand if a normalization is needed. With this purpose, i use a barplot to display the ranges of each variable in order to get a meaningful plot i ignore variables with a range over 5000. Only two of the selected variables turns out to have a range greater than 5000, (std_density and its wtd version)
ranges <- as.numeric(apply(X_train, 2, function(x) max(x) - min(x)))
ind <- which(ranges < 5000)
coln <- colnames(X_train[, ind])
ranges <- ranges[ind]
ranges <- data.frame(ranges, row.names = coln)
ggplot(ranges, aes(x = coln, y = ranges)) +
geom_bar(stat = "identity", fill = "darkorange", col = "grey") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 7)) +
labs(x = "features", y = "Range", title = "Variable range Barplot")
I should normalize the matrix X_train of training features and i will rename it X_trn. In this way each variable will have range \([0,1]\).
n <- dim(X_train)[1]
p <- dim(X_train)[2]
X_trn <- matrix(data = NA, ncol = p, nrow = n)
for (j in 1:p) {
x_trj <- X_train[, j]
X_trn[, j] <- (x_trj - min(x_trj)) / (max(x_trj) - min(x_trj))
}
X_trn <- as.data.frame(X_trn)
colnames(X_trn) <- colnames(X_train)
I check the normalization result on selected variables of the training, and which variables have been selected through PCA.
str(X_trn)
## 'data.frame': 17010 obs. of 20 variables:
## $ range_fie : num 0.0532 0 0.2735 0.6214 0.0606 ...
## $ gmean_atomic_radius : num 0.667 0.608 0.413 0.364 0.384 ...
## $ wtd_entropy_ThermalConductivity: num 0.0012 0 0.7174 0.1259 0.2981 ...
## $ wtd_mean_ElectronAffinity : num 0.292 0.105 0.133 0.361 0.12 ...
## $ wtd_mean_ThermalConductivity : num 0.133 0.14 0.226 0.244 0.382 ...
## $ wtd_range_fie : num 0.529 0 0.287 0.51 0.26 ...
## $ std_Density : num 0.0312 0 0.2629 0.2979 0.967 ...
## $ wtd_range_atomic_radius : num 0.8226 0 0.2275 0.0972 0.0541 ...
## $ std_Valence : num 0.333 0 0.373 0.144 0.667 ...
## $ gmean_fie : num 0.269 0.41 0.402 0.39 0.515 ...
## $ std_ElectronAffinity : num 0.111 0 0.194 0.337 0.326 ...
## $ wtd_mean_atomic_mass : num 0.426 0.861 0.265 0.225 0.309 ...
## $ std_atomic_mass : num 0.318 0 0.122 0.434 0.897 ...
## $ std_ThermalConductivity : num 0.1 0 0.263 0.784 0.24 ...
## $ wtd_gmean_FusionHeat : num 0.2534 0.3415 0.1566 0.0114 0.1173 ...
## $ gmean_FusionHeat : num 0.1541 0.3415 0.1426 0.0364 0.1477 ...
## $ wtd_range_Density : num 0.3813 0 0.1216 0.0951 0.2807 ...
## $ mean_ThermalConductivity : num 0.0977 0.1714 0.3225 0.327 0.4165 ...
## $ gmean_ThermalConductivity : num 0.0766 0.1792 0.2965 0.0237 0.4044 ...
## $ wtd_entropy_ElectronAffinity : num 0.00296 0 0.41143 0.4653 0.07437 ...
I transform the target variable in a binary variable with 2 levels LTS: -1 , and HTS: 1 (HTS is the class of interest)
y_train <- as.factor(ifelse(y_train < 30, -1, 1))
train <- data.frame(X_trn, critical_temp = y_train)
problem of unbalanced classes
There’s a problem related to unbalanced classes. I solved it performing oversampling on the under-represented class. In this way I have artificially created equal size classes.
tab <- as.numeric(table(train$critical_temp))
diff <- max(tab) - min(tab)
train_HTS <- train %>%
group_by(critical_temp) %>%
filter(critical_temp == 1) %>%
sample_n(size = diff, replace = T)
train_rs <- as.data.frame(rbind(train, train_HTS))
train_rs <- train_rs[sample(nrow(train_rs)), ]
colnames(train_rs) <- colnames(train)
y_train_rs <- train_rs %>% select(critical_temp)
table(train_rs$critical_temp)
##
## -1 1
## 9915 9915
X_train_rs <- train_rs %>% select(-critical_temp)
Here I’ve estimated a single full classification tree and then I’ve pruned it using a penalization parameter that has been automatically chosen by cross-validation based on Mislassification error. In the case of classification trees, deviance is not the sum of squares but a weighted sum of the logarithm of class proportions in each terminal region (similar to a total cross-entropy, but instead of the class proportion it shows the size of each class in the region).
tree0 <- tree(critical_temp ~ ., data = train_rs, split = c("deviance"))
cvtree0 <- cv.tree(tree0, FUN = prune.misclass)
min_dev <- min(cvtree0$dev)
optimal_sizes <- cvtree0$size[cvtree0$dev == min_dev]
k_opt <- min(optimal_sizes)
pruned_tree <- prune.misclass(tree0, best = k_opt)
Below there’s the final pruned tree. As every standard classification tree, the interpretation phase is the most straightforward one: Given a specific supercondutor, and having its respective features, One has to start testing each node condition starting from the root up to the leaf nodes, and going on the left if the conditions are satisfied. (the interpretation is less evident with normalized values, but it is still effective). It is worth mentioning that trees are highly data-dependent, hence a different training could have generated a different Tree.
plot(pruned_tree)
text(pruned_tree, pretty = 0, cex = 0.5)
Here I normalize the test set using the columns of the training. the normalization phase is done separately from the training because I don’t want to pass any information from the test to the training in the phase of normalization.
pt <- ncol(X_test)
nt <- nrow(X_test)
X_tst <- matrix(data = NA, ncol = pt, nrow = nt)
for (j in 1:pt) {
x_tst <- X_train[, j]
X_tst[, j] <- (X_test[, j] - min(x_tst)) / (max(x_tst) - min(x_tst))
}
X_tst <- as.data.frame(X_tst)
colnames(X_tst) <- colnames(X_test)
y_test_fac <- as.factor(ifelse(y_test < 30, -1, 1))
pred_class0 <- predict(pruned_tree, X_tst, type = "class")
pre0 <- ifelse(pred_class0 == 1, 1, -1)
pref <- as.factor(pre0)
Below i have the confusion matrix that evaluates the performance of the single pruned tree classifier on the test labels.
conf <- confusionMatrix(pref, y_test_fac, positive = "1")
MER_single_pruned_tree <- 1 - as.numeric(conf$overall[1])
conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction -1 1
## -1 2171 188
## 1 329 1565
##
## Accuracy : 0.8784
## 95% CI : (0.8682, 0.8881)
## No Information Rate : 0.5878
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7521
##
## Mcnemar's Test P-Value : 7.405e-10
##
## Sensitivity : 0.8928
## Specificity : 0.8684
## Pos Pred Value : 0.8263
## Neg Pred Value : 0.9203
## Prevalence : 0.4122
## Detection Rate : 0.3680
## Detection Prevalence : 0.4453
## Balanced Accuracy : 0.8806
##
## 'Positive' Class : 1
##
As we can see, since the structure of the data is simple and has few noise I have already a good accuracy and specificity, even with one single tree In the next section, I will introduce more robust methods;
I’ve also tried to use another function rpart from the homonym library, to estimate a classification tree with a maximum number of layers that has been arbitrarily chosen to 2. This will be useful later in comparing performances. Given the simple structure of the classification tree it seems already having a quite good performance.
tree2 <- rpart(critical_temp ~ .,
data = train_rs, method = "class",
control = rpart.control(maxdepth = 2)
)
pred_class2 <- predict(tree2, X_tst, type = "class")
pre2 <- ifelse(pred_class2 == 1, 1, -1)
pref2 <- as.factor(pre2)
conf2 <- confusionMatrix(pref2, y_test_fac, positive = "1")
MER_2 <- 1 - as.numeric(conf2$overall[1])
conf2
## Confusion Matrix and Statistics
##
## Reference
## Prediction -1 1
## -1 2222 338
## 1 278 1415
##
## Accuracy : 0.8552
## 95% CI : (0.8442, 0.8656)
## No Information Rate : 0.5878
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6996
##
## Mcnemar's Test P-Value : 0.01745
##
## Sensitivity : 0.8072
## Specificity : 0.8888
## Pos Pred Value : 0.8358
## Neg Pred Value : 0.8680
## Prevalence : 0.4122
## Detection Rate : 0.3327
## Detection Prevalence : 0.3981
## Balanced Accuracy : 0.8480
##
## 'Positive' Class : 1
##
Adaptive Boosting Classification Trees
In this section, I’ve manually implemented the AdaBoost.M1 algorithm proposed by Freud and Schapire in 1997, I’ve used it to perform binary classification on the Superconductor Dataset. I’ve based this implementation on the description of the algorithm in the book The Elements of Statistical Learning (page 339) by T.Hastie, R.Tibshirani, J.Friedman.
Explanation
The purpose of Boosting is to weigh the output of many weak classifiers to produce a powerful “committe-based” prediction. I perform AdaBoost on the balanced classes dataset. The core idea of the method is to combine a weighted bootstrap aggregation technique with a clever updating mechanism of the weights (\(w_1,w_2,\dots,w_N\)) that are used to perform weighted bootstrap sampling. More precisely the algorithm starts by sampling the training (N rows) with replacement with equal probabilities (\(w_i=1/N\) for all \(i = 1,2,\dots,N\)).
Then a weak classifier ,\(G_m(x)\), is trained on the just sampled training; in the case of trees, a weak classifier is a very short tree with only one (single stump) or just few nodes. Successively, for the ongoing iteration, it is computed the weighted mean of the number of errors on the training (using the ongoing weights);
\[ err_m=\frac{\sum_{i=1}^Nw_i\cdot I(y_i\neq G_m(x_i))}{\sum_{i=1}^Nw_i} \]
then a quantity alpha is calculated, namely, \(\alpha_m = log((1-err_m)/err_m)\)
and eventually each component of the weight vector is updated based on how much the overall classifier has performed and if the corresponding observation has been correctly identified. The update is computed in the following way:
\[ w_i \leftarrow w_i\cdot exp\{\alpha_m\cdot I(y_i\neq G_m(x_i))\}\qquad i=1,2,\dots,N \]
More weight is increasingly assigned to observations that are difficult to classify. Following this procedure iteratively many times, usually M (number of boosting iterations) the algorithm is able to train many weak classifiers on different versions of the original data. These different versions are determined by sampling rows with replacement at each iteration with weights updated each time. The final prediction is done by doing a weighted average of M weak classifiers using the vector of coefficients alpha. More precisely:
\[ G(x) = sign \left[ \sum_{m=1}^M\alpha_m\cdot G_m(x) \right] \]
The final class prediction will be determined \(\forall \ x\) by \(G(x)\)
N <- nrow(train_rs)
Initialization of the equal weights
w <- rep(1 / N, N)
Number of boosting iterations
M <- 5000
vector alpha and an empty list of classifiers to fill.
alpha_m <- numeric(M)
TREE <- vector("list", M)
The core algorithm to update weights \(w_i\) for \(i=1,2,\dots N\), train weak classifiers \(G_1(x),\dots,G_M(x)\) and estimate the vector of parameters alpha \([\alpha_1\dots\alpha_M]\) is shown below. I’ve used Rpart since it allows me to control the growth of the tree. I’ve chosen to let the tree grow up to 3 layers. The sampling of the Bootstrap training using weights \(\mathbf{w}\) is completely included in the argument weights = w of the function Rpart.
for (m in 1:M) {
TREE[[m]] <- rpart(critical_temp ~ .,
data = train_rs, method = "class",
control = rpart.control(maxdepth = 3), weights = w
)
pred_class <- predict(TREE[[m]], X_train_rs, type = "class")
pre <- ifelse(pred_class == 1, 1, -1)
err_m <- sum(w * (pre != y_train_rs))
alpha_m[m] <- log((1 - err_m) / (err_m))
w <- w * exp(alpha_m[m] * (pre != y_train_rs))
w <- w / sum(w)
}
alpha_M <- matrix(alpha_m, ncol = 1)
I repeat the phase of normalization of the matrix of test data
pt <- ncol(X_test)
nt <- nrow(X_test)
X_tst <- matrix(data = NA, ncol = pt, nrow = nt)
for (j in 1:pt) {
x_tst <- X_train[, j]
X_tst[, j] <- (X_test[, j] - min(x_tst)) / (max(x_tst) - min(x_tst))
}
X_tst <- as.data.frame(X_tst)
colnames(X_tst) <- colnames(X_test)
y_test_fac <- as.factor(ifelse(y_test < 30, -1, 1))
Here I define the final classifier that is a weighted average of weak classifiers weighted with a vector of coefficients alpha that takes into account how well the classifiers (within boosting iterations) are able to correctly classify observations.
G <- function(X, TREE, M, alpha_M) {
pred <- matrix(data = NA, ncol = M, nrow = nrow(X))
for (m in 1:M) {
pred_cl <- predict(TREE[[m]], X, type = "class")
pre <- ifelse(pred_cl == 1, 1, -1)
pred[, m] <- pre
}
out <- sign(pred %*% alpha_M)
return(out)
}
here i compute the predicted classes using Adaptive Boosting
G_boost <- G(X_tst, TREE, M, alpha_M)
Gb <- as.factor(G_boost)
As we can see with respect to a standard classification tree i have a slight improvement in accuracy and sensitivity but it is not that marked.
confusionMatrix(Gb, y_test_fac, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction -1 1
## -1 2266 121
## 1 234 1632
##
## Accuracy : 0.9165
## 95% CI : (0.9078, 0.9247)
## No Information Rate : 0.5878
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8294
##
## Mcnemar's Test P-Value : 2.776e-09
##
## Sensitivity : 0.9310
## Specificity : 0.9064
## Pos Pred Value : 0.8746
## Neg Pred Value : 0.9493
## Prevalence : 0.4122
## Detection Rate : 0.3837
## Detection Prevalence : 0.4387
## Balanced Accuracy : 0.9187
##
## 'Positive' Class : 1
##
Test error reduction with AdaBoost
It is worth mentioning that at each iteration observations misclassified by \(G_m(x)\) have their weights scaled by a factor of \(exp(\alpha_m)\) increasing their relative influence in the definition of the next classifier \(G_{m+1}(x)\); following this path one could prove how the test error is reduced.
Here, I try to show empirically, through a simulation, how increasing the number of boosting iterations (M) reduces the Misclassification error rate computed on the test set. I also compare it with a single pruned tree and with the MER of a dual stumps classification tree. MM is the vector of arbitrarily chosen boosting iterations.
MM <- c(c(1:10), seq(20, 100, 10), seq(150, 200, 25), 250, 300)
K <- length(MM)
N <- nrow(train_rs)
A simple function that allows me to compute the MER, then I have the code to compute the MER as a function of the number of Boosting iterations. This chunk consists of a slight modification of the previous one that performed Adaboost.
class_err <- function(actual, predicted) mean(actual != predicted)
cer_boost <- numeric(K)
for (k in 1:K) {
M_iter <- MM[k]
w <- rep(1 / N, N)
alpha_m <- numeric(M_iter)
TREE <- vector("list", M_iter)
for (m in 1:MM[k]) {
TREE[[m]] <- rpart(critical_temp ~ .,
data = train_rs, method = "class",
control = rpart.control(maxdepth = 2), weights = w
)
pred_class <- predict(TREE[[m]], X_train_rs, type = "class")
pre <- ifelse(pred_class == 1, 1, -1)
err_m <- sum(w * (pre != y_train_rs))
alpha_m[m] <- log((1 - err_m) / (err_m))
w <- w * exp(alpha_m[m] * (pre != y_train_rs))
w <- w / sum(w)
}
alpha_M <- matrix(alpha_m, ncol = 1)
Gboost <- G(X_tst, TREE, M_iter, alpha_M)
Gboost <- as.factor(Gboost)
cer_boost[k] <- class_err(y_test_fac, Gboost)
}
plot(MM, cer_boost,
type = "l", col = "blue", pch = 19, lwd = 1, main = "Test Error rate - Discrete AdaBoost -",
xlab = "Boosting Iterations", ylab = "Test Misclassification Error"
)
abline(h = MER_single_pruned_tree, col = "black", lty = "dashed")
text(x = 260, y = MER_single_pruned_tree-0.005, labels = c("8 node CV pruned Tree"), cex = 0.7, col = "black")
abline(h = MER_2, col = "black", lty = "dashed")
text(x = 250, y = MER_2-0.005, labels = c("Double-Stumps Weak Learner"), cex = 0.7, col = "grey12")
points(MM, cer_boost, pch = 1, cex = 1.2, col = "blue")
text(x = 270, y = cer_boost[which(MM==250)]+0.005, labels = c("AdaBoost"), col = "black", cex = 0.7)
It can be seen how the the test error rate for the AdaBoost.M1 decreases as the number of boosting iteration increases, maintaining itself below the MER of standard Trees. At the same time the improvement in MER is not that high.
In this section, I’ve implemented the technique of Bagging, also known as bootstrap aggregation, specifically for classification trees. The idea behind this method is to fit many high-variance (hence data-dependent) low-bias classifiers \(\hat{f}(x)\) to \(B\) bootstrap samples \(\mathbf{Z}^{*b}\) of the training data \(\mathbf{Z}\) and then averaging the prediction yielded by each model. More precisely, in the following way:
\[ \hat{f}_{bag}(x)=\frac{1}{B}\cdot \underset{b=1}{\overset{B}{\sum}}\hat{f}^{*b}(x) \]
More precisely \(\hat{f}_{bag}(x)\) is a Monte Carlo estimate of the true bagging estimate that would be \(\mathbb{E}_P{\left[\hat{f}^*(x)\right]}\); where \(\mathbf{Z}^*=\{(x_1^*,y_1^*),(x_2^*,y_2^*),\dots,(x_N^*,y_N^*)\}\) and P is the empirical distribution putting equal probability 1/N on each data point \((x_i^*,y_i^*)\). One could verify how taking the Monte Carlo estimate (nothing else than an average) reduces the variance of the classifier.
The way in which the averaging phase is carried out will determine different types of bagging (by consensus vote or probability). In simple words, a bagged classifier is just the mean of classifiers trained on a collection of bootstrap sample; Doing so reduces the variance of the prediction and enhances the performance on the test since the bagged classifier will be more robust. In this case, I’ve decided to apply bootstrap aggregation (Bagging) to classification trees, and I’ve used section 8.7 of The Elements of Statistical learning as primary source of study . Therefore in this application \(\hat{f}^{*b}(x)=T^{*b}\) since the classifier is a classification tree.
Consensus and Probability approach
In Bagging trees there are two distinct approaches that could be followed to aggregate classifiers;
Firstly I’ve tried the second approach alone that should produce improved estimates with lower variance; then i compared the test error as a function of the dimension of the bootstrap sample across the two strategies, consensus and probability.
class_err <- function(actual, predicted) mean(actual != predicted)
pred <- vector("list")
Number of bagging iterations
B <- 200
Here i define the function that will allow to aggregate Bootstrap classifiers (specifically posterior probability matrices) in a single bagged classifier (one single matrix)
mean_list <- function(x) Reduce(`+`, x) / B
n_trn <- nrow(train_rs)
n_tst <- nrow(X_tst)
for (b in 1:B) {
Z <- train_rs[sample(nrow(train_rs), nrow(train_rs), replace = TRUE), ]
Tree <- tree(critical_temp ~ ., data = Z, split = "deviance")
pred[[b]] <- matrix(predict(Tree, X_tst, type = "vector"), ncol = 2, nrow = n_tst)
}
final_mat_pred <- mean_list(pred)
pred_BAG <- as.factor(ifelse(apply(final_mat_pred, 1, which.max) == 1, -1, 1))
y_test_fac <- as.factor(ifelse(y_test < 30, -1, 1))
confusionMatrix(pred_BAG, y_test_fac, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction -1 1
## -1 2237 179
## 1 263 1574
##
## Accuracy : 0.8961
## 95% CI : (0.8865, 0.9051)
## No Information Rate : 0.5878
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7871
##
## Mcnemar's Test P-Value : 7.884e-05
##
## Sensitivity : 0.8979
## Specificity : 0.8948
## Pos Pred Value : 0.8568
## Neg Pred Value : 0.9259
## Prevalence : 0.4122
## Detection Rate : 0.3701
## Detection Prevalence : 0.4319
## Balanced Accuracy : 0.8963
##
## 'Positive' Class : 1
##
MER comparison: Consensus and Probability Bagging
In this case i’m interested in a full grown high-variance tree that i want to make more robust using bagging, for this reason i use the argument rpart.control with complexity parameter cp=0, and minsplit=2 and maxdepth=30. Then I’ve modified the code above to include the consensus vote strategy and compute test error for both approaches. But first I need to initialize the quantities of this chunk.
BB <- c(c(1:10), seq(15, 50, 5), seq(60, 100, 10), seq(125, 250, 25))
H <- length(BB)
mer_prob <- numeric(H)
mer_cons <- numeric(H)
y_test_fac <- as.factor(ifelse(y_test < 30, -1, 1))
mean_list <- function(x, B) Reduce(`+`, x) / B
class_err <- function(actual, predicted) mean(actual != predicted)
n_trn <- nrow(train_rs)
n_tst <- nrow(X_tst)
for (h in 1:H) {
indicator <- matrix(data = 0, nrow = n_tst, ncol = 2)
pred <- vector("list", BB[h])
for (b in 1:BB[h]) {
Z <- train_rs[sample(nrow(train_rs), nrow(train_rs), replace = TRUE), ]
Tree <- rpart(critical_temp ~ .,
data = Z, method = "class",
control = rpart.control(cp = 0, minsplit = 2, maxdepth = 30)
)
pred[[b]] <- matrix(predict(Tree, X_tst, type = "prob"), ncol = 2, nrow = n_tst)
predi <- predict(Tree, X_tst, type = "class")
predi01 <- ifelse(predi == -1, 1, 2)
for (i in 1:n_tst) {
k <- indicator[i, predi01[i]]
indicator[i, predi01[i]] <- k + 1
}
}
final_mat_pred <- mean_list(pred, BB[h])
pred_BAG <- as.factor(ifelse(apply(final_mat_pred, 1, which.max) == 1, -1, 1))
mer_prob[h] <- class_err(y_test_fac, pred_BAG)
mat <- t(apply(indicator, 1, function(x) x / sum(x)))
pred_cons <- as.factor(unlist(apply(mat, 1, which.max)))
levels(pred_cons) <- c(-1, 1)
mer_cons[h] <- class_err(y_test_fac, pred_cons)
}
plot(BB, mer_prob,
type = "l", col = "green3",
xlab = "Number of Bootstrap Samples",
ylab = "Test Misclassification Error",
main = "Test Error Rate -Bagging Trees-", lwd = 1
)
points(BB, mer_prob, col = "green3", lwd = 1, cex = 1.2)
legend(
x = 200, y = 0.082, legend = c("probability", "consensus"), cex = 0.9,
bty = "n", pch = c(19, 19), col = c("green3", "darkorange2")
)
points(BB, mer_cons, col = "darkorange2", type = "l", lwd = 1)
points(BB, mer_cons, col = "darkorange2", lwd = 1, cex = 1.2)
In this case I could have shown also the MER of the single tree, to show how bagging reduces the Misclassification error rate on the test, but since it was quite higher, I would have lost a clear view of the difference between the test error for the consensus and probability approach.
probability approach
It is also interesting to compare the variance of this Bagging procedure, with standard classification trees on different bootstrap samples, this time for a fixed number B=150 iterations. In the following graph for W distinct times I show the test error estimate for Bagged classifier and for classification tree In other words for W=50 times I estimate the MER on the test for the Bagged classifier (defined by averaging 150 prob. matrix on Bootstrap samples) and the MER on the test for a single tree on the first Bootstrap sample. In other words, I wanted to show the behavior of Bagging compared to a standard tree in front of different datasets.
W <- 50
B <- 150
y_test_fac <- as.factor(ifelse(y_test < 30, -1, 1))
class_err <- function(actual, predicted) mean(actual != predicted)
pred <- vector("list")
mer_bag <- numeric(W)
mer_stand <- numeric(W)
X_tst <- as.data.frame(X_tst)
colnames(X_tst) <- colnames(X_test)
for (w in 1:W) {
for (b in 1:B) {
Z <- train_rs[sample(nrow(train_rs), nrow(train_rs), replace = TRUE), ]
Tree <- tree(critical_temp ~ ., data = Z, split = "deviance")
pred[[b]] <- matrix(predict(Tree, X_tst, type = "vector"), ncol = 2, nrow = n_tst)
}
final_mat_pred <- mean_list(pred, B)
pred_BAG <- as.factor(ifelse(apply(final_mat_pred, 1, which.max) == 1, -1, 1))
mer_bag[w] <- class_err(y_test_fac, pred_BAG)
pred_stand <- as.factor(ifelse(apply(pred[[1]], 1, which.max) == 1, -1, 1))
mer_stand[w] <- class_err(y_test_fac, pred_stand)
}
plot(1:W, mer_stand,
type = "l", col = "red3",
ylim = c(min(mer_bag), max(mer_stand)),
xlab = "index of Bootstrap iteration",
main = "Misclassification Error Rate",
ylab = "MER"
)
points(1:W, mer_bag, col = "green3", type = "l")
It is clear how changing the underlying data creates high variability in the test error rate with a single classification tree, while doing Bagging reduces not only the test error but also its variance.
To conclude I’ve discovered that the superconductivity dataset is composed by rather clear and informative data. After removing highly correlated variables I found out that the absence of a high amount of noise in the data has left few space for improvement in classification metrics (such as accuracy,sensitivity) when shifting from standard techniques to more advanced ones. This analysis has been useful to empirically understand how more advanced models (Bagging and Boosting Trees) in the presence of non-sparse/low-noise data could be discarded in favor of simpler but more interpretable models. In this case, it isn’t only the interpretability of standard models that should convince to apply them (rather than more advanced ones), but also the low improvement in classification metrics (across different approaches).
I had also tried to fit a logistic regression model on the training data to perform classification; Later I chose not to include it in the report since in the phase of diagnostic the structural linearity of the predictor was not validated. Since i didn’t want to estimate a logistic regression (maybe also penalized) with a non linear predictor, i opted for a non-parametric approach using k-nearest neighborhood (KNN). I’ve implemented it using the function k-nn of the library class and I’ve done manually the cross-validation phase to choose the optimal size of the neighborhood.
k-nearest neighborhood, k manually cross validated
I want to choose the optimal neighborhood size by manually cross-validating it. It is worth mentioning that Shuffling the rows of the Oversampled training and then choosing the first fold_dim rows, then the second … up to V part etc is equal to divide randomly the original dataset in V parts and then perform cross validation. I’ve arbitrarily chosen the size of the neighborhood to test.
kk <- c(c(1:12), seq(14, 100, 4))
V <- 10
fold_dim <- nrow_X_train_rs / V
sh_index <- sample(nrow_X_train_rs)
X_train_sh <- X_train_rs[sh_index, ]
y_train_sh <- y_train_rs_fact[sh_index]
pt <- ncol(X_test)
nt <- nrow(X_test)
X_tst <- matrix(data = NA, ncol = pt, nrow = nt)
for (j in 1:pt) {
x_tst <- X_train[, j]
X_tst[, j] <- (X_test[, j] - min(x_tst)) / (max(x_tst) - min(x_tst))
}
y_test_fac <- as.factor(ifelse(y_test < 30, -1, 1))
MER <- matrix(data = NA, nrow = V, ncol = length(kk))
calc_class_err <- function(actual, predicted) {
mean(actual != predicted)
}
Actual Cross validation phase:
for (v in 1:V) {
ind_vld_v <- seq(((v - 1) * fold_dim) + 1, fold_dim * (v), 1)
X_vld_v <- X_train_sh[ind_vld_v, ]
X_small_train_v <- X_train_sh[-ind_vld_v, ]
y_vld_v <- y_train_sh[ind_vld_v]
y_small_train_v <- y_train_sh[-ind_vld_v]
for (i in 1:length(kk)) {
pred <- knn(
train = X_small_train_v,
test = X_vld_v, cl = y_small_train_v,
k = kk[i]
)
MER[v, i] <- calc_class_err(y_vld_v, pred)
}
}
MER_means <- colMeans(MER)
Below, I’ve plotted colored lines representing the Misclassification error rate as a function of the k size of the neighbor across the V folds while in black is depicted the average MER (over different folds). the optimal K is found as the one minimizing the average MER. In this case it turns out to be 1.
plot(kk, MER[1, ],
type = "l", col = sample(c("red", "green", "blue", "dodgerblue", "darkorange", "purple")),
ylim = c(0.04, 0.10), lwd = 0.7, ylab = "MER", xlab = "k- size of the neighbourhood", main = "MER behavior across 10-fold CV"
)
for (v in 1:V) {
points(kk, MER[v, ], lwd = 0.7, type = "l", col = sample(c("red", "green", "blue", "dodgerblue", "darkorange", "purple")))
}
points(kk, MER_means, type = "l", col = "black", lwd = 2)
abline(h = min(MER_means), col = "grey", lwd = 1.3, lty = "dashed")
abline(v = which.min(MER_means), col = "grey", lwd = 1.3, lty = "dashed")
points(which.min(MER_means), min(MER_means), col = "black", pch = 19, cex = 0.6)
legend(x = 70, y = 0.08, legend = c("Average MER"), bty = "n", cex = 0.8, fill = "black")
index <- which.min(MER_means)
kCV <- kk[index]
pred_kCV <- knn(
train = X_train_rs,
test = X_tst, cl = y_train_rs_fact,
k = kCV
)
confusionMatrix(pred_kCV, y_test_fac, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction -1 1
## -1 2312 145
## 1 163 1633
##
## Accuracy : 0.9276
## 95% CI : (0.9194, 0.9352)
## No Information Rate : 0.5819
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8514
##
## Mcnemar's Test P-Value : 0.3327
##
## Sensitivity : 0.9184
## Specificity : 0.9341
## Pos Pred Value : 0.9092
## Neg Pred Value : 0.9410
## Prevalence : 0.4181
## Detection Rate : 0.3840
## Detection Prevalence : 0.4223
## Balanced Accuracy : 0.9263
##
## 'Positive' Class : 1
##
It can be observed that despite its simplicity KNN performs quite well compared with Boosting and Bagging. I would say that for its simple structure it is the one I would choose to communicate the method used to perform the analysis. There exists an adaptive version of this algorithm that works well with sparse data, but this was not the case. Regarding the choice of the size of the neighbor, using k=1 turns out to be the optimal choice, since there isn’t too much noise that has to be counteracted using an higher k.
Outcomes of this analysis:
Using an unsupervised approach (similar to Clustering in a Text Mining context), I’ve discovered 4 clusters of Superconductors to which I assigned labels; I’ve also checked the coherency of the estimated classes using the critical temperature variable (not used in the clustering estimation phase). In the interpretation phase elaborating information about the proportion of chemical elements has been crucial.
Then the theory of two Classification Methods, namely, Adaptive Boosting and Bagging has been presented; Through simulation it has been checked the property of variance reduction and test error rate improvement, with a focus on the two ways of performing bagging.
I’ve discovered how advanced methods combined with an already clean dataset don’t necessarily bring huge improvements in classification metrics (compared to standard methods)
In the end, a CV k-NN turns out to be very accurate for this problem. K=1 has been chosen based on cross-validation