Advanced Tutorial

Introduction

As I’ve detailed in the introduction to support vector machines (SVMs), SVMs are a machine learning tool which can be used to classify data into two or more different groups. Outlined below is an example of how an SVM approach can be applied to a relatively complex dataset, in this case considering the properties of a large number of different wines. Here, we will build a SVM capable of accurately classifying the quality of wine based on its chemical properties. Here, quality is a category between 1 and 10, with one reflecting the lowest quality wine and 10 the highest quality wine. For this example, we will be using R.

The first thing to do is download the data and save it to your local machine. The data is available from Kaggle and is from an academic paper by Cortez et al. (2009).

Throughout this tutorial, code chunks will have a blue background and output will have a green background. This tutorial is meant to be a relatively advanced example of SVMs, particularly with regards to the number of different classes (i.e., wine qualities) and data cleaning. For a simpler workflow of SVMs in R, see the Introductory Tutorial on SVMs. Importantly, these tutorial also addresses some of the issues of analyses using SVMs.

Scenario

For this tutorial we will attempt to find a solution to the following scenario.

Let’s imagine we work at a wine merchants, which specialises in mid-range wines. Our company is about to have a new shipment of white wines come in, and wants to know how to price these new wines. In order to do this, the company needs to work out their quality, with higher quality wines likely to command a higher price than those of a lower quality. The company knows that these wines will all be mid-range wines, but does not known whether each wine will be of i) a lower-medium quality (quality class of 5); ii) a medium quality (quality class of 6); or iii) a higher-medium quality (quality class of 7). The company has large-scale dataset on previous wines detailing their quality and a number of chemical properties. As such, the company wants us to generate a model which is capable of determining each wines’ quality from the its chemical properties. In particular, the company is keen for the model not to misclassify wines of lower-medium (5) quality as higher-medium (7) quality (and vice versa) as this may result in customers being particularly overcharged for wines (and potentially dissatisfied customers) or particularly undercharged for wines (lower profits for the company). The company also wants to avoid wines of medium (6) and higher-medium (7) being assigned a lower quality as much as possible given this may harm profits (but is unlikely to unduly harm customer satisfaction).

As detailed above, we are going to attempt to meet the companies aims using a support vector machine.

Packages and set.seed()

Firstly, we need to load the various packages we need for this tutorial. If any of these packages are not installed in your local machine, they can be installed via the install.packages() function. We will also set the seed for this code, this is simply to allow reproducibility of the results.

library(ggplot2)
library(data.table)
library(scales)
library(scutr)
library(caret)
library(caTools)
library(dplyr)

set.seed(7654)

Data loading

The next thing we need to do is load in the data. This will be from wherever you saved the data on your local machine and whatever you titled the file (see the above link). This data represents the dataset referred to in the scenario.

At this point we will also change some of the column names. In the original file there were blank spaces which R isn’t a fan of. The following code will also convert any of these blank spaces to underscores.

df <- fread("~/R_sandbox/Wine_Data_Kaggle.csv", data.table=FALSE)

colnames(df) <- gsub(" ", "_", colnames(df))

head(df)

##    type fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1 white           7.0             0.27        0.36           20.7     0.045
## 2 white           6.3             0.30        0.34            1.6     0.049
## 3 white           8.1             0.28        0.40            6.9     0.050
## 4 white           7.2             0.23        0.32            8.5     0.058
## 5 white           7.2             0.23        0.32            8.5     0.058
## 6 white           8.1             0.28        0.40            6.9     0.050
##   free_sulfur_dioxide total_sulfur_dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

Data cleaning

At this point when checking our data it is apparent that there are a few rows with missing data. For our analysis we will remove these rows as they will convolute any analysis.

original.dim <- dim(df)[1]

df <- na.omit(df)

new.dim <- dim(df)[1]

original.dim - new.dim

## [1] 34

As you can see there are 34 rows which have been removed from our dataframe.

At this stage we can remove the redundant data from our dataset. Our project brief is to develop a model which is capable of predicting the quality of mid-range white wines. As such, we can remove all data from our dataset on red wines. We can also remove all the data on higher and lower quality wines as these are not part of the brief and may only convolute our analyses.

This subsetting can be done with the following code.

The next step is to convert the categorical variable of type into a binary variable. At present the column of type is either ‘white’ or ‘red’ however the SVM doesn’t really respond to data in this form. We are much better off encoding this information as two dummy variables, namely white and red. Wherever type is ‘white’, white will have a value of 1 and red a value of 0. Wherever type is ‘red’, white will have a value of 0 and red a value of 1. In effect we have encoded the information in the type column in a slightly different manner. See this link on stackoverflow for a more thorough explanation of why this is necessary. However, once this is done we can then remove the original type column.

#only select white wines
df <- subset(df, type == "white")

#remove 'type' column as our dataset is now entirely white wines
df <- df[,-1]

#only select mid quality wines and set quality as a factor
df <- subset(df, quality  %in% c(5, 6, 7))

df$quality <- factor(df$quality)

At this point we need to split our data into two distinct subsets. The first will be used for training the svm (the train dataset) and the second will be used for testing the dataset.Here, we will be splitting our data while preserving the same ratio of wines of each quality (5, 6, 7) in both the test and train datasets.

Our analysis is going to be run on the training dataset before being tested on the test dataset right at the very end of the analysis.

For our analysis, the training dataset will comprise 70% of the dataset and the test datset will comprise 30% of the original dataset. Although a 70/30 split is relatively common, other analyses may use a different split (e.g., 67/33) for the training and test datasets.

This can be done using the following code.

split_determined  <- sample.split(df$quality, SplitRatio = 0.7)
df.train <- df[split_determined,]
df.test  <- df[!split_determined,]

Now if we look at our train and test dataset one thing we can easily notice is that the different variables are all on different scales.

head(df.train)

##   fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1           7.0             0.27        0.36           20.7     0.045
## 2           6.3             0.30        0.34            1.6     0.049
## 3           8.1             0.28        0.40            6.9     0.050
## 4           7.2             0.23        0.32            8.5     0.058
## 5           7.2             0.23        0.32            8.5     0.058
## 7           6.2             0.32        0.16            7.0     0.045
##   free_sulfur_dioxide total_sulfur_dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 7                  30                  136  0.9949 3.18      0.47     9.6
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 7       6

This is an issue for SVMs, as if data is on different scales there is the potential for the numerically larger variables to be given a greater importance than the numerically smaller variables. This then has the potential to bias our model. As such, we need to rescale each variable to be on the same scale. This can be done with the following code, see this stackoverflow link for more details.

train_scaling <- preProcess(df.train, method=c("scale", "center"))
df.train <- predict(train_scaling, df.train)
df.test  <- predict(train_scaling, df.test)

This code ensures that the test and train datasets are both scaled in the same way. If they weren’t when we come to test the SVM, all our model predictions would be wrong as the test and train datasets would be subtely different. As such, our test dataset is rescaled using the rescaling factors from our train dataset.

If we have a look at our train dataset, we can plot a histogram of the different count of wines in each of the different categories.

ggplot(df.train, aes(x=as.numeric(as.character(quality)))) + 
  geom_histogram(color="black", fill="white") + 
  xlab("quality")

As we can, see there are more wines of quality 6, then there are of quality 5 or 7. This may be an issue with our SVM, as it may therefore be inherently biased towards predicting wines of quality 6. As such, we could in theory get a reasonable accuracy for our SVM if it predicted group 6 accurately but the other groups poorly. This would represent a biased model. As such one way we can go about this is using a technique called SMOTE or SCUT to rebalance our dataset to have equal proportions of each of the different wine qualities. SMOTE (explained more here) is method for generating new data based on existing data in the same group. I’m not going to go into it too much here, but this technique uses the nearest neighbour algorithm to generate new data. SMOTE is the method to use if your data only contains two groups, while SCUT is the method to use if your data comprises three or more groups. Here, we have three different groups so we will be using SCUT. Note that SCUT only needs to be conducted on the train dataset, not the test dataset.

SCUT can be used from the scutr package, and run using the following code.

df.train <- SCUT(df.train, "quality", oversample="oversample_smote")

As such if we replot our dataset, using the following code, we can see that we’ve now got equal counts of each of the wine qualities in our dataset.

ggplot(df.train, aes(x=as.numeric(as.character(quality)))) + 
  geom_histogram(color="black", fill="white") + 
  xlab("quality")

Our dataset is relatively complex, containing data on a dozen or so chemical properties of each wine. At this stage, we are going try and reduce some of this complexity by implementing a dimensionality reduction procedure to the dataset. In essence, this reformats our dataset into a number of components which account for varying amounts of variance within our dataset. These components no longer account for different variables, but instead different variables are grouped together. Here, we will using principal component analysis (i.e., PCA) to reduce the dimensionality of our dataset. This can be done with the following code:

#conduct a principal component analysis using the train dataset


#remove the qualities from the train dataset
pca_train <- prcomp(df.train[-c(12)])

summary(pca_train)

## Importance of components:
##                          PC1    PC2    PC3    PC4    PC5     PC6     PC7
## Standard deviation     1.861 1.2565 1.1768 1.1507 1.0297 0.96716 0.88395
## Proportion of Variance 0.293 0.1336 0.1172 0.1120 0.0897 0.07914 0.06611
## Cumulative Proportion  0.293 0.4266 0.5438 0.6558 0.7455 0.82463 0.89074
##                            PC8    PC9    PC10    PC11
## Standard deviation     0.75925 0.6441 0.53317 0.12568
## Proportion of Variance 0.04877 0.0351 0.02405 0.00134
## Cumulative Proportion  0.93951 0.9746 0.99866 1.00000

Now, we need to make a decision about the number of principal components to select for our analyses. There are various different ways of selecting which principal components to use. Two of the most common methods are the variance explained criteria or the Kaiser-Guttman criterion (see this link for a thorough explanation of selecting principal components). In short, the variance explained criteria suggests that the user should select principal components which explain the variance up to a certain cut-off point. So, by starting with PC1 we would continue to select PCs (in descending order of variance explained) up until our threshold is met (with 70-90% of variance being common thresholds). In contrast, the Kaiser-Guttman criterion says that only those PCs which have a variance (squared SD) of over 1 should be selected. Here, following the different methods may result in a different number of PCs being selected. If we use a threshold of 80% variance explained, we would select PCs 1-6. However, if use the Kaiser-Guttman criterion, then only PCs 1-5 would be selected. Here, for the purposes of this tutorial, we are going to use the variance explained criteria (with a threshold of 90%) and only select the first eight PCs. Selecting PCs 1-8 can be done using the following code (which also applies the identical PCA transformation to the test dataset).

#For the train dataset, select the first 5 PCs
trainp <- tbl_df(pca_train$x) %>% select(PC1:PC8)

## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

#Apply the same transformation to the test dataset and select the first 5 Pcs
testp <- tbl_df(predict(pca_train, newdata = df.test[-c(12)])) %>% select(PC1:PC8)

#add the qualities to each dataset again
df.train <- cbind(trainp, df.train[c(12)])
df.test <- cbind(testp, df.test[c(12)])

At this point, we are now able to move on the training of the SVM.

Training the SVM

At this point we are now ready to start training our SVM. However, we have to make several decisions about our model. Firstly, we have to decide which kernel to use. Firstly, we have to decide which kernel to use. There are many resources which discuss kernel choice, see links provided on the initial page for support vector machines (SVMs).

For this analysis, we will be using a radial kernel.

The next decision we need to make is what values for the cost and sigma parameters we should specify. Again, for more information on these parameters, consult the links provided on the initial page for support vector machines (SVMs). Fortunately, the SVM package we are using (caret) allows us to test various different parameters and then select the best ones.

Hsu et al. (2003) advise that SVMs should be generated in two distinct sections. Firstly, a relatively coarse selection of parameter values should be selected which are used to train the model against the training dataset (i.e., training the model). At this point the best parameter values are identified. These parameter values are then used in the next stage (i.e., testing the model). For this tutorial, we are not going to be tuning our SVM, but instead using a cross-validation approach when training our SVM (see below).

The range of parameter values to be used is usually at the discretion of the user, although it is usually on a logarithmic scale. Hsu et al. (2003) suggest that values for cost and sigma could be \(2^{-5}\), \(2^{-3}\), … \(2^{15}\). In the interests of speed, for this analysis we will be using cost and sigma parameter values of \(2^{-2}\), \(2^{-1}\), … \(2^{5}\).

When training the SVM, the best model parameters are chosen by those which generate an SVM which minimises wines being assigned to the incorrect class (i.e., the model which correctly assigns most wines to the correct class). As part of this procedure, we will be using k-fold cross validation as way of increasing the robustness of our results. I’m not going to go into k-fold cross validation here, but for more information see this explanation.

Following this, our SVM can be trained using the following code.

radial_grid <- expand.grid(C = 2^c(-2:5),
                           sigma = 2^c(-2:5))


svm_trained <- train(quality ~., 
                     data = df.train, 
                     method = "svmRadial", 
                     trControl = trainControl(method="repeatedcv", number=10, repeats=2), 
                     tuneGrid = radial_grid)




svm_trained

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 3156 samples
##    8 predictor
##    3 classes: '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 2 times) 
## Summary of sample sizes: 2841, 2840, 2840, 2841, 2841, 2841, ... 
## Resampling results across tuning parameters:
## 
##   C      sigma  Accuracy   Kappa    
##    0.25   0.25  0.6451257  0.4676681
##    0.25   0.50  0.6559038  0.4838288
##    0.25   1.00  0.6715965  0.5073636
##    0.25   2.00  0.6980397  0.5470835
##    0.25   4.00  0.5719254  0.3578845
##    0.25   8.00  0.4851110  0.2276649
##    0.25  16.00  0.4588145  0.1882183
##    0.25  32.00  0.4469333  0.1703947
##    0.50   0.25  0.6612815  0.4919022
##    0.50   0.50  0.6909094  0.5363507
##    0.50   1.00  0.7271968  0.5907775
##    0.50   2.00  0.7360581  0.6040736
##    0.50   4.00  0.6896420  0.5344444
##    0.50   8.00  0.6381556  0.4572103
##    0.50  16.00  0.6048835  0.4072993
##    0.50  32.00  0.5890406  0.3835410
##    1.00   0.25  0.6777634  0.5166369
##    1.00   0.50  0.7126208  0.5689209
##    1.00   1.00  0.7474641  0.6211820
##    1.00   2.00  0.7666333  0.6499432
##    1.00   4.00  0.7452499  0.6178629
##    1.00   8.00  0.6874208  0.5311164
##    1.00  16.00  0.6503523  0.4755097
##    1.00  32.00  0.6142295  0.4213232
##    2.00   0.25  0.6970928  0.5456326
##    2.00   0.50  0.7313002  0.5969374
##    2.00   1.00  0.7566504  0.6349649
##    2.00   2.00  0.7713847  0.6570710
##    2.00   4.00  0.7561827  0.6342578
##    2.00   8.00  0.6950278  0.5425233
##    2.00  16.00  0.6566884  0.4850162
##    2.00  32.00  0.6221560  0.4332103
##    4.00   0.25  0.7111997  0.5667908
##    4.00   0.50  0.7393889  0.6090715
##    4.00   1.00  0.7628328  0.6442365
##    4.00   2.00  0.7734431  0.6601577
##    4.00   4.00  0.7565002  0.6347343
##    4.00   8.00  0.6951861  0.5427597
##    4.00  16.00  0.6566884  0.4850162
##    4.00  32.00  0.6221560  0.4332103
##    8.00   0.25  0.7200630  0.5800823
##    8.00   0.50  0.7474711  0.6212010
##    8.00   1.00  0.7648943  0.6473304
##    8.00   2.00  0.7732849  0.6599201
##    8.00   4.00  0.7558663  0.6337829
##    8.00   8.00  0.6953448  0.5429978
##    8.00  16.00  0.6566884  0.4850162
##    8.00  32.00  0.6221560  0.4332103
##   16.00   0.25  0.7265569  0.5898267
##   16.00   0.50  0.7508009  0.6261936
##   16.00   1.00  0.7644226  0.6466240
##   16.00   2.00  0.7721763  0.6582569
##   16.00   4.00  0.7555493  0.6333075
##   16.00   8.00  0.6953448  0.5429978
##   16.00  16.00  0.6566884  0.4850162
##   16.00  32.00  0.6221560  0.4332103
##   32.00   0.25  0.7343146  0.6014669
##   32.00   0.50  0.7466760  0.6200088
##   32.00   1.00  0.7644226  0.6466226
##   32.00   2.00  0.7724943  0.6587336
##   32.00   4.00  0.7555493  0.6333075
##   32.00   8.00  0.6953448  0.5429978
##   32.00  16.00  0.6566884  0.4850162
##   32.00  32.00  0.6221560  0.4332103
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 2 and C = 4.

#show the summary of our svm training

So, as we can see from the model training, the best performing model was when sigma equals and cost equals . Here the best performance of the model was which means that across all 10 folds of the cross validation for these parameter values, ~77.3443137% of all wines were assigned the correct category.

Testing the SVM

This analysis has so far been based on the training dataset, but at this stage we can now shift our attention to the test dataset we subset from the original data. Using our SVM we can now predict the quality of every wine in the test dataset and compare our predictions to the actual class of the wine. From this we can calculate various different metrics which we can use to assess the performance of our SVM. This can be done using the following code.

pred=predict(svm_trained, df.test)
#predict the qualities of wines from the test dataset


cm <- caret::confusionMatrix(pred, df.test$quality) 
cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   5   6   7
##          5 335 217  43
##          6  86 368  90
##          7  13  71 130
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6157          
##                  95% CI : (0.5892, 0.6417)
##     No Information Rate : 0.4848          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3931          
##                                           
##  Mcnemar's Test P-Value : 3.713e-16       
## 
## Statistics by Class:
## 
##                      Class: 5 Class: 6 Class: 7
## Sensitivity            0.7719   0.5610  0.49430
## Specificity            0.7171   0.7475  0.92294
## Pos Pred Value         0.5630   0.6765  0.60748
## Neg Pred Value         0.8694   0.6440  0.88323
## Prevalence             0.3208   0.4848  0.19438
## Detection Rate         0.2476   0.2720  0.09608
## Detection Prevalence   0.4398   0.4021  0.15817
## Balanced Accuracy      0.7445   0.6542  0.70862

There’s an awful lot of information here, so let’s breakdown piece by piece to assess our SVM.

Firstly, let’s start with the confusion matrix using the following code.

cm$table

##           Reference
## Prediction   5   6   7
##          5 335 217  43
##          6  86 368  90
##          7  13  71 130

The confusion matrix is basically a way of allowing us to visually assess whether our SVM assigns our wines (from the test dataset) the correct qualities or, if not, which qualities they are assigned. From our dataset we can see that the SVM broadly classifies wines into the different qualities correctly (as shown by the large numbers across the major diagonal). However, there are relatively large numbers of wines that have not been classified correctly (as shown by the numbers off the major diagonal). This indicates that our SVM may not be as accurate as we would of perhaps hoped. However, this is just a visual inspection and our initial thoughts can be confirmed (or contrasted) using some statistics.

Let’s start by considering the overall statistics for our model, which can be called using the following code.

cm$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   6.156689e-01   3.930815e-01   5.891513e-01   6.416828e-01   4.848485e-01 
## AccuracyPValue  McnemarPValue 
##   2.966695e-22   3.712686e-16

So, of the metrics here, the main ones we are going to consider are accuracy and kappa.

If we start with accuracy, the accuracy of our SVM for the test dataset is 61.5668884% which is lower than the accuracy we obtained from the k-fold cross validation of the SVM at the training stage (~77.3443137%). Overall, this suggests that our SVM may be slightly overfitting our training data.

The next metric to consider is the kappa value. The kappa value is an alternative measure of accuracy which considers the expected and observed accuracy of our SVM. I’m not going to explain kappa in any great detail (but see this comprehensive explanation on stackoverflow), but as with the accuracy metric the higher the value of kappa the better. Kappa generally reflects a less biased measure of the accuracy of a model. As such, the value of kappa here indicates that our SVM does a moderate job of classifying wine quality correctly.

Overall, these metrics are two of the most commonly used metrics for assessing the performance of an SVM.

The final thing we can do when assessing our SVM performance is to consider how the performance varies between different wine qualities (much as we visually did for the confusion matrix above). This can be done by running the following code.

cm$byClass

##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: 5   0.7718894   0.7170838      0.5630252      0.8693931 0.5630252
## Class: 6   0.5609756   0.7474892      0.6764706      0.6440049 0.6764706
## Class: 7   0.4942966   0.9229358      0.6074766      0.8832309 0.6074766
##             Recall        F1 Prevalence Detection Rate Detection Prevalence
## Class: 5 0.7718894 0.6511176  0.3207687     0.24759793            0.4397635
## Class: 6 0.5609756 0.6133333  0.4848485     0.27198817            0.4020695
## Class: 7 0.4942966 0.5450734  0.1943829     0.09608278            0.1581670
##          Balanced Accuracy
## Class: 5         0.7444866
## Class: 6         0.6542324
## Class: 7         0.7086162

As we can see we have a variety of different metrics for each of the different wine qualities. Here, we are going to predominately focus on the metrics of sensitivity, specificity, and balanced accuracy, but an explanation of the other metrics can be found here.

We can cut out the other metrics, and focus theses metrics, using the following code. First, let’s start with sensitivity. For a given wine quality, sensitivity is calculated as \(\frac{TruePositive}{TruePositive + FalseNegative}\). In other words, sensitivity is calculated as the number of wines which were correctly assigned a given quality, divided by the total number of wines for that quality. For example, the sensitivity of wines of quality 5 could be calculated as \(\frac{335}{335+217+43} = 0.7719\) (using the confusion matrix above). This indicates that all wines which are really of quality 5, 77.2% of them were correctly classified. As we can see below, sensitivity for quality 5 wines is fairly high. However, our SVM appears to be less good at correctly classifying wines of quality 6 or 7.

#Sensitivity
cm$byClass[,4]*100

## Class: 5 Class: 6 Class: 7 
## 86.93931 64.40049 88.32309

Secondly, we will consider specificity. For a given wine quality, sensitivity is calculated as \(\frac{TrueNegative}{TrueNegative + FalsePositive}\). In other words, sensitivity is calculated as the number of wines which were correctly assigned a given quality, divided by the total number of wines for that quality. For example, the sensitivity of wines of quality 7 could be calculated as \(\frac{335+217+86+368}{335+217+86+368+13+71} = 0.9229\) (using the confusion matrix above). This indicates that all wines which are of quality 5 or 6 quality 7, 92.3% of them were not incorrectly classified as being quality 7. As we can see below, sensitivity for all the wine qualities is fairly high, in particular wines of quality 7.

#Specificity
cm$byClass[,5]*100

## Class: 5 Class: 6 Class: 7 
## 56.30252 67.64706 60.74766

Finally, we will consider balanced accuracy. Balanced accuracy is simply the average of the sensitivity and specificity for a given wine quality. As we can see below, overal for wines of quality 5 and 7, the SVM has a relatively high balanced accuracy (although we might like the balanced accuracy for wines of quality 6 to be higher).

#Balanced accuracy
cm$byClass[,11]*100

## Class: 5 Class: 6 Class: 7 
## 74.44866 65.42324 70.86162

Meeting scenario aims

Overall, we have put together an svm which appears to do a reasonable job of classifying mid-range white wines into different qualities. While we have assessed a number of different metrics, we need to compare the aims of the company to the model performance. Firstly, we have created an SVM which appears to do an adequate job of classifying these wines - this meets the first objective of the company. Secondly, the company had a number of specific aims which it wanted the model to meet (see below).

“In particular, the company is keen for the model not to misclassify wines of lower-medium (5) quality as higher-medium (7) quality (and vice versa) as this may result in customers being particularly overcharged for wines (and potentially dissatisfied customers) or particularly undercharged for wines (lower profits for the company).”

For this aim, \(\frac{13}{335+86+13}=0.0300\) (i.e., 3%) of lower (5) quality wines were incorrectly assigned the higher (7) quality class. Likewise, \(\frac{43}{43+90+130}=0.1635\) (i.e., 16%) of higher (7) quality wines were incorrectly assigned the lower (5) quality class. Overall, this suggests that using our SVM to classify these wines will meet this company aim.

“The company also wants to avoid wines of medium (6) and higher-medium (7) being assigned a lower quality as much as possible given this may harm profits (but is unlikely to unduly harm customer satisfaction).”

For this aim, \(\frac{130+368+90}{13+71+130+86+368+90}=0.7757\) (i.e., 77.6%) of medium (6) and higher-medium (7) wines are assigned the correct (or higher) quality class meaning that company profits are unlikely to be substantially harmed by using the SVM to classify wines (and hence set their prices).

Improving the SVM

As we’ve shown from the model testing we have developed an SVM which broadly meets the aims of the company in classifying the qualities of wines. However, there are areas in which the model could be improved, and therefore we may wish to further improve the model. One such approach to achieve this may be to tune the SVM with a narrow range of parameter values for cost and sigma, however we need to ensure that we are not going to overfit the model when doing this. If the company altered its aims and wanted the SVM to simply identify whether a wine was of a ‘higher’ or ‘lower’ quality, then we may wish to recode our data by classifying wines with a quality of 6 or 7 as ‘higher’ and 5 ‘lower’ quality. Rerunning the SVM with this recoded data may result in a SVM with a higher predictive ability. Finally, it may be that our relatively limited dataset (~6000 data points) is inherently noisy (i.e., absence of any clear boundaries between the different wine qualities) and that generating an effective SVM capable of classifying wine qualities with a higher degree of accuracy may not necessarily be possible. As such, it is important to note that SVMs (like all other approaches) have conditions under which they excel, and others under which they perform relatively poorly. As such, it may be that we would get a higher accuracy by using an alternative machine learning approach (e.g., decision trees, random forests, or KNN).

Concluding remarks

In this tutorial, we have build a support vector machine (SVM) which meets the overall aims of the company. We can classify medium quality wines with a reasonable accuracy, though (importantly) we broadly meet all the specific aims of the company. In doing so, this tutorial illustrates various approaches regarding data cleaning, data processing, and dimensionality reduction, as well as the various measures which can be used to assess SVM performance. Finally, we also discuss potential options for improving our SVM.

References

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547-553.

Hsu, C. W., Chang, C. C., & Lin, C. J. (2003). A practical guide to support vector classification.

Links

benjburgess.github.io

Support Vector Machines