Practical Introduction


Package and Data Loading


As mentioned during the session setup, load the following packages using the library() function. Additionally, as we will be using a data set which contains large numbers, set scipen to 999, within the options() function.

library(tidyverse)
library(caret)
library(tree)
library(rpart)
library(rpart.plot)

options(scipen = 999)

Furthermore, for the purpose of this session, we will be using data from UC Irvine Machine Learning Repository and in particular the Red Wine Quality Dataset. This can be downloaded from the site directly, or contained in the .zip file.

winequality_red <- read_csv("data/winequality-red.csv")




This data set contains 1599 observations across the following variables:

  • Fixed Acidity (‘fixed acidity’) - fixed & nonvolatile acid in the wine.
  • Volatile Acidity (‘volatile acidity’) - the amount of acetic acid in the wine, related to vinegar flavours.
  • Citric Acid (‘citric acid’) - amount of citric acid in the wine, related to “freshness”.
  • Residual Sugar (‘residual sugar’) - the amount of sugar remaining after fermentation.
  • Chlorides (‘chlorides’) - the amount of salt in the wine.
  • Free Sulfur Dioxide (‘free sulfur dioxide’) - A free form of SO2, preventing microbial growth and oxidation.
  • Total Sulfur Dioxide (‘Total Sulfur Dioxide’) - Total SO2 amount, calculated from both free and bound forms.
  • Density (‘density’) - The density of water, depending on the percent of alcohol and sugar content.
  • pH (‘pH’) - Describes how acidic or basic the wine is (between 0, very acidic, and 14, very basic).
  • Sulphates (‘sulphates’) - A wine additive which contributes to SO2 levels, acts as an antimicrobial and antioxidant.
  • Alcohol (‘alcohol’) - The percent alcohol content of the wine.
  • Quality (‘quality’) - The output variable, based upon sensory data (0 - 10).