Project Introduction

On January 12, 2010, a magnitude 7.0 earthquake struck Haiti causing significant damage, which affected approximately 3 million citizens. In the wake of the disaster, aid groups were working to locate displaced persons to provide food and water. However, due to the large scale destruction of infrastructure over a wide area additional assistance was needed to locate people quickly.

Little is left of a neighborhood on a hillside near downtown Port-au-Prince on Jan. 15. More than a million people were displaced by the quake. (David Gilkey/NPR)

Displaced persons created make-shift shelters out of blue tarps. In order to locate displaced persons quickly, high resolution geo-referenced images were captured by aircraft of the destroyed areas. The data generated by the image collection was too large for aid workers to process in time to supply aid. Therefore, a team from the Rochester Institute of Technology used data-mining algorithms to analyze the images and identify blue tarps. The goal was to effectively locate displaced persons as quickly as possible and communicate their location to rescue workers so they could deliver resources to people in time.

Sample image of a geo-referenced image used for the analysis

As the final project for SYS 6018 - Data Mining, we were assigned to use techniques we learned in the course to build models that would as accurately as possible, and in as timely a manner as possible, locate the greatest number of the displaced persons from the provided imagery data. The data made available to students consisted of a csv of red, green, blue pixel values and a class indicator which indicated if a pixel was representative of a blue tarp or something else like vegetation.

We were also provided multiple text files that contained data and extra information to be used as a final hold out test set. We were expected to wrangle the data into a usable format.

Project Budget

The US Government spent $1.5B on Haiti disaster relief by the end of 2010. For this project, we will assume that $2,000,000 was allocated to our team to deliver supplies to displaced individuals in the immediate aftermath. Our team has been assigned an area where 2,022 displaced people are believed to be (this is the number of blue tarps are in our training data set). Anything less than a 80% delivery success rate will be considered a disaster relief failure. 80% of 2,022 people is 1,618. Therefore, if we fail to locate 404 of our blue tarps our mission would be considered a failure.

These considerations will help guide the selection of thresholds later in the analysis.

Budget	$2,000,000
Cost per Delivery (True Positive)	-$750
Cost per Mis-Delivery (False Positive)	-$300
Cost per Missed Delivery (False Negative)	-$4,946
Cost per True Negative	$0

Data Exploration

tic("Run Time")

The data provided for analysis was generated from overhead images and stored as a three channel output. Each pixel also had a classifier label indicating whether it was a blue tarp or something else like vegetation or soil. The channels represented the red, green, and blue values for pixels within images. RGB color model is referred to as an additive model. The integer value for the red, green, and blue channels are combined to represent a color. Typically, the component values are stored as an 8 bit integer ranging from 0 to 255.

The data was visualized with the ggpairs function. For a pair of variables chosen from the data frame, Ggpairs generates a scatterplot, displays a Pearson correlation, and, on the diagonal, shows a variable distribution. The plots were also color-coded by class. The class label describes what kind of object a pixel is associated with. In our data frame there were the following classes: Blue Tarp, Rooftop, Soil, Various Non-tarp, and Vegetation. The 2D representation of the data only gives us a partial insight into the behavior and relationships of the predictors. Since three channels are used to generate a color, plotting the data in 3D to investigate trends and behavior between classes will be more meaningful.

The 3D scatter plot shows a significant amount of overlap between the different classes. It is worth noting that it is possible to see some separation between the classes.

Check NA

df <- tibble(read.csv("HaitiPixels.csv")) #read in df
"Check for NA values" 
anyNA(df) #check for NA values 
"Summary of Data"
summary(df) #quick look at data
df$Class <- factor(df$Class) #make Class a factor variable.

#> [1] "Check for NA values"
#> [1] FALSE
#> [1] "Summary of Data"
#>     Class                Red          Green            Blue      
#>  Length:63241       Min.   : 48   Min.   : 48.0   Min.   : 44.0  
#>  Class :character   1st Qu.: 80   1st Qu.: 78.0   1st Qu.: 63.0  
#>  Mode  :character   Median :163   Median :148.0   Median :123.0  
#>                     Mean   :163   Mean   :153.7   Mean   :125.1  
#>                     3rd Qu.:255   3rd Qu.:226.0   3rd Qu.:181.0  
#>                     Max.   :255   Max.   :255.0   Max.   :255.0

We can see from the output that there aren’t any NA values that need to be removed or adjusted. The values in each predictor column fall within the expected 0 to 255 range.

Scatter and Correlation

#Reference [1]
# The palette with grey:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73",  "#CC79A7", "#0072B2", "#D55E00")
#"#F0E442",

#view scatter and correlations  
p <- ggpairs(df[,2:4], lower.panel = NULL, upper = list(continuous = wrap("cor", size = 3)), aes(color=df$Class)) #+ scale_fill_manual(values=cbPalette) 

#Reference: https://stackoverflow.com/questions/34740210/how-to-change-the-color-palette-for-ggallyggpairs/34743555
for(i in 1:p$nrow) {
  for(j in 1:p$ncol){
    p[i,j] <- p[i,j] + 
        scale_fill_manual(values=cbPalette) +
        scale_color_manual(values=cbPalette)  
  }
}

p

attach(df) #attach df variables

All three predictors are highly correlated with one another, which is unsurprising given the way RGB values are used to represent colors.

3D Scatter

fig <- plot_ly(df, x=~Red, y=~Green, z=~Blue, color=~Class, colors=c("#999999", "#E69F00", "#56B4E9", "#009E73",  "#CC79A7")) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene=list(xaxis=list(title="Red"),
                     yaxis = list(title = 'Green'),
                     zaxis = list(title = 'Blue')))

fig

Prepare Data Frame for Analysis

Data Frame

As noted in the previous section, the data provided is sufficiently cleaned only one further adjustment to the data frame is needed. Since our main interest is to predict whether a pixel represents a blue tarp or not a blue tarp, the Class column of the data frame needs to be converted into a binary indicator for blue tarp or not blue tarp.

df <- cbind(mutate(df, "Blue_Tarp_or_Not"=ifelse(Class != "Blue Tarp", 0, 1))) #add binary column indicating whether the Class variable is "Blue Tarp" or not
attach(df)
df$Blue_Tarp_or_Not <- factor(Blue_Tarp_or_Not, labels = c("NBT", "BT"))#, levels =c(0,1), labels = c("NBT", "BT")) #ensure new column is a factor 
"First Six Rows of Data Frame"
head(df)
df_factor  <- df[, -1]
"Last Six Rows of Data Frame"
tail(df_factor)
attach(df_factor)

#> [1] "First Six Rows of Data Frame"
#>        Class Red Green Blue Blue_Tarp_or_Not
#> 1 Vegetation  64    67   50              NBT
#> 2 Vegetation  64    67   50              NBT
#> 3 Vegetation  64    66   49              NBT
#> 4 Vegetation  75    82   53              NBT
#> 5 Vegetation  74    82   54              NBT
#> 6 Vegetation  72    76   52              NBT
#> [1] "Last Six Rows of Data Frame"
#>       Red Green Blue Blue_Tarp_or_Not
#> 63236 136   145  150               BT
#> 63237 138   146  150               BT
#> 63238 134   141  152               BT
#> 63239 136   143  151               BT
#> 63240 132   139  149               BT
#> 63241 133   141  153               BT

3D Scatter - Binary

fig1 <- plot_ly(df_factor, x=~Red, y=~Green, z=~Blue, color=~Blue_Tarp_or_Not, colors = cbPalette) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig1 <- fig1 %>% add_markers()
fig1 <- fig1 %>% layout(scene=list(xaxis=list(title="Red"),
                     yaxis = list(title = 'Green'),
                     zaxis = list(title = 'Blue')))

fig1

After the class label is converted into a binary classifier, it is easier to see separation between the data points for blue tarps and not blue tarps. In the region with RGB values of about 95 to about 130 there is more overlap than there is separation of the two classes. Observations that fall in this region will likely to be difficult to correctly classify for some predictors.

Model Fitting

A few general notes on the various metrics and considerations that will be explored for each model.

ROC

The receiver operating characteristic or ROC is a graphic which illustrates the true positive percentage and the false positive percentage produced by the fit model at different threshold values. Generally, we seek to maximize the true positive rate (TPR) and minimize the false positive rate (FPR).

For this project, are interested in maximizing the TPR and minimizing the FPR. However, our use case involves saving lives. Therefore, we are very concerned with the false negative rate. We want to avoid missed blue tarps, as best we can with the budget we’re allowed. Due to the imbalance in the ratio of our BT and NBT observations, we must be concerned with the FPs (NBTs flagged as BTs) depleting our available resources before we are able to find all the blue tarps (TP).

AUC

The area under the curve or AUC is a metric that indicates how well a model can differentiate between two classes.The higher the AUC the better the model is at predicting class labels correctly.

Thresholds

It can be difficult to determine what threshold to choose for your model from just an ROC curve alone. To aid in threshold selection I decided to graph the TPR, FPR, and threshold in a 3-D scatterplot. As you will see from the interactive graphics, it’s much easier to determine which threshold will give you the model performance that is appropriate for your application. In our case, our threshold selection is guided by the aim of maximizing the TPR while minimizing the FPR and ensuring that our operation does not exhaust our budget by pursuing incorrectly labeled not blue tarps (false positives) before we reach our success criterion.

Confusion Matrix

The confusion matrix displays the true negatives, false negatives, false positives, and true positives giving a sense of the performance of a model. These values are used to calculate metrics which are commonly used to compare the performance of models such as the TPR (also known as sensitivity or recall), FPR, FNR, true negative rate (specificity), precision, etc.

Sampling Variability

Understanding the confidence of the model you’ve created is critical to understanding how well your model will perform when given a new data set with a different values. How much variability can you expect from your model?

Logistic Regression

Logistic regression is classification technique, which, for our purposes, will be used to perform binary classification. Our logistic regression model calculates the probability that an observation is a blue tarp. The method of maximum likelihood is used to calculate the log likelihood for a prediction which is converted to a probability. Whether these probabilities are classified as a blue tarp or not a blue tarp is dependent on the threshold value chosen. A probability above the threshold value is assigned a blue tarp label and a probability below the threshold is assigned a not blue tarp label.

Model

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE) 

set.seed(4)
tic("Log. Reg.")
glm.fit <- caret::train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_factor,
                    method="glm",
                    family="binomial",
                    trControl= fitControl)
glm.time <- toc(quiet = TRUE)
glm.fit

"Summary"
summary(glm.fit)

#> Generalized Linear Model 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ... 
#> Resampling results:
#> 
#>   Accuracy   Kappa    
#>   0.9952879  0.9207043
#> 
#> [1] "Summary"
#> 
#> Call:
#> NULL
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -3.4266  -0.0205  -0.0015   0.0000   3.7911  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  0.20984    0.18455   1.137    0.256    
#> Red         -0.26031    0.01262 -20.632   <2e-16 ***
#> Green       -0.21831    0.01330 -16.416   <2e-16 ***
#> Blue         0.47241    0.01548  30.511   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 17901.6  on 63240  degrees of freedom
#> Residual deviance:  1769.5  on 63237  degrees of freedom
#> AIC: 1777.5
#> 
#> Number of Fisher Scoring iterations: 12

The model fit summary output indicates that the three predictors all have significant p-values, meaning that they all are contributing to the model. The accuracy output indicates that over ten folds the average accuracy was 99.5%. Further discussion of the variability over ten folds can be found in the Sampling Variability tab.

ROC

#pass
glm.prob <- predict(glm.fit, newdata=df_factor , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
glm_roc <- roc(df_factor $Blue_Tarp_or_Not, glm.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="Log. Reg. ROC Curve")

An AUC of 99.9% indicates that our model is able to distinguish between the classes almost perfectly. It should be possible to choose a threshold that can identify most of the training blue tarps without going over our budget.

Thresholds

roc.info_glm <- roc(df_factor$Blue_Tarp_or_Not, glm.prob[,2], legacy.axes=TRUE)
roc.glm.df <- data.frame(tpp=roc.info_glm$sensitivities*100, fpp=(1-roc.info_glm$specificities)*100, thresholds=roc.info_glm$thresholds)
#roc.glm.df[roc.glm.df>98.5 & roc.glm.df < 99,]

glm.thresholds <- data.matrix(roc.glm.df$thresholds)

fig2 <- plot_ly(roc.glm.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig2 <- fig2 %>% add_markers()
fig2 <- fig2 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
                     yaxis = list(title = 'False Positive Rate'),
                     zaxis = list(title = 'Threshold')))

fig2

Confusion Matrix

lr.thresh <- 0.03265
glm.pred_thresh <- factor(ifelse(glm.prob[,2]>lr.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.glm_thresh <- confusionMatrix(factor(glm.pred_thresh),df_factor $Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
lr.thresh
cm.glm_thresh

acc_LR <- cm.glm_thresh[["overall"]][["Accuracy"]]*100
auc_LR <- glm_roc[["auc"]]
thresh_LR <- lr.thresh
sens_LR <-  cm.glm_thresh[["byClass"]][["Sensitivity"]]*100
spec_LR <- cm.glm_thresh[["byClass"]][["Specificity"]]*100
FDR_LR <- ((cm.glm_thresh[["table"]][2,1])/(cm.glm_thresh[["table"]][2,1]+cm.glm_thresh[["table"]][2,2]))*100
prec_LR <- cm.glm_thresh[["byClass"]][["Precision"]]*100
F.glm <- round(2*((prec_LR*sens_LR)/(prec_LR+sens_LR))/100, 2)
beta_val <- 3
F3.glm <- round(((1+beta_val**2)*((prec_LR*sens_LR)/(beta_val*prec_LR+sens_LR)))/100, 2)

cost.glm <- round(cm.glm_thresh[["table"]][1]*0 + cm.glm_thresh[["table"]][2]*FP_C + cm.glm_thresh[["table"]][3]*FN_C + cm.glm_thresh[["table"]][4]*TP_C, 4)
cost.glm

#> [1] "Threshold:"
#> [1] 0.03265
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 60171    40
#>        BT   1048  1982
#>                                           
#>                Accuracy : 0.9828          
#>                  95% CI : (0.9818, 0.9838)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.7761          
#>                                           
#>  Mcnemar's Test P-Value : < 2.2e-16       
#>                                           
#>             Sensitivity : 0.98022         
#>             Specificity : 0.98288         
#>          Pos Pred Value : 0.65413         
#>          Neg Pred Value : 0.99934         
#>              Prevalence : 0.03197         
#>          Detection Rate : 0.03134         
#>    Detection Prevalence : 0.04791         
#>       Balanced Accuracy : 0.98155         
#>                                           
#>        'Positive' Class : BT              
#>                                           
#> [1] -1998740

A threshold of 0.03265 was selected for this model.

Sampling Variability

"10 Fold Results"
glm.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"... 
glm.sd <- sd(glm.fit[["resample"]][["Accuracy"]]*100)
#plot(glm.fit[["resample"]][["Accuracy"]], main="Accuracy per Fold", xlab= "Fold Number", ylab="Accuracy")

#> [1] "10 Fold Results"
#>     Accuracy     Kappa parameter Resample
#> 1  0.9955724 0.9259239      none   Fold01
#> 2  0.9963631 0.9393021      none   Fold02
#> 3  0.9950980 0.9169589      none   Fold03
#> 4  0.9943083 0.9038085      none   Fold04
#> 5  0.9950980 0.9177836      none   Fold05
#> 6  0.9960468 0.9340241      none   Fold06
#> 7  0.9944655 0.9052940      none   Fold07
#> 8  0.9955724 0.9262890      none   Fold08
#> 9  0.9954143 0.9223164      none   Fold09
#> 10 0.9949399 0.9153427      none   Fold10

The average accuracy across ten folds is 98.28 with a standard deviation of 0.064. Based on the k folds accuracy performance, which exhibits variation in the thousands place, we can have confidence in the model performing well even when there is some variation in the data. However, if another data set is significantly different than the training data set there could be a considerable change in performance.

LDA

LDA is a classification method that uses the maximum likelihood method to classify observations. LDA assumes each class has a normal distribution and that every covariance is the same.This creates a linear separation boundary between classes. The observation is assigned to the class with the highest calculated likelihood. Caret automatically does the calculation to convert the likelihood to a probability, which is what we use in conjunction with our chosen threshold to classify our data.

Model

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
tic("LDA")
lda.fit <- caret::train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_factor,
                    preProcess=c("center","scale"),
                    method="lda",
                    verbose= FALSE,
                    trControl= fitControl)
lda.time <- toc(quiet = TRUE)
lda.fit

#> Linear Discriminant Analysis 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> Pre-processing: centered (3), scaled (3) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ... 
#> Resampling results:
#> 
#>   Accuracy  Kappa    
#>   0.983887  0.7524541

ROC

#pass
lda.prob <- predict(lda.fit, newdata=df_factor, type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
lda_roc <- roc(df_factor$Blue_Tarp_or_Not, lda.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="LDA ROC Curve")

An AUC of 98.9% indicates that our model is able to distinguish between the classes pretty well. With a higher false positive percentage where the true positive percentage is near 100% it will likely be more difficult to choose a threshold that can identify most of the training blue tarps without going over our budget.

Thresholds

roc.info_lda <- roc(df_factor $Blue_Tarp_or_Not, lda.prob[,2], legacy.axes=TRUE)
roc.lda.df <- data.frame(tpp=roc.info_lda$sensitivities*100, fpp=(1-roc.info_lda$specificities)*100, thresholds=roc.info_lda$thresholds)
#roc.lda.df[roc.lda.df>91.5 & roc.lda.df < 91.6,]

fig3 <- plot_ly(roc.lda.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig3 <- fig3 %>% add_markers()
fig3 <- fig3 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
                     yaxis = list(title = 'False Positive Rate'),
                     zaxis = list(title = 'Threshold')))

fig3

Confusion Matrix

lda.thresh <- 0.01

lda.pred_thresh <- factor(ifelse(lda.prob[,2]>lda.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.lda_thresh <- confusionMatrix(factor(lda.pred_thresh),df_factor$Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
lda.thresh
cm.lda_thresh

acc_lda <- cm.lda_thresh[["overall"]][["Accuracy"]]*100
auc_lda <- lda_roc[["auc"]]
thresh_lda <- lda.thresh
sens_lda <-  cm.lda_thresh[["byClass"]][["Sensitivity"]]*100
spec_lda <- cm.lda_thresh[["byClass"]][["Specificity"]]*100
FDR_lda <- ((cm.lda_thresh[["table"]][2,1])/(cm.lda_thresh[["table"]][2,1]+cm.lda_thresh[["table"]][2,2]))*100
prec_lda <- cm.lda_thresh[["byClass"]][["Precision"]]*100
F.lda <- round(2*((prec_lda*sens_lda)/(prec_lda+sens_lda))/100, 2)
F3.lda <- round(((1+beta_val**2)*((prec_lda*sens_lda)/(beta_val*prec_lda+sens_lda)))/100, 2)

cost.lda <- round(cm.lda_thresh[["table"]][1]*0 + cm.lda_thresh[["table"]][2]*FP_C + cm.lda_thresh[["table"]][3]*FN_C + cm.lda_thresh[["table"]][4]*TP_C, 4)
cost.lda

#> [1] "Threshold:"
#> [1] 0.01
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 59944   231
#>        BT   1275  1791
#>                                          
#>                Accuracy : 0.9762         
#>                  95% CI : (0.975, 0.9774)
#>     No Information Rate : 0.968          
#>     P-Value [Acc > NIR] : < 2.2e-16      
#>                                          
#>                   Kappa : 0.6921         
#>                                          
#>  Mcnemar's Test P-Value : < 2.2e-16      
#>                                          
#>             Sensitivity : 0.88576        
#>             Specificity : 0.97917        
#>          Pos Pred Value : 0.58415        
#>          Neg Pred Value : 0.99616        
#>              Prevalence : 0.03197        
#>          Detection Rate : 0.02832        
#>    Detection Prevalence : 0.04848        
#>       Balanced Accuracy : 0.93246        
#>                                          
#>        'Positive' Class : BT             
#>                                          
#> [1] -2868276

Based on the performance of the LDA model, I was unable to find a threshold that satisfied the budget. Therefore, I chose a threshold that minimized cost and number of false negatives.

Sampling Variability

"10 Fold Results"
lda.fit$resample  
lda.sd <- sd(lda.fit[["resample"]][["Accuracy"]]*100)

#> [1] "10 Fold Results"
#>     Accuracy     Kappa parameter Resample
#> 1  0.9857685 0.7793824      none   Fold01
#> 2  0.9854522 0.7824666      none   Fold02
#> 3  0.9829222 0.7328063      none   Fold03
#> 4  0.9845059 0.7630316      none   Fold04
#> 5  0.9807084 0.7153126      none   Fold05
#> 6  0.9843454 0.7632954      none   Fold06
#> 7  0.9802340 0.6914626      none   Fold07
#> 8  0.9867173 0.7912156      none   Fold08
#> 9  0.9840291 0.7507018      none   Fold09
#> 10 0.9841872 0.7548660      none   Fold10

The average accuracy across ten folds is 97.62 with a standard deviation of 0.208. Based on the k folds accuracy performance, which exhibits variation in the thousands place, we can have confidence in the model performing well even when there is some variation in the data. However, if another data set is significantly different than the training data set there could be a considerable change in performance.

QDA

QDA is a classification method that uses the maximum likelihood method to classify observations. QDA assumes each class has a normal distribution. Each class has its own mean and covariance. The observation is assigned to the class with the highest calculated likelihood. Caret automatically does the calculation to convert the likelihood to a probability, which is what we use in conjunction with our chosen threshold to classify our data.

The decision boundary created by QDA is quadratic and allows for different covariances for the classes. Therefore, you will see better performance of a QDA model compared to a LDA model when the boundary between classes of data is non-linear and the classes have different covariance values.

Model

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
tic("QDA")
qda.fit <- caret::train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_factor,
                    preProcess=c("center","scale"),
                    method="qda",
                    verbose= FALSE,
                    trControl= fitControl)
qda.time <- toc(quiet = TRUE)
qda.fit

#> Quadratic Discriminant Analysis 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> Pre-processing: centered (3), scaled (3) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ... 
#> Resampling results:
#> 
#>   Accuracy   Kappa    
#>   0.9945763  0.9051517

ROC

#pass
qda.prob <- predict(qda.fit, newdata=df_factor , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
qda_roc <- roc(df_factor $Blue_Tarp_or_Not, qda.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="QDA ROC Curve")

An AUC of 99.8% indicates that our model is able to distinguish between the classes very well. It should be possible to choose a threshold that can identify most of the training blue tarps without going over our budget.

Thresholds

roc.info_qda <- roc(df_factor$Blue_Tarp_or_Not, qda.prob[,2], legacy.axes=TRUE)
roc.qda.df <- data.frame(tpp=roc.info_qda$sensitivities*100, fpp=(1-roc.info_qda$specificities)*100, thresholds=roc.info_qda$thresholds)
#roc.qda.df[roc.qda.df>98 & roc.qda.df < 99,]

fig4 <- plot_ly(roc.qda.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig4 <- fig4 %>% add_markers()
fig4 <- fig4 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
                     yaxis = list(title = 'False Positive Rate'),
                     zaxis = list(title = 'Threshold')))

fig4

Confusion Matrix

qda.thresh <- 0.02
qda.pred_thresh <- factor(ifelse(qda.prob[,2]>qda.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.qda_thresh <- confusionMatrix(factor(qda.pred_thresh),df_factor $Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
qda.thresh
cm.qda_thresh

acc_qda <- cm.qda_thresh[["overall"]][["Accuracy"]]*100
auc_qda <- qda_roc[["auc"]]
thresh_qda <- qda.thresh
sens_qda <-  cm.qda_thresh[["byClass"]][["Sensitivity"]]*100
spec_qda <- cm.qda_thresh[["byClass"]][["Specificity"]]*100
FDR_qda <- ((cm.qda_thresh[["table"]][2,1])/(cm.qda_thresh[["table"]][2,1]+cm.qda_thresh[["table"]][2,2]))*100
prec_qda <- cm.qda_thresh[["byClass"]][["Precision"]]*100
F.qda <- round(2*((prec_qda*sens_qda)/(prec_qda+sens_qda))/100,2)
F3.qda <- round(((1+beta_val**2)*((prec_qda*sens_qda)/(beta_val*prec_qda+sens_qda)))/100, 2)

cost.qda <- round(cm.qda_thresh[["table"]][1]*0 + cm.qda_thresh[["table"]][2]*FP_C + cm.qda_thresh[["table"]][3]*FN_C + cm.qda_thresh[["table"]][4]*TP_C, 4)
cost.qda

#> [1] "Threshold:"
#> [1] 0.02
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 59807    14
#>        BT   1412  2008
#>                                           
#>                Accuracy : 0.9775          
#>                  95% CI : (0.9763, 0.9786)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.727           
#>                                           
#>  Mcnemar's Test P-Value : < 2.2e-16       
#>                                           
#>             Sensitivity : 0.99308         
#>             Specificity : 0.97694         
#>          Pos Pred Value : 0.58713         
#>          Neg Pred Value : 0.99977         
#>              Prevalence : 0.03197         
#>          Detection Rate : 0.03175         
#>    Detection Prevalence : 0.05408         
#>       Balanced Accuracy : 0.98501         
#>                                           
#>        'Positive' Class : BT              
#>                                           
#> [1] -1998844

Sampling Variability

"10 Fold Results"
qda.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"... 
qda.sd <- sd(qda.fit[["resample"]][["Accuracy"]]*100)

#> [1] "10 Fold Results"
#>     Accuracy     Kappa parameter Resample
#> 1  0.9954143 0.9203183      none   Fold01
#> 2  0.9958887 0.9298247      none   Fold02
#> 3  0.9938330 0.8917279      none   Fold03
#> 4  0.9941502 0.8978148      none   Fold04
#> 5  0.9938330 0.8911622      none   Fold05
#> 6  0.9954143 0.9215293      none   Fold06
#> 7  0.9928843 0.8710492      none   Fold07
#> 8  0.9957306 0.9269411      none   Fold08
#> 9  0.9947818 0.9088589      none   Fold09
#> 10 0.9938330 0.8922908      none   Fold10

The average accuracy across ten folds is 97.75 with a standard deviation of 0.101. Based on the k folds accuracy performance, which exhibits variation in the thousands place, we can have confidence in the model performing well even when there is some variation in the data. However, if another data set is significantly different than the training data set there could be a considerable change in performance.

KNN

K nearest neighbors is a modeling method that calculates the probability an observation belongs to one class or another based on the class of its k nearest neighbors (observations closest in data space to the observation we’re trying to classify) are assigned to. K values of 5 or 10 are often chosen to avoid too much bias (happens when k is too big) and too much variance (happens when K is too small).

Model

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
tic("KNN")
knn.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_factor,
                    preProcess=c("center","scale"),
                    method="knn",
                    trControl= fitControl,
                    tuneLength=10
                    )
knn.time <- toc(quiet = TRUE)
knn.fit

#> k-Nearest Neighbors 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> Pre-processing: centered (3), scaled (3) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ... 
#> Resampling results across tuning parameters:
#> 
#>   k   Accuracy   Kappa    
#>    5  0.9972328  0.9555000
#>    7  0.9973435  0.9573825
#>    9  0.9972803  0.9563272
#>   11  0.9971221  0.9537932
#>   13  0.9972170  0.9552906
#>   15  0.9971379  0.9540331
#>   17  0.9970431  0.9524698
#>   19  0.9969956  0.9517428
#>   21  0.9969956  0.9516839
#>   23  0.9969482  0.9509483
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 7.

The model fit indicates that the highest accuracy is obtained when k is 7.

plot(knn.fit)

ROC

#pass
knn.prob <- predict(knn.fit, newdata=df_factor , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
knn_roc <- roc(df_factor $Blue_Tarp_or_Not, knn.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="KNN ROC Curve")

An AUC of 100.0% indicates that our model is able to distinguish between the classes perfectly. It should be possible to choose a threshold that can identify all of the training blue tarps without going over our budget.

Thresholds

roc.info_knn <- roc(df_factor$Blue_Tarp_or_Not, knn.prob[,2], legacy.axes=TRUE)
roc.knn.df <- data.frame(tpp=roc.info_knn$sensitivities*100, fpp=(1-roc.info_knn$specificities)*100, thresholds=roc.info_knn$thresholds)
#roc.knn.df[roc.knn.df>99 & roc.knn.df < 100,]
#roc.knn.df

fig5 <- plot_ly(roc.knn.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig5 <- fig5 %>% add_markers()
fig5 <- fig5 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
                     yaxis = list(title = 'False Positive Rate'),
                     zaxis = list(title = 'Threshold')))

fig5

Confusion Matrix

knn.thresh <- 0.07
knn.pred_thresh <- factor(ifelse(knn.prob[,2]>knn.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.knn_thresh <- confusionMatrix(factor(knn.pred_thresh),df_factor $Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
knn.thresh
cm.knn_thresh

acc_knn <- cm.knn_thresh[["overall"]][["Accuracy"]]*100
auc_knn <- knn_roc[["auc"]]
thresh_knn <- knn.thresh
sens_knn <-  cm.knn_thresh[["byClass"]][["Sensitivity"]]*100
spec_knn <- cm.knn_thresh[["byClass"]][["Specificity"]]*100
FDR_knn <- ((cm.knn_thresh[["table"]][2,1])/(cm.knn_thresh[["table"]][2,1]+cm.knn_thresh[["table"]][2,2]))*100
prec_knn <- cm.knn_thresh[["byClass"]][["Precision"]]*100
k_knn <- knn.fit[["bestTune"]][["k"]]
F.knn <- round(2*((prec_knn*sens_knn)/(prec_knn+sens_knn))/100,2)
F3.knn <- round(((1+beta_val**2)*((prec_knn*sens_knn)/(beta_val*prec_knn+sens_knn)))/100, 2)

cost.knn <- round(cm.knn_thresh[["table"]][1]*0 + cm.knn_thresh[["table"]][2]*FP_C + cm.knn_thresh[["table"]][3]*FN_C + cm.knn_thresh[["table"]][4]*TP_C,4)
cost.knn

#> [1] "Threshold:"
#> [1] 0.07
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 60685     0
#>        BT    534  2022
#>                                           
#>                Accuracy : 0.9916          
#>                  95% CI : (0.9908, 0.9923)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.879           
#>                                           
#>  Mcnemar's Test P-Value : < 2.2e-16       
#>                                           
#>             Sensitivity : 1.00000         
#>             Specificity : 0.99128         
#>          Pos Pred Value : 0.79108         
#>          Neg Pred Value : 1.00000         
#>              Prevalence : 0.03197         
#>          Detection Rate : 0.03197         
#>    Detection Prevalence : 0.04042         
#>       Balanced Accuracy : 0.99564         
#>                                           
#>        'Positive' Class : BT              
#>                                           
#> [1] -1676700

Sampling Variability

"10 Fold Results"
knn.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"... 
knn.sd <- sd(knn.fit[["resample"]][["Accuracy"]]*100)

#> [1] "10 Fold Results"
#>      Accuracy     Kappa  k Resample
#> 1   0.9977862 0.9638571  5   Fold01
#> 2   0.9974700 0.9586938  7   Fold01
#> 3   0.9974700 0.9586938  9   Fold01
#> 4   0.9976281 0.9613688 11   Fold01
#> 5   0.9977862 0.9638571 13   Fold01
#> 6   0.9977862 0.9638571 15   Fold01
#> 7   0.9976281 0.9611816 17   Fold01
#> 8   0.9976281 0.9611816 19   Fold01
#> 9   0.9976281 0.9611816 21   Fold01
#> 10  0.9976281 0.9611816 23   Fold01
#> 11  0.9979443 0.9668394  5   Fold02
#> 12  0.9973118 0.9570464  7   Fold02
#> 13  0.9974700 0.9594773  9   Fold02
#> 14  0.9974700 0.9594773 11   Fold02
#> 15  0.9973118 0.9568423 13   Fold02
#> 16  0.9971537 0.9544119 15   Fold02
#> 17  0.9971537 0.9544119 17   Fold02
#> 18  0.9973118 0.9568423 19   Fold02
#> 19  0.9973118 0.9566362 21   Fold02
#> 20  0.9973118 0.9566362 23   Fold02
#> 21  0.9974700 0.9598579  5   Fold03
#> 22  0.9968374 0.9500570  7   Fold03
#> 23  0.9968374 0.9495856  9   Fold03
#> 24  0.9973118 0.9570464 11   Fold03
#> 25  0.9974700 0.9596685 13   Fold03
#> 26  0.9974700 0.9596685 15   Fold03
#> 27  0.9971537 0.9546271 17   Fold03
#> 28  0.9974700 0.9596685 19   Fold03
#> 29  0.9976281 0.9620998 21   Fold03
#> 30  0.9971537 0.9548402 23   Fold03
#> 31  0.9965217 0.9437481  5   Fold04
#> 32  0.9971542 0.9544120  7   Fold04
#> 33  0.9968379 0.9493467  9   Fold04
#> 34  0.9965217 0.9442814 11   Fold04
#> 35  0.9969960 0.9517651 13   Fold04
#> 36  0.9966798 0.9466877 15   Fold04
#> 37  0.9966798 0.9466877 17   Fold04
#> 38  0.9965217 0.9442814 19   Fold04
#> 39  0.9963636 0.9418865 21   Fold04
#> 40  0.9962055 0.9392161 23   Fold04
#> 41  0.9976281 0.9613688  5   Fold05
#> 42  0.9977862 0.9642031  7   Fold05
#> 43  0.9976281 0.9613688  9   Fold05
#> 44  0.9973118 0.9562180 11   Fold05
#> 45  0.9974700 0.9588925 13   Fold05
#> 46  0.9974700 0.9588925 15   Fold05
#> 47  0.9974700 0.9588925 17   Fold05
#> 48  0.9973118 0.9562180 19   Fold05
#> 49  0.9974700 0.9588925 21   Fold05
#> 50  0.9974700 0.9588925 23   Fold05
#> 51  0.9974700 0.9588925  5   Fold06
#> 52  0.9979443 0.9666803  7   Fold06
#> 53  0.9974700 0.9590893  9   Fold06
#> 54  0.9973118 0.9562180 11   Fold06
#> 55  0.9971537 0.9537541 13   Fold06
#> 56  0.9974700 0.9588925 15   Fold06
#> 57  0.9973118 0.9564281 17   Fold06
#> 58  0.9973118 0.9564281 19   Fold06
#> 59  0.9974700 0.9588925 21   Fold06
#> 60  0.9973118 0.9564281 23   Fold06
#> 61  0.9971537 0.9537541  5   Fold07
#> 62  0.9977862 0.9638571  7   Fold07
#> 63  0.9977862 0.9640309  9   Fold07
#> 64  0.9974700 0.9588925 11   Fold07
#> 65  0.9976281 0.9615542 13   Fold07
#> 66  0.9971537 0.9537541 15   Fold07
#> 67  0.9973118 0.9564281 17   Fold07
#> 68  0.9971537 0.9537541 19   Fold07
#> 69  0.9973118 0.9564281 21   Fold07
#> 70  0.9973118 0.9564281 23   Fold07
#> 71  0.9976281 0.9622782  5   Fold08
#> 72  0.9976281 0.9622782  7   Fold08
#> 73  0.9974700 0.9598579  9   Fold08
#> 74  0.9973118 0.9574490 11   Fold08
#> 75  0.9974700 0.9598579 13   Fold08
#> 76  0.9974700 0.9598579 15   Fold08
#> 77  0.9973118 0.9570464 17   Fold08
#> 78  0.9968374 0.9498224 19   Fold08
#> 79  0.9968374 0.9495856 21   Fold08
#> 80  0.9969956 0.9519931 23   Fold08
#> 81  0.9966793 0.9466875  5   Fold09
#> 82  0.9969956 0.9515345  7   Fold09
#> 83  0.9969956 0.9517649  9   Fold09
#> 84  0.9965212 0.9442812 11   Fold09
#> 85  0.9965212 0.9440158 13   Fold09
#> 86  0.9963631 0.9416101 15   Fold09
#> 87  0.9962049 0.9389263 17   Fold09
#> 88  0.9960468 0.9365327 19   Fold09
#> 89  0.9958887 0.9335201 21   Fold09
#> 90  0.9958887 0.9335201 23   Fold09
#> 91  0.9960468 0.9377164  5   Fold10
#> 92  0.9965212 0.9450623  7   Fold10
#> 93  0.9968374 0.9500567  9   Fold10
#> 94  0.9963631 0.9426991 11   Fold10
#> 95  0.9963631 0.9426991 13   Fold10
#> 96  0.9963631 0.9426991 15   Fold10
#> 97  0.9962049 0.9400680 17   Fold10
#> 98  0.9963631 0.9426991 19   Fold10
#> 99  0.9960468 0.9377164 21   Fold10
#> 100 0.9962049 0.9403468 23   Fold10

The average accuracy across ten folds when k = 7 is 99.16 with a standard deviation of 0.052. Based on the k folds accuracy performance, which exhibits variation in the thousands place, we can have confidence in the model performing well even when there is some variation in the data. However, if another data set is significantly different than the training data set there could be a considerable change in performance.

Random Forest

Random forest is a classification method that uses a large number of averaged decorrelated decision trees to assign a class to an observation. Decorrelation is accomplished by limiting the number of predictors the trees may use to split the data. To assign a class to a new observation, the observation is classified by all the trees in the model and assigned a final classification based on which class the largest number of trees assigned to that observation.

Model

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
tic("RF")
rf.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_factor,
                    preProcess=c("center","scale"),
                    method="rf", 
                    trControl= fitControl,
                    tuneLength=3
                    )
rf.time <- toc(quiet = TRUE)
rf.fit

#> note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
#> 
#> Random Forest 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> Pre-processing: centered (3), scaled (3) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>   2     0.9969798  0.9509141
#>   3     0.9969482  0.9504008
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.

The RF model was fit on 500 trees and the tuning parameter mtry=2 (number of variables chosen at each sample split) was selected. The difference in accuracy between mtry=2 and mtry=3 incredibly small (hundred thousandths place).

plot(rf.fit)

ROC

#pass
RF.prob <- predict(rf.fit, newdata=df_factor , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
RF_roc <- roc(df_factor $Blue_Tarp_or_Not, RF.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="RF ROC Curve")

An AUC of 99.4% indicates that our model is able to distinguish between the classes pretty well. It is likely that a threshold that can identify most of the training blue tarps without going over our budget can be chosen.

Thresholds

roc.info_rf <- roc(df_factor$Blue_Tarp_or_Not, RF.prob[,2], legacy.axes=TRUE)
roc.rf.df <- data.frame(tpp=roc.info_rf$sensitivities*100, fpp=(1-roc.info_rf$specificities)*100, thresholds=roc.info_rf$thresholds)
#roc.rf.df[roc.rf.df>99 & roc.rf.df < 100,]
#roc.rf.df

fig6 <- plot_ly(roc.rf.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig6 <- fig6 %>% add_markers()
fig6 <- fig6 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
                     yaxis = list(title = 'False Positive Rate'),
                     zaxis = list(title = 'Threshold')))

fig6

Confusion Matrix

RF.thresh <- 0.36
RF.pred_thresh <- factor(ifelse(RF.prob[,2]>RF.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.RF_thresh <- confusionMatrix(factor(RF.pred_thresh),df_factor $Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
RF.thresh
cm.RF_thresh

acc_RF <- cm.RF_thresh[["overall"]][["Accuracy"]]*100
auc_RF <- RF_roc[["auc"]]
thresh_RF <- RF.thresh
sens_RF <-  cm.RF_thresh[["byClass"]][["Sensitivity"]]*100
spec_RF <- cm.RF_thresh[["byClass"]][["Specificity"]]*100
FDR_RF <- ((cm.RF_thresh[["table"]][2,1])/(cm.RF_thresh[["table"]][2,1]+cm.RF_thresh[["table"]][2,2]))*100
prec_RF <- cm.RF_thresh[["byClass"]][["Precision"]]*100
mtry_best <- rf.fit[["bestTune"]][["mtry"]]
ntree <- rf.fit[["finalModel"]][["ntree"]]
F.rf <- round(2*((prec_RF*sens_RF)/(prec_RF+sens_RF))/100,2)
F3.rf <- round(((1+beta_val**2)*((prec_RF*sens_RF)/(beta_val*prec_RF+sens_RF)))/100, 2)

cost.RF <- round(cm.RF_thresh[["table"]][1]*0 + cm.RF_thresh[["table"]][2]*FP_C + cm.RF_thresh[["table"]][3]*FN_C + cm.RF_thresh[["table"]][4]*TP_C,4)
cost.RF

#> [1] "Threshold:"
#> [1] 0.36
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 61203    22
#>        BT     16  2000
#>                                           
#>                Accuracy : 0.9994          
#>                  95% CI : (0.9992, 0.9996)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : <2e-16          
#>                                           
#>                   Kappa : 0.9903          
#>                                           
#>  Mcnemar's Test P-Value : 0.4173          
#>                                           
#>             Sensitivity : 0.98912         
#>             Specificity : 0.99974         
#>          Pos Pred Value : 0.99206         
#>          Neg Pred Value : 0.99964         
#>              Prevalence : 0.03197         
#>          Detection Rate : 0.03163         
#>    Detection Prevalence : 0.03188         
#>       Balanced Accuracy : 0.99443         
#>                                           
#>        'Positive' Class : BT              
#>                                           
#> [1] -1613612

Sampling Variability

"10 Fold Results"
rf.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"... 
rf.sd <- sd(rf.fit[["resample"]][["Accuracy"]]*100)

#> [1] "10 Fold Results"
#>     Accuracy     Kappa mtry Resample
#> 1  0.9971537 0.9528468    2   Fold01
#> 2  0.9973118 0.9555754    3   Fold01
#> 3  0.9973118 0.9560058    2   Fold02
#> 4  0.9974700 0.9588925    3   Fold02
#> 5  0.9963631 0.9416101    2   Fold03
#> 6  0.9965212 0.9442812    3   Fold03
#> 7  0.9966798 0.9453903    2   Fold04
#> 8  0.9968379 0.9476083    3   Fold04
#> 9  0.9971537 0.9533048    2   Fold05
#> 10 0.9963631 0.9401887    3   Fold05
#> 11 0.9977862 0.9643737    2   Fold06
#> 12 0.9976281 0.9617378    3   Fold06
#> 13 0.9969956 0.9505907    2   Fold07
#> 14 0.9971537 0.9530769    3   Fold07
#> 15 0.9971537 0.9537541    2   Fold08
#> 16 0.9973118 0.9564281    3   Fold08
#> 17 0.9968374 0.9491052    2   Fold09
#> 18 0.9966793 0.9464329    3   Fold09
#> 19 0.9963631 0.9421597    2   Fold10
#> 20 0.9962049 0.9397866    3   Fold10

The average accuracy across ten folds when mtry = 2 is 99.94 with a standard deviation of 0.045. Based on the k folds accuracy performance, which exhibits variation in the thousands place, we can have confidence in the model performing well even when there is some variation in the data. However, if another data set is significantly different than the training data set there could be a considerable change in performance.

SVM

Support vector machines or SVMs are a classification (or regression) technique that generates a hyperplane in N-dimensional space to separate the classes. This is accomplished by choosing the hyperplane that creates the largest margin between the classes. SVM is touted as one of the best classifiers when the data is numeric and continuous (which is not our use case though we will see SVm does a good job predicting on our data set) and the right kernel is used.

Support vectors are data points that are closest to the hyperplane and influence its position and orientation. They fall within the allowable margin on either side of the hyperplane. The SVM model produces a score between -1 and 1 for each observation. The sign of the score indicates which class the model assigned the observation to.

Based on the way caret has been programmed, the predict function creates probabilities instead of scores. This will allow a threshold to be chosen to generate confusion matrices.

Model

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
tic("SVM")
svm.radial.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_factor,
                    preProcess=c("center","scale"),
                    method="svmRadial",
                    trControl= fitControl,
                    tuneLength=10
                    #tuneGrid = expand.grid(C=seq(0,10, length=10),
                    #                           sigma =seq(0,10, length=10))
                    )
svm.time <- toc(quiet = TRUE)
svm.radial.fit
"Summary"
summary(svm.radial.fit)

#> Support Vector Machines with Radial Basis Function Kernel 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> Pre-processing: centered (3), scaled (3) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ... 
#> Resampling results across tuning parameters:
#> 
#>   C       Accuracy   Kappa    
#>     0.25  0.9967268  0.9463658
#>     0.50  0.9968849  0.9489059
#>     1.00  0.9969798  0.9505083
#>     2.00  0.9970589  0.9518827
#>     4.00  0.9971063  0.9526307
#>     8.00  0.9971854  0.9538882
#>    16.00  0.9972170  0.9545384
#>    32.00  0.9973909  0.9575077
#>    64.00  0.9973751  0.9574081
#>   128.00  0.9974384  0.9584802
#> 
#> Tuning parameter 'sigma' was held constant at a value of 9.020118
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were sigma = 9.020118 and C = 128.
#> [1] "Summary"
#> Length  Class   Mode 
#>      1   ksvm     S4

plot(svm.radial.fit)

Both linear and poly SVM functions were considered. Radial SVM produced the highest accuracy values, by less than a percentage point, of the three methods. SVM radial was chosen for building the SVM model.

The best tuning parameters selected for the radial model fit were sigma = 9.020118 and C = 128.

ROC

#pass
SVM.prob <- predict(svm.radial.fit, newdata=df_factor , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
SVM_roc <- roc(df_factor $Blue_Tarp_or_Not, SVM.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="SVM ROC Curve")

Thresholds

roc.info_svm <- roc(df_factor$Blue_Tarp_or_Not, SVM.prob[,2], legacy.axes=TRUE)
roc.svm.df <- data.frame(tpp=roc.info_svm$sensitivities*100, fpp=(1-roc.info_svm$specificities)*100, thresholds=roc.info_svm$thresholds)
#roc.svm.df[roc.svm.df>99 & roc.svm.df < 100,]
#roc.svm.df

fig7 <- plot_ly(roc.svm.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig7 <- fig7 %>% add_markers()
fig7 <- fig7 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
                     yaxis = list(title = 'False Positive Rate'),
                     zaxis = list(title = 'Threshold')))

fig7

Confusion Matrix

SVM.thresh <- 0.0048
SVM.pred_thresh <- factor(ifelse(SVM.prob[,2]>SVM.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.SVM_thresh <- confusionMatrix(factor(SVM.pred_thresh),df_factor $Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
SVM.thresh
cm.SVM_thresh

acc_SVM <- cm.SVM_thresh[["overall"]][["Accuracy"]]*100
auc_SVM <- SVM_roc[["auc"]]
thresh_SVM <- SVM.thresh
sens_SVM <-  cm.SVM_thresh[["byClass"]][["Sensitivity"]]*100
spec_SVM <- cm.SVM_thresh[["byClass"]][["Specificity"]]*100
FDR_SVM <- ((cm.SVM_thresh[["table"]][2,1])/(cm.SVM_thresh[["table"]][2,1]+cm.SVM_thresh[["table"]][2,2]))*100
prec_SVM <- cm.SVM_thresh[["byClass"]][["Precision"]]*100
sigma_best <- round(svm.radial.fit[["bestTune"]][["sigma"]], 4)
C_best <- svm.radial.fit[["bestTune"]][["C"]]
F.svm <- round(2*((prec_SVM*sens_SVM)/(prec_SVM+sens_SVM))/100,2)
F3.svm <- round(((1+beta_val**2)*((prec_SVM*sens_SVM)/(beta_val*prec_SVM+sens_SVM)))/100, 2)

cost.SVM <- round(cm.SVM_thresh[["table"]][1]*0 + cm.SVM_thresh[["table"]][2]*FP_C + cm.SVM_thresh[["table"]][3]*FN_C + cm.SVM_thresh[["table"]][4]*TP_C, 4)
cost.SVM

#> [1] "Threshold:"
#> [1] 0.0048
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 60374     4
#>        BT    845  2018
#>                                           
#>                Accuracy : 0.9866          
#>                  95% CI : (0.9856, 0.9875)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.8194          
#>                                           
#>  Mcnemar's Test P-Value : < 2.2e-16       
#>                                           
#>             Sensitivity : 0.99802         
#>             Specificity : 0.98620         
#>          Pos Pred Value : 0.70486         
#>          Neg Pred Value : 0.99993         
#>              Prevalence : 0.03197         
#>          Detection Rate : 0.03191         
#>    Detection Prevalence : 0.04527         
#>       Balanced Accuracy : 0.99211         
#>                                           
#>        'Positive' Class : BT              
#>                                           
#> [1] -1786784

Sampling Variability

"10 Fold Results"
svm.radial.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"... 
svm.sd <- sd(svm.radial.fit[["resample"]][["Accuracy"]]*100)

#> [1] "10 Fold Results"
#>      Accuracy     Kappa    sigma      C Resample
#> 1   0.9966793 0.9448527 9.020118   0.25   Fold01
#> 2   0.9968374 0.9476075 9.020118   0.50   Fold01
#> 3   0.9973118 0.9555754 9.020118   1.00   Fold01
#> 4   0.9973118 0.9555754 9.020118   2.00   Fold01
#> 5   0.9974700 0.9580860 9.020118   4.00   Fold01
#> 6   0.9974700 0.9580860 9.020118   8.00   Fold01
#> 7   0.9974700 0.9580860 9.020118  16.00   Fold01
#> 8   0.9974700 0.9580860 9.020118  32.00   Fold01
#> 9   0.9974700 0.9582906 9.020118  64.00   Fold01
#> 10  0.9974700 0.9582906 9.020118 128.00   Fold01
#> 11  0.9966793 0.9456543 9.020118   0.25   Fold02
#> 12  0.9971537 0.9535305 9.020118   0.50   Fold02
#> 13  0.9973118 0.9560058 9.020118   1.00   Fold02
#> 14  0.9973118 0.9562180 9.020118   2.00   Fold02
#> 15  0.9971537 0.9535305 9.020118   4.00   Fold02
#> 16  0.9974700 0.9586938 9.020118   8.00   Fold02
#> 17  0.9974700 0.9586938 9.020118  16.00   Fold02
#> 18  0.9977862 0.9640309 9.020118  32.00   Fold02
#> 19  0.9977862 0.9643737 9.020118  64.00   Fold02
#> 20  0.9976281 0.9619196 9.020118 128.00   Fold02
#> 21  0.9966793 0.9453897 9.020118   0.25   Fold03
#> 22  0.9971537 0.9535305 9.020118   0.50   Fold03
#> 23  0.9974700 0.9588925 9.020118   1.00   Fold03
#> 24  0.9974700 0.9590893 9.020118   2.00   Fold03
#> 25  0.9974700 0.9590893 9.020118   4.00   Fold03
#> 26  0.9976281 0.9615542 9.020118   8.00   Fold03
#> 27  0.9973118 0.9566362 9.020118  16.00   Fold03
#> 28  0.9971537 0.9544119 9.020118  32.00   Fold03
#> 29  0.9969956 0.9517649 9.020118  64.00   Fold03
#> 30  0.9969956 0.9517649 9.020118 128.00   Fold03
#> 31  0.9963636 0.9410501 9.020118   0.25   Fold04
#> 32  0.9960474 0.9356152 9.020118   0.50   Fold04
#> 33  0.9963636 0.9407660 9.020118   1.00   Fold04
#> 34  0.9963636 0.9407660 9.020118   2.00   Fold04
#> 35  0.9963636 0.9407660 9.020118   4.00   Fold04
#> 36  0.9962055 0.9383392 9.020118   8.00   Fold04
#> 37  0.9965217 0.9437481 9.020118  16.00   Fold04
#> 38  0.9969960 0.9515348 9.020118  32.00   Fold04
#> 39  0.9973123 0.9568424 9.020118  64.00   Fold04
#> 40  0.9974704 0.9594774 9.020118 128.00   Fold04
#> 41  0.9971537 0.9530769 9.020118   0.25   Fold05
#> 42  0.9974700 0.9580860 9.020118   0.50   Fold05
#> 43  0.9971537 0.9530769 9.020118   1.00   Fold05
#> 44  0.9973118 0.9555754 9.020118   2.00   Fold05
#> 45  0.9973118 0.9553569 9.020118   4.00   Fold05
#> 46  0.9973118 0.9553569 9.020118   8.00   Fold05
#> 47  0.9974700 0.9580860 9.020118  16.00   Fold05
#> 48  0.9974700 0.9580860 9.020118  32.00   Fold05
#> 49  0.9974700 0.9582906 9.020118  64.00   Fold05
#> 50  0.9976281 0.9609926 9.020118 128.00   Fold05
#> 51  0.9966793 0.9448527 9.020118   0.25   Fold06
#> 52  0.9966793 0.9448527 9.020118   0.50   Fold06
#> 53  0.9968374 0.9478633 9.020118   1.00   Fold06
#> 54  0.9968374 0.9478633 9.020118   2.00   Fold06
#> 55  0.9971537 0.9530769 9.020118   4.00   Fold06
#> 56  0.9973118 0.9557916 9.020118   8.00   Fold06
#> 57  0.9973118 0.9557916 9.020118  16.00   Fold06
#> 58  0.9977862 0.9638571 9.020118  32.00   Fold06
#> 59  0.9976281 0.9611816 9.020118  64.00   Fold06
#> 60  0.9977862 0.9640309 9.020118 128.00   Fold06
#> 61  0.9969956 0.9505907 9.020118   0.25   Fold07
#> 62  0.9969956 0.9505907 9.020118   0.50   Fold07
#> 63  0.9968374 0.9476075 9.020118   1.00   Fold07
#> 64  0.9971537 0.9530769 9.020118   2.00   Fold07
#> 65  0.9971537 0.9530769 9.020118   4.00   Fold07
#> 66  0.9969956 0.9501048 9.020118   8.00   Fold07
#> 67  0.9973118 0.9555754 9.020118  16.00   Fold07
#> 68  0.9974700 0.9582906 9.020118  32.00   Fold07
#> 69  0.9974700 0.9584932 9.020118  64.00   Fold07
#> 70  0.9974700 0.9582906 9.020118 128.00   Fold07
#> 71  0.9968374 0.9486156 9.020118   0.25   Fold08
#> 72  0.9971537 0.9535305 9.020118   0.50   Fold08
#> 73  0.9974700 0.9586938 9.020118   1.00   Fold08
#> 74  0.9976281 0.9613688 9.020118   2.00   Fold08
#> 75  0.9977862 0.9640309 9.020118   4.00   Fold08
#> 76  0.9977862 0.9640309 9.020118   8.00   Fold08
#> 77  0.9977862 0.9640309 9.020118  16.00   Fold08
#> 78  0.9981025 0.9693170 9.020118  32.00   Fold08
#> 79  0.9977862 0.9643737 9.020118  64.00   Fold08
#> 80  0.9977862 0.9642031 9.020118 128.00   Fold08
#> 81  0.9962049 0.9380407 9.020118   0.25   Fold09
#> 82  0.9962049 0.9377398 9.020118   0.50   Fold09
#> 83  0.9962049 0.9377398 9.020118   1.00   Fold09
#> 84  0.9963631 0.9401887 9.020118   2.00   Fold09
#> 85  0.9963631 0.9401887 9.020118   4.00   Fold09
#> 86  0.9966793 0.9453897 9.020118   8.00   Fold09
#> 87  0.9966793 0.9453897 9.020118  16.00   Fold09
#> 88  0.9968374 0.9481165 9.020118  32.00   Fold09
#> 89  0.9971537 0.9535305 9.020118  64.00   Fold09
#> 90  0.9974700 0.9588925 9.020118 128.00   Fold09
#> 91  0.9969956 0.9515345 9.020118   0.25   Fold10
#> 92  0.9971537 0.9539755 9.020118   0.50   Fold10
#> 93  0.9968374 0.9488616 9.020118   1.00   Fold10
#> 94  0.9968374 0.9491052 9.020118   2.00   Fold10
#> 95  0.9968374 0.9491052 9.020118   4.00   Fold10
#> 96  0.9969956 0.9515345 9.020118   8.00   Fold10
#> 97  0.9968374 0.9493464 9.020118  16.00   Fold10
#> 98  0.9968374 0.9493464 9.020118  32.00   Fold10
#> 99  0.9966793 0.9469395 9.020118  64.00   Fold10
#> 100 0.9966793 0.9469395 9.020118 128.00   Fold10

The average accuracy across ten folds is 98.66 with a standard deviation of 0.045. Based on the k folds accuracy performance, which exhibits variation in the thousands place, we can have confidence in the model performing well even when there is some variation in the data. However, if another data set is significantly different than the training data set there could be a considerable change in performance.

Hold-Out Test Sample

A model fit to maximize performance on its training data set is not guaranteed to perform well on future data sets. If a model has been overtrained, then there will be significant variation in the performance of the model on future data sets. Here we have a test data set that we will use to test the performance of the final models. There are approximately 2 million data points in this data set.

In the subsections below, the final fit models will predict the class of the test data. A confusion matrix and ROC curve are shown for each model. A collective summary table and comparison of model performance are located in later sections.

df_FHO <- tibble(read.csv("SYS6018_FHO_Data.csv"))
df_FHO$Blue_Tarp_or_Not <- factor(df_FHO$Blue_Tarp_or_Not, labels = c("NBT", "BT"))
df_FHO

#> # A tibble: 2,004,177 x 4
#>      Red Green  Blue Blue_Tarp_or_Not
#>    <int> <int> <int> <fct>           
#>  1   104    89    63 NBT             
#>  2   101    80    60 NBT             
#>  3   103    87    69 NBT             
#>  4   107    93    72 NBT             
#>  5   109    99    68 NBT             
#>  6   103    73    53 NBT             
#>  7   100    79    56 NBT             
#>  8    98    70    51 NBT             
#>  9    97    73    56 NBT             
#> 10    99    79    61 NBT             
#> # ... with 2,004,167 more rows

Logistic Regression

Confusion Matrix

tic("Log.Reg. Predict")
glm.prob_FHO <- predict(glm.fit, newdata=df_FHO , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
glm.predict.time <- toc(quiet = TRUE)

lr.thresh_FHO <- lr.thresh
glm.pred_thresh_FHO <- factor(ifelse(glm.prob_FHO[,2]>lr.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.glm_thresh_FHO <- confusionMatrix(factor(glm.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT") 
"Threshold:" 
lr.thresh_FHO
cm.glm_thresh_FHO

acc_LR_FHO <- round(cm.glm_thresh_FHO[["overall"]][["Accuracy"]]*100, 2)
auc_LR_FHO <- round(glm_roc[["auc"]], 2)
thresh_LR_FHO <- lr.thresh_FHO
sens_LR_FHO <-  round(cm.glm_thresh_FHO[["byClass"]][["Sensitivity"]]*100, 2)
spec_LR_FHO <- round(cm.glm_thresh_FHO[["byClass"]][["Specificity"]]*100, 2)
FDR_LR_FHO <- round(((cm.glm_thresh_FHO[["table"]][2,1])/(cm.glm_thresh_FHO[["table"]][2,1]+cm.glm_thresh_FHO[["table"]][2,2]))*100, 2)
prec_LR_FHO <- round(cm.glm_thresh_FHO[["byClass"]][["Precision"]]*100, 2)
F.glm_FHO <- round(2*((prec_LR_FHO*sens_LR_FHO)/(prec_LR_FHO+sens_LR_FHO))/100, 2)
F3.glm_FHO <- round(((1+beta_val**2)*((prec_LR_FHO*sens_LR_FHO)/(beta_val*prec_LR_FHO+sens_LR_FHO)))/100, 2)

#> [1] "Threshold:"
#> [1] 0.03265
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     NBT      BT
#>        NBT 1767335       0
#>        BT   222362   14480
#>                                           
#>                Accuracy : 0.8891          
#>                  95% CI : (0.8886, 0.8895)
#>     No Information Rate : 0.9928          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.103           
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 1.000000        
#>             Specificity : 0.888243        
#>          Pos Pred Value : 0.061138        
#>          Neg Pred Value : 1.000000        
#>              Prevalence : 0.007225        
#>          Detection Rate : 0.007225        
#>    Detection Prevalence : 0.118174        
#>       Balanced Accuracy : 0.944122        
#>                                           
#>        'Positive' Class : BT              
#>

ROC

par(pty="s")
glm_roc_FHO <- roc(df_FHO$Blue_Tarp_or_Not, glm.prob_FHO[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="Log. Reg. FHO ROC Curve")

auc_LR_FHO <- round(glm_roc_FHO[["auc"]], 2)

LDA

Confusion Matrix

tic("LDA Predict")
lda.prob_FHO <- predict(lda.fit, newdata=df_FHO, type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
lda.predict.time <- toc(quiet = TRUE)

lda.thresh_FHO <- lda.thresh
lda.pred_thresh_FHO <- factor(ifelse(lda.prob_FHO[,2]>lda.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.lda_thresh_FHO <- confusionMatrix(factor(lda.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT") 
"Threshold:" 
lda.thresh_FHO
cm.lda_thresh_FHO

acc_lda_FHO <- round(cm.lda_thresh_FHO[["overall"]][["Accuracy"]]*100, 2)
thresh_lda_FHO <- lda.thresh_FHO
sens_lda_FHO <-  round(cm.lda_thresh_FHO[["byClass"]][["Sensitivity"]]*100,2)
spec_lda_FHO <- round(cm.lda_thresh_FHO[["byClass"]][["Specificity"]]*100,2)
FDR_lda_FHO <- round(((cm.lda_thresh_FHO[["table"]][2,1])/(cm.lda_thresh_FHO[["table"]][2,1]+cm.lda_thresh_FHO[["table"]][2,2]))*100,2)
prec_lda_FHO <- round(cm.lda_thresh_FHO[["byClass"]][["Precision"]]*100, 2)
F.lda_FHO <- round(2*((prec_lda_FHO*sens_lda_FHO)/(prec_lda_FHO+sens_lda_FHO))/100, 2)
F3.lda_FHO <- round(((1+beta_val**2)*((prec_lda_FHO*sens_lda_FHO)/(beta_val*prec_lda_FHO+sens_lda_FHO)))/100, 2)

#> [1] "Threshold:"
#> [1] 0.01
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     NBT      BT
#>        NBT 1926064     620
#>        BT    63633   13860
#>                                           
#>                Accuracy : 0.9679          
#>                  95% CI : (0.9677, 0.9682)
#>     No Information Rate : 0.9928          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.2928          
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.957182        
#>             Specificity : 0.968019        
#>          Pos Pred Value : 0.178855        
#>          Neg Pred Value : 0.999678        
#>              Prevalence : 0.007225        
#>          Detection Rate : 0.006916        
#>    Detection Prevalence : 0.038666        
#>       Balanced Accuracy : 0.962601        
#>                                           
#>        'Positive' Class : BT              
#>

ROC

par(pty="s")
lda_roc_FHO <- roc(df_FHO$Blue_Tarp_or_Not, lda.prob_FHO[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="LDA FHO ROC Curve")

auc_lda_FHO <- round(lda_roc_FHO[["auc"]],2)

QDA

Confusion Matrix

tic("QDA Predict")
qda.prob_FHO <- predict(qda.fit, newdata=df_FHO , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
qda.predict.time <- toc(quiet = TRUE)

qda.thresh_FHO <- qda.thresh
qda.pred_thresh_FHO <- factor(ifelse(qda.prob_FHO[,2]>qda.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.qda_thresh_FHO <- confusionMatrix(factor(qda.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT") 
"Threshold:" 
qda.thresh_FHO
cm.qda_thresh_FHO

acc_qda_FHO <- round(cm.qda_thresh_FHO[["overall"]][["Accuracy"]]*100,2)
thresh_qda_FHO <- qda.thresh_FHO
sens_qda_FHO <-  round(cm.qda_thresh_FHO[["byClass"]][["Sensitivity"]]*100,2)
spec_qda_FHO <- round(cm.qda_thresh_FHO[["byClass"]][["Specificity"]]*100,2)
FDR_qda_FHO <- round(((cm.qda_thresh_FHO[["table"]][2,1])/(cm.qda_thresh_FHO[["table"]][2,1]+cm.qda_thresh_FHO[["table"]][2,2]))*100,2)
prec_qda_FHO <- round(cm.qda_thresh_FHO[["byClass"]][["Precision"]]*100,2)
F.qda_FHO <- round(2*((prec_qda_FHO*sens_qda_FHO)/(prec_qda_FHO+sens_qda_FHO))/100,2)
F3.qda_FHO <- round(((1+beta_val**2)*((prec_qda_FHO*sens_qda_FHO)/(beta_val*prec_qda_FHO+sens_qda_FHO)))/100, 2)

#> [1] "Threshold:"
#> [1] 0.02
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     NBT      BT
#>        NBT 1925760    1103
#>        BT    63937   13377
#>                                           
#>                Accuracy : 0.9675          
#>                  95% CI : (0.9673, 0.9678)
#>     No Information Rate : 0.9928          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.2827          
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.923826        
#>             Specificity : 0.967866        
#>          Pos Pred Value : 0.173022        
#>          Neg Pred Value : 0.999428        
#>              Prevalence : 0.007225        
#>          Detection Rate : 0.006675        
#>    Detection Prevalence : 0.038576        
#>       Balanced Accuracy : 0.945846        
#>                                           
#>        'Positive' Class : BT              
#>

ROC

par(pty="s")
qda_roc_FHO <- roc(df_FHO$Blue_Tarp_or_Not, qda.prob_FHO[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="QDA FHO ROC Curve")

auc_qda_FHO <- round(qda_roc_FHO[["auc"]],2)

KNN

Confusion Matrix

tic("KNN Predict")
knn.prob_FHO <- predict(knn.fit, newdata=df_FHO , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
knn.predict.time <- toc(quiet = TRUE)

knn.thresh_FHO <- knn.thresh
knn.pred_thresh_FHO <- factor(ifelse(knn.prob_FHO[,2]>knn.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.knn_thresh_FHO <- confusionMatrix(factor(knn.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT") 
"Threshold:" 
knn.thresh_FHO
cm.knn_thresh_FHO

acc_knn_FHO <- round(cm.knn_thresh_FHO[["overall"]][["Accuracy"]]*100,2)
thresh_knn_FHO <- knn.thresh_FHO
sens_knn_FHO <-  round(cm.knn_thresh_FHO[["byClass"]][["Sensitivity"]]*100,2)
spec_knn_FHO <- round(cm.knn_thresh_FHO[["byClass"]][["Specificity"]]*100,2)
FDR_knn_FHO <- round(((cm.knn_thresh_FHO[["table"]][2,1])/(cm.knn_thresh_FHO[["table"]][2,1]+cm.knn_thresh_FHO[["table"]][2,2]))*100,2)
prec_knn_FHO <- round(cm.knn_thresh_FHO[["byClass"]][["Precision"]]*100,2)
F.knn_FHO <- round(2*((prec_knn_FHO*sens_knn_FHO)/(prec_knn_FHO+sens_knn_FHO))/100,2)
F3.knn_FHO <- round(((1+beta_val**2)*((prec_knn_FHO*sens_knn_FHO)/(beta_val*prec_knn_FHO+sens_knn_FHO)))/100, 2)

#> [1] "Threshold:"
#> [1] 0.07
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     NBT      BT
#>        NBT 1941216     988
#>        BT    48481   13492
#>                                           
#>                Accuracy : 0.9753          
#>                  95% CI : (0.9751, 0.9755)
#>     No Information Rate : 0.9928          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.3453          
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.931768        
#>             Specificity : 0.975634        
#>          Pos Pred Value : 0.217708        
#>          Neg Pred Value : 0.999491        
#>              Prevalence : 0.007225        
#>          Detection Rate : 0.006732        
#>    Detection Prevalence : 0.030922        
#>       Balanced Accuracy : 0.953701        
#>                                           
#>        'Positive' Class : BT              
#>

ROC

par(pty="s")
knn_roc_FHO <- roc(df_FHO$Blue_Tarp_or_Not, knn.prob_FHO[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="KNN FHO ROC Curve")

auc_knn_FHO <- round(knn_roc_FHO[["auc"]],2)

RF

Confusion Matrix

tic("RF Predict")
RF.prob_FHO <- predict(rf.fit, newdata=df_FHO , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
rf.predict.time <- toc(quiet = TRUE)

RF.thresh_FHO <- RF.thresh
RF.pred_thresh_FHO <- factor(ifelse(RF.prob_FHO[,2]>RF.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.RF_thresh_FHO <- confusionMatrix(factor(RF.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
RF.thresh_FHO
cm.RF_thresh_FHO

acc_RF_FHO <- round(cm.RF_thresh_FHO[["overall"]][["Accuracy"]]*100,2)
thresh_RF_FHO <- RF.thresh
sens_RF_FHO <-  round(cm.RF_thresh_FHO[["byClass"]][["Sensitivity"]]*100,2)
spec_RF_FHO <- round(cm.RF_thresh_FHO[["byClass"]][["Specificity"]]*100,2)
FDR_RF_FHO <- round(((cm.RF_thresh_FHO[["table"]][2,1])/(cm.RF_thresh_FHO[["table"]][2,1]+cm.RF_thresh_FHO[["table"]][2,2]))*100,2)
prec_RF_FHO <- round(cm.RF_thresh_FHO[["byClass"]][["Precision"]]*100,2)
F.rf_FHO <- round(2*((prec_RF_FHO*sens_RF_FHO)/(prec_RF_FHO+sens_RF_FHO))/100,2)
F3.rf_FHO <- round(((1+beta_val**2)*((prec_RF_FHO*sens_RF_FHO)/(beta_val*prec_RF_FHO+sens_RF_FHO)))/100, 2)

#> [1] "Threshold:"
#> [1] 0.36
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     NBT      BT
#>        NBT 1974225    2387
#>        BT    15472   12093
#>                                          
#>                Accuracy : 0.9911         
#>                  95% CI : (0.991, 0.9912)
#>     No Information Rate : 0.9928         
#>     P-Value [Acc > NIR] : 1              
#>                                          
#>                   Kappa : 0.5712         
#>                                          
#>  Mcnemar's Test P-Value : <2e-16         
#>                                          
#>             Sensitivity : 0.835152       
#>             Specificity : 0.992224       
#>          Pos Pred Value : 0.438709       
#>          Neg Pred Value : 0.998792       
#>              Prevalence : 0.007225       
#>          Detection Rate : 0.006034       
#>    Detection Prevalence : 0.013754       
#>       Balanced Accuracy : 0.913688       
#>                                          
#>        'Positive' Class : BT             
#>

ROC

par(pty="s")
rf_roc_FHO <- roc(df_FHO$Blue_Tarp_or_Not, RF.prob_FHO[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="RF FHO ROC Curve")

auc_RF_FHO <- round(rf_roc_FHO[["auc"]],2)

SVM

Confusion Matrix

tic("SVM Predict")
SVM.prob_FHO <- predict(svm.radial.fit, newdata=df_FHO , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
svm.predict.time <-  toc(quiet = TRUE)

SVM.thresh_FHO <- SVM.thresh
SVM.pred_thresh_FHO <- factor(ifelse(SVM.prob_FHO[,2]>SVM.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.SVM_thresh_FHO <- confusionMatrix(factor(SVM.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
SVM.thresh_FHO
cm.SVM_thresh_FHO

acc_SVM_FHO <- round(cm.SVM_thresh_FHO[["overall"]][["Accuracy"]]*100,2)
thresh_SVM_FHO <- SVM.thresh
sens_SVM_FHO <-  round(cm.SVM_thresh_FHO[["byClass"]][["Sensitivity"]]*100,2)
spec_SVM_FHO <- round(cm.SVM_thresh_FHO[["byClass"]][["Specificity"]]*100,2)
FDR_SVM_FHO <- round(((cm.SVM_thresh_FHO[["table"]][2,1])/(cm.SVM_thresh_FHO[["table"]][2,1]+cm.SVM_thresh_FHO[["table"]][2,2]))*100,2)
prec_SVM_FHO <- round(cm.SVM_thresh_FHO[["byClass"]][["Precision"]]*100,2)
F.svm_FHO <- round(2*((prec_SVM_FHO*sens_SVM_FHO)/(prec_SVM_FHO+sens_SVM_FHO))/100,2)
F3.svm_FHO <- round(((1+beta_val**2)*((prec_SVM_FHO*sens_SVM_FHO)/(beta_val*prec_SVM_FHO+sens_SVM_FHO)))/100, 2)

#> [1] "Threshold:"
#> [1] 0.0048
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     NBT      BT
#>        NBT 1929956    2627
#>        BT    59741   11853
#>                                           
#>                Accuracy : 0.9689          
#>                  95% CI : (0.9686, 0.9691)
#>     No Information Rate : 0.9928          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.2666          
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.818577        
#>             Specificity : 0.969975        
#>          Pos Pred Value : 0.165559        
#>          Neg Pred Value : 0.998641        
#>              Prevalence : 0.007225        
#>          Detection Rate : 0.005914        
#>    Detection Prevalence : 0.035722        
#>       Balanced Accuracy : 0.894276        
#>                                           
#>        'Positive' Class : BT              
#>

ROC

par(pty="s")
svm_roc_FHO <- roc(df_FHO$Blue_Tarp_or_Not, SVM.prob_FHO[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="SVM FHO ROC Curve")

auc_SVM_FHO <- round(svm_roc_FHO[["auc"]],2)

Model Performance

Table 2 - Performance Summary

Performance Metrics: 10-Fold Cross-Validation Metrics

Method	KNN (k = 7)	LDA	QDA	Log. Regression	Random Forest (mtry = 2, nTrees = 500)	SVM (sigma = 9.0201, C = 128)
Accuracy	99.16%	97.62%	97.75%	98.28%	99.94%	98.66%
AUC	99.98%	98.89%	99.82%	99.85%	99.45%	99.92%
ROC
Threshold	0.07	0.01	0.02	0.03265	0.36	0.0048
Sensitivity=Recall	100%	88.58%	99.31%	98.02%	98.91%	99.8%
Specificity=1-FPR	99.13%	97.92%	97.69%	98.29%	99.97%	98.62%
FDR	20.89%	41.59%	41.29%	34.59%	0.79%	29.51%
Precision	79.11%	58.41%	58.71%	65.41%	99.21%	70.49%
Cost	-1.676710^{6}	-2.86827610^{6}	-1.99884410^{6}	-1.9987410^{6}	-1.61361210^{6}	-1.78678410^{6}
F₁ Score	0.88	0.7	0.74	0.78	0.99	0.83
F₃ Score	2.35	1.96	2.12	2.18	2.47	2.26

Table 3 - FHO Performance Summary

Performance Metrics: Hold-Out Test Data Set Scores

Method	KNN (k = 7)	LDA	QDA	Log. Regression	Random Forest (mtry = 2, nTrees = 500)	SVM (sigma = 9.0201, C = 128)
Accuracy	97.53%	96.79%	96.75%	88.91%	99.11%	96.89%
AUC	96.27%	99.21%	99.15%	99.94%	98.04%	94.12%
ROC
Threshold	0.07	0.01	0.02	0.03265	0.36	0.0048
Sensitivity=Recall	93.18%	95.72%	92.38%	100%	83.52%	81.86%
Specificity=1-FPR	97.56%	96.8%	96.79%	88.82%	99.22%	97%
FDR	78.23%	82.11%	82.7%	93.89%	56.13%	83.44%
Precision	21.77%	17.89%	17.3%	6.11%	43.87%	16.56%
F₁ Score	0.35	0.3	0.29	0.12	0.58	0.28
F₃ Score	1.28	1.15	1.11	0.52	1.7	1.03

Selection and Interpretation of Tuning Parameters:

Three models - KNN, RF, and SVM - had tuning parameters.

KNN’s tuning parameter was k which represents the number of observations that the model is told to use to determine which class a new observation is assigned to. My code specified a tuneLength of 10, which resulted a model being fit for ten different k values. The following values were tried for k: 5, 7, 9, 11, 13, 15, 17, 19, 21, 23. The k with the highest accuracy, k=7, was selected as the best fit and used for the final mode.

RF’s tuning parameter was mtry. This parameter represents the number of variables chosen at each sample split as the trees are created. Due to there only being three predictors, values of mtry=2 and mtry = 3 were considered. The model fit with mtry=2 produced a higher accuracy than mtry=3 and was selected as the best fit. It should be noted that difference in accuracy between mtry=2 and mtry=3 incredibly small (hundred thousandths place). I would also like to comment that the parameter ntrees, the number of trees built to fit the model, can also be specified for the model. Caret automatically chose 500 trees. Based on the high accuracy produced by 500 trees I did not investigate this parameter further.

SVM has two tuning parameters C, the cost penalty (which ultimately controls the margin size), and sigma, a scaling factor for the exponential behavior (which referred to as gamma in our textbook). More specifically, this sigma parameter is called the “inverse width parameter in the Gaussian Radial Basis kernel”. The larger your C value the bigger the penalty for misclassifications. This results in fewer support vectors and a smaller margin. Small margins result in lower bias and higher variance. With caret a tuneLength = 10 was run. Ten values for C were fit on different models. Trying to explain how sigma was chosen, I realized only one sigma value was used when tuneLength is called. This thread indicates that svmRadial chooses a single sigma based on the output from sigest a function from the kernlab library. Per the function’s description, sigest returns a range of values for sigma which will return good results. I assume svmRadial chooses one of those sigma values, on what basis this decision is made I’m not sure. If I had realized this sooner, I could have used svmRadialSigma which tunes over both cost and sigma.

Returning to the explanation of what parameters were chosen, C = 128 and sigma = 9.020118. These were the parameters that created the best fit of the model from the 10 options provided by tuneLength. If I had more time to play with the project I would investigate the difference in performance between a model fit with C = 32 and C = 128 on the FHO data. There’s only 0.0000475% difference in the accuracy of models fit with those C values. A smaller C would result in a larger margin. It is possible that with a wider margin there could be better performance in predicting the FHO data.

Time Considerations

The time it took for each model to be fit and the time to run predictions on the FHO data set were recorded. These run times are not absolute they will vary based on hardware differences between devices that run these algorithms and the size of the data sets passed to the functions (if the data sets are changed). Therefore, this is a qualitative measure that can be used to compare the run times of the algorithms.

mfrt.glm <- round((glm.time$toc-glm.time$tic), 2) 
mfrt.lda <- round((lda.time$toc-lda.time$tic), 2)
mfrt.qda <- round((qda.time$toc-qda.time$tic), 2) 
mfrt.knn <- round((knn.time$toc-knn.time$tic), 2) 
mfrt.rf <- round((rf.time$toc-rf.time$tic), 2)
mfrt.svm <- round((svm.time$toc-svm.time$tic), 2)

Model Fitting

Model	Run Time (s)
Log. Reg.	5.57
LDA	1.53
QDA	1.33
KNN	43.38
RF	66.39
SVM	171.98

mprt.glm <- round((glm.predict.time$toc-glm.predict.time$tic), 2) 
mprt.lda <- round((lda.predict.time$toc-lda.predict.time$tic),2)
mprt.qda <- round((qda.predict.time$toc-qda.predict.time$tic) ,2)
mprt.knn <- round((knn.predict.time$toc-knn.predict.time$tic) ,2)
mprt.rf <- round((rf.predict.time$toc-rf.predict.time$tic),2)
mprt.svm <- round((svm.predict.time$toc-svm.predict.time$tic),2)

Predicting on the FHO

Model	Run Time (s)
Log. Reg.	6.03
LDA	6.8
QDA	6.65
KNN	358.67
RF	16.67
SVM	123.14

Computer Specs.

OS: Windows
Processor: Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz, 2304 Mhz, 8 Core(s), 16 Logical Processor(s)
RAM: 32.0 GB

Conclusions

I have reordered the conclusions provided to us in the project description based on what I think provides the best flow for this section.

A discussion of the relevance of the metrics calculated in the tables to this application context

Accuracy

Accuracy is a misleading metric for our application. A high accuracy, the percentage of predictions that are correct, does not guarantee we will be able to effectively achieve the goals of the use case. This is especially true for this project, where achieving a high true positive rate is paramount to successfully executing the mission but the data set is imbalanced. The further we push the threshold to maximize the TPR the larger the number of false positives we will predict with the model. False positives cost us resources of which we don’t have an unlimited supply. They also waste time, another resource we are limited in. Therefore, a high accuracy does not provide us with the information we need to know our model will help us achieve our primary objective.

AUC

The area under the curve acts like a quick check as we compare performance between models. This metric quantifies how well a model is able to differentiate between classes. It is a good piece of information but not a critical metric that will help us decide which model will perform the best for this application.

ROC and Threshold

ROC and threshold aren’t performance metrics so they aren’t helpful in comparing the performance of the algorithms directly. However, the ROC curve is generated by reported performance metrics (TPR and FPR) at different threshold values. Thresholds set the decision point to determine which class an observation is assigned to. The prediction function produces a probability that is compared to the threshold value and then a class is assigned. By adjusting the threshold we can maximize a metric such as the TPR. So, the ROC and threshold are important in helping us generate and adjust the metrics that are most relevant to our use case.

Precision and Recall

The metrics that indicate performance most relevant to our use case are precision and recall. Our primary goal is to correctly identify blue tarps, maximize the true positive rate, so that aid can be provided to the people associated with the blue tarp. Recall quantifies how many TP are identified of the total possible number that could be identified. Therefore, it’s indicating the percentage of blue tarps we’ve identified of all the blue tarps that are there to be identified. It’s critical we achieve the highest recall we can within our limitations because FN are effectively deaths.

Due to the imbalance in proportion between the classes in the data set, we have to be concerned with the number of false positive as well. The precision metric, which quantifies the percentage of TP classified of the predicted positive condition (TP + FP), represents this concern. Identifying all 2,022 blue tarps in our data set (maximizing the TPR) does us no good if we also identify 10,000 not blue tarps (have a high FPR). Taking the time to travel to 10,000 extra and incorrect locations would result in a staggering waste of resources and loss of life as aid workers run out of time to deliver supplies to displaced persons. To maximize the precision the algorithm has to have a high number of TP and a low number of FP.

Specificity

Specificity is the true negative rate or (1 - FPR). This metric indicates how well the model is classifying its negative condition. While we are somewhat concerned about how well the model is classifying true negatives as negatives, there is no penalty for a TN. A TN costs us $0 in this use case. Our false negatives and false positives are a larger concern to us. This makes the FPR and precision a larger focus for us.

FDR

The false discovery rate (FDR) indicates what percentage of our predicted positive condition is actually false positives. False positives for us waste time and resources. The FDR is the opposite metric of precision. Therefore, as we control the precision and maximize it we will minimize the FDR.

Discussion of budget and threshold selection

As is the reality of an aid or disaster relief situation, there are not unlimited resources to provide aid to those effected by the disaster. To assist in creating a more real life example, a budget was created to guide threshold selection. Our aim was to maximize the number of TP while minimizing the FP and ensuring that our operation does not exhaust our allowed budget. The budget would be quickly exhausted by pursuing incorrectly labeled not blue tarps (false positives).

Additionally, an overall success percentage of 80% was added to make the operation success qualification more stringent. Allocating millions of dollars to provide life saving resources to displaced persons would be considered a political and international relations nightmare if a certain percentage of success was not achieved.

In our training data set we know that there are 2,022 displaced persons hidden within 63,241 observations. In order for our mission to be considered a success 80% of the 2,022 displaced persons (1,618 people) must be identified and provided resources. Our mission was allocated $2,000,000 to locate the 2,022 displaced persons assigned to us. If we fail to find 404 of these people our mission is a failure, regardless of the cost of the mission. Therefore, to numerically force failure if we don’t identify 404 of the TP as TP the cost of a FN is calculated to be $2,000,000/404. That will keep us from being able to come in under budget. As previously stated in the “Project Budget” section, the cost of a TP is $1,000 and a FP is $300.

This budget limit guided me in the selection of threshold values. First, I sought to produce the lowest number of FN and FP. This typically resulted in the project being over budget. I then adjusted the threshold to still keep the FN as low as possible but allowed more FP.

Based on the chosen threshold values, random forest had the lowest cost then KNN and SVM. Logistic regression and QDA were just under budget. No matter the threshold I chose, I could not get LDA under budget. I chose a threshold that resulted in the lowest cost for this model. The LDA confusion matrix had the highest number of FN of all the models thus accounting for the lowest recall percentage.

A discussion of the best performing algorithm(s) in the cross-validation and hold-out data

Based on the cross-validation data set recall and precision, I rank the models from highest to least likely to recommend (notes next to name indicate what I considered as I chose the rankings):

Random forest - high recall, highest precision, low cost
KNN - perfect recall, high precision, low cost
SVM - high recall, high precision
Logistic regression - high recall, higher precision than QDA
QDA - high recall, low precision
LDA - lowest recall, lowest precision, over budget

The algorithm that performed the best on the cross-validation data, as decided by recall, precision, cost (metrics were listed in order of importance), was the random forest algorithm. KNN and SVM get honorable mentions based on their recall, precision, and cost metrics.

Logistic regression, LDA, and QDA have low precision percentages compared to RF, KNN, and SVM. This can be explained the large number of FP predicted by those models compared to RF, KNN, SVM.

Model:	FP	TP
RF	16	2,000
KNN	534	2,022
SVM	845	2,018
Log. Reg.	1,048	1,982
LDA	1,275	1,791
QDA	1,412	2,008

Based on the final hold out data set recall, precision, I rank the models from highest to least likely to recommend (notes next to name indicate what I considered as I chose the rankings):

LDA - second best recall (95.7%), third best precision
Random forest - okay recall (83.5%), best precision
KNN - pretty good recall (93.2%), second best precision
QDA - good recall, good precision
SVM - worst recall, okay precision
Logistic regression - perfect recall, 222,000+ FP making for worst precision

All qualitative descriptions are relative to the five other models. I am not impressed with any of the models when it comes to their performance. I decided to place LDA at the top of the list since it had a small number of FN and okay FP numbers.

A discussion or analysis justifying why your findings above are compatible or reconcilable

I am a bit surprised by the difference in model performance between the training and test data sets. I wouldn’t say that the performance was very consistent for the models between data sets. I have to consider whether there is a difference between the training and test sets or if my models were overfit to training data set.

Below the training data set and the FHO data set (a subset) are graphed. I don’t see a significant difference in the separation of the two classes.

Training Data Set

fig1

Subset of the Test Data Set

attach(df_FHO)
fig_end <- plot_ly(df_subset, x=~Red, y=~Green, z=~Blue, color=~Blue_Tarp_or_Not, colors = cbPalette) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig_end <- fig_end %>% add_markers()
fig_end <- fig_end %>% layout(scene=list(xaxis=list(title="Red"),
                     yaxis = list(title = 'Green'),
                     zaxis = list(title = 'Blue')))

fig_end

#Reference [1]
# The palette with grey:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73",  "#CC79A7", "#0072B2", "#D55E00")
#"#F0E442",

#view scatter and correlations  
p <- ggpairs(df_subset[,1:3], lower.panel = NULL, upper = list(continuous = wrap("cor", size = 3)), aes(color=df_subset$Blue_Tarp_or_Not)) #+ scale_fill_manual(values=cbPalette) 

#Reference: https://stackoverflow.com/questions/34740210/how-to-change-the-color-palette-for-ggallyggpairs/34743555
for(i in 1:p$nrow) {
  for(j in 1:p$ncol){
    p[i,j] <- p[i,j] + 
        scale_fill_manual(values=cbPalette) +
        scale_color_manual(values=cbPalette)  
  }
}

p

Let’s see what the ROC curves might tell us about the observed difference in performance.

For our logistic regression model, it looks like the threshold chosen for the cross-validation data set is not appropriate for our test data set. The algorithm perfectly predicts the TP at a threshold higher than 0.03265. Due to that, an incredibly high number of NBT to be classified as BT resulting in a 11% FPR.

The LDA model performs with a higher TPR on the FHO than the training data set. My hypothesis is that the FHO data better satisfies the assumptions of LDA for example its possible that each class is more concentrated within the center of its Gaussian distribution such that fewer predictions are near the decision boundary. If more of the observations are far from the decision boundary (predicted probability is far from the threshold) then you’d see better performance.

The QDA decision boundary that was modeled for the training data set must not match the shape of the decision boundary as well for the FHO data because we’re not able to get 100% TPR without at least ~10% FPR. That must mean that about 10% of the NBT data is on the wrong side of the decision boundary. We are seeing improved performance for the LDA model on the FHO. So, I think it’s possible that the linear decision boundary does a better job of separating the data.

For KNN, with a much larger data set k = 7 may no longer be the best setting for the tuning parameter, resulting in higher error. Because there are so many more data points now 7 neighbors probably isn’t enough to accurately predict the class in heavy overlap regions.

For RF, because the model is made by the average of 500 trees if the test data set has a shift in one direction or another or a large number of data points sit on boundaries a significant amount of misclassification could occur. If that were the case then there would be a percentage of data points you couldn’t classify correctly, like were seeing with the FHO data.

SVM the margin may be too narrow or incorrectly placed for the test data set or there could be a pocket of data on the wrong side of the margin. Because SVM sets a decision boundary in a point in space if the data shifts around it and violates the margin then a large number of incorrect predictions would result. That could explain the difference in performance and shape of the FHO ROC curve.

I think collectively what I’ve outlined for most of these methods is that the FHO data set has at least a portion of data that violates the previously defined decision boundary (with the exception of logistic regression) so that the models can’t predict the TPR at 100% without a significant percentage of FPs. For logistic regression I think the threshold is the wrong value allowing a large number of FP to be predicted.

A recommendation and rationale regarding which algorithm to use for detection of blue tarps

If we aren’t allowed to make any adjustments to the models and threshold values based on what we learned from the FHO, then I would recommend the LDA model be used to detect blue tarps. All of our models exceed an 80% TPR which is a success, however, some have a FPR that would be too costly (logistic regression) and others have prediction run times that when deployed to scale would never stop running (KNN and SVM). LDA and QDA have similar recall and precision numbers but LDA’s recall is 3.3% better. It is possible that this difference could be attributed to the variation between samples. If we had another 2 million data points to test it is possible QDA could come out on top that time. In the end, we have to make a decision and based on what information I have at present, LDA is my pick.

If we were allowed to make some additional adjustments, I think the logistic regression model could be refined a little bit more to pare down the FPR and make it a useful model. The AUC for the training and test data sets was 99.9% both times, which means the model does a great job differentiating between the classes. The main issue was the threshold value selection. Having two million data points to train the model on would allow for better threshold selection.

Future Work

Only having three predictors, RGB values, leave a lot to be desired when making a predictive model. Especially, when the three predictors are in effect an additive singular predictor, a color. Grated, given a disaster situation, having frequently imaged areas available for review is no small feat and being able to have any data at all is great. There must be a way to further innovate, right? To collect some more data from somewhere to help improve the model. The group from Rochester Institute of Technology used LIDAR data in their predictive models. I know that LIDAR data can be used to classify damage to infrastructure by calculating change in the height of structures from previous satellite imagery. So I think that LIDAR data should be explored in addition to the RGB data.

In a similar vein, there has been work done to develop infrastructure damage classifications that can be applied to satellite imagery through machine learning. There is a data base that has been created to help further this work. It can be used to take before images where structures are identified with polygons and compare them with after images. The changes in the polygons help to classify the scale of damage. I think something of this nature could be used to identify this kind of damage and indicate where displaced persons might be in a somewhat urban environment. I’m wondering if this could be applied in some way to identify tarp polygons.

If that didn’t work then maybe this kind of technology could be used to track if roads and pathways to travel had been effected. You could pull satellite data from before the event, identify the main routes of travel, secondary routes, then feed the algorithm daily image updates to see if any of the paths of travel have been compromised in some way.

More specifically related to my work, if I had more time I would make precision / recall/ threshold 3-D plots like I made for the ROC curves. It would be interesting to compare the two methods and probably would have helped me pick better threshold values. It was rather late in my analysis that I realized I should focus more on the precision and recall metrics. If you look back in my tables you’ll see I’ve added F₁ and F₃ metrics. The F₃ metric places 3x the emphasis on recall over precision. The closer the score is to 3 the higher the recall and precision, with the emphasis on the recall. I like the F scores and think they are a handy metric when you’re concerned about precision and recall for your application, like ours.

Being able to compare the precision and recall rates with the algorithm execution time helped me when deciding which algorithms would be best suited for this application. Keeping algorithmic complexity in mind, it would likely be worthwhile for aid workers to run new observations through the model in smaller batch sizes so that new predictions could be reported frequently to the field. It would be even better if the field kept track of whether a location was a TP or FP. Then that data, now validated, could be used to retrain and update the model.

Acknowledgements

A sincere thank you to Professor Schwartz and his excellent instruction this semester. In addition, I’m very grateful to my cohort. Casual discussions about the course concepts with Sam Parsons, Edward Thompson, Michael Davies, Paul Hicks and many others has taught me a lot! They inspired much of my learning and helped to reinforce the rest.

References and Works Cited

Creating and using color blindness friendly color pallets: http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette, https://socviz.co/refineplots.html
Thanks to Derek Banks and Michael Davies for sharing code for doing multicore processing
Assigning color to ggplot scatter: https://www.r-graph-gallery.com/ggplot2-color.html
Change correlation text size ggpairs : https://stackoverflow.com/questions/8599685/how-to-change-correlation-text-size-in-ggpairs
Information on RGB color model: https://en.wikipedia.org/wiki/RGB_color_model
Inspiration to color-code the ggpairs figure came from Professor Schwartz showing us another student’s project [Michael Davies] in class.
How to make the tables update automatically with “r var_name” assist by Alex Link and Michael Davies
RMD file Formatting: https://bookdown.org/yihui/rmarkdown/documents.html
Timing Code Block Execution: https://www.r-bloggers.com/2017/05/5-ways-to-measure-running-time-of-r-code/
Unlisted, unnumbered tabs: https://stackoverflow.com/questions/38062706/rmarkdown-how-to-end-tabbed-content
How to override GGPlot color scheme: https://stackoverflow.com/questions/34740210/how-to-change-the-color-palette-for-ggallyggpairs/34743555
F Score: https://en.wikipedia.org/wiki/F-score
Confusion Matrices: https://en.wikipedia.org/wiki/Confusion_matrix
An Introduction to Statistical Learning by G. James, D. Witten, T. Hastie, and R. Tibshirani
ROC curves in R https://www.youtube.com/watch?v=qcvAqAH60Yw&t=670s&ab_channel=StatQuestwithJoshStarmer
https://medium.com/swlh/roc-curve-and-auc-detailed-understanding-and-r-proc-package-86d1430a3191
https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c

#> Run Time: 928.58 sec elapsed

Data Mining Semester Project

Camille Leonard

11/22/2020

Project Introduction

Project Budget

Data Exploration

Check NA

Scatter and Correlation

3D Scatter

Prepare Data Frame for Analysis

Data Frame

3D Scatter - Binary

Model Fitting

Logistic Regression

Model

ROC

Thresholds

Confusion Matrix

Sampling Variability

LDA

Model

ROC

Thresholds

Confusion Matrix

Sampling Variability

QDA

Model

ROC

Thresholds

Confusion Matrix

Sampling Variability

KNN

Model

ROC

Thresholds

Confusion Matrix

Sampling Variability

Random Forest

Model

ROC

Thresholds

Confusion Matrix

Sampling Variability

SVM

Model

ROC

Thresholds

Confusion Matrix

Sampling Variability

Hold-Out Test Sample

Logistic Regression

Confusion Matrix

ROC

LDA

Confusion Matrix

ROC

QDA

Confusion Matrix

ROC

KNN

Confusion Matrix

ROC

RF

Confusion Matrix

ROC

SVM

Confusion Matrix

ROC

Model Performance

Table 2 - Performance Summary

Performance Metrics: 10-Fold Cross-Validation Metrics

Table 3 - FHO Performance Summary

Performance Metrics: Hold-Out Test Data Set Scores

Time Considerations

Conclusions

Acknowledgements

References and Works Cited