Supervised Image Classification with Random Forest in R

In this post, I demonstrate how to load and visualize georeferenced spatial data and perform a supervised random forest classification using R.

Yuri Souza
2025-04-29

Quick Intro and Background

Machine learning has been one of my favorite topics over the last few years, as I’ve been using approaches such as predictive models and cluster analysis. However, these are just a subset of the machine learning potential. With that in mind, I decided to take a step forward and learn what I believe to be the next topic in my field of expertise: Supervised Image Classification. This is a highly effective method for automatically identifying features in an image, such as forest patches, crops, urban areas, water bodies, and so on. In a few words, the way it works is by telling an algorithm how each of these features looks, based on reflectance signatures, and then it searches in the images for patterns on pixel aggregation that resemble the patterns of the input features.

I’ve been trying to perform supervised image classification of satellite images in R for a while, and I’ve always stumbled upon conceptual problems, such as: How to do it? Which parameters to use for validation? Which packages and functions? What is happing on the background?

Even though this has been a very popular technique across remote sensing people and is widely available to learn on the internet, it’s definitely easier to find tutorials for Python or software such as ArcGIS and QGIS than it’s for R. I’m not saying there aren’t tutorials in R out there, in fact, the content available here is from those tutorials, but I found this information very spread across a few blogs and a bit scarce. So, after struggling to find a tutorial that I liked, I decided to create my own tutorial, based on what I could find online, while also adding my personal touch.

This is my attempt to learn how to perform supervised image classification using the random forest algorithm. I’m pretty sure there are many other ways to carry it out, and definitely more efficient ones, but this is what I could get so far, and I believe that will suit a large audience.

For the record, what I’m presenting here is a compilation of what is available in following blogs and tutorials:

About Random Forest

Random Forest is a supervised nonlinear algorithm approach that allows for classifying a dataset into categories or classes using classification or regression techniques. It is a technique based on the approach of Decision Trees, with a difference in that Random Forest uses a collection of Decision Trees instead of a single one, therefore providing more flexibility, accuracy, and ease of access in the output. Decision trees have a limitation because they lack accuracy and show low accuracy during the testing phase due to the process called over-fitting (that here can be understood as a model that matches the training set so closely that it fails to make correct predictions on new data). Random Forest, on the other hand, forests help to reduce tree correlation by injecting more randomness into the tree-growing process. It searches for the best feature from a random subset of features, providing more randomness to the model and resulting in a better and accurate model. Being able to randomize the data and come up with random samples is important in building a model because it creates variation, minimizes bias, creates equivalent treatment groups, and, therefore, increases the model’s potential for generalizations.

Random Forest and Decision Trees both have a tree-like structure made up of nodes and branches. At each node, the data gets split based on one of the input features, creating two or more branches. This splitting continues at the next nodes, leading to even more branches that help break down the original data into smaller parts. The process goes on until you reach a point where most of the data in a node belongs to the same class, making further splits unnecessary. The very first split happens at the root node, while the final nodes, called leaves, are linked to specific class labels. The routes you take from the root to each leaf outline the rules for classifying the data. Check this out for a deep explanation.

Decision Tree vs Random Forest Source

Random Forest uses the bagging method in decision trees and, as a result, increases the accuracy of the learning model. Bagging trees introduces a random component into the tree-building process by building many trees on bootstrapped copies of the training data. Bagging then aggregates the predictions across all the trees; this aggregation reduces the variance of the overall procedure and results in improved predictive performance.

Simply put, bagging (short for bootstrap aggregating) is directly derived from bootstrapping. It uses the concept of bootstrapping—sampling with replacement—to generate multiple datasets from the original training set. Bagging extends bootstrapping by applying it in the context of model training and prediction aggregation, turning a statistical resampling method into a powerful ensemble learning technique. Bootstrapping and bagging share a common foundation in that both involve sampling with replacement from the original training data to create new datasets.

Decision Tree vs Random Forest Source

Bootstrapping is primarily a resampling technique used to estimate variability or confidence in statistical measures by generating multiple bootstrap samples. Bagging builds on this concept by using each of these bootstrap samples to train separate models, often referred to as base learners. While bootstrapping focuses solely on data generation, bagging takes it further by combining the predictions from these multiple models to form a more stable and accurate overall prediction. The main goal of bagging is to reduce variance and minimize overfitting in regression and classification tasks, making it a key ensemble method in machine learning, whereas bootstrapping itself is not directly concerned with model performance.

Decision Tree vs Random Forest Source

For more information:

That said, lets get started!!!

In a few words, my goal here is to use a drone optical imagery I took during my fieldwork in Brazil and make a model able to identify what is Bare Soil, Grass, and Trees on it. This image was taken in Chapada Dos Veadeiros National Park, Brazil.

I broke down the workflow of this tutorial into six main steps, along with some intermediate ones. This step-by-step approach really helped me grasp how the classification process works.

1. Installing and Loading the Packages

2. Loading, Checking, and Cleaning the Data

2.1. Loading the Datasets

2.2. Checking up the Coordinate Reference System

2.3. Checking up the Data Information and Structure

2.4. Cropping the Raster Image

2.5. Renaming and Selecting the Bands

2.6. Visualizing the Raster and Polygons

3. Creating, Preparing, and Visualizing the Vegetation Indexes

3.1. Creating Vegetation Indexes

3.2. Renaming, Visualizing, and Checking Vegetation Index Values

3.3. Scaling Band Values and Remaking Vegetation Indexes

3.4. Renaming, Visualizing, and Checking Vegetation Index Values

3.5. Extracting the Pixel Values of Mapped Features and Assigning Vegetation Classes

4. Building and Training the Model

4.1. Partitioning the Data Into Training and Testing

4.2. Assessing the Proportion of Split Data

4.3. Building the Training Model

4.4. Assessing Model Performance

4.5. Applying the Model to the Test Dataset

4.6. Performing Image Classification

4.7. Visualizing the classification

If you want to follow and try each step described here by yourself please download the data. HERE. If you want to

Hands to Work

To get started, the very first step is to load the necessary packages (assuming you already have R installed on your machine). We’ll utilize raster and terra for handling raster data, sf for managing spatial data like points and polygons, and tidyverse for data tabulation and visualization. Additionally, we’ll rely on caret for implementing random forest methods, and use mapview along with leaflet to plot and visualize both the polygons and raster images. While these are the primary packages we’ll need, I’ll also incorporate some others to further explore the results of our image classification.

1. Installing and Loading the Packages

Begin by installing and loading the necessary packages, along with the additional ones you’ll need. I prefer a setup this syntax that first checks if a package is already installed; if it isn’t, it installs the package and then loads it afterward.

2. Loading, Checking, and Cleaning the Data

2.1 Loading the datasets

Let’s load all the files and information we’ll need later on:

In this step we load the files and make sure to rename their classes using a standard nomenclature.

raw_raster <- terra::rast("woody_savanna.tif") |> 
  terra::aggregate(fact=2) # original pixel size = 10.8cm, new pixel size 21.6cm (old * 2)
raw_raster
class       : SpatRaster 
dimensions  : 759, 782, 6  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63899, -47.63742, -14.0681, -14.06662  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       : woody~nna_1, woody~nna_2, woody~nna_3, woody~nna_4, woody~nna_5, woody~nna_6 
min values  :        2.89,       16.26,       16.97,       24.26,      117.52,           0 
max values  :    65535.00,    65535.00,    65535.00,    65535.00,    65535.00,       65535 
plot(raw_raster)
sampled_area <- terra::vect("sampled_area.shp")
sampled_area
 class       : SpatVector 
 geometry    : polygons 
 dimensions  : 1, 1  (geometries, attributes)
 extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
 source      : sampled_area.shp
 coord. ref. : lon/lat WGS 84 (EPSG:4326) 
 names       :    Id
 type        : <int>
 values      :     0
plot(sampled_area)
train_features <- sf::st_read("train_features.shp")
Reading layer `train_features' from data source 
  `D:\Others\souzayuri.github.io\_posts\2025-01-05-random_forest_sup_classific_r\train_features.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 46 features and 9 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS:  WGS 84
train_features
Simple feature collection with 46 features and 9 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS:  WGS 84
First 10 features:
   Classcode Classname Classvalue RED GREEN BLUE Count   Shape_Leng
1       <NA>     grass          1 255   255  190 16651 0.0002222801
2       <NA>     grass          1 255   255  190 13637 0.0002344464
3       <NA>     grass          1 255   255  190  9875 0.0001834221
4       <NA>     grass          1 255   255  190 11577 0.0002181070
5       <NA>     grass          1 255   255  190 10001 0.0002419869
6       <NA>     grass          1 255   255  190  7684 0.0001557914
7       <NA>     grass          1 255   255  190 16943 0.0002274619
8       <NA>     grass          1 255   255  190  6667 0.0001647991
9       <NA>     grass          1 255   255  190  2964 0.0001024218
10      <NA>     grass          1 255   255  190  4238 0.0001182262
     Shape_Area                       geometry
1  3.318277e-09 POLYGON ((-47.63788 -14.066...
2  2.717603e-09 POLYGON ((-47.63782 -14.067...
3  1.967932e-09 POLYGON ((-47.63772 -14.066...
4  2.307135e-09 POLYGON ((-47.63781 -14.066...
5  1.993210e-09 POLYGON ((-47.63793 -14.067...
6  1.531386e-09 POLYGON ((-47.63788 -14.067...
7  3.376445e-09 POLYGON ((-47.63793 -14.066...
8  1.328762e-09 POLYGON ((-47.63768 -14.067...
9  5.907468e-10 POLYGON ((-47.63769 -14.066...
10 8.447055e-10 POLYGON ((-47.63785 -14.066...
# renaming their classes accordingly to keep it consistent across upcoming analyses.
train_features_clnd <- sf::st_read("train_features.shp") |> 
  dplyr::rename(OBJECTID = "Classcode",
                Class = "Classname") |> 
  dplyr::mutate(OBJECTID = as.character(1:46),
                Class = as.factor(Class),
                Code = as.numeric(Class)) |> 
  dplyr::select(c(1:3))
Reading layer `train_features' from data source 
  `D:\Others\souzayuri.github.io\_posts\2025-01-05-random_forest_sup_classific_r\train_features.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 46 features and 9 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS:  WGS 84
train_features_clnd
Simple feature collection with 46 features and 3 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS:  WGS 84
First 10 features:
   OBJECTID Class Classvalue                       geometry
1         1 grass          1 POLYGON ((-47.63788 -14.066...
2         2 grass          1 POLYGON ((-47.63782 -14.067...
3         3 grass          1 POLYGON ((-47.63772 -14.066...
4         4 grass          1 POLYGON ((-47.63781 -14.066...
5         5 grass          1 POLYGON ((-47.63793 -14.067...
6         6 grass          1 POLYGON ((-47.63788 -14.067...
7         7 grass          1 POLYGON ((-47.63793 -14.066...
8         8 grass          1 POLYGON ((-47.63768 -14.067...
9         9 grass          1 POLYGON ((-47.63769 -14.066...
10       10 grass          1 POLYGON ((-47.63785 -14.066...
plot(train_features_clnd)

2.2. Checking up the Coordinate Reference System

We also check their CRS (coordinate reference system) and ensure they’ll have the same coordinate properties.

# Compare CRS, if they are different make them equal
if (terra::crs(raw_raster) == terra::crs(train_features_clnd)) {
  print("CRS are identical!")
} else {
  print("CRS are different. Therefore, reprojecting to match raster projection...")
  # Reproject train_features to match raw_raster
  train_features_clnd <- sf::st_transform(train_features_clnd, crs = terra::crs(raw_raster))
}
[1] "CRS are different. Therefore, reprojecting to match raster projection..."
# Compare CRS again to ensure they above function worked.
if (terra::crs(raw_raster) == terra::crs(train_features_clnd)) {
  print("CRS are identical!")
} else {
  print("CRS are different. Therefore, reprojecting to match raster projection...")
  # Reproject train_features to match raw_raster
  train_features_clnd <- sf::st_transform(train_features_clnd, crs = terra::crs(raw_raster))
}
[1] "CRS are identical!"

2.3. Checking up the data information and structure

Let’s check the training data and the number of features, which has a total of 46. Here we can see that every training class has 15 features/polygons. Grass has 16 because I added one more by mistake, and I was too lazy to remove it =).

glimpse(train_features_clnd)
Rows: 46
Columns: 4
$ OBJECTID   <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"…
$ Class      <fct> grass, grass, grass, grass, grass, grass, grass, …
$ Classvalue <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2…
$ geometry   <POLYGON [°]> POLYGON ((-47.63788 -14.066..., POLYGON (…
summary(train_features_clnd)
   OBJECTID               Class      Classvalue             geometry 
 Length:46          bare_soil:15   Min.   :1.000   POLYGON      :46  
 Class :character   grass    :16   1st Qu.:1.000   epsg:4326    : 0  
 Mode  :character   trees    :15   Median :2.000   +proj=long...: 0  
                                   Mean   :1.978                     
                                   3rd Qu.:3.000                     
                                   Max.   :3.000                     

2.4. Cropping the raster image

Now that we’ve standardized the data, we’ll crop the raster image to the 100m x 100m grid limit. We also want to check the CRS and make sure it’s correct, and plot each of these bands side-by-side.

cropped_raster <- terra::crop(raw_raster, sampled_area)
cropped_raster
class       : SpatRaster 
dimensions  : 559, 558, 6  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       : woody~nna_1, woody~nna_2, woody~nna_3, woody~nna_4, woody~nna_5, woody~nna_6 
min values  :        2.89,       16.26,       16.97,       24.26,      165.85,       65535 
max values  :     3139.13,     7843.21,     9368.55,    17875.35,    20952.24,       65535 
plot(cropped_raster)
# Compare CRS
if (terra::crs(raw_raster) == terra::crs(train_features_clnd)) {
  print("CRS are identical!")
} else {
  print("CRS are different. Therefore, reprojecting to match raster projection...")
  # Reproject train_features to match raw_raster
  train_features_clnd <- terra::project(train_features_clnd, crs(raw_raster))
}
[1] "CRS are identical!"

2.5. Renaming and Selecting the Bands

So far, we have ensured the consistency of the image coordinates and defined the grid area we want to work on. Now, let’s rename the bands according to their respective spectral signatures. This will help us later identify which band is which. In this step, I also removed the band MASK by only selecting the bands of interest according to their position in the bands list (1 to 5).

names(cropped_raster) <- c("Blue","Green","Red","RE","NIR","MASK")
plot(cropped_raster)
cropped_raster_bands <- cropped_raster[[1:5]]
plot(cropped_raster_bands)

2.6. Visualizing the Raster and Polygons

Now that we’ve the raster with five bands and the polygons ready for training the model, let’s visualize our data. We’ll create a natural color composition. To do this we need to inform the position of the RGB bands of the raster image (that is why naming it before hand was important) and overlay it with each class represented by the polygons. For visualization, we’ll use mapview() because it’s interactive (just because we want some cool!!!).

train_class_colors <- unique(train_features_clnd$Class) |> 
  as.data.frame() |>
  dplyr::mutate(hex = c(grass = "#ff7f00",
                        bare_soil = "#e41a1c",
                        tree = "#55eed1"))
train_class_colors
  unique(train_features_clnd$Class)     hex
1                             grass #ff7f00
2                             trees #e41a1c
3                         bare_soil #55eed1
mapview::viewRGB(brick(cropped_raster_bands[[1:3]]), 
                 r = 3, g = 2, b = 1,
                 map.types = c("Esri.WorldImagery", "OpenStreetMap", "OpenTopoMap"),
                 maxpixels =  866760,
                 layer.name = "RBG Image") + 
  mapview::mapView(train_features_clnd, 
                   zcol = "Class", 
                   color = "black",  
                   lwd = 2,
                   col.regions = train_class_colors$hex,
                   layer.name = "Features")

3. Creating, Preparing, and Visualizing the Vegetation Indexes

One nice aspect of using optical imagery rasters with various bands is the ability to rearrange these bands to create new and informative images. To provide some context, the drone imagery we’re working with includes natural color RGB bands, which are part of the visible spectrum for humans, along with the Red Edge and Near-Infrared bands, which lie outside this spectrum. The latter two bands are particularly useful for highlighting vegetation characteristics, as plants reflect a significant amount of infrared light. This unique property makes infrared imagery an invaluable tool for analyzing vegetation, as it offers deeper insights into the state, health, vigor, and other attributes of plants that natural color imagery simply cannot reveal.

Now that we have our raster image and training features prepared, let’s move on to the next step: using this raster image and bands to create various vegetation indexes. They will be stacked as bands and will be used to train the model, using the mapped polygons, and to predict the vegetation classes in the later steps of the classification. This process will improve the classification of the image by pinpointing pixel combinations that are more accurately representative of the specific vegetation characteristics we aim to map.

3.1. Creating Vegetation Indexes

When conducting image classification, it’s crucial to provide clear information that highlights the distinguishing features you’re mapping. In this section, we’ll focus on creating three widely used vegetation indexes in remote sensing: the NDVI (Normalized Difference Vegetation Index), the EVI (Enhanced Vegetation Index), and the SAVI (Soil Adjusted Vegetation Index). Each of these indexes interprets vegetation in unique ways, and allow to use them alongside the spectral bands from the drone imagery, therefore making the classification more robust.

This is a straightforward process that primarily involves utilizing the Red, Near Infrared, and Blue bands, along with a few predefined parameters. Each index has its own arithmetic equation, and we can identify these bands based on their positions within the raster. We will then plot these indexes side-by-side.

Check out below a table summarizing the similarity and differences among these indexes.

Feature NDVI (Normalized Difference Vegetation Index) EVI (Enhanced Vegetation Index) SAVI (Soil Adjusted Vegetation Index)
Formula (NIR - Red) / (NIR + Red) 2.5 × (NIR - Red) / (NIR + 6 × Red - 7.5 × Blue + 1) ((NIR - Red) / (NIR + Red + 0.5)) × (1 + 0.5)
Bands Used NIR, Red NIR, Red, Blue NIR, Red
Purpose General greenness/vegetation presence Improves sensitivity in high biomass areas Adjusts for soil background noise
Advantages Simple, widely used Corrects for atmospheric & canopy background effects Better for areas with sparse vegetation or exposed soil
Limitations Sensitive to soil brightness and atmospheric conditions More complex, needs Blue band Requires adjustment factor (L) based on soil type
Value Range -1 to +1 (usually 0 to 0.8 in vegetated areas) -1 to +1 (typically 0 to 0.9 for vegetation) -1 to +1 (usually 0 to 0.7, depending on L and vegetation density)
nir <- cropped_raster_bands[[5]]
red <- cropped_raster_bands[[3]]
blue <- cropped_raster_bands[[1]]

# Check the range of your input bands
print(paste("NIR range:", minmax(nir)))
[1] "NIR range: 165.850002288818" "NIR range: 20952.2397460938"
print(paste("Red range:", minmax(red)))
[1] "Red range: 16.9700002372265" "Red range: 9368.55017089844"
print(paste("Blue range:", minmax(blue)))
[1] "Blue range: 2.88999992236495" "Blue range: 3139.12994384766"
cropped_raster_bands_ndvi <- (nir - red) / (nir + red)
cropped_raster_bands_ndvi
class       : SpatRaster 
dimensions  : 559, 558, 1  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
name        :         NIR 
min value   : -0.08708532 
max value   :  0.95849029 
cropped_raster_bands_evi <- 2.5 * ((nir - red) / (nir + (6 * red) - (7.5 * blue) + 1))
cropped_raster_bands_evi
class       : SpatRaster 
dimensions  : 559, 558, 1  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
name        :       NIR 
min value   : -8.517467 
max value   : 83.766299 
cropped_raster_bands_savi <- ((nir - red) / (nir + red + 0.5)) * (1 + 0.5)
cropped_raster_bands_savi
class       : SpatRaster 
dimensions  : 559, 558, 1  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
name        :        NIR 
min value   : -0.1306091 
max value   :  1.4368568 
# Create a red-to-green color palette
red_to_green_palette <- colorRampPalette(c("darkred", "yellow", "darkgreen"))
# Lets visuallize them together
par(mfrow = c(1, 3))

plot(cropped_raster_bands_ndvi, main = "NDVI", col = red_to_green_palette(100))
plot(cropped_raster_bands_evi, main = "EVI", col = red_to_green_palette(100))
plot(cropped_raster_bands_savi, main = "SAVI", col = red_to_green_palette(100))
par(mfrow = c(1, 1))

3.2. Renaming, Visualizing, and Checking Vegetation Index Values

Let’s stack these three indexes in our raster data so we can have more information to feed the model. After stacking the rasters containing the indexes with our main raster, we’ll notice that the new rasters are named as NIR. That is because the first spectral band we used in the above calculations was the NIR band. To avoid confusion between the NIR spectral band and the new rasters indexes, we’ll name them according to which index they represent.

train_features_clnd_indexes <- c(cropped_raster_bands, 
                                 cropped_raster_bands_ndvi,
                                 cropped_raster_bands_evi, 
                                 cropped_raster_bands_savi)
train_features_clnd_indexes
class       : SpatRaster 
dimensions  : 559, 558, 8  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       :    Blue,   Green,     Red,       RE,      NIR,         NIR, ... 
min values  :    2.89,   16.26,   16.97,    24.26,   165.85, -0.08708532, ... 
max values  : 3139.13, 7843.21, 9368.55, 17875.35, 20952.24,  0.95849029, ... 
names(train_features_clnd_indexes) <- c(names(cropped_raster_bands),"NDVI", "EVI", "SAVI")
train_features_clnd_indexes
class       : SpatRaster 
dimensions  : 559, 558, 8  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       :    Blue,   Green,     Red,       RE,      NIR,        NDVI, ... 
min values  :    2.89,   16.26,   16.97,    24.26,   165.85, -0.08708532, ... 
max values  : 3139.13, 7843.21, 9368.55, 17875.35, 20952.24,  0.95849029, ... 
plot(train_features_clnd_indexes)
summary(train_features_clnd_indexes)
      Blue             Green              Red         
 Min.   :   2.89   Min.   :  52.24   Min.   :  58.04  
 1st Qu.: 384.11   1st Qu.:1156.35   1st Qu.:1209.77  
 Median : 528.55   Median :1651.34   Median :1781.48  
 Mean   : 556.88   Mean   :1654.48   Mean   :1883.19  
 3rd Qu.: 704.35   3rd Qu.:2108.03   3rd Qu.:2453.26  
 Max.   :2919.83   Max.   :7528.84   Max.   :9368.55  
       RE               NIR               NDVI         
 Min.   :  147.7   Min.   :  269.3   Min.   :-0.07994  
 1st Qu.: 3228.9   1st Qu.: 3683.2   1st Qu.: 0.31428  
 Median : 4458.6   Median : 4960.7   Median : 0.47468  
 Mean   : 4603.5   Mean   : 5228.5   Mean   : 0.46199  
 3rd Qu.: 5786.7   3rd Qu.: 6528.0   3rd Qu.: 0.61267  
 Max.   :16438.5   Max.   :19855.9   Max.   : 0.93235  
      EVI                SAVI        
 Min.   :-0.07473   Min.   :-0.1199  
 1st Qu.: 0.39774   1st Qu.: 0.4713  
 Median : 0.70507   Median : 0.7119  
 Mean   : 0.73613   Mean   : 0.6929  
 3rd Qu.: 1.03850   3rd Qu.: 0.9189  
 Max.   :83.76630   Max.   : 1.3981  

3.3. Scaling Band Values and Remaking Vegetation Indexes

You probably have noticed that the index values of NDVI, EVI, and SAVI range beyond the expected range of -1 to 1. That is because we need to convert our raw band values to reflectance values normalized to 0-1 range. This conversion could have been done in previous steps, but I wanted to present you with this common issue. To convert, we will simply divide each band by the theoretical maximum value of the drone image we got (in our case 65535, a 16-bit image). Let’s convert these numbers and redo what we’ve done in the step 3.1. Creating Vegetation Indexes.

cropped_raster_bands_scaled = cropped_raster_bands/ 65535


# splitting the required bands
nir <- cropped_raster_bands_scaled[[5]]
red <- cropped_raster_bands_scaled[[3]]
blue <- cropped_raster_bands_scaled[[1]]

# Check the range of your input bands
print(paste("NIR range:", minmax(nir)))
[1] "NIR range: 0.00253070881649223" "NIR range: 0.319710685070478"  
print(paste("Red range:", minmax(red)))
[1] "Red range: 0.000258945605206782"
[2] "Red range: 0.142954912198038"   
print(paste("Blue range:", minmax(blue)))
[1] "Blue range: 4.40985720968177e-05"
[2] "Blue range: 0.0479000525497468"  
cropped_raster_bands_ndvi_scaled <- (nir - red) / (nir + red)
cropped_raster_bands_ndvi_scaled
class       : SpatRaster 
dimensions  : 559, 558, 1  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
name        :         NIR 
min value   : -0.08708532 
max value   :  0.95849029 
cropped_raster_bands_evi_scaled <- 2.5 * ((nir - red) / (nir + (6 * red) - (7.5 * blue) + 1))
cropped_raster_bands_evi_scaled
class       : SpatRaster 
dimensions  : 559, 558, 1  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
name        :        NIR 
min value   : -0.0196188 
max value   :  0.4756189 
cropped_raster_bands_savi_scaled <- ((nir - red) / (nir + red + 0.5)) * (1 + 0.5)
cropped_raster_bands_savi_scaled
class       : SpatRaster 
dimensions  : 559, 558, 1  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
name        :         NIR 
min value   : -0.02529499 
max value   :  0.48395901 
# Create a red-to-green color palette
red_to_green_palette <- colorRampPalette(c("darkred", "yellow", "darkgreen"))
# Lets visuallize them together
par(mfrow = c(1, 3))

plot(cropped_raster_bands_ndvi_scaled, main = "NDVI", col = red_to_green_palette(100))
plot(cropped_raster_bands_evi_scaled, main = "EVI", col = red_to_green_palette(100))
plot(cropped_raster_bands_savi_scaled, main = "SAVI", col = red_to_green_palette(100))

3.4. Renaming, Visualizing, and Checking Vegetation Index Values

If you check the new values, you’ll see that they range from -1 to 1, therefore properly scaled down.

train_features_clnd_indexes_scaled <- c(cropped_raster_bands_scaled, 
                                        cropped_raster_bands_ndvi_scaled,
                                        cropped_raster_bands_evi_scaled, 
                                        cropped_raster_bands_savi_scaled)
train_features_clnd_indexes_scaled
class       : SpatRaster 
dimensions  : 559, 558, 8  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       :         Blue,        Green,          Red,           RE,         NIR,         NIR, ... 
min values  : 4.409857e-05, 0.0002481117, 0.0002589456, 0.0003701839, 0.002530709, -0.08708532, ... 
max values  : 4.790005e-02, 0.1196797144, 0.1429549122, 0.2727603586, 0.319710685,  0.95849029, ... 
# renaming the bands by using their position in the list
names(train_features_clnd_indexes_scaled) <- c(names(cropped_raster_bands_scaled),"NDVI", "EVI", "SAVI")
train_features_clnd_indexes_scaled
class       : SpatRaster 
dimensions  : 559, 558, 8  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       :         Blue,        Green,          Red,           RE,         NIR,        NDVI, ... 
min values  : 4.409857e-05, 0.0002481117, 0.0002589456, 0.0003701839, 0.002530709, -0.08708532, ... 
max values  : 4.790005e-02, 0.1196797144, 0.1429549122, 0.2727603586, 0.319710685,  0.95849029, ... 
plot(train_features_clnd_indexes_scaled)
summary(train_features_clnd_indexes_scaled)
      Blue               Green                Red           
 Min.   :0.0000441   Min.   :0.0007971   Min.   :0.0008856  
 1st Qu.:0.0058611   1st Qu.:0.0176447   1st Qu.:0.0184598  
 Median :0.0080652   Median :0.0251978   Median :0.0271837  
 Mean   :0.0084975   Mean   :0.0252458   Mean   :0.0287356  
 3rd Qu.:0.0107477   3rd Qu.:0.0321666   3rd Qu.:0.0374343  
 Max.   :0.0445538   Max.   :0.1148827   Max.   :0.1429549  
       RE                NIR                NDVI         
 Min.   :0.002254   Min.   :0.004109   Min.   :-0.07994  
 1st Qu.:0.049271   1st Qu.:0.056202   1st Qu.: 0.31428  
 Median :0.068034   Median :0.075696   Median : 0.47468  
 Mean   :0.070244   Mean   :0.079781   Mean   : 0.46199  
 3rd Qu.:0.088299   3rd Qu.:0.099611   3rd Qu.: 0.61267  
 Max.   :0.250836   Max.   :0.302982   Max.   : 0.93235  
      EVI                SAVI         
 Min.   :-0.01902   Min.   :-0.02439  
 1st Qu.: 0.05560   1st Qu.: 0.06694  
 Median : 0.09205   Median : 0.10844  
 Mean   : 0.10633   Mean   : 0.12251  
 3rd Qu.: 0.14630   3rd Qu.: 0.16798  
 Max.   : 0.47562   Max.   : 0.48396  

3.5. Extracting the Pixel Values of Mapped Features and Assigning Vegetation Classes

With these images, we’ve everything we need to move forward, preparing the data that will be used to train and test our random forest model. Let’s start by extracting the pixel values for the features we mapped. We will later use these pixel values to train and test the model.

# extracting the pixel information using the mapped features/polygons
train_features_clnd_pxls <- extract(train_features_clnd_indexes_scaled, train_features_clnd)
head(train_features_clnd_pxls)
  ID        Blue      Green        Red         RE        NIR
1  1 0.006558328 0.02629480 0.03789761 0.06722560 0.06768292
2  1 0.007127642 0.03018280 0.04512795 0.08374609 0.08382895
3  1 0.006670024 0.02567712 0.03890700 0.06104662 0.06191714
4  1 0.006507515 0.02560601 0.04088975 0.07106096 0.06911742
5  1 0.007614557 0.02979995 0.04475639 0.07366522 0.07263493
6  1 0.007649500 0.02975463 0.04506905 0.07024475 0.07175830
       NDVI        EVI       SAVI
1 0.2821098 0.05976755 0.07377707
2 0.3001080 0.07435983 0.09229806
3 0.2282206 0.04619273 0.05744646
4 0.2565984 0.05575727 0.06941147
5 0.2374838 0.05427793 0.06773307
6 0.2284504 0.05193264 0.06490289
# Let's use ID values to index into the Class vector. This indexing operation retrieves the class names/values that correspond to each ID value. These retrieved class names/values are then assigned to a new column called.
train_features_clnd_pxls$Class <- train_features_clnd$Class[train_features_clnd_pxls$ID]
head(train_features_clnd_pxls)
  ID        Blue      Green        Red         RE        NIR
1  1 0.006558328 0.02629480 0.03789761 0.06722560 0.06768292
2  1 0.007127642 0.03018280 0.04512795 0.08374609 0.08382895
3  1 0.006670024 0.02567712 0.03890700 0.06104662 0.06191714
4  1 0.006507515 0.02560601 0.04088975 0.07106096 0.06911742
5  1 0.007614557 0.02979995 0.04475639 0.07366522 0.07263493
6  1 0.007649500 0.02975463 0.04506905 0.07024475 0.07175830
       NDVI        EVI       SAVI Class
1 0.2821098 0.05976755 0.07377707 grass
2 0.3001080 0.07435983 0.09229806 grass
3 0.2282206 0.04619273 0.05744646 grass
4 0.2565984 0.05575727 0.06941147 grass
5 0.2374838 0.05427793 0.06773307 grass
6 0.2284504 0.05193264 0.06490289 grass
# now that we have the class names we don't need the ID index position
train_features_clnd_pxls$ID <- NULL

summary(train_features_clnd_pxls)
      Blue              Green               Red          
 Min.   :0.001006   Min.   :0.002216   Min.   :0.002351  
 1st Qu.:0.006488   1st Qu.:0.022588   1st Qu.:0.023575  
 Median :0.008557   Median :0.029002   Median :0.034620  
 Mean   :0.009274   Mean   :0.029265   Mean   :0.035058  
 3rd Qu.:0.011763   3rd Qu.:0.035875   3rd Qu.:0.045128  
 Max.   :0.029494   Max.   :0.094121   Max.   :0.131305  
       RE                NIR                NDVI         
 Min.   :0.005442   Min.   :0.003323   Min.   :-0.07855  
 1st Qu.:0.058055   1st Qu.:0.061778   1st Qu.: 0.21691  
 Median :0.072678   Median :0.076645   Median : 0.34773  
 Mean   :0.076391   Mean   :0.084564   Mean   : 0.40261  
 3rd Qu.:0.090349   3rd Qu.:0.100305   3rd Qu.: 0.61267  
 Max.   :0.211573   Max.   :0.242752   Max.   : 0.88841  
      EVI                SAVI                Class     
 Min.   :-0.01962   Min.   :-0.02530   bare_soil:2156  
 1st Qu.: 0.04550   1st Qu.: 0.05575   grass    :5988  
 Median : 0.07358   Median : 0.08878   trees    :8149  
 Mean   : 0.10130   Mean   : 0.11685                   
 3rd Qu.: 0.14868   3rd Qu.: 0.17074                   
 Max.   : 0.40614   Max.   : 0.41690                   

3.6. Visualizing the Vegetation Classes Distribution and Trends

Now that we’ve assigned each band and vegetation indexes to the mapped features, let’s take a look at the distribution of each vegetation class using a histogram of frequency.

val_grass <- subset(train_features_clnd_pxls, Class == "grass")
head(val_grass)
         Blue      Green        Red         RE        NIR      NDVI
1 0.006558328 0.02629480 0.03789761 0.06722560 0.06768292 0.2821098
2 0.007127642 0.03018280 0.04512795 0.08374609 0.08382895 0.3001080
3 0.006670024 0.02567712 0.03890700 0.06104662 0.06191714 0.2282206
4 0.006507515 0.02560601 0.04088975 0.07106096 0.06911742 0.2565984
5 0.007614557 0.02979995 0.04475639 0.07366522 0.07263493 0.2374838
6 0.007649500 0.02975463 0.04506905 0.07024475 0.07175830 0.2284504
         EVI       SAVI Class
1 0.05976755 0.07377707 grass
2 0.07435983 0.09229806 grass
3 0.04619273 0.05744646 grass
4 0.05575727 0.06941147 grass
5 0.05427793 0.06773307 grass
6 0.05193264 0.06490289 grass
val_trees <- subset(train_features_clnd_pxls, Class == "trees")
head(val_trees)
           Blue      Green        Red         RE        NIR      NDVI
5123 0.01274861 0.03244617 0.04266209 0.07163714 0.07920165 0.2998395
5124 0.01041855 0.02804318 0.03677150 0.05825589 0.06731792 0.2934633
5125 0.00786923 0.01944717 0.02781994 0.04466133 0.05063920 0.2908426
5126 0.01530221 0.04395438 0.05080659 0.10412344 0.11215274 0.3764507
5127 0.01654933 0.05380575 0.05253895 0.14308217 0.17049882 0.5288785
5128 0.01178302 0.03397559 0.03511559 0.08490883 0.10346059 0.4931945
            EVI       SAVI Class
5123 0.07369464 0.08813722 trees
5124 0.06312248 0.07584910 trees
5125 0.04924143 0.05917252 trees
5126 0.11777173 0.13880071 trees
5127 0.21658119 0.24471723 trees
5128 0.13939068 0.16054076 trees
val_bare_soil <- subset(train_features_clnd_pxls, Class == "bare_soil")
head(val_bare_soil)
            Blue      Green        Red         RE        NIR
11363 0.01098466 0.02928374 0.03323293 0.06334142 0.06236515
11364 0.01298375 0.03484810 0.04463157 0.05669749 0.05331304
11365 0.01331884 0.03622065 0.04737346 0.06195178 0.06458808
11366 0.01330053 0.03285496 0.04463889 0.05397818 0.04485557
11367 0.01359854 0.03721920 0.04782483 0.06062165 0.05441871
11368 0.01355749 0.03716655 0.04862684 0.07034501 0.06854978
             NDVI          EVI         SAVI     Class
11363 0.304736465 0.0617533699 0.0733688227 bare_soil
11364 0.088636497 0.0177357488 0.0217782723 bare_soil
11365 0.153754740 0.0344585245 0.0421953440 bare_soil
11366 0.002421136 0.0004465992 0.0005513494 bare_soil
11367 0.064491932 0.0133007840 0.0164232973 bare_soil
11368 0.170024881 0.0395726837 0.0484211666 bare_soil
# NDVI
par(mfrow = c(3, 1))
hist(val_grass$NDVI, main = "grass", xlab = "NDVI", xlim = c(0, 1), col = "#e5ca81")
hist(val_trees$NDVI, main = "trees", xlab = "NDVI", xlim = c(0, 1), col = "#81a581")
hist(val_bare_soil$NDVI, main = "bare_soil", xlab = "NDVI", xlim = c(0, 1), col = "#7b6f89")

We can also fit a linear model using the pixel values of different indexes to see how they correlate and visualize them using a scatter plot.

# Scatterplot of NDVI and SAVI bands 
plot(NDVI ~ EVI, data = val_grass, pch = ".", col = "#e5ca81", xlab = "SAVI", ylab = "NDVI")
points(NDVI ~ EVI, data = val_trees, pch = ".", col = "#81a581")
points(NDVI ~ EVI, data = val_bare_soil, pch = ".", col = "#7b6f89")

# making a regression model to get the trending line for each vegetation class
model_grass <- lm(NDVI ~ EVI, data = val_grass) 
abline(model_grass, col = "#e5ca81", lwd = 2)    

model_trees <- lm(NDVI ~ EVI, data = val_trees) 
abline(model_trees, col = "#81a581", lwd = 2)   

model_bare_soil <- lm(NDVI ~ EVI, data = val_bare_soil) 
abline(model_bare_soil, col = "#7b6f89", lwd = 2)        

legend("topright", legend = c("grass", "trees", "bare_soil"), 
       fill = c("#e5ca81", "#7b6f89", "#81a581"), bg = "white")

Or use a histogram of frequency, along with facet_wrap(), to plot all vegetation classes and indexes together making them easily comparable.

# selecting only the indexes and classes
train_features_clnd_pxls_indexes <- train_features_clnd_pxls |> 
  dplyr::select(c(NDVI, EVI, SAVI, Class)) |> 
  tidyr::pivot_longer(cols = 1:3, values_to = "values", names_to = "indexes")
train_features_clnd_pxls_indexes
# A tibble: 48,879 × 3
   Class indexes values
   <fct> <chr>    <dbl>
 1 grass NDVI    0.282 
 2 grass EVI     0.0598
 3 grass SAVI    0.0738
 4 grass NDVI    0.300 
 5 grass EVI     0.0744
 6 grass SAVI    0.0923
 7 grass NDVI    0.228 
 8 grass EVI     0.0462
 9 grass SAVI    0.0574
10 grass NDVI    0.257 
# ℹ 48,869 more rows
# add the distribution curve for each indexes parameter using stat_function and dnorm function
ggplot(train_features_clnd_pxls_indexes, 
                        aes(x = values, 
                            fill = indexes)) +
  geom_density_ridges(aes(y = 0, fill = indexes, color = indexes),  
                      jittered_points = TRUE, 
                      alpha = .5, 
                      scale = 1,
                      point_shape = "|",
                      point_size = 3, 
                      position = ggridges::position_points_jitter(width = 0.05, height = 0)) +  
    geom_vline(data = train_features_clnd_pxls_indexes |> 
               dplyr::group_by(Class, indexes) |> 
               dplyr::summarise(mean = mean(values), .groups = "drop") |> 
               dplyr::ungroup(),
             aes(xintercept = mean, color = indexes), linetype = "dashed", linewidth = 0.8) +
  scale_fill_manual(values = c("#e5ca81", "#7b6f89", "#81a581"), labels = c("EVI", "NDVI", "SAVI")) +
  scale_color_manual(values = c("#e5ca81", "#7b6f89", "#81a581"), labels = c("EVI", "NDVI", "SAVI")) +
  theme_bw() + labs(x = "", y = "Frequency") +
  theme(axis.title = element_text(size = 20, face = "bold", color = "gray20"),
        axis.text.x.bottom = element_text(size = 14),
        axis.text.y = element_text(size = 14),
        legend.direction = "horizontal",
        legend.position = "bottom",
        legend.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        title = element_text(size = 18)) +
  facet_wrap(~Class, ncol = 3, scales = "free")

4. Building and Training the Model

With the difference image bands normalized, vegetation indexes calculated, and vegetation classes assigned to previous mapped features, we’re now able to move forward in building the training and testing model.

The training set is used to train the Random Forest model, while the testing set is used to evaluate its performance on data it hasn’t seen before. This separation helps prevent overfitting — where the model memorizes the training data but fails to generalize to new data. In this case, 80% of the data is used to teach the model which vegetation classes correspond to specific band reflectance values and vegetation indexes. The remaining 20% is then used to test how accurately the trained model can predict vegetation classes for new “unseen” data. These proportions are just recommendations, but for small sample sizes it is suggested to use 90% for training.

4.1. Partitioning the Data Into Training and Testing

We’ll use the caret function createDataPartition() to split the data into training and testing sets. Other functions with different parameters can also be used to split it, but for the purpose of this tutorial, let’s keep it simple.

After partitioning, we’ll use a portion of the dataset for training (80%) and testing (20%). If we check the number of rows in the new datasets and use the “three-step rule” equation, we can confirm that the data was properly split.

# set seed will ensure we get the values you get on your computer are the same as we got in this tutorial
set.seed(044) 

# checking the structure of our dataset
str(train_features_clnd_pxls)
'data.frame':   16293 obs. of  9 variables:
 $ Blue : num  0.00656 0.00713 0.00667 0.00651 0.00761 ...
 $ Green: num  0.0263 0.0302 0.0257 0.0256 0.0298 ...
 $ Red  : num  0.0379 0.0451 0.0389 0.0409 0.0448 ...
 $ RE   : num  0.0672 0.0837 0.061 0.0711 0.0737 ...
 $ NIR  : num  0.0677 0.0838 0.0619 0.0691 0.0726 ...
 $ NDVI : num  0.282 0.3 0.228 0.257 0.237 ...
 $ EVI  : num  0.0598 0.0744 0.0462 0.0558 0.0543 ...
 $ SAVI : num  0.0738 0.0923 0.0574 0.0694 0.0677 ...
 $ Class: Factor w/ 3 levels "bare_soil","grass",..: 2 2 2 2 2 2 2 2 2 2 ...
# partitioning the data into 80% (p = 0.8, the percentage of data that goes to training).
rf_train_features_clnd_pxls_indexes <- caret::createDataPartition(train_features_clnd_pxls$Class, list = FALSE, p = 0.8)
str(rf_train_features_clnd_pxls_indexes)
 int [1:13036, 1] 1 4 5 6 7 8 10 11 12 13 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr "Resample1"
# setting 80% for training - Use "rf_train_features_clnd_pxls_indexes" as a vector of row indices that specify which rows to extract from the train_features_clnd_pxls dataset.
rf_train_data <- train_features_clnd_pxls[rf_train_features_clnd_pxls_indexes, ] 
str(rf_train_data)
'data.frame':   13036 obs. of  9 variables:
 $ Blue : num  0.00656 0.00651 0.00761 0.00765 0.0069 ...
 $ Green: num  0.0263 0.0256 0.0298 0.0298 0.0266 ...
 $ Red  : num  0.0379 0.0409 0.0448 0.0451 0.0429 ...
 $ RE   : num  0.0672 0.0711 0.0737 0.0702 0.0714 ...
 $ NIR  : num  0.0677 0.0691 0.0726 0.0718 0.072 ...
 $ NDVI : num  0.282 0.257 0.237 0.228 0.253 ...
 $ EVI  : num  0.0598 0.0558 0.0543 0.0519 0.0569 ...
 $ SAVI : num  0.0738 0.0694 0.0677 0.0649 0.0709 ...
 $ Class: Factor w/ 3 levels "bare_soil","grass",..: 2 2 2 2 2 2 2 2 2 2 ...
# setting 20% for testing - Use "-rf_train_features_clnd_pxls_indexes" as a vector of row indices that specify which rows to extract from the train_features_clnd_pxls dataset.
rf_test_data <- train_features_clnd_pxls[-rf_train_features_clnd_pxls_indexes, ]
str(rf_test_data)
'data.frame':   3257 obs. of  9 variables:
 $ Blue : num  0.00713 0.00667 0.00715 0.00598 0.00656 ...
 $ Green: num  0.0302 0.0257 0.0283 0.0229 0.0268 ...
 $ Red  : num  0.0451 0.0389 0.0445 0.0357 0.0434 ...
 $ RE   : num  0.0837 0.061 0.0747 0.0563 0.0806 ...
 $ NIR  : num  0.0838 0.0619 0.0736 0.0588 0.0798 ...
 $ NDVI : num  0.3 0.228 0.247 0.245 0.296 ...
 $ EVI  : num  0.0744 0.0462 0.0566 0.0471 0.0706 ...
 $ SAVI : num  0.0923 0.0574 0.0707 0.0584 0.0878 ...
 $ Class: Factor w/ 3 levels "bare_soil","grass",..: 2 2 2 2 2 2 2 2 2 2 ...

4.2. Assessing the Proportion of Split Data

Let’s check the proportion of data points/pixels and features selected for training and testing data, respectively.

Training:

rf_train_data_classCount <- rf_train_data |> 
  dplyr::group_by(Class) |> 
  count()

rf_train_data_classCount
# A tibble: 3 × 2
# Groups:   Class [3]
  Class         n
  <fct>     <int>
1 bare_soil  1725
2 grass      4791
3 trees      6520
(prop.table(table(rf_train_data$Class))*100)

bare_soil     grass     trees 
 13.23259  36.75207  50.01534 

Testing:

rf_test_data_classCount <- rf_test_data |> 
  dplyr::group_by(Class) |> 
  count()

rf_test_data_classCount
# A tibble: 3 × 2
# Groups:   Class [3]
  Class         n
  <fct>     <int>
1 bare_soil   431
2 grass      1197
3 trees      1629
(prop.table(table(rf_test_data$Class))*100)

bare_soil     grass     trees 
 13.23304  36.75161  50.01535 

4.3. Building the Training Model

To build our training model I’ll break it down into three mains steps:

1) Defining parameters for training. The function trainControl() generates parameters that further control how models are created, letting you set up the rules for how your model will be trained and tested. In this part, we’ll use Cross-Validation with 10 folds. It means that the training data will be divided into 10 equal parts, with the model trained on 9 parts and validated on the remaining part. This process repeats 10 times with a different validation fold each time, ensuring every observation gets used for both training and validation.

2) Defining Response and Predictive Variables. We’ll define Class as the response variable and the reflectance bands, along with the vegetation indexes, as predictors. The reason is that we want to know how vegetation classes can be determined by their spectral signature.

3) Building and Running the Model. Now we’ll use the parameters and variables determined before to build the model. We’re going to use the function train(), from the package caret, to build the training model using the Random Forest algorithm. We’ll use Kappa (Cohen’s Kappa) metric (as it allows to checks models performance, by comparing observed accuracy with expected accuracy by taking into account the possibility of agreement occurring by chance), a 1000 trees (number of decision trees to grow in your Random Forest - more trees generally lead to a more robust model as the ensemble can better capture complex patterns in the data, but requires more computer processing), and set 8 for the different numbers of the hyperparameter mtry.

The hyperparameters are configuration settings that control the learning algorithm’s behavior. Setting it to 8 means that we will test 8 different values for the key hyperparameter mtry (the number of variables randomly considered at each decision tree split - in this case, it’ll be 8 because we’ve 8 predictors). Rather than manually selecting which values to try, the caret package creates a grid of 8 evenly spaced values across the appropriate range for this hyperparameter. Each of these 8 configurations undergoes the full cross-validation process, and the one producing the best performance (measured by the Kappa statistic) is selected for the final model.

It is worth mentioning the hyperparameter mtry to understand better how the Random Forest is handling our data on the background. The mtry is a key hyperparameter in Random Forest models that stands for “variables randomly sampled as candidates at each split”. When a Random Forest builds each decision tree, at each node (decision point), the algorithm doesn’t consider all available predictors. Instead, it randomly selects a subset of the available predictors to consider for that particular split. The size of this random subset is controlled by the mtry parameter. By testing all 8 possible values, we’ll get a more thorough hyperparameter search that covers the entire possible range. This will find the optimal number of variables to consider at each split for your specific dataset.

Note that we can use all 8 predictors because this is a low number of variables and we’ll not be computational demanding, but you don’t have to set tuneLength to the maximum number of your predictors.

# 1) Defining parameters for training
cvControl <- caret::trainControl(method = 'cv', # cross-validation
                                 number = 10, # number of folds
                                 savePredictions = TRUE, #save the predictions made during cross-validation (useful for later analysis)
                                 verboseIter = FALSE) #won't print progress details during training


# 2) Defining Response and Predictive Variables 
respVar <- c("Class") # response
predVar <- c("Blue","Green","Red","RE","NIR","NDVI","EVI","SAVI") # predictors


# 3) Building and Running the Model
rfModel <- caret::train(rf_train_data[,predVar], # predictors
                        rf_train_data[,respVar], # response
                        method = "rf", # Random Forest algorithm
                        metric = "Kappa",
                        ntree = 1000, # number of trees, 1000
                        trControl = cvControl, # rules for training and testing the training data
                        tuneLength = 8, # different numbers of hyperparameters   
                        importance = TRUE) # Calculates variable importance (which features matter most)
note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .
rfModel
Random Forest 

13036 samples
    8 predictor
    3 classes: 'bare_soil', 'grass', 'trees' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 11732, 11732, 11733, 11732, 11733, 11732, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
  2     0.9260515  0.8751096
  3     0.9286596  0.8796275
  4     0.9284290  0.8792625
  5     0.9298101  0.8816041
  6     0.9299632  0.8818674
  7     0.9291963  0.8806139
  8     0.9292729  0.8807892

Kappa was used to select the optimal model using the largest value.
The final value used for the model was mtry = 6.

4.4. Assessing Model Performance

Among the 8 defined predictors, the mtry 7 was selected as the best model due to the highest Kappa and Accuracy values across tuning parameters. The accuracy measures proportion of correctly classified samples (higher is better), while Kappa measures the agreement between predicted classifications and actual classifications that takes into account the possibility of agreement occurring by chance (higher is better, 0.81-1.00 is Almost perfect agreement). Even though the Kappa and Cross-Validation are distinct things, the model returns them as Kappa (Cross-Validation) because Kappa values are obtained through cross-validation. The Kappa values shown are averaged across the resampling results from our cross-validation folds.

In summary, what happens is that when training a Random Forest model using caret() we specified the 10 fold Cross-Validation as our resampling method. For each parameter combination (mtry values 2-8 predictors) and for each fold of the 10-fold Cross-Validation, the performance metric (Kappa) is computed over the 10 folds. When we call plot() on the train object, caret creates a visualization of these averaged Cross-Validation results, with the tuning parameter (mtry) on the X-axis and the performance metric (Kappa) on the Y-axis.

rfModel
Random Forest 

13036 samples
    8 predictor
    3 classes: 'bare_soil', 'grass', 'trees' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 11732, 11732, 11733, 11732, 11733, 11732, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
  2     0.9260515  0.8751096
  3     0.9286596  0.8796275
  4     0.9284290  0.8792625
  5     0.9298101  0.8816041
  6     0.9299632  0.8818674
  7     0.9291963  0.8806139
  8     0.9292729  0.8807892

Kappa was used to select the optimal model using the largest value.
The final value used for the model was mtry = 6.
plot(rfModel)

Another way to visualize is by checking the error rates of each classified feature. The values from 0 to 0.25 represent the proportion of misclassified samples. The error rates usually decrease as more trees are added to the forest and stabilize somewhere below 0.10 (10%). In the graph below, the color represents:

As we can see, the model didn’t perform so well classifying “bare_soil” as it did for other classes. We can kind of see the misclassified samples trends in the confusion matrix. Trees had the lowest error rate, meaning the model structure was good for classifying this vegetation type. Overall, the OOB had a low misclassified value, thus supporting the robustness of our model.

# first glimpse on the models classification performance
rfModel$finalModel

Call:
 randomForest(x = x, y = y, ntree = 1000, mtry = param$mtry, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 6

        OOB estimate of  error rate: 6.9%
Confusion matrix:
          bare_soil grass trees class.error
bare_soil      1295   302   128  0.24927536
grass           194  4482   115  0.06449593
trees            58   102  6360  0.02453988
# checking the error values of each class
head(rfModel$finalModel$err.rate)
            OOB bare_soil      grass      trees
[1,] 0.10169492 0.2962382 0.09355742 0.05631470
[2,] 0.10528321 0.2790918 0.10641248 0.05773143
[3,] 0.10289159 0.2750760 0.10669280 0.05387755
[4,] 0.10393744 0.2796209 0.11256804 0.05020080
[5,] 0.09870550 0.2716208 0.10304015 0.04955722
[6,] 0.09860541 0.2722256 0.10355625 0.04845598
# visualizing the error values of each class
plot(rfModel$finalModel)

We should also check the importance of each reflectance band and vegetation indexes for identifying the vegetation cover. For this purpose we’ll visualize the variables ranked by the MeanDecreaseAccuracy and the MeanDecreaseGini.

MeanDecreaseAccuracy is a performance metric that measures the proportion of correct predictions across all trees in the random forest. When measuring variable importance, the model first establishes a baseline accuracy using Out-Of-Bag (OOB) samples. Then, values for a specific variable are randomly shuffled in these samples, and the model is re-evaluated. The difference between the baseline accuracy and this new accuracy is the decrease in accuracy attributable to that variable. A larger decrease in accuracy when a variable is shuffled indicates greater importance of that variable. Variables are presented in descending order of importance, with those at the top causing the greatest drop in accuracy when removed or shuffled.

The MeanDecreaseGini measures how each variable contributes to the node purity in the random forest. A higher value indicates that a particular predictor variable plays a greater role in partitioning the data into the defined classes. Higher values indicate greater importance of the variable to your model. This metric measures the total decrease in Gini impurity for a variable across all trees in the forest. Variables that lead to larger decreases in Gini impurity are considered more important, as they’re more effective at creating homogeneous (pure) nodes.

In summary, Accuracy focuses on prediction performance, while Gini focuses on the creation of pure nodes during tree construction.

varImpPlot(rfModel$finalModel)

4.5. Applying the Model to the Test Dataset

We’ve split the data into 80% for training and 20% for testing, meaning that our model has not seen 20% of it yet. Let’s use the model built to see how it performs in predicting those 20% left.

In this step, we’ll evaluate the model performance by checking the Accuracy, Kappa, Confusion Matrix, Sensitivity, and Specificity. The first two we’ve seen before and focus more on the overall performance of the model, while the last three evaluate it by each of the classified features.

The Confusion Matrix summarizes which features were misclassified and taken for other features. Ex: 25 trees were identified as bare_soils and 30 as grasses. The Sensitivity measures how well the model correctly identifies actual positives, in other words: “Of all actual class”tree” pixels, how many did the model correctly classify as class “trees”? The model correctly identified 97.85% of actual “trees” pixels. Now, complementary, Specificity measures how well the model correctly identifies actual negatives, in other words: “Of all non-class”grass” pixels, how many did the model correctly classify as not class “grass”? 95.87% of non-“grass” pixels were correctly classified as not “grass”.

Check out these two videos if you want to know more about these metrics: - Confusion Matrix Explained - 7 Must-Know Metrics That Power Machine Learning

rfPredict <- predict(rfModel, rf_test_data)
head(rfPredict)
[1] grass grass grass grass grass grass
Levels: bare_soil grass trees
conf_matrix <- confusionMatrix(rfPredict, rf_test_data$Class)
conf_matrix
Confusion Matrix and Statistics

           Reference
Prediction  bare_soil grass trees
  bare_soil       341    48    15
  grass            65  1119    20
  trees            25    30  1594

Overall Statistics
                                          
               Accuracy : 0.9377          
                 95% CI : (0.9288, 0.9457)
    No Information Rate : 0.5002          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.8952          
                                          
 Mcnemar's Test P-Value : 0.07009         

Statistics by Class:

                     Class: bare_soil Class: grass Class: trees
Sensitivity                    0.7912       0.9348       0.9785
Specificity                    0.9777       0.9587       0.9662
Pos Pred Value                 0.8441       0.9294       0.9666
Neg Pred Value                 0.9685       0.9620       0.9782
Prevalence                     0.1323       0.3675       0.5002
Detection Rate                 0.1047       0.3436       0.4894
Detection Prevalence           0.1240       0.3697       0.5063
Balanced Accuracy              0.8844       0.9468       0.9724
conf_table <- as.data.frame(conf_matrix$table)

colnames(conf_table) <- c("Predicted", "Actual", "Frequency")
conf_table
  Predicted    Actual Frequency
1 bare_soil bare_soil       341
2     grass bare_soil        65
3     trees bare_soil        25
4 bare_soil     grass        48
5     grass     grass      1119
6     trees     grass        30
7 bare_soil     trees        15
8     grass     trees        20
9     trees     trees      1594
ggplot(data = conf_table, aes(x = Actual, y = Predicted, fill = Frequency)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "darkgreen") +
  geom_text(aes(label = Frequency), color = "black", size = 5) +
  labs(title = "Confusion Matrix Heatmap",
       x = "Actual Labels",
       y = "Predicted Labels",
       fill = "Frequency") +
  theme_minimal(base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

4.6. Performing Image Classification

Just recapping, so far we’ve been using pixel values extracted within the boundaries of our mapped features to:

1) Train the model using 80% of these pixel values.

2) Testing the model using the 20% left of these pixel values.

Now, we’re gonna use the training model to predict the vegetation cover type for our full image (that one containing all reflectance bands and indexes). To do so, we’ll use the function predict().

image_lc_predict <- terra::predict(train_features_clnd_indexes, rfModel) 
image_lc_predict
class       : SpatRaster 
dimensions  : 559, 558, 1  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
categories  : class 
name        :     class 
min value   : bare_soil 
max value   :     trees 
# checking how many pixels were assigned for each feature
pixel_counts <- raster::freq(image_lc_predict)
pixel_counts
  layer     value  count
1     1 bare_soil  74005
2     1     grass 155720
3     1     trees  82197

4.7. Visualizing the classification

Let’s plot the created image and determine feature colors for better visualization.

#Lets define the plalette(mainly be using Hexadecimal colours).
pal <-c("#7b6f89", "#81a581", "#e5ca81")

tmap::tm_shape(image_lc_predict)+
  tmap::tm_raster(style = "cat",
                  labels = c("bare_soil",
                             "trees", 
                             "grass"),
                  palette = pal,
                  title = "Legend") +
  tmap::tm_layout(main.title= "",
                  main.title.size =.9 ) +
  tmap::tm_layout(legend.outside = TRUE,
                  legend.outside.position = "right",
                  legend.title.size = .8)

We can also visualize it using an interactive plot, through leaflet() function, and overlay it with a basemap. It looks like our model had a fair performance at predicting the vegetation types and cover.

raster_lc_predict <- raster::raster(image_lc_predict)

pal <- leaflet::colorFactor(
  palette = c("#7b6f89", "#81a581", "#e5ca81"), 
  levels = c(1, 2, 3) 
)

leaflet::leaflet() |> 
  leaflet::addTiles(group = "Base Map") |> 
  leafem::addRasterRGB(raster::brick(cropped_raster_bands[[1:3]]),
               r = 3, g = 2, b = 1,
               group = "RGB Image") |> 
  leaflet::addRasterImage(raster_lc_predict,
                 colors = pal,
                 opacity = 0.4,
                 group = "Land Cover Classification") |> 
  leaflet::addLegend(pal = pal,
            values = c(1, 2, 3), 
            title = "Land Cover Classification",
            labels = c("bare_soil", "grass", "trees"),
            position = "bottomright") |> 
  leaflet::addLayersControl(
    baseGroups = c("Base Map", "RGB Image"),
    overlayGroups = c("Land Cover Classification"),
    options = leaflet::layersControlOptions(collapsed = FALSE))

Let’s visuallize the fixel frequency of each lass also using a interactive barplot.

raster_lc_predict_df <- as.data.frame(raster_lc_predict) |> 
  dplyr::mutate(category = dplyr::case_when(
    class == 1 ~ "bare_soil",
    class == 2 ~ "trees",
    class == 3 ~ "grass",
    TRUE ~ NA_character_)) |> 
  dplyr::group_by(category) |> 
  dplyr::summarise(count = dplyr::n()) |> 
  dplyr::ungroup() |> 
  dplyr::mutate(prop = round(((count*100)/(sum(count))), 2)) |> 
  dplyr::arrange(dplyr::desc(prop))

head(raster_lc_predict_df)
# A tibble: 3 × 3
  category   count  prop
  <chr>      <int> <dbl>
1 trees     155720  49.9
2 grass      82197  26.4
3 bare_soil  74005  23.7
plotly::plot_ly(raster_lc_predict_df, 
        x = ~category, 
        y = ~prop,
        type = "bar",
        text = ~prop, 
        textposition = 'auto') |> 
  plotly::layout(title = "Vegetation Cover",
               xaxis = list(title = "Vegetation Type",
                            tickvals = c("bare_soil", "grass", "trees"),
                            ticktext = c("Bare Soil", "Grass", "Trees")),
               yaxis = list(title = "Cover Proportation (%)"))

Thank you for sticking around this whole tutorial. Remember, we used this tutorial for satellite image classification, but the concepts can also be used in predictive models and by other types of classification.

Post Cover Image Credits

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/souzayuri/souzayuri.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Souza (2025, April 29). Yuri Souza, Ecologist: Supervised Image Classification with Random Forest in R. Retrieved from https://souzayuri.github.io/posts/2025-01-05-random_forest_sup_classific_r/

BibTeX citation

@misc{souza2025supervised,
  author = {Souza, Yuri},
  title = {Yuri Souza, Ecologist: Supervised Image Classification with Random Forest in R},
  url = {https://souzayuri.github.io/posts/2025-01-05-random_forest_sup_classific_r/},
  year = {2025}
}