In this post, I demonstrate how to load and visualize georeferenced spatial data and perform a supervised random forest classification using R.
Machine learning has been one of my favorite topics over the last few years, as I’ve been using approaches such as predictive models and cluster analysis. However, these are just a subset of the machine learning potential. With that in mind, I decided to take a step forward and learn what I believe to be the next topic in my field of expertise: Supervised Image Classification. This is a highly effective method for automatically identifying features in an image, such as forest patches, crops, urban areas, water bodies, and so on. In a few words, the way it works is by telling an algorithm how each of these features looks, based on reflectance signatures, and then it searches in the images for patterns on pixel aggregation that resemble the patterns of the input features.
I’ve been trying to perform supervised image classification of satellite images in R for a while, and I’ve always stumbled upon conceptual problems, such as: How to do it? Which parameters to use for validation? Which packages and functions? What is happing on the background?
Even though this has been a very popular technique across remote sensing people and is widely available to learn on the internet, it’s definitely easier to find tutorials for Python or software such as ArcGIS and QGIS than it’s for R. I’m not saying there aren’t tutorials in R out there, in fact, the content available here is from those tutorials, but I found this information very spread across a few blogs and a bit scarce. So, after struggling to find a tutorial that I liked, I decided to create my own tutorial, based on what I could find online, while also adding my personal touch.
This is my attempt to learn how to perform supervised image classification using the random forest algorithm. I’m pretty sure there are many other ways to carry it out, and definitely more efficient ones, but this is what I could get so far, and I believe that will suit a large audience.
For the record, what I’m presenting here is a compilation of what is available in following blogs and tutorials:
Random Forest is a supervised nonlinear algorithm approach that allows for classifying a dataset into categories or classes using classification or regression techniques. It is a technique based on the approach of Decision Trees, with a difference in that Random Forest uses a collection of Decision Trees instead of a single one, therefore providing more flexibility, accuracy, and ease of access in the output. Decision trees have a limitation because they lack accuracy and show low accuracy during the testing phase due to the process called over-fitting (that here can be understood as a model that matches the training set so closely that it fails to make correct predictions on new data). Random Forest, on the other hand, forests help to reduce tree correlation by injecting more randomness into the tree-growing process. It searches for the best feature from a random subset of features, providing more randomness to the model and resulting in a better and accurate model. Being able to randomize the data and come up with random samples is important in building a model because it creates variation, minimizes bias, creates equivalent treatment groups, and, therefore, increases the model’s potential for generalizations.
Random Forest and Decision Trees both have a tree-like structure made up of nodes and branches. At each node, the data gets split based on one of the input features, creating two or more branches. This splitting continues at the next nodes, leading to even more branches that help break down the original data into smaller parts. The process goes on until you reach a point where most of the data in a node belongs to the same class, making further splits unnecessary. The very first split happens at the root node, while the final nodes, called leaves, are linked to specific class labels. The routes you take from the root to each leaf outline the rules for classifying the data. Check this out for a deep explanation.
Random Forest uses the bagging method in decision trees and, as a result, increases the accuracy of the learning model. Bagging trees introduces a random component into the tree-building process by building many trees on bootstrapped copies of the training data. Bagging then aggregates the predictions across all the trees; this aggregation reduces the variance of the overall procedure and results in improved predictive performance.
Simply put, bagging (short for bootstrap aggregating) is directly derived from bootstrapping. It uses the concept of bootstrapping—sampling with replacement—to generate multiple datasets from the original training set. Bagging extends bootstrapping by applying it in the context of model training and prediction aggregation, turning a statistical resampling method into a powerful ensemble learning technique. Bootstrapping and bagging share a common foundation in that both involve sampling with replacement from the original training data to create new datasets.
Bootstrapping is primarily a resampling technique used to estimate variability or confidence in statistical measures by generating multiple bootstrap samples. Bagging builds on this concept by using each of these bootstrap samples to train separate models, often referred to as base learners. While bootstrapping focuses solely on data generation, bagging takes it further by combining the predictions from these multiple models to form a more stable and accurate overall prediction. The main goal of bagging is to reduce variance and minimize overfitting in regression and classification tasks, making it a key ensemble method in machine learning, whereas bootstrapping itself is not directly concerned with model performance.
For more information:
In a few words, my goal here is to use a drone optical imagery I took during my fieldwork in Brazil and make a model able to identify what is Bare Soil, Grass, and Trees on it. This image was taken in Chapada Dos Veadeiros National Park, Brazil.
I broke down the workflow of this tutorial into six main steps, along with some intermediate ones. This step-by-step approach really helped me grasp how the classification process works.
If you want to follow and try each step described here by yourself please download the data. HERE. If you want to
To get started, the very first step is to load the necessary packages (assuming you already have R installed on your machine). We’ll utilize raster and terra for handling raster data, sf for managing spatial data like points and polygons, and tidyverse for data tabulation and visualization. Additionally, we’ll rely on caret for implementing random forest methods, and use mapview along with leaflet to plot and visualize both the polygons and raster images. While these are the primary packages we’ll need, I’ll also incorporate some others to further explore the results of our image classification.
Begin by installing and loading the necessary packages, along with the additional ones you’ll need. I prefer a setup this syntax that first checks if a package is already installed; if it isn’t, it installs the package and then loads it afterward.
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("terra")) install.packages("terra")
if (!require("raster")) install.packages("raster")
if (!require("sf")) install.packages("sf")
if (!require("sp")) install.packages("sp")
if (!require("caret")) install.packages("caret")
if (!require("mapview")) install.packages("mapview")
if (!require("leaflet")) install.packages("leaflet")
if (!require("ggridges")) install.packages("ggridges")
if (!require("randomForest")) install.packages("randomForest")
if (!require("DiagrammeR")) install.packages("DiagrammeR")
if (!require("tmap")) install.packages("tmap")
if (!require("leafem")) install.packages("leafem")
if (!require("plotly")) install.packages("plotly")
Let’s load all the files and information we’ll need later on:
woody_savanna.tif: this is a drone image that I took during my master’s. It has 6 stacked bands: Blue, Green, Red, Red Edge, Near Infrared, and a Mask in this respective order. The mask band represents pixels that could not be identified during the image compilation (to make this image, I needed to use Agisoft Metashape).
sampled_area.shp: this is a polygon shapefile that we’ll use to clip the image and restrict it to a area of 100m x 100m (therefore removing the edges of the image).
train_features.shp: this file contains the features we’ll use to train our Random Forest model. It contains three classes: Bare Soil, Grass, and Trees.
In this step we load the files and make sure to rename their classes using a standard nomenclature.
raw_raster <- terra::rast("woody_savanna.tif") |>
terra::aggregate(fact=2) # original pixel size = 10.8cm, new pixel size 21.6cm (old * 2)
raw_raster
class : SpatRaster
dimensions : 759, 782, 6 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63899, -47.63742, -14.0681, -14.06662 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
names : woody~nna_1, woody~nna_2, woody~nna_3, woody~nna_4, woody~nna_5, woody~nna_6
min values : 2.89, 16.26, 16.97, 24.26, 117.52, 0
max values : 65535.00, 65535.00, 65535.00, 65535.00, 65535.00, 65535
plot(raw_raster)
sampled_area <- terra::vect("sampled_area.shp")
sampled_area
class : SpatVector
geometry : polygons
dimensions : 1, 1 (geometries, attributes)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
source : sampled_area.shp
coord. ref. : lon/lat WGS 84 (EPSG:4326)
names : Id
type : <int>
values : 0
plot(sampled_area)
train_features <- sf::st_read("train_features.shp")
Reading layer `train_features' from data source
`D:\Others\souzayuri.github.io\_posts\2025-01-05-random_forest_sup_classific_r\train_features.shp'
using driver `ESRI Shapefile'
Simple feature collection with 46 features and 9 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS: WGS 84
train_features
Simple feature collection with 46 features and 9 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS: WGS 84
First 10 features:
Classcode Classname Classvalue RED GREEN BLUE Count Shape_Leng
1 <NA> grass 1 255 255 190 16651 0.0002222801
2 <NA> grass 1 255 255 190 13637 0.0002344464
3 <NA> grass 1 255 255 190 9875 0.0001834221
4 <NA> grass 1 255 255 190 11577 0.0002181070
5 <NA> grass 1 255 255 190 10001 0.0002419869
6 <NA> grass 1 255 255 190 7684 0.0001557914
7 <NA> grass 1 255 255 190 16943 0.0002274619
8 <NA> grass 1 255 255 190 6667 0.0001647991
9 <NA> grass 1 255 255 190 2964 0.0001024218
10 <NA> grass 1 255 255 190 4238 0.0001182262
Shape_Area geometry
1 3.318277e-09 POLYGON ((-47.63788 -14.066...
2 2.717603e-09 POLYGON ((-47.63782 -14.067...
3 1.967932e-09 POLYGON ((-47.63772 -14.066...
4 2.307135e-09 POLYGON ((-47.63781 -14.066...
5 1.993210e-09 POLYGON ((-47.63793 -14.067...
6 1.531386e-09 POLYGON ((-47.63788 -14.067...
7 3.376445e-09 POLYGON ((-47.63793 -14.066...
8 1.328762e-09 POLYGON ((-47.63768 -14.067...
9 5.907468e-10 POLYGON ((-47.63769 -14.066...
10 8.447055e-10 POLYGON ((-47.63785 -14.066...
# renaming their classes accordingly to keep it consistent across upcoming analyses.
train_features_clnd <- sf::st_read("train_features.shp") |>
dplyr::rename(OBJECTID = "Classcode",
Class = "Classname") |>
dplyr::mutate(OBJECTID = as.character(1:46),
Class = as.factor(Class),
Code = as.numeric(Class)) |>
dplyr::select(c(1:3))
Reading layer `train_features' from data source
`D:\Others\souzayuri.github.io\_posts\2025-01-05-random_forest_sup_classific_r\train_features.shp'
using driver `ESRI Shapefile'
Simple feature collection with 46 features and 9 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS: WGS 84
train_features_clnd
Simple feature collection with 46 features and 3 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS: WGS 84
First 10 features:
OBJECTID Class Classvalue geometry
1 1 grass 1 POLYGON ((-47.63788 -14.066...
2 2 grass 1 POLYGON ((-47.63782 -14.067...
3 3 grass 1 POLYGON ((-47.63772 -14.066...
4 4 grass 1 POLYGON ((-47.63781 -14.066...
5 5 grass 1 POLYGON ((-47.63793 -14.067...
6 6 grass 1 POLYGON ((-47.63788 -14.067...
7 7 grass 1 POLYGON ((-47.63793 -14.066...
8 8 grass 1 POLYGON ((-47.63768 -14.067...
9 9 grass 1 POLYGON ((-47.63769 -14.066...
10 10 grass 1 POLYGON ((-47.63785 -14.066...
plot(train_features_clnd)
We also check their CRS (coordinate reference system) and ensure they’ll have the same coordinate properties.
# Compare CRS, if they are different make them equal
if (terra::crs(raw_raster) == terra::crs(train_features_clnd)) {
print("CRS are identical!")
} else {
print("CRS are different. Therefore, reprojecting to match raster projection...")
# Reproject train_features to match raw_raster
train_features_clnd <- sf::st_transform(train_features_clnd, crs = terra::crs(raw_raster))
}
[1] "CRS are different. Therefore, reprojecting to match raster projection..."
# Compare CRS again to ensure they above function worked.
if (terra::crs(raw_raster) == terra::crs(train_features_clnd)) {
print("CRS are identical!")
} else {
print("CRS are different. Therefore, reprojecting to match raster projection...")
# Reproject train_features to match raw_raster
train_features_clnd <- sf::st_transform(train_features_clnd, crs = terra::crs(raw_raster))
}
[1] "CRS are identical!"
Let’s check the training data and the number of features, which has a total of 46. Here we can see that every training class has 15 features/polygons. Grass has 16 because I added one more by mistake, and I was too lazy to remove it =).
glimpse(train_features_clnd)
Rows: 46
Columns: 4
$ OBJECTID <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"…
$ Class <fct> grass, grass, grass, grass, grass, grass, grass, …
$ Classvalue <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2…
$ geometry <POLYGON [°]> POLYGON ((-47.63788 -14.066..., POLYGON (…
summary(train_features_clnd)
OBJECTID Class Classvalue geometry
Length:46 bare_soil:15 Min. :1.000 POLYGON :46
Class :character grass :16 1st Qu.:1.000 epsg:4326 : 0
Mode :character trees :15 Median :2.000 +proj=long...: 0
Mean :1.978
3rd Qu.:3.000
Max. :3.000
Now that we’ve standardized the data, we’ll crop the raster image to the 100m x 100m grid limit. We also want to check the CRS and make sure it’s correct, and plot each of these bands side-by-side.
cropped_raster <- terra::crop(raw_raster, sampled_area)
cropped_raster
class : SpatRaster
dimensions : 559, 558, 6 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
names : woody~nna_1, woody~nna_2, woody~nna_3, woody~nna_4, woody~nna_5, woody~nna_6
min values : 2.89, 16.26, 16.97, 24.26, 165.85, 65535
max values : 3139.13, 7843.21, 9368.55, 17875.35, 20952.24, 65535
plot(cropped_raster)
# Compare CRS
if (terra::crs(raw_raster) == terra::crs(train_features_clnd)) {
print("CRS are identical!")
} else {
print("CRS are different. Therefore, reprojecting to match raster projection...")
# Reproject train_features to match raw_raster
train_features_clnd <- terra::project(train_features_clnd, crs(raw_raster))
}
[1] "CRS are identical!"
So far, we have ensured the consistency of the image coordinates and defined the grid area we want to work on. Now, let’s rename the bands according to their respective spectral signatures. This will help us later identify which band is which. In this step, I also removed the band MASK by only selecting the bands of interest according to their position in the bands list (1 to 5).
Now that we’ve the raster with five bands and the polygons ready for training the model, let’s visualize our data. We’ll create a natural color composition. To do this we need to inform the position of the RGB bands of the raster image (that is why naming it before hand was important) and overlay it with each class represented by the polygons. For visualization, we’ll use mapview() because it’s interactive (just because we want some cool!!!).
train_class_colors <- unique(train_features_clnd$Class) |>
as.data.frame() |>
dplyr::mutate(hex = c(grass = "#ff7f00",
bare_soil = "#e41a1c",
tree = "#55eed1"))
train_class_colors
unique(train_features_clnd$Class) hex
1 grass #ff7f00
2 trees #e41a1c
3 bare_soil #55eed1
mapview::viewRGB(brick(cropped_raster_bands[[1:3]]),
r = 3, g = 2, b = 1,
map.types = c("Esri.WorldImagery", "OpenStreetMap", "OpenTopoMap"),
maxpixels = 866760,
layer.name = "RBG Image") +
mapview::mapView(train_features_clnd,
zcol = "Class",
color = "black",
lwd = 2,
col.regions = train_class_colors$hex,
layer.name = "Features")
One nice aspect of using optical imagery rasters with various bands is the ability to rearrange these bands to create new and informative images. To provide some context, the drone imagery we’re working with includes natural color RGB bands, which are part of the visible spectrum for humans, along with the Red Edge and Near-Infrared bands, which lie outside this spectrum. The latter two bands are particularly useful for highlighting vegetation characteristics, as plants reflect a significant amount of infrared light. This unique property makes infrared imagery an invaluable tool for analyzing vegetation, as it offers deeper insights into the state, health, vigor, and other attributes of plants that natural color imagery simply cannot reveal.
Now that we have our raster image and training features prepared, let’s move on to the next step: using this raster image and bands to create various vegetation indexes. They will be stacked as bands and will be used to train the model, using the mapped polygons, and to predict the vegetation classes in the later steps of the classification. This process will improve the classification of the image by pinpointing pixel combinations that are more accurately representative of the specific vegetation characteristics we aim to map.
When conducting image classification, it’s crucial to provide clear information that highlights the distinguishing features you’re mapping. In this section, we’ll focus on creating three widely used vegetation indexes in remote sensing: the NDVI (Normalized Difference Vegetation Index), the EVI (Enhanced Vegetation Index), and the SAVI (Soil Adjusted Vegetation Index). Each of these indexes interprets vegetation in unique ways, and allow to use them alongside the spectral bands from the drone imagery, therefore making the classification more robust.
This is a straightforward process that primarily involves utilizing the Red, Near Infrared, and Blue bands, along with a few predefined parameters. Each index has its own arithmetic equation, and we can identify these bands based on their positions within the raster. We will then plot these indexes side-by-side.
Check out below a table summarizing the similarity and differences among these indexes.
Feature | NDVI (Normalized Difference Vegetation Index) | EVI (Enhanced Vegetation Index) | SAVI (Soil Adjusted Vegetation Index) |
---|---|---|---|
Formula | (NIR - Red) / (NIR + Red) | 2.5 × (NIR - Red) / (NIR + 6 × Red - 7.5 × Blue + 1) | ((NIR - Red) / (NIR + Red + 0.5)) × (1 + 0.5) |
Bands Used | NIR, Red | NIR, Red, Blue | NIR, Red |
Purpose | General greenness/vegetation presence | Improves sensitivity in high biomass areas | Adjusts for soil background noise |
Advantages | Simple, widely used | Corrects for atmospheric & canopy background effects | Better for areas with sparse vegetation or exposed soil |
Limitations | Sensitive to soil brightness and atmospheric conditions | More complex, needs Blue band | Requires adjustment factor (L) based on soil type |
Value Range | -1 to +1 (usually 0 to 0.8 in vegetated areas) | -1 to +1 (typically 0 to 0.9 for vegetation) | -1 to +1 (usually 0 to 0.7, depending on L and vegetation density) |
nir <- cropped_raster_bands[[5]]
red <- cropped_raster_bands[[3]]
blue <- cropped_raster_bands[[1]]
# Check the range of your input bands
print(paste("NIR range:", minmax(nir)))
[1] "NIR range: 165.850002288818" "NIR range: 20952.2397460938"
[1] "Red range: 16.9700002372265" "Red range: 9368.55017089844"
[1] "Blue range: 2.88999992236495" "Blue range: 3139.12994384766"
cropped_raster_bands_ndvi <- (nir - red) / (nir + red)
cropped_raster_bands_ndvi
class : SpatRaster
dimensions : 559, 558, 1 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
name : NIR
min value : -0.08708532
max value : 0.95849029
cropped_raster_bands_evi <- 2.5 * ((nir - red) / (nir + (6 * red) - (7.5 * blue) + 1))
cropped_raster_bands_evi
class : SpatRaster
dimensions : 559, 558, 1 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
name : NIR
min value : -8.517467
max value : 83.766299
cropped_raster_bands_savi <- ((nir - red) / (nir + red + 0.5)) * (1 + 0.5)
cropped_raster_bands_savi
class : SpatRaster
dimensions : 559, 558, 1 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
name : NIR
min value : -0.1306091
max value : 1.4368568
# Create a red-to-green color palette
red_to_green_palette <- colorRampPalette(c("darkred", "yellow", "darkgreen"))
# Lets visuallize them together
par(mfrow = c(1, 3))
plot(cropped_raster_bands_ndvi, main = "NDVI", col = red_to_green_palette(100))
plot(cropped_raster_bands_evi, main = "EVI", col = red_to_green_palette(100))
plot(cropped_raster_bands_savi, main = "SAVI", col = red_to_green_palette(100))
Let’s stack these three indexes in our raster data so we can have more information to feed the model. After stacking the rasters containing the indexes with our main raster, we’ll notice that the new rasters are named as NIR. That is because the first spectral band we used in the above calculations was the NIR band. To avoid confusion between the NIR spectral band and the new rasters indexes, we’ll name them according to which index they represent.
train_features_clnd_indexes <- c(cropped_raster_bands,
cropped_raster_bands_ndvi,
cropped_raster_bands_evi,
cropped_raster_bands_savi)
train_features_clnd_indexes
class : SpatRaster
dimensions : 559, 558, 8 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
names : Blue, Green, Red, RE, NIR, NIR, ...
min values : 2.89, 16.26, 16.97, 24.26, 165.85, -0.08708532, ...
max values : 3139.13, 7843.21, 9368.55, 17875.35, 20952.24, 0.95849029, ...
names(train_features_clnd_indexes) <- c(names(cropped_raster_bands),"NDVI", "EVI", "SAVI")
train_features_clnd_indexes
class : SpatRaster
dimensions : 559, 558, 8 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
names : Blue, Green, Red, RE, NIR, NDVI, ...
min values : 2.89, 16.26, 16.97, 24.26, 165.85, -0.08708532, ...
max values : 3139.13, 7843.21, 9368.55, 17875.35, 20952.24, 0.95849029, ...
plot(train_features_clnd_indexes)
summary(train_features_clnd_indexes)
Blue Green Red
Min. : 2.89 Min. : 52.24 Min. : 58.04
1st Qu.: 384.11 1st Qu.:1156.35 1st Qu.:1209.77
Median : 528.55 Median :1651.34 Median :1781.48
Mean : 556.88 Mean :1654.48 Mean :1883.19
3rd Qu.: 704.35 3rd Qu.:2108.03 3rd Qu.:2453.26
Max. :2919.83 Max. :7528.84 Max. :9368.55
RE NIR NDVI
Min. : 147.7 Min. : 269.3 Min. :-0.07994
1st Qu.: 3228.9 1st Qu.: 3683.2 1st Qu.: 0.31428
Median : 4458.6 Median : 4960.7 Median : 0.47468
Mean : 4603.5 Mean : 5228.5 Mean : 0.46199
3rd Qu.: 5786.7 3rd Qu.: 6528.0 3rd Qu.: 0.61267
Max. :16438.5 Max. :19855.9 Max. : 0.93235
EVI SAVI
Min. :-0.07473 Min. :-0.1199
1st Qu.: 0.39774 1st Qu.: 0.4713
Median : 0.70507 Median : 0.7119
Mean : 0.73613 Mean : 0.6929
3rd Qu.: 1.03850 3rd Qu.: 0.9189
Max. :83.76630 Max. : 1.3981
You probably have noticed that the index values of NDVI, EVI, and SAVI range beyond the expected range of -1 to 1. That is because we need to convert our raw band values to reflectance values normalized to 0-1 range. This conversion could have been done in previous steps, but I wanted to present you with this common issue. To convert, we will simply divide each band by the theoretical maximum value of the drone image we got (in our case 65535, a 16-bit image). Let’s convert these numbers and redo what we’ve done in the step 3.1. Creating Vegetation Indexes.
cropped_raster_bands_scaled = cropped_raster_bands/ 65535
# splitting the required bands
nir <- cropped_raster_bands_scaled[[5]]
red <- cropped_raster_bands_scaled[[3]]
blue <- cropped_raster_bands_scaled[[1]]
# Check the range of your input bands
print(paste("NIR range:", minmax(nir)))
[1] "NIR range: 0.00253070881649223" "NIR range: 0.319710685070478"
[1] "Red range: 0.000258945605206782"
[2] "Red range: 0.142954912198038"
[1] "Blue range: 4.40985720968177e-05"
[2] "Blue range: 0.0479000525497468"
cropped_raster_bands_ndvi_scaled <- (nir - red) / (nir + red)
cropped_raster_bands_ndvi_scaled
class : SpatRaster
dimensions : 559, 558, 1 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
name : NIR
min value : -0.08708532
max value : 0.95849029
cropped_raster_bands_evi_scaled <- 2.5 * ((nir - red) / (nir + (6 * red) - (7.5 * blue) + 1))
cropped_raster_bands_evi_scaled
class : SpatRaster
dimensions : 559, 558, 1 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
name : NIR
min value : -0.0196188
max value : 0.4756189
cropped_raster_bands_savi_scaled <- ((nir - red) / (nir + red + 0.5)) * (1 + 0.5)
cropped_raster_bands_savi_scaled
class : SpatRaster
dimensions : 559, 558, 1 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
name : NIR
min value : -0.02529499
max value : 0.48395901
# Create a red-to-green color palette
red_to_green_palette <- colorRampPalette(c("darkred", "yellow", "darkgreen"))
# Lets visuallize them together
par(mfrow = c(1, 3))
plot(cropped_raster_bands_ndvi_scaled, main = "NDVI", col = red_to_green_palette(100))
plot(cropped_raster_bands_evi_scaled, main = "EVI", col = red_to_green_palette(100))
plot(cropped_raster_bands_savi_scaled, main = "SAVI", col = red_to_green_palette(100))
If you check the new values, you’ll see that they range from -1 to 1, therefore properly scaled down.
train_features_clnd_indexes_scaled <- c(cropped_raster_bands_scaled,
cropped_raster_bands_ndvi_scaled,
cropped_raster_bands_evi_scaled,
cropped_raster_bands_savi_scaled)
train_features_clnd_indexes_scaled
class : SpatRaster
dimensions : 559, 558, 8 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
names : Blue, Green, Red, RE, NIR, NIR, ...
min values : 4.409857e-05, 0.0002481117, 0.0002589456, 0.0003701839, 0.002530709, -0.08708532, ...
max values : 4.790005e-02, 0.1196797144, 0.1429549122, 0.2727603586, 0.319710685, 0.95849029, ...
# renaming the bands by using their position in the list
names(train_features_clnd_indexes_scaled) <- c(names(cropped_raster_bands_scaled),"NDVI", "EVI", "SAVI")
train_features_clnd_indexes_scaled
class : SpatRaster
dimensions : 559, 558, 8 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
names : Blue, Green, Red, RE, NIR, NDVI, ...
min values : 4.409857e-05, 0.0002481117, 0.0002589456, 0.0003701839, 0.002530709, -0.08708532, ...
max values : 4.790005e-02, 0.1196797144, 0.1429549122, 0.2727603586, 0.319710685, 0.95849029, ...
plot(train_features_clnd_indexes_scaled)
summary(train_features_clnd_indexes_scaled)
Blue Green Red
Min. :0.0000441 Min. :0.0007971 Min. :0.0008856
1st Qu.:0.0058611 1st Qu.:0.0176447 1st Qu.:0.0184598
Median :0.0080652 Median :0.0251978 Median :0.0271837
Mean :0.0084975 Mean :0.0252458 Mean :0.0287356
3rd Qu.:0.0107477 3rd Qu.:0.0321666 3rd Qu.:0.0374343
Max. :0.0445538 Max. :0.1148827 Max. :0.1429549
RE NIR NDVI
Min. :0.002254 Min. :0.004109 Min. :-0.07994
1st Qu.:0.049271 1st Qu.:0.056202 1st Qu.: 0.31428
Median :0.068034 Median :0.075696 Median : 0.47468
Mean :0.070244 Mean :0.079781 Mean : 0.46199
3rd Qu.:0.088299 3rd Qu.:0.099611 3rd Qu.: 0.61267
Max. :0.250836 Max. :0.302982 Max. : 0.93235
EVI SAVI
Min. :-0.01902 Min. :-0.02439
1st Qu.: 0.05560 1st Qu.: 0.06694
Median : 0.09205 Median : 0.10844
Mean : 0.10633 Mean : 0.12251
3rd Qu.: 0.14630 3rd Qu.: 0.16798
Max. : 0.47562 Max. : 0.48396
With these images, we’ve everything we need to move forward, preparing the data that will be used to train and test our random forest model. Let’s start by extracting the pixel values for the features we mapped. We will later use these pixel values to train and test the model.
# extracting the pixel information using the mapped features/polygons
train_features_clnd_pxls <- extract(train_features_clnd_indexes_scaled, train_features_clnd)
head(train_features_clnd_pxls)
ID Blue Green Red RE NIR
1 1 0.006558328 0.02629480 0.03789761 0.06722560 0.06768292
2 1 0.007127642 0.03018280 0.04512795 0.08374609 0.08382895
3 1 0.006670024 0.02567712 0.03890700 0.06104662 0.06191714
4 1 0.006507515 0.02560601 0.04088975 0.07106096 0.06911742
5 1 0.007614557 0.02979995 0.04475639 0.07366522 0.07263493
6 1 0.007649500 0.02975463 0.04506905 0.07024475 0.07175830
NDVI EVI SAVI
1 0.2821098 0.05976755 0.07377707
2 0.3001080 0.07435983 0.09229806
3 0.2282206 0.04619273 0.05744646
4 0.2565984 0.05575727 0.06941147
5 0.2374838 0.05427793 0.06773307
6 0.2284504 0.05193264 0.06490289
# Let's use ID values to index into the Class vector. This indexing operation retrieves the class names/values that correspond to each ID value. These retrieved class names/values are then assigned to a new column called.
train_features_clnd_pxls$Class <- train_features_clnd$Class[train_features_clnd_pxls$ID]
head(train_features_clnd_pxls)
ID Blue Green Red RE NIR
1 1 0.006558328 0.02629480 0.03789761 0.06722560 0.06768292
2 1 0.007127642 0.03018280 0.04512795 0.08374609 0.08382895
3 1 0.006670024 0.02567712 0.03890700 0.06104662 0.06191714
4 1 0.006507515 0.02560601 0.04088975 0.07106096 0.06911742
5 1 0.007614557 0.02979995 0.04475639 0.07366522 0.07263493
6 1 0.007649500 0.02975463 0.04506905 0.07024475 0.07175830
NDVI EVI SAVI Class
1 0.2821098 0.05976755 0.07377707 grass
2 0.3001080 0.07435983 0.09229806 grass
3 0.2282206 0.04619273 0.05744646 grass
4 0.2565984 0.05575727 0.06941147 grass
5 0.2374838 0.05427793 0.06773307 grass
6 0.2284504 0.05193264 0.06490289 grass
# now that we have the class names we don't need the ID index position
train_features_clnd_pxls$ID <- NULL
summary(train_features_clnd_pxls)
Blue Green Red
Min. :0.001006 Min. :0.002216 Min. :0.002351
1st Qu.:0.006488 1st Qu.:0.022588 1st Qu.:0.023575
Median :0.008557 Median :0.029002 Median :0.034620
Mean :0.009274 Mean :0.029265 Mean :0.035058
3rd Qu.:0.011763 3rd Qu.:0.035875 3rd Qu.:0.045128
Max. :0.029494 Max. :0.094121 Max. :0.131305
RE NIR NDVI
Min. :0.005442 Min. :0.003323 Min. :-0.07855
1st Qu.:0.058055 1st Qu.:0.061778 1st Qu.: 0.21691
Median :0.072678 Median :0.076645 Median : 0.34773
Mean :0.076391 Mean :0.084564 Mean : 0.40261
3rd Qu.:0.090349 3rd Qu.:0.100305 3rd Qu.: 0.61267
Max. :0.211573 Max. :0.242752 Max. : 0.88841
EVI SAVI Class
Min. :-0.01962 Min. :-0.02530 bare_soil:2156
1st Qu.: 0.04550 1st Qu.: 0.05575 grass :5988
Median : 0.07358 Median : 0.08878 trees :8149
Mean : 0.10130 Mean : 0.11685
3rd Qu.: 0.14868 3rd Qu.: 0.17074
Max. : 0.40614 Max. : 0.41690
Now that we’ve assigned each band and vegetation indexes to the mapped features, let’s take a look at the distribution of each vegetation class using a histogram of frequency.
Blue Green Red RE NIR NDVI
1 0.006558328 0.02629480 0.03789761 0.06722560 0.06768292 0.2821098
2 0.007127642 0.03018280 0.04512795 0.08374609 0.08382895 0.3001080
3 0.006670024 0.02567712 0.03890700 0.06104662 0.06191714 0.2282206
4 0.006507515 0.02560601 0.04088975 0.07106096 0.06911742 0.2565984
5 0.007614557 0.02979995 0.04475639 0.07366522 0.07263493 0.2374838
6 0.007649500 0.02975463 0.04506905 0.07024475 0.07175830 0.2284504
EVI SAVI Class
1 0.05976755 0.07377707 grass
2 0.07435983 0.09229806 grass
3 0.04619273 0.05744646 grass
4 0.05575727 0.06941147 grass
5 0.05427793 0.06773307 grass
6 0.05193264 0.06490289 grass
Blue Green Red RE NIR NDVI
5123 0.01274861 0.03244617 0.04266209 0.07163714 0.07920165 0.2998395
5124 0.01041855 0.02804318 0.03677150 0.05825589 0.06731792 0.2934633
5125 0.00786923 0.01944717 0.02781994 0.04466133 0.05063920 0.2908426
5126 0.01530221 0.04395438 0.05080659 0.10412344 0.11215274 0.3764507
5127 0.01654933 0.05380575 0.05253895 0.14308217 0.17049882 0.5288785
5128 0.01178302 0.03397559 0.03511559 0.08490883 0.10346059 0.4931945
EVI SAVI Class
5123 0.07369464 0.08813722 trees
5124 0.06312248 0.07584910 trees
5125 0.04924143 0.05917252 trees
5126 0.11777173 0.13880071 trees
5127 0.21658119 0.24471723 trees
5128 0.13939068 0.16054076 trees
Blue Green Red RE NIR
11363 0.01098466 0.02928374 0.03323293 0.06334142 0.06236515
11364 0.01298375 0.03484810 0.04463157 0.05669749 0.05331304
11365 0.01331884 0.03622065 0.04737346 0.06195178 0.06458808
11366 0.01330053 0.03285496 0.04463889 0.05397818 0.04485557
11367 0.01359854 0.03721920 0.04782483 0.06062165 0.05441871
11368 0.01355749 0.03716655 0.04862684 0.07034501 0.06854978
NDVI EVI SAVI Class
11363 0.304736465 0.0617533699 0.0733688227 bare_soil
11364 0.088636497 0.0177357488 0.0217782723 bare_soil
11365 0.153754740 0.0344585245 0.0421953440 bare_soil
11366 0.002421136 0.0004465992 0.0005513494 bare_soil
11367 0.064491932 0.0133007840 0.0164232973 bare_soil
11368 0.170024881 0.0395726837 0.0484211666 bare_soil
# NDVI
par(mfrow = c(3, 1))
hist(val_grass$NDVI, main = "grass", xlab = "NDVI", xlim = c(0, 1), col = "#e5ca81")
hist(val_trees$NDVI, main = "trees", xlab = "NDVI", xlim = c(0, 1), col = "#81a581")
hist(val_bare_soil$NDVI, main = "bare_soil", xlab = "NDVI", xlim = c(0, 1), col = "#7b6f89")
We can also fit a linear model using the pixel values of different indexes to see how they correlate and visualize them using a scatter plot.
# Scatterplot of NDVI and SAVI bands
plot(NDVI ~ EVI, data = val_grass, pch = ".", col = "#e5ca81", xlab = "SAVI", ylab = "NDVI")
points(NDVI ~ EVI, data = val_trees, pch = ".", col = "#81a581")
points(NDVI ~ EVI, data = val_bare_soil, pch = ".", col = "#7b6f89")
# making a regression model to get the trending line for each vegetation class
model_grass <- lm(NDVI ~ EVI, data = val_grass)
abline(model_grass, col = "#e5ca81", lwd = 2)
model_trees <- lm(NDVI ~ EVI, data = val_trees)
abline(model_trees, col = "#81a581", lwd = 2)
model_bare_soil <- lm(NDVI ~ EVI, data = val_bare_soil)
abline(model_bare_soil, col = "#7b6f89", lwd = 2)
legend("topright", legend = c("grass", "trees", "bare_soil"),
fill = c("#e5ca81", "#7b6f89", "#81a581"), bg = "white")
Or use a histogram of frequency, along with facet_wrap(), to plot all vegetation classes and indexes together making them easily comparable.
# selecting only the indexes and classes
train_features_clnd_pxls_indexes <- train_features_clnd_pxls |>
dplyr::select(c(NDVI, EVI, SAVI, Class)) |>
tidyr::pivot_longer(cols = 1:3, values_to = "values", names_to = "indexes")
train_features_clnd_pxls_indexes
# A tibble: 48,879 × 3
Class indexes values
<fct> <chr> <dbl>
1 grass NDVI 0.282
2 grass EVI 0.0598
3 grass SAVI 0.0738
4 grass NDVI 0.300
5 grass EVI 0.0744
6 grass SAVI 0.0923
7 grass NDVI 0.228
8 grass EVI 0.0462
9 grass SAVI 0.0574
10 grass NDVI 0.257
# ℹ 48,869 more rows
# add the distribution curve for each indexes parameter using stat_function and dnorm function
ggplot(train_features_clnd_pxls_indexes,
aes(x = values,
fill = indexes)) +
geom_density_ridges(aes(y = 0, fill = indexes, color = indexes),
jittered_points = TRUE,
alpha = .5,
scale = 1,
point_shape = "|",
point_size = 3,
position = ggridges::position_points_jitter(width = 0.05, height = 0)) +
geom_vline(data = train_features_clnd_pxls_indexes |>
dplyr::group_by(Class, indexes) |>
dplyr::summarise(mean = mean(values), .groups = "drop") |>
dplyr::ungroup(),
aes(xintercept = mean, color = indexes), linetype = "dashed", linewidth = 0.8) +
scale_fill_manual(values = c("#e5ca81", "#7b6f89", "#81a581"), labels = c("EVI", "NDVI", "SAVI")) +
scale_color_manual(values = c("#e5ca81", "#7b6f89", "#81a581"), labels = c("EVI", "NDVI", "SAVI")) +
theme_bw() + labs(x = "", y = "Frequency") +
theme(axis.title = element_text(size = 20, face = "bold", color = "gray20"),
axis.text.x.bottom = element_text(size = 14),
axis.text.y = element_text(size = 14),
legend.direction = "horizontal",
legend.position = "bottom",
legend.title = element_text(size = 16, face = "bold"),
legend.text = element_text(size = 14),
title = element_text(size = 18)) +
facet_wrap(~Class, ncol = 3, scales = "free")
With the difference image bands normalized, vegetation indexes calculated, and vegetation classes assigned to previous mapped features, we’re now able to move forward in building the training and testing model.
The training set is used to train the Random Forest model, while the testing set is used to evaluate its performance on data it hasn’t seen before. This separation helps prevent overfitting — where the model memorizes the training data but fails to generalize to new data. In this case, 80% of the data is used to teach the model which vegetation classes correspond to specific band reflectance values and vegetation indexes. The remaining 20% is then used to test how accurately the trained model can predict vegetation classes for new “unseen” data. These proportions are just recommendations, but for small sample sizes it is suggested to use 90% for training.
We’ll use the caret function createDataPartition() to split the data into training and testing sets. Other functions with different parameters can also be used to split it, but for the purpose of this tutorial, let’s keep it simple.
After partitioning, we’ll use a portion of the dataset for training (80%) and testing (20%). If we check the number of rows in the new datasets and use the “three-step rule” equation, we can confirm that the data was properly split.
# set seed will ensure we get the values you get on your computer are the same as we got in this tutorial
set.seed(044)
# checking the structure of our dataset
str(train_features_clnd_pxls)
'data.frame': 16293 obs. of 9 variables:
$ Blue : num 0.00656 0.00713 0.00667 0.00651 0.00761 ...
$ Green: num 0.0263 0.0302 0.0257 0.0256 0.0298 ...
$ Red : num 0.0379 0.0451 0.0389 0.0409 0.0448 ...
$ RE : num 0.0672 0.0837 0.061 0.0711 0.0737 ...
$ NIR : num 0.0677 0.0838 0.0619 0.0691 0.0726 ...
$ NDVI : num 0.282 0.3 0.228 0.257 0.237 ...
$ EVI : num 0.0598 0.0744 0.0462 0.0558 0.0543 ...
$ SAVI : num 0.0738 0.0923 0.0574 0.0694 0.0677 ...
$ Class: Factor w/ 3 levels "bare_soil","grass",..: 2 2 2 2 2 2 2 2 2 2 ...
# partitioning the data into 80% (p = 0.8, the percentage of data that goes to training).
rf_train_features_clnd_pxls_indexes <- caret::createDataPartition(train_features_clnd_pxls$Class, list = FALSE, p = 0.8)
str(rf_train_features_clnd_pxls_indexes)
int [1:13036, 1] 1 4 5 6 7 8 10 11 12 13 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr "Resample1"
# setting 80% for training - Use "rf_train_features_clnd_pxls_indexes" as a vector of row indices that specify which rows to extract from the train_features_clnd_pxls dataset.
rf_train_data <- train_features_clnd_pxls[rf_train_features_clnd_pxls_indexes, ]
str(rf_train_data)
'data.frame': 13036 obs. of 9 variables:
$ Blue : num 0.00656 0.00651 0.00761 0.00765 0.0069 ...
$ Green: num 0.0263 0.0256 0.0298 0.0298 0.0266 ...
$ Red : num 0.0379 0.0409 0.0448 0.0451 0.0429 ...
$ RE : num 0.0672 0.0711 0.0737 0.0702 0.0714 ...
$ NIR : num 0.0677 0.0691 0.0726 0.0718 0.072 ...
$ NDVI : num 0.282 0.257 0.237 0.228 0.253 ...
$ EVI : num 0.0598 0.0558 0.0543 0.0519 0.0569 ...
$ SAVI : num 0.0738 0.0694 0.0677 0.0649 0.0709 ...
$ Class: Factor w/ 3 levels "bare_soil","grass",..: 2 2 2 2 2 2 2 2 2 2 ...
# setting 20% for testing - Use "-rf_train_features_clnd_pxls_indexes" as a vector of row indices that specify which rows to extract from the train_features_clnd_pxls dataset.
rf_test_data <- train_features_clnd_pxls[-rf_train_features_clnd_pxls_indexes, ]
str(rf_test_data)
'data.frame': 3257 obs. of 9 variables:
$ Blue : num 0.00713 0.00667 0.00715 0.00598 0.00656 ...
$ Green: num 0.0302 0.0257 0.0283 0.0229 0.0268 ...
$ Red : num 0.0451 0.0389 0.0445 0.0357 0.0434 ...
$ RE : num 0.0837 0.061 0.0747 0.0563 0.0806 ...
$ NIR : num 0.0838 0.0619 0.0736 0.0588 0.0798 ...
$ NDVI : num 0.3 0.228 0.247 0.245 0.296 ...
$ EVI : num 0.0744 0.0462 0.0566 0.0471 0.0706 ...
$ SAVI : num 0.0923 0.0574 0.0707 0.0584 0.0878 ...
$ Class: Factor w/ 3 levels "bare_soil","grass",..: 2 2 2 2 2 2 2 2 2 2 ...
Let’s check the proportion of data points/pixels and features selected for training and testing data, respectively.
Training:
rf_train_data_classCount <- rf_train_data |>
dplyr::group_by(Class) |>
count()
rf_train_data_classCount
# A tibble: 3 × 2
# Groups: Class [3]
Class n
<fct> <int>
1 bare_soil 1725
2 grass 4791
3 trees 6520
(prop.table(table(rf_train_data$Class))*100)
bare_soil grass trees
13.23259 36.75207 50.01534
Testing:
rf_test_data_classCount <- rf_test_data |>
dplyr::group_by(Class) |>
count()
rf_test_data_classCount
# A tibble: 3 × 2
# Groups: Class [3]
Class n
<fct> <int>
1 bare_soil 431
2 grass 1197
3 trees 1629
(prop.table(table(rf_test_data$Class))*100)
bare_soil grass trees
13.23304 36.75161 50.01535
To build our training model I’ll break it down into three mains steps:
1) Defining parameters for training. The function trainControl() generates parameters that further control how models are created, letting you set up the rules for how your model will be trained and tested. In this part, we’ll use Cross-Validation with 10 folds. It means that the training data will be divided into 10 equal parts, with the model trained on 9 parts and validated on the remaining part. This process repeats 10 times with a different validation fold each time, ensuring every observation gets used for both training and validation.
2) Defining Response and Predictive Variables. We’ll define Class as the response variable and the reflectance bands, along with the vegetation indexes, as predictors. The reason is that we want to know how vegetation classes can be determined by their spectral signature.
3) Building and Running the Model. Now we’ll use the parameters and variables determined before to build the model. We’re going to use the function train(), from the package caret, to build the training model using the Random Forest algorithm. We’ll use Kappa (Cohen’s Kappa) metric (as it allows to checks models performance, by comparing observed accuracy with expected accuracy by taking into account the possibility of agreement occurring by chance), a 1000 trees (number of decision trees to grow in your Random Forest - more trees generally lead to a more robust model as the ensemble can better capture complex patterns in the data, but requires more computer processing), and set 8 for the different numbers of the hyperparameter mtry.
The hyperparameters are configuration settings that control the learning algorithm’s behavior. Setting it to 8 means that we will test 8 different values for the key hyperparameter mtry (the number of variables randomly considered at each decision tree split - in this case, it’ll be 8 because we’ve 8 predictors). Rather than manually selecting which values to try, the caret package creates a grid of 8 evenly spaced values across the appropriate range for this hyperparameter. Each of these 8 configurations undergoes the full cross-validation process, and the one producing the best performance (measured by the Kappa statistic) is selected for the final model.
It is worth mentioning the hyperparameter mtry to understand better how the Random Forest is handling our data on the background. The mtry is a key hyperparameter in Random Forest models that stands for “variables randomly sampled as candidates at each split”. When a Random Forest builds each decision tree, at each node (decision point), the algorithm doesn’t consider all available predictors. Instead, it randomly selects a subset of the available predictors to consider for that particular split. The size of this random subset is controlled by the mtry parameter. By testing all 8 possible values, we’ll get a more thorough hyperparameter search that covers the entire possible range. This will find the optimal number of variables to consider at each split for your specific dataset.
Note that we can use all 8 predictors because this is a low number of variables and we’ll not be computational demanding, but you don’t have to set tuneLength to the maximum number of your predictors.
# 1) Defining parameters for training
cvControl <- caret::trainControl(method = 'cv', # cross-validation
number = 10, # number of folds
savePredictions = TRUE, #save the predictions made during cross-validation (useful for later analysis)
verboseIter = FALSE) #won't print progress details during training
# 2) Defining Response and Predictive Variables
respVar <- c("Class") # response
predVar <- c("Blue","Green","Red","RE","NIR","NDVI","EVI","SAVI") # predictors
# 3) Building and Running the Model
rfModel <- caret::train(rf_train_data[,predVar], # predictors
rf_train_data[,respVar], # response
method = "rf", # Random Forest algorithm
metric = "Kappa",
ntree = 1000, # number of trees, 1000
trControl = cvControl, # rules for training and testing the training data
tuneLength = 8, # different numbers of hyperparameters
importance = TRUE) # Calculates variable importance (which features matter most)
note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .
rfModel
Random Forest
13036 samples
8 predictor
3 classes: 'bare_soil', 'grass', 'trees'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 11732, 11732, 11733, 11732, 11733, 11732, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9260515 0.8751096
3 0.9286596 0.8796275
4 0.9284290 0.8792625
5 0.9298101 0.8816041
6 0.9299632 0.8818674
7 0.9291963 0.8806139
8 0.9292729 0.8807892
Kappa was used to select the optimal model using the largest value.
The final value used for the model was mtry = 6.
Among the 8 defined predictors, the mtry 7 was selected as the best model due to the highest Kappa and Accuracy values across tuning parameters. The accuracy measures proportion of correctly classified samples (higher is better), while Kappa measures the agreement between predicted classifications and actual classifications that takes into account the possibility of agreement occurring by chance (higher is better, 0.81-1.00 is Almost perfect agreement). Even though the Kappa and Cross-Validation are distinct things, the model returns them as Kappa (Cross-Validation) because Kappa values are obtained through cross-validation. The Kappa values shown are averaged across the resampling results from our cross-validation folds.
In summary, what happens is that when training a Random Forest model using caret() we specified the 10 fold Cross-Validation as our resampling method. For each parameter combination (mtry values 2-8 predictors) and for each fold of the 10-fold Cross-Validation, the performance metric (Kappa) is computed over the 10 folds. When we call plot() on the train object, caret creates a visualization of these averaged Cross-Validation results, with the tuning parameter (mtry) on the X-axis and the performance metric (Kappa) on the Y-axis.
rfModel
Random Forest
13036 samples
8 predictor
3 classes: 'bare_soil', 'grass', 'trees'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 11732, 11732, 11733, 11732, 11733, 11732, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9260515 0.8751096
3 0.9286596 0.8796275
4 0.9284290 0.8792625
5 0.9298101 0.8816041
6 0.9299632 0.8818674
7 0.9291963 0.8806139
8 0.9292729 0.8807892
Kappa was used to select the optimal model using the largest value.
The final value used for the model was mtry = 6.
plot(rfModel)
Another way to visualize is by checking the error rates of each classified feature. The values from 0 to 0.25 represent the proportion of misclassified samples. The error rates usually decrease as more trees are added to the forest and stabilize somewhere below 0.10 (10%). In the graph below, the color represents:
As we can see, the model didn’t perform so well classifying “bare_soil” as it did for other classes. We can kind of see the misclassified samples trends in the confusion matrix. Trees had the lowest error rate, meaning the model structure was good for classifying this vegetation type. Overall, the OOB had a low misclassified value, thus supporting the robustness of our model.
# first glimpse on the models classification performance
rfModel$finalModel
Call:
randomForest(x = x, y = y, ntree = 1000, mtry = param$mtry, importance = TRUE)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 6
OOB estimate of error rate: 6.9%
Confusion matrix:
bare_soil grass trees class.error
bare_soil 1295 302 128 0.24927536
grass 194 4482 115 0.06449593
trees 58 102 6360 0.02453988
# checking the error values of each class
head(rfModel$finalModel$err.rate)
OOB bare_soil grass trees
[1,] 0.10169492 0.2962382 0.09355742 0.05631470
[2,] 0.10528321 0.2790918 0.10641248 0.05773143
[3,] 0.10289159 0.2750760 0.10669280 0.05387755
[4,] 0.10393744 0.2796209 0.11256804 0.05020080
[5,] 0.09870550 0.2716208 0.10304015 0.04955722
[6,] 0.09860541 0.2722256 0.10355625 0.04845598
# visualizing the error values of each class
plot(rfModel$finalModel)
We should also check the importance of each reflectance band and vegetation indexes for identifying the vegetation cover. For this purpose we’ll visualize the variables ranked by the MeanDecreaseAccuracy and the MeanDecreaseGini.
MeanDecreaseAccuracy is a performance metric that measures the proportion of correct predictions across all trees in the random forest. When measuring variable importance, the model first establishes a baseline accuracy using Out-Of-Bag (OOB) samples. Then, values for a specific variable are randomly shuffled in these samples, and the model is re-evaluated. The difference between the baseline accuracy and this new accuracy is the decrease in accuracy attributable to that variable. A larger decrease in accuracy when a variable is shuffled indicates greater importance of that variable. Variables are presented in descending order of importance, with those at the top causing the greatest drop in accuracy when removed or shuffled.
The MeanDecreaseGini measures how each variable contributes to the node purity in the random forest. A higher value indicates that a particular predictor variable plays a greater role in partitioning the data into the defined classes. Higher values indicate greater importance of the variable to your model. This metric measures the total decrease in Gini impurity for a variable across all trees in the forest. Variables that lead to larger decreases in Gini impurity are considered more important, as they’re more effective at creating homogeneous (pure) nodes.
In summary, Accuracy focuses on prediction performance, while Gini focuses on the creation of pure nodes during tree construction.
varImpPlot(rfModel$finalModel)
We’ve split the data into 80% for training and 20% for testing, meaning that our model has not seen 20% of it yet. Let’s use the model built to see how it performs in predicting those 20% left.
In this step, we’ll evaluate the model performance by checking the Accuracy, Kappa, Confusion Matrix, Sensitivity, and Specificity. The first two we’ve seen before and focus more on the overall performance of the model, while the last three evaluate it by each of the classified features.
The Confusion Matrix summarizes which features were misclassified and taken for other features. Ex: 25 trees were identified as bare_soils and 30 as grasses. The Sensitivity measures how well the model correctly identifies actual positives, in other words: “Of all actual class”tree” pixels, how many did the model correctly classify as class “trees”? The model correctly identified 97.85% of actual “trees” pixels. Now, complementary, Specificity measures how well the model correctly identifies actual negatives, in other words: “Of all non-class”grass” pixels, how many did the model correctly classify as not class “grass”? 95.87% of non-“grass” pixels were correctly classified as not “grass”.
Check out these two videos if you want to know more about these metrics: - Confusion Matrix Explained - 7 Must-Know Metrics That Power Machine Learning
[1] grass grass grass grass grass grass
Levels: bare_soil grass trees
conf_matrix <- confusionMatrix(rfPredict, rf_test_data$Class)
conf_matrix
Confusion Matrix and Statistics
Reference
Prediction bare_soil grass trees
bare_soil 341 48 15
grass 65 1119 20
trees 25 30 1594
Overall Statistics
Accuracy : 0.9377
95% CI : (0.9288, 0.9457)
No Information Rate : 0.5002
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.8952
Mcnemar's Test P-Value : 0.07009
Statistics by Class:
Class: bare_soil Class: grass Class: trees
Sensitivity 0.7912 0.9348 0.9785
Specificity 0.9777 0.9587 0.9662
Pos Pred Value 0.8441 0.9294 0.9666
Neg Pred Value 0.9685 0.9620 0.9782
Prevalence 0.1323 0.3675 0.5002
Detection Rate 0.1047 0.3436 0.4894
Detection Prevalence 0.1240 0.3697 0.5063
Balanced Accuracy 0.8844 0.9468 0.9724
conf_table <- as.data.frame(conf_matrix$table)
colnames(conf_table) <- c("Predicted", "Actual", "Frequency")
conf_table
Predicted Actual Frequency
1 bare_soil bare_soil 341
2 grass bare_soil 65
3 trees bare_soil 25
4 bare_soil grass 48
5 grass grass 1119
6 trees grass 30
7 bare_soil trees 15
8 grass trees 20
9 trees trees 1594
ggplot(data = conf_table, aes(x = Actual, y = Predicted, fill = Frequency)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "white", high = "darkgreen") +
geom_text(aes(label = Frequency), color = "black", size = 5) +
labs(title = "Confusion Matrix Heatmap",
x = "Actual Labels",
y = "Predicted Labels",
fill = "Frequency") +
theme_minimal(base_size = 14) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Just recapping, so far we’ve been using pixel values extracted within the boundaries of our mapped features to:
1) Train the model using 80% of these pixel values.
2) Testing the model using the 20% left of these pixel values.
Now, we’re gonna use the training model to predict the vegetation cover type for our full image (that one containing all reflectance bands and indexes). To do so, we’ll use the function predict().
image_lc_predict <- terra::predict(train_features_clnd_indexes, rfModel)
image_lc_predict
class : SpatRaster
dimensions : 559, 558, 1 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
categories : class
name : class
min value : bare_soil
max value : trees
# checking how many pixels were assigned for each feature
pixel_counts <- raster::freq(image_lc_predict)
pixel_counts
layer value count
1 1 bare_soil 74005
2 1 grass 155720
3 1 trees 82197
Let’s plot the created image and determine feature colors for better visualization.
#Lets define the plalette(mainly be using Hexadecimal colours).
pal <-c("#7b6f89", "#81a581", "#e5ca81")
tmap::tm_shape(image_lc_predict)+
tmap::tm_raster(style = "cat",
labels = c("bare_soil",
"trees",
"grass"),
palette = pal,
title = "Legend") +
tmap::tm_layout(main.title= "",
main.title.size =.9 ) +
tmap::tm_layout(legend.outside = TRUE,
legend.outside.position = "right",
legend.title.size = .8)
We can also visualize it using an interactive plot, through leaflet() function, and overlay it with a basemap. It looks like our model had a fair performance at predicting the vegetation types and cover.
raster_lc_predict <- raster::raster(image_lc_predict)
pal <- leaflet::colorFactor(
palette = c("#7b6f89", "#81a581", "#e5ca81"),
levels = c(1, 2, 3)
)
leaflet::leaflet() |>
leaflet::addTiles(group = "Base Map") |>
leafem::addRasterRGB(raster::brick(cropped_raster_bands[[1:3]]),
r = 3, g = 2, b = 1,
group = "RGB Image") |>
leaflet::addRasterImage(raster_lc_predict,
colors = pal,
opacity = 0.4,
group = "Land Cover Classification") |>
leaflet::addLegend(pal = pal,
values = c(1, 2, 3),
title = "Land Cover Classification",
labels = c("bare_soil", "grass", "trees"),
position = "bottomright") |>
leaflet::addLayersControl(
baseGroups = c("Base Map", "RGB Image"),
overlayGroups = c("Land Cover Classification"),
options = leaflet::layersControlOptions(collapsed = FALSE))
Let’s visuallize the fixel frequency of each lass also using a interactive barplot.
raster_lc_predict_df <- as.data.frame(raster_lc_predict) |>
dplyr::mutate(category = dplyr::case_when(
class == 1 ~ "bare_soil",
class == 2 ~ "trees",
class == 3 ~ "grass",
TRUE ~ NA_character_)) |>
dplyr::group_by(category) |>
dplyr::summarise(count = dplyr::n()) |>
dplyr::ungroup() |>
dplyr::mutate(prop = round(((count*100)/(sum(count))), 2)) |>
dplyr::arrange(dplyr::desc(prop))
head(raster_lc_predict_df)
# A tibble: 3 × 3
category count prop
<chr> <int> <dbl>
1 trees 155720 49.9
2 grass 82197 26.4
3 bare_soil 74005 23.7
plotly::plot_ly(raster_lc_predict_df,
x = ~category,
y = ~prop,
type = "bar",
text = ~prop,
textposition = 'auto') |>
plotly::layout(title = "Vegetation Cover",
xaxis = list(title = "Vegetation Type",
tickvals = c("bare_soil", "grass", "trees"),
ticktext = c("Bare Soil", "Grass", "Trees")),
yaxis = list(title = "Cover Proportation (%)"))
Thank you for sticking around this whole tutorial. Remember, we used this tutorial for satellite image classification, but the concepts can also be used in predictive models and by other types of classification.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/souzayuri/souzayuri.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Souza (2025, April 29). Yuri Souza, Ecologist: Supervised Image Classification with Random Forest in R. Retrieved from https://souzayuri.github.io/posts/2025-01-05-random_forest_sup_classific_r/
BibTeX citation
@misc{souza2025supervised, author = {Souza, Yuri}, title = {Yuri Souza, Ecologist: Supervised Image Classification with Random Forest in R}, url = {https://souzayuri.github.io/posts/2025-01-05-random_forest_sup_classific_r/}, year = {2025} }