Supervised Image Classification with Random Forest in R

In this post, I demonstrate how to load and visualize georeferenced spatial data and perform a supervised random forest classification using R.

Yuri Souza
2025-03-21

Quick intro and backgroung

Machine learning has been one of my favorite topics over the last few years, as I have been using approaches such as predictive models and cluster analysis. However, these are just a subset of the machine learning potential. With that in mind, I decided to take a step forward and learn what I believe to be the next topic in my field of expertise: Supervised Image Classification. This is a highly effective method for automatically identifying features in an image, such as forest patches, crops, urban areas, water bodies, and so on. In a few words, the way it works is by telling an algorithm how each of these features looks like and then it search in the images for patterns on pixel aggregation that resemble the patterns of the input feature.

I’ve been trying to perform supervised image classification of satellite images in R for a while, and I’ve always stumbled upon conceptual problems, such as: How to do it? Which parameters to use for validation? Which packages and functions?

Even though this has been a very popular technique across remote sensing people and is widely available to learn on the internet, it is definitely easier to find tutorials for Python or software such as ArcGIS and QGIS than it is for R. I’m not saying there aren’t tutorials in R out there, in fact, the content available here is from those tutorials, but I found this information very spread across a few blogs and a bit scarce. So, after struggling to find a tutorial that I liked, I decided to create my own tutorial, based on what I could find online, while also adding my own touch.

This is my attempt to learn how to perform supervised image classification, using the random forest algorithm. I’m pretty sure there are many other ways to carry it out, and definitely more efficient ones, but this is what I could get so far and I believe that will suite a large audience.

For the record, this tutorial is a compilation of what is available in following blogs and tutorials:

That said, lets get started!!!

In a few words, my goal here is to use a drone optical imagery I took during my fieldwork in Brazil and make a model able to identify what is Bare Soil, Grass, and Trees on it. This image was taken in Chapada Dos Veadeiros National Park, Brazil.

I broke down the workflow of this tutorial into six main steps, along with some intermediate ones, a #BONUS . This step-by-step approach really helped me grasp how the classification process works.

1. Installing and Loading the Packages

2. Getting the Data Ready Getting the Data Ready

2.1. Loading the Datasets

2.2. Checking up the Coordinate Reference System

2.3. Checking up the Data Information and Structure

2.4. Cropping the Raster Image

2.5. Renaming and Selecting the Bands

2.6. Visualizing the Raster and Polygons

3. Training the Model

3.1. Creating Vegetation Indexes

4. Evaluating the Model

5. Classifying the Image

6. Exploring the Results

7. BONUS

If you want to follow and try each step described here by yourself please download the data. HERE. If you want to

Hands to Work

To get started, the very first step is to load the necessary packages (assuming you already have R installed on your machine). We’ll utilize raster and terra for handling raster data, sf for managing spatial data like points and polygons, and tidyverse for data tabulation and visualization. Additionally, we’ll rely on caret for implementing random forest methods, and use mapview along with leaflet to plot and visualize both the polygons and raster images. While these are the primary packages we’ll need, I’ll also incorporate some others to further explore the results of our image classification.

1. Installing and Loading the Packages

Begin by installing and loading the necessary packages, along with the additional ones you will need. I prefer a setup this syntax that first checks if a package is already installed; if it isn’t, it installs the package and then loads it afterward.

2. Getting the Data Ready

2.1 Loading the datasets

Let’s load all the files and information we will need later on:

In this step we load the files and make sure to rename their classes using a standard nomenclature.

raw_raster <- terra::rast("woody_savanna.tif") |> 
  terra::aggregate(fact=2) # original pixel size = 10.8cm, new pixel size 21.6cm (old * 2)
raw_raster
class       : SpatRaster 
dimensions  : 759, 782, 6  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63899, -47.63742, -14.0681, -14.06662  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       : woody~nna_1, woody~nna_2, woody~nna_3, woody~nna_4, woody~nna_5, woody~nna_6 
min values  :        2.89,       16.26,       16.97,       24.26,      117.52,           0 
max values  :    65535.00,    65535.00,    65535.00,    65535.00,    65535.00,       65535 
plot(raw_raster)
sampled_area <- terra::vect("sampled_area.shp")
sampled_area
 class       : SpatVector 
 geometry    : polygons 
 dimensions  : 1, 1  (geometries, attributes)
 extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
 source      : sampled_area.shp
 coord. ref. : lon/lat WGS 84 (EPSG:4326) 
 names       :    Id
 type        : <int>
 values      :     0
plot(sampled_area)
train_features <- sf::st_read("train_features.shp")
Reading layer `train_features' from data source 
  `F:\souzayuri.github.io\_posts\2025-01-05-random_forest_sup_classific_r\train_features.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 46 features and 9 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS:  WGS 84
train_features
Simple feature collection with 46 features and 9 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS:  WGS 84
First 10 features:
   Classcode Classname Classvalue RED GREEN BLUE Count   Shape_Leng
1       <NA>     grass          1 255   255  190 16651 0.0002222801
2       <NA>     grass          1 255   255  190 13637 0.0002344464
3       <NA>     grass          1 255   255  190  9875 0.0001834221
4       <NA>     grass          1 255   255  190 11577 0.0002181070
5       <NA>     grass          1 255   255  190 10001 0.0002419869
6       <NA>     grass          1 255   255  190  7684 0.0001557914
7       <NA>     grass          1 255   255  190 16943 0.0002274619
8       <NA>     grass          1 255   255  190  6667 0.0001647991
9       <NA>     grass          1 255   255  190  2964 0.0001024218
10      <NA>     grass          1 255   255  190  4238 0.0001182262
     Shape_Area                       geometry
1  3.318277e-09 POLYGON ((-47.63788 -14.066...
2  2.717603e-09 POLYGON ((-47.63782 -14.067...
3  1.967932e-09 POLYGON ((-47.63772 -14.066...
4  2.307135e-09 POLYGON ((-47.63781 -14.066...
5  1.993210e-09 POLYGON ((-47.63793 -14.067...
6  1.531386e-09 POLYGON ((-47.63788 -14.067...
7  3.376445e-09 POLYGON ((-47.63793 -14.066...
8  1.328762e-09 POLYGON ((-47.63768 -14.067...
9  5.907468e-10 POLYGON ((-47.63769 -14.066...
10 8.447055e-10 POLYGON ((-47.63785 -14.066...
# renaming their classes accordingly to keep it consistent across upcoming analyses.
train_features_clnd <- sf::st_read("train_features.shp") |> 
  dplyr::rename(OBJECTID = "Classcode",
                Class = "Classname") |> 
  dplyr::mutate(OBJECTID = as.character(1:46),
                Class = as.factor(Class),
                Code = as.numeric(Class)) |> 
  dplyr::select(c(1:3))
Reading layer `train_features' from data source 
  `F:\souzayuri.github.io\_posts\2025-01-05-random_forest_sup_classific_r\train_features.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 46 features and 9 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS:  WGS 84
train_features_clnd
Simple feature collection with 46 features and 3 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS:  WGS 84
First 10 features:
   OBJECTID Class Classvalue                       geometry
1         1 grass          1 POLYGON ((-47.63788 -14.066...
2         2 grass          1 POLYGON ((-47.63782 -14.067...
3         3 grass          1 POLYGON ((-47.63772 -14.066...
4         4 grass          1 POLYGON ((-47.63781 -14.066...
5         5 grass          1 POLYGON ((-47.63793 -14.067...
6         6 grass          1 POLYGON ((-47.63788 -14.067...
7         7 grass          1 POLYGON ((-47.63793 -14.066...
8         8 grass          1 POLYGON ((-47.63768 -14.067...
9         9 grass          1 POLYGON ((-47.63769 -14.066...
10       10 grass          1 POLYGON ((-47.63785 -14.066...
plot(train_features_clnd)

2.2. Checking up the Coordinate Reference System

We also check their CRS (coordinate reference system) and ensure they all have the same coordinate properties.

# Compare CRS, if they are different make them equal
if (terra::crs(raw_raster) == terra::crs(train_features_clnd)) {
  print("CRS are identical!")
} else {
  print("CRS are different. Therefore, reprojecting to match raster projection...")
  # Reproject train_features to match raw_raster
  train_features_clnd <- sf::st_transform(train_features_clnd, crs = terra::crs(raw_raster))
}
[1] "CRS are different. Therefore, reprojecting to match raster projection..."
# Compare CRS again to ensure they above function worked.
if (terra::crs(raw_raster) == terra::crs(train_features_clnd)) {
  print("CRS are identical!")
} else {
  print("CRS are different. Therefore, reprojecting to match raster projection...")
  # Reproject train_features to match raw_raster
  train_features_clnd <- sf::st_transform(train_features_clnd, crs = terra::crs(raw_raster))
}
[1] "CRS are identical!"

2.3. Checking up the data information and structure

Let’s check the training data and the number of features, which has a total of 46. Here we can see that every training class has 15 features/polygons (grass has 16 because I added one more by mistake).

glimpse(train_features_clnd)
Rows: 46
Columns: 4
$ OBJECTID   <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"…
$ Class      <fct> grass, grass, grass, grass, grass, grass, grass, …
$ Classvalue <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2…
$ geometry   <POLYGON [°]> POLYGON ((-47.63788 -14.066..., POLYGON (…
summary(train_features_clnd)
   OBJECTID               Class      Classvalue             geometry 
 Length:46          bare_soil:15   Min.   :1.000   POLYGON      :46  
 Class :character   grass    :16   1st Qu.:1.000   epsg:4326    : 0  
 Mode  :character   trees    :15   Median :2.000   +proj=long...: 0  
                                   Mean   :1.978                     
                                   3rd Qu.:3.000                     
                                   Max.   :3.000                     

2.4. Cropping the raster image

Now that we have standardized the data, we will crop the raster image to the 100m x 100m grid limit. We also want to check the CRS and make sure it is correct, and plot each of these bands side-by-side.

cropped_raster <- terra::crop(raw_raster, sampled_area)
cropped_raster
class       : SpatRaster 
dimensions  : 559, 558, 6  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       : woody~nna_1, woody~nna_2, woody~nna_3, woody~nna_4, woody~nna_5, woody~nna_6 
min values  :        2.89,       16.26,       16.97,       24.26,      165.85,       65535 
max values  :     3139.13,     7843.21,     9368.55,    17875.35,    20952.24,       65535 
plot(cropped_raster)
# Compare CRS
if (terra::crs(raw_raster) == terra::crs(train_features_clnd)) {
  print("CRS are identical!")
} else {
  print("CRS are different. Therefore, reprojecting to match raster projection...")
  # Reproject train_features to match raw_raster
  train_features_clnd <- terra::project(train_features_clnd, crs(raw_raster))
}
[1] "CRS are identical!"

2.5. Renaming and Selecting the Bands

So far, we have ensured the consistency of the image coordinates and defined the grid area we want to work on. Now, let’s rename the bands according to their respective spectral signatures. This will help us later identify which band is which. In this step, I also removed the band MASK by only selecting the bands of interest according to their position in the bands list (1 to 5).

names(cropped_raster) <- c("Blue","Green","Red","RE","NIR","MASK")
plot(cropped_raster)
cropped_raster_bands <- cropped_raster[[1:5]]
plot(cropped_raster_bands)

2.6. Visualizing the Raster and Polygons

Now that we have the raster with five bands and the polygons ready for training the model, let’s visualize our data. We’ll create a natural color composition. To do this we need to inform the position of the RGB bands of the raster image (that is why naming it before hand was important) and overlay it with each class represented by the polygons. For visualization, we will use mapview() because it’s interactive (just because we want some cool!!!).

train_class_colors <- unique(train_features_clnd$Class) |> 
  as.data.frame() |>
  dplyr::mutate(hex = c(grass = "#ff7f00",
                        bare_soil = "#e41a1c",
                        tree = "#55eed1"))
train_class_colors
  unique(train_features_clnd$Class)     hex
1                             grass #ff7f00
2                             trees #e41a1c
3                         bare_soil #55eed1
mapview::viewRGB(brick(cropped_raster_bands[[1:3]]), 
                 r = 3, g = 2, b = 1,
                 map.types = c("Esri.WorldImagery", "OpenStreetMap", "OpenTopoMap"),
                 maxpixels =  866760,
                 layer.name = "RBG Image") + 
  mapview::mapView(train_features_clnd, 
                   zcol = "Class", 
                   color = "black",  
                   lwd = 2,
                   col.regions = train_class_colors$hex,
                   layer.name = "Features")

3. Training the Model

One cool aspect of using optical imagery rasters with various bands is the ability to rearrange these bands to create new, informative images. To provide some context, the drone imagery we’re working with includes natural color RGB bands, which are part of the visible spectrum for humans, along with the Red Edge and Near-Infrared bands, which lie outside this spectrum. The latter two bands are particularly useful for highlighting vegetation characteristics, as plants reflect a significant amount of infrared light. This unique property makes infrared imagery an invaluable tool for analyzing vegetation, as it offers deeper insights into the state, health, vigor, and other attributes of plants that natural color imagery simply cannot reveal.

Now that we have our raster image and training features prepared, let’s move on to the next step: using this raster image and bands to create various vegetation indexes. They will be stacked as bands and will be used to train the model, using the mapped polygons, and to predict the vegetation classes in the later steps of the classification. This process will improve the classification of the image by pinpointing pixel combinations that are more accurately representative of the specific vegetation characteristics we aim to map.

3.1. Creating Vegetation Indexes

When conducting image classification, it’s crucial to provide clear information that highlights the distinguishing features you’re mapping. In this section, we will focus on creating three widely used vegetation indexes in remote sensing: the NDVI (Normalized Difference Vegetation Index), the EVI (Enhanced Vegetation Index), and the SAVI (Soil Adjusted Vegetation Index). Each of these indexes interprets vegetation in unique ways, and we will utilize them alongside the spectral bands from the drone imagery to classify the image effectively.

This is a straightforward process that primarily involves utilizing the Red, Near Infrared, and Blue bands, along with a few predefined parameters. Each index has its own arithmetic equation, and we can identify these bands based on their positions within the raster. We will then plot these indexes side-by-side.

Check out below a table summarizing the similarity and differences among these indexes.

Feature NDVI (Normalized Difference Vegetation Index) EVI (Enhanced Vegetation Index) SAVI (Soil Adjusted Vegetation Index)
Formula (NIR - Red) / (NIR + Red) 2.5 × (NIR - Red) / (NIR + C1 × Red - 7.5 × Blue + 1) ((NIR - Red) / (NIR + Red + 0.5)) × (1 + 0.5)
Bands Used NIR, Red NIR, Red, Blue NIR, Red
Purpose General greenness/vegetation presence Improves sensitivity in high biomass areas Adjusts for soil background noise
Advantages Simple, widely used Corrects for atmospheric & canopy background effects Better for areas with sparse vegetation or exposed soil
Limitations Sensitive to soil brightness and atmospheric conditions More complex, needs Blue band Requires adjustment factor (L) based on soil type
Best Use Cases Moderate to dense vegetation Dense forests, tropical areas Arid and semi-arid areas, sparse vegetation
cropped_raster_bands_ndvi <- (cropped_raster_bands[[5]] - cropped_raster_bands[[3]]) / 
  (cropped_raster_bands[[5]] + cropped_raster_bands[[3]])
cropped_raster_bands_ndvi
class       : SpatRaster 
dimensions  : 559, 558, 1  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
name        :         NIR 
min value   : -0.08708532 
max value   :  0.95849029 
cropped_raster_bands_evi <- 2.5 * ((cropped_raster_bands[[5]] - cropped_raster_bands[[3]]) /
    (cropped_raster_bands[[5]] + (6 * cropped_raster_bands[[3]]) - (7.5 * cropped_raster_bands[[1]]) + 1))
cropped_raster_bands_evi
class       : SpatRaster 
dimensions  : 559, 558, 1  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
name        :       NIR 
min value   : -8.517467 
max value   : 83.766299 
cropped_raster_bands_savi <- ((cropped_raster_bands[[5]] - cropped_raster_bands[[3]]) /
    (cropped_raster_bands[[5]] + cropped_raster_bands[[3]] + 0.5)) * (1 + 0.5)
cropped_raster_bands_savi
class       : SpatRaster 
dimensions  : 559, 558, 1  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
name        :        NIR 
min value   : -0.1306091 
max value   :  1.4368568 
# Create a red-to-green color palette
red_to_green_palette <- colorRampPalette(c("darkred", "yellow", "darkgreen"))
# Lets visuallize them together
par(mfrow = c(1, 3))

plot(cropped_raster_bands_ndvi, main = "NDVI", col = red_to_green_palette(100))
plot(cropped_raster_bands_evi, main = "EVI", col = red_to_green_palette(100))
plot(cropped_raster_bands_savi, main = "SAVI", col = red_to_green_palette(100))
par(mfrow = c(1, 1))

Let’s stack these three indexes in our raster data so we can have more information to feed the model. After stacking the rasters containing the indexes with our main raster, we will notice that the new rasters are named as NIR. That is because the first spectral band we used in the above calculations was the NIR band. To avoid confusion between the NIR spectral band and the new rasters indexes, we will name them according to which index they represent.

train_features_clnd_indexes <- c(cropped_raster_bands, 
                                 cropped_raster_bands_ndvi,
                                 cropped_raster_bands_evi, 
                                 cropped_raster_bands_savi)
train_features_clnd_indexes
class       : SpatRaster 
dimensions  : 559, 558, 8  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       :    Blue,   Green,     Red,       RE,      NIR,         NIR, ... 
min values  :    2.89,   16.26,   16.97,    24.26,   165.85, -0.08708532, ... 
max values  : 3139.13, 7843.21, 9368.55, 17875.35, 20952.24,  0.95849029, ... 
names(train_features_clnd_indexes) <- c(names(cropped_raster_bands),"NDVI", "EVI", "SAVI")
train_features_clnd_indexes
class       : SpatRaster 
dimensions  : 559, 558, 8  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       :    Blue,   Green,     Red,       RE,      NIR,        NDVI, ... 
min values  :    2.89,   16.26,   16.97,    24.26,   165.85, -0.08708532, ... 
max values  : 3139.13, 7843.21, 9368.55, 17875.35, 20952.24,  0.95849029, ... 
plot(train_features_clnd_indexes)

Now we have everything to move forward to prepare the data that will be use for training and testing our RF model. Lets get started by extracting the pixel values for those features we mapped. These pixel values will be used to train and test the model later.

train_features_clnd_pxls <- extract(train_features_clnd_indexes, train_features_clnd)
head(train_features_clnd_pxls)
  ID   Blue   Green     Red      RE     NIR      NDVI       EVI
1  1 429.80 1723.23 2483.62 4405.63 4435.60 0.2821098 0.3028237
2  1 467.11 1978.03 2957.46 5488.30 5493.73 0.3001080 0.3212719
3  1 437.12 1682.75 2549.77 4000.69 4057.74 0.2282206 0.2344632
4  1 426.47 1678.09 2679.71 4656.98 4529.61 0.2565984 0.2656323
5  1 499.02 1952.94 2933.11 4827.65 4760.13 0.2374838 0.2453411
6  1 501.31 1949.97 2953.60 4603.49 4702.68 0.2284504 0.2342670
       SAVI
1 0.4231342
2 0.4501354
3 0.3423050
4 0.3848709
5 0.3562026
6 0.3426532
train_features_clnd_pxls$Class <- train_features_clnd$Class[train_features_clnd_pxls$ID]
head(train_features_clnd_pxls)
  ID   Blue   Green     Red      RE     NIR      NDVI       EVI
1  1 429.80 1723.23 2483.62 4405.63 4435.60 0.2821098 0.3028237
2  1 467.11 1978.03 2957.46 5488.30 5493.73 0.3001080 0.3212719
3  1 437.12 1682.75 2549.77 4000.69 4057.74 0.2282206 0.2344632
4  1 426.47 1678.09 2679.71 4656.98 4529.61 0.2565984 0.2656323
5  1 499.02 1952.94 2933.11 4827.65 4760.13 0.2374838 0.2453411
6  1 501.31 1949.97 2953.60 4603.49 4702.68 0.2284504 0.2342670
       SAVI Class
1 0.4231342 grass
2 0.4501354 grass
3 0.3423050 grass
4 0.3848709 grass
5 0.3562026 grass
6 0.3426532 grass
train_features_clnd_pxls$ID <- NULL

summary(train_features_clnd_pxls)
      Blue             Green             Red             RE         
 Min.   :  65.94   Min.   : 145.2   Min.   : 154   Min.   :  356.7  
 1st Qu.: 425.17   1st Qu.:1480.3   1st Qu.:1545   1st Qu.: 3804.7  
 Median : 560.81   Median :1900.7   Median :2269   Median : 4763.0  
 Mean   : 607.79   Mean   :1917.9   Mean   :2298   Mean   : 5006.3  
 3rd Qu.: 770.89   3rd Qu.:2351.1   3rd Qu.:2957   3rd Qu.: 5921.0  
 Max.   :1932.87   Max.   :6168.2   Max.   :8605   Max.   :13865.4  
      NIR               NDVI               EVI         
 Min.   :  217.8   Min.   :-0.07855   Min.   :-0.0729  
 1st Qu.: 4048.6   1st Qu.: 0.21691   1st Qu.: 0.2374  
 Median : 5022.9   Median : 0.34773   Median : 0.4421  
 Mean   : 5541.9   Mean   : 0.40261   Mean   : 0.6292  
 3rd Qu.: 6573.5   3rd Qu.: 0.61267   3rd Qu.: 1.0327  
 Max.   :15908.8   Max.   : 0.88841   Max.   : 2.1119  
      SAVI               Class     
 Min.   :-0.1178   bare_soil:2156  
 1st Qu.: 0.3254   grass    :5988  
 Median : 0.5216   trees    :8149  
 Mean   : 0.6039                   
 3rd Qu.: 0.9189                   
 Max.   : 1.3325                   

Now that we have given those polygons delimitation pixel information, lets take a look at some of these pixels distribution

val_grass <- subset(train_features_clnd_pxls, Class == "grass")
head(val_grass)
    Blue   Green     Red      RE     NIR      NDVI       EVI
1 429.80 1723.23 2483.62 4405.63 4435.60 0.2821098 0.3028237
2 467.11 1978.03 2957.46 5488.30 5493.73 0.3001080 0.3212719
3 437.12 1682.75 2549.77 4000.69 4057.74 0.2282206 0.2344632
4 426.47 1678.09 2679.71 4656.98 4529.61 0.2565984 0.2656323
5 499.02 1952.94 2933.11 4827.65 4760.13 0.2374838 0.2453411
6 501.31 1949.97 2953.60 4603.49 4702.68 0.2284504 0.2342670
       SAVI Class
1 0.4231342 grass
2 0.4501354 grass
3 0.3423050 grass
4 0.3848709 grass
5 0.3562026 grass
6 0.3426532 grass
val_trees <- subset(train_features_clnd_pxls, Class == "trees")
head(val_trees)
        Blue   Green     Red      RE      NIR      NDVI       EVI
5123  835.48 2126.36 2795.86 4694.74  5190.48 0.2998395 0.3812958
5124  682.78 1837.81 2409.82 3817.80  4411.68 0.2934633 0.3639547
5125  515.71 1274.47 1823.18 2926.88  3318.64 0.2908426 0.3598006
5126 1002.83 2880.55 3329.61 6823.73  7349.93 0.3764507 0.5074274
5127 1084.56 3526.16 3443.14 9376.89 11173.64 0.5288785 0.8154784
5128  772.20 2226.59 2301.30 5564.50  6780.29 0.4931945 0.7567093
          SAVI Class
5123 0.4497311 trees
5124 0.4401627 trees
5125 0.4362214 trees
5126 0.5646496 trees
5127 0.7932905 trees
5128 0.7397510 trees
val_bare_soil <- subset(train_features_clnd_pxls, Class == "bare_soil")
head(val_bare_soil)
        Blue   Green     Red      RE     NIR        NDVI         EVI
11363 719.88 1919.11 2177.92 4151.08 4087.10 0.304736465 0.405983246
11364 850.89 2283.77 2924.93 3715.67 3493.87 0.088636497 0.097004139
11365 872.85 2373.72 3104.62 4060.01 4232.78 0.153754740 0.172870271
11366 871.65 2153.15 2925.41 3537.46 2939.61 0.002421136 0.002543767
11367 891.18 2439.16 3134.20 3972.84 3566.33 0.064491932 0.068860180
11368 888.49 2435.71 3186.76 4610.06 4492.41 0.170024881 0.192570391
             SAVI     Class
11363 0.457068219 bare_soil
11364 0.132944390 bare_soil
11365 0.230616395 bare_soil
11366 0.003631395 bare_soil
11367 0.096730680 bare_soil
11368 0.255020717 bare_soil
# NDVI
par(mfrow = c(3, 1))
hist(val_grass$NDVI, main = "grass", xlab = "NDVI", xlim = c(0, 1), col = "#e5ca81")
hist(val_trees$NDVI, main = "trees", xlab = "NDVI", xlim = c(0, 1), col = "#81a581")
hist(val_bare_soil$NDVI, main = "bare_soil", xlab = "NDVI", xlim = c(0, 1), col = "#7b6f89")

par(mfrow = c(1, 1))

# Scatterplot and trend lines of NIR and SAVI bands 
plot(NDVI ~ EVI, data = val_grass, pch = ".", col = "#e5ca81", xlab = "SAVI", ylab = "NDVI")
points(NDVI ~ EVI, data = val_trees, pch = ".", col = "#81a581")
points(NDVI ~ EVI, data = val_bare_soil, pch = ".", col = "#7b6f89")


model_grass <- lm(NDVI ~ EVI, data = val_grass) 
abline(model_grass, col = "#e5ca81", lwd = 2)    

model_trees <- lm(NDVI ~ EVI, data = val_trees) 
abline(model_trees, col = "#81a581", lwd = 2)   


model_bare_soil <- lm(NDVI ~ EVI, data = val_bare_soil) 
abline(model_bare_soil, col = "#7b6f89", lwd = 2)        

legend("topright", legend = c("grass", "trees", "bare_soil"), 
       fill = c("#e5ca81", "#7b6f89", "#81a581"), bg = "white")

Lets visualize the three indexes for each mapped feature in a single graph

train_features_clnd_pxls_indexes <- train_features_clnd_pxls |> 
  dplyr::select(c(NDVI, EVI, SAVI, Class)) |> 
  tidyr::pivot_longer(cols = 1:3, values_to = "values", names_to = "indexes")
train_features_clnd_pxls_indexes
# A tibble: 48,879 × 3
   Class indexes values
   <fct> <chr>    <dbl>
 1 grass NDVI     0.282
 2 grass EVI      0.303
 3 grass SAVI     0.423
 4 grass NDVI     0.300
 5 grass EVI      0.321
 6 grass SAVI     0.450
 7 grass NDVI     0.228
 8 grass EVI      0.234
 9 grass SAVI     0.342
10 grass NDVI     0.257
# ℹ 48,869 more rows
# add the distribution curve for each indexes parameter using stat_function and dnorm function
ggplot(train_features_clnd_pxls_indexes, 
                        aes(x = values, 
                            fill = indexes)) +
  geom_density_ridges(aes(y = 0, fill = indexes, color = indexes),  
                      jittered_points = TRUE, 
                      alpha = .5, 
                      scale = 1,
                      point_shape = "|",
                      point_size = 3, 
                      position = ggridges::position_points_jitter(width = 0.05, height = 0)) +  
    geom_vline(data = train_features_clnd_pxls_indexes |> 
               dplyr::group_by(Class, indexes) |> 
               dplyr::summarise(mean = mean(values), .groups = "drop") |> 
               dplyr::ungroup(),
             aes(xintercept = mean, color = indexes), linetype = "dashed", size = 0.8) +
  scale_fill_manual(values = c("#e5ca81", "#7b6f89", "#81a581"), labels = c("EVI", "NDVI", "SAVI")) +
  scale_color_manual(values = c("#e5ca81", "#7b6f89", "#81a581"), labels = c("EVI", "NDVI", "SAVI")) +
  theme_bw() + labs(x = "", y = "Frequency") +
  theme(axis.title = element_text(size = 20, face = "bold", color = "gray20"),
        axis.text.x.bottom = element_text(size = 14),
        axis.text.y = element_text(size = 14),
        legend.direction = "horizontal",
        legend.position = "bottom",
        legend.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        title = element_text(size = 18)) +
  facet_wrap(~Class, ncol = 3, scales = "free")

Now lets build the train and test data. At this stage we partition our dataset into two, thats a train and test dataset. We train the model with the train dataset and evaluate the model perfomance using the test data.

We would use the caret function createDataPartition() to split data into training and testing sets.

set.seed(124) 

str(train_features_clnd_pxls)
'data.frame':   16293 obs. of  9 variables:
 $ Blue : num  430 467 437 426 499 ...
 $ Green: num  1723 1978 1683 1678 1953 ...
 $ Red  : num  2484 2957 2550 2680 2933 ...
 $ RE   : num  4406 5488 4001 4657 4828 ...
 $ NIR  : num  4436 5494 4058 4530 4760 ...
 $ NDVI : num  0.282 0.3 0.228 0.257 0.237 ...
 $ EVI  : num  0.303 0.321 0.234 0.266 0.245 ...
 $ SAVI : num  0.423 0.45 0.342 0.385 0.356 ...
 $ Class: Factor w/ 3 levels "bare_soil","grass",..: 2 2 2 2 2 2 2 2 2 2 ...
# partitioning the data into 80%
rf_train_features_clnd_pxls_indexes <- caret::createDataPartition(train_features_clnd_pxls$Class, list = FALSE, p = 0.8)

Now that we have partitioned 80% of the dataset, we will select that portion of that data to be used as train data (80%) and test data (20%) If we check the number of rows in the new datasets and use the “rule of three” equation we can confirm that the data was properly split just as asked.

# setting 80% for training - Use rf_train_features_clnd_pxls_indexes as a vector of row indices that specify which rows to extract from the train_features_clnd_pxls_indexes dataset.
rf_train_data <- train_features_clnd_pxls[rf_train_features_clnd_pxls_indexes, ] 
str(rf_train_data)
'data.frame':   13036 obs. of  9 variables:
 $ Blue : num  467 437 426 501 452 ...
 $ Green: num  1978 1683 1678 1950 1740 ...
 $ Red  : num  2957 2550 2680 2954 2814 ...
 $ RE   : num  5488 4001 4657 4603 4677 ...
 $ NIR  : num  5494 4058 4530 4703 4720 ...
 $ NDVI : num  0.3 0.228 0.257 0.228 0.253 ...
 $ EVI  : num  0.321 0.234 0.266 0.234 0.262 ...
 $ SAVI : num  0.45 0.342 0.385 0.343 0.379 ...
 $ Class: Factor w/ 3 levels "bare_soil","grass",..: 2 2 2 2 2 2 2 2 2 2 ...
# setting 20% for training - Use rf_train_features_clnd_pxls_indexes as a vector of row indices that specify which rows to extract from the train_features_clnd_pxls_indexes dataset.
rf_test_data <- train_features_clnd_pxls[-rf_train_features_clnd_pxls_indexes, ]
str(rf_test_data)
'data.frame':   3257 obs. of  9 variables:
 $ Blue : num  430 499 469 435 487 ...
 $ Green: num  1723 1953 1853 1748 1969 ...
 $ Red  : num  2484 2933 2915 2714 3154 ...
 $ RE   : num  4406 4828 4898 4657 5250 ...
 $ NIR  : num  4436 4760 4823 4749 5357 ...
 $ NDVI : num  0.282 0.237 0.247 0.273 0.259 ...
 $ EVI  : num  0.303 0.245 0.254 0.286 0.267 ...
 $ SAVI : num  0.423 0.356 0.37 0.409 0.388 ...
 $ Class: Factor w/ 3 levels "bare_soil","grass",..: 2 2 2 2 2 2 2 2 2 2 ...

Lets check the proportion of features selected by each set of data

rf_train_data_classCount <- rf_train_data |> 
  dplyr::group_by(Class) |> 
  count()

rf_train_data_classCount
# A tibble: 3 × 2
# Groups:   Class [3]
  Class         n
  <fct>     <int>
1 bare_soil  1725
2 grass      4791
3 trees      6520
(prop.table(table(rf_train_data$Class))*100)

bare_soil     grass     trees 
 13.23259  36.75207  50.01534 
rf_test_data_classCount <- rf_test_data |> 
  dplyr::group_by(Class) |> 
  count()

rf_test_data_classCount
# A tibble: 3 × 2
# Groups:   Class [3]
  Class         n
  <fct>     <int>
1 bare_soil   431
2 grass      1197
3 trees      1629

Proportion of the response Classes that makes up the training data.

(prop.table(table(rf_test_data$Class))*100)

bare_soil     grass     trees 
 13.23304  36.75161  50.01535 

The function trainControl generates parameters that further control how models are created

set.seed(124)
cvControl <- caret::trainControl(method = 'cv',
                                 number = 10,
                                 savePredictions = TRUE,
                                 verboseIter = FALSE)


respVar <- c("Class")
predVar <- c("Blue","Green","Red","RE","NIR","NDVI","EVI","SAVI")


rfModel <- caret::train(rf_train_data[,predVar],
                        rf_train_data[,respVar],
                        method="rf",
                        metric = "Kappa",
                        ntree= 1000,
                        trControl= cvControl,
                        tuneLength=6,
                        importance=TRUE)

rfModel
Random Forest 

13036 samples
    8 predictor
    3 classes: 'bare_soil', 'grass', 'trees' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 11733, 11733, 11732, 11732, 11733, 11732, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
  2     0.9272793  0.8773001
  3     0.9282768  0.8790817
  4     0.9285835  0.8796430
  5     0.9292738  0.8807883
  6     0.9297337  0.8815564
  8     0.9290430  0.8804391

Kappa was used to select the optimal model using the largest value.
The final value used for the model was mtry = 6.
plot(rfModel)

plot(rfModel$finalModel)

varImpPlot(rfModel$finalModel)

rfPredict <- predict(rfModel,rf_test_data)
head(rfPredict)
[1] grass grass grass grass grass grass
Levels: bare_soil grass trees
conf_matrix <- confusionMatrix(rfPredict, rf_test_data$Class)
conf_matrix
Confusion Matrix and Statistics

           Reference
Prediction  bare_soil grass trees
  bare_soil       299    47     9
  grass            94  1126    26
  trees            38    24  1594

Overall Statistics
                                          
               Accuracy : 0.9269          
                 95% CI : (0.9174, 0.9356)
    No Information Rate : 0.5002          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8763          
                                          
 Mcnemar's Test P-Value : 2.36e-07        

Statistics by Class:

                     Class: bare_soil Class: grass Class: trees
Sensitivity                    0.6937       0.9407       0.9785
Specificity                    0.9802       0.9417       0.9619
Pos Pred Value                 0.8423       0.9037       0.9626
Neg Pred Value                 0.9545       0.9647       0.9781
Prevalence                     0.1323       0.3675       0.5002
Detection Rate                 0.0918       0.3457       0.4894
Detection Prevalence           0.1090       0.3826       0.5084
Balanced Accuracy              0.8370       0.9412       0.9702
conf_table <- as.data.frame(conf_matrix$table)

colnames(conf_table) <- c("Predicted", "Actual", "Frequency")
conf_table
  Predicted    Actual Frequency
1 bare_soil bare_soil       299
2     grass bare_soil        94
3     trees bare_soil        38
4 bare_soil     grass        47
5     grass     grass      1126
6     trees     grass        24
7 bare_soil     trees         9
8     grass     trees        26
9     trees     trees      1594
ggplot(data = conf_table, aes(x = Actual, y = Predicted, fill = Frequency)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "darkgreen") +
  geom_text(aes(label = Frequency), color = "black", size = 5) +
  labs(title = "Confusion Matrix Heatmap",
       x = "Actual Labels",
       y = "Predicted Labels",
       fill = "Frequency") +
  theme_minimal(base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

image_lc_predict <- raster::predict(train_features_clnd_indexes, rfModel) 
#Lets define the plalette(mainly be using Hexadecimal colours).
pal <-c("#7b6f89", "#e5ca81", "#81a581")

tm_shape(image_lc_predict)+
  tm_raster(style = "cat",labels = c("bare_soil",
                                     "grass",
                                     "trees"),
            palette = pal,
            title = "Legend")+
  tm_layout(main.title= "",
            main.title.size =.9 )+
  tm_layout(legend.outside = TRUE,
            legend.outside.position = "right",
            legend.title.size = .8)

raster_lc_predict <- raster(image_lc_predict)

# Define the palette for the classified raster
pal <- colorFactor(
  palette = c("#7b6f89", "#e5ca81", "#81a581"), 
  levels = c(1, 2, 3),
  na.color = "transparent" 
)


leaflet() |> 
  addTiles(group = "Base Map") |> 
  addRasterRGB(brick(cropped_raster_bands[[1:3]]),
               r = 3, g = 2, b = 1,
               group = "RGB Image") |> 
  addRasterImage(raster_lc_predict,
                 colors = pal,
                 opacity = 0.4,
                 group = "Land Cover Classification") |> 
  addLegend(pal = pal,
            values = c(1, 2, 3), 
            title = "Land Cover Classification",
            labels = c("bare_soil", "grass", "trees"),
            position = "bottomright") |> 
    addLayersControl(
    baseGroups = c("Base Map", "RGB Image"),
    overlayGroups = c("Land Cover Classification"),
    options = layersControlOptions(collapsed = FALSE))
raster_lc_predict_df <- as.data.frame(raster_lc_predict) |> 
  mutate(category = case_when(
    class == 1 ~ "bare_soil",
    class == 2 ~ "grass",
    class == 3 ~ "trees",
    TRUE ~ NA_character_)) |> 
  group_by(category) |> 
  summarise(count = n()) |> 
  ungroup() |> 
  mutate(prop = round(((count*100)/(sum(count))), 2)) |> 
  arrange(desc(prop))

head(raster_lc_predict_df)
# A tibble: 3 × 3
  category   count  prop
  <chr>      <int> <dbl>
1 trees     230519 73.9 
2 grass      57288 18.4 
3 bare_soil  24115  7.73
plot_ly(raster_lc_predict_df, 
        x = ~category, 
        y = ~prop,
        type = "bar",
        text = ~prop, 
        textposition = 'auto') |> 
        layout(title = "Vegetation Cover",
               xaxis = list(title = "Vegetation Type",
                            tickvals = c("bare_soil", "grass", "trees"),
                            ticktext = c("Bare Soil", "Grass", "Trees")),
               yaxis = list(title = "Cover Proportation (%)")) 

Post Cover Image Credits

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/souzayuri/souzayuri.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Souza (2025, March 21). Yuri Souza, Ecologist: Supervised Image Classification with Random Forest in R. Retrieved from https://souzayuri.github.io/posts/2025-01-05-random_forest_sup_classific_r/

BibTeX citation

@misc{souza2025supervised,
  author = {Souza, Yuri},
  title = {Yuri Souza, Ecologist: Supervised Image Classification with Random Forest in R},
  url = {https://souzayuri.github.io/posts/2025-01-05-random_forest_sup_classific_r/},
  year = {2025}
}