Image Supervised Classification with Random Forest in R

Here I demonstrate how to load and visualize georeferenced spatial data along with a supervised random forest classification. This is a work in progress and should be finished in the upcoming days.

Yuri Souza
2025-01-05

I will divide the workflow of this tutorial into four main steps:

  1. Getting the data read
  2. Train Model
  3. Evaluate Model
  4. Classify Image

The data used in this tutorial can be accessed HERE

The first thing we are gonna do is to load the required packages. We will use raster and terra to deal with raster data, sf for spatial vectors data, tidyverse for tabulate date and visualization, caret for random forest, and mapview along with leaflet for data viz.

Now, lets load our raster data and shapefile containing a few features delimited for use as training data in the random forest model.

raw_raster <- terra::rast("woody_savanna.tif") |> 
  terra::aggregate(fact=2) # original pixel size = 10.8cm, new pixel size 21.6cm (old * 2)
raw_raster
class       : SpatRaster 
dimensions  : 759, 782, 6  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63899, -47.63742, -14.0681, -14.06662  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       : woody~nna_1, woody~nna_2, woody~nna_3, woody~nna_4, woody~nna_5, woody~nna_6 
min values  :        2.89,       16.26,       16.97,       24.26,      117.52,           0 
max values  :    65535.00,    65535.00,    65535.00,    65535.00,    65535.00,       65535 
plot(raw_raster)
sampled_area <- terra::vect("sampled_area.shp")
sampled_area
 class       : SpatVector 
 geometry    : polygons 
 dimensions  : 1, 1  (geometries, attributes)
 extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
 source      : sampled_area.shp
 coord. ref. : lon/lat WGS 84 (EPSG:4326) 
 names       :    Id
 type        : <int>
 values      :     0
plot(sampled_area)
train_features <- sf::st_read("train_features.shp")
Reading layer `train_features' from data source 
  `F:\souzayuri.github.io\_posts\2025-01-05-random_forest_sup_classific_r\train_features.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 46 features and 9 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS:  WGS 84
train_features
Simple feature collection with 46 features and 9 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS:  WGS 84
First 10 features:
   Classcode Classname Classvalue RED GREEN BLUE Count   Shape_Leng
1       <NA>     grass          1 255   255  190 16651 0.0002222801
2       <NA>     grass          1 255   255  190 13637 0.0002344464
3       <NA>     grass          1 255   255  190  9875 0.0001834221
4       <NA>     grass          1 255   255  190 11577 0.0002181070
5       <NA>     grass          1 255   255  190 10001 0.0002419869
6       <NA>     grass          1 255   255  190  7684 0.0001557914
7       <NA>     grass          1 255   255  190 16943 0.0002274619
8       <NA>     grass          1 255   255  190  6667 0.0001647991
9       <NA>     grass          1 255   255  190  2964 0.0001024218
10      <NA>     grass          1 255   255  190  4238 0.0001182262
     Shape_Area                       geometry
1  3.318277e-09 POLYGON ((-47.63788 -14.066...
2  2.717603e-09 POLYGON ((-47.63782 -14.067...
3  1.967932e-09 POLYGON ((-47.63772 -14.066...
4  2.307135e-09 POLYGON ((-47.63781 -14.066...
5  1.993210e-09 POLYGON ((-47.63793 -14.067...
6  1.531386e-09 POLYGON ((-47.63788 -14.067...
7  3.376445e-09 POLYGON ((-47.63793 -14.066...
8  1.328762e-09 POLYGON ((-47.63768 -14.067...
9  5.907468e-10 POLYGON ((-47.63769 -14.066...
10 8.447055e-10 POLYGON ((-47.63785 -14.066...
train_features_clnd <- sf::st_read("train_features.shp") |> 
  dplyr::rename(OBJECTID = "Classcode",
                Class = "Classname") |> 
  dplyr::mutate(OBJECTID = as.character(1:46),
                Class = as.factor(Class),
                Code = as.numeric(Class)) |> 
  dplyr::select(c(1:3))
Reading layer `train_features' from data source 
  `F:\souzayuri.github.io\_posts\2025-01-05-random_forest_sup_classific_r\train_features.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 46 features and 9 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS:  WGS 84
train_features_clnd
Simple feature collection with 46 features and 3 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS:  WGS 84
First 10 features:
   OBJECTID Class Classvalue                       geometry
1         1 grass          1 POLYGON ((-47.63788 -14.066...
2         2 grass          1 POLYGON ((-47.63782 -14.067...
3         3 grass          1 POLYGON ((-47.63772 -14.066...
4         4 grass          1 POLYGON ((-47.63781 -14.066...
5         5 grass          1 POLYGON ((-47.63793 -14.067...
6         6 grass          1 POLYGON ((-47.63788 -14.067...
7         7 grass          1 POLYGON ((-47.63793 -14.066...
8         8 grass          1 POLYGON ((-47.63768 -14.067...
9         9 grass          1 POLYGON ((-47.63769 -14.066...
10       10 grass          1 POLYGON ((-47.63785 -14.066...
plot(train_features_clnd)
# Compare CRS
if (terra::crs(raw_raster) == terra::crs(train_features_clnd)) {
  print("CRS are identical!")
} else {
  print("CRS are different. Therefore, reprojecting to match raster projection...")
  # Reproject train_features to match raw_raster
  train_features_clnd <- sf::st_transform(train_features_clnd, crs = terra::crs(raw_raster))
}
[1] "CRS are different. Therefore, reprojecting to match raster projection..."
# Compare CRS
if (terra::crs(raw_raster) == terra::crs(train_features_clnd)) {
  print("CRS are identical!")
} else {
  print("CRS are different. Therefore, reprojecting to match raster projection...")
  # Reproject train_features to match raw_raster
  train_features_clnd <- sf::st_transform(train_features_clnd, crs = terra::crs(raw_raster))
}
[1] "CRS are identical!"
glimpse(train_features_clnd)
Rows: 46
Columns: 4
$ OBJECTID   <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"…
$ Class      <fct> grass, grass, grass, grass, grass, grass, grass, …
$ Classvalue <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2…
$ geometry   <POLYGON [°]> POLYGON ((-47.63788 -14.066..., POLYGON (…
summary(train_features_clnd)
   OBJECTID               Class      Classvalue             geometry 
 Length:46          bare_soil:15   Min.   :1.000   POLYGON      :46  
 Class :character   grass    :16   1st Qu.:1.000   epsg:4326    : 0  
 Mode  :character   trees    :15   Median :2.000   +proj=long...: 0  
                                   Mean   :1.978                     
                                   3rd Qu.:3.000                     
                                   Max.   :3.000                     

Now, lets crop the raster and check projection

cropped_raster <- terra::crop(raw_raster, sampled_area)
cropped_raster
class       : SpatRaster 
dimensions  : 559, 558, 6  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       : woody~nna_1, woody~nna_2, woody~nna_3, woody~nna_4, woody~nna_5, woody~nna_6 
min values  :        2.89,       16.26,       16.97,       24.26,      165.85,       65535 
max values  :     3139.13,     7843.21,     9368.55,    17875.35,    20952.24,       65535 
plot(cropped_raster)
# Compare CRS
if (terra::crs(raw_raster) == terra::crs(train_features_clnd)) {
  print("CRS are identical!")
} else {
  print("CRS are different. Therefore, reprojecting to match raster projection...")
  # Reproject train_features to match raw_raster
  train_features_clnd <- terra::project(train_features_clnd, crs(raw_raster))
}
[1] "CRS are identical!"

Now, lets rename our band composition

names(cropped_raster) <- c("Blue","Green","Red","RE","NIR","MASK")
plot(cropped_raster)

Lets remove the MASK band as we will not use it

cropped_raster_bands <- cropped_raster[[1:5]]
plot(cropped_raster_bands)

Lets make a RGB band compostion and visualize it in an interactive map with viewmap

train_class_colors <- unique(train_features_clnd$Class) |> 
  as.data.frame() |>
  dplyr::mutate(hex = c(grass = "#ff7f00",
                        bare_soil = "#e41a1c",
                        tree = "#55eed1"))
train_class_colors
  unique(train_features_clnd$Class)     hex
1                             grass #ff7f00
2                             trees #e41a1c
3                         bare_soil #55eed1
mapview::viewRGB(brick(cropped_raster_bands[[1:3]]), 
                 r = 3, g = 2, b = 1,
                 map.types = c("Esri.WorldImagery", "OpenStreetMap", "OpenTopoMap"),
                 maxpixels =  866760,
                 layer.name = "RBG Image") + 
  mapview::mapView(train_features_clnd, 
                   zcol = "Class", 
                   color = "black",  
                   lwd = 2,
                   col.regions = train_class_colors$hex,
                   layer.name = "Features")

Now that we have checked and named the raster bands and visualized the raster overlapped with the features that will be used for training and testing the RF model, lets do an additional raster calculation to improve the model quality in identifying features. In this step we will create a map containing the calculations of the NDVI, EVI, and SAVI index using the band composition of our raster image.

cropped_raster_bands_ndvi <- (cropped_raster_bands[[5]] - cropped_raster_bands[[3]]) / 
  (cropped_raster_bands[[5]] + cropped_raster_bands[[3]])
cropped_raster_bands_ndvi
class       : SpatRaster 
dimensions  : 559, 558, 1  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
name        :         NIR 
min value   : -0.08708532 
max value   :  0.95849029 
cropped_raster_bands_evi <- 2.5 * ((cropped_raster_bands[[5]] - cropped_raster_bands[[3]]) /
    (cropped_raster_bands[[5]] + 6 * cropped_raster_bands[[3]] - 7.5 * cropped_raster_bands[[1]] + 1))
cropped_raster_bands_evi
class       : SpatRaster 
dimensions  : 559, 558, 1  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
name        :       NIR 
min value   : -8.517467 
max value   : 83.766299 
cropped_raster_bands_savi <- ((cropped_raster_bands[[5]] - cropped_raster_bands[[3]]) /
    (cropped_raster_bands[[5]] + cropped_raster_bands[[3]] + 0.5)) * (1 + 0.5)
cropped_raster_bands_savi
class       : SpatRaster 
dimensions  : 559, 558, 1  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
name        :        NIR 
min value   : -0.1306091 
max value   :  1.4368568 
# Create a red-to-green color palette
red_to_green_palette <- colorRampPalette(c("darkred", "yellow", "darkgreen"))
# Lets visuallize them together
par(mfrow = c(1, 3))

plot(cropped_raster_bands_ndvi, main = "NDVI", col = red_to_green_palette(100))
plot(cropped_raster_bands_evi, main = "EVI", col = red_to_green_palette(100))
plot(cropped_raster_bands_savi, main = "SAVI", col = red_to_green_palette(100))
par(mfrow = c(1, 1))

Let’s stack these three indexes in our raster data so we can have more information to feed the model. After stacking the rasters containing the indexes with our main raster, we will notice that the new rasters are named as NIR. That is because the first spectral band we used in the above calculations was the NIR band. To avoid confusion between the NIR spectral band and the new rasters indexes, we will name them according to which index they represent.

train_features_clnd_indexes <- c(cropped_raster_bands, 
                                 cropped_raster_bands_ndvi,
                                 cropped_raster_bands_evi, 
                                 cropped_raster_bands_savi)
train_features_clnd_indexes
class       : SpatRaster 
dimensions  : 559, 558, 8  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       :    Blue,   Green,     Red,       RE,      NIR,         NIR, ... 
min values  :    2.89,   16.26,   16.97,    24.26,   165.85, -0.08708532, ... 
max values  : 3139.13, 7843.21, 9368.55, 17875.35, 20952.24,  0.95849029, ... 
names(train_features_clnd_indexes) <- c(names(cropped_raster_bands),"NDVI", "EVI", "SAVI")
train_features_clnd_indexes
class       : SpatRaster 
dimensions  : 559, 558, 8  (nrow, ncol, nlyr)
resolution  : 1.99956e-06, 1.95189e-06  (x, y)
extent      : -47.63879, -47.63768, -14.06791, -14.06682  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       :    Blue,   Green,     Red,       RE,      NIR,        NDVI, ... 
min values  :    2.89,   16.26,   16.97,    24.26,   165.85, -0.08708532, ... 
max values  : 3139.13, 7843.21, 9368.55, 17875.35, 20952.24,  0.95849029, ... 
plot(train_features_clnd_indexes)

Now we have everything to move forward to prepare the data that will be use for training and testing our RF model. Lets get started by extracting the pixel values for those features we mapped. These pixel values will be used to train and test the model later.

train_features_clnd_pxls <- extract(train_features_clnd_indexes, train_features_clnd)
head(train_features_clnd_pxls)
  ID   Blue   Green     Red      RE     NIR      NDVI       EVI
1  1 429.80 1723.23 2483.62 4405.63 4435.60 0.2821098 0.3028237
2  1 467.11 1978.03 2957.46 5488.30 5493.73 0.3001080 0.3212719
3  1 437.12 1682.75 2549.77 4000.69 4057.74 0.2282206 0.2344632
4  1 426.47 1678.09 2679.71 4656.98 4529.61 0.2565984 0.2656323
5  1 499.02 1952.94 2933.11 4827.65 4760.13 0.2374838 0.2453411
6  1 501.31 1949.97 2953.60 4603.49 4702.68 0.2284504 0.2342670
       SAVI
1 0.4231342
2 0.4501354
3 0.3423050
4 0.3848709
5 0.3562026
6 0.3426532
train_features_clnd_pxls$Class <- train_features_clnd$Class[train_features_clnd_pxls$ID]
head(train_features_clnd_pxls)
  ID   Blue   Green     Red      RE     NIR      NDVI       EVI
1  1 429.80 1723.23 2483.62 4405.63 4435.60 0.2821098 0.3028237
2  1 467.11 1978.03 2957.46 5488.30 5493.73 0.3001080 0.3212719
3  1 437.12 1682.75 2549.77 4000.69 4057.74 0.2282206 0.2344632
4  1 426.47 1678.09 2679.71 4656.98 4529.61 0.2565984 0.2656323
5  1 499.02 1952.94 2933.11 4827.65 4760.13 0.2374838 0.2453411
6  1 501.31 1949.97 2953.60 4603.49 4702.68 0.2284504 0.2342670
       SAVI Class
1 0.4231342 grass
2 0.4501354 grass
3 0.3423050 grass
4 0.3848709 grass
5 0.3562026 grass
6 0.3426532 grass
train_features_clnd_pxls$ID <- NULL

summary(train_features_clnd_pxls)
      Blue             Green             Red             RE         
 Min.   :  65.94   Min.   : 145.2   Min.   : 154   Min.   :  356.7  
 1st Qu.: 425.17   1st Qu.:1480.3   1st Qu.:1545   1st Qu.: 3804.7  
 Median : 560.81   Median :1900.7   Median :2269   Median : 4763.0  
 Mean   : 607.79   Mean   :1917.9   Mean   :2298   Mean   : 5006.3  
 3rd Qu.: 770.89   3rd Qu.:2351.1   3rd Qu.:2957   3rd Qu.: 5921.0  
 Max.   :1932.87   Max.   :6168.2   Max.   :8605   Max.   :13865.4  
      NIR               NDVI               EVI         
 Min.   :  217.8   Min.   :-0.07855   Min.   :-0.0729  
 1st Qu.: 4048.6   1st Qu.: 0.21691   1st Qu.: 0.2374  
 Median : 5022.9   Median : 0.34773   Median : 0.4421  
 Mean   : 5541.9   Mean   : 0.40261   Mean   : 0.6292  
 3rd Qu.: 6573.5   3rd Qu.: 0.61267   3rd Qu.: 1.0327  
 Max.   :15908.8   Max.   : 0.88841   Max.   : 2.1119  
      SAVI               Class     
 Min.   :-0.1178   bare_soil:2156  
 1st Qu.: 0.3254   grass    :5988  
 Median : 0.5216   trees    :8149  
 Mean   : 0.6039                   
 3rd Qu.: 0.9189                   
 Max.   : 1.3325                   

Now that we have given those polygons delimitation pixel information, lets take a look at some of these pixels distribution

val_grass <- subset(train_features_clnd_pxls, Class == "grass")
head(val_grass)
    Blue   Green     Red      RE     NIR      NDVI       EVI
1 429.80 1723.23 2483.62 4405.63 4435.60 0.2821098 0.3028237
2 467.11 1978.03 2957.46 5488.30 5493.73 0.3001080 0.3212719
3 437.12 1682.75 2549.77 4000.69 4057.74 0.2282206 0.2344632
4 426.47 1678.09 2679.71 4656.98 4529.61 0.2565984 0.2656323
5 499.02 1952.94 2933.11 4827.65 4760.13 0.2374838 0.2453411
6 501.31 1949.97 2953.60 4603.49 4702.68 0.2284504 0.2342670
       SAVI Class
1 0.4231342 grass
2 0.4501354 grass
3 0.3423050 grass
4 0.3848709 grass
5 0.3562026 grass
6 0.3426532 grass
val_trees <- subset(train_features_clnd_pxls, Class == "trees")
head(val_trees)
        Blue   Green     Red      RE      NIR      NDVI       EVI
5123  835.48 2126.36 2795.86 4694.74  5190.48 0.2998395 0.3812958
5124  682.78 1837.81 2409.82 3817.80  4411.68 0.2934633 0.3639547
5125  515.71 1274.47 1823.18 2926.88  3318.64 0.2908426 0.3598006
5126 1002.83 2880.55 3329.61 6823.73  7349.93 0.3764507 0.5074274
5127 1084.56 3526.16 3443.14 9376.89 11173.64 0.5288785 0.8154784
5128  772.20 2226.59 2301.30 5564.50  6780.29 0.4931945 0.7567093
          SAVI Class
5123 0.4497311 trees
5124 0.4401627 trees
5125 0.4362214 trees
5126 0.5646496 trees
5127 0.7932905 trees
5128 0.7397510 trees
val_bare_soil <- subset(train_features_clnd_pxls, Class == "bare_soil")
head(val_bare_soil)
        Blue   Green     Red      RE     NIR        NDVI         EVI
11363 719.88 1919.11 2177.92 4151.08 4087.10 0.304736465 0.405983246
11364 850.89 2283.77 2924.93 3715.67 3493.87 0.088636497 0.097004139
11365 872.85 2373.72 3104.62 4060.01 4232.78 0.153754740 0.172870271
11366 871.65 2153.15 2925.41 3537.46 2939.61 0.002421136 0.002543767
11367 891.18 2439.16 3134.20 3972.84 3566.33 0.064491932 0.068860180
11368 888.49 2435.71 3186.76 4610.06 4492.41 0.170024881 0.192570391
             SAVI     Class
11363 0.457068219 bare_soil
11364 0.132944390 bare_soil
11365 0.230616395 bare_soil
11366 0.003631395 bare_soil
11367 0.096730680 bare_soil
11368 0.255020717 bare_soil
# NDVI
par(mfrow = c(3, 1))
hist(val_grass$NDVI, main = "grass", xlab = "NDVI", xlim = c(0, 1), col = "#e5ca81")
hist(val_trees$NDVI, main = "trees", xlab = "NDVI", xlim = c(0, 1), col = "#81a581")
hist(val_bare_soil$NDVI, main = "bare_soil", xlab = "NDVI", xlim = c(0, 1), col = "#7b6f89")

par(mfrow = c(1, 1))

# Scatterplot and trend lines of NIR and SAVI bands 
plot(NDVI ~ EVI, data = val_grass, pch = ".", col = "#e5ca81", xlab = "SAVI", ylab = "NDVI")
points(NDVI ~ EVI, data = val_trees, pch = ".", col = "#81a581")
points(NDVI ~ EVI, data = val_bare_soil, pch = ".", col = "#7b6f89")


model_grass <- lm(NDVI ~ EVI, data = val_grass) 
abline(model_grass, col = "#e5ca81", lwd = 2)    

model_trees <- lm(NDVI ~ EVI, data = val_trees) 
abline(model_trees, col = "#81a581", lwd = 2)   


model_bare_soil <- lm(NDVI ~ EVI, data = val_bare_soil) 
abline(model_bare_soil, col = "#7b6f89", lwd = 2)        

legend("topright", legend = c("grass", "trees", "bare_soil"), 
       fill = c("#e5ca81", "#7b6f89", "#81a581"), bg = "white")

Lets visualize the three indexes for each mapped feature in a single graph

train_features_clnd_pxls_indexes <- train_features_clnd_pxls |> 
  dplyr::select(c(NDVI, EVI, SAVI, Class)) |> 
  tidyr::pivot_longer(cols = 1:3, values_to = "values", names_to = "indexes")
train_features_clnd_pxls_indexes
# A tibble: 48,879 × 3
   Class indexes values
   <fct> <chr>    <dbl>
 1 grass NDVI     0.282
 2 grass EVI      0.303
 3 grass SAVI     0.423
 4 grass NDVI     0.300
 5 grass EVI      0.321
 6 grass SAVI     0.450
 7 grass NDVI     0.228
 8 grass EVI      0.234
 9 grass SAVI     0.342
10 grass NDVI     0.257
# ℹ 48,869 more rows
# add the distribution curve for each indexes parameter using stat_function and dnorm function
ggplot(train_features_clnd_pxls_indexes, 
                        aes(x = values, 
                            fill = indexes)) +
  geom_density_ridges(aes(y = 0, fill = indexes, color = indexes),  
                      jittered_points = TRUE, 
                      alpha = .5, 
                      scale = 1,
                      point_shape = "|",
                      point_size = 3, 
                      position = ggridges::position_points_jitter(width = 0.05, height = 0)) +  
    geom_vline(data = train_features_clnd_pxls_indexes |> 
               dplyr::group_by(Class, indexes) |> 
               dplyr::summarise(mean = mean(values), .groups = "drop") |> 
               dplyr::ungroup(),
             aes(xintercept = mean, color = indexes), linetype = "dashed", size = 0.8) +
  scale_fill_manual(values = c("#e5ca81", "#7b6f89", "#81a581"), labels = c("EVI", "NDVI", "SAVI")) +
  scale_color_manual(values = c("#e5ca81", "#7b6f89", "#81a581"), labels = c("EVI", "NDVI", "SAVI")) +
  theme_bw() + labs(x = "", y = "Frequency") +
  theme(axis.title = element_text(size = 20, face = "bold", color = "gray20"),
        axis.text.x.bottom = element_text(size = 14),
        axis.text.y = element_text(size = 14),
        legend.direction = "horizontal",
        legend.position = "bottom",
        legend.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        title = element_text(size = 18)) +
  facet_wrap(~Class, ncol = 3, scales = "free")

Now lets build the train and test data. At this stage we partition our dataset into two, thats a train and test dataset. We train the model with the train dataset and evaluate the model perfomance using the test data.

We would use the caret function createDataPartition() to split data into training and testing sets.

set.seed(124) 

str(train_features_clnd_pxls)
'data.frame':   16293 obs. of  9 variables:
 $ Blue : num  430 467 437 426 499 ...
 $ Green: num  1723 1978 1683 1678 1953 ...
 $ Red  : num  2484 2957 2550 2680 2933 ...
 $ RE   : num  4406 5488 4001 4657 4828 ...
 $ NIR  : num  4436 5494 4058 4530 4760 ...
 $ NDVI : num  0.282 0.3 0.228 0.257 0.237 ...
 $ EVI  : num  0.303 0.321 0.234 0.266 0.245 ...
 $ SAVI : num  0.423 0.45 0.342 0.385 0.356 ...
 $ Class: Factor w/ 3 levels "bare_soil","grass",..: 2 2 2 2 2 2 2 2 2 2 ...
# partitioning the data into 80%
rf_train_features_clnd_pxls_indexes <- caret::createDataPartition(train_features_clnd_pxls$Class, list = FALSE, p = 0.8)

Now that we have partitioned 80% of the dataset, we will select that portion of that data to be used as train data (80%) and test data (20%) If we check the number of rows in the new datasets and use the “rule of three” equation we can confirm that the data was properly split just as asked.

# setting 80% for training - Use rf_train_features_clnd_pxls_indexes as a vector of row indices that specify which rows to extract from the train_features_clnd_pxls_indexes dataset.
rf_train_data <- train_features_clnd_pxls[rf_train_features_clnd_pxls_indexes, ] 
str(rf_train_data)
'data.frame':   13036 obs. of  9 variables:
 $ Blue : num  467 437 426 501 452 ...
 $ Green: num  1978 1683 1678 1950 1740 ...
 $ Red  : num  2957 2550 2680 2954 2814 ...
 $ RE   : num  5488 4001 4657 4603 4677 ...
 $ NIR  : num  5494 4058 4530 4703 4720 ...
 $ NDVI : num  0.3 0.228 0.257 0.228 0.253 ...
 $ EVI  : num  0.321 0.234 0.266 0.234 0.262 ...
 $ SAVI : num  0.45 0.342 0.385 0.343 0.379 ...
 $ Class: Factor w/ 3 levels "bare_soil","grass",..: 2 2 2 2 2 2 2 2 2 2 ...
# setting 20% for training - Use rf_train_features_clnd_pxls_indexes as a vector of row indices that specify which rows to extract from the train_features_clnd_pxls_indexes dataset.
rf_test_data <- train_features_clnd_pxls[-rf_train_features_clnd_pxls_indexes, ]
str(rf_test_data)
'data.frame':   3257 obs. of  9 variables:
 $ Blue : num  430 499 469 435 487 ...
 $ Green: num  1723 1953 1853 1748 1969 ...
 $ Red  : num  2484 2933 2915 2714 3154 ...
 $ RE   : num  4406 4828 4898 4657 5250 ...
 $ NIR  : num  4436 4760 4823 4749 5357 ...
 $ NDVI : num  0.282 0.237 0.247 0.273 0.259 ...
 $ EVI  : num  0.303 0.245 0.254 0.286 0.267 ...
 $ SAVI : num  0.423 0.356 0.37 0.409 0.388 ...
 $ Class: Factor w/ 3 levels "bare_soil","grass",..: 2 2 2 2 2 2 2 2 2 2 ...

Lets check the proportion of features selected by each set of data

rf_train_data_classCount <- rf_train_data |> 
  dplyr::group_by(Class) |> 
  count()

rf_train_data_classCount
# A tibble: 3 × 2
# Groups:   Class [3]
  Class         n
  <fct>     <int>
1 bare_soil  1725
2 grass      4791
3 trees      6520
(prop.table(table(rf_train_data$Class))*100)

bare_soil     grass     trees 
 13.23259  36.75207  50.01534 
rf_test_data_classCount <- rf_test_data |> 
  dplyr::group_by(Class) |> 
  count()

rf_test_data_classCount
# A tibble: 3 × 2
# Groups:   Class [3]
  Class         n
  <fct>     <int>
1 bare_soil   431
2 grass      1197
3 trees      1629

Proportion of the response Classes that makes up the training data.

(prop.table(table(rf_test_data$Class))*100)

bare_soil     grass     trees 
 13.23304  36.75161  50.01535 

The function trainControl generates parameters that further control how models are created

set.seed(124)
cvControl <- caret::trainControl(method = 'cv',
                                 number = 10,
                                 savePredictions = TRUE,
                                 verboseIter = FALSE)


respVar <- c("Class")
predVar <- c("Blue","Green","Red","RE","NIR","NDVI","EVI","SAVI")


rfModel <- caret::train(rf_train_data[,predVar],
                        rf_train_data[,respVar],
                        method="rf",
                        metric = "Kappa",
                        ntree= 500,
                        trControl= cvControl,
                        tuneLength=6,
                        importance=TRUE)


rfModel
Random Forest 

13036 samples
    8 predictor
    3 classes: 'bare_soil', 'grass', 'trees' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 11733, 11733, 11732, 11732, 11733, 11732, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
  2     0.9281236  0.8787451
  3     0.9285834  0.8795842
  4     0.9284300  0.8793783
  5     0.9295808  0.8813170
  6     0.9295036  0.8811901
  8     0.9293499  0.8809498

Kappa was used to select the optimal model using the largest value.
The final value used for the model was mtry = 5.
rfPredict <- predict(rfModel,rf_test_data)
head(rfPredict)
[1] grass grass grass grass grass grass
Levels: bare_soil grass trees
confusionMatrix(rfPredict,rf_test_data$Class)
Confusion Matrix and Statistics

           Reference
Prediction  bare_soil grass trees
  bare_soil       294    44    11
  grass            98  1130    26
  trees            39    23  1592

Overall Statistics
                                          
               Accuracy : 0.926           
                 95% CI : (0.9165, 0.9348)
    No Information Rate : 0.5002          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8747          
                                          
 Mcnemar's Test P-Value : 6.167e-08       

Statistics by Class:

                     Class: bare_soil Class: grass Class: trees
Sensitivity                   0.68213       0.9440       0.9773
Specificity                   0.98054       0.9398       0.9619
Pos Pred Value                0.84241       0.9011       0.9625
Neg Pred Value                0.95289       0.9666       0.9769
Prevalence                    0.13233       0.3675       0.5002
Detection Rate                0.09027       0.3469       0.4888
Detection Prevalence          0.10715       0.3850       0.5078
Balanced Accuracy             0.83134       0.9419       0.9696
bawkuPredictions <- raster::predict(train_features_clnd_indexes, rfModel) 
#Lets define the plalette(mainly be using Hexadecimal colours).
pal <-c("#7b6f89", "#e5ca81", "#81a581")

tm_shape(bawkuPredictions)+
  tm_raster(style = "cat",labels = c("bare_soil",
                                     "grass",
                                     "trees"),
            palette = pal,
            title = "Legend")+
  tm_layout(main.title= "",
            main.title.size =.9 )+
  tm_layout(legend.outside = TRUE,
            legend.outside.position = "right",
            legend.title.size = .8)

Post Cover Image Credits

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/souzayuri/souzayuri.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Souza (2025, Jan. 5). Yuri Souza, Data Scientist: Image Supervised Classification with Random Forest in R. Retrieved from https://souzayuri.github.io/posts/2025-01-05-random_forest_sup_classific_r/

BibTeX citation

@misc{souza2025image,
  author = {Souza, Yuri},
  title = {Yuri Souza, Data Scientist: Image Supervised Classification with Random Forest in R},
  url = {https://souzayuri.github.io/posts/2025-01-05-random_forest_sup_classific_r/},
  year = {2025}
}