Here I demonstrate how to load and visualize georeferenced spatial data along with a supervised random forest classification. This is a work in progress and should be finished in the upcoming days.
I will divide the workflow of this tutorial into four main steps:
The data used in this tutorial can be accessed HERE
The first thing we are gonna do is to load the required packages. We will use raster and terra to deal with raster data, sf for spatial vectors data, tidyverse for tabulate date and visualization, caret for random forest, and mapview along with leaflet for data viz.
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("terra")) install.packages("terra")
if (!require("raster")) install.packages("raster")
if (!require("sf")) install.packages("sf")
if (!require("sp")) install.packages("sp")
if (!require("caret")) install.packages("caret")
if (!require("mapview")) install.packages("mapview")
if (!require("leaflet")) install.packages("leaflet")
if (!require("ggridges")) install.packages("ggridges")
if (!require("randomForest")) install.packages("randomForest")
if (!require("tmap")) install.packages("tmap")
Now, lets load our raster data and shapefile containing a few features delimited for use as training data in the random forest model.
raw_raster <- terra::rast("woody_savanna.tif") |>
terra::aggregate(fact=2) # original pixel size = 10.8cm, new pixel size 21.6cm (old * 2)
raw_raster
class : SpatRaster
dimensions : 759, 782, 6 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63899, -47.63742, -14.0681, -14.06662 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
names : woody~nna_1, woody~nna_2, woody~nna_3, woody~nna_4, woody~nna_5, woody~nna_6
min values : 2.89, 16.26, 16.97, 24.26, 117.52, 0
max values : 65535.00, 65535.00, 65535.00, 65535.00, 65535.00, 65535
plot(raw_raster)
sampled_area <- terra::vect("sampled_area.shp")
sampled_area
class : SpatVector
geometry : polygons
dimensions : 1, 1 (geometries, attributes)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
source : sampled_area.shp
coord. ref. : lon/lat WGS 84 (EPSG:4326)
names : Id
type : <int>
values : 0
plot(sampled_area)
train_features <- sf::st_read("train_features.shp")
Reading layer `train_features' from data source
`F:\souzayuri.github.io\_posts\2025-01-05-random_forest_sup_classific_r\train_features.shp'
using driver `ESRI Shapefile'
Simple feature collection with 46 features and 9 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS: WGS 84
train_features
Simple feature collection with 46 features and 9 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS: WGS 84
First 10 features:
Classcode Classname Classvalue RED GREEN BLUE Count Shape_Leng
1 <NA> grass 1 255 255 190 16651 0.0002222801
2 <NA> grass 1 255 255 190 13637 0.0002344464
3 <NA> grass 1 255 255 190 9875 0.0001834221
4 <NA> grass 1 255 255 190 11577 0.0002181070
5 <NA> grass 1 255 255 190 10001 0.0002419869
6 <NA> grass 1 255 255 190 7684 0.0001557914
7 <NA> grass 1 255 255 190 16943 0.0002274619
8 <NA> grass 1 255 255 190 6667 0.0001647991
9 <NA> grass 1 255 255 190 2964 0.0001024218
10 <NA> grass 1 255 255 190 4238 0.0001182262
Shape_Area geometry
1 3.318277e-09 POLYGON ((-47.63788 -14.066...
2 2.717603e-09 POLYGON ((-47.63782 -14.067...
3 1.967932e-09 POLYGON ((-47.63772 -14.066...
4 2.307135e-09 POLYGON ((-47.63781 -14.066...
5 1.993210e-09 POLYGON ((-47.63793 -14.067...
6 1.531386e-09 POLYGON ((-47.63788 -14.067...
7 3.376445e-09 POLYGON ((-47.63793 -14.066...
8 1.328762e-09 POLYGON ((-47.63768 -14.067...
9 5.907468e-10 POLYGON ((-47.63769 -14.066...
10 8.447055e-10 POLYGON ((-47.63785 -14.066...
train_features_clnd <- sf::st_read("train_features.shp") |>
dplyr::rename(OBJECTID = "Classcode",
Class = "Classname") |>
dplyr::mutate(OBJECTID = as.character(1:46),
Class = as.factor(Class),
Code = as.numeric(Class)) |>
dplyr::select(c(1:3))
Reading layer `train_features' from data source
`F:\souzayuri.github.io\_posts\2025-01-05-random_forest_sup_classific_r\train_features.shp'
using driver `ESRI Shapefile'
Simple feature collection with 46 features and 9 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS: WGS 84
train_features_clnd
Simple feature collection with 46 features and 3 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -47.63869 ymin: -14.06791 xmax: -47.63768 ymax: -14.06683
Geodetic CRS: WGS 84
First 10 features:
OBJECTID Class Classvalue geometry
1 1 grass 1 POLYGON ((-47.63788 -14.066...
2 2 grass 1 POLYGON ((-47.63782 -14.067...
3 3 grass 1 POLYGON ((-47.63772 -14.066...
4 4 grass 1 POLYGON ((-47.63781 -14.066...
5 5 grass 1 POLYGON ((-47.63793 -14.067...
6 6 grass 1 POLYGON ((-47.63788 -14.067...
7 7 grass 1 POLYGON ((-47.63793 -14.066...
8 8 grass 1 POLYGON ((-47.63768 -14.067...
9 9 grass 1 POLYGON ((-47.63769 -14.066...
10 10 grass 1 POLYGON ((-47.63785 -14.066...
plot(train_features_clnd)
# Compare CRS
if (terra::crs(raw_raster) == terra::crs(train_features_clnd)) {
print("CRS are identical!")
} else {
print("CRS are different. Therefore, reprojecting to match raster projection...")
# Reproject train_features to match raw_raster
train_features_clnd <- sf::st_transform(train_features_clnd, crs = terra::crs(raw_raster))
}
[1] "CRS are different. Therefore, reprojecting to match raster projection..."
# Compare CRS
if (terra::crs(raw_raster) == terra::crs(train_features_clnd)) {
print("CRS are identical!")
} else {
print("CRS are different. Therefore, reprojecting to match raster projection...")
# Reproject train_features to match raw_raster
train_features_clnd <- sf::st_transform(train_features_clnd, crs = terra::crs(raw_raster))
}
[1] "CRS are identical!"
glimpse(train_features_clnd)
Rows: 46
Columns: 4
$ OBJECTID <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"…
$ Class <fct> grass, grass, grass, grass, grass, grass, grass, …
$ Classvalue <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2…
$ geometry <POLYGON [°]> POLYGON ((-47.63788 -14.066..., POLYGON (…
summary(train_features_clnd)
OBJECTID Class Classvalue geometry
Length:46 bare_soil:15 Min. :1.000 POLYGON :46
Class :character grass :16 1st Qu.:1.000 epsg:4326 : 0
Mode :character trees :15 Median :2.000 +proj=long...: 0
Mean :1.978
3rd Qu.:3.000
Max. :3.000
Now, lets crop the raster and check projection
cropped_raster <- terra::crop(raw_raster, sampled_area)
cropped_raster
class : SpatRaster
dimensions : 559, 558, 6 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
names : woody~nna_1, woody~nna_2, woody~nna_3, woody~nna_4, woody~nna_5, woody~nna_6
min values : 2.89, 16.26, 16.97, 24.26, 165.85, 65535
max values : 3139.13, 7843.21, 9368.55, 17875.35, 20952.24, 65535
plot(cropped_raster)
# Compare CRS
if (terra::crs(raw_raster) == terra::crs(train_features_clnd)) {
print("CRS are identical!")
} else {
print("CRS are different. Therefore, reprojecting to match raster projection...")
# Reproject train_features to match raw_raster
train_features_clnd <- terra::project(train_features_clnd, crs(raw_raster))
}
[1] "CRS are identical!"
Now, lets rename our band composition
Lets remove the MASK band as we will not use it
cropped_raster_bands <- cropped_raster[[1:5]]
plot(cropped_raster_bands)
Lets make a RGB band compostion and visualize it in an interactive map with viewmap
train_class_colors <- unique(train_features_clnd$Class) |>
as.data.frame() |>
dplyr::mutate(hex = c(grass = "#ff7f00",
bare_soil = "#e41a1c",
tree = "#55eed1"))
train_class_colors
unique(train_features_clnd$Class) hex
1 grass #ff7f00
2 trees #e41a1c
3 bare_soil #55eed1
mapview::viewRGB(brick(cropped_raster_bands[[1:3]]),
r = 3, g = 2, b = 1,
map.types = c("Esri.WorldImagery", "OpenStreetMap", "OpenTopoMap"),
maxpixels = 866760,
layer.name = "RBG Image") +
mapview::mapView(train_features_clnd,
zcol = "Class",
color = "black",
lwd = 2,
col.regions = train_class_colors$hex,
layer.name = "Features")
Now that we have checked and named the raster bands and visualized the raster overlapped with the features that will be used for training and testing the RF model, lets do an additional raster calculation to improve the model quality in identifying features. In this step we will create a map containing the calculations of the NDVI, EVI, and SAVI index using the band composition of our raster image.
cropped_raster_bands_ndvi <- (cropped_raster_bands[[5]] - cropped_raster_bands[[3]]) /
(cropped_raster_bands[[5]] + cropped_raster_bands[[3]])
cropped_raster_bands_ndvi
class : SpatRaster
dimensions : 559, 558, 1 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
name : NIR
min value : -0.08708532
max value : 0.95849029
cropped_raster_bands_evi <- 2.5 * ((cropped_raster_bands[[5]] - cropped_raster_bands[[3]]) /
(cropped_raster_bands[[5]] + 6 * cropped_raster_bands[[3]] - 7.5 * cropped_raster_bands[[1]] + 1))
cropped_raster_bands_evi
class : SpatRaster
dimensions : 559, 558, 1 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
name : NIR
min value : -8.517467
max value : 83.766299
cropped_raster_bands_savi <- ((cropped_raster_bands[[5]] - cropped_raster_bands[[3]]) /
(cropped_raster_bands[[5]] + cropped_raster_bands[[3]] + 0.5)) * (1 + 0.5)
cropped_raster_bands_savi
class : SpatRaster
dimensions : 559, 558, 1 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
name : NIR
min value : -0.1306091
max value : 1.4368568
# Create a red-to-green color palette
red_to_green_palette <- colorRampPalette(c("darkred", "yellow", "darkgreen"))
# Lets visuallize them together
par(mfrow = c(1, 3))
plot(cropped_raster_bands_ndvi, main = "NDVI", col = red_to_green_palette(100))
plot(cropped_raster_bands_evi, main = "EVI", col = red_to_green_palette(100))
plot(cropped_raster_bands_savi, main = "SAVI", col = red_to_green_palette(100))
Let’s stack these three indexes in our raster data so we can have more information to feed the model. After stacking the rasters containing the indexes with our main raster, we will notice that the new rasters are named as NIR. That is because the first spectral band we used in the above calculations was the NIR band. To avoid confusion between the NIR spectral band and the new rasters indexes, we will name them according to which index they represent.
train_features_clnd_indexes <- c(cropped_raster_bands,
cropped_raster_bands_ndvi,
cropped_raster_bands_evi,
cropped_raster_bands_savi)
train_features_clnd_indexes
class : SpatRaster
dimensions : 559, 558, 8 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
names : Blue, Green, Red, RE, NIR, NIR, ...
min values : 2.89, 16.26, 16.97, 24.26, 165.85, -0.08708532, ...
max values : 3139.13, 7843.21, 9368.55, 17875.35, 20952.24, 0.95849029, ...
names(train_features_clnd_indexes) <- c(names(cropped_raster_bands),"NDVI", "EVI", "SAVI")
train_features_clnd_indexes
class : SpatRaster
dimensions : 559, 558, 8 (nrow, ncol, nlyr)
resolution : 1.99956e-06, 1.95189e-06 (x, y)
extent : -47.63879, -47.63768, -14.06791, -14.06682 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
names : Blue, Green, Red, RE, NIR, NDVI, ...
min values : 2.89, 16.26, 16.97, 24.26, 165.85, -0.08708532, ...
max values : 3139.13, 7843.21, 9368.55, 17875.35, 20952.24, 0.95849029, ...
plot(train_features_clnd_indexes)
Now we have everything to move forward to prepare the data that will be use for training and testing our RF model. Lets get started by extracting the pixel values for those features we mapped. These pixel values will be used to train and test the model later.
train_features_clnd_pxls <- extract(train_features_clnd_indexes, train_features_clnd)
head(train_features_clnd_pxls)
ID Blue Green Red RE NIR NDVI EVI
1 1 429.80 1723.23 2483.62 4405.63 4435.60 0.2821098 0.3028237
2 1 467.11 1978.03 2957.46 5488.30 5493.73 0.3001080 0.3212719
3 1 437.12 1682.75 2549.77 4000.69 4057.74 0.2282206 0.2344632
4 1 426.47 1678.09 2679.71 4656.98 4529.61 0.2565984 0.2656323
5 1 499.02 1952.94 2933.11 4827.65 4760.13 0.2374838 0.2453411
6 1 501.31 1949.97 2953.60 4603.49 4702.68 0.2284504 0.2342670
SAVI
1 0.4231342
2 0.4501354
3 0.3423050
4 0.3848709
5 0.3562026
6 0.3426532
train_features_clnd_pxls$Class <- train_features_clnd$Class[train_features_clnd_pxls$ID]
head(train_features_clnd_pxls)
ID Blue Green Red RE NIR NDVI EVI
1 1 429.80 1723.23 2483.62 4405.63 4435.60 0.2821098 0.3028237
2 1 467.11 1978.03 2957.46 5488.30 5493.73 0.3001080 0.3212719
3 1 437.12 1682.75 2549.77 4000.69 4057.74 0.2282206 0.2344632
4 1 426.47 1678.09 2679.71 4656.98 4529.61 0.2565984 0.2656323
5 1 499.02 1952.94 2933.11 4827.65 4760.13 0.2374838 0.2453411
6 1 501.31 1949.97 2953.60 4603.49 4702.68 0.2284504 0.2342670
SAVI Class
1 0.4231342 grass
2 0.4501354 grass
3 0.3423050 grass
4 0.3848709 grass
5 0.3562026 grass
6 0.3426532 grass
train_features_clnd_pxls$ID <- NULL
summary(train_features_clnd_pxls)
Blue Green Red RE
Min. : 65.94 Min. : 145.2 Min. : 154 Min. : 356.7
1st Qu.: 425.17 1st Qu.:1480.3 1st Qu.:1545 1st Qu.: 3804.7
Median : 560.81 Median :1900.7 Median :2269 Median : 4763.0
Mean : 607.79 Mean :1917.9 Mean :2298 Mean : 5006.3
3rd Qu.: 770.89 3rd Qu.:2351.1 3rd Qu.:2957 3rd Qu.: 5921.0
Max. :1932.87 Max. :6168.2 Max. :8605 Max. :13865.4
NIR NDVI EVI
Min. : 217.8 Min. :-0.07855 Min. :-0.0729
1st Qu.: 4048.6 1st Qu.: 0.21691 1st Qu.: 0.2374
Median : 5022.9 Median : 0.34773 Median : 0.4421
Mean : 5541.9 Mean : 0.40261 Mean : 0.6292
3rd Qu.: 6573.5 3rd Qu.: 0.61267 3rd Qu.: 1.0327
Max. :15908.8 Max. : 0.88841 Max. : 2.1119
SAVI Class
Min. :-0.1178 bare_soil:2156
1st Qu.: 0.3254 grass :5988
Median : 0.5216 trees :8149
Mean : 0.6039
3rd Qu.: 0.9189
Max. : 1.3325
Now that we have given those polygons delimitation pixel information, lets take a look at some of these pixels distribution
Blue Green Red RE NIR NDVI EVI
1 429.80 1723.23 2483.62 4405.63 4435.60 0.2821098 0.3028237
2 467.11 1978.03 2957.46 5488.30 5493.73 0.3001080 0.3212719
3 437.12 1682.75 2549.77 4000.69 4057.74 0.2282206 0.2344632
4 426.47 1678.09 2679.71 4656.98 4529.61 0.2565984 0.2656323
5 499.02 1952.94 2933.11 4827.65 4760.13 0.2374838 0.2453411
6 501.31 1949.97 2953.60 4603.49 4702.68 0.2284504 0.2342670
SAVI Class
1 0.4231342 grass
2 0.4501354 grass
3 0.3423050 grass
4 0.3848709 grass
5 0.3562026 grass
6 0.3426532 grass
Blue Green Red RE NIR NDVI EVI
5123 835.48 2126.36 2795.86 4694.74 5190.48 0.2998395 0.3812958
5124 682.78 1837.81 2409.82 3817.80 4411.68 0.2934633 0.3639547
5125 515.71 1274.47 1823.18 2926.88 3318.64 0.2908426 0.3598006
5126 1002.83 2880.55 3329.61 6823.73 7349.93 0.3764507 0.5074274
5127 1084.56 3526.16 3443.14 9376.89 11173.64 0.5288785 0.8154784
5128 772.20 2226.59 2301.30 5564.50 6780.29 0.4931945 0.7567093
SAVI Class
5123 0.4497311 trees
5124 0.4401627 trees
5125 0.4362214 trees
5126 0.5646496 trees
5127 0.7932905 trees
5128 0.7397510 trees
Blue Green Red RE NIR NDVI EVI
11363 719.88 1919.11 2177.92 4151.08 4087.10 0.304736465 0.405983246
11364 850.89 2283.77 2924.93 3715.67 3493.87 0.088636497 0.097004139
11365 872.85 2373.72 3104.62 4060.01 4232.78 0.153754740 0.172870271
11366 871.65 2153.15 2925.41 3537.46 2939.61 0.002421136 0.002543767
11367 891.18 2439.16 3134.20 3972.84 3566.33 0.064491932 0.068860180
11368 888.49 2435.71 3186.76 4610.06 4492.41 0.170024881 0.192570391
SAVI Class
11363 0.457068219 bare_soil
11364 0.132944390 bare_soil
11365 0.230616395 bare_soil
11366 0.003631395 bare_soil
11367 0.096730680 bare_soil
11368 0.255020717 bare_soil
par(mfrow = c(1, 1))
# Scatterplot and trend lines of NIR and SAVI bands
plot(NDVI ~ EVI, data = val_grass, pch = ".", col = "#e5ca81", xlab = "SAVI", ylab = "NDVI")
points(NDVI ~ EVI, data = val_trees, pch = ".", col = "#81a581")
points(NDVI ~ EVI, data = val_bare_soil, pch = ".", col = "#7b6f89")
model_grass <- lm(NDVI ~ EVI, data = val_grass)
abline(model_grass, col = "#e5ca81", lwd = 2)
model_trees <- lm(NDVI ~ EVI, data = val_trees)
abline(model_trees, col = "#81a581", lwd = 2)
model_bare_soil <- lm(NDVI ~ EVI, data = val_bare_soil)
abline(model_bare_soil, col = "#7b6f89", lwd = 2)
legend("topright", legend = c("grass", "trees", "bare_soil"),
fill = c("#e5ca81", "#7b6f89", "#81a581"), bg = "white")
Lets visualize the three indexes for each mapped feature in a single graph
train_features_clnd_pxls_indexes <- train_features_clnd_pxls |>
dplyr::select(c(NDVI, EVI, SAVI, Class)) |>
tidyr::pivot_longer(cols = 1:3, values_to = "values", names_to = "indexes")
train_features_clnd_pxls_indexes
# A tibble: 48,879 × 3
Class indexes values
<fct> <chr> <dbl>
1 grass NDVI 0.282
2 grass EVI 0.303
3 grass SAVI 0.423
4 grass NDVI 0.300
5 grass EVI 0.321
6 grass SAVI 0.450
7 grass NDVI 0.228
8 grass EVI 0.234
9 grass SAVI 0.342
10 grass NDVI 0.257
# ℹ 48,869 more rows
# add the distribution curve for each indexes parameter using stat_function and dnorm function
ggplot(train_features_clnd_pxls_indexes,
aes(x = values,
fill = indexes)) +
geom_density_ridges(aes(y = 0, fill = indexes, color = indexes),
jittered_points = TRUE,
alpha = .5,
scale = 1,
point_shape = "|",
point_size = 3,
position = ggridges::position_points_jitter(width = 0.05, height = 0)) +
geom_vline(data = train_features_clnd_pxls_indexes |>
dplyr::group_by(Class, indexes) |>
dplyr::summarise(mean = mean(values), .groups = "drop") |>
dplyr::ungroup(),
aes(xintercept = mean, color = indexes), linetype = "dashed", size = 0.8) +
scale_fill_manual(values = c("#e5ca81", "#7b6f89", "#81a581"), labels = c("EVI", "NDVI", "SAVI")) +
scale_color_manual(values = c("#e5ca81", "#7b6f89", "#81a581"), labels = c("EVI", "NDVI", "SAVI")) +
theme_bw() + labs(x = "", y = "Frequency") +
theme(axis.title = element_text(size = 20, face = "bold", color = "gray20"),
axis.text.x.bottom = element_text(size = 14),
axis.text.y = element_text(size = 14),
legend.direction = "horizontal",
legend.position = "bottom",
legend.title = element_text(size = 16, face = "bold"),
legend.text = element_text(size = 14),
title = element_text(size = 18)) +
facet_wrap(~Class, ncol = 3, scales = "free")
Now lets build the train and test data. At this stage we partition our dataset into two, thats a train and test dataset. We train the model with the train dataset and evaluate the model perfomance using the test data.
We would use the caret function createDataPartition() to split data into training and testing sets.
'data.frame': 16293 obs. of 9 variables:
$ Blue : num 430 467 437 426 499 ...
$ Green: num 1723 1978 1683 1678 1953 ...
$ Red : num 2484 2957 2550 2680 2933 ...
$ RE : num 4406 5488 4001 4657 4828 ...
$ NIR : num 4436 5494 4058 4530 4760 ...
$ NDVI : num 0.282 0.3 0.228 0.257 0.237 ...
$ EVI : num 0.303 0.321 0.234 0.266 0.245 ...
$ SAVI : num 0.423 0.45 0.342 0.385 0.356 ...
$ Class: Factor w/ 3 levels "bare_soil","grass",..: 2 2 2 2 2 2 2 2 2 2 ...
# partitioning the data into 80%
rf_train_features_clnd_pxls_indexes <- caret::createDataPartition(train_features_clnd_pxls$Class, list = FALSE, p = 0.8)
Now that we have partitioned 80% of the dataset, we will select that portion of that data to be used as train data (80%) and test data (20%) If we check the number of rows in the new datasets and use the “rule of three” equation we can confirm that the data was properly split just as asked.
# setting 80% for training - Use rf_train_features_clnd_pxls_indexes as a vector of row indices that specify which rows to extract from the train_features_clnd_pxls_indexes dataset.
rf_train_data <- train_features_clnd_pxls[rf_train_features_clnd_pxls_indexes, ]
str(rf_train_data)
'data.frame': 13036 obs. of 9 variables:
$ Blue : num 467 437 426 501 452 ...
$ Green: num 1978 1683 1678 1950 1740 ...
$ Red : num 2957 2550 2680 2954 2814 ...
$ RE : num 5488 4001 4657 4603 4677 ...
$ NIR : num 5494 4058 4530 4703 4720 ...
$ NDVI : num 0.3 0.228 0.257 0.228 0.253 ...
$ EVI : num 0.321 0.234 0.266 0.234 0.262 ...
$ SAVI : num 0.45 0.342 0.385 0.343 0.379 ...
$ Class: Factor w/ 3 levels "bare_soil","grass",..: 2 2 2 2 2 2 2 2 2 2 ...
# setting 20% for training - Use rf_train_features_clnd_pxls_indexes as a vector of row indices that specify which rows to extract from the train_features_clnd_pxls_indexes dataset.
rf_test_data <- train_features_clnd_pxls[-rf_train_features_clnd_pxls_indexes, ]
str(rf_test_data)
'data.frame': 3257 obs. of 9 variables:
$ Blue : num 430 499 469 435 487 ...
$ Green: num 1723 1953 1853 1748 1969 ...
$ Red : num 2484 2933 2915 2714 3154 ...
$ RE : num 4406 4828 4898 4657 5250 ...
$ NIR : num 4436 4760 4823 4749 5357 ...
$ NDVI : num 0.282 0.237 0.247 0.273 0.259 ...
$ EVI : num 0.303 0.245 0.254 0.286 0.267 ...
$ SAVI : num 0.423 0.356 0.37 0.409 0.388 ...
$ Class: Factor w/ 3 levels "bare_soil","grass",..: 2 2 2 2 2 2 2 2 2 2 ...
Lets check the proportion of features selected by each set of data
rf_train_data_classCount <- rf_train_data |>
dplyr::group_by(Class) |>
count()
rf_train_data_classCount
# A tibble: 3 × 2
# Groups: Class [3]
Class n
<fct> <int>
1 bare_soil 1725
2 grass 4791
3 trees 6520
(prop.table(table(rf_train_data$Class))*100)
bare_soil grass trees
13.23259 36.75207 50.01534
rf_test_data_classCount <- rf_test_data |>
dplyr::group_by(Class) |>
count()
rf_test_data_classCount
# A tibble: 3 × 2
# Groups: Class [3]
Class n
<fct> <int>
1 bare_soil 431
2 grass 1197
3 trees 1629
Proportion of the response Classes that makes up the training data.
(prop.table(table(rf_test_data$Class))*100)
bare_soil grass trees
13.23304 36.75161 50.01535
The function trainControl generates parameters that further control how models are created
set.seed(124)
cvControl <- caret::trainControl(method = 'cv',
number = 10,
savePredictions = TRUE,
verboseIter = FALSE)
respVar <- c("Class")
predVar <- c("Blue","Green","Red","RE","NIR","NDVI","EVI","SAVI")
rfModel <- caret::train(rf_train_data[,predVar],
rf_train_data[,respVar],
method="rf",
metric = "Kappa",
ntree= 500,
trControl= cvControl,
tuneLength=6,
importance=TRUE)
rfModel
Random Forest
13036 samples
8 predictor
3 classes: 'bare_soil', 'grass', 'trees'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 11733, 11733, 11732, 11732, 11733, 11732, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9281236 0.8787451
3 0.9285834 0.8795842
4 0.9284300 0.8793783
5 0.9295808 0.8813170
6 0.9295036 0.8811901
8 0.9293499 0.8809498
Kappa was used to select the optimal model using the largest value.
The final value used for the model was mtry = 5.
[1] grass grass grass grass grass grass
Levels: bare_soil grass trees
confusionMatrix(rfPredict,rf_test_data$Class)
Confusion Matrix and Statistics
Reference
Prediction bare_soil grass trees
bare_soil 294 44 11
grass 98 1130 26
trees 39 23 1592
Overall Statistics
Accuracy : 0.926
95% CI : (0.9165, 0.9348)
No Information Rate : 0.5002
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8747
Mcnemar's Test P-Value : 6.167e-08
Statistics by Class:
Class: bare_soil Class: grass Class: trees
Sensitivity 0.68213 0.9440 0.9773
Specificity 0.98054 0.9398 0.9619
Pos Pred Value 0.84241 0.9011 0.9625
Neg Pred Value 0.95289 0.9666 0.9769
Prevalence 0.13233 0.3675 0.5002
Detection Rate 0.09027 0.3469 0.4888
Detection Prevalence 0.10715 0.3850 0.5078
Balanced Accuracy 0.83134 0.9419 0.9696
bawkuPredictions <- raster::predict(train_features_clnd_indexes, rfModel)
#Lets define the plalette(mainly be using Hexadecimal colours).
pal <-c("#7b6f89", "#e5ca81", "#81a581")
tm_shape(bawkuPredictions)+
tm_raster(style = "cat",labels = c("bare_soil",
"grass",
"trees"),
palette = pal,
title = "Legend")+
tm_layout(main.title= "",
main.title.size =.9 )+
tm_layout(legend.outside = TRUE,
legend.outside.position = "right",
legend.title.size = .8)
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/souzayuri/souzayuri.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Souza (2025, Jan. 5). Yuri Souza, Data Scientist: Image Supervised Classification with Random Forest in R. Retrieved from https://souzayuri.github.io/posts/2025-01-05-random_forest_sup_classific_r/
BibTeX citation
@misc{souza2025image, author = {Souza, Yuri}, title = {Yuri Souza, Data Scientist: Image Supervised Classification with Random Forest in R}, url = {https://souzayuri.github.io/posts/2025-01-05-random_forest_sup_classific_r/}, year = {2025} }