spectralR: Spectral reflectance visualisations for user-defined areas

Oleh Prylutskyi, Vladimir Mikryukov, Dariia Shyriaieva

2022-08-05

Introduction

The spectralR package (homepage and source code) is aimed to obtain, process, and visualize spectral reflectance data for the user-defined earth surface classes for visual exploring in which wavelengths the classes differ. User-defined classes might represent different habitat or vegetation types, crops, land uses, landscapes, or other territories. Input should be a shapefile with polygons of surface classes (it might be different habitat types, crops, or other things). So far, the single source of spectral data is Sentinel 2 Level 2A satellite mission optical bands pixel data, obtained through the Google Earth Engine service.

The initial purpose of spectralR development was to quickly and visually explore the differences in spectral reflectance patterns for supervised vegetation (and, widely, habitat) mapping. While machine learning classification methods, such as RandomForest, typically utilize the full set of available spectral data (satellite bands), we observed that highly similar habitat types (e.g., different types of grasslands, floodplain habitats) might have similar reflectance values during one season and different for another, so time-series reflectance data are also needed. Without the preliminary selection of the most discriminating satellite bands for given habitat types in given conditions, the models became overfitted and required extensive machine resources for calculation.

The first version of our product was based on three code snippets, two written for R (shapefile preparation and visualization of the results) and one for Google Earth Engine for satellite image preparation and spectral data sampling itself. That approach, though, required a bunch of manual operations, and downloading big files on a local machine. So we moved to the recent rgee R package, which provides a bridge between R and Python API for Google Earth Engine. All the operations with satellite images now run in a cloud, and the obtained pixel data is visualized locally afterward. Therefore, the most resource-hungry functions do not overload your local machine despite the large amount of input data. However, a stable Internet connection is required for using the API.

For using rgee you should have a Google Earth Engine account. If you don’t, first register using your Google account.

Depending on the operating system you use and your current Python configuration, it may require the installation of additional R and Python packages for running rgee. See the following links for instructions. See also Quick Start User’s Guide, Official documentation, and the source code of rgee for solving issues that arise during installation and setting up.

We strongly encourage you to follow the official rgee installation guide and pay attention to the messages that arrive during installation.

The overall workflow is following:

  1. Load the user’s ESRI shapefile containing polygons for user-defined surface classes, as well as the text or numerical field with class names (labels).
  2. Apply rgee functionality to retrieve multi-band pixel data for class polygons from the Google Earth Engine service.
  3. Visualize retrieved pixel data locally, using ggplot2 approach.

Essential requirements:

Essential preparations

Install and set up rgee

install.packages('rgee') # install regular version from CRAN, more stable
# remotes::install_github("r-spatial/rgee") # to install development version, more recent

Load the library:

library(rgee)

It is necessary just once to complete the installation of the dependencies:

ee_install()

If something went wrong at this step, see rgee’s installation guide

Check non-R dependencies

ee_check() 

rgee developers recommend installing the version of the Earth Engine Python API which rgee was tested with, using the following command. Although it calls “upgrade”, it might actually downgrade your version of earthengine_api. You may up(down)grade or not, but in our tests, newer versions of earthengine_api worked flawlessly.

ee_install_upgrade()

Initialize the Google Earth Engine API for current session:

ee_Initialize()

At this step, you would be prompted to link your Google Earth Engine account with rgee. Follow instructions in the R console, and you will be redirected to a web browser for logging into your Google Earth Engine account to confirm access rights. During this step, you should past an authorization code generated into R console to finalize authentication.

If everything is OK at the last step and you see a message of successful initiation of API with your GEE username in the console, - congratulations, you managed to install and configure rgee!

Pay attention that rgee uses earthengine_api package, which depends on Python. That’s why we need to set up a local Python environment for using rgee.

Unfortunately, you have to repeat the environment setting and re-authorization each time Python gets updates. For actively updated operating systems, like regular versions of Ubuntu, that’s quite annoying. On the other hand, this built-in repairing system protects you from accidentally breaking the earthengine Python environment during installation or using any other Python-related tools, which is convenient, at least if you are a weak Python user like me. If you get an error message

“The current Python PATH: /home/user/.virtualenvs/rgee/bin/python does not have the Python package”earthengine-api” installed. Are you restarted/terminated your R session after install miniconda or run ee_install()?”

then proceed with the instruction that appeared in R console and will help you to delete your previous configuration and set it up again.

Installation of the other dependencies

spectralR was designed on top of rgee, sf, and ggplot2 R packages, which you should install and set up before using spectralR. Whilst the other dependencies may be installed automatically during spectralR installation process, it is strongly recommended to check also if packages geojsonio, dplyr, reshape2, tibble, and tidyr are installed and work properly.

Installation of spectralR

spectralR can be installed from GitHub sources so far, although we are planning to land it on CRAN soon.

library(remotes)
install_github("olehprylutskyi/spectralR")

We offer two datasets to illustrate the functionality of spectralR through the selected use-cases. The first dataset, namely ‘Homilsha Forests’, illustrates the “small data”: small-size area, a small number of polygons (8 polygons) which belong to a few categories (5 land-cover classes). The area belongs to the Homilsha Forests National Park, situated in the Kharkiv region (Eastern Ukraine). The polygons were created manually in the GIS software. Habitat borders and assignments were based on a field survey and adjusted according to the satellite imaginary. The land cover categories correspond to the CORINE classification (level 3) (available via https://land.copernicus.eu/) but some of them have shortened names: cropland = 2.1.1 Non-irrigated arable land; coniferous forest = 3.1.2 Coniferous forest; meadow = 3.2.1 Natural grasslands; reed = 4.1.1 Inland marshes; water = 5.1.2 Water bodies.

The second dataset, ‘Buzkyi Gard’, represents “large data”: a large area with complicated topography, hundreds of polygons, and dozens of classes. End users don’t expect to notice large differences in workflow, but under the hood we implemented two different algorithms for either “small” or “large” spatial data, which will be explained later.

Use case 1. Basic habitat types of ‘Homilsha Forests’ National Park.

Environment preparation

# Reset R's brain before the new analysis session started. It will erase all the objects stored in 
# R memory, while keeping loaded libraries.
rm(list = ls())

# Load required packages
library(tidyverse)
library(rgee)
library(sf)
library(geojsonio)
library(reshape2)
library(spectralR)

Upload and process vector data

Function prepare.vector.data takes a shapefile with polygons of different classes of surface (habitats, crops, vegetation, etc.) and retrieves the ready-for-sampling sf object. One should specify the shapefile name (which should be within the working directory, using absolute paths that were not tested) and the name of the field that contains class labels. The function extracts geometries of all polygons, marks them with custom labels (‘label’), and automatically assigns integer class ID (‘class’) for each entry. The last variable is required because the Google Earth Engine sampler respects only numerical class ID - don’t delete any field! Don’t panic, the resulting dataframe will be marked according to your custom labels, not hard-to-memorizable numbers.

While preparing a shapefile with custom polygons using desktop GIS software (e.d., in QGIS), it is highly recommended to follow the recommendation:

# Extract polygons from shapefile and prepare sf object with proper structure
sf_df <- prepare.vector.data(system.file("extdata/test_shapefile.shp", package = "spectralR"), "veget_type")

Explore the resulting spatial object:

head(sf_df)
#> Simple feature collection with 6 features and 2 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 36.34492 ymin: 49.55411 xmax: 36.54525 ymax: 49.66422
#> Geodetic CRS:  WGS 84
#>               label class                       geometry
#> 1 coniferous_forest     1 POLYGON ((36.35601 49.62186...
#> 2             water     5 POLYGON ((36.50612 49.57447...
#> 3              reed     4 POLYGON ((36.43128 49.59587...
#> 4          cropland     2 POLYGON ((36.42945 49.62938...
#> 5            meadow     3 POLYGON ((36.34492 49.64281...
#> 6            meadow     3 POLYGON ((36.37633 49.66116...

The example above uses an internal test shapefile. To use your own data, put all the shapefile into your working directory and use the following syntax (uncomment the row and change file names before executing):

# sf_df <- prepare.vector.data("your-shapefile-within-working-directory-name.shp", "name-of-the-field-with-class-labels")

Obtain pixel values from Sentinel 2A image collection

Function get.pixel.data is the heart of spectralR. It takes sf polygon object obtained at the previous step and retrieves a data frame with brightness values for each pixel intersected with polygons for each optical band of Sentinel-2 sensor, marked according to the label of surface class from the polygons. get.pixel.data starts with the conversion of the sf object into GEE feature collection. Then it prepares a satellite image for pixel sampling, performs sampling, and finally exports the resulting object back into an R data frame (non-spatial now).

One of the most tricky and issues-causing steps of this procedure is the conversion between GEE feature collection and R sf object. rgee implements three ways to do so:

The first one is the most straightforward and quick but usable only for small data (less than 15000 entries, or 1.5 MB size local files). The other two require transient storage to store your data between conversions (because Earth Engine is Google’s service and seamlessly integrated with both Google Drive and Google Cloud Storage).

In spectralR, we use the first two methods. get.pixel.data assesses the size of your sf object or a feature collection, then activates either getInfo or getInfo_to_asset pathway. If your data is considered to be “small”, getInfo will be used, and you will notice nothing special. But, if your data is larger than our arbitrary threshold, the method getInfo_to_asset will be activated, and you may be prompted to authorize in Google Drive and allow rgee to access GDrive files and folders. In such a case, please follow instructions in the R console and then in your web browser. You may do it once you perform your first large query - your credits will be stored in your local earthengine environment.

After authorization and first use, folder rgee_backup will be created in your Google Drive storage when all the intermediate files will be stored. If you run out of storage, get.pixel.data won’t be able to use your Google Drive as mediating storage, and you will receive an error. Though get.pixel.data re-write objects each time it will be launched, we strongly recommend cleaning rgee_backup folder in your Drive from time to time manually or using the following command from the rgee package:

ee_clean_container(name = "rgee_backup", type = "drive", quiet = FALSE)

To use the get.pixel.data function, we need to specify some values:

The resulting pixel data will be saved within a working directory and can be loaded during the following sessions.

Note 1. Particular satellite imagery is usually not ready for instant sampling – it contains clouds, cloud shadows, aerosols, and may cover not all the territory of your interest. Another issue is that each particular pixel slightly differs in reflectance between images taken on different days due to differences in atmospheric conditions and angle of sunlight at the moments images were taken. The Google Earth Engine has its build-in algorithms for image pre-processing, atmospheric corrections and mosaicing, which allows one to obtain a ready-to-use, rectified image. The approach used in spectralR is to find a median value for each pixel between several images within each of 10 optical bands, thereby making a composite image. To define a set of imageries between which we are going to calculate the median, we should set a timespan of image collection. Sentinel-2 apparatus takes a picture once every five days, so if you set up a month-long timesnap, you can expect that each pixel value will be calculated based on 5 to 6 values (remember, some images might appear unsatisfactory cloudy).

Note 2. The finest resolution for Sentinel data is 10 m while using a higher scale_value decreases the required computational resources and size of the resulting data frame. Although sampling satellite data is performed in a cloud, there are some limitations for geocalculations placed by GEE itself. If you are about to sample large areas, consider setting a higher ‘scale’ value (e.g., 100, 1000). Read more in GEE best practices.

# Get pixel data
reflectance = get.pixel.data(sf_df, "2019-05-15", "2019-06-30", 10, 100)

# Save pixel data for further sessions
save(reflectance, file = "reflectance_test_data")

Here we choose the 100 m scale (pixel size for resulting imagery is 100x100 m), which resulted in a sampling dataset of 2060 rows. Finer pixel size would result in a larger sampling dataset, which would require using moderating storage (see use case 2).

Let’s have a look at the resulting data:

head(reflectance)
#>       B11     B12      B2      B3      B4      B5      B6      B7      B8
#> 1 0.11700 0.06375 0.02790 0.04240 0.03460 0.06740 0.15810 0.18520 0.19275
#> 2 0.11065 0.05955 0.02720 0.04155 0.03320 0.06475 0.15590 0.18180 0.18900
#> 3 0.10665 0.05815 0.02640 0.04085 0.03210 0.06365 0.15095 0.17770 0.18440
#> 4 0.11265 0.06320 0.02945 0.04350 0.03540 0.06735 0.16075 0.18935 0.19675
#> 5 0.10885 0.05875 0.02720 0.04130 0.03315 0.06435 0.15470 0.18120 0.18840
#> 6 0.11825 0.06630 0.02800 0.04300 0.03700 0.06905 0.15030 0.17590 0.18225
#>       B8A             label
#> 1 0.19995 coniferous_forest
#> 2 0.19565 coniferous_forest
#> 3 0.19280 coniferous_forest
#> 4 0.20490 coniferous_forest
#> 5 0.19500 coniferous_forest
#> 6 0.19050 coniferous_forest

We have a data frame with the number of rows equal to the number of sampled “pixels” of satellite image and 10 variables with reflectance values for each optical band of Sentinel 2. Each sampled pixel has a label of a surface class of the user’s polygon it intersected.

Note: spectralR receives true reflectance values (between 0 and 1), transforming GEE’s values (multiplied by 10000 for calculation convenience).

Visualize results

First of all, one should explore the quality and comprehensiveness of obtained pixel data.

# load(file = "./reflectance_test_data") # restore previously saved pixel data

summary(factor(reflectance$label)) # how many pixels in each class? 
#> coniferous_forest          cropland            meadow              reed 
#>              1270               165                35               144 
#>             water 
#>               446

For reliable results, keeping the similar size of each surface class is recommended. Classes represented by a few sampled pixels should be excluded from further analysis.

Visual overview of pixel data

# Number of spectral values for different classes
ggplot(reflectance, aes(x=label))+
  geom_bar()+
  labs(x = 'Classes of surface', y = 'Number of pixels',
       title = "Total pixel data for different classes of surface",
       caption = 'Data: Sentinel-2 Level-2A')+
  theme_minimal()

Function spectral.reflectance.curve transform the data and plot smoother curves for each surface class, using ggplot2’s geom_smooth() aesthetics. For large (thousands of rows) data (and we need large data for reliable conclusions!), ggplot2 uses the GAM method for drawing a trendline.

Depending on the data size, the processing may take some time.

# Make a ggplot object
p1 <- spectral.curves.plot(reflectance)
#> Joining, by = "variable"

# Default plot
p1

Since the output of spectral.curves.plot is ggplot object, we can apply any tools provided by ggplot2 package to make it more visually pleasing.

# Customized plot
p1+
  labs(x = 'Wavelength, nm', y = 'Reflectance',
       colour = "Surface classes",
       fill = "Surface classes",
       title = "Spectral reflectance curves for different surface classes",
       caption = 'Data: Sentinel-2 Level-2A')+
  theme_minimal()

You can save the plot as a *.png (or other standard graphical formats) file using ggsave or png() functions.

ggsave("Spectral_curves_usecase1.png", width=16, height=12, unit="cm", dpi=300)

Function stat.summary.plot make a plot with a statistical summary of reflectance values (mean, mean-standard deviation, mean+standard deviation) for defined classes of the surface. Given reflectance data as input, the function returns a ggplot2 object with basic visual aesthetics. Default aesthetics are line with a statistical summary for each satellite band (geom_line + geom_pointrange).

Wavelength values (nm) are acquired from the mean known value for each optical band of the Sentinel 2 sensor.

# Make a ggplot object
p2 <- stat.summary.plot(reflectance)
#> Joining, by = "variable"
#> Joining, by = "variable"

# Default plot
p2

Add a touch of customization.

# Customized plot
p2 + 
  labs(x = 'Sentinel-2 bands', y = 'Reflectance',
       colour = "Surface classes",
       title = "Reflectance for different surface classes",
       caption='Data: Sentinel-2 Level-2A\nmean ± standard deviation')+
  theme_minimal()

Save the plot as a *.png file

ggsave("Statsummary_usecase1.png", width=16, height=12, unit="cm", dpi=300)

Function violin.plot helps to visualize a reflectance as violin plots for each surface class per satellite bands. It gets reflectance data as input and returns ggplot2 object with basic visual aesthetics. The default aesthetics is geom_violin.

# Make a ggplot object
p3 <- violin.plot(reflectance)

# Customized plot
p3 + 
  labs(x='Surface class',y='Reflectance',
       fill="Surface classes",
       title = "Reflectance for different surface classes",
       caption='Data: Sentinel-2 Level-2A')+
  theme_minimal()


# Save the plot as a *.png file
ggsave("Violins_usecase1.png", width=22, height=16, unit="cm", dpi=300)

Use case 2. Habitats of Buzkyi Gard National Park, Mykolaiv region, Ukraine.

Buzkyi Gard National Park is located in Southern Bug valley (south-western Ukraine, 31.08325,47.90720). The National Park is characterized by unique steppe zone landscapes with diverse vegetation and flora (Shyriaieva et al., 2022). Natural habitats with high conservation value, i.e., a wide range of dry grasslands, rocky outcrops, and steppe forests, are concentrated in river valleys and surrounded by croplands. The habitat types were defined using precise vegetation data (phytosociological relevés with GPS location) and field mapping data for the artificial habitats. Then, the new revised version of EUNIS expert system, version 2021-06-01 with additions from the old EUNIS 2007 for non-revised habitat groups and the National Habitat Catalogue of Ukraine were used for the classification. As a result, the dataset contains 380 hand-drawn polygons of 26 different EUNIS habitat types.

Environment preparation

# Reset R's brain before the new analysis session started. It will erase all the objects stored in 
# R memory while keeping loaded libraries.
rm(list = ls())

# Load required packages
library(tidyverse)
library(rgee)
library(sf)
library(geojsonio)
library(reshape2)
library(spectralR)

Upload and process vector data.

# Extract polygons from shapefile and prepare sf object with proper structure
sf_df <- prepare.vector.data(system.file("extdata/SouthernBuh-habitats_shapefile.shp", package = "spectralR"), "eunis_2020")

Explore the resulting spatial object:

head(sf_df)
#> Simple feature collection with 6 features and 2 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 30.96412 ymin: 47.91221 xmax: 31.06714 ymax: 48.01866
#> Geodetic CRS:  WGS 84
#>   label class                       geometry
#> 1   R1B    11 POLYGON ((31.05245 47.92259...
#> 2   R1B    11 POLYGON ((31.06686 47.91231...
#> 3   U33    22 POLYGON ((30.96416 48.01731...
#> 4   R1B    11 POLYGON ((30.96983 48.01847...
#> 5   R12     9 POLYGON ((30.97225 48.01432...
#> 6   U33    22 POLYGON ((30.96804 48.01308...

Get reflectance values

reflectance = get.pixel.data(sf_df, "2019-05-15", "2019-06-30", 10, 10)

# save pixel data for further sessions
save(reflectance, file = "reflectance_BG_data") 

The habitats of the Southern Buh River Valley often are represented by small areas (especially rocky outcrops, grasslands, etc.). Therefore, many polygons are relatively small – up to 1 ha – and have an irregular shape. We recommend decreasing scale values (pixel size) in such cases, approaching the minimum value (10 m). On the other hand, that will require more computational power and processing time.

Quantitative overview of pixel data

summary(factor(reflectance$label)) # how many pixels in each class?
#>  C2.2  C2.3    J1  J3.2  J4.2    J5   Q51   Q53   R12   R1A   R1B   R21   R36 
#>   741  3054 11204   983   447    87  1276     3   659    52  1448   310    49 
#>   S35   S36   S91   T11   T13   T19   T1E   T1H   U33   V11   V34   V38   X18 
#>   694   220     4   829   331   294    72 14372   165 48778    18  4717   130

With “big data”, spectralR uses your Google Drive as intermittent storage between R and GEE. Please take care of having enough storage in the Drive and purge rgee_backup directory manually from time to time.

Spectral reflectance curves for different habitat types

# Create basic ggplot object
p1 <- spectral.curves.plot(reflectance)
#> Joining, by = "variable"
#> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

# Plotting
p1 +
  labs(x = 'Wavelength, nm', y = 'Reflectance',
       colour = "Habitat types",
       fill = "Habitat types",
       title = "Spectral reflectance curves for different habitat types\nSouthern Bug National park, Ukraine",
       caption = 'Data: Sentinel-2 Level-2A')+
  theme_minimal()

The plot above, unlike the one from use-case 1, looks messy and hard to read, because of a large number of surface classes (habitat types). To make the resulting plot easier to comprehend, both functions spectral.curves.plot and stat.summary.plot have an argument target_classes, which takes as an input the list of the classes of surface which should be highlighted, while others will be turned in gray, as a background. Defaults value is NULL - all the classes would be colourized, as above. Let’s give it a try.

p1_1 <- spectral.curves.plot(
  data = reflectance, 
  target_classes = list("C2.2", "C2.3", "J5")
  )
#> Joining, by = "variable"
#> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

# Plotting
p1_1 +
  labs(x = 'Wavelength, nm', y = 'Reflectance',
       colour = "Habitat types",
       fill = "Habitat types",
       title = "Spectral reflectance curves for different habitat types\nSouthern Bug National park, Ukraine",
       caption = 'Data: Sentinel-2 Level-2A')+
  theme_minimal()

Not to mention, that the user is able to filter input data to obtain what they want.

Suppose we need only grasslands in our reflectance plot (habitats which code started from R) and want to highlight the differences/similarities in spectral reflectance of Continental dry grasslands (true steppe, type R1B).

# Filter reflectance data
grasslands <- reflectance %>%
  filter(label == c("R12", "R1A", "R1B", "R21", "R36"))
#> Warning in `==.default`(label, c("R12", "R1A", "R1B", "R21", "R36")): longer
#> object length is not a multiple of shorter object length
#> Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
#> shorter object length

# Construct a ggplot object with basic aestetics 
p1_2 <- spectral.curves.plot(
  data = grasslands, 
  target_classes = list("R1B")
  )
#> Joining, by = "variable"
#> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

# Plotting
p1_2 +
  labs(x = 'Wavelength, nm', y = 'Reflectance',
       colour = "Habitat types",
       fill = "Habitat types",
       title = "Spectral reflectance curves for different types of\ngrassland habitats, Southern Bug National park, Ukraine",
       caption = 'Data: Sentinel-2 Level-2A')+
  theme_minimal()

Statistical summary for each habitat type

# Create basic ggplot object
p2 <- stat.summary.plot(reflectance)
#> Joining, by = "variable"
#> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
#> Joining, by = "variable"

# Plotting
p2 + 
  labs(x = 'Sentinel-2 bands', y = 'Reflectance',
       colour = "Habitat types",
       title = "Reflectance for different habitat types\nSouthern Bug National park, Ukraine",
       caption='Data: Sentinel-2 Level-2A\nmean ± standard deviation')+
  theme_minimal()

Same way, create a plot that highlighted only tall-helophyte beds (type Q51), compared to two common other floodplain habitat types (R21 Mesic lowland pasture, T11 Riparian forest).

# Filter reflectance data
floodplain <- reflectance %>%
  filter(label == c("Q51", "R21", "T11"))
#> Warning in `==.default`(label, c("Q51", "R21", "T11")): longer object length is
#> not a multiple of shorter object length
#> Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
#> shorter object length

# Create basic ggplot object
p2_1 <- stat.summary.plot(
  data = floodplain, 
  target_classes = list("Q51")
)
#> Joining, by = "variable"
#> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
#> Joining, by = "variable"

# Plotting
p2_1 + 
  labs(x = 'Sentinel-2 bands', y = 'Reflectance',
       colour = "Habitat types",
       title = "Reflectance of tall-helophyte beds,\nSouthern Bug National park, Ukraine",
       caption='Data: Sentinel-2 Level-2A\nmean ± standard deviation')+
  theme_minimal()

Create a violin plots for given habitat types

# Create basic ggplot object
p3 <- violin.plot(reflectance)

# Plotting
p3 + 
  labs(x = "Habitat type", y = "Reflectance",
       fill = "Habitat types",
       title = "Reflectance for different habitat types\nSouthern Bug National park, Ukraine",
       caption ='Data: Sentinel-2 Level-2A')+
  theme_minimal()