spectralR package (homepage and source
code) is aimed to obtain, process, and visualize spectral
reflectance data for the user-defined earth surface classes for visual
exploring in which wavelengths the classes differ. User-defined classes
might represent different habitat or vegetation types, crops, land uses,
landscapes, or other territories. Input should be a shapefile with
polygons of surface classes (it might be different habitat types, crops,
or other things). So far, the single source of spectral data is Sentinel
2 Level 2A satellite mission optical bands pixel data, obtained
through the Google Earth
The initial purpose of
spectralR development was to
quickly and visually explore the differences in spectral reflectance
patterns for supervised vegetation (and, widely, habitat) mapping. While
machine learning classification methods, such as RandomForest,
typically utilize the full set of available spectral data (satellite
bands), we observed that highly similar habitat types (e.g., different
types of grasslands, floodplain habitats) might have similar reflectance
values during one season and different for another, so time-series
reflectance data are also needed. Without the preliminary selection of
the most discriminating satellite bands for given habitat types in given
conditions, the models became overfitted and required extensive machine
resources for calculation.
The first version of our product was based on three code snippets, two written for R (shapefile preparation and visualization of the results) and one for Google Earth Engine for satellite image preparation and spectral data sampling itself. That approach, though, required a bunch of manual operations, and downloading big files on a local machine. So we moved to the recent rgee R package, which provides a bridge between R and Python API for Google Earth Engine. All the operations with satellite images now run in a cloud, and the obtained pixel data is visualized locally afterward. Therefore, the most resource-hungry functions do not overload your local machine despite the large amount of input data. However, a stable Internet connection is required for using the API.
rgee you should have a Google Earth Engine
account. If you don’t, first register using
your Google account.
Depending on the operating system you use and your current Python
configuration, it may require the installation of additional R and
Python packages for running
rgee. See the following links
for instructions. See also Quick
Start User’s Guide, Official
documentation, and the source code of rgee for
solving issues that arise during installation and setting up.
We strongly encourage you to follow the official
installation guide and pay attention to the messages that arrive during
The overall workflow is following:
rgeefunctionality to retrieve multi-band pixel data for class polygons from the Google Earth Engine service.
Stable Internet connection (for using API)
Installed and correctly pre-configured Python environment (v. 3.5 or above)
Active Google Earth Engine account
install.packages('rgee') # install regular version from CRAN, more stable # remotes::install_github("r-spatial/rgee") # to install development version, more recent
Load the library:
It is necessary just once to complete the installation of the dependencies:
If something went wrong at this step, see
Check non-R dependencies
rgee developers recommend installing the version of the
Earth Engine Python API which
rgee was tested with, using
the following command. Although it calls “upgrade”, it might actually
downgrade your version of
earthengine_api. You may
up(down)grade or not, but in our tests, newer versions of
earthengine_api worked flawlessly.
Initialize the Google Earth Engine API for current session:
At this step, you would be prompted to link your Google Earth Engine
rgee. Follow instructions in the R console,
and you will be redirected to a web browser for logging into your Google
Earth Engine account to confirm access rights. During this step, you
should past an authorization code generated into R console to finalize
If everything is OK at the last step and you see a message of
successful initiation of API with your GEE username in the console, -
congratulations, you managed to install and configure
Pay attention that
earthengine_api package, which depends on Python. That’s
why we need to set up a local Python environment for using
Unfortunately, you have to repeat the environment setting and
re-authorization each time Python gets updates. For actively updated
operating systems, like regular versions of Ubuntu, that’s quite
annoying. On the other hand, this built-in repairing system protects you
from accidentally breaking the
environment during installation or using any other Python-related tools,
which is convenient, at least if you are a weak Python user like me. If
you get an error message
“The current Python PATH: /home/user/.virtualenvs/rgee/bin/python does not have the Python package”earthengine-api” installed. Are you restarted/terminated your R session after install miniconda or run ee_install()?”
then proceed with the instruction that appeared in R console and will help you to delete your previous configuration and set it up again.
spectralR was designed on top of
ggplot2 R packages, which you should
install and set up before using
spectralR. Whilst the other
dependencies may be installed automatically during
spectralR installation process, it is strongly recommended
to check also if packages
installed and work properly.
spectralR can be installed from GitHub
sources so far, although we are planning to land it on CRAN soon.
We offer two datasets to illustrate the functionality of
spectralR through the selected use-cases. The first
dataset, namely ‘Homilsha Forests’, illustrates the “small data”:
small-size area, a small number of polygons (8 polygons) which belong to
a few categories (5 land-cover classes). The area belongs to the
Homilsha Forests National Park, situated in the Kharkiv region (Eastern
Ukraine). The polygons were created manually in the GIS software.
Habitat borders and assignments were based on a field survey and
adjusted according to the satellite imaginary. The land cover categories
correspond to the CORINE classification (level 3) (available via https://land.copernicus.eu/) but some of them have
shortened names: cropland = 2.1.1 Non-irrigated arable land; coniferous
forest = 3.1.2 Coniferous forest; meadow = 3.2.1 Natural grasslands;
reed = 4.1.1 Inland marshes; water = 5.1.2 Water bodies.
The second dataset, ‘Buzkyi Gard’, represents “large data”: a large area with complicated topography, hundreds of polygons, and dozens of classes. End users don’t expect to notice large differences in workflow, but under the hood we implemented two different algorithms for either “small” or “large” spatial data, which will be explained later.
# Reset R's brain before the new analysis session started. It will erase all the objects stored in # R memory, while keeping loaded libraries. rm(list = ls()) # Load required packages library(tidyverse) library(rgee) library(sf) library(geojsonio) library(reshape2) library(spectralR)
prepare.vector.data takes a shapefile with
polygons of different classes of surface (habitats, crops, vegetation,
etc.) and retrieves the ready-for-sampling
sf object. One
should specify the shapefile name (which should be within the working
directory, using absolute paths that were not tested) and the name of
the field that contains class labels. The function extracts geometries
of all polygons, marks them with custom labels (‘label’), and
automatically assigns integer class ID (‘class’) for each entry. The
last variable is required because the Google Earth Engine sampler
respects only numerical class ID - don’t delete any field! Don’t panic,
the resulting dataframe will be marked according to your custom labels,
not hard-to-memorizable numbers.
While preparing a shapefile with custom polygons using desktop GIS software (e.d., in QGIS), it is highly recommended to follow the recommendation:
if possible, draw polygons in homogeneous landscapes and avoid class mixture;
keep geometries simple, if possible. Avoid multipolygons, holes inside polygons, etc.;
tiny polygons (same size as satellite imagery resolution or lesser) tend to produce a stronger “edge effect”;
few big polygons are easier to process than a lot of small ones;
GEE has its own memory limitation, which may result in extended processing time.
# Extract polygons from shapefile and prepare sf object with proper structure <- prepare.vector.data(system.file("extdata/test_shapefile.shp", package = "spectralR"), "veget_type")sf_df
Explore the resulting spatial object:
head(sf_df) #> Simple feature collection with 6 features and 2 fields #> Geometry type: POLYGON #> Dimension: XY #> Bounding box: xmin: 36.34492 ymin: 49.55411 xmax: 36.54525 ymax: 49.66422 #> Geodetic CRS: WGS 84 #> label class geometry #> 1 coniferous_forest 1 POLYGON ((36.35601 49.62186... #> 2 water 5 POLYGON ((36.50612 49.57447... #> 3 reed 4 POLYGON ((36.43128 49.59587... #> 4 cropland 2 POLYGON ((36.42945 49.62938... #> 5 meadow 3 POLYGON ((36.34492 49.64281... #> 6 meadow 3 POLYGON ((36.37633 49.66116...
The example above uses an internal test shapefile. To use your own data, put all the shapefile into your working directory and use the following syntax (uncomment the row and change file names before executing):
# sf_df <- prepare.vector.data("your-shapefile-within-working-directory-name.shp", "name-of-the-field-with-class-labels")
get.pixel.data is the heart of
spectralR. It takes
sf polygon object obtained
at the previous step and retrieves a data frame with brightness values
for each pixel intersected with polygons for each optical band of
Sentinel-2 sensor, marked according to the label of surface class from
get.pixel.data starts with the conversion of
sf object into GEE feature collection. Then it prepares
a satellite image for pixel sampling, performs sampling, and finally
exports the resulting object back into an R data frame (non-spatial
One of the most tricky and issues-causing steps of this procedure is
the conversion between GEE feature collection and R
rgee implements three ways to do so:
through the JSON translator (
using Google Drive (
using Google Cloud Storage (
The first one is the most straightforward and quick but usable only for small data (less than 15000 entries, or 1.5 MB size local files). The other two require transient storage to store your data between conversions (because Earth Engine is Google’s service and seamlessly integrated with both Google Drive and Google Cloud Storage).
spectralR, we use the first two methods.
get.pixel.data assesses the size of your
object or a feature collection, then activates either
getInfo_to_asset pathway. If your
data is considered to be “small”,
getInfo will be used, and
you will notice nothing special. But, if your data is larger than our
arbitrary threshold, the method
getInfo_to_asset will be
activated, and you may be prompted to authorize in Google Drive and
rgee to access GDrive files and folders. In such a
case, please follow instructions in the R console and then in your web
browser. You may do it once you perform your first large query - your
credits will be stored in your local
After authorization and first use, folder rgee_backup will
be created in your Google Drive storage when all the intermediate files
will be stored. If you run out of storage,
won’t be able to use your Google Drive as mediating storage, and you
will receive an error. Though
objects each time it will be launched, we strongly recommend cleaning
rgee_backup folder in your Drive from time to time manually or
using the following command from the
ee_clean_container(name = "rgee_backup", type = "drive", quiet = FALSE)
To use the
get.pixel.data function, we need to specify
polygons of surface classes as an
prepared at the previous step;
starting day for Sentinel image collection in “YYYY-MM-DD” format. See Note 1 below;
final day for Sentinel image collection, in “YYYY-MM-DD” format;
cloud threshold (maximum percent of cloud-covered pixels per image) by which individual satellite imageries will be filtered;
a scale of resulting satellite images in meters (pixel size). See Note 2 below;
The resulting pixel data will be saved within a working directory and can be loaded during the following sessions.
Note 1. Particular satellite imagery is usually not
ready for instant sampling – it contains clouds, cloud shadows,
aerosols, and may cover not all the territory of your interest. Another
issue is that each particular pixel slightly differs in reflectance
between images taken on different days due to differences in atmospheric
conditions and angle of sunlight at the moments images were taken. The
Google Earth Engine has its build-in algorithms for image
pre-processing, atmospheric corrections and mosaicing, which allows one
to obtain a ready-to-use, rectified image. The approach used in
spectralR is to find a median value for each pixel between
several images within each of 10 optical bands, thereby making a
composite image. To define a set of imageries between which we are going
to calculate the median, we should set a timespan of image collection.
Sentinel-2 apparatus takes a picture once every five days, so if you set
up a month-long timesnap, you can expect that each pixel value will be
calculated based on 5 to 6 values (remember, some images might appear
Note 2. The finest resolution for Sentinel data is 10 m while using a higher scale_value decreases the required computational resources and size of the resulting data frame. Although sampling satellite data is performed in a cloud, there are some limitations for geocalculations placed by GEE itself. If you are about to sample large areas, consider setting a higher ‘scale’ value (e.g., 100, 1000). Read more in GEE best practices.
# Get pixel data = get.pixel.data(sf_df, "2019-05-15", "2019-06-30", 10, 100) reflectance # Save pixel data for further sessions save(reflectance, file = "reflectance_test_data")
Here we choose the 100 m scale (pixel size for resulting imagery is 100x100 m), which resulted in a sampling dataset of 2060 rows. Finer pixel size would result in a larger sampling dataset, which would require using moderating storage (see use case 2).
Let’s have a look at the resulting data:
head(reflectance) #> B11 B12 B2 B3 B4 B5 B6 B7 B8 #> 1 0.11700 0.06375 0.02790 0.04240 0.03460 0.06740 0.15810 0.18520 0.19275 #> 2 0.11065 0.05955 0.02720 0.04155 0.03320 0.06475 0.15590 0.18180 0.18900 #> 3 0.10665 0.05815 0.02640 0.04085 0.03210 0.06365 0.15095 0.17770 0.18440 #> 4 0.11265 0.06320 0.02945 0.04350 0.03540 0.06735 0.16075 0.18935 0.19675 #> 5 0.10885 0.05875 0.02720 0.04130 0.03315 0.06435 0.15470 0.18120 0.18840 #> 6 0.11825 0.06630 0.02800 0.04300 0.03700 0.06905 0.15030 0.17590 0.18225 #> B8A label #> 1 0.19995 coniferous_forest #> 2 0.19565 coniferous_forest #> 3 0.19280 coniferous_forest #> 4 0.20490 coniferous_forest #> 5 0.19500 coniferous_forest #> 6 0.19050 coniferous_forest
We have a data frame with the number of rows equal to the number of sampled “pixels” of satellite image and 10 variables with reflectance values for each optical band of Sentinel 2. Each sampled pixel has a label of a surface class of the user’s polygon it intersected.
spectralR receives true reflectance values
(between 0 and 1), transforming GEE’s values
(multiplied by 10000 for calculation convenience).
First of all, one should explore the quality and comprehensiveness of obtained pixel data.
# load(file = "./reflectance_test_data") # restore previously saved pixel data summary(factor(reflectance$label)) # how many pixels in each class? #> coniferous_forest cropland meadow reed #> 1270 165 35 144 #> water #> 446
For reliable results, keeping the similar size of each surface class is recommended. Classes represented by a few sampled pixels should be excluded from further analysis.
Visual overview of pixel data
# Number of spectral values for different classes ggplot(reflectance, aes(x=label))+ geom_bar()+ labs(x = 'Classes of surface', y = 'Number of pixels', title = "Total pixel data for different classes of surface", caption = 'Data: Sentinel-2 Level-2A')+ theme_minimal()
spectral.reflectance.curve transform the data
and plot smoother curves for each surface class, using
geom_smooth() aesthetics. For large
(thousands of rows) data (and we need large data for reliable
ggplot2 uses the GAM method for
drawing a trendline.
Depending on the data size, the processing may take some time.
# Make a ggplot object <- spectral.curves.plot(reflectance) p1 #> Joining, by = "variable" # Default plot p1
Since the output of
ggplot object, we can apply any tools provided by
ggplot2 package to make it more visually pleasing.
# Customized plot + p1labs(x = 'Wavelength, nm', y = 'Reflectance', colour = "Surface classes", fill = "Surface classes", title = "Spectral reflectance curves for different surface classes", caption = 'Data: Sentinel-2 Level-2A')+ theme_minimal()
You can save the plot as a *.png (or other standard graphical
formats) file using
ggsave("Spectral_curves_usecase1.png", width=16, height=12, unit="cm", dpi=300)
stat.summary.plot make a plot with a
statistical summary of reflectance values (mean, mean-standard
deviation, mean+standard deviation) for defined classes of the surface.
Given reflectance data as input, the function returns a ggplot2 object
with basic visual aesthetics. Default aesthetics are line with a
statistical summary for each satellite band (geom_line
Wavelength values (nm) are acquired from the mean known value for each optical band of the Sentinel 2 sensor.
# Make a ggplot object <- stat.summary.plot(reflectance) p2 #> Joining, by = "variable" #> Joining, by = "variable" # Default plot p2
Add a touch of customization.
# Customized plot + p2 labs(x = 'Sentinel-2 bands', y = 'Reflectance', colour = "Surface classes", title = "Reflectance for different surface classes", caption='Data: Sentinel-2 Level-2A\nmean ± standard deviation')+ theme_minimal()
Save the plot as a *.png file
ggsave("Statsummary_usecase1.png", width=16, height=12, unit="cm", dpi=300)
violin.plot helps to visualize a reflectance as
violin plots for each surface class per satellite bands. It gets
reflectance data as input and returns ggplot2 object with basic
visual aesthetics. The default aesthetics is geom_violin.
# Make a ggplot object <- violin.plot(reflectance) p3 # Customized plot + p3 labs(x='Surface class',y='Reflectance', fill="Surface classes", title = "Reflectance for different surface classes", caption='Data: Sentinel-2 Level-2A')+ theme_minimal()
# Save the plot as a *.png file ggsave("Violins_usecase1.png", width=22, height=16, unit="cm", dpi=300)
Buzkyi Gard National Park is located in Southern Bug valley (south-western Ukraine, 31.08325,47.90720). The National Park is characterized by unique steppe zone landscapes with diverse vegetation and flora (Shyriaieva et al., 2022). Natural habitats with high conservation value, i.e., a wide range of dry grasslands, rocky outcrops, and steppe forests, are concentrated in river valleys and surrounded by croplands. The habitat types were defined using precise vegetation data (phytosociological relevés with GPS location) and field mapping data for the artificial habitats. Then, the new revised version of EUNIS expert system, version 2021-06-01 with additions from the old EUNIS 2007 for non-revised habitat groups and the National Habitat Catalogue of Ukraine were used for the classification. As a result, the dataset contains 380 hand-drawn polygons of 26 different EUNIS habitat types.
# Reset R's brain before the new analysis session started. It will erase all the objects stored in # R memory while keeping loaded libraries. rm(list = ls()) # Load required packages library(tidyverse) library(rgee) library(sf) library(geojsonio) library(reshape2) library(spectralR)
Upload and process vector data.
# Extract polygons from shapefile and prepare sf object with proper structure <- prepare.vector.data(system.file("extdata/SouthernBuh-habitats_shapefile.shp", package = "spectralR"), "eunis_2020")sf_df
Explore the resulting spatial object:
head(sf_df) #> Simple feature collection with 6 features and 2 fields #> Geometry type: POLYGON #> Dimension: XY #> Bounding box: xmin: 30.96412 ymin: 47.91221 xmax: 31.06714 ymax: 48.01866 #> Geodetic CRS: WGS 84 #> label class geometry #> 1 R1B 11 POLYGON ((31.05245 47.92259... #> 2 R1B 11 POLYGON ((31.06686 47.91231... #> 3 U33 22 POLYGON ((30.96416 48.01731... #> 4 R1B 11 POLYGON ((30.96983 48.01847... #> 5 R12 9 POLYGON ((30.97225 48.01432... #> 6 U33 22 POLYGON ((30.96804 48.01308...
Get reflectance values
= get.pixel.data(sf_df, "2019-05-15", "2019-06-30", 10, 10) reflectance # save pixel data for further sessions save(reflectance, file = "reflectance_BG_data")
The habitats of the Southern Buh River Valley often are represented by small areas (especially rocky outcrops, grasslands, etc.). Therefore, many polygons are relatively small – up to 1 ha – and have an irregular shape. We recommend decreasing scale values (pixel size) in such cases, approaching the minimum value (10 m). On the other hand, that will require more computational power and processing time.
Quantitative overview of pixel data
summary(factor(reflectance$label)) # how many pixels in each class? #> C2.2 C2.3 J1 J3.2 J4.2 J5 Q51 Q53 R12 R1A R1B R21 R36 #> 741 3054 11204 983 447 87 1276 3 659 52 1448 310 49 #> S35 S36 S91 T11 T13 T19 T1E T1H U33 V11 V34 V38 X18 #> 694 220 4 829 331 294 72 14372 165 48778 18 4717 130
With “big data”,
spectralR uses your Google Drive as
intermittent storage between R and
GEE. Please take care of having enough storage in the
Drive and purge rgee_backup directory manually from time to
Spectral reflectance curves for different habitat types
# Create basic ggplot object <- spectral.curves.plot(reflectance) p1 #> Joining, by = "variable" #> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion # Plotting + p1 labs(x = 'Wavelength, nm', y = 'Reflectance', colour = "Habitat types", fill = "Habitat types", title = "Spectral reflectance curves for different habitat types\nSouthern Bug National park, Ukraine", caption = 'Data: Sentinel-2 Level-2A')+ theme_minimal()
The plot above, unlike the one from use-case 1, looks messy and hard
to read, because of a large number of surface classes (habitat types).
To make the resulting plot easier to comprehend, both functions
have an argument target_classes, which takes as an input the
list of the classes of surface which should be highlighted, while others
will be turned in gray, as a background. Defaults value is NULL - all
the classes would be colourized, as above. Let’s give it a try.
<- spectral.curves.plot( p1_1 data = reflectance, target_classes = list("C2.2", "C2.3", "J5") )#> Joining, by = "variable" #> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion # Plotting + p1_1 labs(x = 'Wavelength, nm', y = 'Reflectance', colour = "Habitat types", fill = "Habitat types", title = "Spectral reflectance curves for different habitat types\nSouthern Bug National park, Ukraine", caption = 'Data: Sentinel-2 Level-2A')+ theme_minimal()
Not to mention, that the user is able to filter input data to obtain what they want.
Suppose we need only grasslands in our reflectance plot (habitats which code started from R) and want to highlight the differences/similarities in spectral reflectance of Continental dry grasslands (true steppe, type R1B).
# Filter reflectance data <- reflectance %>% grasslands filter(label == c("R12", "R1A", "R1B", "R21", "R36")) #> Warning in `==.default`(label, c("R12", "R1A", "R1B", "R21", "R36")): longer #> object length is not a multiple of shorter object length #> Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of #> shorter object length # Construct a ggplot object with basic aestetics <- spectral.curves.plot( p1_2 data = grasslands, target_classes = list("R1B") )#> Joining, by = "variable" #> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion # Plotting + p1_2 labs(x = 'Wavelength, nm', y = 'Reflectance', colour = "Habitat types", fill = "Habitat types", title = "Spectral reflectance curves for different types of\ngrassland habitats, Southern Bug National park, Ukraine", caption = 'Data: Sentinel-2 Level-2A')+ theme_minimal()
Statistical summary for each habitat type
# Create basic ggplot object <- stat.summary.plot(reflectance) p2 #> Joining, by = "variable" #> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion #> Joining, by = "variable" # Plotting + p2 labs(x = 'Sentinel-2 bands', y = 'Reflectance', colour = "Habitat types", title = "Reflectance for different habitat types\nSouthern Bug National park, Ukraine", caption='Data: Sentinel-2 Level-2A\nmean ± standard deviation')+ theme_minimal()
Same way, create a plot that highlighted only tall-helophyte beds (type Q51), compared to two common other floodplain habitat types (R21 Mesic lowland pasture, T11 Riparian forest).
# Filter reflectance data <- reflectance %>% floodplain filter(label == c("Q51", "R21", "T11")) #> Warning in `==.default`(label, c("Q51", "R21", "T11")): longer object length is #> not a multiple of shorter object length #> Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of #> shorter object length # Create basic ggplot object <- stat.summary.plot( p2_1 data = floodplain, target_classes = list("Q51") )#> Joining, by = "variable" #> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion #> Joining, by = "variable" # Plotting + p2_1 labs(x = 'Sentinel-2 bands', y = 'Reflectance', colour = "Habitat types", title = "Reflectance of tall-helophyte beds,\nSouthern Bug National park, Ukraine", caption='Data: Sentinel-2 Level-2A\nmean ± standard deviation')+ theme_minimal()
Create a violin plots for given habitat types
# Create basic ggplot object <- violin.plot(reflectance) p3 # Plotting + p3 labs(x = "Habitat type", y = "Reflectance", fill = "Habitat types", title = "Reflectance for different habitat types\nSouthern Bug National park, Ukraine", caption ='Data: Sentinel-2 Level-2A')+ theme_minimal()