High Influx Filter¶
QC protocol for Private Weather Stations¶
This notebook presents how to use the ‘High Influx filter’ in the Python package pypwsqc, a quality assurance protocol developed for automated private weather stations (PWS).
The package is based on the original R code available at https://github.com/LottedeVos/PWSQC/.
pypwsqc depends on the poligrain, xarray, pandas and numpy packages. Make sure to install and import the required packages first.
[2]:
import poligrain as plg
import xarray as xr
import pypwsqc
Download example data¶
In this example, we use an open PWS dataset from Amsterdam, called the “AMS PWS” dataset. By running the cell below, an example NetCDF-file will be downloaded to your current repository (if your machine is connected to the internet).
[2]:
!curl -OL https://github.com/OpenSenseAction/OS_data_format_conventions/raw/main/notebooks/data/OpenSense_PWS_example_format_data.nc
Data preparations¶
This package handles rainfall data as xarray Datasets. The data set must have time and id dimensions, latitude and longitude as coordinates, and rainfall as data variable.
An example of how to convert .csv data to a xarray dataset is found here.
We now load the data set under the name ds_pws.
[3]:
ds_pws = xr.open_dataset("OpenSense_PWS_example_format_data.nc")
We then slice the time series to 1 month of data to shorten the computation period of the example.
[4]:
ds_pws = ds_pws.sel(time=slice("2017-07-01", "2017-07-30"))
ds_pws
[4]:
<xarray.Dataset> Size: 9MB
Dimensions: (time: 8640, id: 134)
Coordinates:
* time (time) datetime64[ns] 69kB 2017-07-01 ... 2017-07-30T23:55:00
* id (id) <U6 3kB 'ams1' 'ams2' 'ams3' ... 'ams132' 'ams133' 'ams134'
elevation (id) <U3 2kB ...
latitude (id) float64 1kB ...
longitude (id) float64 1kB ...
Data variables:
rainfall (id, time) float64 9MB ...
Attributes:
title: PWS data from Amsterdam
file author: Maximilian Graf
institution: Wageningen University and Research, Department of ...
date: 2022-10-18 10:32:00
source: Netamo PWS
history: Data derived and reformated from the originally pu...
naming convention: OpenSense-0.1
license restrictions: CC-BY 4.0 https://creativecommons.org/licenses/by/...
reference: https://doi.org/10.1029/2019GL083731
comment: Reproject coordinates¶
First we reproject the coordinates to a local metric coordinate reference system to allow for distance calculations. In the Amsterdam example we use EPSG:25832. Remember to use a local metric reference system for your use case! We use the function spatial.project_point_coordinates in the poligrainpackage.
[4]:
ds_pws.coords["x"], ds_pws.coords["y"] = plg.spatial.project_point_coordinates(
x=ds_pws.longitude, y=ds_pws.latitude, target_projection="EPSG:25832"
)
Create distance matrix¶
Then, we calculate the distances between all stations in our data set. If your data set has a large number of stations this can take some time.
[5]:
distance_matrix = plg.spatial.calc_point_to_point_distances(ds_pws, ds_pws)
Select range for neighbouring checks¶
The quality control is performed by comparing time series of each station with the time series of neighbouring stations within a specificed range max_distance. The selected range depends on the use case and area of interest. In this example, we use 10’000 meters. max_distance is called d in the original publication.
Select considered range around each station¶
[6]:
max_distance = 10e3
Quality control¶
Now the data set is prepared to run the quality control.
High Influx filter¶
Conditions for raising High Influx flag:
If median rainfall of neighbours is below threshold ϕA, then high influx if rainfall above threshold ϕB
If median rainfall of neighbours is above ϕA, then high influx if rainfall exceeds median times ϕB/ϕA
Filter cannot be applied if less than n_stat neighbours are reporting data (HI flag is set to -1)
NOTE! The filter cannot be applied if the station has reported NaN data in the last
ninttime steps. This gives more -1 flags than in the original R-implementation that does not use this condition. This choice was done to ensure that timesteps without data at the evaluated station is not mistakenly being interpreted as timesteps who have passed the quality control (if they would have been flagged with 0) or as time steps with a High Influx issue (if they would have been flagged with 1).
For settings for parameter ϕA, ϕB and n_stat, see table 1 in https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019GL083731
Set HI parameters¶
[7]:
hi_thres_a = 0.4
hi_thres_b = 10
nint = 6
n_stat = 5
Run HI filter¶
[8]:
%%time
ds_pws_flagged = pypwsqc.flagging.hi_filter(
ds_pws, hi_thres_a, hi_thres_b, nint, n_stat, distance_matrix, max_distance
)
CPU times: total: 0 ns
Wall time: 0 ns
Save flagged dataset¶
[9]:
ds_pws_flagged.to_netcdf("hi_flagged_dataset.nc")