Comparing similarities between Time Series¶
Authors:
- Victor Tang
Reviewed/Edited by:
- Marcos Kavlin
- Dr. Andrew Dean
Purpose¶
This notebook is the fifth in the SWIFT-C Tutorial. The goal of this notebook is to show you, the user, how one can use Dynamic Time Warping (DTW) to compare similarities between time series. This can be used in the context of Wetland Function Assessment, 'classify', watersheds according to reference time series, thus enabling the classification of a landscape based on different key floodplain typologies.
Workflow¶
- Import required packages.
- Load data.
- Define necessary functions.
- Plot the results.
Notes¶
This notebook relies on the dtw-python package linked below.
dtw package: https://pypi.org/project/dtw-python/
1. Import required packages¶
In order to run this notebook, dtw-python will have to be installed if it hasn't been installed previously.
To do so run the following command in your terminal:
pip install dtw-python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from dtw import dtw
Importing the dtw module. When using in academic works please cite: T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package. J. Stat. Soft., doi:10.18637/jss.v031.i07.
2. Load data¶
In this notebook we are comparing the timeseries of each landscape unit to the reference time series we obtained from the cluster analysis. We will therefore load both datasets as pandas.DataFrames.
df_ts = pd.read_csv("data/vh_2022.csv", index_col=0)
df_ref = pd.read_csv("data/vh_ref_2022.csv", index_col=0)
3. Defining functions¶
Our time series are currently set up so that we have a data point every 10 days. However, we would like to increase the temporal resolution by adding 4 data points to our time series, so that the algorithm has more data points to use as a reference for the DTW calculation.
We will therefore define two functions:
- interpolate_time_series; aims to interpolate our time series in order to increase the temporal resolution.
- calculate_dtw; which takes the interpolated time series and calculates the 'distance' of each time series to the reference.
def interpolate_time_series(ts_data: np.ndarray, intv: int) -> np.ndarray:
"""This function interpolates missing data values
linearly from time series data.
Args:
ts_data: np.ndarray
Array containing time series data
for interpolation.
intv: int
Integer specifying the number of
data points we wish to add to the
time series using linear interpolation
to increase temporal resolution.
Returns:
ts_interp: np.ndarray
Interpolated time series
to specified temporal interval.
"""
nrow, ncol = ts_data.shape
mat = np.zeros([nrow, ncol * intv]) * np.nan
mat[:, ::intv] = ts_data
df = pd.DataFrame(mat)
df.interpolate(method="linear", axis=1, inplace=True)
ts_intp = df.values[:, : -(intv - 1)]
return ts_intp
def calculate_dtw(
ts_data: np.ndarray, ts_ref: np.ndarray, window_size: int, intv: int
) -> np.ndarray:
"""This calclulates the DTW distance of
each time series to the reference.
Args:
ts_data: np.ndarray
Array containing time series data.
ts_ref: np.ndarray
Array containing reference time series
data.
window_size: int
Integer specifying the size of the window
to use for the DTW calculation.
intv: int
Integer specifying how many data points
we wish to add to the time series using
interpolation.
Returns:
dist: np.ndarray
1-d array specifying each time series'
distance to the reference.
"""
ts_ref = ts_ref[np.newaxis, :]
ts_data = interpolate_time_series(ts_data, intv)
ts_ref = interpolate_time_series(ts_ref, intv)
n = ts_data.shape[0]
d = np.zeros(n) * np.nan
for i in range(n):
alignment = dtw(
ts_ref,
ts_data[i, :],
keep_internals=True,
window_type="sakoechiba",
window_args={"window_size": window_size},
)
d[i] = alignment.distance
return d
3.1 Apply functions and plot¶
The next step is to the functions we have just defined to our data. We have currently set the window size for DTW to 7 and the amount of data points to add per time step to 4. These parameters worked well for our use case but we highly recommend you test the functions and methond with your own data to determine what yields the best results.
In the cell below we will make two plots:
- The first will contain the distribution of the distance each of our time series has compared to the reference 'Class 01'. We will also add a dashed red line, that will identify the theshold under which time series will be considered as belonging to 'Class 01'.
- The second plot displays the time series selected by the threshold, as well as the reference time sereis of 'Class 01'.
The combination of the two plots, provides the information necessary to a practitioner in order to fine tune the threshold they plan to use.
ts_array = df_ts.values
ref_array = df_ref.loc["Class 01"].values
d = calculate_dtw(ts_array, ref_array, 7, 4)
thr = 160
idx = d < thr
df_ts_sel = df_ts.iloc[idx, :]
n_sel = df_ts_sel.shape[0]
fig, axes = plt.subplots(2, 1, figsize=[9, 8])
axes[0].hist(d, bins=100, alpha=0.4)
axes[0].axvline(thr, color="tab:red", linestyle="--",
lw=1.2, label="Threshold")
axes[0].set_xlabel("DTW", fontsize=14)
axes[0].set_ylabel("# Samples", fontsize=14)
axes[0].set_title("Distritution of DTW", fontsize=14, fontweight="bold")
axes[0].legend()
axes[1].plot(df_ts_sel.values.T, color="tab:green", lw=0.8, alpha=0.1)
axes[1].plot(ref_array, color="k", lw=2, label="Reference Time Series")
axes[1].set_xlabel("Backscatter (dB)", fontsize=14)
axes[1].set_ylabel("Time Step (2 days)", fontsize=14)
axes[1].set_title(f"Selected Time Series ({n_sel})", fontsize=14,
fontweight="bold")
axes[1].legend()
fig.tight_layout()