Comparing similarities between Time Series¶

Authors:

Victor Tang

Reviewed/Edited by:

Marcos Kavlin
Dr. Andrew Dean

Purpose¶

This notebook is the fifth in the SWIFT-C Tutorial. The goal of this notebook is to show you, the user, how one can use Dynamic Time Warping (DTW) to compare similarities between time series. This can be used in the context of Wetland Function Assessment, 'classify', watersheds according to reference time series, thus enabling the classification of a landscape based on different key floodplain typologies.

Workflow¶

Import required packages.
Load data.
Define necessary functions.
Plot the results.

Notes¶

This notebook relies on the dtw-python package linked below.

dtw package: https://pypi.org/project/dtw-python/

1. Import required packages¶

In order to run this notebook, dtw-python will have to be installed if it hasn't been installed previously.

To do so run the following command in your terminal:

pip install dtw-python

In [3]:

Copied!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from dtw import dtw
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from dtw import dtw

Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.

2. Load data¶

In this notebook we are comparing the timeseries of each landscape unit to the reference time series we obtained from the cluster analysis. We will therefore load both datasets as pandas.DataFrames.

In [4]:

Copied!

df_ts = pd.read_csv("data/vh_2022.csv", index_col=0)
df_ref = pd.read_csv("data/vh_ref_2022.csv", index_col=0)
df_ts = pd.read_csv("data/vh_2022.csv", index_col=0)
df_ref = pd.read_csv("data/vh_ref_2022.csv", index_col=0)

3. Defining functions¶

Our time series are currently set up so that we have a data point every 10 days. However, we would like to increase the temporal resolution by adding 4 data points to our time series, so that the algorithm has more data points to use as a reference for the DTW calculation.

We will therefore define two functions:

interpolate_time_series; aims to interpolate our time series in order to increase the temporal resolution.
calculate_dtw; which takes the interpolated time series and calculates the 'distance' of each time series to the reference.

In [31]:

Copied!





def interpolate_time_series(ts_data: np.ndarray, intv: int) -> np.ndarray:
    """This function interpolates missing data values
       linearly from time series data.

    Args:
        ts_data: np.ndarray
            Array containing time series data
            for interpolation.

        intv: int
            Integer specifying the number of
            data points we wish to add to the
            time series using linear interpolation
            to increase temporal resolution.

    Returns:
        ts_interp: np.ndarray
            Interpolated time series
            to specified temporal interval.

    """

    nrow, ncol = ts_data.shape
    mat = np.zeros([nrow, ncol * intv]) * np.nan
    mat[:, ::intv] = ts_data

    df = pd.DataFrame(mat)
    df.interpolate(method="linear", axis=1, inplace=True)
    ts_intp = df.values[:, : -(intv - 1)]

    return ts_intp
def interpolate_time_series(ts_data: np.ndarray, intv: int) -> np.ndarray:
    """This function interpolates missing data values
       linearly from time series data.

    Args:
        ts_data: np.ndarray
            Array containing time series data
            for interpolation.

        intv: int
            Integer specifying the number of
            data points we wish to add to the
            time series using linear interpolation
            to increase temporal resolution.

    Returns:
        ts_interp: np.ndarray
            Interpolated time series
            to specified temporal interval.

    """

    nrow, ncol = ts_data.shape
    mat = np.zeros([nrow, ncol * intv]) * np.nan
    mat[:, ::intv] = ts_data

    df = pd.DataFrame(mat)
    df.interpolate(method="linear", axis=1, inplace=True)
    ts_intp = df.values[:, : -(intv - 1)]

    return ts_intp

In [32]:

Copied!





def calculate_dtw(
    ts_data: np.ndarray, ts_ref: np.ndarray, window_size: int, intv: int
) -> np.ndarray:
    """This calclulates the DTW distance of
       each time series to the reference.

    Args:
        ts_data: np.ndarray
            Array containing time series data.

        ts_ref: np.ndarray
            Array containing reference time series
            data.

        window_size: int
            Integer specifying the size of the window
            to use for the DTW calculation.

        intv: int
            Integer specifying how many data points
            we wish to add to the time series using
            interpolation.

    Returns:
        dist: np.ndarray
            1-d array specifying each time series'
            distance to the reference.

    """

    ts_ref = ts_ref[np.newaxis, :]

    ts_data = interpolate_time_series(ts_data, intv)
    ts_ref = interpolate_time_series(ts_ref, intv)

    n = ts_data.shape[0]
    d = np.zeros(n) * np.nan
    for i in range(n):
        alignment = dtw(
            ts_ref,
            ts_data[i, :],
            keep_internals=True,
            window_type="sakoechiba",
            window_args={"window_size": window_size},
        )
        d[i] = alignment.distance
    return d
def calculate_dtw(
    ts_data: np.ndarray, ts_ref: np.ndarray, window_size: int, intv: int
) -> np.ndarray:
    """This calclulates the DTW distance of
       each time series to the reference.

    Args:
        ts_data: np.ndarray
            Array containing time series data.

        ts_ref: np.ndarray
            Array containing reference time series
            data.

        window_size: int
            Integer specifying the size of the window
            to use for the DTW calculation.

        intv: int
            Integer specifying how many data points
            we wish to add to the time series using
            interpolation.

    Returns:
        dist: np.ndarray
            1-d array specifying each time series'
            distance to the reference.

    """

    ts_ref = ts_ref[np.newaxis, :]

    ts_data = interpolate_time_series(ts_data, intv)
    ts_ref = interpolate_time_series(ts_ref, intv)

    n = ts_data.shape[0]
    d = np.zeros(n) * np.nan
    for i in range(n):
        alignment = dtw(
            ts_ref,
            ts_data[i, :],
            keep_internals=True,
            window_type="sakoechiba",
            window_args={"window_size": window_size},
        )
        d[i] = alignment.distance
    return d

3.1 Apply functions and plot¶

The next step is to the functions we have just defined to our data. We have currently set the window size for DTW to 7 and the amount of data points to add per time step to 4. These parameters worked well for our use case but we highly recommend you test the functions and methond with your own data to determine what yields the best results.

In the cell below we will make two plots:

The first will contain the distribution of the distance each of our time series has compared to the reference 'Class 01'. We will also add a dashed red line, that will identify the theshold under which time series will be considered as belonging to 'Class 01'.
The second plot displays the time series selected by the threshold, as well as the reference time sereis of 'Class 01'.

The combination of the two plots, provides the information necessary to a practitioner in order to fine tune the threshold they plan to use.

In [34]:

Copied!





ts_array = df_ts.values
ref_array = df_ref.loc["Class 01"].values

d = calculate_dtw(ts_array, ref_array, 7, 4)

thr = 160
idx = d < thr

df_ts_sel = df_ts.iloc[idx, :]
n_sel = df_ts_sel.shape[0]

fig, axes = plt.subplots(2, 1, figsize=[9, 8])

axes[0].hist(d, bins=100, alpha=0.4)
axes[0].axvline(thr, color="tab:red", linestyle="--",
                lw=1.2, label="Threshold")
axes[0].set_xlabel("DTW", fontsize=14)
axes[0].set_ylabel("# Samples", fontsize=14)
axes[0].set_title("Distritution of DTW", fontsize=14, fontweight="bold")
axes[0].legend()

axes[1].plot(df_ts_sel.values.T, color="tab:green", lw=0.8, alpha=0.1)
axes[1].plot(ref_array, color="k", lw=2, label="Reference Time Series")
axes[1].set_xlabel("Backscatter (dB)", fontsize=14)
axes[1].set_ylabel("Time Step (2 days)", fontsize=14)
axes[1].set_title(f"Selected Time Series ({n_sel})", fontsize=14,
                  fontweight="bold")
axes[1].legend()

fig.tight_layout()
ts_array = df_ts.values
ref_array = df_ref.loc["Class 01"].values

d = calculate_dtw(ts_array, ref_array, 7, 4)

thr = 160
idx = d < thr

df_ts_sel = df_ts.iloc[idx, :]
n_sel = df_ts_sel.shape[0]

fig, axes = plt.subplots(2, 1, figsize=[9, 8])

axes[0].hist(d, bins=100, alpha=0.4)
axes[0].axvline(thr, color="tab:red", linestyle="--",
                lw=1.2, label="Threshold")
axes[0].set_xlabel("DTW", fontsize=14)
axes[0].set_ylabel("# Samples", fontsize=14)
axes[0].set_title("Distritution of DTW", fontsize=14, fontweight="bold")
axes[0].legend()

axes[1].plot(df_ts_sel.values.T, color="tab:green", lw=0.8, alpha=0.1)
axes[1].plot(ref_array, color="k", lw=2, label="Reference Time Series")
axes[1].set_xlabel("Backscatter (dB)", fontsize=14)
axes[1].set_ylabel("Time Step (2 days)", fontsize=14)
axes[1].set_title(f"Selected Time Series ({n_sel})", fontsize=14,
                  fontweight="bold")
axes[1].legend()

fig.tight_layout()

No description has been provided for this image