Model Selection

The gtime.model_selection module deals with model selection.

class gtime.model_selection.FeatureSplitter(drop_na_mode: str = 'any')

Splits the feature matrices X and y in X_train, y_train, X_test, y_test.

X and y are the feature matrices obtained from the FeatureCreation class.

Parameters

drop_na_modestr, optional, default: 'any'

How to drop the Nan contained in the X and y matrices. Only ‘any’ is supported for the moment.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from gtime.model_selection import FeatureSplitter
>>> X = pd.DataFrame.from_dict({"feature_0": [np.nan, 0, 1, 2, 3, 4, 5, 6, 7, 8],
...                             "feature_1": [np.nan, np.nan, 0.5, 1.5, 2.5, 3.5,
...                                            4.5, 5.5, 6.5, 7.5, ]
...                            })
>>> y = pd.DataFrame.from_dict({"y_0": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
...                             "y_1": [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan],
...                             "y_2": [2, 3, 4, 5, 6, 7, 8, 9, np.nan, np.nan]
...                            })
>>> feature_splitter = FeatureSplitter()
>>> X_train, y_train, X_test, y_test = feature_splitter.transform(X, y)
>>> X_train
   feature_0  feature_1
2        1.0        0.5
3        2.0        1.5
4        3.0        2.5
5        4.0        3.5
6        5.0        4.5
7        6.0        5.5
>>> y_train
   y_0  y_1  y_2
2    2  3.0  4.0
3    3  4.0  5.0
4    4  5.0  6.0
5    5  6.0  7.0
6    6  7.0  8.0
7    7  8.0  9.0
>>> X_test
   feature_0  feature_1
8        7.0        6.5
9        8.0        7.5
>>> y_test
   y_0  y_1  y_2
8    8  9.0  NaN
9    9  NaN  NaN
transform(X: ~pandas.core.frame.DataFrame, y: ~pandas.core.frame.DataFrame) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)

Split the feature matrices X and y in X_train, y_train, X_test, y_test.

X and y are the feature matrices obtained from the FeatureCreation class.

Parameters

Xpd.DataFrame, shape (n_samples, n_features), required

The feature matrix.

ypd.DataFrame, shape (n_samples, horizon), required

The y matrix.

Returns

X_train, y_train, X_test, y_testTuple[pd.DataFrame, pd.DataFrame,

pd.DataFrame, pd.DataFrame] The X and y, split between train and test.

gtime.model_selection.blocking_time_series_split(time_series: DataFrame, n_splits=4, split_on='index')

Time Series cross-validator

Blocking Time Series Split works by adding margin after each split. The margin is between the folds used at each iteration in order to prevent the model from memorizing patterns from an iteration to the next.

If the data is not a timeries then the split is based on the number of samples. If the data has a time index and split_on ‘time’ then divide the time series based on time.

Parameters

time_series : pandas DataFrame, shape (n_samples, n_features), required

n_splits : int, default = 4, required The number of splits/folds on the dataset

split_on : ‘index’, default = ‘index’. Optional - ‘time’ If the parameter is ‘time’ then dataframe index must be DatetimeIndex or PeriodIndex or TimedeltaIndex. The dataset will be split based on time

Yields

fold : RangeIndex indexes of folds, or time_fold : DateTimeIndex of folds if split_on ‘time’

Examples

Example 1) >>> date_rng = pd.date_range(start=’1/1/2018’, end=’1/08/2018’, freq=’D’) >>> time_series = pd.DataFrame(date_rng, columns=[‘date’]) >>> time_series.set_index(‘date’, inplace=True) >>> time_series[‘data’] = np.random.randint(0,100,size=(len(date_rng))) >>> for fold in (time_series_split(time_series, n_splits=4, split_on=’time’)): … print(fold) DatetimeIndex([‘2018-01-01’, ‘2018-01-02’], dtype=’datetime64[ns]’, name=’date’, freq=None) DatetimeIndex([‘2018-01-03’, ‘2018-01-04’], dtype=’datetime64[ns]’, name=’date’, freq=None) DatetimeIndex([‘2018-01-05’, ‘2018-01-06’], dtype=’datetime64[ns]’, name=’date’, freq=None) DatetimeIndex([‘2018-01-07’, ‘2018-01-08’], dtype=’datetime64[ns]’, name=’date’, freq=None)

Example 2) >>> df = pd.DataFrame(np.random.randint(0,100,size=(16, 4)), columns=list(‘ABCD’)) >>> for fold in (time_series_split(df, n_splits=4, split_on=’index’)): … print(fold) RangeIndex(start=0, stop=4, step=1) RangeIndex(start=4, stop=8, step=1) RangeIndex(start=8, stop=12, step=1) RangeIndex(start=12, stop=16, step=1)

gtime.model_selection.horizon_shift(time_series: DataFrame, horizon: Union[int, List[int]] = 5) DataFrame

Perform a shift of the original time_series for each time step between 1 and horizon.

Parameters

time_seriespd.DataFrame, shape (n_samples, n_features), required

The list of TimeSeriesFeature from which to compute the feature_extraction.

horizonint, optional, default: 5

It represents how much into the future is necessary to predict. This corresponds to the number of shifts that are going to be performed on y.

Returns

ypd.DataFrame, shape (n_samples, horizon)

The shifted time series.

Examples

>>> import pandas as pd
>>> from gtime.model_selection import horizon_shift
>>> X = pd.DataFrame(range(0, 5), index=pd.date_range("2020-01-01", "2020-01-05"))
>>> horizon_shift(X, horizon=2)
            y_1  y_2
2020-01-01  1.0  2.0
2020-01-02  2.0  3.0
2020-01-03  3.0  4.0
2020-01-04  4.0  NaN
2020-01-05  NaN  NaN
>>> horizon_shift(X, horizon=[2])
            y_2
2020-01-01  2.0
2020-01-02  3.0
2020-01-03  4.0
2020-01-04  NaN
2020-01-05  NaN
gtime.model_selection.time_series_split(time_series: DataFrame, n_splits=4, split_on='index')

Time Series cross-validator

time_series_split provides indices to split time series data samples that are observed at fixed time intervals, in the data sets. In each split, subsequent indices must be higher than before, and thus shuffling in cross validator is inappropriate. Split the input dataframe into n_splits.

If the data is not a timeries then the split is based on the number of samples. If the data has a time index and split_on ‘time’ then divide the time series based on time.

Parameters

time_series : pandas DataFrame, shape (n_samples, n_features), required

n_splits : int, default = 4, required The number of splits/folds on the dataset

split_on : ‘index’, default = ‘index’. Optional - ‘time’ If the parameter is ‘time’ then dataframe index must be DatetimeIndex or PeriodIndex or TimedeltaIndex. The dataset will be split based on time

Yields

fold : RangeIndex indexes of folds, or time_fold : DateTimeIndex of folds if split_on ‘time’

Examples

Example 1) >>> date_rng = pd.date_range(start=’1/1/2018’, end=’1/08/2018’, freq=’D’) >>> time_series = pd.DataFrame(date_rng, columns=[‘date’]) >>> time_series.set_index(‘date’, inplace=True) >>> time_series[‘data’] = np.random.randint(0,100,size=(len(date_rng))) >>> for fold in (time_series_split(time_series, n_splits=4, split_on=’time’)): … print(fold) DatetimeIndex([‘2018-01-01’, ‘2018-01-02’], dtype=’datetime64[ns]’, name=’date’, freq=None) DatetimeIndex([‘2018-01-01’, ‘2018-01-02’, ‘2018-01-03’, ‘2018-01-04’],

dtype=’datetime64[ns]’, name=’date’, freq=None)

DatetimeIndex([‘2018-01-01’, ‘2018-01-02’, ‘2018-01-03’, ‘2018-01-04’,
‘2018-01-05’, ‘2018-01-06’],

dtype=’datetime64[ns]’, name=’date’, freq=None)

DatetimeIndex([‘2018-01-01’, ‘2018-01-02’, ‘2018-01-03’, ‘2018-01-04’,
‘2018-01-05’, ‘2018-01-06’, ‘2018-01-07’, ‘2018-01-08’],

dtype=’datetime64[ns]’, name=’date’, freq=None)

Example 2) >>> df = pd.DataFrame(np.random.randint(0,100,size=(16, 4)), columns=list(‘ABCD’)) >>> for fold in (time_series_split(df, n_splits=4, split_on=’index’)): … print(fold) RangeIndex(start=0, stop=4, step=1) RangeIndex(start=0, stop=8, step=1) RangeIndex(start=0, stop=12, step=1) RangeIndex(start=0, stop=16, step=1)