Model Selection

The gtime.model_selection module deals with model selection.

class gtime.model_selection.FeatureSplitter(drop_na_mode: str = 'any')

Splits the feature matrices X and y in X_train, y_train, X_test, y_test.

X and y are the feature matrices obtained from the FeatureCreation class.

Parameters

drop_na_modestr, optional, default: 'any': How to drop the Nan contained in the X and y matrices. Only ‘any’ is supported for the moment.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from gtime.model_selection import FeatureSplitter
>>> X = pd.DataFrame.from_dict({"feature_0": [np.nan, 0, 1, 2, 3, 4, 5, 6, 7, 8],
...                             "feature_1": [np.nan, np.nan, 0.5, 1.5, 2.5, 3.5,
...                                            4.5, 5.5, 6.5, 7.5, ]
...                            })
>>> y = pd.DataFrame.from_dict({"y_0": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
...                             "y_1": [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan],
...                             "y_2": [2, 3, 4, 5, 6, 7, 8, 9, np.nan, np.nan]
...                            })
>>> feature_splitter = FeatureSplitter()
>>> X_train, y_train, X_test, y_test = feature_splitter.transform(X, y)
>>> X_train
   feature_0  feature_1
2        1.0        0.5
3        2.0        1.5
4        3.0        2.5
5        4.0        3.5
6        5.0        4.5
7        6.0        5.5
>>> y_train
   y_0  y_1  y_2
2    2  3.0  4.0
3    3  4.0  5.0
4    4  5.0  6.0
5    5  6.0  7.0
6    6  7.0  8.0
7    7  8.0  9.0
>>> X_test
   feature_0  feature_1
8        7.0        6.5
9        8.0        7.5
>>> y_test
   y_0  y_1  y_2
8    8  9.0  NaN
9    9  NaN  NaN

transform(X: ~pandas.core.frame.DataFrame, y: ~pandas.core.frame.DataFrame) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)

Split the feature matrices X and y in X_train, y_train, X_test, y_test.

X and y are the feature matrices obtained from the FeatureCreation class.

Parameters

Xpd.DataFrame, shape (n_samples, n_features), required: The feature matrix.
ypd.DataFrame, shape (n_samples, horizon), required: The y matrix.

Returns

X_train, y_train, X_test, y_testTuple[pd.DataFrame, pd.DataFrame,: pd.DataFrame, pd.DataFrame] The X and y, split between train and test.

gtime.model_selection.blocking_time_series_split(time_series: DataFrame, n_splits=4, split_on='index')

Time Series cross-validator

Blocking Time Series Split works by adding margin after each split. The margin is between the folds used at each iteration in order to prevent the model from memorizing patterns from an iteration to the next.

If the data is not a timeries then the split is based on the number of samples. If the data has a time index and split_on ‘time’ then divide the time series based on time.

Parameters

time_series : pandas DataFrame, shape (n_samples, n_features), required

n_splits : int, default = 4, required The number of splits/folds on the dataset

split_on : ‘index’, default = ‘index’. Optional - ‘time’ If the parameter is ‘time’ then dataframe index must be DatetimeIndex or PeriodIndex or TimedeltaIndex. The dataset will be split based on time

Yields

fold : RangeIndex indexes of folds, or time_fold : DateTimeIndex of folds if split_on ‘time’

Examples

Example 1) >>> date_rng = pd.date_range(start=’1/1/2018’, end=’1/08/2018’, freq=’D’) >>> time_series = pd.DataFrame(date_rng, columns=[‘date’]) >>> time_series.set_index(‘date’, inplace=True) >>> time_series[‘data’] = np.random.randint(0,100,size=(len(date_rng))) >>> for fold in (time_series_split(time_series, n_splits=4, split_on=’time’)): … print(fold) DatetimeIndex([‘2018-01-01’, ‘2018-01-02’], dtype=’datetime64[ns]’, name=’date’, freq=None) DatetimeIndex([‘2018-01-03’, ‘2018-01-04’], dtype=’datetime64[ns]’, name=’date’, freq=None) DatetimeIndex([‘2018-01-05’, ‘2018-01-06’], dtype=’datetime64[ns]’, name=’date’, freq=None) DatetimeIndex([‘2018-01-07’, ‘2018-01-08’], dtype=’datetime64[ns]’, name=’date’, freq=None)

Example 2) >>> df = pd.DataFrame(np.random.randint(0,100,size=(16, 4)), columns=list(‘ABCD’)) >>> for fold in (time_series_split(df, n_splits=4, split_on=’index’)): … print(fold) RangeIndex(start=0, stop=4, step=1) RangeIndex(start=4, stop=8, step=1) RangeIndex(start=8, stop=12, step=1) RangeIndex(start=12, stop=16, step=1)

gtime.model_selection.horizon_shift(time_series: DataFrame, horizon: Union[int, List[int]] = 5) → DataFrame

Perform a shift of the original time_series for each time step between 1 and horizon.

Parameters

time_seriespd.DataFrame, shape (n_samples, n_features), required: The list of TimeSeriesFeature from which to compute the feature_extraction.
horizonint, optional, default: 5: It represents how much into the future is necessary to predict. This corresponds to the number of shifts that are going to be performed on y.

Returns

ypd.DataFrame, shape (n_samples, horizon): The shifted time series.

Examples

>>> import pandas as pd
>>> from gtime.model_selection import horizon_shift
>>> X = pd.DataFrame(range(0, 5), index=pd.date_range("2020-01-01", "2020-01-05"))
>>> horizon_shift(X, horizon=2)
            y_1  y_2
2020-01-01  1.0  2.0
2020-01-02  2.0  3.0
2020-01-03  3.0  4.0
2020-01-04  4.0  NaN
2020-01-05  NaN  NaN
>>> horizon_shift(X, horizon=[2])
            y_2
2020-01-01  2.0
2020-01-02  3.0
2020-01-03  4.0
2020-01-04  NaN
2020-01-05  NaN

gtime.model_selection.time_series_split(time_series: DataFrame, n_splits=4, split_on='index')

Time Series cross-validator

time_series_split provides indices to split time series data samples that are observed at fixed time intervals, in the data sets. In each split, subsequent indices must be higher than before, and thus shuffling in cross validator is inappropriate. Split the input dataframe into n_splits.

If the data is not a timeries then the split is based on the number of samples. If the data has a time index and split_on ‘time’ then divide the time series based on time.

Parameters

time_series : pandas DataFrame, shape (n_samples, n_features), required

n_splits : int, default = 4, required The number of splits/folds on the dataset

split_on : ‘index’, default = ‘index’. Optional - ‘time’ If the parameter is ‘time’ then dataframe index must be DatetimeIndex or PeriodIndex or TimedeltaIndex. The dataset will be split based on time

Yields

fold : RangeIndex indexes of folds, or time_fold : DateTimeIndex of folds if split_on ‘time’

Examples

Example 1) >>> date_rng = pd.date_range(start=’1/1/2018’, end=’1/08/2018’, freq=’D’) >>> time_series = pd.DataFrame(date_rng, columns=[‘date’]) >>> time_series.set_index(‘date’, inplace=True) >>> time_series[‘data’] = np.random.randint(0,100,size=(len(date_rng))) >>> for fold in (time_series_split(time_series, n_splits=4, split_on=’time’)): … print(fold) DatetimeIndex([‘2018-01-01’, ‘2018-01-02’], dtype=’datetime64[ns]’, name=’date’, freq=None) DatetimeIndex([‘2018-01-01’, ‘2018-01-02’, ‘2018-01-03’, ‘2018-01-04’],

dtype=’datetime64[ns]’, name=’date’, freq=None)

DatetimeIndex([‘2018-01-01’, ‘2018-01-02’, ‘2018-01-03’, ‘2018-01-04’,

‘2018-01-05’, ‘2018-01-06’],: dtype=’datetime64[ns]’, name=’date’, freq=None)

DatetimeIndex([‘2018-01-01’, ‘2018-01-02’, ‘2018-01-03’, ‘2018-01-04’,

‘2018-01-05’, ‘2018-01-06’, ‘2018-01-07’, ‘2018-01-08’],: dtype=’datetime64[ns]’, name=’date’, freq=None)

Example 2) >>> df = pd.DataFrame(np.random.randint(0,100,size=(16, 4)), columns=list(‘ABCD’)) >>> for fold in (time_series_split(df, n_splits=4, split_on=’index’)): … print(fold) RangeIndex(start=0, stop=4, step=1) RangeIndex(start=0, stop=8, step=1) RangeIndex(start=0, stop=12, step=1) RangeIndex(start=0, stop=16, step=1)