Model Selection
The gtime.model_selection
module deals with model selection.
- class gtime.model_selection.FeatureSplitter(drop_na_mode: str = 'any')
Splits the feature matrices X and y in X_train, y_train, X_test, y_test.
X and y are the feature matrices obtained from the FeatureCreation class.
Parameters
- drop_na_modestr, optional, default:
'any'
How to drop the Nan contained in the
X
andy
matrices. Only ‘any’ is supported for the moment.
Examples
>>> import pandas as pd >>> import numpy as np >>> from gtime.model_selection import FeatureSplitter >>> X = pd.DataFrame.from_dict({"feature_0": [np.nan, 0, 1, 2, 3, 4, 5, 6, 7, 8], ... "feature_1": [np.nan, np.nan, 0.5, 1.5, 2.5, 3.5, ... 4.5, 5.5, 6.5, 7.5, ] ... }) >>> y = pd.DataFrame.from_dict({"y_0": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], ... "y_1": [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan], ... "y_2": [2, 3, 4, 5, 6, 7, 8, 9, np.nan, np.nan] ... }) >>> feature_splitter = FeatureSplitter() >>> X_train, y_train, X_test, y_test = feature_splitter.transform(X, y) >>> X_train feature_0 feature_1 2 1.0 0.5 3 2.0 1.5 4 3.0 2.5 5 4.0 3.5 6 5.0 4.5 7 6.0 5.5 >>> y_train y_0 y_1 y_2 2 2 3.0 4.0 3 3 4.0 5.0 4 4 5.0 6.0 5 5 6.0 7.0 6 6 7.0 8.0 7 7 8.0 9.0 >>> X_test feature_0 feature_1 8 7.0 6.5 9 8.0 7.5 >>> y_test y_0 y_1 y_2 8 8 9.0 NaN 9 9 NaN NaN
- transform(X: ~pandas.core.frame.DataFrame, y: ~pandas.core.frame.DataFrame) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)
Split the feature matrices X and y in X_train, y_train, X_test, y_test.
X
andy
are the feature matrices obtained from the FeatureCreation class.Parameters
- Xpd.DataFrame, shape (n_samples, n_features), required
The feature matrix.
- ypd.DataFrame, shape (n_samples, horizon), required
The y matrix.
Returns
- X_train, y_train, X_test, y_testTuple[pd.DataFrame, pd.DataFrame,
pd.DataFrame, pd.DataFrame] The X and y, split between train and test.
- drop_na_modestr, optional, default:
- gtime.model_selection.blocking_time_series_split(time_series: DataFrame, n_splits=4, split_on='index')
Time Series cross-validator
Blocking Time Series Split works by adding margin after each split. The margin is between the folds used at each iteration in order to prevent the model from memorizing patterns from an iteration to the next.
If the data is not a timeries then the split is based on the number of samples. If the data has a time index and split_on ‘time’ then divide the time series based on time.
Parameters
time_series : pandas DataFrame, shape (n_samples, n_features), required
n_splits : int, default = 4, required The number of splits/folds on the dataset
split_on : ‘index’, default = ‘index’. Optional - ‘time’ If the parameter is ‘time’ then dataframe index must be DatetimeIndex or PeriodIndex or TimedeltaIndex. The dataset will be split based on time
Yields
fold : RangeIndex indexes of folds, or time_fold : DateTimeIndex of folds if split_on ‘time’
Examples
Example 1) >>> date_rng = pd.date_range(start=’1/1/2018’, end=’1/08/2018’, freq=’D’) >>> time_series = pd.DataFrame(date_rng, columns=[‘date’]) >>> time_series.set_index(‘date’, inplace=True) >>> time_series[‘data’] = np.random.randint(0,100,size=(len(date_rng))) >>> for fold in (time_series_split(time_series, n_splits=4, split_on=’time’)): … print(fold) DatetimeIndex([‘2018-01-01’, ‘2018-01-02’], dtype=’datetime64[ns]’, name=’date’, freq=None) DatetimeIndex([‘2018-01-03’, ‘2018-01-04’], dtype=’datetime64[ns]’, name=’date’, freq=None) DatetimeIndex([‘2018-01-05’, ‘2018-01-06’], dtype=’datetime64[ns]’, name=’date’, freq=None) DatetimeIndex([‘2018-01-07’, ‘2018-01-08’], dtype=’datetime64[ns]’, name=’date’, freq=None)
Example 2) >>> df = pd.DataFrame(np.random.randint(0,100,size=(16, 4)), columns=list(‘ABCD’)) >>> for fold in (time_series_split(df, n_splits=4, split_on=’index’)): … print(fold) RangeIndex(start=0, stop=4, step=1) RangeIndex(start=4, stop=8, step=1) RangeIndex(start=8, stop=12, step=1) RangeIndex(start=12, stop=16, step=1)
- gtime.model_selection.horizon_shift(time_series: DataFrame, horizon: Union[int, List[int]] = 5) DataFrame
Perform a shift of the original
time_series
for each time step between 1 andhorizon
.Parameters
- time_seriespd.DataFrame, shape (n_samples, n_features), required
The list of
TimeSeriesFeature
from which to compute the feature_extraction.- horizonint, optional, default:
5
It represents how much into the future is necessary to predict. This corresponds to the number of shifts that are going to be performed on y.
Returns
- ypd.DataFrame, shape (n_samples, horizon)
The shifted time series.
Examples
>>> import pandas as pd >>> from gtime.model_selection import horizon_shift >>> X = pd.DataFrame(range(0, 5), index=pd.date_range("2020-01-01", "2020-01-05")) >>> horizon_shift(X, horizon=2) y_1 y_2 2020-01-01 1.0 2.0 2020-01-02 2.0 3.0 2020-01-03 3.0 4.0 2020-01-04 4.0 NaN 2020-01-05 NaN NaN >>> horizon_shift(X, horizon=[2]) y_2 2020-01-01 2.0 2020-01-02 3.0 2020-01-03 4.0 2020-01-04 NaN 2020-01-05 NaN
- gtime.model_selection.time_series_split(time_series: DataFrame, n_splits=4, split_on='index')
Time Series cross-validator
time_series_split provides indices to split time series data samples that are observed at fixed time intervals, in the data sets. In each split, subsequent indices must be higher than before, and thus shuffling in cross validator is inappropriate. Split the input dataframe into n_splits.
If the data is not a timeries then the split is based on the number of samples. If the data has a time index and split_on ‘time’ then divide the time series based on time.
Parameters
time_series : pandas DataFrame, shape (n_samples, n_features), required
n_splits : int, default = 4, required The number of splits/folds on the dataset
split_on : ‘index’, default = ‘index’. Optional - ‘time’ If the parameter is ‘time’ then dataframe index must be DatetimeIndex or PeriodIndex or TimedeltaIndex. The dataset will be split based on time
Yields
fold : RangeIndex indexes of folds, or time_fold : DateTimeIndex of folds if split_on ‘time’
Examples
Example 1) >>> date_rng = pd.date_range(start=’1/1/2018’, end=’1/08/2018’, freq=’D’) >>> time_series = pd.DataFrame(date_rng, columns=[‘date’]) >>> time_series.set_index(‘date’, inplace=True) >>> time_series[‘data’] = np.random.randint(0,100,size=(len(date_rng))) >>> for fold in (time_series_split(time_series, n_splits=4, split_on=’time’)): … print(fold) DatetimeIndex([‘2018-01-01’, ‘2018-01-02’], dtype=’datetime64[ns]’, name=’date’, freq=None) DatetimeIndex([‘2018-01-01’, ‘2018-01-02’, ‘2018-01-03’, ‘2018-01-04’],
dtype=’datetime64[ns]’, name=’date’, freq=None)
- DatetimeIndex([‘2018-01-01’, ‘2018-01-02’, ‘2018-01-03’, ‘2018-01-04’,
- ‘2018-01-05’, ‘2018-01-06’],
dtype=’datetime64[ns]’, name=’date’, freq=None)
- DatetimeIndex([‘2018-01-01’, ‘2018-01-02’, ‘2018-01-03’, ‘2018-01-04’,
- ‘2018-01-05’, ‘2018-01-06’, ‘2018-01-07’, ‘2018-01-08’],
dtype=’datetime64[ns]’, name=’date’, freq=None)
Example 2) >>> df = pd.DataFrame(np.random.randint(0,100,size=(16, 4)), columns=list(‘ABCD’)) >>> for fold in (time_series_split(df, n_splits=4, split_on=’index’)): … print(fold) RangeIndex(start=0, stop=4, step=1) RangeIndex(start=0, stop=8, step=1) RangeIndex(start=0, stop=12, step=1) RangeIndex(start=0, stop=16, step=1)