Recently, i was writing module of feature engineering, i found two excellently package -- tsfresh and sklearn.
tsfresh has been specialized for data of time series, tsfresh mainly include two module, feature extract and feature select:
1 from tsfresh import feature_selection, feature_extraction
To limit the number of irrelevant features, tsfresh deploys the fresh algorithms. The hole process consists of three steps.
Firstly. the algorithm characterizes time series with comprehensive and well-established feature mappings. the feature calculators used to derive the features are contained in tsfresh.feature_extraction.feature_calculators.
In a second step, each extracted feature vector is individually and evaluated with respect to its significance for predicting the target under investigation, those test are contained in submodule tsfresh.feature_selection.significance_tests. the result of significance test is a vector of p-value, quantifying the significance of each feature for predicting the target.
Finally, the vector of p-value is evaluated base on basis of the Benjamini-Yekutieli procedure in order to decide which feature could keep.
In summary, the tsfresh is a scalable and efficiency tool of feature engeering.
although the function of tsfresh was powerful, but i choice sklearn.
I download a data which is the heart diseaes data set. the data set target is binary and has 13 dimension feature, I was just used MinMaxScaler to transfrom age,trestbps,chol three columns, the model had choiced ensemble of AutoSklearnClassifer and ensemble of RandomForest. but bad performance for two model.
from sklearn.preprocessing import MinMaxScaler,StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from numpy import set_printoptions, inf set_printoptions(threshold=inf) import pandas as pd data = pd.read_csv("../data_set/heart.csv") X = data[data.columns[:data.shape[1] - 1]].values y = data[data.columns[-1]].values data = MinMaxScaler().fit_transform(X[:, [0, 3, 4, 7]]) X[:, [0, 3, 4, 7]] = data x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) from autosklearn.classification import AutoSklearnClassifier model_auto = AutoSklearnClassifier(time_left_for_this_task=120, n_jobs=3, include_preprocessors=["no_preprocessing"], seed=3) model_auto.fit(x_train, y_train) from sklearn.metrics import accuracy_score y_pred = model_auto.predict(x_test) accuracy_score(y_test, y_pred) >>> 0.8021978021978022 from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=500) y_pred_rf = model.predict(x_test) accuracy_score(y_test, y_pred_rf) >>> 0.8051648351648352
My personal web site which provide automl service, I upload this data set to my service, it get better score then my code: http://simple-automl.com/preview.html
0.8131868131868132
原文:https://www.cnblogs.com/xu-xiaofeng/p/10934296.html