原文链接这里
本文着重的要点是:Anomaly Detection 异常值检测
也就是通过对于不同异常值的定义,用不同的检测算法/方法来检测异常值
在了解有哪些对应的方法之前,我们要先知道异常值有哪些类型
参考这篇文章
异常值分为三类:
本文的解释:
这里小结一下,作者对于不同anomalies的方法。
然后大部分都是拟合数据为模型,然后在用模型预测原数据,选择数据中偏差最大的1%(没有拟合好的数据)作为anomalies。
看导入的包
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.covariance import EllipticEnvelope
#from pyemma import msm # not available on Kaggle Kernel
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
原文的步骤是:
outliers_fraction = 0.01
是一个分割递归算法
# Take useful feature and standardize them
data = df[[‘value‘, ‘hours‘, ‘daylight‘, ‘DayOfTheWeek‘, ‘WeekDay‘]]
min_max_scaler = preprocessing.StandardScaler()
np_scaled = min_max_scaler.fit_transform(data)
data = pd.DataFrame(np_scaled)
# train isolation forest
model = IsolationForest(contamination = outliers_fraction)
model.fit(data)
# add the data to the main
df[‘anomaly25‘] = pd.Series(model.predict(data))
df[‘anomaly25‘] = df[‘anomaly25‘].map( {1: 0, -1: 1} )
print(df[‘anomaly25‘].value_counts())
单个class的SVM
# Take useful feature and standardize them
data = df[[‘value‘, ‘hours‘, ‘daylight‘, ‘DayOfTheWeek‘, ‘WeekDay‘]]
min_max_scaler = preprocessing.StandardScaler()
np_scaled = min_max_scaler.fit_transform(data)
# train one class SVM
model = OneClassSVM(nu=0.95 * outliers_fraction) #nu=0.95 * outliers_fraction + 0.05
data = pd.DataFrame(np_scaled)
model.fit(data)
# add the data to the main
df[‘anomaly26‘] = pd.Series(model.predict(data))
df[‘anomaly26‘] = df[‘anomaly26‘].map( {1: 0, -1: 1} )
print(df[‘anomaly26‘].value_counts())
原文的步骤:
# creation of 4 differents data set based on categories defined before
df_class0 = df.loc[df[‘categories‘] == 0, ‘value‘]
df_class1 = df.loc[df[‘categories‘] == 1, ‘value‘]
df_class2 = df.loc[df[‘categories‘] == 2, ‘value‘]
df_class3 = df.loc[df[‘categories‘] == 3, ‘value‘]
# apply ellipticEnvelope(gaussian distribution) at each categories
envelope = EllipticEnvelope(contamination = outliers_fraction)
X_train = df_class0.values.reshape(-1,1)
envelope.fit(X_train)
df_class0 = pd.DataFrame(df_class0)
df_class0[‘deviation‘] = envelope.decision_function(X_train)
df_class0[‘anomaly‘] = envelope.predict(X_train)
原文步骤:
# definition of the different state
x1 = (df[‘value‘] <=18).astype(int)
x2= ((df[‘value‘] > 18) & (df[‘value‘]<=21)).astype(int)
x3 = ((df[‘value‘] > 21) & (df[‘value‘]<=24)).astype(int)
x4 = ((df[‘value‘] > 24) & (df[‘value‘]<=27)).astype(int)
x5 = (df[‘value‘] >27).astype(int)
df_mm = x1 + 2*x2 + 3*x3 + 4*x4 + 5*x5
# getting the anomaly labels for our dataset (evaluating sequence of 5 values and anomaly = less than 20% probable)
# I USE pyemma NOT AVAILABLE IN KAGGLE KERNEL
#df_anomaly = markovAnomaly(df_mm, 5, 0.20)
#df_anomaly = pd.Series(df_anomaly)
#print(df_anomaly.value_counts())
这里用了50个数据来预测一个新的数据
然后如果选择预测diff最大的前1%来作为anomaly
# important parameters and train/test size
prediction_time = 1
testdatasize = 1000
unroll_length = 50
testdatacut = testdatasize + unroll_length + 1
#train data
x_train = data_n[0:-prediction_time-testdatacut].as_matrix() #get ndarray as result
y_train = data_n[prediction_time:-testdatacut ][0].as_matrix()
# test data
x_test = data_n[0-testdatacut:-prediction_time].as_matrix()
y_test = data_n[prediction_time-testdatacut: ][0].as_matrix()
#unroll: create sequence of 50 previous data points for each data points
def unroll(data,sequence_length=24):
result = []
for index in range(len(data) - sequence_length):
result.append(data[index: index + sequence_length])
return np.asarray(result)
# adapt the datasets for the sequence data shape
x_train = unroll(x_train,unroll_length)
x_test = unroll(x_test,unroll_length)
y_train = y_train[-x_train.shape[0]:]
y_test = y_test[-x_test.shape[0]:]
# specific libraries for RNN
# keras is a high layer build on Tensorflow layer to stay in high level/easy implementation
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from keras.models import Sequential
import time #helper libraries
from keras.models import model_from_json
import sys
# Build the model
model = Sequential()
model.add(LSTM(
input_dim=x_train.shape[-1],
output_dim=50,
return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(
100,
return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(
units=1))
model.add(Activation(‘linear‘))
start = time.time()
model.compile(loss=‘mse‘, optimizer=‘rmsprop‘)
print(‘compilation time : {}‘.format(time.time() - start))
# Train the model
#nb_epoch = 350
model.fit(
x_train,
y_train,
batch_size=3028,
nb_epoch=30,
validation_split=0.1)
# create the list of difference between prediction and test data
loaded_model = model
diff=[]
ratio=[]
p = loaded_model.predict(x_test)
# predictions = lstm.predict_sequences_multiple(loaded_model, x_test, 50, 50)
for u in range(len(y_test)):
pr = p[u][0]
ratio.append((y_test[u]/pr)-1)
diff.append(abs(y_test[u]- pr))
# select the most distant prediction/reality data points as anomalies
diff = pd.Series(diff)
number_of_outliers = int(outliers_fraction*len(diff))
threshold = diff.nlargest(number_of_outliers).min()
# data with anomaly label (test data part)
test = (diff >= threshold).astype(int)
# the training data part where we didn‘t predict anything (overfitting possible): no anomaly
complement = pd.Series(0, index=np.arange(len(data_n)-testdatasize))
# # add the data to the main
df[‘anomaly27‘] = complement.append(test, ignore_index=‘True‘)
print(df[‘anomaly27‘].value_counts())
原文:https://www.cnblogs.com/niemand-01/p/14391592.html