인공지능 기초 / 머신러닝
1. 퍼셉트론의 과제
- 직선을 하나 그어서 직선의 한쪽 편에는 검은 점, 다른 한쪽에는 흰 점만 있도록 해보시오.
- 이것이 퍼셉트론의 한계를 설명할 때 등장하는 XOR(exclusive OR) 문제
2. SVM
- 서포트 벡터 머신은 여백(Margin)을 최대화하는 지도 학습 알고리즘
- 여백(Margin)은 주어진 데이터가 오류를 발생시키지 않고 움직일 수 있는 최대 공간
- 분류를 위한 서포트 벡터 머신 SVC (class)
- 회귀를 위한 서포트 벡터 머신 SVR (regression)
3. Decision Tree
- 분류와 회귀 문제에 널리 사용하는 모델
- 기본적으로 결정 트리는 결정에 다다르기 위해 예/아니오 질문을 이어 나가면서 학습
- scikit-learn에서 결정 트리는 DecisionTreeRegressor와 DecisionTreeClassifier에 구현되어 있음
4. 스케일링 (Scaling)
Normalization (정규화)
- 특성들을 특정 범위(주로 [0,1]) 로 스케일링 하는 것
- 가장 작은 값은 0, 가장 큰 값은 1 로 변환되므로, 모든 특성들은 [0, 1] 범위를 갖게 됨
Standardization (표준화)
- 특성들의 평균을 0, 분산을 1 로 스케일링 하는 것
- 즉, 특성들을 정규분포로 만드는 것
주의사항
- 훈련 데이터에는 fit_transform() 메서드를 적용
- 테스트 데이터에는 transform() 메서드를 적용
1. MinMaxScaler()
- Min-Max Normalization 이라고도 불리며 특성들을 특정 범위([0,1]) 로 스케일링
- 이상치에 매우 민감하며, 분류보다 회귀에 유용함
- 특성들의 평균을 0, 분산을 1 로 스케일링 즉, 특성들을 정규분포로 만드는 것
- 이상치에 매우 민감하며, 회귀보다 분류에 유용함
2. StandardScaler()
- 특성들의 평균을 0, 분산을 1 로 스케일링 즉, 특성들을 정규분포로 만드는 것
- 이상치에 매우 민감하며, 회귀보다 분류에 유용함
3. MaxAbsScaler()
- 각 특성의 절대값이 0 과 1 사이가 되도록 스케일링
- 즉, 모든 값은 -1 과 1 사이로 표현되며, 데이터가 양수일 경우 MinMaxScaler 와 동일함
4. RobustScaler()
- 평균과 분산 대신에 중간 값과 사분위 값을 사용
- 이상치 영향을 최소화할 수 있음
6. Random Forest
- 여러 개의 결정 트리들을 임의적으로 학습하는 방식의 앙상블 방법
- 여러가지 학습기들을 생성한 후 이를 선형 결합하여 최종 학습기를 만드는 방법
• 특징
임의성: 서로 조금씩 다른 특성의 트리들로 구성
비상관화: 각 트리들의 예측이 서로 연관되지 않음
견고성: 오류가 전파되지 않아 노이즈에 강함
일반화: 임의화를 통한 과적합 문제 극복
7. Boosting 계열의 모델
- Boosting이란약한 분류기를 결합하여 강한 분류기를 만드는 과정
- 각 0.3의 정확도를 가진 A, B, C를 결합하여 더 높은 정확도, 예를 들어 0.7 정도의 accuracy를 얻는 게
앙상블 알고리즘의 기본 원리
- Boosting은 이 과정을 순차적으로 실행 A 분류기를 만든 후, 그 정보를 바탕으로 B 분류기를 만들고,
다시 그 정보를 바탕으로 C 분류기를 만듦
8. Adaptive Boosting (AdaBoost)
다수결을 통한 정답 분류 및 오답에 가중치 부여
9. Gradient Boosting Model (GBM)
- Loss Function의 gradient를 통해 오답에 가중치 부여
- LightGBM, CatBoost, XGBoost - Gradient Boosting Algorithm을 구현한 패키지
9. K-Fold
- ML 모델에서 가장 보편적으로 사용되는 교차 검증 기법
- K개의 데이터 폴드 세트를 만들어서 K번만큼 각 폴드 세트에 학습과 검증 평가를 수행
- 회귀 문제에서의 교차 검증
10. StratifedKFold
• 레이블 데이터가 왜곡되었을 경우
• 일반적으로 분류에서의 교차 검증
https://dev-with-gpt.tistory.com/58
1. XOR 문제의 해결
import numpy as np
from sklearn.linear_model import Perceptron
from sklearn.svm import SVC, SVR, LinearSVC, LinearSVR
from sklearn.metrics import accuracy_score
from keras.models import Sequential
from keras.layers import Dense
#1. 데이터
# MLP모델 구성하여 ACC=1.0 만들기
x_data = [[0,0], [0,1], [1,0], [1,1]]
y_data = [0, 1, 1, 0]
#2. 모델
model = Sequential()
model.add(Dense(32, input_dim=2))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(1, activation='sigmoid'))
# sklearn의 perceptron과 동일
#3. 컴파일, 훈련
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics='acc')
model.fit(x_data, y_data, batch_size=1, epochs=100)
#4. 평가, 예측
loss,acc = model.evaluate(x_data, y_data)
y_predict = model.predict(x_data)
print(x_data, "의 결과 : ", y_predict)
print('모델의 loss : ', loss)
print('acc : ', acc)
# MLP모델 구성하여 ACC=1.0 만들기
# 모델의 loss : 0.00042129121720790863
# acc : 1.0
2. SVM 모델 (SVC, SVR)
import numpy as np
from sklearn.linear_model import Perceptron
from sklearn.svm import SVC, SVR, LinearSVC, LinearSVR
from sklearn.metrics import accuracy_score
#1. 데이터
x_data = [[0,0], [0,1], [1,0], [1,1]]
y_data = [0, 1, 1, 0]
#2. 모델
model = SVC() #MLP (multi layer perceptron)
#3. 훈련
model.fit(x_data, y_data)
#4. 평가, 예측
result = model.score(x_data, y_data)
print('모델 score : ', result)
y_predict = model.predict(x_data)
print(x_data, "의 결과 : ", y_predict)
acc = accuracy_score(y_data, y_predict)
print('acc : ', acc)
# SVC 퍼셉트론
# 모델 score : 1.0
# [[0, 0], [0, 1], [1, 0], [1, 1]] 의 결과 : [0 1 1 0]
# acc : 1.0
3. Linear 모델
(Perceptron, LogisticRegression은 분류모델, LinearRegression 회귀모델)
[SVR 회귀모델]
#1. 실습 svm 모델과 나의 tf keras 모델 성능 비교하기
#1. iris
#2. cancer
#3. wine
#4. cailimport numpy as np
import numpy as np
from sklearn.linear_model import Perceptron
from sklearn.svm import SVR, LinearSVR
from sklearn.metrics import accuracy_score
from keras.models import Sequential
from keras.layers import Dense
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import tensorflow as tf
tf. random.set_seed(100) # weigth에 난수값 조절
#1. 데이터
datasets = fetch_california_housing()
x = datasets['data']
y = datasets['target']
print(x.shape, y.shape) # (150, 4) (150,)
print('y의 라벨값 : ', np.unique(y)) # y의 라벨값 : [0 1 2]
#2. 모델
model = SVR()
#3. 훈련
model.fit(x,y)
#4. 평가, 예측
result = model.score(x, y)
print('결과 cail SVR acc : ', result)
#
#2. 모델
model = LinearSVR()
#3. 훈련
model.fit(x,y)
#4. 평가, 예측
result = model.score(x, y)
print('결과 cail LinearSVR acc : ', result)
####
#2. 모델
model = SVR()
x_train, x_test, y_train, y_test = train_test_split(
x,y,train_size=0.7, random_state=100, shuffle= True
)
#3. 훈련
model.fit(x_train,y_train)
#4. 평가, 예측
result = model.score(x_test, y_test)
print('결과 svr mlp acc : ', result)
# 결과 cail SVR acc : -0.01658668690926901
# 결과 cail LinearSVR acc : -0.41929123956268755
# 결과 svr mlp acc : -0.01663695941103427
[SVC 분류모델]
#1. 실습 svm 모델과 나의 tf keras 모델 성능 비교하기
#1. iris
#2. cancer
#3. wine
#4. cailimport numpy as np
import numpy as np
from sklearn.linear_model import Perceptron
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import accuracy_score
from keras.models import Sequential
from keras.layers import Dense
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
#1. 데이터
datasets = load_breast_cancer()
x = datasets['data']
y = datasets['target']
print(x.shape, y.shape) # (150, 4) (150,)
print('y의 라벨값 : ', np.unique(y)) # y의 라벨값 : [0 1 2]
#2. 모델
model = SVC()
#3. 훈련
model.fit(x,y)
#4. 평가, 예측
result = model.score(x, y)
print('결과 cance svc acc : ', result)
# ####
#2. 모델
model = LinearSVC()
#3. 훈련
model.fit(x,y)
#4. 평가, 예측
result = model.score(x, y)
print('결과 cancer LinearSVC acc : ', result)
#2. 모델
model = SVC()
x_train, x_test, y_train, y_test = train_test_split(
x,y,train_size=0.7, random_state=100, shuffle= True
)
#3. 훈련
model.fit(x_train,y_train)
#4. 평가, 예측
result = model.score(x_test, y_test)
print('결과 cancer mlp acc : ', result)
#결과 cance svc acc : 0.9226713532513181
# 결과 cancer LinearSVC acc : 0.9244288224956063
# 결과 cancer mlp acc : 0.9064327485380117
4. Tree 모델 (DecisionTreeClassifier, DecisionTreeRegressor)
[DecisionTreeRegressor]
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.datasets import fetch_california_housing
# from sklearn.datasets import load_boston #윤리적 문제로 제공안됨
from sklearn.svm import SVR, LinearSVR
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.preprocessing import MaxAbsScaler, RobustScaler
from sklearn.tree import DecisionTreeRegressor
#1. 데이터
# datasets = load_boston()
datasets = fetch_california_housing()
x = datasets.data
# ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
y = datasets.target
# MedInc 블록 그룹의 중간 소득
# HouseAge 중간 블록 그룹의 주택 연령
# AveRooms 가구당 평균 객실 수
# AveBedrms 가구당 평균 침실 수
# Population 인구 블록 그룹 인구
# AveOccup 평균 가구 구성원 수
# Latitude 위도 블록 그룹 위도
# Longitude 경도 블록 그룹 경도
print(datasets.feature_names)
print(datasets.DESCR)
print(x.shape) #(20640, 8)
print(y.shape) #(20640,)
model = DecisionTreeRegressor
x_train, x_test, y_train, y_test = train_test_split(
x, y, train_size=0.7, random_state=100, shuffle=True
)
# scaler 적용
# scaler = MinMaxScaler()
scaler = StandardScaler()
# scaler = MaxAbsScaler()
#scaler = RobustScaler()
scaler.fit(x_train)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
#2. 모델구성
model = Sequential()
model.add(Dense(8, input_dim=8)) # print(x_train.shape) # (14447, 8) 열 숫자
model.add(Dense(100))
model.add(Dropout(0.25))
model.add(Dense(100))
model.add(Dropout(0.25))
model.add(Dense(100))
model.add(Dense(100))
model.add(Dense(1)) # 회귀분석이므로 출력층은 1
#3. 컴파일, 훈련
model.compile(loss='mse', optimizer='adam')
model.fit(x_train, y_train, epochs=500, batch_size=125) #epochs 반복횟수
#4. 평가, 예측
loss = model.evaluate(x_test, y_test)
print('loss : ', loss)
y_predict = model.predict(x_test)
r2 = r2_score(y_test, y_predict)
print('r2스코어 : ', r2)
#loss : 0.506873607635498
#r2스코어 : 0.617474494949859
# loss : 0.3207322657108307
# acc : 0.9259259104728699
[DecisionTreeClassifier]
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, accuracy_score
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.preprocessing import MaxAbsScaler, RobustScaler
import time
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
#1. 데이터
datasets = load_iris()
print(datasets.DESCR)
print(datasets.feature_names)
#['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
x = datasets.data
y = datasets.target
print(x.shape,y.shape) # (150, 4) (150,) #input_dim = 4
model = DecisionTreeClassifier()
x_train, x_test, y_train, y_test = train_test_split(
x, y, train_size=0.7, random_state=100, shuffle=True
)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
print(y_test)
# scaler 적용
# scaler = MinMaxScaler()
scaler = StandardScaler()
# scaler = MaxAbsScaler()
#scaler = RobustScaler()
scaler.fit(x_train)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
#2. 모델 구성
model = Sequential()
model.add(Dense(100,input_dim=4))
model.add(Dense(100))
model.add(Dense(100))
model.add(Dense(500))
model.add(Dense(3, activation='softmax'))
#3.컴파일, 훈련
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam',
metrics=['accuracy'])
start_time = time.time()
model.fit(x_train, y_train, epochs=100, batch_size=100)
end_time = time.time() - start_time
#4. 평가,예측
# y_predict = model.predict(x_test)
# y_predict = np.round(y_predict)
# loss= model.evaluate(x_test, y_test)
# acc = accuracy_score(y_test, y_predict)
#
y_predict = model.predict(x_test)
y_predict = np.argmax(y_predict, axis=1)
#
print(y_predict)
loss= model.evaluate(x_test, y_test)
acc = accuracy_score(y_test, y_predict)
# np.argmax(,axis=n) 함수는 배열에서 가장 큰 값의
# 인덱스를 반환하는 함수입니다.
# axis=1을 지정하여 각 행에서 가장 큰 값의 인덱스를 추출하면,
# 다중 클래스로 예측된 결과를 얻을 수 있습니다.
# axis = 0 (행), axis = 1 (열)
print('loss : ', loss)
print('acc : ', acc)
print('걸린시간 : ', end_time)
# loss : [0.11477744579315186, 0.9555555582046509]
# acc : 0.9555555555555556
# 걸린시간 : 0.9926300048828125
# sclaer 적용
# loss : [0.049853190779685974, 0.9777777791023254]
# acc : 0.9777777777777777
# 걸린시간 : 0.9749109745025635
5. Ensemble 모델 (RandomForestClassifier, RandomForestRegressor)
[DecisionTreeClassifier]
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_validate, cross_val_predict
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import all_estimators
from sklearn.metrics import r2_score, accuracy_score
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
#1.데이터
datasets = load_iris()
x = datasets.data
y = datasets.target
x_train, x_test, y_train, y_test = train_test_split(
x, y, train_size=0.8, shuffle=True, random_state=42 # valdiation data포함
)
#kfold
n_splits = 5
random_state=42
kfold = StratifiedKFold(n_splits=n_splits,
shuffle=True,
random_state=random_state)
scaler = MinMaxScaler()
scaler.fit(x_train)
x = scaler.transform(x)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
# x_test = scaler.transform(x_test)
#2. 모델
model = RandomForestClassifier()
#3.훈련
model.fit(x_train, y_train)
#4. 결과
score = cross_val_score(model,
x_train, y_train,
cv=kfold) # crossvalidation
print(score)
y_predict = cross_val_predict(model,
x_test, y_test,
cv=kfold)
y_predict = np.round(y_predict).astype(int)
# print('cv predict : ', y_predict)
acc = accuracy_score(y_test, y_predict)
print('cv RandomForestClassifier iris pred acc : ', acc)
#[0.91666667 0.95833333 0.91666667 0.83333333 1. ]
# cv RandomForestClassifier iris pred acc : 0.9666666666666667
# [0.95833333 0.95833333 0.875 0.95833333 0.91666667]
# cv RandomForestClassifier iris pred acc : 0.9666666666666667
[RandomForestRegressor]
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_validate, cross_val_predict
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import all_estimators
from sklearn.metrics import r2_score, accuracy_score
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
#1.데이터
datasets = fetch_california_housing()
x = datasets.data
y = datasets.target
x_train, x_test, y_train, y_test = train_test_split(
x, y, train_size=0.8, shuffle=True, random_state=42 # valdiation data포함
)
#kfold
n_splits = 5
random_state=42
kfold = KFold(n_splits=n_splits,
shuffle=True,
random_state=random_state)
scaler = MinMaxScaler()
scaler.fit(x_train)
x = scaler.transform(x)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
# x_test = scaler.transform(x_test)
#2. 모델
model = RandomForestRegressor()
#3.훈련
model.fit(x_train, y_train)
#4. 결과
score = cross_val_score(model,
x_train, y_train,
cv=kfold) # crossvalidation
print(score)
y_predict = cross_val_predict(model,
x_test, y_test,
cv=kfold)
#y_predict = np.round(y_predict).astype(int)
# print('cv predict : ', y_predict)
r2 = r2_score(y_test, y_predict)
print('cv RandomForestClassifier iris pred acc : ', r2)
#[0.91666667 0.95833333 0.91666667 0.83333333 1. ]
# cv RandomForestClassifier iris pred acc : 0.9666666666666667
# [0.95833333 0.95833333 0.875 0.95833333 0.91666667]
# cv RandomForestClassifier iris pred acc : 0.9666666666666667
####feature importaces
print(model, ":", model.feature_importances_)
import matplotlib.pyplot as plt
n_features = datasets.data.shape[1]
plt.barh(range(n_features), model.feature_importances_,align='center')
plt.yticks(np.arange(n_features), datasets.feature_names)
plt.title('cali Feature Importances')
plt.ylabel('Feature')
plt.xlabel('importances')
plt.ylim(-1, n_features) # 가로출력
plt.show()
6. All_Estimator (알고리즘 출력하면서 정확도 찾기)
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import all_estimators
from sklearn.metrics import r2_score, accuracy_score
import warnings
warnings.filterwarnings('ignore')
#1.데이터
datasets = load_breast_cancer()
x = datasets.data
y = datasets.target
x_train, x_test, y_train, y_test = train_test_split(
x, y, train_size=0.7, random_state=42, shuffle=True
)
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
#2. 모델
allAlgorithms = all_estimators(type_filter='classifier')
print('allAlgorithms : ', allAlgorithms)
print('total : ', len(allAlgorithms)) # 41개
#3. 출력
for (name, algorithm) in allAlgorithms :
try :
model = algorithm()
model.fit(x_train, y_train)
y_predict = model.predict(x_test)
acc = accuracy_score(y_test, y_predict)
print(name, '의 정답률 : ', acc)
except :
print(name, '안나온 놈 !')
7. KFold 와 StratifiedKFold (회귀분석과 분류모델)
[KFold]
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, KFold
from sklearn.model_selection import cross_val_score, cross_validate, cross_val_predict
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import all_estimators
from sklearn.metrics import r2_score, accuracy_score
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
#1.데이터
datasets = load_wine()
x = datasets.data
y = datasets.target
x_train, x_test, y_train, y_test = train_test_split(
x, y, train_size=0.8, shuffle=True, random_state=42 # valdiation data포함
)
#kfold
n_splits = 5
random_state=42
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
scaler = MinMaxScaler()
scaler.fit(x_train)
x = scaler.transform(x)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
# x_test = scaler.transform(x_test)
#2. 모델
model = RandomForestRegressor()
#3.훈련
model.fit(x_train, y_train)
#4. 결과
score = cross_val_score(model,
x_train, y_train,
cv=kfold) # crossvalidation
print(score)
y_predict = cross_val_predict(model,
x_test, y_test,
cv=kfold)
y_predict = np.round(y_predict).astype(int)
# print('cv predict : ', y_predict)
r2 = r2_score(y_test, y_predict)
#acc = accuracy_score(y_test, y_predict)
print('cv RandomForestClassifie cali pred acc : ', r2)
#[0.92044174 0.93128764 0.86083611 0.94454865 0.9748 ]
# cv RandomForestClassifie cali pred acc : 0.7619047619047619
[StratifiedKFold]
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_validate, cross_val_predict
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import all_estimators
from sklearn.metrics import r2_score, accuracy_score
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
#1.데이터
datasets = load_iris()
x = datasets.data
y = datasets.target
x_train, x_test, y_train, y_test = train_test_split(
x, y, train_size=0.8, shuffle=True, random_state=42 # valdiation data포함
)
#kfold
n_splits = 5
random_state=42
kfold = StratifiedKFold(n_splits=n_splits,
shuffle=True,
random_state=random_state)
scaler = MinMaxScaler()
scaler.fit(x_train)
x = scaler.transform(x)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
# x_test = scaler.transform(x_test)
#2. 모델
model = RandomForestClassifier()
#3.훈련
model.fit(x_train, y_train)
#4. 결과
score = cross_val_score(model,
x_train, y_train,
cv=kfold) # crossvalidation
print(score)
y_predict = cross_val_predict(model,
x_test, y_test,
cv=kfold)
y_predict = np.round(y_predict).astype(int)
# print('cv predict : ', y_predict)
acc = accuracy_score(y_test, y_predict)
print('cv RandomForestClassifier iris pred acc : ', acc)
#[0.91666667 0.95833333 0.91666667 0.83333333 1. ]
# cv RandomForestClassifier iris pred acc : 0.9666666666666667
# [0.95833333 0.95833333 0.875 0.95833333 0.91666667]
# cv RandomForestClassifier iris pred acc : 0.9666666666666667
8. Feature Importances (피쳐 중요도)
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_validate, cross_val_predict
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import all_estimators
from sklearn.metrics import r2_score, accuracy_score
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
#1.데이터
datasets = fetch_california_housing()
x = datasets.data
y = datasets.target
x_train, x_test, y_train, y_test = train_test_split(
x, y, train_size=0.8, shuffle=True, random_state=42 # valdiation data포함
)
#kfold
n_splits = 5
random_state=42
kfold = KFold(n_splits=n_splits,
shuffle=True,
random_state=random_state)
scaler = MinMaxScaler()
scaler.fit(x_train)
x = scaler.transform(x)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
# x_test = scaler.transform(x_test)
#2. 모델
model = RandomForestRegressor()
#3.훈련
model.fit(x_train, y_train)
#4. 결과
score = cross_val_score(model,
x_train, y_train,
cv=kfold) # crossvalidation
print(score)
y_predict = cross_val_predict(model,
x_test, y_test,
cv=kfold)
#y_predict = np.round(y_predict).astype(int)
# print('cv predict : ', y_predict)
r2 = r2_score(y_test, y_predict)
print('cv RandomForestClassifier iris pred acc : ', r2)
#[0.91666667 0.95833333 0.91666667 0.83333333 1. ]
# cv RandomForestClassifier iris pred acc : 0.9666666666666667
# [0.95833333 0.95833333 0.875 0.95833333 0.91666667]
# cv RandomForestClassifier iris pred acc : 0.9666666666666667
####feature importaces
print(model, ":", model.feature_importances_)
import matplotlib.pyplot as plt
n_features = datasets.data.shape[1]
plt.barh(range(n_features), model.feature_importances_,align='center')
plt.yticks(np.arange(n_features), datasets.feature_names)
plt.title('cali Feature Importances')
plt.ylabel('Feature')
plt.xlabel('importances')
plt.ylim(-1, n_features) # 가로출력
plt.show()
실습
각 boost 계열 모델 (iris, cancer, wine, california)
팀 프로젝트에 1번, 2번 적용
[Lightgbmboost]
pip install xgboost
pip install catboost
pip install lightgbm
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_validate, cross_val_predict
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import all_estimators
from sklearn.metrics import r2_score, accuracy_score
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
#1. 데이터
path = './'
datasets = pd.read_csv(path + 'train.csv')
print(datasets.columns)
#print(datasets.head(7))
# 1. x,y Data
x = datasets[['6~7_ride', '7~8_ride', '8~9_ride',
'9~10_ride', '10~11_ride', '11~12_ride', '6~7_takeoff', '7~8_takeoff',
'8~9_takeoff', '9~10_takeoff', '10~11_takeoff', '11~12_takeoff']]
y = datasets[['18~20_ride']]
x = x.astype('int64')
print(x.info())
x_train, x_test, y_train, y_test = train_test_split(
x, y, train_size=0.8, shuffle=True, random_state=42 # valdiation data포함
)
#kfold
n_splits = 5
random_state=77
kfold = KFold(n_splits=n_splits,
shuffle=True,
random_state=random_state)
scaler = MinMaxScaler()
scaler.fit(x_train)
x = scaler.transform(x)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
from lightgbm import LGBMRegressor
model = LGBMRegressor()
#3.훈련
model.fit(x_train, y_train)
#4. 결과
score = cross_val_score(model,
x_train, y_train,
cv=kfold) # crossvalidation
print(score)
y_predict = cross_val_predict(model,
x_test, y_test,
cv=kfold)
r2 = r2_score(y_test, y_predict)
print('lightgbm r2 : ', r2)
import matplotlib.pyplot as plt
n_features = x.shape[1] # Use the number of features in your data
plt.barh(range(n_features), model.feature_importances_, align='center')
plt.yticks(np.arange(n_features), x.feature_names) # Use the feature names from the Boston dataset
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.show()
[Catboost]
pip install catboost
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_validate, cross_val_predict
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import all_estimators
from sklearn.metrics import r2_score, accuracy_score
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
#1. 데이터
path = './'
datasets = pd.read_csv(path + 'train.csv')
print(datasets.columns)
#print(datasets.head(7))
# 1. x,y Data
x = datasets[['6~7_ride', '7~8_ride', '8~9_ride',
'9~10_ride', '10~11_ride', '11~12_ride', '6~7_takeoff', '7~8_takeoff',
'8~9_takeoff', '9~10_takeoff', '10~11_takeoff', '11~12_takeoff']]
y = datasets[['18~20_ride']]
x = x.astype('int64')
print(x.info())
x_train, x_test, y_train, y_test = train_test_split(
x, y, train_size=0.8, shuffle=True, random_state=42 # valdiation data포함
)
#kfold
n_splits = 5
random_state=77
kfold = KFold(n_splits=n_splits,
shuffle=True,
random_state=random_state)
scaler = MinMaxScaler()
scaler.fit(x_train)
x = scaler.transform(x)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
from catboost import CatBoostRegressor
model = CatBoostRegressor()
#3.훈련
model.fit(x_train, y_train)
#4. 결과
score = cross_val_score(model,
x_train, y_train,
cv=kfold) # crossvalidation
print(score)
y_predict = cross_val_predict(model,
x_test, y_test,
cv=kfold)
r2 = r2_score(y_test, y_predict)
print('r2 : ', r2)
import matplotlib.pyplot as plt
n_features = x.shape[1] # Use the number of features in your data
plt.barh(range(n_features), model.feature_importances_, align='center')
plt.yticks(np.arange(n_features), x.feature_names) # Use the feature names from the Boston dataset
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.show()
[XGboost]
pip install xgboost
pip install catboost
pip install lightgbm
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_validate, cross_val_predict
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import all_estimators
from sklearn.metrics import r2_score, accuracy_score
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
#1. 데이터
path = './'
datasets = pd.read_csv(path + 'train.csv')
print(datasets.columns)
#print(datasets.head(7))
# 1. x,y Data
x = datasets[['6~7_ride', '7~8_ride', '8~9_ride',
'9~10_ride', '10~11_ride', '11~12_ride', '6~7_takeoff', '7~8_takeoff',
'8~9_takeoff', '9~10_takeoff', '10~11_takeoff', '11~12_takeoff']]
y = datasets[['18~20_ride']]
x = x.astype('int64')
print(x.info())
x_train, x_test, y_train, y_test = train_test_split(
x, y, train_size=0.8, shuffle=True, random_state=42 # valdiation data포함
)
#kfold
n_splits = 5
random_state=77
kfold = KFold(n_splits=n_splits,
shuffle=True,
random_state=random_state)
scaler = MinMaxScaler()
scaler.fit(x_train)
x = scaler.transform(x)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
#2. 모델
from xgboost import XGBRegressor
model = XGBRegressor()
#3.훈련
model.fit(x_train, y_train)
#4. 결과
score = cross_val_score(model,
x_train, y_train,
cv=kfold) # crossvalidation
print(score)
y_predict = cross_val_predict(model,
x_test, y_test,
cv=kfold)
r2 = r2_score(y_test, y_predict)
print('xgboost r2 : ', r2)
import matplotlib.pyplot as plt
n_features = x.shape[1] # Use the number of features in your data
plt.barh(range(n_features), model.feature_importances_, align='center')
plt.yticks(np.arange(n_features), x.feature_names) # Use the feature names from the Boston dataset
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.show()
실습 2.
Feature importances 확인 및 feature 정의 후 성능 비교