728x90
Google Colab에 있는 sample_data를 통해
회귀문제를 한 번 풀어볼 것이다.
¶
California housing : Regression
Imports¶
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm, probplot, skew
from IPython.display import display
from sklearn.preprocessing import StandardScaler
Load Data¶
In [2]:
train_df = pd.read_csv('/content/sample_data/california_housing_train.csv')
test_df = pd.read_csv('/content/sample_data/california_housing_test.csv')
In [3]:
train_df.head()
Out[3]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
0 | -114.31 | 34.19 | 15.0 | 5612.0 | 1283.0 | 1015.0 | 472.0 | 1.4936 | 66900.0 |
1 | -114.47 | 34.40 | 19.0 | 7650.0 | 1901.0 | 1129.0 | 463.0 | 1.8200 | 80100.0 |
2 | -114.56 | 33.69 | 17.0 | 720.0 | 174.0 | 333.0 | 117.0 | 1.6509 | 85700.0 |
3 | -114.57 | 33.64 | 14.0 | 1501.0 | 337.0 | 515.0 | 226.0 | 3.1917 | 73400.0 |
4 | -114.57 | 33.57 | 20.0 | 1454.0 | 326.0 | 624.0 | 262.0 | 1.9250 | 65500.0 |
In [4]:
train_df.describe()
Out[4]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
count | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 |
mean | -119.562108 | 35.625225 | 28.589353 | 2643.664412 | 539.410824 | 1429.573941 | 501.221941 | 3.883578 | 207300.912353 |
std | 2.005166 | 2.137340 | 12.586937 | 2179.947071 | 421.499452 | 1147.852959 | 384.520841 | 1.908157 | 115983.764387 |
min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
25% | -121.790000 | 33.930000 | 18.000000 | 1462.000000 | 297.000000 | 790.000000 | 282.000000 | 2.566375 | 119400.000000 |
50% | -118.490000 | 34.250000 | 29.000000 | 2127.000000 | 434.000000 | 1167.000000 | 409.000000 | 3.544600 | 180400.000000 |
75% | -118.000000 | 37.720000 | 37.000000 | 3151.250000 | 648.250000 | 1721.000000 | 605.250000 | 4.767000 | 265000.000000 |
max | -114.310000 | 41.950000 | 52.000000 | 37937.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
Distribution of training data¶
In [5]:
train_df.columns.values.tolist()
Out[5]:
['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value']
In [6]:
def plot_df(df):
features = df.columns.values.tolist()
target = features.pop()
plt.figure(figsize=(10,10))
for i, feature in enumerate(features):
plt.subplot(4, 2, i+1)
plt.grid()
plt.plot(df[feature], df[target], 'r.')
plt.title(feature)
plt.tight_layout()
plt.show()
plt.close()
plot_df(train_df)
In [7]:
train_df['population'] >= 20000
Out[7]:
0 False 1 False 2 False 3 False 4 False ... 16995 False 16996 False 16997 False 16998 False 16999 False Name: population, Length: 17000, dtype: bool
In [8]:
train_df[train_df['population'] >= 20000]
Out[8]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
2274 | -117.42 | 33.35 | 14.0 | 25135.0 | 4819.0 | 35682.0 | 4769.0 | 2.5729 | 134400.0 |
12772 | -121.79 | 36.64 | 11.0 | 32627.0 | 6445.0 | 28566.0 | 6082.0 | 2.3087 | 118800.0 |
In [9]:
train_df[train_df['population'] >= 20000].index
Out[9]:
Int64Index([2274, 12772], dtype='int64')
In [10]:
# treating out liers(이상값)
train_df = train_df.drop(train_df[train_df['population'] >= 20000].index)
In [11]:
plot_df(train_df)
Target variable analysis¶
In [12]:
def plot_hist_prob(df, feature):
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
# plot histogram
sns.distplot(df[feature], fit=norm)
plt.title('Distribution')
plt.subplot(1, 2, 2)
probplot(df[feature], plot=plt)
plt.tight_layout()
plt.show()
plt.close()
target = 'median_house_value'
plot_hist_prob(train_df, target)
/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
In [13]:
#log-transformation
#log(1+x)
train_df[target] = np.log1p(train_df[target])
In [14]:
plot_hist_prob(train_df, target)
/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Feature engineering¶
In [15]:
full_df = pd.concat([
train_df.drop(target, axis = 1),
test_df.drop(target, axis = 1)
])
full_df
Out[15]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | |
---|---|---|---|---|---|---|---|---|
0 | -114.31 | 34.19 | 15.0 | 5612.0 | 1283.0 | 1015.0 | 472.0 | 1.4936 |
1 | -114.47 | 34.40 | 19.0 | 7650.0 | 1901.0 | 1129.0 | 463.0 | 1.8200 |
2 | -114.56 | 33.69 | 17.0 | 720.0 | 174.0 | 333.0 | 117.0 | 1.6509 |
3 | -114.57 | 33.64 | 14.0 | 1501.0 | 337.0 | 515.0 | 226.0 | 3.1917 |
4 | -114.57 | 33.57 | 20.0 | 1454.0 | 326.0 | 624.0 | 262.0 | 1.9250 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2995 | -119.86 | 34.42 | 23.0 | 1450.0 | 642.0 | 1258.0 | 607.0 | 1.1790 |
2996 | -118.14 | 34.06 | 27.0 | 5257.0 | 1082.0 | 3496.0 | 1036.0 | 3.3906 |
2997 | -119.70 | 36.30 | 10.0 | 956.0 | 201.0 | 693.0 | 220.0 | 2.2895 |
2998 | -117.12 | 34.10 | 40.0 | 96.0 | 14.0 | 46.0 | 14.0 | 3.2708 |
2999 | -119.63 | 34.42 | 42.0 | 1765.0 | 263.0 | 753.0 | 260.0 | 8.5608 |
19998 rows × 8 columns
In [16]:
# skewness
def print_skewness():
feats = full_df.columns.values.tolist()
skewed_feats = full_df[feats].apply(lambda x: skew(x)).sort_values(ascending=False)
display(skewed_feats)
print_skewness()
total_rooms 3.930562 total_bedrooms 3.286475 households 3.231976 population 3.226013 median_income 1.636937 latitude 0.469927 housing_median_age 0.057776 longitude -0.303023 dtype: float64
In [17]:
plot_hist_prob(full_df, 'total_rooms')
/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
In [18]:
# boxcox transformation
def fixing_skewness():
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
feats = full_df.columns.values.tolist()
skewed_feats = full_df[feats].apply(lambda x: skew(x)).sort_values(ascending=False)
high_skew = skewed_feats[skewed_feats > 1]
for feat in high_skew.index:
full_df[feat] = boxcox1p(full_df[feat], boxcox_normmax(full_df[feat]+1))
fixing_skewness()
In [19]:
print_skewness()
latitude 0.469927 total_bedrooms 0.111805 total_rooms 0.109239 households 0.103211 population 0.098523 housing_median_age 0.057776 median_income -0.005347 longitude -0.303023 dtype: float64
In [20]:
plot_hist_prob(full_df, 'total_rooms')
/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
In [21]:
full_df
Out[21]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | |
---|---|---|---|---|---|---|---|---|
0 | -114.31 | 34.19 | 15.0 | 25.363016 | 17.377704 | 17.635147 | 14.011835 | 0.835193 |
1 | -114.47 | 34.40 | 19.0 | 27.445129 | 19.354420 | 18.195048 | 13.928676 | 0.936432 |
2 | -114.56 | 33.69 | 17.0 | 14.598093 | 9.603326 | 12.554290 | 8.882992 | 0.885849 |
3 | -114.57 | 33.64 | 14.0 | 17.911356 | 11.805374 | 14.382370 | 11.089628 | 1.246235 |
4 | -114.57 | 33.57 | 20.0 | 17.756645 | 11.686931 | 15.249883 | 11.635321 | 0.966043 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2995 | -119.86 | 34.42 | 23.0 | 17.743298 | 14.287117 | 18.778910 | 15.134444 | 0.721303 |
2996 | -118.14 | 34.06 | 27.0 | 24.941418 | 16.572808 | 25.106247 | 17.757042 | 1.280894 |
2997 | -119.70 | 36.30 | 10.0 | 15.814389 | 10.056551 | 15.740909 | 10.992368 | 1.059720 |
2998 | -117.12 | 34.10 | 40.0 | 7.823240 | 3.699959 | 6.304133 | 3.804378 | 1.260250 |
2999 | -119.63 | 34.42 | 42.0 | 18.716082 | 10.941239 | 16.138496 | 11.606549 | 1.817227 |
19998 rows × 8 columns
In [22]:
scaler = StandardScaler()
full_df.loc[:] = scaler.fit_transform(full_df)
full_df
Out[22]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | |
---|---|---|---|---|---|---|---|---|
0 | 2.623441 | -0.672627 | -1.083286 | 1.480233 | 1.706900 | -0.230818 | 0.169856 | -1.758411 |
1 | 2.543582 | -0.574318 | -0.765359 | 2.035311 | 2.455572 | -0.081288 | 0.141070 | -1.374958 |
2 | 2.498662 | -0.906696 | -0.924322 | -1.389630 | -1.237611 | -1.587739 | -1.605496 | -1.566545 |
3 | 2.493671 | -0.930103 | -1.162768 | -0.506335 | -0.403595 | -1.099522 | -0.841668 | -0.201552 |
4 | 2.493671 | -0.962872 | -0.685877 | -0.547580 | -0.448455 | -0.867840 | -0.652776 | -1.262803 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2995 | -0.146656 | -0.564955 | -0.447432 | -0.551138 | 0.536354 | 0.074641 | 0.558447 | -2.189780 |
2996 | 0.711824 | -0.733485 | -0.129505 | 1.367837 | 1.402049 | 1.764454 | 1.466261 | -0.070280 |
2997 | -0.066798 | 0.315145 | -1.480695 | -1.065373 | -1.065954 | -0.736703 | -0.875335 | -0.907996 |
2998 | 1.220923 | -0.714759 | 0.903758 | -3.195765 | -3.473485 | -3.256939 | -3.363461 | -0.148469 |
2999 | -0.031860 | -0.564955 | 1.062722 | -0.291800 | -0.730882 | -0.630522 | -0.662735 | 1.961132 |
19998 rows × 8 columns
In [23]:
x_train = full_df[:len(train_df)]
x_test = full_df[len(train_df):]
train_df = pd.concat([x_train, train_df[target]], axis=1)
test_df = pd.concat([x_test, test_df[target]], axis=1)
del x_train, x_test
AutoML with Pycaret¶
In [24]:
!pip install pycaret
Requirement already satisfied: pycaret in /usr/local/lib/python3.6/dist-packages (2.2.3) Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.6/dist-packages (from pycaret) (1.19.5) Requirement already satisfied: umap-learn in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.5.0) Requirement already satisfied: ipywidgets in /usr/local/lib/python3.6/dist-packages (from pycaret) (7.6.3) Requirement already satisfied: spacy in /usr/local/lib/python3.6/dist-packages (from pycaret) (2.2.4) Requirement already satisfied: cufflinks>=0.17.0 in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.17.3) Requirement already satisfied: kmodes>=0.10.1 in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.11.0) Requirement already satisfied: lightgbm>=2.3.1 in /usr/local/lib/python3.6/dist-packages (from pycaret) (3.1.1) Requirement already satisfied: wordcloud in /usr/local/lib/python3.6/dist-packages (from pycaret) (1.5.0) Requirement already satisfied: scikit-learn==0.23.2 in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.23.2) Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from pycaret) (3.2.2) Requirement already satisfied: yellowbrick>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from pycaret) (1.3.post1) Requirement already satisfied: gensim in /usr/local/lib/python3.6/dist-packages (from pycaret) (3.6.0) Requirement already satisfied: IPython in /usr/local/lib/python3.6/dist-packages (from pycaret) (5.5.0) Requirement already satisfied: seaborn in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.11.1) Requirement already satisfied: nltk in /usr/local/lib/python3.6/dist-packages (from pycaret) (3.2.5) Requirement already satisfied: textblob in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.15.3) Requirement already satisfied: pandas-profiling>=2.8.0 in /usr/local/lib/python3.6/dist-packages (from pycaret) (2.10.1) Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from pycaret) (1.1.5) Requirement already satisfied: mlxtend in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.14.0) Requirement already satisfied: catboost>=0.23.2 in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.24.4) Requirement already satisfied: scikit-plot in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.3.7) Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from pycaret) (1.0.0) Requirement already satisfied: pyod in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.8.6) Requirement already satisfied: imbalanced-learn>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.7.0) Requirement already satisfied: plotly>=4.4.1 in /usr/local/lib/python3.6/dist-packages (from pycaret) (4.4.1) Requirement already satisfied: xgboost>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from pycaret) (1.3.3) Requirement already satisfied: pyLDAvis in /usr/local/lib/python3.6/dist-packages (from pycaret) (3.2.0) Requirement already satisfied: mlflow in /usr/local/lib/python3.6/dist-packages (from pycaret) (1.13.1) Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.6/dist-packages (from umap-learn->pycaret) (1.4.1) Requirement already satisfied: numba>=0.49 in /usr/local/lib/python3.6/dist-packages (from umap-learn->pycaret) (0.51.2) Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.6/dist-packages (from umap-learn->pycaret) (0.5.1) Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.6/dist-packages (from ipywidgets->pycaret) (4.10.1) Requirement already satisfied: jupyterlab-widgets>=1.0.0; python_version >= "3.6" in /usr/local/lib/python3.6/dist-packages (from ipywidgets->pycaret) (1.0.0) Requirement already satisfied: traitlets>=4.3.1 in /usr/local/lib/python3.6/dist-packages (from ipywidgets->pycaret) (4.3.3) Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets->pycaret) (3.5.1) Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets->pycaret) (5.1.2) Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (2.0.5) Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (7.4.0) Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (0.4.1) Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (1.1.3) Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (1.0.5) Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (1.0.5) Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (53.0.0) Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (0.8.2) Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (1.0.0) Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (4.41.1) Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (2.23.0) Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (3.0.5) Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from cufflinks>=0.17.0->pycaret) (1.15.0) Requirement already satisfied: colorlover>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from cufflinks>=0.17.0->pycaret) (0.3.0) Requirement already satisfied: wheel in /usr/local/lib/python3.6/dist-packages (from lightgbm>=2.3.1->pycaret) (0.36.2) Requirement already satisfied: pillow in /usr/local/lib/python3.6/dist-packages (from wordcloud->pycaret) (7.0.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn==0.23.2->pycaret) (2.1.0) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pycaret) (1.3.1) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pycaret) (0.10.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pycaret) (2.4.7) Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pycaret) (2.8.1) Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.6/dist-packages (from gensim->pycaret) (4.1.2) Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/dist-packages (from IPython->pycaret) (0.7.5) Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.6/dist-packages (from IPython->pycaret) (1.0.18) Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from IPython->pycaret) (2.6.1) Requirement already satisfied: decorator in /usr/local/lib/python3.6/dist-packages (from IPython->pycaret) (4.4.2) Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from IPython->pycaret) (4.8.0) Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from IPython->pycaret) (0.8.1) Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling>=2.8.0->pycaret) (20.3.0) Requirement already satisfied: missingno>=0.4.2 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling>=2.8.0->pycaret) (0.4.2) Requirement already satisfied: tangled-up-in-unicode>=0.0.6 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling>=2.8.0->pycaret) (0.0.6) Requirement already satisfied: visions[type_image_path]==0.6.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling>=2.8.0->pycaret) (0.6.0) Requirement already satisfied: jinja2>=2.11.1 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling>=2.8.0->pycaret) (2.11.3) Requirement already satisfied: confuse>=1.0.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling>=2.8.0->pycaret) (1.4.0) Requirement already satisfied: htmlmin>=0.1.12 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling>=2.8.0->pycaret) (0.1.12) Requirement already satisfied: phik>=0.10.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling>=2.8.0->pycaret) (0.11.0) Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->pycaret) (2018.9) Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (from catboost>=0.23.2->pycaret) (0.10.1) Requirement already satisfied: suod in /usr/local/lib/python3.6/dist-packages (from pyod->pycaret) (0.0.6) Requirement already satisfied: statsmodels in /usr/local/lib/python3.6/dist-packages (from pyod->pycaret) (0.10.2) Requirement already satisfied: combo in /usr/local/lib/python3.6/dist-packages (from pyod->pycaret) (0.1.2) Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly>=4.4.1->pycaret) (1.3.3) Requirement already satisfied: numexpr in /usr/local/lib/python3.6/dist-packages (from pyLDAvis->pycaret) (2.7.2) Requirement already satisfied: funcy in /usr/local/lib/python3.6/dist-packages (from pyLDAvis->pycaret) (1.15) Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from pyLDAvis->pycaret) (0.16.0) Requirement already satisfied: protobuf>=3.6.0 in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (3.12.4) Requirement already satisfied: prometheus-flask-exporter in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (0.18.1) Requirement already satisfied: querystring-parser in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (1.2.4) Requirement already satisfied: Flask in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (1.1.2) Requirement already satisfied: sqlparse>=0.3.1 in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (0.4.1) Requirement already satisfied: sqlalchemy in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (1.3.23) Requirement already satisfied: databricks-cli>=0.8.7 in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (0.14.1) Requirement already satisfied: cloudpickle in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (1.3.0) Requirement already satisfied: pyyaml in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (3.13) Requirement already satisfied: docker>=4.0.0 in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (4.4.1) Requirement already satisfied: azure-storage-blob>=12.0.0 in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (12.7.1) Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (7.1.2) Requirement already satisfied: gunicorn; platform_system != "Windows" in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (20.0.4) Requirement already satisfied: entrypoints in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (0.3) Requirement already satisfied: gitpython>=2.1.0 in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (3.1.13) Requirement already satisfied: alembic<=1.4.1 in /usr/local/lib/python3.6/dist-packages (from mlflow->pycaret) (1.4.1) Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.6/dist-packages (from numba>=0.49->umap-learn->pycaret) (0.34.0) Requirement already satisfied: jupyter-client in /usr/local/lib/python3.6/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.3.5) Requirement already satisfied: tornado>=4.0 in /usr/local/lib/python3.6/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.1.1) Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.3.1->ipywidgets->pycaret) (0.2.0) Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.6/dist-packages (from widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.3.1) Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (2.6.0) Requirement already satisfied: jupyter-core in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (4.7.1) Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy->pycaret) (3.4.0) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (2.10) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (1.24.3) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (3.0.4) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (2020.12.5) Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->IPython->pycaret) (0.2.5) Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect; sys_platform != "win32"->IPython->pycaret) (0.7.0) Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.6/dist-packages (from visions[type_image_path]==0.6.0->pandas-profiling>=2.8.0->pycaret) (2.5) Requirement already satisfied: imagehash; extra == "type_image_path" in /usr/local/lib/python3.6/dist-packages (from visions[type_image_path]==0.6.0->pandas-profiling>=2.8.0->pycaret) (4.2.0) Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2>=2.11.1->pandas-profiling>=2.8.0->pycaret) (1.1.1) Requirement already satisfied: psutil in /usr/local/lib/python3.6/dist-packages (from suod->pyod->pycaret) (5.4.8) Requirement already satisfied: patsy>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from statsmodels->pyod->pycaret) (0.5.1) Requirement already satisfied: prometheus-client in /usr/local/lib/python3.6/dist-packages (from prometheus-flask-exporter->mlflow->pycaret) (0.9.0) Requirement already satisfied: Werkzeug>=0.15 in /usr/local/lib/python3.6/dist-packages (from Flask->mlflow->pycaret) (1.0.1) Requirement already satisfied: itsdangerous>=0.24 in /usr/local/lib/python3.6/dist-packages (from Flask->mlflow->pycaret) (1.1.0) Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.6/dist-packages (from databricks-cli>=0.8.7->mlflow->pycaret) (0.8.7) Requirement already satisfied: websocket-client>=0.32.0 in /usr/local/lib/python3.6/dist-packages (from docker>=4.0.0->mlflow->pycaret) (0.57.0) Requirement already satisfied: msrest>=0.6.18 in /usr/local/lib/python3.6/dist-packages (from azure-storage-blob>=12.0.0->mlflow->pycaret) (0.6.21) Requirement already satisfied: cryptography>=2.1.4 in /usr/local/lib/python3.6/dist-packages (from azure-storage-blob>=12.0.0->mlflow->pycaret) (3.4.5) Requirement already satisfied: azure-core<2.0.0,>=1.10.0 in /usr/local/lib/python3.6/dist-packages (from azure-storage-blob>=12.0.0->mlflow->pycaret) (1.11.0) Requirement already satisfied: gitdb<5,>=4.0.1 in /usr/local/lib/python3.6/dist-packages (from gitpython>=2.1.0->mlflow->pycaret) (4.0.5) Requirement already satisfied: Mako in /usr/local/lib/python3.6/dist-packages (from alembic<=1.4.1->mlflow->pycaret) (1.1.4) Requirement already satisfied: python-editor>=0.3 in /usr/local/lib/python3.6/dist-packages (from alembic<=1.4.1->mlflow->pycaret) (1.0.4) Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.6/dist-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets->pycaret) (22.0.2) Requirement already satisfied: nbconvert in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.6.1) Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.9.2) Requirement already satisfied: Send2Trash in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.5.0) Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy->pycaret) (3.4.0) Requirement already satisfied: typing-extensions>=3.6.4; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy->pycaret) (3.7.4.3) Requirement already satisfied: PyWavelets in /usr/local/lib/python3.6/dist-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.6.0->pandas-profiling>=2.8.0->pycaret) (1.1.1) Requirement already satisfied: requests-oauthlib>=0.5.0 in /usr/local/lib/python3.6/dist-packages (from msrest>=0.6.18->azure-storage-blob>=12.0.0->mlflow->pycaret) (1.3.0) Requirement already satisfied: isodate>=0.6.0 in /usr/local/lib/python3.6/dist-packages (from msrest>=0.6.18->azure-storage-blob>=12.0.0->mlflow->pycaret) (0.6.0) Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.6/dist-packages (from cryptography>=2.1.4->azure-storage-blob>=12.0.0->mlflow->pycaret) (1.14.4) Requirement already satisfied: smmap<4,>=3.0.1 in /usr/local/lib/python3.6/dist-packages (from gitdb<5,>=4.0.1->gitpython>=2.1.0->mlflow->pycaret) (3.0.5) Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.6.0) Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.8.4) Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.4.3) Requirement already satisfied: bleach in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (3.3.0) Requirement already satisfied: testpath in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.4.4) Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from requests-oauthlib>=0.5.0->msrest>=0.6.18->azure-storage-blob>=12.0.0->mlflow->pycaret) (3.1.0) Requirement already satisfied: pycparser in /usr/local/lib/python3.6/dist-packages (from cffi>=1.12->cryptography>=2.1.4->azure-storage-blob>=12.0.0->mlflow->pycaret) (2.20) Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (20.9) Requirement already satisfied: webencodings in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.5.1)
In [25]:
from pycaret.utils import enable_colab, check_metric
from pycaret.regression import *
In [26]:
enable_colab()
Colab mode enabled.
In [27]:
train_df
Out[27]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
0 | 2.623441 | -0.672627 | -1.083286 | 1.480233 | 1.706900 | -0.230818 | 0.169856 | -1.758411 | 11.110969 |
1 | 2.543582 | -0.574318 | -0.765359 | 2.035311 | 2.455572 | -0.081288 | 0.141070 | -1.374958 | 11.291044 |
2 | 2.498662 | -0.906696 | -0.924322 | -1.389630 | -1.237611 | -1.587739 | -1.605496 | -1.566545 | 11.358620 |
3 | 2.493671 | -0.930103 | -1.162768 | -0.506335 | -0.403595 | -1.099522 | -0.841668 | -0.201552 | 11.203693 |
4 | 2.493671 | -0.962872 | -0.685877 | -0.547580 | -0.448455 | -0.867840 | -0.652776 | -1.262803 | 11.089821 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
16995 | -2.342770 | 2.318776 | 1.857540 | 0.023189 | -0.187857 | -0.384780 | -0.188402 | -0.847382 | 11.620892 |
16996 | -2.347761 | 2.370271 | 0.585831 | 0.105599 | 0.236958 | -0.001115 | 0.147503 | -0.708898 | 11.277216 |
16997 | -2.362734 | 2.908630 | -0.924322 | 0.295703 | 0.245455 | 0.058325 | 0.118385 | -0.312980 | 11.548302 |
16998 | -2.362734 | 2.889904 | -0.765359 | 0.292946 | 0.303904 | 0.120510 | 0.188815 | -1.206272 | 11.359786 |
16999 | -2.387690 | 2.300050 | 1.857540 | -0.250310 | -0.559368 | -0.541958 | -0.613507 | -0.324815 | 11.457423 |
16998 rows × 9 columns
In [28]:
reg = setup(data=train_df , target= 'median_house_value' )
Description | Value | |
---|---|---|
0 | session_id | 3134 |
1 | Target | median_house_value |
2 | Original Data | (16998, 9) |
3 | Missing Values | False |
4 | Numeric Features | 8 |
5 | Categorical Features | 0 |
6 | Ordinal Features | False |
7 | High Cardinality Features | False |
8 | High Cardinality Method | None |
9 | Transformed Train Set | (11898, 8) |
10 | Transformed Test Set | (5100, 8) |
11 | Shuffle Train-Test | True |
12 | Stratify Train-Test | False |
13 | Fold Generator | KFold |
14 | Fold Number | 10 |
15 | CPU Jobs | -1 |
16 | Use GPU | False |
17 | Log Experiment | False |
18 | Experiment Name | reg-default-name |
19 | USI | 610c |
20 | Imputation Type | simple |
21 | Iterative Imputation Iteration | None |
22 | Numeric Imputer | mean |
23 | Iterative Imputation Numeric Model | None |
24 | Categorical Imputer | constant |
25 | Iterative Imputation Categorical Model | None |
26 | Unknown Categoricals Handling | least_frequent |
27 | Normalize | False |
28 | Normalize Method | None |
29 | Transformation | False |
30 | Transformation Method | None |
31 | PCA | False |
32 | PCA Method | None |
33 | PCA Components | None |
34 | Ignore Low Variance | False |
35 | Combine Rare Levels | False |
36 | Rare Level Threshold | None |
37 | Numeric Binning | False |
38 | Remove Outliers | False |
39 | Outliers Threshold | None |
40 | Remove Multicollinearity | False |
41 | Multicollinearity Threshold | None |
42 | Clustering | False |
43 | Clustering Iteration | None |
44 | Polynomial Features | False |
45 | Polynomial Degree | None |
46 | Trignometry Features | False |
47 | Polynomial Threshold | None |
48 | Group Features | False |
49 | Feature Selection | False |
50 | Features Selection Threshold | None |
51 | Feature Interaction | False |
52 | Feature Ratio | False |
53 | Interaction Threshold | None |
54 | Transform Target | False |
55 | Transform Target Method | box-cox |
In [29]:
best_3 = compare_models(n_select=3)
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
catboost | CatBoost Regressor | 0.1549 | 0.0496 | 0.2226 | 0.8478 | 0.0172 | 0.0129 | 4.643 |
lightgbm | Light Gradient Boosting Machine | 0.1625 | 0.0527 | 0.2295 | 0.8382 | 0.0177 | 0.0135 | 0.214 |
xgboost | Extreme Gradient Boosting | 0.1622 | 0.0534 | 0.2310 | 0.8360 | 0.0178 | 0.0135 | 5.586 |
rf | Random Forest Regressor | 0.1624 | 0.0544 | 0.2332 | 0.8329 | 0.0180 | 0.0135 | 5.625 |
et | Extra Trees Regressor | 0.1722 | 0.0583 | 0.2414 | 0.8209 | 0.0186 | 0.0143 | 3.113 |
gbr | Gradient Boosting Regressor | 0.1934 | 0.0687 | 0.2621 | 0.7890 | 0.0202 | 0.0161 | 1.646 |
knn | K Neighbors Regressor | 0.2197 | 0.0904 | 0.3005 | 0.7226 | 0.0232 | 0.0183 | 0.082 |
lr | Linear Regression | 0.2395 | 0.1012 | 0.3180 | 0.6893 | 0.0246 | 0.0200 | 0.362 |
ridge | Ridge Regression | 0.2395 | 0.1012 | 0.3180 | 0.6893 | 0.0246 | 0.0200 | 0.022 |
br | Bayesian Ridge | 0.2395 | 0.1012 | 0.3180 | 0.6893 | 0.0246 | 0.0200 | 0.023 |
huber | Huber Regressor | 0.2385 | 0.1017 | 0.3189 | 0.6876 | 0.0247 | 0.0199 | 0.084 |
dt | Decision Tree Regressor | 0.2278 | 0.1055 | 0.3246 | 0.6762 | 0.0250 | 0.0189 | 0.110 |
ada | AdaBoost Regressor | 0.3056 | 0.1466 | 0.3828 | 0.5498 | 0.0295 | 0.0254 | 0.596 |
lar | Least Angle Regression | 0.2895 | 0.1644 | 0.4050 | 0.4951 | 0.0310 | 0.0241 | 0.025 |
par | Passive Aggressive Regressor | 0.3091 | 0.1650 | 0.4048 | 0.4934 | 0.0313 | 0.0256 | 0.030 |
omp | Orthogonal Matching Pursuit | 0.3243 | 0.1737 | 0.4167 | 0.4666 | 0.0322 | 0.0270 | 0.022 |
lasso | Lasso Regression | 0.4608 | 0.3260 | 0.5709 | -0.0008 | 0.0439 | 0.0384 | 0.018 |
en | Elastic Net | 0.4608 | 0.3260 | 0.5709 | -0.0008 | 0.0439 | 0.0384 | 0.023 |
llar | Lasso Least Angle Regression | 0.4608 | 0.3260 | 0.5709 | -0.0008 | 0.0439 | 0.0384 | 0.020 |
In [30]:
blended = blend_models(estimator_list=best_3, fold=3)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
0 | 0.1596 | 0.0511 | 0.2261 | 0.8435 | 0.0174 | 0.0133 |
1 | 0.1587 | 0.0521 | 0.2283 | 0.8399 | 0.0177 | 0.0132 |
2 | 0.1543 | 0.0490 | 0.2213 | 0.8494 | 0.0171 | 0.0128 |
Mean | 0.1576 | 0.0507 | 0.2252 | 0.8443 | 0.0174 | 0.0131 |
SD | 0.0023 | 0.0013 | 0.0029 | 0.0039 | 0.0002 | 0.0002 |
In [31]:
pred = predict_model(blended)
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|---|
0 | Voting Regressor | 0.1517 | 0.0479 | 0.2188 | 0.8545 | 0.0169 | 0.0126 |
In [32]:
final_model = finalize_model(blended)
In [33]:
predictions = predict_model(final_model, data=test_df)
In [34]:
predictions.head()
Out[34]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | Label | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -1.239722 | 0.816052 | -0.129505 | 0.867908 | 0.582140 | 0.374182 | 0.555823 | 1.397282 | 344700.0 | 12.956008 |
1 | 0.631966 | -0.639857 | 1.142204 | -0.498552 | -0.515856 | -0.537079 | -0.579865 | 0.059820 | 176500.0 | 12.163323 |
2 | 0.876533 | -0.864563 | -0.129505 | 0.742244 | 0.176407 | 0.320667 | 0.241569 | 1.107858 | 270500.0 | 12.523111 |
3 | 0.602019 | -0.845838 | -0.050023 | -3.441488 | -3.428861 | -3.215307 | -3.506875 | 1.234294 | 330000.0 | 12.495220 |
4 | -0.051824 | 0.329189 | -0.765359 | -0.748750 | -0.826367 | -0.471742 | -0.781667 | -0.380637 | 81700.0 | 11.205035 |
In [35]:
check_metric(predictions['median_house_value'], np.expm1(predictions['Label']), 'R2')
Out[35]:
0.8213
In [36]:
plot_model(final_model)
'AI' 카테고리의 다른 글
02. 파이썬 자료형 | 숫자형, 문자열, 리스트, 튜플, 딕셔너리, 인덱싱, 슬라이싱, append 메소드 (2) | 2021.02.14 |
---|---|
01. 파이썬 기초 | 프로그래밍 기본 개념 (0) | 2021.02.01 |