Activity - Machine Learning - Introduction#

import pandas as pd

In this activity you will try to classify the drug consumption using personal variables. Please work collaboratively, discuss your ideas and finally we will share results.

data = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00373/drug_consumption.data",
    names=[
        "age",
        "gender",
        "education",
        "country",
        "ethnicity",
        "nscore",
        "escore",
        "oscore",
        "ascore",
        "cscore",
        "impulsive",
        "ss",
        "alcohol",
        "amphet",
        "amyl",
        "benzos",
        "caff",
        "cannabis",
        "choc",
        "coke",
        "crack",
        "ecstasy",
        "heroin",
        "ketamine",
        "legalh",
        "lsd",
        "meth",
        "mushrooms",
        "nicotine",
        "semer",
        "vsa"
    ],
    index_col=0
)
data
age gender education country ethnicity nscore escore oscore ascore cscore ... ecstasy heroin ketamine legalh lsd meth mushrooms nicotine semer vsa
1 0.49788 0.48246 -0.05921 0.96082 0.12600 0.31287 -0.57545 -0.58331 -0.91699 -0.00665 ... CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL2 CL0 CL0
2 -0.07854 -0.48246 1.98437 0.96082 -0.31685 -0.67825 1.93886 1.43533 0.76096 -0.14277 ... CL4 CL0 CL2 CL0 CL2 CL3 CL0 CL4 CL0 CL0
3 0.49788 -0.48246 -0.05921 0.96082 -0.31685 -0.46725 0.80523 -0.84732 -1.62090 -1.01450 ... CL0 CL0 CL0 CL0 CL0 CL0 CL1 CL0 CL0 CL0
4 -0.95197 0.48246 1.16365 0.96082 -0.31685 -0.14882 -0.80615 -0.01928 0.59042 0.58489 ... CL0 CL0 CL2 CL0 CL0 CL0 CL0 CL2 CL0 CL0
5 0.49788 0.48246 1.98437 0.96082 -0.31685 0.73545 -1.63340 -0.45174 -0.30172 1.30612 ... CL1 CL0 CL0 CL1 CL0 CL0 CL2 CL2 CL0 CL0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1884 -0.95197 0.48246 -0.61113 -0.57009 -0.31685 -1.19430 1.74091 1.88511 0.76096 -1.13788 ... CL0 CL0 CL0 CL3 CL3 CL0 CL0 CL0 CL0 CL5
1885 -0.95197 -0.48246 -0.61113 -0.57009 -0.31685 -0.24649 1.74091 0.58331 0.76096 -1.51840 ... CL2 CL0 CL0 CL3 CL5 CL4 CL4 CL5 CL0 CL0
1886 -0.07854 0.48246 0.45468 -0.57009 -0.31685 1.13281 -1.37639 -1.27553 -1.77200 -1.38502 ... CL4 CL0 CL2 CL0 CL2 CL0 CL2 CL6 CL0 CL0
1887 -0.95197 0.48246 -0.61113 -0.57009 -0.31685 0.91093 -1.92173 0.29338 -1.62090 -2.57309 ... CL3 CL0 CL0 CL3 CL3 CL0 CL3 CL4 CL0 CL0
1888 -0.95197 -0.48246 -0.61113 0.21128 -0.31685 -0.46725 2.12700 1.65653 1.11406 0.41594 ... CL3 CL0 CL0 CL3 CL3 CL0 CL3 CL6 CL0 CL2

1885 rows × 31 columns

Step 0: Explore the data#

Please read the description of the data and its variables in the following link: https://archive-beta.ics.uci.edu/dataset/373/drug+consumption+quantified

Feel free to do some descriptive statistics but don’t spend more than 5 minutes.

Step 1: Prepare your data#

Step 1.1: Select a drug#

y = data["nicotine"]  # Change it for the drug you want to classify
y
1       CL2
2       CL4
3       CL0
4       CL2
5       CL2
       ... 
1884    CL0
1885    CL5
1886    CL6
1887    CL4
1888    CL6
Name: nicotine, Length: 1885, dtype: object

Step 1.2: Select attributes#

Feel free to erase any variable you think should not be consider for your machine learning algorithm.

X = data.loc[
    :,
    [
        "age",
        "gender",
        "education",
        "country",
        "ethnicity",
        "nscore",
        "escore",
        "oscore",
        "ascore",
        "cscore",
        "impulsive",
        "ss"
    ]
]
X
age gender education country ethnicity nscore escore oscore ascore cscore impulsive ss
1 0.49788 0.48246 -0.05921 0.96082 0.12600 0.31287 -0.57545 -0.58331 -0.91699 -0.00665 -0.21712 -1.18084
2 -0.07854 -0.48246 1.98437 0.96082 -0.31685 -0.67825 1.93886 1.43533 0.76096 -0.14277 -0.71126 -0.21575
3 0.49788 -0.48246 -0.05921 0.96082 -0.31685 -0.46725 0.80523 -0.84732 -1.62090 -1.01450 -1.37983 0.40148
4 -0.95197 0.48246 1.16365 0.96082 -0.31685 -0.14882 -0.80615 -0.01928 0.59042 0.58489 -1.37983 -1.18084
5 0.49788 0.48246 1.98437 0.96082 -0.31685 0.73545 -1.63340 -0.45174 -0.30172 1.30612 -0.21712 -0.21575
... ... ... ... ... ... ... ... ... ... ... ... ...
1884 -0.95197 0.48246 -0.61113 -0.57009 -0.31685 -1.19430 1.74091 1.88511 0.76096 -1.13788 0.88113 1.92173
1885 -0.95197 -0.48246 -0.61113 -0.57009 -0.31685 -0.24649 1.74091 0.58331 0.76096 -1.51840 0.88113 0.76540
1886 -0.07854 0.48246 0.45468 -0.57009 -0.31685 1.13281 -1.37639 -1.27553 -1.77200 -1.38502 0.52975 -0.52593
1887 -0.95197 0.48246 -0.61113 -0.57009 -0.31685 0.91093 -1.92173 0.29338 -1.62090 -2.57309 1.29221 1.22470
1888 -0.95197 -0.48246 -0.61113 0.21128 -0.31685 -0.46725 2.12700 1.65653 1.11406 0.41594 0.88113 1.22470

1885 rows × 12 columns

Step 2: Select your classification model#

from sklearn#FIX ME#
  Cell In[5], line 1
    from sklearn#FIX ME#
                ^
SyntaxError: invalid syntax
model = #FIX ME#
  Cell In[6], line 1
    model = #FIX ME#
            ^
SyntaxError: invalid syntax

Step 3: Train your model#

model.fit(#FIX ME#)
  Cell In[7], line 1
    model.fit(#FIX ME#)
                       ^
SyntaxError: incomplete input

Step 4: Evaluate your model#

from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(model, X, y)

Optional: Train test#

Train your model with training data and compute the score using the test dataset. Feel free to do some hyper-optimization as well.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# YOU CAN DO IT! #