Activity - Machine Learning - Introduction

Activity - Machine Learning - Introduction#

import pandas as pd

In this activity you will try to classify the drug consumption using personal variables. Please work collaboratively, discuss your ideas and finally we will share results.

data = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00373/drug_consumption.data",
    names=[
        "age",
        "gender",
        "education",
        "country",
        "ethnicity",
        "nscore",
        "escore",
        "oscore",
        "ascore",
        "cscore",
        "impulsive",
        "ss",
        "alcohol",
        "amphet",
        "amyl",
        "benzos",
        "caff",
        "cannabis",
        "choc",
        "coke",
        "crack",
        "ecstasy",
        "heroin",
        "ketamine",
        "legalh",
        "lsd",
        "meth",
        "mushrooms",
        "nicotine",
        "semer",
        "vsa"
    ],
    index_col=0
)
data

	age	gender	education	country	ethnicity	nscore	escore	oscore	ascore	cscore	...	ecstasy	heroin	ketamine	legalh	lsd	meth	mushrooms	nicotine	semer	vsa
1	0.49788	0.48246	-0.05921	0.96082	0.12600	0.31287	-0.57545	-0.58331	-0.91699	-0.00665	...	CL0	CL0	CL0	CL0	CL0	CL0	CL0	CL2	CL0	CL0
2	-0.07854	-0.48246	1.98437	0.96082	-0.31685	-0.67825	1.93886	1.43533	0.76096	-0.14277	...	CL4	CL0	CL2	CL0	CL2	CL3	CL0	CL4	CL0	CL0
3	0.49788	-0.48246	-0.05921	0.96082	-0.31685	-0.46725	0.80523	-0.84732	-1.62090	-1.01450	...	CL0	CL0	CL0	CL0	CL0	CL0	CL1	CL0	CL0	CL0
4	-0.95197	0.48246	1.16365	0.96082	-0.31685	-0.14882	-0.80615	-0.01928	0.59042	0.58489	...	CL0	CL0	CL2	CL0	CL0	CL0	CL0	CL2	CL0	CL0
5	0.49788	0.48246	1.98437	0.96082	-0.31685	0.73545	-1.63340	-0.45174	-0.30172	1.30612	...	CL1	CL0	CL0	CL1	CL0	CL0	CL2	CL2	CL0	CL0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1884	-0.95197	0.48246	-0.61113	-0.57009	-0.31685	-1.19430	1.74091	1.88511	0.76096	-1.13788	...	CL0	CL0	CL0	CL3	CL3	CL0	CL0	CL0	CL0	CL5
1885	-0.95197	-0.48246	-0.61113	-0.57009	-0.31685	-0.24649	1.74091	0.58331	0.76096	-1.51840	...	CL2	CL0	CL0	CL3	CL5	CL4	CL4	CL5	CL0	CL0
1886	-0.07854	0.48246	0.45468	-0.57009	-0.31685	1.13281	-1.37639	-1.27553	-1.77200	-1.38502	...	CL4	CL0	CL2	CL0	CL2	CL0	CL2	CL6	CL0	CL0
1887	-0.95197	0.48246	-0.61113	-0.57009	-0.31685	0.91093	-1.92173	0.29338	-1.62090	-2.57309	...	CL3	CL0	CL0	CL3	CL3	CL0	CL3	CL4	CL0	CL0
1888	-0.95197	-0.48246	-0.61113	0.21128	-0.31685	-0.46725	2.12700	1.65653	1.11406	0.41594	...	CL3	CL0	CL0	CL3	CL3	CL0	CL3	CL6	CL0	CL2

1885 rows × 31 columns

Step 0: Explore the data#

Please read the description of the data and its variables in the following link: https://archive-beta.ics.uci.edu/dataset/373/drug+consumption+quantified

Feel free to do some descriptive statistics but don’t spend more than 5 minutes.

Step 1: Prepare your data#

Step 1.1: Select a drug#

y = data["nicotine"]  # Change it for the drug you want to classify
y

     CL2
     CL4
     CL0
     CL2
     CL2
       ... 
  CL0
  CL5
  CL6
  CL4
  CL6
Name: nicotine, Length: 1885, dtype: object

Step 1.2: Select attributes#

Feel free to erase any variable you think should not be consider for your machine learning algorithm.

X = data.loc[
    :,
    [
        "age",
        "gender",
        "education",
        "country",
        "ethnicity",
        "nscore",
        "escore",
        "oscore",
        "ascore",
        "cscore",
        "impulsive",
        "ss"
    ]
]
X

	age	gender	education	country	ethnicity	nscore	escore	oscore	ascore	cscore	impulsive	ss
1	0.49788	0.48246	-0.05921	0.96082	0.12600	0.31287	-0.57545	-0.58331	-0.91699	-0.00665	-0.21712	-1.18084
2	-0.07854	-0.48246	1.98437	0.96082	-0.31685	-0.67825	1.93886	1.43533	0.76096	-0.14277	-0.71126	-0.21575
3	0.49788	-0.48246	-0.05921	0.96082	-0.31685	-0.46725	0.80523	-0.84732	-1.62090	-1.01450	-1.37983	0.40148
4	-0.95197	0.48246	1.16365	0.96082	-0.31685	-0.14882	-0.80615	-0.01928	0.59042	0.58489	-1.37983	-1.18084
5	0.49788	0.48246	1.98437	0.96082	-0.31685	0.73545	-1.63340	-0.45174	-0.30172	1.30612	-0.21712	-0.21575
...	...	...	...	...	...	...	...	...	...	...	...	...
1884	-0.95197	0.48246	-0.61113	-0.57009	-0.31685	-1.19430	1.74091	1.88511	0.76096	-1.13788	0.88113	1.92173
1885	-0.95197	-0.48246	-0.61113	-0.57009	-0.31685	-0.24649	1.74091	0.58331	0.76096	-1.51840	0.88113	0.76540
1886	-0.07854	0.48246	0.45468	-0.57009	-0.31685	1.13281	-1.37639	-1.27553	-1.77200	-1.38502	0.52975	-0.52593
1887	-0.95197	0.48246	-0.61113	-0.57009	-0.31685	0.91093	-1.92173	0.29338	-1.62090	-2.57309	1.29221	1.22470
1888	-0.95197	-0.48246	-0.61113	0.21128	-0.31685	-0.46725	2.12700	1.65653	1.11406	0.41594	0.88113	1.22470

1885 rows × 12 columns

Step 2: Select your classification model#

from sklearn#FIX ME#

  Cell In[5], line 1
    from sklearn#FIX ME#
                ^
SyntaxError: invalid syntax

model = #FIX ME#

  Cell In[6], line 1
    model = #FIX ME#
            ^
SyntaxError: invalid syntax

Step 3: Train your model#

model.fit(#FIX ME#)

  Cell In[7], line 1
    model.fit(#FIX ME#)
                       ^
SyntaxError: incomplete input

Step 4: Evaluate your model#

from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(model, X, y)

Optional: Train test#

Train your model with training data and compute the score using the test dataset. Feel free to do some hyper-optimization as well.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# YOU CAN DO IT! #