Activity - Machine Learning - Introduction#
import pandas as pd
In this activity you will try to classify the drug consumption using personal variables. Please work collaboratively, discuss your ideas and finally we will share results.
data = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/00373/drug_consumption.data",
names=[
"age",
"gender",
"education",
"country",
"ethnicity",
"nscore",
"escore",
"oscore",
"ascore",
"cscore",
"impulsive",
"ss",
"alcohol",
"amphet",
"amyl",
"benzos",
"caff",
"cannabis",
"choc",
"coke",
"crack",
"ecstasy",
"heroin",
"ketamine",
"legalh",
"lsd",
"meth",
"mushrooms",
"nicotine",
"semer",
"vsa"
],
index_col=0
)
data
age | gender | education | country | ethnicity | nscore | escore | oscore | ascore | cscore | ... | ecstasy | heroin | ketamine | legalh | lsd | meth | mushrooms | nicotine | semer | vsa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.49788 | 0.48246 | -0.05921 | 0.96082 | 0.12600 | 0.31287 | -0.57545 | -0.58331 | -0.91699 | -0.00665 | ... | CL0 | CL0 | CL0 | CL0 | CL0 | CL0 | CL0 | CL2 | CL0 | CL0 |
2 | -0.07854 | -0.48246 | 1.98437 | 0.96082 | -0.31685 | -0.67825 | 1.93886 | 1.43533 | 0.76096 | -0.14277 | ... | CL4 | CL0 | CL2 | CL0 | CL2 | CL3 | CL0 | CL4 | CL0 | CL0 |
3 | 0.49788 | -0.48246 | -0.05921 | 0.96082 | -0.31685 | -0.46725 | 0.80523 | -0.84732 | -1.62090 | -1.01450 | ... | CL0 | CL0 | CL0 | CL0 | CL0 | CL0 | CL1 | CL0 | CL0 | CL0 |
4 | -0.95197 | 0.48246 | 1.16365 | 0.96082 | -0.31685 | -0.14882 | -0.80615 | -0.01928 | 0.59042 | 0.58489 | ... | CL0 | CL0 | CL2 | CL0 | CL0 | CL0 | CL0 | CL2 | CL0 | CL0 |
5 | 0.49788 | 0.48246 | 1.98437 | 0.96082 | -0.31685 | 0.73545 | -1.63340 | -0.45174 | -0.30172 | 1.30612 | ... | CL1 | CL0 | CL0 | CL1 | CL0 | CL0 | CL2 | CL2 | CL0 | CL0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1884 | -0.95197 | 0.48246 | -0.61113 | -0.57009 | -0.31685 | -1.19430 | 1.74091 | 1.88511 | 0.76096 | -1.13788 | ... | CL0 | CL0 | CL0 | CL3 | CL3 | CL0 | CL0 | CL0 | CL0 | CL5 |
1885 | -0.95197 | -0.48246 | -0.61113 | -0.57009 | -0.31685 | -0.24649 | 1.74091 | 0.58331 | 0.76096 | -1.51840 | ... | CL2 | CL0 | CL0 | CL3 | CL5 | CL4 | CL4 | CL5 | CL0 | CL0 |
1886 | -0.07854 | 0.48246 | 0.45468 | -0.57009 | -0.31685 | 1.13281 | -1.37639 | -1.27553 | -1.77200 | -1.38502 | ... | CL4 | CL0 | CL2 | CL0 | CL2 | CL0 | CL2 | CL6 | CL0 | CL0 |
1887 | -0.95197 | 0.48246 | -0.61113 | -0.57009 | -0.31685 | 0.91093 | -1.92173 | 0.29338 | -1.62090 | -2.57309 | ... | CL3 | CL0 | CL0 | CL3 | CL3 | CL0 | CL3 | CL4 | CL0 | CL0 |
1888 | -0.95197 | -0.48246 | -0.61113 | 0.21128 | -0.31685 | -0.46725 | 2.12700 | 1.65653 | 1.11406 | 0.41594 | ... | CL3 | CL0 | CL0 | CL3 | CL3 | CL0 | CL3 | CL6 | CL0 | CL2 |
1885 rows × 31 columns
Step 0: Explore the data#
Please read the description of the data and its variables in the following link: https://archive-beta.ics.uci.edu/dataset/373/drug+consumption+quantified
Feel free to do some descriptive statistics but don’t spend more than 5 minutes.
Step 1: Prepare your data#
Step 1.1: Select a drug#
y = data["nicotine"] # Change it for the drug you want to classify
y
1 CL2
2 CL4
3 CL0
4 CL2
5 CL2
...
1884 CL0
1885 CL5
1886 CL6
1887 CL4
1888 CL6
Name: nicotine, Length: 1885, dtype: object
Step 1.2: Select attributes#
Feel free to erase any variable you think should not be consider for your machine learning algorithm.
X = data.loc[
:,
[
"age",
"gender",
"education",
"country",
"ethnicity",
"nscore",
"escore",
"oscore",
"ascore",
"cscore",
"impulsive",
"ss"
]
]
X
age | gender | education | country | ethnicity | nscore | escore | oscore | ascore | cscore | impulsive | ss | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.49788 | 0.48246 | -0.05921 | 0.96082 | 0.12600 | 0.31287 | -0.57545 | -0.58331 | -0.91699 | -0.00665 | -0.21712 | -1.18084 |
2 | -0.07854 | -0.48246 | 1.98437 | 0.96082 | -0.31685 | -0.67825 | 1.93886 | 1.43533 | 0.76096 | -0.14277 | -0.71126 | -0.21575 |
3 | 0.49788 | -0.48246 | -0.05921 | 0.96082 | -0.31685 | -0.46725 | 0.80523 | -0.84732 | -1.62090 | -1.01450 | -1.37983 | 0.40148 |
4 | -0.95197 | 0.48246 | 1.16365 | 0.96082 | -0.31685 | -0.14882 | -0.80615 | -0.01928 | 0.59042 | 0.58489 | -1.37983 | -1.18084 |
5 | 0.49788 | 0.48246 | 1.98437 | 0.96082 | -0.31685 | 0.73545 | -1.63340 | -0.45174 | -0.30172 | 1.30612 | -0.21712 | -0.21575 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1884 | -0.95197 | 0.48246 | -0.61113 | -0.57009 | -0.31685 | -1.19430 | 1.74091 | 1.88511 | 0.76096 | -1.13788 | 0.88113 | 1.92173 |
1885 | -0.95197 | -0.48246 | -0.61113 | -0.57009 | -0.31685 | -0.24649 | 1.74091 | 0.58331 | 0.76096 | -1.51840 | 0.88113 | 0.76540 |
1886 | -0.07854 | 0.48246 | 0.45468 | -0.57009 | -0.31685 | 1.13281 | -1.37639 | -1.27553 | -1.77200 | -1.38502 | 0.52975 | -0.52593 |
1887 | -0.95197 | 0.48246 | -0.61113 | -0.57009 | -0.31685 | 0.91093 | -1.92173 | 0.29338 | -1.62090 | -2.57309 | 1.29221 | 1.22470 |
1888 | -0.95197 | -0.48246 | -0.61113 | 0.21128 | -0.31685 | -0.46725 | 2.12700 | 1.65653 | 1.11406 | 0.41594 | 0.88113 | 1.22470 |
1885 rows × 12 columns
Step 2: Select your classification model#
from sklearn#FIX ME#
Cell In[5], line 1
from sklearn#FIX ME#
^
SyntaxError: invalid syntax
model = #FIX ME#
Cell In[6], line 1
model = #FIX ME#
^
SyntaxError: invalid syntax
Step 3: Train your model#
model.fit(#FIX ME#)
Cell In[7], line 1
model.fit(#FIX ME#)
^
SyntaxError: incomplete input
Step 4: Evaluate your model#
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(model, X, y)
Optional: Train test#
Train your model with training data and compute the score using the test dataset. Feel free to do some hyper-optimization as well.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# YOU CAN DO IT! #