Activity - Data Visualization#
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="darkgrid")
We will use the breast cancer dataset for this activity. The main goal is to create visualization for finding the attributes that could predict or classify the type of tumor.
Attribute |
Domain |
---|---|
Sample code number |
id number |
Clump Thickness |
1 - 10 |
Uniformity of Cell Size |
1 - 10 |
Uniformity of Cell Shape |
1 - 10 |
Marginal Adhesion |
1 - 10 |
Single Epithelial Cell Size |
1 - 10 |
Bare Nuclei |
1 - 10 |
Bland Chromatin |
1 - 10 |
Normal Nucleoli |
1 - 10 |
Mitoses |
1 - 10 |
Class |
(2 for benign, 4 for malignant) |
More details here.
breast_cancer_data = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",
names=[
"code",
"clump_thickness",
"uniformity_cell_size",
"uniformity_cell_shape",
"marginal_adhesion",
"single_epithelial_cell_size",
"bare_nuclei",
"bland_chromatin",
"normal_cucleoli",
"mitoses",
"class",
],
index_col=0
)
breast_cancer_data.head()
clump_thickness | uniformity_cell_size | uniformity_cell_shape | marginal_adhesion | single_epithelial_cell_size | bare_nuclei | bland_chromatin | normal_cucleoli | mitoses | class | |
---|---|---|---|---|---|---|---|---|---|---|
code | ||||||||||
1000025 | 5 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 1 | 2 |
1002945 | 5 | 4 | 4 | 5 | 7 | 10 | 3 | 2 | 1 | 2 |
1015425 | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 1 | 1 | 2 |
1016277 | 6 | 8 | 8 | 1 | 3 | 4 | 3 | 7 | 1 | 2 |
1017023 | 4 | 1 | 1 | 3 | 2 | 1 | 3 | 1 | 1 | 2 |
Just for this time we will drop the bare_nublei
column.
breast_cancer_data.dtypes
clump_thickness int64
uniformity_cell_size int64
uniformity_cell_shape int64
marginal_adhesion int64
single_epithelial_cell_size int64
bare_nuclei object
bland_chromatin int64
normal_cucleoli int64
mitoses int64
class int64
dtype: object
breast_cancer_data.drop(columns="bare_nuclei", inplace=True)
breast_cancer_data.dtypes
clump_thickness int64
uniformity_cell_size int64
uniformity_cell_shape int64
marginal_adhesion int64
single_epithelial_cell_size int64
bland_chromatin int64
normal_cucleoli int64
mitoses int64
class int64
dtype: object
And let’s add a categorical column with the type of tumor.
class_dict = {2: "benign", 4: "malignant"}
breast_cancer_data["class_name"] = breast_cancer_data["class"].map(class_dict)
breast_cancer_data.head()
clump_thickness | uniformity_cell_size | uniformity_cell_shape | marginal_adhesion | single_epithelial_cell_size | bland_chromatin | normal_cucleoli | mitoses | class | class_name | |
---|---|---|---|---|---|---|---|---|---|---|
code | ||||||||||
1000025 | 5 | 1 | 1 | 1 | 2 | 3 | 1 | 1 | 2 | benign |
1002945 | 5 | 4 | 4 | 5 | 7 | 3 | 2 | 1 | 2 | benign |
1015425 | 3 | 1 | 1 | 1 | 2 | 3 | 1 | 1 | 2 | benign |
1016277 | 6 | 8 | 8 | 1 | 3 | 3 | 7 | 1 | 2 | benign |
1017023 | 4 | 1 | 1 | 3 | 2 | 3 | 1 | 1 | 2 | benign |
For example,
sns.scatterplot(
data=breast_cancer_data,
x="clump_thickness",
y="uniformity_cell_size",
hue="class_name"
)
<Axes: xlabel='clump_thickness', ylabel='uniformity_cell_size'>
![../_images/b77e2f5bc615094a51d30e5ae0055c1c7f97c624c4c7111bb47e820e14f905a7.png](../_images/b77e2f5bc615094a51d30e5ae0055c1c7f97c624c4c7111bb47e820e14f905a7.png)
Your turn!