Data Analysis#
The main goal of this class is to learn how to gather, explore, clean and analyze different types of datasets.
We will introduce some data analysis common tasks using the pandas
package.
pandas
is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
import pandas as pd
# from pathlib import Path # Run this line if you are working in a local environment
Dataset: Breast Cancer Wisconsin#
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.
Source: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
Attribute |
Domain |
---|---|
Sample code number |
id number |
Clump Thickness |
1 - 10 |
Uniformity of Cell Size |
1 - 10 |
Uniformity of Cell Shape |
1 - 10 |
Marginal Adhesion |
1 - 10 |
Single Epithelial Cell Size |
1 - 10 |
Bare Nuclei |
1 - 10 |
Bland Chromatin |
1 - 10 |
Normal Nucleoli |
1 - 10 |
Mitoses |
1 - 10 |
Class |
(2 for benign, 4 for malignant) |
data_filepath = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
# data_filepath = Path().resolve().parent / "data" / "breast-cancer-wisconsin.data"
data_filepath
'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
More details in the following file you can download: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names
The easiest way to open a plain text file as this one is using pd.read_csv
.
breast_cancer_data = pd.read_csv(
data_filepath ,
names=[
"code",
"clump_thickness",
"uniformity_cell_size",
"uniformity_cell_shape",
"marginal_adhesion",
"single_epithelial_cell_size",
"bare_nuclei",
"bland_chromatin",
"normal_cucleoli",
"mitoses",
"class",
],
index_col=0
)
breast_cancer_data.head()
clump_thickness | uniformity_cell_size | uniformity_cell_shape | marginal_adhesion | single_epithelial_cell_size | bare_nuclei | bland_chromatin | normal_cucleoli | mitoses | class | |
---|---|---|---|---|---|---|---|---|---|---|
code | ||||||||||
1000025 | 5 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 1 | 2 |
1002945 | 5 | 4 | 4 | 5 | 7 | 10 | 3 | 2 | 1 | 2 |
1015425 | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 1 | 1 | 2 |
1016277 | 6 | 8 | 8 | 1 | 3 | 4 | 3 | 7 | 1 | 2 |
1017023 | 4 | 1 | 1 | 3 | 2 | 1 | 3 | 1 | 1 | 2 |
Let’s explore this data a little bit before start working with it.
breast_cancer_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 699 entries, 1000025 to 897471
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 clump_thickness 699 non-null int64
1 uniformity_cell_size 699 non-null int64
2 uniformity_cell_shape 699 non-null int64
3 marginal_adhesion 699 non-null int64
4 single_epithelial_cell_size 699 non-null int64
5 bare_nuclei 699 non-null object
6 bland_chromatin 699 non-null int64
7 normal_cucleoli 699 non-null int64
8 mitoses 699 non-null int64
9 class 699 non-null int64
dtypes: int64(9), object(1)
memory usage: 60.1+ KB
breast_cancer_data.dtypes
clump_thickness int64
uniformity_cell_size int64
uniformity_cell_shape int64
marginal_adhesion int64
single_epithelial_cell_size int64
bare_nuclei object
bland_chromatin int64
normal_cucleoli int64
mitoses int64
class int64
dtype: object
breast_cancer_data.describe()
clump_thickness | uniformity_cell_size | uniformity_cell_shape | marginal_adhesion | single_epithelial_cell_size | bland_chromatin | normal_cucleoli | mitoses | class | |
---|---|---|---|---|---|---|---|---|---|
count | 699.000000 | 699.000000 | 699.000000 | 699.000000 | 699.000000 | 699.000000 | 699.000000 | 699.000000 | 699.000000 |
mean | 4.417740 | 3.134478 | 3.207439 | 2.806867 | 3.216023 | 3.437768 | 2.866953 | 1.589413 | 2.689557 |
std | 2.815741 | 3.051459 | 2.971913 | 2.855379 | 2.214300 | 2.438364 | 3.053634 | 1.715078 | 0.951273 |
min | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 |
25% | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 2.000000 | 1.000000 | 1.000000 | 2.000000 |
50% | 4.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 3.000000 | 1.000000 | 1.000000 | 2.000000 |
75% | 6.000000 | 5.000000 | 5.000000 | 4.000000 | 4.000000 | 5.000000 | 4.000000 | 1.000000 | 4.000000 |
max | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 4.000000 |
breast_cancer_data.describe(include="all")
clump_thickness | uniformity_cell_size | uniformity_cell_shape | marginal_adhesion | single_epithelial_cell_size | bare_nuclei | bland_chromatin | normal_cucleoli | mitoses | class | |
---|---|---|---|---|---|---|---|---|---|---|
count | 699.000000 | 699.000000 | 699.000000 | 699.000000 | 699.000000 | 699 | 699.000000 | 699.000000 | 699.000000 | 699.000000 |
unique | NaN | NaN | NaN | NaN | NaN | 11 | NaN | NaN | NaN | NaN |
top | NaN | NaN | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN |
freq | NaN | NaN | NaN | NaN | NaN | 402 | NaN | NaN | NaN | NaN |
mean | 4.417740 | 3.134478 | 3.207439 | 2.806867 | 3.216023 | NaN | 3.437768 | 2.866953 | 1.589413 | 2.689557 |
std | 2.815741 | 3.051459 | 2.971913 | 2.855379 | 2.214300 | NaN | 2.438364 | 3.053634 | 1.715078 | 0.951273 |
min | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | NaN | 1.000000 | 1.000000 | 1.000000 | 2.000000 |
25% | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | NaN | 2.000000 | 1.000000 | 1.000000 | 2.000000 |
50% | 4.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | NaN | 3.000000 | 1.000000 | 1.000000 | 2.000000 |
75% | 6.000000 | 5.000000 | 5.000000 | 4.000000 | 4.000000 | NaN | 5.000000 | 4.000000 | 1.000000 | 4.000000 |
max | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | NaN | 10.000000 | 10.000000 | 10.000000 | 4.000000 |
Series#
Series are one-dimensional labeled arrays. You can think they are similar to columns of a excel spreadsheet.
There are multiple ways to create a pd.Series
, using lists, dictionaies, np.array
or from a file.
Since we already loaded the breast cancer data we will use it as an example. Each list of this file has been converted to a pd.Series
.
clump_thick_series = breast_cancer_data["clump_thickness"].copy()
clump_thick_series.head()
code
1000025 5
1002945 5
1015425 3
1016277 6
1017023 4
Name: clump_thickness, dtype: int64
type(clump_thick_series)
pandas.core.series.Series
pd.Series
are made with index and values.
clump_thick_series.index
Int64Index([1000025, 1002945, 1015425, 1016277, 1017023, 1017122, 1018099,
1018561, 1033078, 1033078,
...
654546, 654546, 695091, 714039, 763235, 776715, 841769,
888820, 897471, 897471],
dtype='int64', name='code', length=699)
clump_thick_series.values
array([ 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, 8, 7, 4,
4, 10, 6, 7, 10, 3, 8, 1, 5, 3, 5, 2, 1, 3, 2, 10, 2,
3, 2, 10, 6, 5, 2, 6, 10, 6, 5, 10, 1, 3, 1, 4, 7, 9,
5, 10, 5, 10, 10, 8, 8, 5, 9, 5, 1, 9, 6, 1, 10, 4, 5,
8, 1, 5, 6, 1, 9, 10, 1, 1, 5, 3, 2, 2, 4, 5, 3, 3,
5, 3, 3, 4, 2, 1, 3, 4, 1, 2, 1, 2, 5, 9, 7, 10, 2,
4, 8, 10, 7, 10, 1, 1, 6, 1, 8, 10, 10, 3, 1, 8, 4, 1,
3, 1, 4, 10, 5, 5, 1, 7, 3, 8, 1, 5, 2, 5, 3, 3, 5,
4, 3, 4, 1, 3, 2, 9, 1, 2, 1, 3, 1, 3, 8, 1, 7, 10,
4, 1, 5, 1, 2, 1, 9, 10, 4, 3, 1, 5, 4, 5, 10, 3, 1,
3, 1, 1, 6, 8, 5, 2, 5, 4, 5, 1, 1, 6, 5, 8, 2, 1,
10, 5, 1, 10, 7, 5, 1, 3, 4, 8, 5, 1, 3, 9, 10, 1, 5,
1, 5, 10, 1, 1, 5, 8, 8, 1, 10, 10, 8, 1, 1, 6, 6, 1,
10, 4, 7, 10, 1, 10, 8, 1, 10, 7, 6, 8, 10, 3, 3, 10, 9,
8, 10, 5, 3, 2, 1, 1, 5, 8, 8, 4, 3, 1, 10, 6, 6, 9,
5, 3, 3, 3, 5, 10, 5, 8, 10, 7, 5, 10, 3, 10, 1, 8, 5,
3, 7, 3, 3, 3, 1, 1, 10, 3, 2, 1, 10, 7, 8, 10, 3, 6,
5, 1, 1, 8, 10, 1, 5, 5, 5, 8, 9, 8, 1, 10, 1, 8, 10,
1, 1, 7, 3, 2, 1, 8, 1, 1, 4, 5, 6, 1, 4, 7, 3, 3,
5, 1, 3, 10, 1, 8, 10, 10, 5, 5, 5, 8, 1, 6, 1, 1, 8,
10, 1, 2, 1, 7, 1, 5, 1, 3, 4, 5, 2, 3, 2, 1, 4, 5,
8, 8, 10, 6, 3, 3, 4, 2, 2, 6, 5, 1, 1, 4, 1, 4, 5,
3, 1, 1, 1, 3, 5, 1, 10, 3, 2, 2, 3, 7, 5, 2, 5, 1,
10, 3, 1, 1, 3, 3, 4, 3, 1, 3, 3, 5, 3, 1, 1, 4, 1,
2, 3, 1, 1, 10, 5, 8, 3, 8, 1, 5, 2, 3, 10, 4, 5, 3,
9, 5, 8, 1, 2, 1, 5, 5, 3, 6, 10, 10, 4, 4, 5, 10, 5,
1, 1, 5, 2, 1, 5, 1, 5, 4, 5, 3, 4, 2, 10, 10, 8, 5,
5, 5, 3, 6, 4, 4, 10, 10, 6, 4, 1, 3, 6, 6, 4, 5, 3,
4, 4, 5, 4, 5, 5, 9, 8, 5, 1, 3, 10, 3, 6, 1, 5, 4,
5, 5, 3, 1, 4, 4, 4, 6, 4, 4, 4, 1, 3, 8, 1, 5, 2,
1, 5, 5, 3, 6, 4, 1, 1, 3, 4, 1, 4, 10, 7, 3, 3, 4,
4, 6, 4, 7, 4, 1, 3, 2, 1, 5, 5, 4, 6, 5, 3, 5, 4,
2, 5, 6, 2, 3, 7, 3, 1, 3, 4, 3, 4, 5, 5, 2, 5, 5,
5, 1, 3, 4, 5, 3, 4, 8, 10, 8, 7, 3, 1, 10, 5, 5, 1,
1, 1, 5, 5, 6, 3, 5, 1, 8, 5, 9, 5, 4, 2, 10, 5, 4,
5, 4, 5, 3, 5, 3, 1, 4, 5, 5, 10, 4, 1, 5, 5, 10, 5,
8, 2, 2, 4, 3, 1, 4, 5, 3, 6, 7, 1, 5, 3, 4, 2, 2,
4, 6, 5, 1, 8, 3, 3, 10, 4, 4, 5, 4, 3, 3, 1, 2, 3,
1, 1, 5, 3, 3, 1, 5, 4, 3, 3, 5, 5, 7, 1, 1, 4, 1,
1, 3, 1, 5, 3, 5, 5, 3, 3, 2, 5, 1, 4, 1, 5, 1, 2,
10, 5, 5, 1, 1, 1, 1, 3, 4, 1, 1, 5, 3, 3, 3, 2, 5,
4, 4])
Now, imagine you want to access to a specific value from the third patient.
clump_thick_series.iloc[2] # Remember Python is a 0-indexed progamming language.
3
However what if you want to know the clump thickness of a specific patient. Since we have their codes we can access with another method.
For example, for patient’s code 1166654
clump_thick_series.loc[1166654]
10
Don’t forget
loc
refers to indexes (labels).iloc
refers to order.
We will focus on loc
instead of iloc
since the power of pandas
comes from its indexes can be numeric or categoricals. If you only need to do order-based analysis pandas
could be overkill and numpy
could be enough.
What if you want to get the values of several patients? For example patients 1166654
and 1178580
clump_thick_series.loc[[1166654, 1178580]]
code
1166654 10
1178580 5
Name: clump_thickness, dtype: int64
Important
Notice if the argument is just one label the loc
returns only the value. On the other hand, if the argument is a list then loc
returns a pd.Series
object.
type(clump_thick_series.loc[1166654])
numpy.int64
type(clump_thick_series.loc[[1166654, 1178580]])
pandas.core.series.Series
You can even edit or add values with these methods.
For instance, what if the dataset is wrong about patient 1166654
and clump thickness should have been 6
instead of 10
?
We can fix that easily.
clump_thick_series.loc[1166654] = 6
Warning
You should have got a SettingWithCopyWarning
message after running the last code cell if we had not used the copy()
method.
I would suggest you to read this link if you get that warning. But in simple words, loc
returns a view, that means if you change anything it will change the main object itself. This is a feature, not an error. We have to be careful with this in the future.
Ok, let’s check that change we made
clump_thick_series.loc[1166654]
6
Howerver, notice since we copied the column this didn’t change the value in the original dataset.
breast_cancer_data.loc[1166654]
clump_thickness 10
uniformity_cell_size 3
uniformity_cell_shape 5
marginal_adhesion 1
single_epithelial_cell_size 10
bare_nuclei 5
bland_chromatin 3
normal_cucleoli 10
mitoses 2
class 4
Name: 1166654, dtype: object
Attention
You can try to create clump_thick_series
without the .copy()
method and explore what happens if you change values.
I would suggest you to use copies if you are not sure.
Another common mask is when you want to filter by a condition.
For example, let’s get all the patients with a clump thickness greater than 7.
clump_thick_series > 7
code
1000025 False
1002945 False
1015425 False
1016277 False
1017023 False
...
776715 False
841769 False
888820 False
897471 False
897471 False
Name: clump_thickness, Length: 699, dtype: bool
You can do logical comparations with pd.Series
but this only will return another pd.Series
of boolean objects (True/False). We want to keep only those ones where the value is true.
clump_thick_series.loc[clump_thick_series > 7]
code
1017122 8
1044572 8
1050670 10
1054593 10
1057013 8
..
736150 10
822829 8
1253955 8
1268952 10
1369821 10
Name: clump_thickness, Length: 128, dtype: int64
You can avoid using loc
in this task but to be honest I rather use it.
clump_thick_series[clump_thick_series > 7]
code
1017122 8
1044572 8
1050670 10
1054593 10
1057013 8
..
736150 10
822829 8
1253955 8
1268952 10
1369821 10
Name: clump_thickness, Length: 128, dtype: int64
However, my favorite version is using a functional approach with the function lambda
. It is less intuitive at the beginning but it allows you to concatenate operations.
clump_thick_series.loc[lambda x: x > 7]
code
1017122 8
1044572 8
1050670 10
1054593 10
1057013 8
..
736150 10
822829 8
1253955 8
1268952 10
1369821 10
Name: clump_thickness, Length: 128, dtype: int64
DataFrames#
pd.DataFrame
are 2-dimensional arrays with horizontal and vertical labels (indexes and columns). It is the natural extension of pd.Series
and you can even think they are a multiple pd.Series
concatenated.
type(breast_cancer_data)
pandas.core.frame.DataFrame
There are a few useful methods for exploring the data, let’s explore some of them.
breast_cancer_data.shape
(699, 10)
breast_cancer_data.head()
clump_thickness | uniformity_cell_size | uniformity_cell_shape | marginal_adhesion | single_epithelial_cell_size | bare_nuclei | bland_chromatin | normal_cucleoli | mitoses | class | |
---|---|---|---|---|---|---|---|---|---|---|
code | ||||||||||
1000025 | 5 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 1 | 2 |
1002945 | 5 | 4 | 4 | 5 | 7 | 10 | 3 | 2 | 1 | 2 |
1015425 | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 1 | 1 | 2 |
1016277 | 6 | 8 | 8 | 1 | 3 | 4 | 3 | 7 | 1 | 2 |
1017023 | 4 | 1 | 1 | 3 | 2 | 1 | 3 | 1 | 1 | 2 |
breast_cancer_data.tail()
clump_thickness | uniformity_cell_size | uniformity_cell_shape | marginal_adhesion | single_epithelial_cell_size | bare_nuclei | bland_chromatin | normal_cucleoli | mitoses | class | |
---|---|---|---|---|---|---|---|---|---|---|
code | ||||||||||
776715 | 3 | 1 | 1 | 1 | 3 | 2 | 1 | 1 | 1 | 2 |
841769 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 2 |
888820 | 5 | 10 | 10 | 3 | 7 | 3 | 8 | 10 | 2 | 4 |
897471 | 4 | 8 | 6 | 4 | 3 | 4 | 10 | 6 | 1 | 4 |
897471 | 4 | 8 | 8 | 5 | 4 | 5 | 10 | 4 | 1 | 4 |
breast_cancer_data.max()
clump_thickness 10
uniformity_cell_size 10
uniformity_cell_shape 10
marginal_adhesion 10
single_epithelial_cell_size 10
bare_nuclei ?
bland_chromatin 10
normal_cucleoli 10
mitoses 10
class 4
dtype: object
breast_cancer_data.mean()
/tmp/ipykernel_16492/3888804924.py:1: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
breast_cancer_data.mean()
clump_thickness 4.417740
uniformity_cell_size 3.134478
uniformity_cell_shape 3.207439
marginal_adhesion 2.806867
single_epithelial_cell_size 3.216023
bland_chromatin 3.437768
normal_cucleoli 2.866953
mitoses 1.589413
class 2.689557
dtype: float64
We can inspectionate values using loc
as well
breast_cancer_data.loc[1166654]
clump_thickness 10
uniformity_cell_size 3
uniformity_cell_shape 5
marginal_adhesion 1
single_epithelial_cell_size 10
bare_nuclei 5
bland_chromatin 3
normal_cucleoli 10
mitoses 2
class 4
Name: 1166654, dtype: object
However, you shouldn’t use a double loc
. Technical reason here.
breast_cancer_data.loc[1166654].loc["clump_thickness"]
10
Just use a double index notation
breast_cancer_data.loc[1166654, "clump_thickness"]
10
breast_cancer_data.loc[1166654, ["clump_thickness", "class"]]
clump_thickness 10
class 4
Name: 1166654, dtype: object
breast_cancer_data.loc[[1166654, 1178580], ["clump_thickness", "class"]]
clump_thickness | class | |
---|---|---|
code | ||
1166654 | 10 | 4 |
1178580 | 5 | 2 |
Boolean masks also work
breast_cancer_data.loc[lambda x: x["clump_thickness"] > 7]
clump_thickness | uniformity_cell_size | uniformity_cell_shape | marginal_adhesion | single_epithelial_cell_size | bare_nuclei | bland_chromatin | normal_cucleoli | mitoses | class | |
---|---|---|---|---|---|---|---|---|---|---|
code | ||||||||||
1017122 | 8 | 10 | 10 | 8 | 7 | 10 | 9 | 7 | 1 | 4 |
1044572 | 8 | 7 | 5 | 10 | 7 | 9 | 5 | 5 | 4 | 4 |
1050670 | 10 | 7 | 7 | 6 | 4 | 10 | 4 | 1 | 2 | 4 |
1054593 | 10 | 5 | 5 | 3 | 6 | 7 | 7 | 10 | 1 | 4 |
1057013 | 8 | 4 | 5 | 1 | 2 | ? | 7 | 3 | 1 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
736150 | 10 | 4 | 3 | 10 | 3 | 10 | 7 | 1 | 2 | 4 |
822829 | 8 | 10 | 10 | 10 | 6 | 10 | 10 | 10 | 10 | 4 |
1253955 | 8 | 7 | 4 | 4 | 5 | 3 | 5 | 10 | 1 | 4 |
1268952 | 10 | 10 | 7 | 8 | 7 | 1 | 10 | 10 | 3 | 4 |
1369821 | 10 | 10 | 10 | 10 | 5 | 10 | 10 | 10 | 7 | 4 |
129 rows × 10 columns
Or getting a specific column
breast_cancer_data.loc[:, "bare_nuclei"]
code
1000025 1
1002945 10
1015425 2
1016277 4
1017023 1
..
776715 2
841769 1
888820 3
897471 4
897471 5
Name: bare_nuclei, Length: 699, dtype: object
However, you can also access directly to a column without loc
.
breast_cancer_data["bare_nuclei"]
code
1000025 1
1002945 10
1015425 2
1016277 4
1017023 1
..
776715 2
841769 1
888820 3
897471 4
897471 5
Name: bare_nuclei, Length: 699, dtype: object
There are some cool methods you can use for exploring your data
breast_cancer_data.loc[:, "bare_nuclei"].value_counts()
1 402
10 132
2 30
5 30
3 28
8 21
4 19
? 16
9 9
7 8
6 4
Name: bare_nuclei, dtype: int64
What about with those ?
values?
They are representing a missing value!
breast_cancer_data.loc[lambda s: s['bare_nuclei'] == '?']
clump_thickness | uniformity_cell_size | uniformity_cell_shape | marginal_adhesion | single_epithelial_cell_size | bare_nuclei | bland_chromatin | normal_cucleoli | mitoses | class | |
---|---|---|---|---|---|---|---|---|---|---|
code | ||||||||||
1057013 | 8 | 4 | 5 | 1 | 2 | ? | 7 | 3 | 1 | 4 |
1096800 | 6 | 6 | 6 | 9 | 6 | ? | 7 | 8 | 1 | 2 |
1183246 | 1 | 1 | 1 | 1 | 1 | ? | 2 | 1 | 1 | 2 |
1184840 | 1 | 1 | 3 | 1 | 2 | ? | 2 | 1 | 1 | 2 |
1193683 | 1 | 1 | 2 | 1 | 3 | ? | 1 | 1 | 1 | 2 |
1197510 | 5 | 1 | 1 | 1 | 2 | ? | 3 | 1 | 1 | 2 |
1241232 | 3 | 1 | 4 | 1 | 2 | ? | 3 | 1 | 1 | 2 |
169356 | 3 | 1 | 1 | 1 | 2 | ? | 3 | 1 | 1 | 2 |
432809 | 3 | 1 | 3 | 1 | 2 | ? | 2 | 1 | 1 | 2 |
563649 | 8 | 8 | 8 | 1 | 2 | ? | 6 | 10 | 1 | 4 |
606140 | 1 | 1 | 1 | 1 | 2 | ? | 2 | 1 | 1 | 2 |
61634 | 5 | 4 | 3 | 1 | 2 | ? | 2 | 3 | 1 | 2 |
704168 | 4 | 6 | 5 | 6 | 7 | ? | 4 | 9 | 1 | 2 |
733639 | 3 | 1 | 1 | 1 | 2 | ? | 3 | 1 | 1 | 2 |
1238464 | 1 | 1 | 1 | 1 | 1 | ? | 2 | 1 | 1 | 2 |
1057067 | 1 | 1 | 1 | 1 | 1 | ? | 1 | 1 | 1 | 2 |
pandas
has a specific object for denoting null values, pd.NA
.
breast_cancer_data.loc[lambda s: s['bare_nuclei'] == '?', 'bare_nuclei'] = pd.NA
Tip
Same result but way more elegant is achieved with the following code line breast_cancer_data.replace({'bare_nuclei': {"?": pd.NA}})
Let’s see the rows with null values
breast_cancer_data.loc[lambda s: s['bare_nuclei'] == pd.NA]
clump_thickness | uniformity_cell_size | uniformity_cell_shape | marginal_adhesion | single_epithelial_cell_size | bare_nuclei | bland_chromatin | normal_cucleoli | mitoses | class | |
---|---|---|---|---|---|---|---|---|---|---|
code |
Wait a second, why is it not showing me the null values? Null values have weird behaviors in Python.
pd.NA == pd.NA
<NA>
Important
pandas
will recognize None
, np.na
and pd.NA
as null values, be careful!
There are special methods for working with null values
breast_cancer_data.loc[lambda s: s['bare_nuclei'].isnull()]
clump_thickness | uniformity_cell_size | uniformity_cell_shape | marginal_adhesion | single_epithelial_cell_size | bare_nuclei | bland_chromatin | normal_cucleoli | mitoses | class | |
---|---|---|---|---|---|---|---|---|---|---|
code | ||||||||||
1057013 | 8 | 4 | 5 | 1 | 2 | <NA> | 7 | 3 | 1 | 4 |
1096800 | 6 | 6 | 6 | 9 | 6 | <NA> | 7 | 8 | 1 | 2 |
1183246 | 1 | 1 | 1 | 1 | 1 | <NA> | 2 | 1 | 1 | 2 |
1184840 | 1 | 1 | 3 | 1 | 2 | <NA> | 2 | 1 | 1 | 2 |
1193683 | 1 | 1 | 2 | 1 | 3 | <NA> | 1 | 1 | 1 | 2 |
1197510 | 5 | 1 | 1 | 1 | 2 | <NA> | 3 | 1 | 1 | 2 |
1241232 | 3 | 1 | 4 | 1 | 2 | <NA> | 3 | 1 | 1 | 2 |
169356 | 3 | 1 | 1 | 1 | 2 | <NA> | 3 | 1 | 1 | 2 |
432809 | 3 | 1 | 3 | 1 | 2 | <NA> | 2 | 1 | 1 | 2 |
563649 | 8 | 8 | 8 | 1 | 2 | <NA> | 6 | 10 | 1 | 4 |
606140 | 1 | 1 | 1 | 1 | 2 | <NA> | 2 | 1 | 1 | 2 |
61634 | 5 | 4 | 3 | 1 | 2 | <NA> | 2 | 3 | 1 | 2 |
704168 | 4 | 6 | 5 | 6 | 7 | <NA> | 4 | 9 | 1 | 2 |
733639 | 3 | 1 | 1 | 1 | 2 | <NA> | 3 | 1 | 1 | 2 |
1238464 | 1 | 1 | 1 | 1 | 1 | <NA> | 2 | 1 | 1 | 2 |
1057067 | 1 | 1 | 1 | 1 | 1 | <NA> | 1 | 1 | 1 | 2 |
Or you can explore for any column if there is any null value
breast_cancer_data.isnull().any()
clump_thickness False
uniformity_cell_size False
uniformity_cell_shape False
marginal_adhesion False
single_epithelial_cell_size False
bare_nuclei True
bland_chromatin False
normal_cucleoli False
mitoses False
class False
dtype: bool
Or maybe for rows using axis=1
.
breast_cancer_data.isnull().any(axis=1)
code
1000025 False
1002945 False
1015425 False
1016277 False
1017023 False
...
776715 False
841769 False
888820 False
897471 False
897471 False
Length: 699, dtype: bool
Ok, now we will fix the bare_nuclei
column. Imagine you want to replace the null values with the mean value.
breast_cancer_data['bare_nuclei'].mean()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[46], line 1
----> 1 breast_cancer_data['bare_nuclei'].mean()
File ~/mambaforge/envs/casbbi-nrt-ds/lib/python3.11/site-packages/pandas/core/generic.py:11847, in NDFrame._add_numeric_operations.<locals>.mean(self, axis, skipna, level, numeric_only, **kwargs)
11829 @doc(
11830 _num_doc,
11831 desc="Return the mean of the values over the requested axis.",
(...)
11845 **kwargs,
11846 ):
> 11847 return NDFrame.mean(self, axis, skipna, level, numeric_only, **kwargs)
File ~/mambaforge/envs/casbbi-nrt-ds/lib/python3.11/site-packages/pandas/core/generic.py:11401, in NDFrame.mean(self, axis, skipna, level, numeric_only, **kwargs)
11393 def mean(
11394 self,
11395 axis: Axis | None | lib.NoDefault = lib.no_default,
(...)
11399 **kwargs,
11400 ) -> Series | float:
> 11401 return self._stat_function(
11402 "mean", nanops.nanmean, axis, skipna, level, numeric_only, **kwargs
11403 )
File ~/mambaforge/envs/casbbi-nrt-ds/lib/python3.11/site-packages/pandas/core/generic.py:11353, in NDFrame._stat_function(self, name, func, axis, skipna, level, numeric_only, **kwargs)
11343 warnings.warn(
11344 "Using the level keyword in DataFrame and Series aggregations is "
11345 "deprecated and will be removed in a future version. Use groupby "
(...)
11348 stacklevel=find_stack_level(),
11349 )
11350 return self._agg_by_level(
11351 name, axis=axis, level=level, skipna=skipna, numeric_only=numeric_only
11352 )
> 11353 return self._reduce(
11354 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
11355 )
File ~/mambaforge/envs/casbbi-nrt-ds/lib/python3.11/site-packages/pandas/core/series.py:4816, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
4812 raise NotImplementedError(
4813 f"Series.{name} does not implement {kwd_name}."
4814 )
4815 with np.errstate(all="ignore"):
-> 4816 return op(delegate, skipna=skipna, **kwds)
File ~/mambaforge/envs/casbbi-nrt-ds/lib/python3.11/site-packages/pandas/core/nanops.py:93, in disallow.__call__.<locals>._f(*args, **kwargs)
91 try:
92 with np.errstate(invalid="ignore"):
---> 93 return f(*args, **kwargs)
94 except ValueError as e:
95 # we want to transform an object array
96 # ValueError message to the more typical TypeError
97 # e.g. this is normally a disallowed function on
98 # object arrays that contain strings
99 if is_object_dtype(args[0]):
File ~/mambaforge/envs/casbbi-nrt-ds/lib/python3.11/site-packages/pandas/core/nanops.py:155, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
153 result = alt(values, axis=axis, skipna=skipna, **kwds)
154 else:
--> 155 result = alt(values, axis=axis, skipna=skipna, **kwds)
157 return result
File ~/mambaforge/envs/casbbi-nrt-ds/lib/python3.11/site-packages/pandas/core/nanops.py:418, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
415 if datetimelike and mask is None:
416 mask = isna(values)
--> 418 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
420 if datetimelike:
421 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)
File ~/mambaforge/envs/casbbi-nrt-ds/lib/python3.11/site-packages/pandas/core/nanops.py:706, in nanmean(values, axis, skipna, mask)
703 dtype_count = dtype
705 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 706 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
708 if axis is not None and getattr(the_sum, "ndim", False):
709 count = cast(np.ndarray, count)
File ~/mambaforge/envs/casbbi-nrt-ds/lib/python3.11/site-packages/numpy/core/_methods.py:49, in _sum(a, axis, dtype, out, keepdims, initial, where)
47 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
48 initial=_NoValue, where=True):
---> 49 return umr_sum(a, axis, dtype, out, keepdims, initial, where)
TypeError: can only concatenate str (not "int") to str
Oh no! We need to convert that column to a numeric column
pd.to_numeric(breast_cancer_data['bare_nuclei'])
code
1000025 1.0
1002945 10.0
1015425 2.0
1016277 4.0
1017023 1.0
...
776715 2.0
841769 1.0
888820 3.0
897471 4.0
897471 5.0
Name: bare_nuclei, Length: 699, dtype: float64
breast_cancer_data['bare_nuclei'] = pd.to_numeric(breast_cancer_data['bare_nuclei'])
Other option could have been
breast_cancer_data = breast_cancer_data.assign(
bare_nuclei=lambda x: pd.to_numeric(x['bare_nuclei'])
)
I like this last one better, but don’t worry!
breast_cancer_data['bare_nuclei'].mean()
3.5446559297218156
Now, every value is a integer, so we should convert this value to a integer, you should ask to the experts what makes more sense. Let’s say it is better to approximate this value to the a bigger integer.
There is a scientific computing package called numpy
that we don’t have time to cover but you should check it out.
import numpy as np
bare_nuclei_mean = np.ceil(breast_cancer_data['bare_nuclei'].mean())
bare_nuclei_mean
4.0
Now, as an example, let’s think we want to fill those null values with the mean value of the column.
If you are wondering if there is any method for this the answer is yes!
pd.DataFrame.fillna?
breast_cancer_data.fillna(value={'bare_nuclei': bare_nuclei_mean})
clump_thickness | uniformity_cell_size | uniformity_cell_shape | marginal_adhesion | single_epithelial_cell_size | bare_nuclei | bland_chromatin | normal_cucleoli | mitoses | class | |
---|---|---|---|---|---|---|---|---|---|---|
code | ||||||||||
1000025 | 5 | 1 | 1 | 1 | 2 | 1.0 | 3 | 1 | 1 | 2 |
1002945 | 5 | 4 | 4 | 5 | 7 | 10.0 | 3 | 2 | 1 | 2 |
1015425 | 3 | 1 | 1 | 1 | 2 | 2.0 | 3 | 1 | 1 | 2 |
1016277 | 6 | 8 | 8 | 1 | 3 | 4.0 | 3 | 7 | 1 | 2 |
1017023 | 4 | 1 | 1 | 3 | 2 | 1.0 | 3 | 1 | 1 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
776715 | 3 | 1 | 1 | 1 | 3 | 2.0 | 1 | 1 | 1 | 2 |
841769 | 2 | 1 | 1 | 1 | 2 | 1.0 | 1 | 1 | 1 | 2 |
888820 | 5 | 10 | 10 | 3 | 7 | 3.0 | 8 | 10 | 2 | 4 |
897471 | 4 | 8 | 6 | 4 | 3 | 4.0 | 10 | 6 | 1 | 4 |
897471 | 4 | 8 | 8 | 5 | 4 | 5.0 | 10 | 4 | 1 | 4 |
699 rows × 10 columns
breast_cancer_data.isnull().any()
clump_thickness False
uniformity_cell_size False
uniformity_cell_shape False
marginal_adhesion False
single_epithelial_cell_size False
bare_nuclei True
bland_chromatin False
normal_cucleoli False
mitoses False
class False
dtype: bool
What? There still null values. That is because most of pandas
functions return a copy of the DataFrame. You have to options
To assign the result to the same variable.
If the method allows it, you can use
inplace=True
.
breast_cancer_data.fillna(value={'bare_nuclei': bare_nuclei_mean}, inplace=True)
breast_cancer_data.isnull().any()
clump_thickness False
uniformity_cell_size False
uniformity_cell_shape False
marginal_adhesion False
single_epithelial_cell_size False
bare_nuclei False
bland_chromatin False
normal_cucleoli False
mitoses False
class False
dtype: bool
Summary and next steps#
In this session we explore a data set, reading it, understanding its elements and methods. Also we clean the dataset with null values.
We didn’t have enough time but you should learn about merging datasets, aggreagation, etc.
Just a few examples:
cancer_names = pd.DataFrame(
[[2, "benign"], [4, "malignant"], [0, "unknown"]],
columns=["class", "cancer"]
)
cancer_names
class | cancer | |
---|---|---|
0 | 2 | benign |
1 | 4 | malignant |
2 | 0 | unknown |
breast_cancer_data2 = breast_cancer_data.merge(
cancer_names,
how="left",
on="class"
)
breast_cancer_data2
clump_thickness | uniformity_cell_size | uniformity_cell_shape | marginal_adhesion | single_epithelial_cell_size | bare_nuclei | bland_chromatin | normal_cucleoli | mitoses | class | cancer | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | 1 | 1 | 1 | 2 | 1.0 | 3 | 1 | 1 | 2 | benign |
1 | 5 | 4 | 4 | 5 | 7 | 10.0 | 3 | 2 | 1 | 2 | benign |
2 | 3 | 1 | 1 | 1 | 2 | 2.0 | 3 | 1 | 1 | 2 | benign |
3 | 6 | 8 | 8 | 1 | 3 | 4.0 | 3 | 7 | 1 | 2 | benign |
4 | 4 | 1 | 1 | 3 | 2 | 1.0 | 3 | 1 | 1 | 2 | benign |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
694 | 3 | 1 | 1 | 1 | 3 | 2.0 | 1 | 1 | 1 | 2 | benign |
695 | 2 | 1 | 1 | 1 | 2 | 1.0 | 1 | 1 | 1 | 2 | benign |
696 | 5 | 10 | 10 | 3 | 7 | 3.0 | 8 | 10 | 2 | 4 | malignant |
697 | 4 | 8 | 6 | 4 | 3 | 4.0 | 10 | 6 | 1 | 4 | malignant |
698 | 4 | 8 | 8 | 5 | 4 | 5.0 | 10 | 4 | 1 | 4 | malignant |
699 rows × 11 columns
breast_cancer_data2["cancer"].unique()
array(['benign', 'malignant'], dtype=object)
breast_cancer_data2.groupby("cancer").mean().T
cancer | benign | malignant |
---|---|---|
clump_thickness | 2.956332 | 7.195021 |
uniformity_cell_size | 1.325328 | 6.572614 |
uniformity_cell_shape | 1.443231 | 6.560166 |
marginal_adhesion | 1.364629 | 5.547718 |
single_epithelial_cell_size | 2.120087 | 5.298755 |
bare_nuclei | 1.427948 | 7.597510 |
bland_chromatin | 2.100437 | 5.979253 |
normal_cucleoli | 1.290393 | 5.863071 |
mitoses | 1.063319 | 2.589212 |
class | 2.000000 | 4.000000 |
A very good place to learn is in the official user guide.