# Data Analysis

The main goal of this class is to learn how to gather, explore, clean and analyze different types of datasets.

We will introduce some data analysis common tasks using the `pandas` package.

> `pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
>
> --<cite> https://pandas.pydata.org </cite>--

In [2]:
import pandas as pd

# from pathlib import Path  # Run this line if you are working in a local environment

## Dataset: Breast Cancer Wisconsin

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

Source: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29


| Attribute                   | Domain  |
|-----------------------------|---------|
| Sample code number          | id number |
| Clump Thickness             | 1 - 10 |
| Uniformity of Cell Size     | 1 - 10 |
| Uniformity of Cell Shape    | 1 - 10 |
| Marginal Adhesion           | 1 - 10 |
| Single Epithelial Cell Size | 1 - 10 |
| Bare Nuclei                 | 1 - 10 |
| Bland Chromatin             | 1 - 10 |
| Normal Nucleoli             | 1 - 10 |
| Mitoses                     |  1 - 10 |
| Class                       |  (2 for benign, 4 for malignant) |

In [3]:
data_filepath = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
# data_filepath = Path().resolve().parent / "data" / "breast-cancer-wisconsin.data"
data_filepath

'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'

More details in the following file you can download: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names

The easiest way to open a plain text file as this one is using `pd.read_csv`.

In [4]:
breast_cancer_data = pd.read_csv(
    data_filepath ,
    names=[
        "code",
        "clump_thickness",
        "uniformity_cell_size",
        "uniformity_cell_shape",
        "marginal_adhesion",
        "single_epithelial_cell_size",
        "bare_nuclei",
        "bland_chromatin",
        "normal_cucleoli",
        "mitoses",
        "class",
    ],
    index_col=0
)
breast_cancer_data.head()

Unnamed: 0_level_0,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_cucleoli,mitoses,class
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
1017023,4,1,1,3,2,1,3,1,1,2


Let's explore this data a little bit before start working with it.

In [5]:
breast_cancer_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 699 entries, 1000025 to 897471
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   clump_thickness              699 non-null    int64 
 1   uniformity_cell_size         699 non-null    int64 
 2   uniformity_cell_shape        699 non-null    int64 
 3   marginal_adhesion            699 non-null    int64 
 4   single_epithelial_cell_size  699 non-null    int64 
 5   bare_nuclei                  699 non-null    object
 6   bland_chromatin              699 non-null    int64 
 7   normal_cucleoli              699 non-null    int64 
 8   mitoses                      699 non-null    int64 
 9   class                        699 non-null    int64 
dtypes: int64(9), object(1)
memory usage: 60.1+ KB


In [6]:
breast_cancer_data.dtypes

clump_thickness                 int64
uniformity_cell_size            int64
uniformity_cell_shape           int64
marginal_adhesion               int64
single_epithelial_cell_size     int64
bare_nuclei                    object
bland_chromatin                 int64
normal_cucleoli                 int64
mitoses                         int64
class                           int64
dtype: object

In [7]:
breast_cancer_data.describe()

Unnamed: 0,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bland_chromatin,normal_cucleoli,mitoses,class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [8]:
breast_cancer_data.describe(include="all")

Unnamed: 0,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_cucleoli,mitoses,class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
unique,,,,,,11.0,,,,
top,,,,,,1.0,,,,
freq,,,,,,402.0,,,,
mean,4.41774,3.134478,3.207439,2.806867,3.216023,,3.437768,2.866953,1.589413,2.689557
std,2.815741,3.051459,2.971913,2.855379,2.2143,,2.438364,3.053634,1.715078,0.951273
min,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,2.0
25%,2.0,1.0,1.0,1.0,2.0,,2.0,1.0,1.0,2.0
50%,4.0,1.0,1.0,1.0,2.0,,3.0,1.0,1.0,2.0
75%,6.0,5.0,5.0,4.0,4.0,,5.0,4.0,1.0,4.0


## Series

Series are one-dimensional labeled arrays. You can think they are similar to columns of a excel spreadsheet. 

There are multiple ways to create a `pd.Series`, using lists, dictionaies, `np.array` or from a file. 

Since we already loaded the breast cancer data we will use it as an example. Each list of this file has been converted to a `pd.Series`.

In [9]:
clump_thick_series = breast_cancer_data["clump_thickness"].copy()
clump_thick_series.head()

code
1000025    5
1002945    5
1015425    3
1016277    6
1017023    4
Name: clump_thickness, dtype: int64

In [10]:
type(clump_thick_series)

pandas.core.series.Series

`pd.Series` are made with _index_ and _values_.

In [11]:
clump_thick_series.index

Int64Index([1000025, 1002945, 1015425, 1016277, 1017023, 1017122, 1018099,
            1018561, 1033078, 1033078,
            ...
             654546,  654546,  695091,  714039,  763235,  776715,  841769,
             888820,  897471,  897471],
           dtype='int64', name='code', length=699)

In [12]:
clump_thick_series.values

array([ 5,  5,  3,  6,  4,  8,  1,  2,  2,  4,  1,  2,  5,  1,  8,  7,  4,
        4, 10,  6,  7, 10,  3,  8,  1,  5,  3,  5,  2,  1,  3,  2, 10,  2,
        3,  2, 10,  6,  5,  2,  6, 10,  6,  5, 10,  1,  3,  1,  4,  7,  9,
        5, 10,  5, 10, 10,  8,  8,  5,  9,  5,  1,  9,  6,  1, 10,  4,  5,
        8,  1,  5,  6,  1,  9, 10,  1,  1,  5,  3,  2,  2,  4,  5,  3,  3,
        5,  3,  3,  4,  2,  1,  3,  4,  1,  2,  1,  2,  5,  9,  7, 10,  2,
        4,  8, 10,  7, 10,  1,  1,  6,  1,  8, 10, 10,  3,  1,  8,  4,  1,
        3,  1,  4, 10,  5,  5,  1,  7,  3,  8,  1,  5,  2,  5,  3,  3,  5,
        4,  3,  4,  1,  3,  2,  9,  1,  2,  1,  3,  1,  3,  8,  1,  7, 10,
        4,  1,  5,  1,  2,  1,  9, 10,  4,  3,  1,  5,  4,  5, 10,  3,  1,
        3,  1,  1,  6,  8,  5,  2,  5,  4,  5,  1,  1,  6,  5,  8,  2,  1,
       10,  5,  1, 10,  7,  5,  1,  3,  4,  8,  5,  1,  3,  9, 10,  1,  5,
        1,  5, 10,  1,  1,  5,  8,  8,  1, 10, 10,  8,  1,  1,  6,  6,  1,
       10,  4,  7, 10,  1

Now, imagine you want to access to a specific value from the third patient.

In [13]:
clump_thick_series.iloc[2]  # Remember Python is a 0-indexed progamming language.

3

However what if you want to know the clump thickness of a specific patient. Since we have their codes we can access with another method.

For example, for patient's code `1166654`

In [14]:
clump_thick_series.loc[1166654]

10

Don't forget

* `loc` refers to indexes (__labels__).
* `iloc` refers to order.

We will focus on `loc` instead of `iloc` since the power of `pandas` comes from its indexes can be numeric or categoricals. If you only need to do order-based analysis `pandas` could be overkill and `numpy` could be enough.

What if you want to get the values of several patients? For example patients `1166654` and `1178580`

In [15]:
clump_thick_series.loc[[1166654, 1178580]]

code
1166654    10
1178580     5
Name: clump_thickness, dtype: int64

```{important}
Notice if the argument is just one label the `loc` returns only the value. On the other hand, if the argument is a list then `loc` returns a `pd.Series` object.
```

In [16]:
type(clump_thick_series.loc[1166654])

numpy.int64

In [17]:
type(clump_thick_series.loc[[1166654, 1178580]])

pandas.core.series.Series

You can even edit or add values with these methods.

For instance, what if the dataset is wrong about patient `1166654` and clump thickness should have been `6` instead of `10`? 

We can fix that easily.

In [18]:
clump_thick_series.loc[1166654] = 6

```{warning}
You should have got a `SettingWithCopyWarning` message after running the last code cell if we had not used the `copy()` method.

I would suggest you to read [this link](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy) if you get that warning. But in simple words, `loc` returns a __view__, that means if you change anything it will change the main object itself. This is a feature, not an error. We have to be careful with this in the future.
```


Ok, let's check that change we made

In [19]:
clump_thick_series.loc[1166654]

6

Howerver, notice since we copied the column this didn't change the value in the original dataset.

In [20]:
breast_cancer_data.loc[1166654]

clump_thickness                10
uniformity_cell_size            3
uniformity_cell_shape           5
marginal_adhesion               1
single_epithelial_cell_size    10
bare_nuclei                     5
bland_chromatin                 3
normal_cucleoli                10
mitoses                         2
class                           4
Name: 1166654, dtype: object

```{attention}
You can try to create `clump_thick_series` without the `.copy()` method and explore what happens if you change values.

I would suggest you __to use copies if you are not sure__.
```

Another common mask is when you want to filter by a condition.

For example, let's get all the patients with a clump thickness greater than 7.

In [21]:
clump_thick_series > 7

code
1000025    False
1002945    False
1015425    False
1016277    False
1017023    False
           ...  
776715     False
841769     False
888820     False
897471     False
897471     False
Name: clump_thickness, Length: 699, dtype: bool

You can do logical comparations with `pd.Series` but this only will return another `pd.Series` of boolean objects (True/False). We want to keep only those ones where the value is true.

In [22]:
clump_thick_series.loc[clump_thick_series > 7]

code
1017122     8
1044572     8
1050670    10
1054593    10
1057013     8
           ..
736150     10
822829      8
1253955     8
1268952    10
1369821    10
Name: clump_thickness, Length: 128, dtype: int64

You can avoid using `loc` in this task but to be honest I rather use it. 

In [23]:
clump_thick_series[clump_thick_series > 7]

code
1017122     8
1044572     8
1050670    10
1054593    10
1057013     8
           ..
736150     10
822829      8
1253955     8
1268952    10
1369821    10
Name: clump_thickness, Length: 128, dtype: int64

However, my favorite version is using a functional approach with the function `lambda`. It is less intuitive at the beginning but it allows you to concatenate operations.

In [24]:
clump_thick_series.loc[lambda x: x > 7]

code
1017122     8
1044572     8
1050670    10
1054593    10
1057013     8
           ..
736150     10
822829      8
1253955     8
1268952    10
1369821    10
Name: clump_thickness, Length: 128, dtype: int64

## DataFrames

`pd.DataFrame` are 2-dimensional arrays with horizontal and vertical labels (_indexes_ and _columns_). It is the natural extension of `pd.Series` and you can even think they are a multiple `pd.Series` concatenated.

In [25]:
type(breast_cancer_data)

pandas.core.frame.DataFrame

There are a few useful methods for exploring the data, let's explore some of them.

In [26]:
breast_cancer_data.shape

(699, 10)

In [27]:
breast_cancer_data.head()

Unnamed: 0_level_0,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_cucleoli,mitoses,class
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
1017023,4,1,1,3,2,1,3,1,1,2


In [28]:
breast_cancer_data.tail()

Unnamed: 0_level_0,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_cucleoli,mitoses,class
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
776715,3,1,1,1,3,2,1,1,1,2
841769,2,1,1,1,2,1,1,1,1,2
888820,5,10,10,3,7,3,8,10,2,4
897471,4,8,6,4,3,4,10,6,1,4
897471,4,8,8,5,4,5,10,4,1,4


In [29]:
breast_cancer_data.max()

clump_thickness                10
uniformity_cell_size           10
uniformity_cell_shape          10
marginal_adhesion              10
single_epithelial_cell_size    10
bare_nuclei                     ?
bland_chromatin                10
normal_cucleoli                10
mitoses                        10
class                           4
dtype: object

In [30]:
breast_cancer_data.mean()

  breast_cancer_data.mean()


clump_thickness                4.417740
uniformity_cell_size           3.134478
uniformity_cell_shape          3.207439
marginal_adhesion              2.806867
single_epithelial_cell_size    3.216023
bland_chromatin                3.437768
normal_cucleoli                2.866953
mitoses                        1.589413
class                          2.689557
dtype: float64

We can inspectionate values using `loc` as well

In [31]:
breast_cancer_data.loc[1166654]

clump_thickness                10
uniformity_cell_size            3
uniformity_cell_shape           5
marginal_adhesion               1
single_epithelial_cell_size    10
bare_nuclei                     5
bland_chromatin                 3
normal_cucleoli                10
mitoses                         2
class                           4
Name: 1166654, dtype: object

However, you shouldn't use a double `loc`. Technical reason [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing).

In [32]:
breast_cancer_data.loc[1166654].loc["clump_thickness"]

10

Just use a double index notation

In [33]:
breast_cancer_data.loc[1166654, "clump_thickness"]

10

In [34]:
breast_cancer_data.loc[1166654, ["clump_thickness", "class"]]

clump_thickness    10
class               4
Name: 1166654, dtype: object

In [35]:
breast_cancer_data.loc[[1166654, 1178580], ["clump_thickness", "class"]]

Unnamed: 0_level_0,clump_thickness,class
code,Unnamed: 1_level_1,Unnamed: 2_level_1
1166654,10,4
1178580,5,2


Boolean masks also work

In [36]:
breast_cancer_data.loc[lambda x: x["clump_thickness"] > 7]

Unnamed: 0_level_0,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_cucleoli,mitoses,class
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1017122,8,10,10,8,7,10,9,7,1,4
1044572,8,7,5,10,7,9,5,5,4,4
1050670,10,7,7,6,4,10,4,1,2,4
1054593,10,5,5,3,6,7,7,10,1,4
1057013,8,4,5,1,2,?,7,3,1,4
...,...,...,...,...,...,...,...,...,...,...
736150,10,4,3,10,3,10,7,1,2,4
822829,8,10,10,10,6,10,10,10,10,4
1253955,8,7,4,4,5,3,5,10,1,4
1268952,10,10,7,8,7,1,10,10,3,4


Or getting a specific column

In [37]:
breast_cancer_data.loc[:, "bare_nuclei"]

code
1000025     1
1002945    10
1015425     2
1016277     4
1017023     1
           ..
776715      2
841769      1
888820      3
897471      4
897471      5
Name: bare_nuclei, Length: 699, dtype: object

However, you can also access directly to a column without `loc`.

In [38]:
breast_cancer_data["bare_nuclei"]

code
1000025     1
1002945    10
1015425     2
1016277     4
1017023     1
           ..
776715      2
841769      1
888820      3
897471      4
897471      5
Name: bare_nuclei, Length: 699, dtype: object

There are some cool methods you can use for exploring your data

In [39]:
breast_cancer_data.loc[:, "bare_nuclei"].value_counts()

1     402
10    132
2      30
5      30
3      28
8      21
4      19
?      16
9       9
7       8
6       4
Name: bare_nuclei, dtype: int64

What about with those `?` values?

They are representing a missing value!

In [40]:
breast_cancer_data.loc[lambda s: s['bare_nuclei'] == '?']

Unnamed: 0_level_0,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_cucleoli,mitoses,class
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1057013,8,4,5,1,2,?,7,3,1,4
1096800,6,6,6,9,6,?,7,8,1,2
1183246,1,1,1,1,1,?,2,1,1,2
1184840,1,1,3,1,2,?,2,1,1,2
1193683,1,1,2,1,3,?,1,1,1,2
1197510,5,1,1,1,2,?,3,1,1,2
1241232,3,1,4,1,2,?,3,1,1,2
169356,3,1,1,1,2,?,3,1,1,2
432809,3,1,3,1,2,?,2,1,1,2
563649,8,8,8,1,2,?,6,10,1,4


`pandas` has a specific object for denoting null values, `pd.NA`.

In [41]:
breast_cancer_data.loc[lambda s: s['bare_nuclei'] == '?',  'bare_nuclei'] = pd.NA

```{tip}
Same result but way more elegant is achieved with the following code line `breast_cancer_data.replace({'bare_nuclei': {"?": pd.NA}})`
```

Let's see the rows with null values


In [42]:
breast_cancer_data.loc[lambda s: s['bare_nuclei'] == pd.NA]

Unnamed: 0_level_0,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_cucleoli,mitoses,class
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1


Wait a second, why is it not showing me the null values? Null values have weird behaviors in Python.

In [43]:
pd.NA == pd.NA

<NA>

```{important}
`pandas` will recognize `None`, `np.na` and `pd.NA` as null values, be careful!
```

There are special methods for working with null values

In [44]:
breast_cancer_data.loc[lambda s: s['bare_nuclei'].isnull()]

Unnamed: 0_level_0,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_cucleoli,mitoses,class
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1057013,8,4,5,1,2,,7,3,1,4
1096800,6,6,6,9,6,,7,8,1,2
1183246,1,1,1,1,1,,2,1,1,2
1184840,1,1,3,1,2,,2,1,1,2
1193683,1,1,2,1,3,,1,1,1,2
1197510,5,1,1,1,2,,3,1,1,2
1241232,3,1,4,1,2,,3,1,1,2
169356,3,1,1,1,2,,3,1,1,2
432809,3,1,3,1,2,,2,1,1,2
563649,8,8,8,1,2,,6,10,1,4


Or you can explore for any column if there is any null value

In [45]:
breast_cancer_data.isnull().any()

clump_thickness                False
uniformity_cell_size           False
uniformity_cell_shape          False
marginal_adhesion              False
single_epithelial_cell_size    False
bare_nuclei                     True
bland_chromatin                False
normal_cucleoli                False
mitoses                        False
class                          False
dtype: bool

Or maybe for rows using `axis=1`.

In [46]:
breast_cancer_data.isnull().any(axis=1)

code
1000025    False
1002945    False
1015425    False
1016277    False
1017023    False
           ...  
776715     False
841769     False
888820     False
897471     False
897471     False
Length: 699, dtype: bool

Ok, now we will fix the `bare_nuclei` column. Imagine you want to replace the null values with the mean value.

In [None]:
breast_cancer_data['bare_nuclei'].mean()

Oh no! We need to convert that column to a numeric column

In [52]:
pd.to_numeric(breast_cancer_data['bare_nuclei'])

code
1000025     1.0
1002945    10.0
1015425     2.0
1016277     4.0
1017023     1.0
           ... 
776715      2.0
841769      1.0
888820      3.0
897471      4.0
897471      5.0
Name: bare_nuclei, Length: 699, dtype: float64

In [53]:
breast_cancer_data['bare_nuclei'] = pd.to_numeric(breast_cancer_data['bare_nuclei'])

Other option could have been

In [55]:
breast_cancer_data = breast_cancer_data.assign(
    bare_nuclei=lambda x: pd.to_numeric(x['bare_nuclei'])
)

I like this last one better, but don't worry!

In [56]:
breast_cancer_data['bare_nuclei'].mean()

3.5446559297218156

Now, every value is a integer, so we should convert this value to a integer, you should ask to the experts what makes more sense. Let's say it is better to approximate this value to the a bigger integer.

There is a scientific computing package called `numpy` that we don't have time to cover but you should check it out.

In [57]:
import numpy as np

In [58]:
bare_nuclei_mean = np.ceil(breast_cancer_data['bare_nuclei'].mean())
bare_nuclei_mean

4.0

Now, as an example, let's think we want to fill those null values with the mean value of the column.

If you are wondering if there is any method for this the answer is yes!

In [59]:
pd.DataFrame.fillna?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mDataFrame[0m[0;34m.[0m[0mfillna[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvalue[0m[0;34m:[0m [0;34m'Hashable | Mapping | Series | DataFrame'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmethod[0m[0;34m:[0m [0;34m'FillnaOptions | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maxis[0m[0;34m:[0m [0;34m'Axis | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minplace[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlimit[0m[0;34m:[0m [0;34m'int | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdowncast[0m[0;34m:[0m [0;34m'dict | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;34

In [60]:
breast_cancer_data.fillna(value={'bare_nuclei': bare_nuclei_mean})

Unnamed: 0_level_0,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_cucleoli,mitoses,class
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1000025,5,1,1,1,2,1.0,3,1,1,2
1002945,5,4,4,5,7,10.0,3,2,1,2
1015425,3,1,1,1,2,2.0,3,1,1,2
1016277,6,8,8,1,3,4.0,3,7,1,2
1017023,4,1,1,3,2,1.0,3,1,1,2
...,...,...,...,...,...,...,...,...,...,...
776715,3,1,1,1,3,2.0,1,1,1,2
841769,2,1,1,1,2,1.0,1,1,1,2
888820,5,10,10,3,7,3.0,8,10,2,4
897471,4,8,6,4,3,4.0,10,6,1,4


In [61]:
breast_cancer_data.isnull().any()

clump_thickness                False
uniformity_cell_size           False
uniformity_cell_shape          False
marginal_adhesion              False
single_epithelial_cell_size    False
bare_nuclei                     True
bland_chromatin                False
normal_cucleoli                False
mitoses                        False
class                          False
dtype: bool

What? There still null values. That is because most of `pandas` functions return a copy of the DataFrame. You have to options

* To assign the result to the same variable.
* If the method allows it, you can use `inplace=True`.

In [62]:
breast_cancer_data.fillna(value={'bare_nuclei': bare_nuclei_mean}, inplace=True)

In [63]:
breast_cancer_data.isnull().any()

clump_thickness                False
uniformity_cell_size           False
uniformity_cell_shape          False
marginal_adhesion              False
single_epithelial_cell_size    False
bare_nuclei                    False
bland_chromatin                False
normal_cucleoli                False
mitoses                        False
class                          False
dtype: bool

## Summary and next steps

In this session we explore a data set, reading it, understanding its elements and methods. Also we clean the dataset with null values.

We didn't have enough time but you should learn about merging datasets, aggreagation, etc.

Just a few examples:

In [66]:
cancer_names = pd.DataFrame(
    [[2, "benign"], [4, "malignant"], [0, "unknown"]],
    columns=["class", "cancer"]
)
cancer_names

Unnamed: 0,class,cancer
0,2,benign
1,4,malignant
2,0,unknown


In [68]:
breast_cancer_data2 = breast_cancer_data.merge(
    cancer_names,
    how="left",
    on="class"
)
breast_cancer_data2

Unnamed: 0,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_cucleoli,mitoses,class,cancer
0,5,1,1,1,2,1.0,3,1,1,2,benign
1,5,4,4,5,7,10.0,3,2,1,2,benign
2,3,1,1,1,2,2.0,3,1,1,2,benign
3,6,8,8,1,3,4.0,3,7,1,2,benign
4,4,1,1,3,2,1.0,3,1,1,2,benign
...,...,...,...,...,...,...,...,...,...,...,...
694,3,1,1,1,3,2.0,1,1,1,2,benign
695,2,1,1,1,2,1.0,1,1,1,2,benign
696,5,10,10,3,7,3.0,8,10,2,4,malignant
697,4,8,6,4,3,4.0,10,6,1,4,malignant


In [69]:
breast_cancer_data2["cancer"].unique()

array(['benign', 'malignant'], dtype=object)

In [71]:
breast_cancer_data2.groupby("cancer").mean().T

cancer,benign,malignant
clump_thickness,2.956332,7.195021
uniformity_cell_size,1.325328,6.572614
uniformity_cell_shape,1.443231,6.560166
marginal_adhesion,1.364629,5.547718
single_epithelial_cell_size,2.120087,5.298755
bare_nuclei,1.427948,7.59751
bland_chromatin,2.100437,5.979253
normal_cucleoli,1.290393,5.863071
mitoses,1.063319,2.589212
class,2.0,4.0


A very good place to learn is in the official [user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html).