Data Analysis#

The main goal of this class is to learn how to gather, explore, clean and analyze different types of datasets.

We will introduce some data analysis common tasks using the pandas package.

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

https://pandas.pydata.org

import pandas as pd

# from pathlib import Path  # Run this line if you are working in a local environment

Dataset: Breast Cancer Wisconsin#

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

Source: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

Attribute

Domain

Sample code number

id number

Clump Thickness

1 - 10

Uniformity of Cell Size

1 - 10

Uniformity of Cell Shape

1 - 10

Marginal Adhesion

1 - 10

Single Epithelial Cell Size

1 - 10

Bare Nuclei

1 - 10

Bland Chromatin

1 - 10

Normal Nucleoli

1 - 10

Mitoses

1 - 10

Class

(2 for benign, 4 for malignant)

data_filepath = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
# data_filepath = Path().resolve().parent / "data" / "breast-cancer-wisconsin.data"
data_filepath
'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'

More details in the following file you can download: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names

The easiest way to open a plain text file as this one is using pd.read_csv.

breast_cancer_data = pd.read_csv(
    data_filepath ,
    names=[
        "code",
        "clump_thickness",
        "uniformity_cell_size",
        "uniformity_cell_shape",
        "marginal_adhesion",
        "single_epithelial_cell_size",
        "bare_nuclei",
        "bland_chromatin",
        "normal_cucleoli",
        "mitoses",
        "class",
    ],
    index_col=0
)
breast_cancer_data.head()
clump_thickness uniformity_cell_size uniformity_cell_shape marginal_adhesion single_epithelial_cell_size bare_nuclei bland_chromatin normal_cucleoli mitoses class
code
1000025 5 1 1 1 2 1 3 1 1 2
1002945 5 4 4 5 7 10 3 2 1 2
1015425 3 1 1 1 2 2 3 1 1 2
1016277 6 8 8 1 3 4 3 7 1 2
1017023 4 1 1 3 2 1 3 1 1 2

Let’s explore this data a little bit before start working with it.

breast_cancer_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 699 entries, 1000025 to 897471
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   clump_thickness              699 non-null    int64 
 1   uniformity_cell_size         699 non-null    int64 
 2   uniformity_cell_shape        699 non-null    int64 
 3   marginal_adhesion            699 non-null    int64 
 4   single_epithelial_cell_size  699 non-null    int64 
 5   bare_nuclei                  699 non-null    object
 6   bland_chromatin              699 non-null    int64 
 7   normal_cucleoli              699 non-null    int64 
 8   mitoses                      699 non-null    int64 
 9   class                        699 non-null    int64 
dtypes: int64(9), object(1)
memory usage: 60.1+ KB
breast_cancer_data.dtypes
clump_thickness                 int64
uniformity_cell_size            int64
uniformity_cell_shape           int64
marginal_adhesion               int64
single_epithelial_cell_size     int64
bare_nuclei                    object
bland_chromatin                 int64
normal_cucleoli                 int64
mitoses                         int64
class                           int64
dtype: object
breast_cancer_data.describe()
clump_thickness uniformity_cell_size uniformity_cell_shape marginal_adhesion single_epithelial_cell_size bland_chromatin normal_cucleoli mitoses class
count 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000
mean 4.417740 3.134478 3.207439 2.806867 3.216023 3.437768 2.866953 1.589413 2.689557
std 2.815741 3.051459 2.971913 2.855379 2.214300 2.438364 3.053634 1.715078 0.951273
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000
25% 2.000000 1.000000 1.000000 1.000000 2.000000 2.000000 1.000000 1.000000 2.000000
50% 4.000000 1.000000 1.000000 1.000000 2.000000 3.000000 1.000000 1.000000 2.000000
75% 6.000000 5.000000 5.000000 4.000000 4.000000 5.000000 4.000000 1.000000 4.000000
max 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 4.000000
breast_cancer_data.describe(include="all")
clump_thickness uniformity_cell_size uniformity_cell_shape marginal_adhesion single_epithelial_cell_size bare_nuclei bland_chromatin normal_cucleoli mitoses class
count 699.000000 699.000000 699.000000 699.000000 699.000000 699 699.000000 699.000000 699.000000 699.000000
unique NaN NaN NaN NaN NaN 11 NaN NaN NaN NaN
top NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN
freq NaN NaN NaN NaN NaN 402 NaN NaN NaN NaN
mean 4.417740 3.134478 3.207439 2.806867 3.216023 NaN 3.437768 2.866953 1.589413 2.689557
std 2.815741 3.051459 2.971913 2.855379 2.214300 NaN 2.438364 3.053634 1.715078 0.951273
min 1.000000 1.000000 1.000000 1.000000 1.000000 NaN 1.000000 1.000000 1.000000 2.000000
25% 2.000000 1.000000 1.000000 1.000000 2.000000 NaN 2.000000 1.000000 1.000000 2.000000
50% 4.000000 1.000000 1.000000 1.000000 2.000000 NaN 3.000000 1.000000 1.000000 2.000000
75% 6.000000 5.000000 5.000000 4.000000 4.000000 NaN 5.000000 4.000000 1.000000 4.000000
max 10.000000 10.000000 10.000000 10.000000 10.000000 NaN 10.000000 10.000000 10.000000 4.000000

Series#

Series are one-dimensional labeled arrays. You can think they are similar to columns of a excel spreadsheet.

There are multiple ways to create a pd.Series, using lists, dictionaies, np.array or from a file.

Since we already loaded the breast cancer data we will use it as an example. Each list of this file has been converted to a pd.Series.

clump_thick_series = breast_cancer_data["clump_thickness"].copy()
clump_thick_series.head()
code
1000025    5
1002945    5
1015425    3
1016277    6
1017023    4
Name: clump_thickness, dtype: int64
type(clump_thick_series)
pandas.core.series.Series

pd.Series are made with index and values.

clump_thick_series.index
Index([1000025, 1002945, 1015425, 1016277, 1017023, 1017122, 1018099, 1018561,
       1033078, 1033078,
       ...
        654546,  654546,  695091,  714039,  763235,  776715,  841769,  888820,
        897471,  897471],
      dtype='int64', name='code', length=699)
clump_thick_series.values
array([ 5,  5,  3,  6,  4,  8,  1,  2,  2,  4,  1,  2,  5,  1,  8,  7,  4,
        4, 10,  6,  7, 10,  3,  8,  1,  5,  3,  5,  2,  1,  3,  2, 10,  2,
        3,  2, 10,  6,  5,  2,  6, 10,  6,  5, 10,  1,  3,  1,  4,  7,  9,
        5, 10,  5, 10, 10,  8,  8,  5,  9,  5,  1,  9,  6,  1, 10,  4,  5,
        8,  1,  5,  6,  1,  9, 10,  1,  1,  5,  3,  2,  2,  4,  5,  3,  3,
        5,  3,  3,  4,  2,  1,  3,  4,  1,  2,  1,  2,  5,  9,  7, 10,  2,
        4,  8, 10,  7, 10,  1,  1,  6,  1,  8, 10, 10,  3,  1,  8,  4,  1,
        3,  1,  4, 10,  5,  5,  1,  7,  3,  8,  1,  5,  2,  5,  3,  3,  5,
        4,  3,  4,  1,  3,  2,  9,  1,  2,  1,  3,  1,  3,  8,  1,  7, 10,
        4,  1,  5,  1,  2,  1,  9, 10,  4,  3,  1,  5,  4,  5, 10,  3,  1,
        3,  1,  1,  6,  8,  5,  2,  5,  4,  5,  1,  1,  6,  5,  8,  2,  1,
       10,  5,  1, 10,  7,  5,  1,  3,  4,  8,  5,  1,  3,  9, 10,  1,  5,
        1,  5, 10,  1,  1,  5,  8,  8,  1, 10, 10,  8,  1,  1,  6,  6,  1,
       10,  4,  7, 10,  1, 10,  8,  1, 10,  7,  6,  8, 10,  3,  3, 10,  9,
        8, 10,  5,  3,  2,  1,  1,  5,  8,  8,  4,  3,  1, 10,  6,  6,  9,
        5,  3,  3,  3,  5, 10,  5,  8, 10,  7,  5, 10,  3, 10,  1,  8,  5,
        3,  7,  3,  3,  3,  1,  1, 10,  3,  2,  1, 10,  7,  8, 10,  3,  6,
        5,  1,  1,  8, 10,  1,  5,  5,  5,  8,  9,  8,  1, 10,  1,  8, 10,
        1,  1,  7,  3,  2,  1,  8,  1,  1,  4,  5,  6,  1,  4,  7,  3,  3,
        5,  1,  3, 10,  1,  8, 10, 10,  5,  5,  5,  8,  1,  6,  1,  1,  8,
       10,  1,  2,  1,  7,  1,  5,  1,  3,  4,  5,  2,  3,  2,  1,  4,  5,
        8,  8, 10,  6,  3,  3,  4,  2,  2,  6,  5,  1,  1,  4,  1,  4,  5,
        3,  1,  1,  1,  3,  5,  1, 10,  3,  2,  2,  3,  7,  5,  2,  5,  1,
       10,  3,  1,  1,  3,  3,  4,  3,  1,  3,  3,  5,  3,  1,  1,  4,  1,
        2,  3,  1,  1, 10,  5,  8,  3,  8,  1,  5,  2,  3, 10,  4,  5,  3,
        9,  5,  8,  1,  2,  1,  5,  5,  3,  6, 10, 10,  4,  4,  5, 10,  5,
        1,  1,  5,  2,  1,  5,  1,  5,  4,  5,  3,  4,  2, 10, 10,  8,  5,
        5,  5,  3,  6,  4,  4, 10, 10,  6,  4,  1,  3,  6,  6,  4,  5,  3,
        4,  4,  5,  4,  5,  5,  9,  8,  5,  1,  3, 10,  3,  6,  1,  5,  4,
        5,  5,  3,  1,  4,  4,  4,  6,  4,  4,  4,  1,  3,  8,  1,  5,  2,
        1,  5,  5,  3,  6,  4,  1,  1,  3,  4,  1,  4, 10,  7,  3,  3,  4,
        4,  6,  4,  7,  4,  1,  3,  2,  1,  5,  5,  4,  6,  5,  3,  5,  4,
        2,  5,  6,  2,  3,  7,  3,  1,  3,  4,  3,  4,  5,  5,  2,  5,  5,
        5,  1,  3,  4,  5,  3,  4,  8, 10,  8,  7,  3,  1, 10,  5,  5,  1,
        1,  1,  5,  5,  6,  3,  5,  1,  8,  5,  9,  5,  4,  2, 10,  5,  4,
        5,  4,  5,  3,  5,  3,  1,  4,  5,  5, 10,  4,  1,  5,  5, 10,  5,
        8,  2,  2,  4,  3,  1,  4,  5,  3,  6,  7,  1,  5,  3,  4,  2,  2,
        4,  6,  5,  1,  8,  3,  3, 10,  4,  4,  5,  4,  3,  3,  1,  2,  3,
        1,  1,  5,  3,  3,  1,  5,  4,  3,  3,  5,  5,  7,  1,  1,  4,  1,
        1,  3,  1,  5,  3,  5,  5,  3,  3,  2,  5,  1,  4,  1,  5,  1,  2,
       10,  5,  5,  1,  1,  1,  1,  3,  4,  1,  1,  5,  3,  3,  3,  2,  5,
        4,  4])

Now, imagine you want to access to a specific value from the third patient.

clump_thick_series.iloc[2]  # Remember Python is a 0-indexed progamming language.
3

However what if you want to know the clump thickness of a specific patient. Since we have their codes we can access with another method.

For example, for patient’s code 1166654

clump_thick_series.loc[1166654]
10

Don’t forget

  • loc refers to indexes (labels).

  • iloc refers to order.

We will focus on loc instead of iloc since the power of pandas comes from its indexes can be numeric or categoricals. If you only need to do order-based analysis pandas could be overkill and numpy could be enough.

What if you want to get the values of several patients? For example patients 1166654 and 1178580

clump_thick_series.loc[[1166654, 1178580]]
code
1166654    10
1178580     5
Name: clump_thickness, dtype: int64

Important

Notice if the argument is just one label the loc returns only the value. On the other hand, if the argument is a list then loc returns a pd.Series object.

type(clump_thick_series.loc[1166654])
numpy.int64
type(clump_thick_series.loc[[1166654, 1178580]])
pandas.core.series.Series

You can even edit or add values with these methods.

For instance, what if the dataset is wrong about patient 1166654 and clump thickness should have been 6 instead of 10?

We can fix that easily.

clump_thick_series.loc[1166654] = 6

Warning

You should have got a SettingWithCopyWarning message after running the last code cell if we had not used the copy() method.

I would suggest you to read this link if you get that warning. But in simple words, loc returns a view, that means if you change anything it will change the main object itself. This is a feature, not an error. We have to be careful with this in the future.

Ok, let’s check that change we made

clump_thick_series.loc[1166654]
6

Howerver, notice since we copied the column this didn’t change the value in the original dataset.

breast_cancer_data.loc[1166654]
clump_thickness                10
uniformity_cell_size            3
uniformity_cell_shape           5
marginal_adhesion               1
single_epithelial_cell_size    10
bare_nuclei                     5
bland_chromatin                 3
normal_cucleoli                10
mitoses                         2
class                           4
Name: 1166654, dtype: object

Attention

You can try to create clump_thick_series without the .copy() method and explore what happens if you change values.

I would suggest you to use copies if you are not sure.

Another common mask is when you want to filter by a condition.

For example, let’s get all the patients with a clump thickness greater than 7.

clump_thick_series > 7
code
1000025    False
1002945    False
1015425    False
1016277    False
1017023    False
           ...  
776715     False
841769     False
888820     False
897471     False
897471     False
Name: clump_thickness, Length: 699, dtype: bool

You can do logical comparations with pd.Series but this only will return another pd.Series of boolean objects (True/False). We want to keep only those ones where the value is true.

clump_thick_series.loc[clump_thick_series > 7]
code
1017122     8
1044572     8
1050670    10
1054593    10
1057013     8
           ..
736150     10
822829      8
1253955     8
1268952    10
1369821    10
Name: clump_thickness, Length: 128, dtype: int64

You can avoid using loc in this task but to be honest I rather use it.

clump_thick_series[clump_thick_series > 7]
code
1017122     8
1044572     8
1050670    10
1054593    10
1057013     8
           ..
736150     10
822829      8
1253955     8
1268952    10
1369821    10
Name: clump_thickness, Length: 128, dtype: int64

However, my favorite version is using a functional approach with the function lambda. It is less intuitive at the beginning but it allows you to concatenate operations.

clump_thick_series.loc[lambda x: x > 7]
code
1017122     8
1044572     8
1050670    10
1054593    10
1057013     8
           ..
736150     10
822829      8
1253955     8
1268952    10
1369821    10
Name: clump_thickness, Length: 128, dtype: int64

DataFrames#

pd.DataFrame are 2-dimensional arrays with horizontal and vertical labels (indexes and columns). It is the natural extension of pd.Series and you can even think they are a multiple pd.Series concatenated.

type(breast_cancer_data)
pandas.core.frame.DataFrame

There are a few useful methods for exploring the data, let’s explore some of them.

breast_cancer_data.shape
(699, 10)
breast_cancer_data.head()
clump_thickness uniformity_cell_size uniformity_cell_shape marginal_adhesion single_epithelial_cell_size bare_nuclei bland_chromatin normal_cucleoli mitoses class
code
1000025 5 1 1 1 2 1 3 1 1 2
1002945 5 4 4 5 7 10 3 2 1 2
1015425 3 1 1 1 2 2 3 1 1 2
1016277 6 8 8 1 3 4 3 7 1 2
1017023 4 1 1 3 2 1 3 1 1 2
breast_cancer_data.tail()
clump_thickness uniformity_cell_size uniformity_cell_shape marginal_adhesion single_epithelial_cell_size bare_nuclei bland_chromatin normal_cucleoli mitoses class
code
776715 3 1 1 1 3 2 1 1 1 2
841769 2 1 1 1 2 1 1 1 1 2
888820 5 10 10 3 7 3 8 10 2 4
897471 4 8 6 4 3 4 10 6 1 4
897471 4 8 8 5 4 5 10 4 1 4
breast_cancer_data.max()
clump_thickness                10
uniformity_cell_size           10
uniformity_cell_shape          10
marginal_adhesion              10
single_epithelial_cell_size    10
bare_nuclei                     ?
bland_chromatin                10
normal_cucleoli                10
mitoses                        10
class                           4
dtype: object
breast_cancer_data.mean()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/nanops.py:1680, in _ensure_numeric(x)
   1679 try:
-> 1680     x = x.astype(np.complex128)
   1681 except (TypeError, ValueError):

ValueError: complex() arg is a malformed string

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/nanops.py:1683, in _ensure_numeric(x)
   1682 try:
-> 1683     x = x.astype(np.float64)
   1684 except ValueError as err:
   1685     # GH#29941 we get here with object arrays containing strs

ValueError: could not convert string to float: '1102411010111113391111011071?17111111511111107?31011191183458856110232821211091121104211311112948101111111111610551313101019291083521032121010711011011110112111?11551?821101105311011?1010113?210111111101010111101111010181081810111171111010111105111108110105114111058101105110781101?1029102115121091?1101010810111810101010311010411011041?1117111010101010151011?10?105?110411011010113511111?1081510?11011101410811101011011101011110111181131011310471010331110101111111111111101111101121101111111191141111211?4110310121310111101211111181011111043211111101111016103111511141010111111111111011510131103411011051111111111154111111101011110115101111111011111111121111110115111511111111111101310510101121111111010111101311101011011111111110811101102101111?111211146511111311121111111111214111111110111111111158111111111101011111111151121345'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[29], line 1
----> 1 breast_cancer_data.mean()

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/generic.py:11556, in NDFrame._add_numeric_operations.<locals>.mean(self, axis, skipna, numeric_only, **kwargs)
  11539 @doc(
  11540     _num_doc,
  11541     desc="Return the mean of the values over the requested axis.",
   (...)
  11554     **kwargs,
  11555 ):
> 11556     return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/generic.py:11201, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  11194 def mean(
  11195     self,
  11196     axis: Axis | None = 0,
   (...)
  11199     **kwargs,
  11200 ) -> Series | float:
> 11201     return self._stat_function(
  11202         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  11203     )

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/generic.py:11158, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  11154     nv.validate_stat_func((), kwargs, fname=name)
  11156 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11158 return self._reduce(
  11159     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  11160 )

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/frame.py:10519, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
  10515     df = df.T
  10517 # After possibly _get_data and transposing, we are now in the
  10518 #  simple case where we can use BlockManager.reduce
> 10519 res = df._mgr.reduce(blk_func)
  10520 out = df._constructor(res).iloc[0]
  10521 if out_dtype is not None:

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/internals/managers.py:1534, in BlockManager.reduce(self, func)
   1532 res_blocks: list[Block] = []
   1533 for blk in self.blocks:
-> 1534     nbs = blk.reduce(func)
   1535     res_blocks.extend(nbs)
   1537 index = Index([None])  # placeholder

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/internals/blocks.py:339, in Block.reduce(self, func)
    333 @final
    334 def reduce(self, func) -> list[Block]:
    335     # We will apply the function and reshape the result into a single-row
    336     #  Block with the same mgr_locs; squeezing will be done at a higher level
    337     assert self.ndim == 2
--> 339     result = func(self.values)
    341     if self.values.ndim == 1:
    342         # TODO(EA2D): special case not needed with 2D EAs
    343         res_values = np.array([[result]])

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/frame.py:10482, in DataFrame._reduce.<locals>.blk_func(values, axis)
  10480     return values._reduce(name, skipna=skipna, **kwds)
  10481 else:
> 10482     return op(values, axis=axis, skipna=skipna, **kwds)

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/nanops.py:96, in disallow.__call__.<locals>._f(*args, **kwargs)
     94 try:
     95     with np.errstate(invalid="ignore"):
---> 96         return f(*args, **kwargs)
     97 except ValueError as e:
     98     # we want to transform an object array
     99     # ValueError message to the more typical TypeError
    100     # e.g. this is normally a disallowed function on
    101     # object arrays that contain strings
    102     if is_object_dtype(args[0]):

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/nanops.py:158, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    156         result = alt(values, axis=axis, skipna=skipna, **kwds)
    157 else:
--> 158     result = alt(values, axis=axis, skipna=skipna, **kwds)
    160 return result

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/nanops.py:421, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    418 if datetimelike and mask is None:
    419     mask = isna(values)
--> 421 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    423 if datetimelike:
    424     result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/nanops.py:727, in nanmean(values, axis, skipna, mask)
    724     dtype_count = dtype
    726 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 727 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
    729 if axis is not None and getattr(the_sum, "ndim", False):
    730     count = cast(np.ndarray, count)

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/nanops.py:1686, in _ensure_numeric(x)
   1683         x = x.astype(np.float64)
   1684     except ValueError as err:
   1685         # GH#29941 we get here with object arrays containing strs
-> 1686         raise TypeError(f"Could not convert {x} to numeric") from err
   1687 else:
   1688     if not np.any(np.imag(x)):

TypeError: Could not convert ['1102411010111113391111011071?17111111511111107?31011191183458856110232821211091121104211311112948101111111111610551313101019291083521032121010711011011110112111?11551?821101105311011?1010113?210111111101010111101111010181081810111171111010111105111108110105114111058101105110781101?1029102115121091?1101010810111810101010311010411011041?1117111010101010151011?10?105?110411011010113511111?1081510?11011101410811101011011101011110111181131011310471010331110101111111111111101111101121101111111191141111211?4110310121310111101211111181011111043211111101111016103111511141010111111111111011510131103411011051111111111154111111101011110115101111111011111111121111110115111511111111111101310510101121111111010111101311101011011111111110811101102101111?111211146511111311121111111111214111111110111111111158111111111101011111111151121345'] to numeric

We can inspectionate values using loc as well

breast_cancer_data.loc[1166654]
clump_thickness                10
uniformity_cell_size            3
uniformity_cell_shape           5
marginal_adhesion               1
single_epithelial_cell_size    10
bare_nuclei                     5
bland_chromatin                 3
normal_cucleoli                10
mitoses                         2
class                           4
Name: 1166654, dtype: object

However, you shouldn’t use a double loc. Technical reason here.

breast_cancer_data.loc[1166654].loc["clump_thickness"]
10

Just use a double index notation

breast_cancer_data.loc[1166654, "clump_thickness"]
10
breast_cancer_data.loc[1166654, ["clump_thickness", "class"]]
clump_thickness    10
class               4
Name: 1166654, dtype: object
breast_cancer_data.loc[[1166654, 1178580], ["clump_thickness", "class"]]
clump_thickness class
code
1166654 10 4
1178580 5 2

Boolean masks also work

breast_cancer_data.loc[lambda x: x["clump_thickness"] > 7]
clump_thickness uniformity_cell_size uniformity_cell_shape marginal_adhesion single_epithelial_cell_size bare_nuclei bland_chromatin normal_cucleoli mitoses class
code
1017122 8 10 10 8 7 10 9 7 1 4
1044572 8 7 5 10 7 9 5 5 4 4
1050670 10 7 7 6 4 10 4 1 2 4
1054593 10 5 5 3 6 7 7 10 1 4
1057013 8 4 5 1 2 ? 7 3 1 4
... ... ... ... ... ... ... ... ... ... ...
736150 10 4 3 10 3 10 7 1 2 4
822829 8 10 10 10 6 10 10 10 10 4
1253955 8 7 4 4 5 3 5 10 1 4
1268952 10 10 7 8 7 1 10 10 3 4
1369821 10 10 10 10 5 10 10 10 7 4

129 rows × 10 columns

Or getting a specific column

breast_cancer_data.loc[:, "bare_nuclei"]
code
1000025     1
1002945    10
1015425     2
1016277     4
1017023     1
           ..
776715      2
841769      1
888820      3
897471      4
897471      5
Name: bare_nuclei, Length: 699, dtype: object

However, you can also access directly to a column without loc.

breast_cancer_data["bare_nuclei"]
code
1000025     1
1002945    10
1015425     2
1016277     4
1017023     1
           ..
776715      2
841769      1
888820      3
897471      4
897471      5
Name: bare_nuclei, Length: 699, dtype: object

There are some cool methods you can use for exploring your data

breast_cancer_data.loc[:, "bare_nuclei"].value_counts()
1     402
10    132
2      30
5      30
3      28
8      21
4      19
?      16
9       9
7       8
6       4
Name: bare_nuclei, dtype: int64

What about with those ? values?

They are representing a missing value!

breast_cancer_data.loc[lambda s: s['bare_nuclei'] == '?']
clump_thickness uniformity_cell_size uniformity_cell_shape marginal_adhesion single_epithelial_cell_size bare_nuclei bland_chromatin normal_cucleoli mitoses class
code
1057013 8 4 5 1 2 ? 7 3 1 4
1096800 6 6 6 9 6 ? 7 8 1 2
1183246 1 1 1 1 1 ? 2 1 1 2
1184840 1 1 3 1 2 ? 2 1 1 2
1193683 1 1 2 1 3 ? 1 1 1 2
1197510 5 1 1 1 2 ? 3 1 1 2
1241232 3 1 4 1 2 ? 3 1 1 2
169356 3 1 1 1 2 ? 3 1 1 2
432809 3 1 3 1 2 ? 2 1 1 2
563649 8 8 8 1 2 ? 6 10 1 4
606140 1 1 1 1 2 ? 2 1 1 2
61634 5 4 3 1 2 ? 2 3 1 2
704168 4 6 5 6 7 ? 4 9 1 2
733639 3 1 1 1 2 ? 3 1 1 2
1238464 1 1 1 1 1 ? 2 1 1 2
1057067 1 1 1 1 1 ? 1 1 1 2

pandas has a specific object for denoting null values, pd.NA.

breast_cancer_data.loc[lambda s: s['bare_nuclei'] == '?',  'bare_nuclei'] = pd.NA

Tip

Same result but way more elegant is achieved with the following code line breast_cancer_data.replace({'bare_nuclei': {"?": pd.NA}})

Let’s see the rows with null values

breast_cancer_data.loc[lambda s: s['bare_nuclei'] == pd.NA]
clump_thickness uniformity_cell_size uniformity_cell_shape marginal_adhesion single_epithelial_cell_size bare_nuclei bland_chromatin normal_cucleoli mitoses class
code

Wait a second, why is it not showing me the null values? Null values have weird behaviors in Python.

pd.NA == pd.NA
<NA>

Important

pandas will recognize None, np.na and pd.NA as null values, be careful!

There are special methods for working with null values

breast_cancer_data.loc[lambda s: s['bare_nuclei'].isnull()]
clump_thickness uniformity_cell_size uniformity_cell_shape marginal_adhesion single_epithelial_cell_size bare_nuclei bland_chromatin normal_cucleoli mitoses class
code
1057013 8 4 5 1 2 <NA> 7 3 1 4
1096800 6 6 6 9 6 <NA> 7 8 1 2
1183246 1 1 1 1 1 <NA> 2 1 1 2
1184840 1 1 3 1 2 <NA> 2 1 1 2
1193683 1 1 2 1 3 <NA> 1 1 1 2
1197510 5 1 1 1 2 <NA> 3 1 1 2
1241232 3 1 4 1 2 <NA> 3 1 1 2
169356 3 1 1 1 2 <NA> 3 1 1 2
432809 3 1 3 1 2 <NA> 2 1 1 2
563649 8 8 8 1 2 <NA> 6 10 1 4
606140 1 1 1 1 2 <NA> 2 1 1 2
61634 5 4 3 1 2 <NA> 2 3 1 2
704168 4 6 5 6 7 <NA> 4 9 1 2
733639 3 1 1 1 2 <NA> 3 1 1 2
1238464 1 1 1 1 1 <NA> 2 1 1 2
1057067 1 1 1 1 1 <NA> 1 1 1 2

Or you can explore for any column if there is any null value

breast_cancer_data.isnull().any()
clump_thickness                False
uniformity_cell_size           False
uniformity_cell_shape          False
marginal_adhesion              False
single_epithelial_cell_size    False
bare_nuclei                     True
bland_chromatin                False
normal_cucleoli                False
mitoses                        False
class                          False
dtype: bool

Or maybe for rows using axis=1.

breast_cancer_data.isnull().any(axis=1)
code
1000025    False
1002945    False
1015425    False
1016277    False
1017023    False
           ...  
776715     False
841769     False
888820     False
897471     False
897471     False
Length: 699, dtype: bool

Ok, now we will fix the bare_nuclei column. Imagine you want to replace the null values with the mean value.

breast_cancer_data['bare_nuclei'].mean()

Oh no! We need to convert that column to a numeric column

pd.to_numeric(breast_cancer_data['bare_nuclei'])
code
1000025     1.0
1002945    10.0
1015425     2.0
1016277     4.0
1017023     1.0
           ... 
776715      2.0
841769      1.0
888820      3.0
897471      4.0
897471      5.0
Name: bare_nuclei, Length: 699, dtype: float64
breast_cancer_data['bare_nuclei'] = pd.to_numeric(breast_cancer_data['bare_nuclei'])

Other option could have been

breast_cancer_data = breast_cancer_data.assign(
    bare_nuclei=lambda x: pd.to_numeric(x['bare_nuclei'])
)

I like this last one better, but don’t worry!

breast_cancer_data['bare_nuclei'].mean()
3.5446559297218156

Now, every value is a integer, so we should convert this value to a integer, you should ask to the experts what makes more sense. Let’s say it is better to approximate this value to the a bigger integer.

There is a scientific computing package called numpy that we don’t have time to cover but you should check it out.

import numpy as np
bare_nuclei_mean = np.ceil(breast_cancer_data['bare_nuclei'].mean())
bare_nuclei_mean
4.0

Now, as an example, let’s think we want to fill those null values with the mean value of the column.

If you are wondering if there is any method for this the answer is yes!

pd.DataFrame.fillna?
Signature:
pd.DataFrame.fillna(
    self,
    value: 'Hashable | Mapping | Series | DataFrame' = None,
    *,
    method: 'FillnaOptions | None' = None,
    axis: 'Axis | None' = None,
    inplace: 'bool' = False,
    limit: 'int | None' = None,
    downcast: 'dict | None' = None,
) -> 'DataFrame | None'
Docstring:
Fill NA/NaN values using the specified method.

Parameters
----------
value : scalar, dict, Series, or DataFrame
    Value to use to fill holes (e.g. 0), alternately a
    dict/Series/DataFrame of values specifying which value to use for
    each index (for a Series) or column (for a DataFrame).  Values not
    in the dict/Series/DataFrame will not be filled. This value cannot
    be a list.
method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
    Method to use for filling holes in reindexed Series
    pad / ffill: propagate last valid observation forward to next valid
    backfill / bfill: use next valid observation to fill gap.
axis : {0 or 'index', 1 or 'columns'}
    Axis along which to fill missing values. For `Series`
    this parameter is unused and defaults to 0.
inplace : bool, default False
    If True, fill in-place. Note: this will modify any
    other views on this object (e.g., a no-copy slice for a column in a
    DataFrame).
limit : int, default None
    If method is specified, this is the maximum number of consecutive
    NaN values to forward/backward fill. In other words, if there is
    a gap with more than this number of consecutive NaNs, it will only
    be partially filled. If method is not specified, this is the
    maximum number of entries along the entire axis where NaNs will be
    filled. Must be greater than 0 if not None.
downcast : dict, default is None
    A dict of item->dtype of what to downcast if possible,
    or the string 'infer' which will try to downcast to an appropriate
    equal type (e.g. float64 to int64 if possible).

Returns
-------
DataFrame or None
    Object with missing values filled or None if ``inplace=True``.

See Also
--------
interpolate : Fill NaN values using interpolation.
reindex : Conform object to new index.
asfreq : Convert TimeSeries to specified frequency.

Examples
--------
>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],
...                    [3, 4, np.nan, 1],
...                    [np.nan, np.nan, np.nan, np.nan],
...                    [np.nan, 3, np.nan, 4]],
...                   columns=list("ABCD"))
>>> df
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  NaN  NaN NaN  NaN
3  NaN  3.0 NaN  4.0

Replace all NaN elements with 0s.

>>> df.fillna(0)
     A    B    C    D
0  0.0  2.0  0.0  0.0
1  3.0  4.0  0.0  1.0
2  0.0  0.0  0.0  0.0
3  0.0  3.0  0.0  4.0

We can also propagate non-null values forward or backward.

>>> df.fillna(method="ffill")
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  3.0  4.0 NaN  1.0
3  3.0  3.0 NaN  4.0

Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1,
2, and 3 respectively.

>>> values = {"A": 0, "B": 1, "C": 2, "D": 3}
>>> df.fillna(value=values)
     A    B    C    D
0  0.0  2.0  2.0  0.0
1  3.0  4.0  2.0  1.0
2  0.0  1.0  2.0  3.0
3  0.0  3.0  2.0  4.0

Only replace the first NaN element.

>>> df.fillna(value=values, limit=1)
     A    B    C    D
0  0.0  2.0  2.0  0.0
1  3.0  4.0  NaN  1.0
2  NaN  1.0  NaN  3.0
3  NaN  3.0  NaN  4.0

When filling using a DataFrame, replacement happens along
the same column names and same indices

>>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE"))
>>> df.fillna(df2)
     A    B    C    D
0  0.0  2.0  0.0  0.0
1  3.0  4.0  0.0  1.0
2  0.0  0.0  0.0  NaN
3  0.0  3.0  0.0  4.0

Note that column D is not affected since it is not present in df2.
File:      ~/mambaforge/envs/casbbi-nrt-ds/lib/python3.11/site-packages/pandas/core/frame.py
Type:      function
breast_cancer_data.fillna(value={'bare_nuclei': bare_nuclei_mean})
clump_thickness uniformity_cell_size uniformity_cell_shape marginal_adhesion single_epithelial_cell_size bare_nuclei bland_chromatin normal_cucleoli mitoses class
code
1000025 5 1 1 1 2 1.0 3 1 1 2
1002945 5 4 4 5 7 10.0 3 2 1 2
1015425 3 1 1 1 2 2.0 3 1 1 2
1016277 6 8 8 1 3 4.0 3 7 1 2
1017023 4 1 1 3 2 1.0 3 1 1 2
... ... ... ... ... ... ... ... ... ... ...
776715 3 1 1 1 3 2.0 1 1 1 2
841769 2 1 1 1 2 1.0 1 1 1 2
888820 5 10 10 3 7 3.0 8 10 2 4
897471 4 8 6 4 3 4.0 10 6 1 4
897471 4 8 8 5 4 5.0 10 4 1 4

699 rows × 10 columns

breast_cancer_data.isnull().any()
clump_thickness                False
uniformity_cell_size           False
uniformity_cell_shape          False
marginal_adhesion              False
single_epithelial_cell_size    False
bare_nuclei                     True
bland_chromatin                False
normal_cucleoli                False
mitoses                        False
class                          False
dtype: bool

What? There still null values. That is because most of pandas functions return a copy of the DataFrame. You have to options

  • To assign the result to the same variable.

  • If the method allows it, you can use inplace=True.

breast_cancer_data.fillna(value={'bare_nuclei': bare_nuclei_mean}, inplace=True)
breast_cancer_data.isnull().any()
clump_thickness                False
uniformity_cell_size           False
uniformity_cell_shape          False
marginal_adhesion              False
single_epithelial_cell_size    False
bare_nuclei                    False
bland_chromatin                False
normal_cucleoli                False
mitoses                        False
class                          False
dtype: bool

Summary and next steps#

In this session we explore a data set, reading it, understanding its elements and methods. Also we clean the dataset with null values.

We didn’t have enough time but you should learn about merging datasets, aggreagation, etc.

Just a few examples:

cancer_names = pd.DataFrame(
    [[2, "benign"], [4, "malignant"], [0, "unknown"]],
    columns=["class", "cancer"]
)
cancer_names
class cancer
0 2 benign
1 4 malignant
2 0 unknown
breast_cancer_data2 = breast_cancer_data.merge(
    cancer_names,
    how="left",
    on="class"
)
breast_cancer_data2
clump_thickness uniformity_cell_size uniformity_cell_shape marginal_adhesion single_epithelial_cell_size bare_nuclei bland_chromatin normal_cucleoli mitoses class cancer
0 5 1 1 1 2 1.0 3 1 1 2 benign
1 5 4 4 5 7 10.0 3 2 1 2 benign
2 3 1 1 1 2 2.0 3 1 1 2 benign
3 6 8 8 1 3 4.0 3 7 1 2 benign
4 4 1 1 3 2 1.0 3 1 1 2 benign
... ... ... ... ... ... ... ... ... ... ... ...
694 3 1 1 1 3 2.0 1 1 1 2 benign
695 2 1 1 1 2 1.0 1 1 1 2 benign
696 5 10 10 3 7 3.0 8 10 2 4 malignant
697 4 8 6 4 3 4.0 10 6 1 4 malignant
698 4 8 8 5 4 5.0 10 4 1 4 malignant

699 rows × 11 columns

breast_cancer_data2["cancer"].unique()
array(['benign', 'malignant'], dtype=object)
breast_cancer_data2.groupby("cancer").mean().T
cancer benign malignant
clump_thickness 2.956332 7.195021
uniformity_cell_size 1.325328 6.572614
uniformity_cell_shape 1.443231 6.560166
marginal_adhesion 1.364629 5.547718
single_epithelial_cell_size 2.120087 5.298755
bare_nuclei 1.427948 7.597510
bland_chromatin 2.100437 5.979253
normal_cucleoli 1.290393 5.863071
mitoses 1.063319 2.589212
class 2.000000 4.000000

A very good place to learn is in the official user guide.