{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Analysis"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The main goal of this class is to learn how to gather, explore, clean and analyze different types of datasets.\n",
    "\n",
    "We will introduce some data analysis common tasks using the `pandas` package."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> `pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.\n",
    ">\n",
    "> --<cite> https://pandas.pydata.org </cite>--"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# from pathlib import Path  # Run this line if you are working in a local environment"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset: Breast Cancer Wisconsin"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.\n",
    "\n",
    "This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.\n",
    "\n",
    "Source: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29\n",
    "\n",
    "\n",
    "| Attribute                   | Domain  |\n",
    "|-----------------------------|---------|\n",
    "| Sample code number          | id number |\n",
    "| Clump Thickness             | 1 - 10 |\n",
    "| Uniformity of Cell Size     | 1 - 10 |\n",
    "| Uniformity of Cell Shape    | 1 - 10 |\n",
    "| Marginal Adhesion           | 1 - 10 |\n",
    "| Single Epithelial Cell Size | 1 - 10 |\n",
    "| Bare Nuclei                 | 1 - 10 |\n",
    "| Bland Chromatin             | 1 - 10 |\n",
    "| Normal Nucleoli             | 1 - 10 |\n",
    "| Mitoses                     |  1 - 10 |\n",
    "| Class                       |  (2 for benign, 4 for malignant) |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_filepath = \"https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data\"\n",
    "# data_filepath = Path().resolve().parent / \"data\" / \"breast-cancer-wisconsin.data\"\n",
    "data_filepath"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More details in the following file you can download: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The easiest way to open a plain text file as this one is using `pd.read_csv`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>clump_thickness</th>\n",
       "      <th>uniformity_cell_size</th>\n",
       "      <th>uniformity_cell_shape</th>\n",
       "      <th>marginal_adhesion</th>\n",
       "      <th>single_epithelial_cell_size</th>\n",
       "      <th>bare_nuclei</th>\n",
       "      <th>bland_chromatin</th>\n",
       "      <th>normal_cucleoli</th>\n",
       "      <th>mitoses</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1000025</th>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1002945</th>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>7</td>\n",
       "      <td>10</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1015425</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1016277</th>\n",
       "      <td>6</td>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1017023</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         clump_thickness  uniformity_cell_size  uniformity_cell_shape  \\\n",
       "code                                                                    \n",
       "1000025                5                     1                      1   \n",
       "1002945                5                     4                      4   \n",
       "1015425                3                     1                      1   \n",
       "1016277                6                     8                      8   \n",
       "1017023                4                     1                      1   \n",
       "\n",
       "         marginal_adhesion  single_epithelial_cell_size bare_nuclei  \\\n",
       "code                                                                  \n",
       "1000025                  1                            2           1   \n",
       "1002945                  5                            7          10   \n",
       "1015425                  1                            2           2   \n",
       "1016277                  1                            3           4   \n",
       "1017023                  3                            2           1   \n",
       "\n",
       "         bland_chromatin  normal_cucleoli  mitoses  class  \n",
       "code                                                       \n",
       "1000025                3                1        1      2  \n",
       "1002945                3                2        1      2  \n",
       "1015425                3                1        1      2  \n",
       "1016277                3                7        1      2  \n",
       "1017023                3                1        1      2  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data = pd.read_csv(\n",
    "    data_filepath ,\n",
    "    names=[\n",
    "        \"code\",\n",
    "        \"clump_thickness\",\n",
    "        \"uniformity_cell_size\",\n",
    "        \"uniformity_cell_shape\",\n",
    "        \"marginal_adhesion\",\n",
    "        \"single_epithelial_cell_size\",\n",
    "        \"bare_nuclei\",\n",
    "        \"bland_chromatin\",\n",
    "        \"normal_cucleoli\",\n",
    "        \"mitoses\",\n",
    "        \"class\",\n",
    "    ],\n",
    "    index_col=0\n",
    ")\n",
    "breast_cancer_data.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's explore this data a little bit before start working with it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 699 entries, 1000025 to 897471\n",
      "Data columns (total 10 columns):\n",
      " #   Column                       Non-Null Count  Dtype \n",
      "---  ------                       --------------  ----- \n",
      " 0   clump_thickness              699 non-null    int64 \n",
      " 1   uniformity_cell_size         699 non-null    int64 \n",
      " 2   uniformity_cell_shape        699 non-null    int64 \n",
      " 3   marginal_adhesion            699 non-null    int64 \n",
      " 4   single_epithelial_cell_size  699 non-null    int64 \n",
      " 5   bare_nuclei                  699 non-null    object\n",
      " 6   bland_chromatin              699 non-null    int64 \n",
      " 7   normal_cucleoli              699 non-null    int64 \n",
      " 8   mitoses                      699 non-null    int64 \n",
      " 9   class                        699 non-null    int64 \n",
      "dtypes: int64(9), object(1)\n",
      "memory usage: 60.1+ KB\n"
     ]
    }
   ],
   "source": [
    "breast_cancer_data.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "clump_thickness                 int64\n",
       "uniformity_cell_size            int64\n",
       "uniformity_cell_shape           int64\n",
       "marginal_adhesion               int64\n",
       "single_epithelial_cell_size     int64\n",
       "bare_nuclei                    object\n",
       "bland_chromatin                 int64\n",
       "normal_cucleoli                 int64\n",
       "mitoses                         int64\n",
       "class                           int64\n",
       "dtype: object"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>clump_thickness</th>\n",
       "      <th>uniformity_cell_size</th>\n",
       "      <th>uniformity_cell_shape</th>\n",
       "      <th>marginal_adhesion</th>\n",
       "      <th>single_epithelial_cell_size</th>\n",
       "      <th>bland_chromatin</th>\n",
       "      <th>normal_cucleoli</th>\n",
       "      <th>mitoses</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>699.000000</td>\n",
       "      <td>699.000000</td>\n",
       "      <td>699.000000</td>\n",
       "      <td>699.000000</td>\n",
       "      <td>699.000000</td>\n",
       "      <td>699.000000</td>\n",
       "      <td>699.000000</td>\n",
       "      <td>699.000000</td>\n",
       "      <td>699.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>4.417740</td>\n",
       "      <td>3.134478</td>\n",
       "      <td>3.207439</td>\n",
       "      <td>2.806867</td>\n",
       "      <td>3.216023</td>\n",
       "      <td>3.437768</td>\n",
       "      <td>2.866953</td>\n",
       "      <td>1.589413</td>\n",
       "      <td>2.689557</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>2.815741</td>\n",
       "      <td>3.051459</td>\n",
       "      <td>2.971913</td>\n",
       "      <td>2.855379</td>\n",
       "      <td>2.214300</td>\n",
       "      <td>2.438364</td>\n",
       "      <td>3.053634</td>\n",
       "      <td>1.715078</td>\n",
       "      <td>0.951273</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>4.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>6.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>4.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>4.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       clump_thickness  uniformity_cell_size  uniformity_cell_shape  \\\n",
       "count       699.000000            699.000000             699.000000   \n",
       "mean          4.417740              3.134478               3.207439   \n",
       "std           2.815741              3.051459               2.971913   \n",
       "min           1.000000              1.000000               1.000000   \n",
       "25%           2.000000              1.000000               1.000000   \n",
       "50%           4.000000              1.000000               1.000000   \n",
       "75%           6.000000              5.000000               5.000000   \n",
       "max          10.000000             10.000000              10.000000   \n",
       "\n",
       "       marginal_adhesion  single_epithelial_cell_size  bland_chromatin  \\\n",
       "count         699.000000                   699.000000       699.000000   \n",
       "mean            2.806867                     3.216023         3.437768   \n",
       "std             2.855379                     2.214300         2.438364   \n",
       "min             1.000000                     1.000000         1.000000   \n",
       "25%             1.000000                     2.000000         2.000000   \n",
       "50%             1.000000                     2.000000         3.000000   \n",
       "75%             4.000000                     4.000000         5.000000   \n",
       "max            10.000000                    10.000000        10.000000   \n",
       "\n",
       "       normal_cucleoli     mitoses       class  \n",
       "count       699.000000  699.000000  699.000000  \n",
       "mean          2.866953    1.589413    2.689557  \n",
       "std           3.053634    1.715078    0.951273  \n",
       "min           1.000000    1.000000    2.000000  \n",
       "25%           1.000000    1.000000    2.000000  \n",
       "50%           1.000000    1.000000    2.000000  \n",
       "75%           4.000000    1.000000    4.000000  \n",
       "max          10.000000   10.000000    4.000000  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>clump_thickness</th>\n",
       "      <th>uniformity_cell_size</th>\n",
       "      <th>uniformity_cell_shape</th>\n",
       "      <th>marginal_adhesion</th>\n",
       "      <th>single_epithelial_cell_size</th>\n",
       "      <th>bare_nuclei</th>\n",
       "      <th>bland_chromatin</th>\n",
       "      <th>normal_cucleoli</th>\n",
       "      <th>mitoses</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>699.000000</td>\n",
       "      <td>699.000000</td>\n",
       "      <td>699.000000</td>\n",
       "      <td>699.000000</td>\n",
       "      <td>699.000000</td>\n",
       "      <td>699</td>\n",
       "      <td>699.000000</td>\n",
       "      <td>699.000000</td>\n",
       "      <td>699.000000</td>\n",
       "      <td>699.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>11</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>402</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>4.417740</td>\n",
       "      <td>3.134478</td>\n",
       "      <td>3.207439</td>\n",
       "      <td>2.806867</td>\n",
       "      <td>3.216023</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.437768</td>\n",
       "      <td>2.866953</td>\n",
       "      <td>1.589413</td>\n",
       "      <td>2.689557</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>2.815741</td>\n",
       "      <td>3.051459</td>\n",
       "      <td>2.971913</td>\n",
       "      <td>2.855379</td>\n",
       "      <td>2.214300</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.438364</td>\n",
       "      <td>3.053634</td>\n",
       "      <td>1.715078</td>\n",
       "      <td>0.951273</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>4.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>6.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>4.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>4.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        clump_thickness  uniformity_cell_size  uniformity_cell_shape  \\\n",
       "count        699.000000            699.000000             699.000000   \n",
       "unique              NaN                   NaN                    NaN   \n",
       "top                 NaN                   NaN                    NaN   \n",
       "freq                NaN                   NaN                    NaN   \n",
       "mean           4.417740              3.134478               3.207439   \n",
       "std            2.815741              3.051459               2.971913   \n",
       "min            1.000000              1.000000               1.000000   \n",
       "25%            2.000000              1.000000               1.000000   \n",
       "50%            4.000000              1.000000               1.000000   \n",
       "75%            6.000000              5.000000               5.000000   \n",
       "max           10.000000             10.000000              10.000000   \n",
       "\n",
       "        marginal_adhesion  single_epithelial_cell_size bare_nuclei  \\\n",
       "count          699.000000                   699.000000         699   \n",
       "unique                NaN                          NaN          11   \n",
       "top                   NaN                          NaN           1   \n",
       "freq                  NaN                          NaN         402   \n",
       "mean             2.806867                     3.216023         NaN   \n",
       "std              2.855379                     2.214300         NaN   \n",
       "min              1.000000                     1.000000         NaN   \n",
       "25%              1.000000                     2.000000         NaN   \n",
       "50%              1.000000                     2.000000         NaN   \n",
       "75%              4.000000                     4.000000         NaN   \n",
       "max             10.000000                    10.000000         NaN   \n",
       "\n",
       "        bland_chromatin  normal_cucleoli     mitoses       class  \n",
       "count        699.000000       699.000000  699.000000  699.000000  \n",
       "unique              NaN              NaN         NaN         NaN  \n",
       "top                 NaN              NaN         NaN         NaN  \n",
       "freq                NaN              NaN         NaN         NaN  \n",
       "mean           3.437768         2.866953    1.589413    2.689557  \n",
       "std            2.438364         3.053634    1.715078    0.951273  \n",
       "min            1.000000         1.000000    1.000000    2.000000  \n",
       "25%            2.000000         1.000000    1.000000    2.000000  \n",
       "50%            3.000000         1.000000    1.000000    2.000000  \n",
       "75%            5.000000         4.000000    1.000000    4.000000  \n",
       "max           10.000000        10.000000   10.000000    4.000000  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.describe(include=\"all\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Series"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Series are one-dimensional labeled arrays. You can think they are similar to columns of a excel spreadsheet. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are multiple ways to create a `pd.Series`, using lists, dictionaies, `np.array` or from a file. \n",
    "\n",
    "Since we already loaded the breast cancer data we will use it as an example. Each list of this file has been converted to a `pd.Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "code\n",
       "1000025    5\n",
       "1002945    5\n",
       "1015425    3\n",
       "1016277    6\n",
       "1017023    4\n",
       "Name: clump_thickness, dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clump_thick_series = breast_cancer_data[\"clump_thickness\"].copy()\n",
    "clump_thick_series.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(clump_thick_series)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pd.Series` are made with _index_ and _values_."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Int64Index([1000025, 1002945, 1015425, 1016277, 1017023, 1017122, 1018099,\n",
       "            1018561, 1033078, 1033078,\n",
       "            ...\n",
       "             654546,  654546,  695091,  714039,  763235,  776715,  841769,\n",
       "             888820,  897471,  897471],\n",
       "           dtype='int64', name='code', length=699)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clump_thick_series.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 5,  5,  3,  6,  4,  8,  1,  2,  2,  4,  1,  2,  5,  1,  8,  7,  4,\n",
       "        4, 10,  6,  7, 10,  3,  8,  1,  5,  3,  5,  2,  1,  3,  2, 10,  2,\n",
       "        3,  2, 10,  6,  5,  2,  6, 10,  6,  5, 10,  1,  3,  1,  4,  7,  9,\n",
       "        5, 10,  5, 10, 10,  8,  8,  5,  9,  5,  1,  9,  6,  1, 10,  4,  5,\n",
       "        8,  1,  5,  6,  1,  9, 10,  1,  1,  5,  3,  2,  2,  4,  5,  3,  3,\n",
       "        5,  3,  3,  4,  2,  1,  3,  4,  1,  2,  1,  2,  5,  9,  7, 10,  2,\n",
       "        4,  8, 10,  7, 10,  1,  1,  6,  1,  8, 10, 10,  3,  1,  8,  4,  1,\n",
       "        3,  1,  4, 10,  5,  5,  1,  7,  3,  8,  1,  5,  2,  5,  3,  3,  5,\n",
       "        4,  3,  4,  1,  3,  2,  9,  1,  2,  1,  3,  1,  3,  8,  1,  7, 10,\n",
       "        4,  1,  5,  1,  2,  1,  9, 10,  4,  3,  1,  5,  4,  5, 10,  3,  1,\n",
       "        3,  1,  1,  6,  8,  5,  2,  5,  4,  5,  1,  1,  6,  5,  8,  2,  1,\n",
       "       10,  5,  1, 10,  7,  5,  1,  3,  4,  8,  5,  1,  3,  9, 10,  1,  5,\n",
       "        1,  5, 10,  1,  1,  5,  8,  8,  1, 10, 10,  8,  1,  1,  6,  6,  1,\n",
       "       10,  4,  7, 10,  1, 10,  8,  1, 10,  7,  6,  8, 10,  3,  3, 10,  9,\n",
       "        8, 10,  5,  3,  2,  1,  1,  5,  8,  8,  4,  3,  1, 10,  6,  6,  9,\n",
       "        5,  3,  3,  3,  5, 10,  5,  8, 10,  7,  5, 10,  3, 10,  1,  8,  5,\n",
       "        3,  7,  3,  3,  3,  1,  1, 10,  3,  2,  1, 10,  7,  8, 10,  3,  6,\n",
       "        5,  1,  1,  8, 10,  1,  5,  5,  5,  8,  9,  8,  1, 10,  1,  8, 10,\n",
       "        1,  1,  7,  3,  2,  1,  8,  1,  1,  4,  5,  6,  1,  4,  7,  3,  3,\n",
       "        5,  1,  3, 10,  1,  8, 10, 10,  5,  5,  5,  8,  1,  6,  1,  1,  8,\n",
       "       10,  1,  2,  1,  7,  1,  5,  1,  3,  4,  5,  2,  3,  2,  1,  4,  5,\n",
       "        8,  8, 10,  6,  3,  3,  4,  2,  2,  6,  5,  1,  1,  4,  1,  4,  5,\n",
       "        3,  1,  1,  1,  3,  5,  1, 10,  3,  2,  2,  3,  7,  5,  2,  5,  1,\n",
       "       10,  3,  1,  1,  3,  3,  4,  3,  1,  3,  3,  5,  3,  1,  1,  4,  1,\n",
       "        2,  3,  1,  1, 10,  5,  8,  3,  8,  1,  5,  2,  3, 10,  4,  5,  3,\n",
       "        9,  5,  8,  1,  2,  1,  5,  5,  3,  6, 10, 10,  4,  4,  5, 10,  5,\n",
       "        1,  1,  5,  2,  1,  5,  1,  5,  4,  5,  3,  4,  2, 10, 10,  8,  5,\n",
       "        5,  5,  3,  6,  4,  4, 10, 10,  6,  4,  1,  3,  6,  6,  4,  5,  3,\n",
       "        4,  4,  5,  4,  5,  5,  9,  8,  5,  1,  3, 10,  3,  6,  1,  5,  4,\n",
       "        5,  5,  3,  1,  4,  4,  4,  6,  4,  4,  4,  1,  3,  8,  1,  5,  2,\n",
       "        1,  5,  5,  3,  6,  4,  1,  1,  3,  4,  1,  4, 10,  7,  3,  3,  4,\n",
       "        4,  6,  4,  7,  4,  1,  3,  2,  1,  5,  5,  4,  6,  5,  3,  5,  4,\n",
       "        2,  5,  6,  2,  3,  7,  3,  1,  3,  4,  3,  4,  5,  5,  2,  5,  5,\n",
       "        5,  1,  3,  4,  5,  3,  4,  8, 10,  8,  7,  3,  1, 10,  5,  5,  1,\n",
       "        1,  1,  5,  5,  6,  3,  5,  1,  8,  5,  9,  5,  4,  2, 10,  5,  4,\n",
       "        5,  4,  5,  3,  5,  3,  1,  4,  5,  5, 10,  4,  1,  5,  5, 10,  5,\n",
       "        8,  2,  2,  4,  3,  1,  4,  5,  3,  6,  7,  1,  5,  3,  4,  2,  2,\n",
       "        4,  6,  5,  1,  8,  3,  3, 10,  4,  4,  5,  4,  3,  3,  1,  2,  3,\n",
       "        1,  1,  5,  3,  3,  1,  5,  4,  3,  3,  5,  5,  7,  1,  1,  4,  1,\n",
       "        1,  3,  1,  5,  3,  5,  5,  3,  3,  2,  5,  1,  4,  1,  5,  1,  2,\n",
       "       10,  5,  5,  1,  1,  1,  1,  3,  4,  1,  1,  5,  3,  3,  3,  2,  5,\n",
       "        4,  4])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clump_thick_series.values"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, imagine you want to access to a specific value from the third patient."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clump_thick_series.iloc[2]  # Remember Python is a 0-indexed progamming language."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However what if you want to know the clump thickness of a specific patient. Since we have their codes we can access with another method.\n",
    "\n",
    "For example, for patient's code `1166654`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clump_thick_series.loc[1166654]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Don't forget\n",
    "\n",
    "* `loc` refers to indexes (__labels__).\n",
    "* `iloc` refers to order.\n",
    "\n",
    "We will focus on `loc` instead of `iloc` since the power of `pandas` comes from its indexes can be numeric or categoricals. If you only need to do order-based analysis `pandas` could be overkill and `numpy` could be enough."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What if you want to get the values of several patients? For example patients `1166654` and `1178580`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "code\n",
       "1166654    10\n",
       "1178580     5\n",
       "Name: clump_thickness, dtype: int64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clump_thick_series.loc[[1166654, 1178580]]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{important}\n",
    "Notice if the argument is just one label the `loc` returns only the value. On the other hand, if the argument is a list then `loc` returns a `pd.Series` object.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "numpy.int64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(clump_thick_series.loc[1166654])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(clump_thick_series.loc[[1166654, 1178580]])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can even edit or add values with these methods.\n",
    "\n",
    "For instance, what if the dataset is wrong about patient `1166654` and clump thickness should have been `6` instead of `10`? \n",
    "\n",
    "We can fix that easily."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "clump_thick_series.loc[1166654] = 6"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{warning}\n",
    "You should have got a `SettingWithCopyWarning` message after running the last code cell if we had not used the `copy()` method.\n",
    "\n",
    "I would suggest you to read [this link](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy) if you get that warning. But in simple words, `loc` returns a __view__, that means if you change anything it will change the main object itself. This is a feature, not an error. We have to be careful with this in the future.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Ok, let's check that change we made"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clump_thick_series.loc[1166654]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Howerver, notice since we copied the column this didn't change the value in the original dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "clump_thickness                10\n",
       "uniformity_cell_size            3\n",
       "uniformity_cell_shape           5\n",
       "marginal_adhesion               1\n",
       "single_epithelial_cell_size    10\n",
       "bare_nuclei                     5\n",
       "bland_chromatin                 3\n",
       "normal_cucleoli                10\n",
       "mitoses                         2\n",
       "class                           4\n",
       "Name: 1166654, dtype: object"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.loc[1166654]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{attention}\n",
    "You can try to create `clump_thick_series` without the `.copy()` method and explore what happens if you change values.\n",
    "\n",
    "I would suggest you __to use copies if you are not sure__.\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another common mask is when you want to filter by a condition.\n",
    "\n",
    "For example, let's get all the patients with a clump thickness greater than 7."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "code\n",
       "1000025    False\n",
       "1002945    False\n",
       "1015425    False\n",
       "1016277    False\n",
       "1017023    False\n",
       "           ...  \n",
       "776715     False\n",
       "841769     False\n",
       "888820     False\n",
       "897471     False\n",
       "897471     False\n",
       "Name: clump_thickness, Length: 699, dtype: bool"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clump_thick_series > 7"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can do logical comparations with `pd.Series` but this only will return another `pd.Series` of boolean objects (True/False). We want to keep only those ones where the value is true."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "code\n",
       "1017122     8\n",
       "1044572     8\n",
       "1050670    10\n",
       "1054593    10\n",
       "1057013     8\n",
       "           ..\n",
       "736150     10\n",
       "822829      8\n",
       "1253955     8\n",
       "1268952    10\n",
       "1369821    10\n",
       "Name: clump_thickness, Length: 128, dtype: int64"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clump_thick_series.loc[clump_thick_series > 7]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can avoid using `loc` in this task but to be honest I rather use it. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "code\n",
       "1017122     8\n",
       "1044572     8\n",
       "1050670    10\n",
       "1054593    10\n",
       "1057013     8\n",
       "           ..\n",
       "736150     10\n",
       "822829      8\n",
       "1253955     8\n",
       "1268952    10\n",
       "1369821    10\n",
       "Name: clump_thickness, Length: 128, dtype: int64"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clump_thick_series[clump_thick_series > 7]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, my favorite version is using a functional approach with the function `lambda`. It is less intuitive at the beginning but it allows you to concatenate operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "code\n",
       "1017122     8\n",
       "1044572     8\n",
       "1050670    10\n",
       "1054593    10\n",
       "1057013     8\n",
       "           ..\n",
       "736150     10\n",
       "822829      8\n",
       "1253955     8\n",
       "1268952    10\n",
       "1369821    10\n",
       "Name: clump_thickness, Length: 128, dtype: int64"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clump_thick_series.loc[lambda x: x > 7]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DataFrames"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pd.DataFrame` are 2-dimensional arrays with horizontal and vertical labels (_indexes_ and _columns_). It is the natural extension of `pd.Series` and you can even think they are a multiple `pd.Series` concatenated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(breast_cancer_data)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a few useful methods for exploring the data, let's explore some of them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(699, 10)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>clump_thickness</th>\n",
       "      <th>uniformity_cell_size</th>\n",
       "      <th>uniformity_cell_shape</th>\n",
       "      <th>marginal_adhesion</th>\n",
       "      <th>single_epithelial_cell_size</th>\n",
       "      <th>bare_nuclei</th>\n",
       "      <th>bland_chromatin</th>\n",
       "      <th>normal_cucleoli</th>\n",
       "      <th>mitoses</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1000025</th>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1002945</th>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>7</td>\n",
       "      <td>10</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1015425</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1016277</th>\n",
       "      <td>6</td>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1017023</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         clump_thickness  uniformity_cell_size  uniformity_cell_shape  \\\n",
       "code                                                                    \n",
       "1000025                5                     1                      1   \n",
       "1002945                5                     4                      4   \n",
       "1015425                3                     1                      1   \n",
       "1016277                6                     8                      8   \n",
       "1017023                4                     1                      1   \n",
       "\n",
       "         marginal_adhesion  single_epithelial_cell_size bare_nuclei  \\\n",
       "code                                                                  \n",
       "1000025                  1                            2           1   \n",
       "1002945                  5                            7          10   \n",
       "1015425                  1                            2           2   \n",
       "1016277                  1                            3           4   \n",
       "1017023                  3                            2           1   \n",
       "\n",
       "         bland_chromatin  normal_cucleoli  mitoses  class  \n",
       "code                                                       \n",
       "1000025                3                1        1      2  \n",
       "1002945                3                2        1      2  \n",
       "1015425                3                1        1      2  \n",
       "1016277                3                7        1      2  \n",
       "1017023                3                1        1      2  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>clump_thickness</th>\n",
       "      <th>uniformity_cell_size</th>\n",
       "      <th>uniformity_cell_shape</th>\n",
       "      <th>marginal_adhesion</th>\n",
       "      <th>single_epithelial_cell_size</th>\n",
       "      <th>bare_nuclei</th>\n",
       "      <th>bland_chromatin</th>\n",
       "      <th>normal_cucleoli</th>\n",
       "      <th>mitoses</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>776715</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>841769</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>888820</th>\n",
       "      <td>5</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>8</td>\n",
       "      <td>10</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>897471</th>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>10</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>897471</th>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>10</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        clump_thickness  uniformity_cell_size  uniformity_cell_shape  \\\n",
       "code                                                                   \n",
       "776715                3                     1                      1   \n",
       "841769                2                     1                      1   \n",
       "888820                5                    10                     10   \n",
       "897471                4                     8                      6   \n",
       "897471                4                     8                      8   \n",
       "\n",
       "        marginal_adhesion  single_epithelial_cell_size bare_nuclei  \\\n",
       "code                                                                 \n",
       "776715                  1                            3           2   \n",
       "841769                  1                            2           1   \n",
       "888820                  3                            7           3   \n",
       "897471                  4                            3           4   \n",
       "897471                  5                            4           5   \n",
       "\n",
       "        bland_chromatin  normal_cucleoli  mitoses  class  \n",
       "code                                                      \n",
       "776715                1                1        1      2  \n",
       "841769                1                1        1      2  \n",
       "888820                8               10        2      4  \n",
       "897471               10                6        1      4  \n",
       "897471               10                4        1      4  "
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "clump_thickness                10\n",
       "uniformity_cell_size           10\n",
       "uniformity_cell_shape          10\n",
       "marginal_adhesion              10\n",
       "single_epithelial_cell_size    10\n",
       "bare_nuclei                     ?\n",
       "bland_chromatin                10\n",
       "normal_cucleoli                10\n",
       "mitoses                        10\n",
       "class                           4\n",
       "dtype: object"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_6822/3888804924.py:1: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.\n",
      "  breast_cancer_data.mean()\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "clump_thickness                4.417740\n",
       "uniformity_cell_size           3.134478\n",
       "uniformity_cell_shape          3.207439\n",
       "marginal_adhesion              2.806867\n",
       "single_epithelial_cell_size    3.216023\n",
       "bland_chromatin                3.437768\n",
       "normal_cucleoli                2.866953\n",
       "mitoses                        1.589413\n",
       "class                          2.689557\n",
       "dtype: float64"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.mean()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can inspectionate values using `loc` as well"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "clump_thickness                10\n",
       "uniformity_cell_size            3\n",
       "uniformity_cell_shape           5\n",
       "marginal_adhesion               1\n",
       "single_epithelial_cell_size    10\n",
       "bare_nuclei                     5\n",
       "bland_chromatin                 3\n",
       "normal_cucleoli                10\n",
       "mitoses                         2\n",
       "class                           4\n",
       "Name: 1166654, dtype: object"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.loc[1166654]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, you shouldn't use a double `loc`. Technical reason [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.loc[1166654].loc[\"clump_thickness\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just use a double index notation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.loc[1166654, \"clump_thickness\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "clump_thickness    10\n",
       "class               4\n",
       "Name: 1166654, dtype: object"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.loc[1166654, [\"clump_thickness\", \"class\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>clump_thickness</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1166654</th>\n",
       "      <td>10</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1178580</th>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         clump_thickness  class\n",
       "code                           \n",
       "1166654               10      4\n",
       "1178580                5      2"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.loc[[1166654, 1178580], [\"clump_thickness\", \"class\"]]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Boolean masks also work"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>clump_thickness</th>\n",
       "      <th>uniformity_cell_size</th>\n",
       "      <th>uniformity_cell_shape</th>\n",
       "      <th>marginal_adhesion</th>\n",
       "      <th>single_epithelial_cell_size</th>\n",
       "      <th>bare_nuclei</th>\n",
       "      <th>bland_chromatin</th>\n",
       "      <th>normal_cucleoli</th>\n",
       "      <th>mitoses</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1017122</th>\n",
       "      <td>8</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>8</td>\n",
       "      <td>7</td>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1044572</th>\n",
       "      <td>8</td>\n",
       "      <td>7</td>\n",
       "      <td>5</td>\n",
       "      <td>10</td>\n",
       "      <td>7</td>\n",
       "      <td>9</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1050670</th>\n",
       "      <td>10</td>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>10</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1054593</th>\n",
       "      <td>10</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1057013</th>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>736150</th>\n",
       "      <td>10</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>10</td>\n",
       "      <td>3</td>\n",
       "      <td>10</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>822829</th>\n",
       "      <td>8</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>6</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1253955</th>\n",
       "      <td>8</td>\n",
       "      <td>7</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1268952</th>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1369821</th>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>5</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>7</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>129 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         clump_thickness  uniformity_cell_size  uniformity_cell_shape  \\\n",
       "code                                                                    \n",
       "1017122                8                    10                     10   \n",
       "1044572                8                     7                      5   \n",
       "1050670               10                     7                      7   \n",
       "1054593               10                     5                      5   \n",
       "1057013                8                     4                      5   \n",
       "...                  ...                   ...                    ...   \n",
       "736150                10                     4                      3   \n",
       "822829                 8                    10                     10   \n",
       "1253955                8                     7                      4   \n",
       "1268952               10                    10                      7   \n",
       "1369821               10                    10                     10   \n",
       "\n",
       "         marginal_adhesion  single_epithelial_cell_size bare_nuclei  \\\n",
       "code                                                                  \n",
       "1017122                  8                            7          10   \n",
       "1044572                 10                            7           9   \n",
       "1050670                  6                            4          10   \n",
       "1054593                  3                            6           7   \n",
       "1057013                  1                            2           ?   \n",
       "...                    ...                          ...         ...   \n",
       "736150                  10                            3          10   \n",
       "822829                  10                            6          10   \n",
       "1253955                  4                            5           3   \n",
       "1268952                  8                            7           1   \n",
       "1369821                 10                            5          10   \n",
       "\n",
       "         bland_chromatin  normal_cucleoli  mitoses  class  \n",
       "code                                                       \n",
       "1017122                9                7        1      4  \n",
       "1044572                5                5        4      4  \n",
       "1050670                4                1        2      4  \n",
       "1054593                7               10        1      4  \n",
       "1057013                7                3        1      4  \n",
       "...                  ...              ...      ...    ...  \n",
       "736150                 7                1        2      4  \n",
       "822829                10               10       10      4  \n",
       "1253955                5               10        1      4  \n",
       "1268952               10               10        3      4  \n",
       "1369821               10               10        7      4  \n",
       "\n",
       "[129 rows x 10 columns]"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.loc[lambda x: x[\"clump_thickness\"] > 7]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or getting a specific column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "code\n",
       "1000025     1\n",
       "1002945    10\n",
       "1015425     2\n",
       "1016277     4\n",
       "1017023     1\n",
       "           ..\n",
       "776715      2\n",
       "841769      1\n",
       "888820      3\n",
       "897471      4\n",
       "897471      5\n",
       "Name: bare_nuclei, Length: 699, dtype: object"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.loc[:, \"bare_nuclei\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, you can also access directly to a column without `loc`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "code\n",
       "1000025     1\n",
       "1002945    10\n",
       "1015425     2\n",
       "1016277     4\n",
       "1017023     1\n",
       "           ..\n",
       "776715      2\n",
       "841769      1\n",
       "888820      3\n",
       "897471      4\n",
       "897471      5\n",
       "Name: bare_nuclei, Length: 699, dtype: object"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data[\"bare_nuclei\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are some cool methods you can use for exploring your data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1     402\n",
       "10    132\n",
       "2      30\n",
       "5      30\n",
       "3      28\n",
       "8      21\n",
       "4      19\n",
       "?      16\n",
       "9       9\n",
       "7       8\n",
       "6       4\n",
       "Name: bare_nuclei, dtype: int64"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.loc[:, \"bare_nuclei\"].value_counts()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What about with those `?` values?\n",
    "\n",
    "They are representing a missing value!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>clump_thickness</th>\n",
       "      <th>uniformity_cell_size</th>\n",
       "      <th>uniformity_cell_shape</th>\n",
       "      <th>marginal_adhesion</th>\n",
       "      <th>single_epithelial_cell_size</th>\n",
       "      <th>bare_nuclei</th>\n",
       "      <th>bland_chromatin</th>\n",
       "      <th>normal_cucleoli</th>\n",
       "      <th>mitoses</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1057013</th>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1096800</th>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>9</td>\n",
       "      <td>6</td>\n",
       "      <td>?</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1183246</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>?</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1184840</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1193683</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>?</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1197510</th>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1241232</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>169356</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>432809</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>563649</th>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "      <td>6</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>606140</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61634</th>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>704168</th>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>?</td>\n",
       "      <td>4</td>\n",
       "      <td>9</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>733639</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1238464</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>?</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1057067</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>?</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         clump_thickness  uniformity_cell_size  uniformity_cell_shape  \\\n",
       "code                                                                    \n",
       "1057013                8                     4                      5   \n",
       "1096800                6                     6                      6   \n",
       "1183246                1                     1                      1   \n",
       "1184840                1                     1                      3   \n",
       "1193683                1                     1                      2   \n",
       "1197510                5                     1                      1   \n",
       "1241232                3                     1                      4   \n",
       "169356                 3                     1                      1   \n",
       "432809                 3                     1                      3   \n",
       "563649                 8                     8                      8   \n",
       "606140                 1                     1                      1   \n",
       "61634                  5                     4                      3   \n",
       "704168                 4                     6                      5   \n",
       "733639                 3                     1                      1   \n",
       "1238464                1                     1                      1   \n",
       "1057067                1                     1                      1   \n",
       "\n",
       "         marginal_adhesion  single_epithelial_cell_size bare_nuclei  \\\n",
       "code                                                                  \n",
       "1057013                  1                            2           ?   \n",
       "1096800                  9                            6           ?   \n",
       "1183246                  1                            1           ?   \n",
       "1184840                  1                            2           ?   \n",
       "1193683                  1                            3           ?   \n",
       "1197510                  1                            2           ?   \n",
       "1241232                  1                            2           ?   \n",
       "169356                   1                            2           ?   \n",
       "432809                   1                            2           ?   \n",
       "563649                   1                            2           ?   \n",
       "606140                   1                            2           ?   \n",
       "61634                    1                            2           ?   \n",
       "704168                   6                            7           ?   \n",
       "733639                   1                            2           ?   \n",
       "1238464                  1                            1           ?   \n",
       "1057067                  1                            1           ?   \n",
       "\n",
       "         bland_chromatin  normal_cucleoli  mitoses  class  \n",
       "code                                                       \n",
       "1057013                7                3        1      4  \n",
       "1096800                7                8        1      2  \n",
       "1183246                2                1        1      2  \n",
       "1184840                2                1        1      2  \n",
       "1193683                1                1        1      2  \n",
       "1197510                3                1        1      2  \n",
       "1241232                3                1        1      2  \n",
       "169356                 3                1        1      2  \n",
       "432809                 2                1        1      2  \n",
       "563649                 6               10        1      4  \n",
       "606140                 2                1        1      2  \n",
       "61634                  2                3        1      2  \n",
       "704168                 4                9        1      2  \n",
       "733639                 3                1        1      2  \n",
       "1238464                2                1        1      2  \n",
       "1057067                1                1        1      2  "
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.loc[lambda s: s['bare_nuclei'] == '?']"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pandas` has a specific object for denoting null values, `pd.NA`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "breast_cancer_data.loc[lambda s: s['bare_nuclei'] == '?',  'bare_nuclei'] = pd.NA"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{tip}\n",
    "Same result but way more elegant is achieved with the following code line `breast_cancer_data.replace({'bare_nuclei': {\"?\": pd.NA}})`\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see the rows with null values\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>clump_thickness</th>\n",
       "      <th>uniformity_cell_size</th>\n",
       "      <th>uniformity_cell_shape</th>\n",
       "      <th>marginal_adhesion</th>\n",
       "      <th>single_epithelial_cell_size</th>\n",
       "      <th>bare_nuclei</th>\n",
       "      <th>bland_chromatin</th>\n",
       "      <th>normal_cucleoli</th>\n",
       "      <th>mitoses</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [clump_thickness, uniformity_cell_size, uniformity_cell_shape, marginal_adhesion, single_epithelial_cell_size, bare_nuclei, bland_chromatin, normal_cucleoli, mitoses, class]\n",
       "Index: []"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.loc[lambda s: s['bare_nuclei'] == pd.NA]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wait a second, why is it not showing me the null values? Null values have weird behaviors in Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<NA>"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.NA == pd.NA"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{important}\n",
    "`pandas` will recognize `None`, `np.na` and `pd.NA` as null values, be careful!\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are special methods for working with null values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>clump_thickness</th>\n",
       "      <th>uniformity_cell_size</th>\n",
       "      <th>uniformity_cell_shape</th>\n",
       "      <th>marginal_adhesion</th>\n",
       "      <th>single_epithelial_cell_size</th>\n",
       "      <th>bare_nuclei</th>\n",
       "      <th>bland_chromatin</th>\n",
       "      <th>normal_cucleoli</th>\n",
       "      <th>mitoses</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1057013</th>\n",
       "      <td>8</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1096800</th>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>9</td>\n",
       "      <td>6</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1183246</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1184840</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1193683</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1197510</th>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1241232</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>169356</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>432809</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>563649</th>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>6</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>606140</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61634</th>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>704168</th>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>4</td>\n",
       "      <td>9</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>733639</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1238464</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1057067</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         clump_thickness  uniformity_cell_size  uniformity_cell_shape  \\\n",
       "code                                                                    \n",
       "1057013                8                     4                      5   \n",
       "1096800                6                     6                      6   \n",
       "1183246                1                     1                      1   \n",
       "1184840                1                     1                      3   \n",
       "1193683                1                     1                      2   \n",
       "1197510                5                     1                      1   \n",
       "1241232                3                     1                      4   \n",
       "169356                 3                     1                      1   \n",
       "432809                 3                     1                      3   \n",
       "563649                 8                     8                      8   \n",
       "606140                 1                     1                      1   \n",
       "61634                  5                     4                      3   \n",
       "704168                 4                     6                      5   \n",
       "733639                 3                     1                      1   \n",
       "1238464                1                     1                      1   \n",
       "1057067                1                     1                      1   \n",
       "\n",
       "         marginal_adhesion  single_epithelial_cell_size bare_nuclei  \\\n",
       "code                                                                  \n",
       "1057013                  1                            2        <NA>   \n",
       "1096800                  9                            6        <NA>   \n",
       "1183246                  1                            1        <NA>   \n",
       "1184840                  1                            2        <NA>   \n",
       "1193683                  1                            3        <NA>   \n",
       "1197510                  1                            2        <NA>   \n",
       "1241232                  1                            2        <NA>   \n",
       "169356                   1                            2        <NA>   \n",
       "432809                   1                            2        <NA>   \n",
       "563649                   1                            2        <NA>   \n",
       "606140                   1                            2        <NA>   \n",
       "61634                    1                            2        <NA>   \n",
       "704168                   6                            7        <NA>   \n",
       "733639                   1                            2        <NA>   \n",
       "1238464                  1                            1        <NA>   \n",
       "1057067                  1                            1        <NA>   \n",
       "\n",
       "         bland_chromatin  normal_cucleoli  mitoses  class  \n",
       "code                                                       \n",
       "1057013                7                3        1      4  \n",
       "1096800                7                8        1      2  \n",
       "1183246                2                1        1      2  \n",
       "1184840                2                1        1      2  \n",
       "1193683                1                1        1      2  \n",
       "1197510                3                1        1      2  \n",
       "1241232                3                1        1      2  \n",
       "169356                 3                1        1      2  \n",
       "432809                 2                1        1      2  \n",
       "563649                 6               10        1      4  \n",
       "606140                 2                1        1      2  \n",
       "61634                  2                3        1      2  \n",
       "704168                 4                9        1      2  \n",
       "733639                 3                1        1      2  \n",
       "1238464                2                1        1      2  \n",
       "1057067                1                1        1      2  "
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.loc[lambda s: s['bare_nuclei'].isnull()]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or you can explore for any column if there is any null value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "clump_thickness                False\n",
       "uniformity_cell_size           False\n",
       "uniformity_cell_shape          False\n",
       "marginal_adhesion              False\n",
       "single_epithelial_cell_size    False\n",
       "bare_nuclei                     True\n",
       "bland_chromatin                False\n",
       "normal_cucleoli                False\n",
       "mitoses                        False\n",
       "class                          False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.isnull().any()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or maybe for rows using `axis=1`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "code\n",
       "1000025    False\n",
       "1002945    False\n",
       "1015425    False\n",
       "1016277    False\n",
       "1017023    False\n",
       "           ...  \n",
       "776715     False\n",
       "841769     False\n",
       "888820     False\n",
       "897471     False\n",
       "897471     False\n",
       "Length: 699, dtype: bool"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.isnull().any(axis=1)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, now we will fix the `bare_nuclei` column. Imagine you want to replace the null values with the mean value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "raises-exception"
    ]
   },
   "outputs": [],
   "source": [
    "breast_cancer_data['bare_nuclei'].mean()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Oh no! We need to convert that column to a numeric column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "code\n",
       "1000025     1.0\n",
       "1002945    10.0\n",
       "1015425     2.0\n",
       "1016277     4.0\n",
       "1017023     1.0\n",
       "           ... \n",
       "776715      2.0\n",
       "841769      1.0\n",
       "888820      3.0\n",
       "897471      4.0\n",
       "897471      5.0\n",
       "Name: bare_nuclei, Length: 699, dtype: float64"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.to_numeric(breast_cancer_data['bare_nuclei'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "breast_cancer_data['bare_nuclei'] = pd.to_numeric(breast_cancer_data['bare_nuclei'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other option could have been"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "breast_cancer_data = breast_cancer_data.assign(\n",
    "    bare_nuclei=lambda x: pd.to_numeric(x['bare_nuclei'])\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I like this last one better, but don't worry!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.5446559297218156"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data['bare_nuclei'].mean()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, every value is a integer, so we should convert this value to a integer, you should ask to the experts what makes more sense. Let's say it is better to approximate this value to the a bigger integer.\n",
    "\n",
    "There is a scientific computing package called `numpy` that we don't have time to cover but you should check it out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4.0"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bare_nuclei_mean = np.ceil(breast_cancer_data['bare_nuclei'].mean())\n",
    "bare_nuclei_mean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, as an example, let's think we want to fill those null values with the mean value of the column.\n",
    "\n",
    "If you are wondering if there is any method for this the answer is yes!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[0;31mSignature:\u001b[0m\n",
      "\u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDataFrame\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfillna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
      "\u001b[0;34m\u001b[0m    \u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
      "\u001b[0;34m\u001b[0m    \u001b[0mvalue\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Hashable | Mapping | Series | DataFrame'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
      "\u001b[0;34m\u001b[0m    \u001b[0;34m*\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
      "\u001b[0;34m\u001b[0m    \u001b[0mmethod\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'FillnaOptions | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
      "\u001b[0;34m\u001b[0m    \u001b[0maxis\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Axis | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
      "\u001b[0;34m\u001b[0m    \u001b[0minplace\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
      "\u001b[0;34m\u001b[0m    \u001b[0mlimit\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'int | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
      "\u001b[0;34m\u001b[0m    \u001b[0mdowncast\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'dict | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
      "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;34m'DataFrame | None'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mDocstring:\u001b[0m\n",
      "Fill NA/NaN values using the specified method.\n",
      "\n",
      "Parameters\n",
      "----------\n",
      "value : scalar, dict, Series, or DataFrame\n",
      "    Value to use to fill holes (e.g. 0), alternately a\n",
      "    dict/Series/DataFrame of values specifying which value to use for\n",
      "    each index (for a Series) or column (for a DataFrame).  Values not\n",
      "    in the dict/Series/DataFrame will not be filled. This value cannot\n",
      "    be a list.\n",
      "method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None\n",
      "    Method to use for filling holes in reindexed Series\n",
      "    pad / ffill: propagate last valid observation forward to next valid\n",
      "    backfill / bfill: use next valid observation to fill gap.\n",
      "axis : {0 or 'index', 1 or 'columns'}\n",
      "    Axis along which to fill missing values. For `Series`\n",
      "    this parameter is unused and defaults to 0.\n",
      "inplace : bool, default False\n",
      "    If True, fill in-place. Note: this will modify any\n",
      "    other views on this object (e.g., a no-copy slice for a column in a\n",
      "    DataFrame).\n",
      "limit : int, default None\n",
      "    If method is specified, this is the maximum number of consecutive\n",
      "    NaN values to forward/backward fill. In other words, if there is\n",
      "    a gap with more than this number of consecutive NaNs, it will only\n",
      "    be partially filled. If method is not specified, this is the\n",
      "    maximum number of entries along the entire axis where NaNs will be\n",
      "    filled. Must be greater than 0 if not None.\n",
      "downcast : dict, default is None\n",
      "    A dict of item->dtype of what to downcast if possible,\n",
      "    or the string 'infer' which will try to downcast to an appropriate\n",
      "    equal type (e.g. float64 to int64 if possible).\n",
      "\n",
      "Returns\n",
      "-------\n",
      "DataFrame or None\n",
      "    Object with missing values filled or None if ``inplace=True``.\n",
      "\n",
      "See Also\n",
      "--------\n",
      "interpolate : Fill NaN values using interpolation.\n",
      "reindex : Conform object to new index.\n",
      "asfreq : Convert TimeSeries to specified frequency.\n",
      "\n",
      "Examples\n",
      "--------\n",
      ">>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],\n",
      "...                    [3, 4, np.nan, 1],\n",
      "...                    [np.nan, np.nan, np.nan, np.nan],\n",
      "...                    [np.nan, 3, np.nan, 4]],\n",
      "...                   columns=list(\"ABCD\"))\n",
      ">>> df\n",
      "     A    B   C    D\n",
      "0  NaN  2.0 NaN  0.0\n",
      "1  3.0  4.0 NaN  1.0\n",
      "2  NaN  NaN NaN  NaN\n",
      "3  NaN  3.0 NaN  4.0\n",
      "\n",
      "Replace all NaN elements with 0s.\n",
      "\n",
      ">>> df.fillna(0)\n",
      "     A    B    C    D\n",
      "0  0.0  2.0  0.0  0.0\n",
      "1  3.0  4.0  0.0  1.0\n",
      "2  0.0  0.0  0.0  0.0\n",
      "3  0.0  3.0  0.0  4.0\n",
      "\n",
      "We can also propagate non-null values forward or backward.\n",
      "\n",
      ">>> df.fillna(method=\"ffill\")\n",
      "     A    B   C    D\n",
      "0  NaN  2.0 NaN  0.0\n",
      "1  3.0  4.0 NaN  1.0\n",
      "2  3.0  4.0 NaN  1.0\n",
      "3  3.0  3.0 NaN  4.0\n",
      "\n",
      "Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1,\n",
      "2, and 3 respectively.\n",
      "\n",
      ">>> values = {\"A\": 0, \"B\": 1, \"C\": 2, \"D\": 3}\n",
      ">>> df.fillna(value=values)\n",
      "     A    B    C    D\n",
      "0  0.0  2.0  2.0  0.0\n",
      "1  3.0  4.0  2.0  1.0\n",
      "2  0.0  1.0  2.0  3.0\n",
      "3  0.0  3.0  2.0  4.0\n",
      "\n",
      "Only replace the first NaN element.\n",
      "\n",
      ">>> df.fillna(value=values, limit=1)\n",
      "     A    B    C    D\n",
      "0  0.0  2.0  2.0  0.0\n",
      "1  3.0  4.0  NaN  1.0\n",
      "2  NaN  1.0  NaN  3.0\n",
      "3  NaN  3.0  NaN  4.0\n",
      "\n",
      "When filling using a DataFrame, replacement happens along\n",
      "the same column names and same indices\n",
      "\n",
      ">>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list(\"ABCE\"))\n",
      ">>> df.fillna(df2)\n",
      "     A    B    C    D\n",
      "0  0.0  2.0  0.0  0.0\n",
      "1  3.0  4.0  0.0  1.0\n",
      "2  0.0  0.0  0.0  NaN\n",
      "3  0.0  3.0  0.0  4.0\n",
      "\n",
      "Note that column D is not affected since it is not present in df2.\n",
      "\u001b[0;31mFile:\u001b[0m      ~/mambaforge/envs/casbbi-nrt-ds/lib/python3.11/site-packages/pandas/core/frame.py\n",
      "\u001b[0;31mType:\u001b[0m      function"
     ]
    }
   ],
   "source": [
    "pd.DataFrame.fillna?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>clump_thickness</th>\n",
       "      <th>uniformity_cell_size</th>\n",
       "      <th>uniformity_cell_shape</th>\n",
       "      <th>marginal_adhesion</th>\n",
       "      <th>single_epithelial_cell_size</th>\n",
       "      <th>bare_nuclei</th>\n",
       "      <th>bland_chromatin</th>\n",
       "      <th>normal_cucleoli</th>\n",
       "      <th>mitoses</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>code</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1000025</th>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1.0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1002945</th>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>7</td>\n",
       "      <td>10.0</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1015425</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1016277</th>\n",
       "      <td>6</td>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>4.0</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1017023</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1.0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>776715</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>841769</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>888820</th>\n",
       "      <td>5</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>3.0</td>\n",
       "      <td>8</td>\n",
       "      <td>10</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>897471</th>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>4.0</td>\n",
       "      <td>10</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>897471</th>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>5.0</td>\n",
       "      <td>10</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>699 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         clump_thickness  uniformity_cell_size  uniformity_cell_shape  \\\n",
       "code                                                                    \n",
       "1000025                5                     1                      1   \n",
       "1002945                5                     4                      4   \n",
       "1015425                3                     1                      1   \n",
       "1016277                6                     8                      8   \n",
       "1017023                4                     1                      1   \n",
       "...                  ...                   ...                    ...   \n",
       "776715                 3                     1                      1   \n",
       "841769                 2                     1                      1   \n",
       "888820                 5                    10                     10   \n",
       "897471                 4                     8                      6   \n",
       "897471                 4                     8                      8   \n",
       "\n",
       "         marginal_adhesion  single_epithelial_cell_size  bare_nuclei  \\\n",
       "code                                                                   \n",
       "1000025                  1                            2          1.0   \n",
       "1002945                  5                            7         10.0   \n",
       "1015425                  1                            2          2.0   \n",
       "1016277                  1                            3          4.0   \n",
       "1017023                  3                            2          1.0   \n",
       "...                    ...                          ...          ...   \n",
       "776715                   1                            3          2.0   \n",
       "841769                   1                            2          1.0   \n",
       "888820                   3                            7          3.0   \n",
       "897471                   4                            3          4.0   \n",
       "897471                   5                            4          5.0   \n",
       "\n",
       "         bland_chromatin  normal_cucleoli  mitoses  class  \n",
       "code                                                       \n",
       "1000025                3                1        1      2  \n",
       "1002945                3                2        1      2  \n",
       "1015425                3                1        1      2  \n",
       "1016277                3                7        1      2  \n",
       "1017023                3                1        1      2  \n",
       "...                  ...              ...      ...    ...  \n",
       "776715                 1                1        1      2  \n",
       "841769                 1                1        1      2  \n",
       "888820                 8               10        2      4  \n",
       "897471                10                6        1      4  \n",
       "897471                10                4        1      4  \n",
       "\n",
       "[699 rows x 10 columns]"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.fillna(value={'bare_nuclei': bare_nuclei_mean})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "clump_thickness                False\n",
       "uniformity_cell_size           False\n",
       "uniformity_cell_shape          False\n",
       "marginal_adhesion              False\n",
       "single_epithelial_cell_size    False\n",
       "bare_nuclei                     True\n",
       "bland_chromatin                False\n",
       "normal_cucleoli                False\n",
       "mitoses                        False\n",
       "class                          False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.isnull().any()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What? There still null values. That is because most of `pandas` functions return a copy of the DataFrame. You have to options\n",
    "\n",
    "* To assign the result to the same variable.\n",
    "* If the method allows it, you can use `inplace=True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "breast_cancer_data.fillna(value={'bare_nuclei': bare_nuclei_mean}, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "clump_thickness                False\n",
       "uniformity_cell_size           False\n",
       "uniformity_cell_shape          False\n",
       "marginal_adhesion              False\n",
       "single_epithelial_cell_size    False\n",
       "bare_nuclei                    False\n",
       "bland_chromatin                False\n",
       "normal_cucleoli                False\n",
       "mitoses                        False\n",
       "class                          False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data.isnull().any()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary and next steps"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this session we explore a data set, reading it, understanding its elements and methods. Also we clean the dataset with null values.\n",
    "\n",
    "We didn't have enough time but you should learn about merging datasets, aggreagation, etc.\n",
    "\n",
    "Just a few examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class</th>\n",
       "      <th>cancer</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>benign</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>malignant</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>unknown</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   class     cancer\n",
       "0      2     benign\n",
       "1      4  malignant\n",
       "2      0    unknown"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cancer_names = pd.DataFrame(\n",
    "    [[2, \"benign\"], [4, \"malignant\"], [0, \"unknown\"]],\n",
    "    columns=[\"class\", \"cancer\"]\n",
    ")\n",
    "cancer_names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>clump_thickness</th>\n",
       "      <th>uniformity_cell_size</th>\n",
       "      <th>uniformity_cell_shape</th>\n",
       "      <th>marginal_adhesion</th>\n",
       "      <th>single_epithelial_cell_size</th>\n",
       "      <th>bare_nuclei</th>\n",
       "      <th>bland_chromatin</th>\n",
       "      <th>normal_cucleoli</th>\n",
       "      <th>mitoses</th>\n",
       "      <th>class</th>\n",
       "      <th>cancer</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1.0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>benign</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>7</td>\n",
       "      <td>10.0</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>benign</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>benign</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6</td>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>4.0</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>benign</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1.0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>benign</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>694</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>benign</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>695</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>benign</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>696</th>\n",
       "      <td>5</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>3.0</td>\n",
       "      <td>8</td>\n",
       "      <td>10</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>malignant</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>697</th>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>4.0</td>\n",
       "      <td>10</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>malignant</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>698</th>\n",
       "      <td>4</td>\n",
       "      <td>8</td>\n",
       "      <td>8</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>5.0</td>\n",
       "      <td>10</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>malignant</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>699 rows × 11 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     clump_thickness  uniformity_cell_size  uniformity_cell_shape  \\\n",
       "0                  5                     1                      1   \n",
       "1                  5                     4                      4   \n",
       "2                  3                     1                      1   \n",
       "3                  6                     8                      8   \n",
       "4                  4                     1                      1   \n",
       "..               ...                   ...                    ...   \n",
       "694                3                     1                      1   \n",
       "695                2                     1                      1   \n",
       "696                5                    10                     10   \n",
       "697                4                     8                      6   \n",
       "698                4                     8                      8   \n",
       "\n",
       "     marginal_adhesion  single_epithelial_cell_size  bare_nuclei  \\\n",
       "0                    1                            2          1.0   \n",
       "1                    5                            7         10.0   \n",
       "2                    1                            2          2.0   \n",
       "3                    1                            3          4.0   \n",
       "4                    3                            2          1.0   \n",
       "..                 ...                          ...          ...   \n",
       "694                  1                            3          2.0   \n",
       "695                  1                            2          1.0   \n",
       "696                  3                            7          3.0   \n",
       "697                  4                            3          4.0   \n",
       "698                  5                            4          5.0   \n",
       "\n",
       "     bland_chromatin  normal_cucleoli  mitoses  class     cancer  \n",
       "0                  3                1        1      2     benign  \n",
       "1                  3                2        1      2     benign  \n",
       "2                  3                1        1      2     benign  \n",
       "3                  3                7        1      2     benign  \n",
       "4                  3                1        1      2     benign  \n",
       "..               ...              ...      ...    ...        ...  \n",
       "694                1                1        1      2     benign  \n",
       "695                1                1        1      2     benign  \n",
       "696                8               10        2      4  malignant  \n",
       "697               10                6        1      4  malignant  \n",
       "698               10                4        1      4  malignant  \n",
       "\n",
       "[699 rows x 11 columns]"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data2 = breast_cancer_data.merge(\n",
    "    cancer_names,\n",
    "    how=\"left\",\n",
    "    on=\"class\"\n",
    ")\n",
    "breast_cancer_data2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['benign', 'malignant'], dtype=object)"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data2[\"cancer\"].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>cancer</th>\n",
       "      <th>benign</th>\n",
       "      <th>malignant</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>clump_thickness</th>\n",
       "      <td>2.956332</td>\n",
       "      <td>7.195021</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uniformity_cell_size</th>\n",
       "      <td>1.325328</td>\n",
       "      <td>6.572614</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uniformity_cell_shape</th>\n",
       "      <td>1.443231</td>\n",
       "      <td>6.560166</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>marginal_adhesion</th>\n",
       "      <td>1.364629</td>\n",
       "      <td>5.547718</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>single_epithelial_cell_size</th>\n",
       "      <td>2.120087</td>\n",
       "      <td>5.298755</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bare_nuclei</th>\n",
       "      <td>1.427948</td>\n",
       "      <td>7.597510</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bland_chromatin</th>\n",
       "      <td>2.100437</td>\n",
       "      <td>5.979253</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>normal_cucleoli</th>\n",
       "      <td>1.290393</td>\n",
       "      <td>5.863071</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mitoses</th>\n",
       "      <td>1.063319</td>\n",
       "      <td>2.589212</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>class</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>4.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "cancer                         benign  malignant\n",
       "clump_thickness              2.956332   7.195021\n",
       "uniformity_cell_size         1.325328   6.572614\n",
       "uniformity_cell_shape        1.443231   6.560166\n",
       "marginal_adhesion            1.364629   5.547718\n",
       "single_epithelial_cell_size  2.120087   5.298755\n",
       "bare_nuclei                  1.427948   7.597510\n",
       "bland_chromatin              2.100437   5.979253\n",
       "normal_cucleoli              1.290393   5.863071\n",
       "mitoses                      1.063319   2.589212\n",
       "class                        2.000000   4.000000"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "breast_cancer_data2.groupby(\"cancer\").mean().T"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A very good place to learn is in the official [user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "casbbi-nrt-ds",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "aaa63b10875fd439d6e29c7af3cfc11b4f81a200a31b26fa6fffddaf9fa68644"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}