{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Analysis"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The main goal of this class is to learn how to gather, explore, clean and analyze different types of datasets.\n",
"\n",
"We will introduce some data analysis common tasks using the `pandas` package."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"> `pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.\n",
">\n",
"> -- https://pandas.pydata.org --"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# from pathlib import Path # Run this line if you are working in a local environment"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dataset: Breast Cancer Wisconsin"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.\n",
"\n",
"This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.\n",
"\n",
"Source: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29\n",
"\n",
"\n",
"| Attribute | Domain |\n",
"|-----------------------------|---------|\n",
"| Sample code number | id number |\n",
"| Clump Thickness | 1 - 10 |\n",
"| Uniformity of Cell Size | 1 - 10 |\n",
"| Uniformity of Cell Shape | 1 - 10 |\n",
"| Marginal Adhesion | 1 - 10 |\n",
"| Single Epithelial Cell Size | 1 - 10 |\n",
"| Bare Nuclei | 1 - 10 |\n",
"| Bland Chromatin | 1 - 10 |\n",
"| Normal Nucleoli | 1 - 10 |\n",
"| Mitoses | 1 - 10 |\n",
"| Class | (2 for benign, 4 for malignant) |"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_filepath = \"https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data\"\n",
"# data_filepath = Path().resolve().parent / \"data\" / \"breast-cancer-wisconsin.data\"\n",
"data_filepath"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"More details in the following file you can download: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The easiest way to open a plain text file as this one is using `pd.read_csv`."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" clump_thickness | \n",
" uniformity_cell_size | \n",
" uniformity_cell_shape | \n",
" marginal_adhesion | \n",
" single_epithelial_cell_size | \n",
" bare_nuclei | \n",
" bland_chromatin | \n",
" normal_cucleoli | \n",
" mitoses | \n",
" class | \n",
"
\n",
" \n",
" code | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1000025 | \n",
" 5 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1002945 | \n",
" 5 | \n",
" 4 | \n",
" 4 | \n",
" 5 | \n",
" 7 | \n",
" 10 | \n",
" 3 | \n",
" 2 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1015425 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1016277 | \n",
" 6 | \n",
" 8 | \n",
" 8 | \n",
" 1 | \n",
" 3 | \n",
" 4 | \n",
" 3 | \n",
" 7 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1017023 | \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" 3 | \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" clump_thickness uniformity_cell_size uniformity_cell_shape \\\n",
"code \n",
"1000025 5 1 1 \n",
"1002945 5 4 4 \n",
"1015425 3 1 1 \n",
"1016277 6 8 8 \n",
"1017023 4 1 1 \n",
"\n",
" marginal_adhesion single_epithelial_cell_size bare_nuclei \\\n",
"code \n",
"1000025 1 2 1 \n",
"1002945 5 7 10 \n",
"1015425 1 2 2 \n",
"1016277 1 3 4 \n",
"1017023 3 2 1 \n",
"\n",
" bland_chromatin normal_cucleoli mitoses class \n",
"code \n",
"1000025 3 1 1 2 \n",
"1002945 3 2 1 2 \n",
"1015425 3 1 1 2 \n",
"1016277 3 7 1 2 \n",
"1017023 3 1 1 2 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data = pd.read_csv(\n",
" data_filepath ,\n",
" names=[\n",
" \"code\",\n",
" \"clump_thickness\",\n",
" \"uniformity_cell_size\",\n",
" \"uniformity_cell_shape\",\n",
" \"marginal_adhesion\",\n",
" \"single_epithelial_cell_size\",\n",
" \"bare_nuclei\",\n",
" \"bland_chromatin\",\n",
" \"normal_cucleoli\",\n",
" \"mitoses\",\n",
" \"class\",\n",
" ],\n",
" index_col=0\n",
")\n",
"breast_cancer_data.head()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's explore this data a little bit before start working with it."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Int64Index: 699 entries, 1000025 to 897471\n",
"Data columns (total 10 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 clump_thickness 699 non-null int64 \n",
" 1 uniformity_cell_size 699 non-null int64 \n",
" 2 uniformity_cell_shape 699 non-null int64 \n",
" 3 marginal_adhesion 699 non-null int64 \n",
" 4 single_epithelial_cell_size 699 non-null int64 \n",
" 5 bare_nuclei 699 non-null object\n",
" 6 bland_chromatin 699 non-null int64 \n",
" 7 normal_cucleoli 699 non-null int64 \n",
" 8 mitoses 699 non-null int64 \n",
" 9 class 699 non-null int64 \n",
"dtypes: int64(9), object(1)\n",
"memory usage: 60.1+ KB\n"
]
}
],
"source": [
"breast_cancer_data.info()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"clump_thickness int64\n",
"uniformity_cell_size int64\n",
"uniformity_cell_shape int64\n",
"marginal_adhesion int64\n",
"single_epithelial_cell_size int64\n",
"bare_nuclei object\n",
"bland_chromatin int64\n",
"normal_cucleoli int64\n",
"mitoses int64\n",
"class int64\n",
"dtype: object"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" clump_thickness | \n",
" uniformity_cell_size | \n",
" uniformity_cell_shape | \n",
" marginal_adhesion | \n",
" single_epithelial_cell_size | \n",
" bland_chromatin | \n",
" normal_cucleoli | \n",
" mitoses | \n",
" class | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 699.000000 | \n",
" 699.000000 | \n",
" 699.000000 | \n",
" 699.000000 | \n",
" 699.000000 | \n",
" 699.000000 | \n",
" 699.000000 | \n",
" 699.000000 | \n",
" 699.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 4.417740 | \n",
" 3.134478 | \n",
" 3.207439 | \n",
" 2.806867 | \n",
" 3.216023 | \n",
" 3.437768 | \n",
" 2.866953 | \n",
" 1.589413 | \n",
" 2.689557 | \n",
"
\n",
" \n",
" std | \n",
" 2.815741 | \n",
" 3.051459 | \n",
" 2.971913 | \n",
" 2.855379 | \n",
" 2.214300 | \n",
" 2.438364 | \n",
" 3.053634 | \n",
" 1.715078 | \n",
" 0.951273 | \n",
"
\n",
" \n",
" min | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 2.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 2.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 2.000000 | \n",
" 2.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 2.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" 4.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 2.000000 | \n",
" 3.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 2.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 6.000000 | \n",
" 5.000000 | \n",
" 5.000000 | \n",
" 4.000000 | \n",
" 4.000000 | \n",
" 5.000000 | \n",
" 4.000000 | \n",
" 1.000000 | \n",
" 4.000000 | \n",
"
\n",
" \n",
" max | \n",
" 10.000000 | \n",
" 10.000000 | \n",
" 10.000000 | \n",
" 10.000000 | \n",
" 10.000000 | \n",
" 10.000000 | \n",
" 10.000000 | \n",
" 10.000000 | \n",
" 4.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" clump_thickness uniformity_cell_size uniformity_cell_shape \\\n",
"count 699.000000 699.000000 699.000000 \n",
"mean 4.417740 3.134478 3.207439 \n",
"std 2.815741 3.051459 2.971913 \n",
"min 1.000000 1.000000 1.000000 \n",
"25% 2.000000 1.000000 1.000000 \n",
"50% 4.000000 1.000000 1.000000 \n",
"75% 6.000000 5.000000 5.000000 \n",
"max 10.000000 10.000000 10.000000 \n",
"\n",
" marginal_adhesion single_epithelial_cell_size bland_chromatin \\\n",
"count 699.000000 699.000000 699.000000 \n",
"mean 2.806867 3.216023 3.437768 \n",
"std 2.855379 2.214300 2.438364 \n",
"min 1.000000 1.000000 1.000000 \n",
"25% 1.000000 2.000000 2.000000 \n",
"50% 1.000000 2.000000 3.000000 \n",
"75% 4.000000 4.000000 5.000000 \n",
"max 10.000000 10.000000 10.000000 \n",
"\n",
" normal_cucleoli mitoses class \n",
"count 699.000000 699.000000 699.000000 \n",
"mean 2.866953 1.589413 2.689557 \n",
"std 3.053634 1.715078 0.951273 \n",
"min 1.000000 1.000000 2.000000 \n",
"25% 1.000000 1.000000 2.000000 \n",
"50% 1.000000 1.000000 2.000000 \n",
"75% 4.000000 1.000000 4.000000 \n",
"max 10.000000 10.000000 4.000000 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.describe()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" clump_thickness | \n",
" uniformity_cell_size | \n",
" uniformity_cell_shape | \n",
" marginal_adhesion | \n",
" single_epithelial_cell_size | \n",
" bare_nuclei | \n",
" bland_chromatin | \n",
" normal_cucleoli | \n",
" mitoses | \n",
" class | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 699.000000 | \n",
" 699.000000 | \n",
" 699.000000 | \n",
" 699.000000 | \n",
" 699.000000 | \n",
" 699 | \n",
" 699.000000 | \n",
" 699.000000 | \n",
" 699.000000 | \n",
" 699.000000 | \n",
"
\n",
" \n",
" unique | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 11 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" top | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" freq | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 402 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" mean | \n",
" 4.417740 | \n",
" 3.134478 | \n",
" 3.207439 | \n",
" 2.806867 | \n",
" 3.216023 | \n",
" NaN | \n",
" 3.437768 | \n",
" 2.866953 | \n",
" 1.589413 | \n",
" 2.689557 | \n",
"
\n",
" \n",
" std | \n",
" 2.815741 | \n",
" 3.051459 | \n",
" 2.971913 | \n",
" 2.855379 | \n",
" 2.214300 | \n",
" NaN | \n",
" 2.438364 | \n",
" 3.053634 | \n",
" 1.715078 | \n",
" 0.951273 | \n",
"
\n",
" \n",
" min | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" NaN | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 2.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 2.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 2.000000 | \n",
" NaN | \n",
" 2.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 2.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" 4.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 2.000000 | \n",
" NaN | \n",
" 3.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 2.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 6.000000 | \n",
" 5.000000 | \n",
" 5.000000 | \n",
" 4.000000 | \n",
" 4.000000 | \n",
" NaN | \n",
" 5.000000 | \n",
" 4.000000 | \n",
" 1.000000 | \n",
" 4.000000 | \n",
"
\n",
" \n",
" max | \n",
" 10.000000 | \n",
" 10.000000 | \n",
" 10.000000 | \n",
" 10.000000 | \n",
" 10.000000 | \n",
" NaN | \n",
" 10.000000 | \n",
" 10.000000 | \n",
" 10.000000 | \n",
" 4.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" clump_thickness uniformity_cell_size uniformity_cell_shape \\\n",
"count 699.000000 699.000000 699.000000 \n",
"unique NaN NaN NaN \n",
"top NaN NaN NaN \n",
"freq NaN NaN NaN \n",
"mean 4.417740 3.134478 3.207439 \n",
"std 2.815741 3.051459 2.971913 \n",
"min 1.000000 1.000000 1.000000 \n",
"25% 2.000000 1.000000 1.000000 \n",
"50% 4.000000 1.000000 1.000000 \n",
"75% 6.000000 5.000000 5.000000 \n",
"max 10.000000 10.000000 10.000000 \n",
"\n",
" marginal_adhesion single_epithelial_cell_size bare_nuclei \\\n",
"count 699.000000 699.000000 699 \n",
"unique NaN NaN 11 \n",
"top NaN NaN 1 \n",
"freq NaN NaN 402 \n",
"mean 2.806867 3.216023 NaN \n",
"std 2.855379 2.214300 NaN \n",
"min 1.000000 1.000000 NaN \n",
"25% 1.000000 2.000000 NaN \n",
"50% 1.000000 2.000000 NaN \n",
"75% 4.000000 4.000000 NaN \n",
"max 10.000000 10.000000 NaN \n",
"\n",
" bland_chromatin normal_cucleoli mitoses class \n",
"count 699.000000 699.000000 699.000000 699.000000 \n",
"unique NaN NaN NaN NaN \n",
"top NaN NaN NaN NaN \n",
"freq NaN NaN NaN NaN \n",
"mean 3.437768 2.866953 1.589413 2.689557 \n",
"std 2.438364 3.053634 1.715078 0.951273 \n",
"min 1.000000 1.000000 1.000000 2.000000 \n",
"25% 2.000000 1.000000 1.000000 2.000000 \n",
"50% 3.000000 1.000000 1.000000 2.000000 \n",
"75% 5.000000 4.000000 1.000000 4.000000 \n",
"max 10.000000 10.000000 10.000000 4.000000 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.describe(include=\"all\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Series"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Series are one-dimensional labeled arrays. You can think they are similar to columns of a excel spreadsheet. "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"There are multiple ways to create a `pd.Series`, using lists, dictionaies, `np.array` or from a file. \n",
"\n",
"Since we already loaded the breast cancer data we will use it as an example. Each list of this file has been converted to a `pd.Series`."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"code\n",
"1000025 5\n",
"1002945 5\n",
"1015425 3\n",
"1016277 6\n",
"1017023 4\n",
"Name: clump_thickness, dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clump_thick_series = breast_cancer_data[\"clump_thickness\"].copy()\n",
"clump_thick_series.head()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(clump_thick_series)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"`pd.Series` are made with _index_ and _values_."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Int64Index([1000025, 1002945, 1015425, 1016277, 1017023, 1017122, 1018099,\n",
" 1018561, 1033078, 1033078,\n",
" ...\n",
" 654546, 654546, 695091, 714039, 763235, 776715, 841769,\n",
" 888820, 897471, 897471],\n",
" dtype='int64', name='code', length=699)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clump_thick_series.index"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, 8, 7, 4,\n",
" 4, 10, 6, 7, 10, 3, 8, 1, 5, 3, 5, 2, 1, 3, 2, 10, 2,\n",
" 3, 2, 10, 6, 5, 2, 6, 10, 6, 5, 10, 1, 3, 1, 4, 7, 9,\n",
" 5, 10, 5, 10, 10, 8, 8, 5, 9, 5, 1, 9, 6, 1, 10, 4, 5,\n",
" 8, 1, 5, 6, 1, 9, 10, 1, 1, 5, 3, 2, 2, 4, 5, 3, 3,\n",
" 5, 3, 3, 4, 2, 1, 3, 4, 1, 2, 1, 2, 5, 9, 7, 10, 2,\n",
" 4, 8, 10, 7, 10, 1, 1, 6, 1, 8, 10, 10, 3, 1, 8, 4, 1,\n",
" 3, 1, 4, 10, 5, 5, 1, 7, 3, 8, 1, 5, 2, 5, 3, 3, 5,\n",
" 4, 3, 4, 1, 3, 2, 9, 1, 2, 1, 3, 1, 3, 8, 1, 7, 10,\n",
" 4, 1, 5, 1, 2, 1, 9, 10, 4, 3, 1, 5, 4, 5, 10, 3, 1,\n",
" 3, 1, 1, 6, 8, 5, 2, 5, 4, 5, 1, 1, 6, 5, 8, 2, 1,\n",
" 10, 5, 1, 10, 7, 5, 1, 3, 4, 8, 5, 1, 3, 9, 10, 1, 5,\n",
" 1, 5, 10, 1, 1, 5, 8, 8, 1, 10, 10, 8, 1, 1, 6, 6, 1,\n",
" 10, 4, 7, 10, 1, 10, 8, 1, 10, 7, 6, 8, 10, 3, 3, 10, 9,\n",
" 8, 10, 5, 3, 2, 1, 1, 5, 8, 8, 4, 3, 1, 10, 6, 6, 9,\n",
" 5, 3, 3, 3, 5, 10, 5, 8, 10, 7, 5, 10, 3, 10, 1, 8, 5,\n",
" 3, 7, 3, 3, 3, 1, 1, 10, 3, 2, 1, 10, 7, 8, 10, 3, 6,\n",
" 5, 1, 1, 8, 10, 1, 5, 5, 5, 8, 9, 8, 1, 10, 1, 8, 10,\n",
" 1, 1, 7, 3, 2, 1, 8, 1, 1, 4, 5, 6, 1, 4, 7, 3, 3,\n",
" 5, 1, 3, 10, 1, 8, 10, 10, 5, 5, 5, 8, 1, 6, 1, 1, 8,\n",
" 10, 1, 2, 1, 7, 1, 5, 1, 3, 4, 5, 2, 3, 2, 1, 4, 5,\n",
" 8, 8, 10, 6, 3, 3, 4, 2, 2, 6, 5, 1, 1, 4, 1, 4, 5,\n",
" 3, 1, 1, 1, 3, 5, 1, 10, 3, 2, 2, 3, 7, 5, 2, 5, 1,\n",
" 10, 3, 1, 1, 3, 3, 4, 3, 1, 3, 3, 5, 3, 1, 1, 4, 1,\n",
" 2, 3, 1, 1, 10, 5, 8, 3, 8, 1, 5, 2, 3, 10, 4, 5, 3,\n",
" 9, 5, 8, 1, 2, 1, 5, 5, 3, 6, 10, 10, 4, 4, 5, 10, 5,\n",
" 1, 1, 5, 2, 1, 5, 1, 5, 4, 5, 3, 4, 2, 10, 10, 8, 5,\n",
" 5, 5, 3, 6, 4, 4, 10, 10, 6, 4, 1, 3, 6, 6, 4, 5, 3,\n",
" 4, 4, 5, 4, 5, 5, 9, 8, 5, 1, 3, 10, 3, 6, 1, 5, 4,\n",
" 5, 5, 3, 1, 4, 4, 4, 6, 4, 4, 4, 1, 3, 8, 1, 5, 2,\n",
" 1, 5, 5, 3, 6, 4, 1, 1, 3, 4, 1, 4, 10, 7, 3, 3, 4,\n",
" 4, 6, 4, 7, 4, 1, 3, 2, 1, 5, 5, 4, 6, 5, 3, 5, 4,\n",
" 2, 5, 6, 2, 3, 7, 3, 1, 3, 4, 3, 4, 5, 5, 2, 5, 5,\n",
" 5, 1, 3, 4, 5, 3, 4, 8, 10, 8, 7, 3, 1, 10, 5, 5, 1,\n",
" 1, 1, 5, 5, 6, 3, 5, 1, 8, 5, 9, 5, 4, 2, 10, 5, 4,\n",
" 5, 4, 5, 3, 5, 3, 1, 4, 5, 5, 10, 4, 1, 5, 5, 10, 5,\n",
" 8, 2, 2, 4, 3, 1, 4, 5, 3, 6, 7, 1, 5, 3, 4, 2, 2,\n",
" 4, 6, 5, 1, 8, 3, 3, 10, 4, 4, 5, 4, 3, 3, 1, 2, 3,\n",
" 1, 1, 5, 3, 3, 1, 5, 4, 3, 3, 5, 5, 7, 1, 1, 4, 1,\n",
" 1, 3, 1, 5, 3, 5, 5, 3, 3, 2, 5, 1, 4, 1, 5, 1, 2,\n",
" 10, 5, 5, 1, 1, 1, 1, 3, 4, 1, 1, 5, 3, 3, 3, 2, 5,\n",
" 4, 4])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clump_thick_series.values"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, imagine you want to access to a specific value from the third patient."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clump_thick_series.iloc[2] # Remember Python is a 0-indexed progamming language."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"However what if you want to know the clump thickness of a specific patient. Since we have their codes we can access with another method.\n",
"\n",
"For example, for patient's code `1166654`"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clump_thick_series.loc[1166654]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Don't forget\n",
"\n",
"* `loc` refers to indexes (__labels__).\n",
"* `iloc` refers to order.\n",
"\n",
"We will focus on `loc` instead of `iloc` since the power of `pandas` comes from its indexes can be numeric or categoricals. If you only need to do order-based analysis `pandas` could be overkill and `numpy` could be enough."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"What if you want to get the values of several patients? For example patients `1166654` and `1178580`"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"code\n",
"1166654 10\n",
"1178580 5\n",
"Name: clump_thickness, dtype: int64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clump_thick_series.loc[[1166654, 1178580]]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```{important}\n",
"Notice if the argument is just one label the `loc` returns only the value. On the other hand, if the argument is a list then `loc` returns a `pd.Series` object.\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"numpy.int64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(clump_thick_series.loc[1166654])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(clump_thick_series.loc[[1166654, 1178580]])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can even edit or add values with these methods.\n",
"\n",
"For instance, what if the dataset is wrong about patient `1166654` and clump thickness should have been `6` instead of `10`? \n",
"\n",
"We can fix that easily."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"clump_thick_series.loc[1166654] = 6"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```{warning}\n",
"You should have got a `SettingWithCopyWarning` message after running the last code cell if we had not used the `copy()` method.\n",
"\n",
"I would suggest you to read [this link](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy) if you get that warning. But in simple words, `loc` returns a __view__, that means if you change anything it will change the main object itself. This is a feature, not an error. We have to be careful with this in the future.\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Ok, let's check that change we made"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clump_thick_series.loc[1166654]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Howerver, notice since we copied the column this didn't change the value in the original dataset."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"clump_thickness 10\n",
"uniformity_cell_size 3\n",
"uniformity_cell_shape 5\n",
"marginal_adhesion 1\n",
"single_epithelial_cell_size 10\n",
"bare_nuclei 5\n",
"bland_chromatin 3\n",
"normal_cucleoli 10\n",
"mitoses 2\n",
"class 4\n",
"Name: 1166654, dtype: object"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.loc[1166654]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```{attention}\n",
"You can try to create `clump_thick_series` without the `.copy()` method and explore what happens if you change values.\n",
"\n",
"I would suggest you __to use copies if you are not sure__.\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Another common mask is when you want to filter by a condition.\n",
"\n",
"For example, let's get all the patients with a clump thickness greater than 7."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"code\n",
"1000025 False\n",
"1002945 False\n",
"1015425 False\n",
"1016277 False\n",
"1017023 False\n",
" ... \n",
"776715 False\n",
"841769 False\n",
"888820 False\n",
"897471 False\n",
"897471 False\n",
"Name: clump_thickness, Length: 699, dtype: bool"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clump_thick_series > 7"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can do logical comparations with `pd.Series` but this only will return another `pd.Series` of boolean objects (True/False). We want to keep only those ones where the value is true."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"code\n",
"1017122 8\n",
"1044572 8\n",
"1050670 10\n",
"1054593 10\n",
"1057013 8\n",
" ..\n",
"736150 10\n",
"822829 8\n",
"1253955 8\n",
"1268952 10\n",
"1369821 10\n",
"Name: clump_thickness, Length: 128, dtype: int64"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clump_thick_series.loc[clump_thick_series > 7]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can avoid using `loc` in this task but to be honest I rather use it. "
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"code\n",
"1017122 8\n",
"1044572 8\n",
"1050670 10\n",
"1054593 10\n",
"1057013 8\n",
" ..\n",
"736150 10\n",
"822829 8\n",
"1253955 8\n",
"1268952 10\n",
"1369821 10\n",
"Name: clump_thickness, Length: 128, dtype: int64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clump_thick_series[clump_thick_series > 7]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"However, my favorite version is using a functional approach with the function `lambda`. It is less intuitive at the beginning but it allows you to concatenate operations."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"code\n",
"1017122 8\n",
"1044572 8\n",
"1050670 10\n",
"1054593 10\n",
"1057013 8\n",
" ..\n",
"736150 10\n",
"822829 8\n",
"1253955 8\n",
"1268952 10\n",
"1369821 10\n",
"Name: clump_thickness, Length: 128, dtype: int64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clump_thick_series.loc[lambda x: x > 7]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrames"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"`pd.DataFrame` are 2-dimensional arrays with horizontal and vertical labels (_indexes_ and _columns_). It is the natural extension of `pd.Series` and you can even think they are a multiple `pd.Series` concatenated."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(breast_cancer_data)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"There are a few useful methods for exploring the data, let's explore some of them."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(699, 10)"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.shape"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" clump_thickness | \n",
" uniformity_cell_size | \n",
" uniformity_cell_shape | \n",
" marginal_adhesion | \n",
" single_epithelial_cell_size | \n",
" bare_nuclei | \n",
" bland_chromatin | \n",
" normal_cucleoli | \n",
" mitoses | \n",
" class | \n",
"
\n",
" \n",
" code | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1000025 | \n",
" 5 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1002945 | \n",
" 5 | \n",
" 4 | \n",
" 4 | \n",
" 5 | \n",
" 7 | \n",
" 10 | \n",
" 3 | \n",
" 2 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1015425 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1016277 | \n",
" 6 | \n",
" 8 | \n",
" 8 | \n",
" 1 | \n",
" 3 | \n",
" 4 | \n",
" 3 | \n",
" 7 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1017023 | \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" 3 | \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" clump_thickness uniformity_cell_size uniformity_cell_shape \\\n",
"code \n",
"1000025 5 1 1 \n",
"1002945 5 4 4 \n",
"1015425 3 1 1 \n",
"1016277 6 8 8 \n",
"1017023 4 1 1 \n",
"\n",
" marginal_adhesion single_epithelial_cell_size bare_nuclei \\\n",
"code \n",
"1000025 1 2 1 \n",
"1002945 5 7 10 \n",
"1015425 1 2 2 \n",
"1016277 1 3 4 \n",
"1017023 3 2 1 \n",
"\n",
" bland_chromatin normal_cucleoli mitoses class \n",
"code \n",
"1000025 3 1 1 2 \n",
"1002945 3 2 1 2 \n",
"1015425 3 1 1 2 \n",
"1016277 3 7 1 2 \n",
"1017023 3 1 1 2 "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" clump_thickness | \n",
" uniformity_cell_size | \n",
" uniformity_cell_shape | \n",
" marginal_adhesion | \n",
" single_epithelial_cell_size | \n",
" bare_nuclei | \n",
" bland_chromatin | \n",
" normal_cucleoli | \n",
" mitoses | \n",
" class | \n",
"
\n",
" \n",
" code | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 776715 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 3 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 841769 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 888820 | \n",
" 5 | \n",
" 10 | \n",
" 10 | \n",
" 3 | \n",
" 7 | \n",
" 3 | \n",
" 8 | \n",
" 10 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" 897471 | \n",
" 4 | \n",
" 8 | \n",
" 6 | \n",
" 4 | \n",
" 3 | \n",
" 4 | \n",
" 10 | \n",
" 6 | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
" 897471 | \n",
" 4 | \n",
" 8 | \n",
" 8 | \n",
" 5 | \n",
" 4 | \n",
" 5 | \n",
" 10 | \n",
" 4 | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" clump_thickness uniformity_cell_size uniformity_cell_shape \\\n",
"code \n",
"776715 3 1 1 \n",
"841769 2 1 1 \n",
"888820 5 10 10 \n",
"897471 4 8 6 \n",
"897471 4 8 8 \n",
"\n",
" marginal_adhesion single_epithelial_cell_size bare_nuclei \\\n",
"code \n",
"776715 1 3 2 \n",
"841769 1 2 1 \n",
"888820 3 7 3 \n",
"897471 4 3 4 \n",
"897471 5 4 5 \n",
"\n",
" bland_chromatin normal_cucleoli mitoses class \n",
"code \n",
"776715 1 1 1 2 \n",
"841769 1 1 1 2 \n",
"888820 8 10 2 4 \n",
"897471 10 6 1 4 \n",
"897471 10 4 1 4 "
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.tail()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"clump_thickness 10\n",
"uniformity_cell_size 10\n",
"uniformity_cell_shape 10\n",
"marginal_adhesion 10\n",
"single_epithelial_cell_size 10\n",
"bare_nuclei ?\n",
"bland_chromatin 10\n",
"normal_cucleoli 10\n",
"mitoses 10\n",
"class 4\n",
"dtype: object"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.max()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_6822/3888804924.py:1: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.\n",
" breast_cancer_data.mean()\n"
]
},
{
"data": {
"text/plain": [
"clump_thickness 4.417740\n",
"uniformity_cell_size 3.134478\n",
"uniformity_cell_shape 3.207439\n",
"marginal_adhesion 2.806867\n",
"single_epithelial_cell_size 3.216023\n",
"bland_chromatin 3.437768\n",
"normal_cucleoli 2.866953\n",
"mitoses 1.589413\n",
"class 2.689557\n",
"dtype: float64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.mean()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We can inspectionate values using `loc` as well"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"clump_thickness 10\n",
"uniformity_cell_size 3\n",
"uniformity_cell_shape 5\n",
"marginal_adhesion 1\n",
"single_epithelial_cell_size 10\n",
"bare_nuclei 5\n",
"bland_chromatin 3\n",
"normal_cucleoli 10\n",
"mitoses 2\n",
"class 4\n",
"Name: 1166654, dtype: object"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.loc[1166654]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"However, you shouldn't use a double `loc`. Technical reason [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing)."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.loc[1166654].loc[\"clump_thickness\"]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Just use a double index notation"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.loc[1166654, \"clump_thickness\"]"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"clump_thickness 10\n",
"class 4\n",
"Name: 1166654, dtype: object"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.loc[1166654, [\"clump_thickness\", \"class\"]]"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" clump_thickness | \n",
" class | \n",
"
\n",
" \n",
" code | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1166654 | \n",
" 10 | \n",
" 4 | \n",
"
\n",
" \n",
" 1178580 | \n",
" 5 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" clump_thickness class\n",
"code \n",
"1166654 10 4\n",
"1178580 5 2"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.loc[[1166654, 1178580], [\"clump_thickness\", \"class\"]]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Boolean masks also work"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" clump_thickness | \n",
" uniformity_cell_size | \n",
" uniformity_cell_shape | \n",
" marginal_adhesion | \n",
" single_epithelial_cell_size | \n",
" bare_nuclei | \n",
" bland_chromatin | \n",
" normal_cucleoli | \n",
" mitoses | \n",
" class | \n",
"
\n",
" \n",
" code | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1017122 | \n",
" 8 | \n",
" 10 | \n",
" 10 | \n",
" 8 | \n",
" 7 | \n",
" 10 | \n",
" 9 | \n",
" 7 | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
" 1044572 | \n",
" 8 | \n",
" 7 | \n",
" 5 | \n",
" 10 | \n",
" 7 | \n",
" 9 | \n",
" 5 | \n",
" 5 | \n",
" 4 | \n",
" 4 | \n",
"
\n",
" \n",
" 1050670 | \n",
" 10 | \n",
" 7 | \n",
" 7 | \n",
" 6 | \n",
" 4 | \n",
" 10 | \n",
" 4 | \n",
" 1 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" 1054593 | \n",
" 10 | \n",
" 5 | \n",
" 5 | \n",
" 3 | \n",
" 6 | \n",
" 7 | \n",
" 7 | \n",
" 10 | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
" 1057013 | \n",
" 8 | \n",
" 4 | \n",
" 5 | \n",
" 1 | \n",
" 2 | \n",
" ? | \n",
" 7 | \n",
" 3 | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 736150 | \n",
" 10 | \n",
" 4 | \n",
" 3 | \n",
" 10 | \n",
" 3 | \n",
" 10 | \n",
" 7 | \n",
" 1 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" 822829 | \n",
" 8 | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 6 | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 4 | \n",
"
\n",
" \n",
" 1253955 | \n",
" 8 | \n",
" 7 | \n",
" 4 | \n",
" 4 | \n",
" 5 | \n",
" 3 | \n",
" 5 | \n",
" 10 | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
" 1268952 | \n",
" 10 | \n",
" 10 | \n",
" 7 | \n",
" 8 | \n",
" 7 | \n",
" 1 | \n",
" 10 | \n",
" 10 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" 1369821 | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 5 | \n",
" 10 | \n",
" 10 | \n",
" 10 | \n",
" 7 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
129 rows × 10 columns
\n",
"
"
],
"text/plain": [
" clump_thickness uniformity_cell_size uniformity_cell_shape \\\n",
"code \n",
"1017122 8 10 10 \n",
"1044572 8 7 5 \n",
"1050670 10 7 7 \n",
"1054593 10 5 5 \n",
"1057013 8 4 5 \n",
"... ... ... ... \n",
"736150 10 4 3 \n",
"822829 8 10 10 \n",
"1253955 8 7 4 \n",
"1268952 10 10 7 \n",
"1369821 10 10 10 \n",
"\n",
" marginal_adhesion single_epithelial_cell_size bare_nuclei \\\n",
"code \n",
"1017122 8 7 10 \n",
"1044572 10 7 9 \n",
"1050670 6 4 10 \n",
"1054593 3 6 7 \n",
"1057013 1 2 ? \n",
"... ... ... ... \n",
"736150 10 3 10 \n",
"822829 10 6 10 \n",
"1253955 4 5 3 \n",
"1268952 8 7 1 \n",
"1369821 10 5 10 \n",
"\n",
" bland_chromatin normal_cucleoli mitoses class \n",
"code \n",
"1017122 9 7 1 4 \n",
"1044572 5 5 4 4 \n",
"1050670 4 1 2 4 \n",
"1054593 7 10 1 4 \n",
"1057013 7 3 1 4 \n",
"... ... ... ... ... \n",
"736150 7 1 2 4 \n",
"822829 10 10 10 4 \n",
"1253955 5 10 1 4 \n",
"1268952 10 10 3 4 \n",
"1369821 10 10 7 4 \n",
"\n",
"[129 rows x 10 columns]"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.loc[lambda x: x[\"clump_thickness\"] > 7]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Or getting a specific column"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"code\n",
"1000025 1\n",
"1002945 10\n",
"1015425 2\n",
"1016277 4\n",
"1017023 1\n",
" ..\n",
"776715 2\n",
"841769 1\n",
"888820 3\n",
"897471 4\n",
"897471 5\n",
"Name: bare_nuclei, Length: 699, dtype: object"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.loc[:, \"bare_nuclei\"]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"However, you can also access directly to a column without `loc`."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"code\n",
"1000025 1\n",
"1002945 10\n",
"1015425 2\n",
"1016277 4\n",
"1017023 1\n",
" ..\n",
"776715 2\n",
"841769 1\n",
"888820 3\n",
"897471 4\n",
"897471 5\n",
"Name: bare_nuclei, Length: 699, dtype: object"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data[\"bare_nuclei\"]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"There are some cool methods you can use for exploring your data"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1 402\n",
"10 132\n",
"2 30\n",
"5 30\n",
"3 28\n",
"8 21\n",
"4 19\n",
"? 16\n",
"9 9\n",
"7 8\n",
"6 4\n",
"Name: bare_nuclei, dtype: int64"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.loc[:, \"bare_nuclei\"].value_counts()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"What about with those `?` values?\n",
"\n",
"They are representing a missing value!"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" clump_thickness | \n",
" uniformity_cell_size | \n",
" uniformity_cell_shape | \n",
" marginal_adhesion | \n",
" single_epithelial_cell_size | \n",
" bare_nuclei | \n",
" bland_chromatin | \n",
" normal_cucleoli | \n",
" mitoses | \n",
" class | \n",
"
\n",
" \n",
" code | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1057013 | \n",
" 8 | \n",
" 4 | \n",
" 5 | \n",
" 1 | \n",
" 2 | \n",
" ? | \n",
" 7 | \n",
" 3 | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
" 1096800 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 9 | \n",
" 6 | \n",
" ? | \n",
" 7 | \n",
" 8 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1183246 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" ? | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1184840 | \n",
" 1 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
" ? | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1193683 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
" ? | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1197510 | \n",
" 5 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" ? | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1241232 | \n",
" 3 | \n",
" 1 | \n",
" 4 | \n",
" 1 | \n",
" 2 | \n",
" ? | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 169356 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" ? | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 432809 | \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
" ? | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 563649 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 1 | \n",
" 2 | \n",
" ? | \n",
" 6 | \n",
" 10 | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
" 606140 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" ? | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 61634 | \n",
" 5 | \n",
" 4 | \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
" ? | \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 704168 | \n",
" 4 | \n",
" 6 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" ? | \n",
" 4 | \n",
" 9 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 733639 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" ? | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1238464 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" ? | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1057067 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" ? | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" clump_thickness uniformity_cell_size uniformity_cell_shape \\\n",
"code \n",
"1057013 8 4 5 \n",
"1096800 6 6 6 \n",
"1183246 1 1 1 \n",
"1184840 1 1 3 \n",
"1193683 1 1 2 \n",
"1197510 5 1 1 \n",
"1241232 3 1 4 \n",
"169356 3 1 1 \n",
"432809 3 1 3 \n",
"563649 8 8 8 \n",
"606140 1 1 1 \n",
"61634 5 4 3 \n",
"704168 4 6 5 \n",
"733639 3 1 1 \n",
"1238464 1 1 1 \n",
"1057067 1 1 1 \n",
"\n",
" marginal_adhesion single_epithelial_cell_size bare_nuclei \\\n",
"code \n",
"1057013 1 2 ? \n",
"1096800 9 6 ? \n",
"1183246 1 1 ? \n",
"1184840 1 2 ? \n",
"1193683 1 3 ? \n",
"1197510 1 2 ? \n",
"1241232 1 2 ? \n",
"169356 1 2 ? \n",
"432809 1 2 ? \n",
"563649 1 2 ? \n",
"606140 1 2 ? \n",
"61634 1 2 ? \n",
"704168 6 7 ? \n",
"733639 1 2 ? \n",
"1238464 1 1 ? \n",
"1057067 1 1 ? \n",
"\n",
" bland_chromatin normal_cucleoli mitoses class \n",
"code \n",
"1057013 7 3 1 4 \n",
"1096800 7 8 1 2 \n",
"1183246 2 1 1 2 \n",
"1184840 2 1 1 2 \n",
"1193683 1 1 1 2 \n",
"1197510 3 1 1 2 \n",
"1241232 3 1 1 2 \n",
"169356 3 1 1 2 \n",
"432809 2 1 1 2 \n",
"563649 6 10 1 4 \n",
"606140 2 1 1 2 \n",
"61634 2 3 1 2 \n",
"704168 4 9 1 2 \n",
"733639 3 1 1 2 \n",
"1238464 2 1 1 2 \n",
"1057067 1 1 1 2 "
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.loc[lambda s: s['bare_nuclei'] == '?']"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"`pandas` has a specific object for denoting null values, `pd.NA`."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"breast_cancer_data.loc[lambda s: s['bare_nuclei'] == '?', 'bare_nuclei'] = pd.NA"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```{tip}\n",
"Same result but way more elegant is achieved with the following code line `breast_cancer_data.replace({'bare_nuclei': {\"?\": pd.NA}})`\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see the rows with null values\n"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" clump_thickness | \n",
" uniformity_cell_size | \n",
" uniformity_cell_shape | \n",
" marginal_adhesion | \n",
" single_epithelial_cell_size | \n",
" bare_nuclei | \n",
" bland_chromatin | \n",
" normal_cucleoli | \n",
" mitoses | \n",
" class | \n",
"
\n",
" \n",
" code | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [clump_thickness, uniformity_cell_size, uniformity_cell_shape, marginal_adhesion, single_epithelial_cell_size, bare_nuclei, bland_chromatin, normal_cucleoli, mitoses, class]\n",
"Index: []"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.loc[lambda s: s['bare_nuclei'] == pd.NA]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Wait a second, why is it not showing me the null values? Null values have weird behaviors in Python."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.NA == pd.NA"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```{important}\n",
"`pandas` will recognize `None`, `np.na` and `pd.NA` as null values, be careful!\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"There are special methods for working with null values"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" clump_thickness | \n",
" uniformity_cell_size | \n",
" uniformity_cell_shape | \n",
" marginal_adhesion | \n",
" single_epithelial_cell_size | \n",
" bare_nuclei | \n",
" bland_chromatin | \n",
" normal_cucleoli | \n",
" mitoses | \n",
" class | \n",
"
\n",
" \n",
" code | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1057013 | \n",
" 8 | \n",
" 4 | \n",
" 5 | \n",
" 1 | \n",
" 2 | \n",
" <NA> | \n",
" 7 | \n",
" 3 | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
" 1096800 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 9 | \n",
" 6 | \n",
" <NA> | \n",
" 7 | \n",
" 8 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1183246 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" <NA> | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1184840 | \n",
" 1 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
" <NA> | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1193683 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
" <NA> | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1197510 | \n",
" 5 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" <NA> | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1241232 | \n",
" 3 | \n",
" 1 | \n",
" 4 | \n",
" 1 | \n",
" 2 | \n",
" <NA> | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 169356 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" <NA> | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 432809 | \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
" <NA> | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 563649 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 1 | \n",
" 2 | \n",
" <NA> | \n",
" 6 | \n",
" 10 | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
" 606140 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" <NA> | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 61634 | \n",
" 5 | \n",
" 4 | \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
" <NA> | \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 704168 | \n",
" 4 | \n",
" 6 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" <NA> | \n",
" 4 | \n",
" 9 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 733639 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" <NA> | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1238464 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" <NA> | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1057067 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" <NA> | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" clump_thickness uniformity_cell_size uniformity_cell_shape \\\n",
"code \n",
"1057013 8 4 5 \n",
"1096800 6 6 6 \n",
"1183246 1 1 1 \n",
"1184840 1 1 3 \n",
"1193683 1 1 2 \n",
"1197510 5 1 1 \n",
"1241232 3 1 4 \n",
"169356 3 1 1 \n",
"432809 3 1 3 \n",
"563649 8 8 8 \n",
"606140 1 1 1 \n",
"61634 5 4 3 \n",
"704168 4 6 5 \n",
"733639 3 1 1 \n",
"1238464 1 1 1 \n",
"1057067 1 1 1 \n",
"\n",
" marginal_adhesion single_epithelial_cell_size bare_nuclei \\\n",
"code \n",
"1057013 1 2 \n",
"1096800 9 6 \n",
"1183246 1 1 \n",
"1184840 1 2 \n",
"1193683 1 3 \n",
"1197510 1 2 \n",
"1241232 1 2 \n",
"169356 1 2 \n",
"432809 1 2 \n",
"563649 1 2 \n",
"606140 1 2 \n",
"61634 1 2 \n",
"704168 6 7 \n",
"733639 1 2 \n",
"1238464 1 1 \n",
"1057067 1 1 \n",
"\n",
" bland_chromatin normal_cucleoli mitoses class \n",
"code \n",
"1057013 7 3 1 4 \n",
"1096800 7 8 1 2 \n",
"1183246 2 1 1 2 \n",
"1184840 2 1 1 2 \n",
"1193683 1 1 1 2 \n",
"1197510 3 1 1 2 \n",
"1241232 3 1 1 2 \n",
"169356 3 1 1 2 \n",
"432809 2 1 1 2 \n",
"563649 6 10 1 4 \n",
"606140 2 1 1 2 \n",
"61634 2 3 1 2 \n",
"704168 4 9 1 2 \n",
"733639 3 1 1 2 \n",
"1238464 2 1 1 2 \n",
"1057067 1 1 1 2 "
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.loc[lambda s: s['bare_nuclei'].isnull()]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Or you can explore for any column if there is any null value"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"clump_thickness False\n",
"uniformity_cell_size False\n",
"uniformity_cell_shape False\n",
"marginal_adhesion False\n",
"single_epithelial_cell_size False\n",
"bare_nuclei True\n",
"bland_chromatin False\n",
"normal_cucleoli False\n",
"mitoses False\n",
"class False\n",
"dtype: bool"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.isnull().any()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Or maybe for rows using `axis=1`."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"code\n",
"1000025 False\n",
"1002945 False\n",
"1015425 False\n",
"1016277 False\n",
"1017023 False\n",
" ... \n",
"776715 False\n",
"841769 False\n",
"888820 False\n",
"897471 False\n",
"897471 False\n",
"Length: 699, dtype: bool"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.isnull().any(axis=1)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, now we will fix the `bare_nuclei` column. Imagine you want to replace the null values with the mean value."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"raises-exception"
]
},
"outputs": [],
"source": [
"breast_cancer_data['bare_nuclei'].mean()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Oh no! We need to convert that column to a numeric column"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"code\n",
"1000025 1.0\n",
"1002945 10.0\n",
"1015425 2.0\n",
"1016277 4.0\n",
"1017023 1.0\n",
" ... \n",
"776715 2.0\n",
"841769 1.0\n",
"888820 3.0\n",
"897471 4.0\n",
"897471 5.0\n",
"Name: bare_nuclei, Length: 699, dtype: float64"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.to_numeric(breast_cancer_data['bare_nuclei'])"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"breast_cancer_data['bare_nuclei'] = pd.to_numeric(breast_cancer_data['bare_nuclei'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Other option could have been"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"breast_cancer_data = breast_cancer_data.assign(\n",
" bare_nuclei=lambda x: pd.to_numeric(x['bare_nuclei'])\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"I like this last one better, but don't worry!"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3.5446559297218156"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data['bare_nuclei'].mean()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, every value is a integer, so we should convert this value to a integer, you should ask to the experts what makes more sense. Let's say it is better to approximate this value to the a bigger integer.\n",
"\n",
"There is a scientific computing package called `numpy` that we don't have time to cover but you should check it out."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4.0"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bare_nuclei_mean = np.ceil(breast_cancer_data['bare_nuclei'].mean())\n",
"bare_nuclei_mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, as an example, let's think we want to fill those null values with the mean value of the column.\n",
"\n",
"If you are wondering if there is any method for this the answer is yes!"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[0;31mSignature:\u001b[0m\n",
"\u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDataFrame\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfillna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Hashable | Mapping | Series | DataFrame'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mmethod\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'FillnaOptions | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Axis | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0minplace\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mlimit\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'int | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m \u001b[0mdowncast\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'dict | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
"\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;34m'DataFrame | None'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mDocstring:\u001b[0m\n",
"Fill NA/NaN values using the specified method.\n",
"\n",
"Parameters\n",
"----------\n",
"value : scalar, dict, Series, or DataFrame\n",
" Value to use to fill holes (e.g. 0), alternately a\n",
" dict/Series/DataFrame of values specifying which value to use for\n",
" each index (for a Series) or column (for a DataFrame). Values not\n",
" in the dict/Series/DataFrame will not be filled. This value cannot\n",
" be a list.\n",
"method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None\n",
" Method to use for filling holes in reindexed Series\n",
" pad / ffill: propagate last valid observation forward to next valid\n",
" backfill / bfill: use next valid observation to fill gap.\n",
"axis : {0 or 'index', 1 or 'columns'}\n",
" Axis along which to fill missing values. For `Series`\n",
" this parameter is unused and defaults to 0.\n",
"inplace : bool, default False\n",
" If True, fill in-place. Note: this will modify any\n",
" other views on this object (e.g., a no-copy slice for a column in a\n",
" DataFrame).\n",
"limit : int, default None\n",
" If method is specified, this is the maximum number of consecutive\n",
" NaN values to forward/backward fill. In other words, if there is\n",
" a gap with more than this number of consecutive NaNs, it will only\n",
" be partially filled. If method is not specified, this is the\n",
" maximum number of entries along the entire axis where NaNs will be\n",
" filled. Must be greater than 0 if not None.\n",
"downcast : dict, default is None\n",
" A dict of item->dtype of what to downcast if possible,\n",
" or the string 'infer' which will try to downcast to an appropriate\n",
" equal type (e.g. float64 to int64 if possible).\n",
"\n",
"Returns\n",
"-------\n",
"DataFrame or None\n",
" Object with missing values filled or None if ``inplace=True``.\n",
"\n",
"See Also\n",
"--------\n",
"interpolate : Fill NaN values using interpolation.\n",
"reindex : Conform object to new index.\n",
"asfreq : Convert TimeSeries to specified frequency.\n",
"\n",
"Examples\n",
"--------\n",
">>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],\n",
"... [3, 4, np.nan, 1],\n",
"... [np.nan, np.nan, np.nan, np.nan],\n",
"... [np.nan, 3, np.nan, 4]],\n",
"... columns=list(\"ABCD\"))\n",
">>> df\n",
" A B C D\n",
"0 NaN 2.0 NaN 0.0\n",
"1 3.0 4.0 NaN 1.0\n",
"2 NaN NaN NaN NaN\n",
"3 NaN 3.0 NaN 4.0\n",
"\n",
"Replace all NaN elements with 0s.\n",
"\n",
">>> df.fillna(0)\n",
" A B C D\n",
"0 0.0 2.0 0.0 0.0\n",
"1 3.0 4.0 0.0 1.0\n",
"2 0.0 0.0 0.0 0.0\n",
"3 0.0 3.0 0.0 4.0\n",
"\n",
"We can also propagate non-null values forward or backward.\n",
"\n",
">>> df.fillna(method=\"ffill\")\n",
" A B C D\n",
"0 NaN 2.0 NaN 0.0\n",
"1 3.0 4.0 NaN 1.0\n",
"2 3.0 4.0 NaN 1.0\n",
"3 3.0 3.0 NaN 4.0\n",
"\n",
"Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1,\n",
"2, and 3 respectively.\n",
"\n",
">>> values = {\"A\": 0, \"B\": 1, \"C\": 2, \"D\": 3}\n",
">>> df.fillna(value=values)\n",
" A B C D\n",
"0 0.0 2.0 2.0 0.0\n",
"1 3.0 4.0 2.0 1.0\n",
"2 0.0 1.0 2.0 3.0\n",
"3 0.0 3.0 2.0 4.0\n",
"\n",
"Only replace the first NaN element.\n",
"\n",
">>> df.fillna(value=values, limit=1)\n",
" A B C D\n",
"0 0.0 2.0 2.0 0.0\n",
"1 3.0 4.0 NaN 1.0\n",
"2 NaN 1.0 NaN 3.0\n",
"3 NaN 3.0 NaN 4.0\n",
"\n",
"When filling using a DataFrame, replacement happens along\n",
"the same column names and same indices\n",
"\n",
">>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list(\"ABCE\"))\n",
">>> df.fillna(df2)\n",
" A B C D\n",
"0 0.0 2.0 0.0 0.0\n",
"1 3.0 4.0 0.0 1.0\n",
"2 0.0 0.0 0.0 NaN\n",
"3 0.0 3.0 0.0 4.0\n",
"\n",
"Note that column D is not affected since it is not present in df2.\n",
"\u001b[0;31mFile:\u001b[0m ~/mambaforge/envs/casbbi-nrt-ds/lib/python3.11/site-packages/pandas/core/frame.py\n",
"\u001b[0;31mType:\u001b[0m function"
]
}
],
"source": [
"pd.DataFrame.fillna?"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" clump_thickness | \n",
" uniformity_cell_size | \n",
" uniformity_cell_shape | \n",
" marginal_adhesion | \n",
" single_epithelial_cell_size | \n",
" bare_nuclei | \n",
" bland_chromatin | \n",
" normal_cucleoli | \n",
" mitoses | \n",
" class | \n",
"
\n",
" \n",
" code | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1000025 | \n",
" 5 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 1.0 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1002945 | \n",
" 5 | \n",
" 4 | \n",
" 4 | \n",
" 5 | \n",
" 7 | \n",
" 10.0 | \n",
" 3 | \n",
" 2 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1015425 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 2.0 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1016277 | \n",
" 6 | \n",
" 8 | \n",
" 8 | \n",
" 1 | \n",
" 3 | \n",
" 4.0 | \n",
" 3 | \n",
" 7 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1017023 | \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" 3 | \n",
" 2 | \n",
" 1.0 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 776715 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 3 | \n",
" 2.0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 841769 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 1.0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 888820 | \n",
" 5 | \n",
" 10 | \n",
" 10 | \n",
" 3 | \n",
" 7 | \n",
" 3.0 | \n",
" 8 | \n",
" 10 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" 897471 | \n",
" 4 | \n",
" 8 | \n",
" 6 | \n",
" 4 | \n",
" 3 | \n",
" 4.0 | \n",
" 10 | \n",
" 6 | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
" 897471 | \n",
" 4 | \n",
" 8 | \n",
" 8 | \n",
" 5 | \n",
" 4 | \n",
" 5.0 | \n",
" 10 | \n",
" 4 | \n",
" 1 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
699 rows × 10 columns
\n",
"
"
],
"text/plain": [
" clump_thickness uniformity_cell_size uniformity_cell_shape \\\n",
"code \n",
"1000025 5 1 1 \n",
"1002945 5 4 4 \n",
"1015425 3 1 1 \n",
"1016277 6 8 8 \n",
"1017023 4 1 1 \n",
"... ... ... ... \n",
"776715 3 1 1 \n",
"841769 2 1 1 \n",
"888820 5 10 10 \n",
"897471 4 8 6 \n",
"897471 4 8 8 \n",
"\n",
" marginal_adhesion single_epithelial_cell_size bare_nuclei \\\n",
"code \n",
"1000025 1 2 1.0 \n",
"1002945 5 7 10.0 \n",
"1015425 1 2 2.0 \n",
"1016277 1 3 4.0 \n",
"1017023 3 2 1.0 \n",
"... ... ... ... \n",
"776715 1 3 2.0 \n",
"841769 1 2 1.0 \n",
"888820 3 7 3.0 \n",
"897471 4 3 4.0 \n",
"897471 5 4 5.0 \n",
"\n",
" bland_chromatin normal_cucleoli mitoses class \n",
"code \n",
"1000025 3 1 1 2 \n",
"1002945 3 2 1 2 \n",
"1015425 3 1 1 2 \n",
"1016277 3 7 1 2 \n",
"1017023 3 1 1 2 \n",
"... ... ... ... ... \n",
"776715 1 1 1 2 \n",
"841769 1 1 1 2 \n",
"888820 8 10 2 4 \n",
"897471 10 6 1 4 \n",
"897471 10 4 1 4 \n",
"\n",
"[699 rows x 10 columns]"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.fillna(value={'bare_nuclei': bare_nuclei_mean})"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"clump_thickness False\n",
"uniformity_cell_size False\n",
"uniformity_cell_shape False\n",
"marginal_adhesion False\n",
"single_epithelial_cell_size False\n",
"bare_nuclei True\n",
"bland_chromatin False\n",
"normal_cucleoli False\n",
"mitoses False\n",
"class False\n",
"dtype: bool"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.isnull().any()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"What? There still null values. That is because most of `pandas` functions return a copy of the DataFrame. You have to options\n",
"\n",
"* To assign the result to the same variable.\n",
"* If the method allows it, you can use `inplace=True`."
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [],
"source": [
"breast_cancer_data.fillna(value={'bare_nuclei': bare_nuclei_mean}, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"clump_thickness False\n",
"uniformity_cell_size False\n",
"uniformity_cell_shape False\n",
"marginal_adhesion False\n",
"single_epithelial_cell_size False\n",
"bare_nuclei False\n",
"bland_chromatin False\n",
"normal_cucleoli False\n",
"mitoses False\n",
"class False\n",
"dtype: bool"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data.isnull().any()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary and next steps"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"In this session we explore a data set, reading it, understanding its elements and methods. Also we clean the dataset with null values.\n",
"\n",
"We didn't have enough time but you should learn about merging datasets, aggreagation, etc.\n",
"\n",
"Just a few examples:"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" class | \n",
" cancer | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2 | \n",
" benign | \n",
"
\n",
" \n",
" 1 | \n",
" 4 | \n",
" malignant | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" unknown | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" class cancer\n",
"0 2 benign\n",
"1 4 malignant\n",
"2 0 unknown"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cancer_names = pd.DataFrame(\n",
" [[2, \"benign\"], [4, \"malignant\"], [0, \"unknown\"]],\n",
" columns=[\"class\", \"cancer\"]\n",
")\n",
"cancer_names"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" clump_thickness | \n",
" uniformity_cell_size | \n",
" uniformity_cell_shape | \n",
" marginal_adhesion | \n",
" single_epithelial_cell_size | \n",
" bare_nuclei | \n",
" bland_chromatin | \n",
" normal_cucleoli | \n",
" mitoses | \n",
" class | \n",
" cancer | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 1.0 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" benign | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
" 4 | \n",
" 4 | \n",
" 5 | \n",
" 7 | \n",
" 10.0 | \n",
" 3 | \n",
" 2 | \n",
" 1 | \n",
" 2 | \n",
" benign | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 2.0 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" benign | \n",
"
\n",
" \n",
" 3 | \n",
" 6 | \n",
" 8 | \n",
" 8 | \n",
" 1 | \n",
" 3 | \n",
" 4.0 | \n",
" 3 | \n",
" 7 | \n",
" 1 | \n",
" 2 | \n",
" benign | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" 3 | \n",
" 2 | \n",
" 1.0 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" benign | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 694 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 3 | \n",
" 2.0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" benign | \n",
"
\n",
" \n",
" 695 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 1.0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" benign | \n",
"
\n",
" \n",
" 696 | \n",
" 5 | \n",
" 10 | \n",
" 10 | \n",
" 3 | \n",
" 7 | \n",
" 3.0 | \n",
" 8 | \n",
" 10 | \n",
" 2 | \n",
" 4 | \n",
" malignant | \n",
"
\n",
" \n",
" 697 | \n",
" 4 | \n",
" 8 | \n",
" 6 | \n",
" 4 | \n",
" 3 | \n",
" 4.0 | \n",
" 10 | \n",
" 6 | \n",
" 1 | \n",
" 4 | \n",
" malignant | \n",
"
\n",
" \n",
" 698 | \n",
" 4 | \n",
" 8 | \n",
" 8 | \n",
" 5 | \n",
" 4 | \n",
" 5.0 | \n",
" 10 | \n",
" 4 | \n",
" 1 | \n",
" 4 | \n",
" malignant | \n",
"
\n",
" \n",
"
\n",
"
699 rows × 11 columns
\n",
"
"
],
"text/plain": [
" clump_thickness uniformity_cell_size uniformity_cell_shape \\\n",
"0 5 1 1 \n",
"1 5 4 4 \n",
"2 3 1 1 \n",
"3 6 8 8 \n",
"4 4 1 1 \n",
".. ... ... ... \n",
"694 3 1 1 \n",
"695 2 1 1 \n",
"696 5 10 10 \n",
"697 4 8 6 \n",
"698 4 8 8 \n",
"\n",
" marginal_adhesion single_epithelial_cell_size bare_nuclei \\\n",
"0 1 2 1.0 \n",
"1 5 7 10.0 \n",
"2 1 2 2.0 \n",
"3 1 3 4.0 \n",
"4 3 2 1.0 \n",
".. ... ... ... \n",
"694 1 3 2.0 \n",
"695 1 2 1.0 \n",
"696 3 7 3.0 \n",
"697 4 3 4.0 \n",
"698 5 4 5.0 \n",
"\n",
" bland_chromatin normal_cucleoli mitoses class cancer \n",
"0 3 1 1 2 benign \n",
"1 3 2 1 2 benign \n",
"2 3 1 1 2 benign \n",
"3 3 7 1 2 benign \n",
"4 3 1 1 2 benign \n",
".. ... ... ... ... ... \n",
"694 1 1 1 2 benign \n",
"695 1 1 1 2 benign \n",
"696 8 10 2 4 malignant \n",
"697 10 6 1 4 malignant \n",
"698 10 4 1 4 malignant \n",
"\n",
"[699 rows x 11 columns]"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data2 = breast_cancer_data.merge(\n",
" cancer_names,\n",
" how=\"left\",\n",
" on=\"class\"\n",
")\n",
"breast_cancer_data2"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['benign', 'malignant'], dtype=object)"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data2[\"cancer\"].unique()"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" cancer | \n",
" benign | \n",
" malignant | \n",
"
\n",
" \n",
" \n",
" \n",
" clump_thickness | \n",
" 2.956332 | \n",
" 7.195021 | \n",
"
\n",
" \n",
" uniformity_cell_size | \n",
" 1.325328 | \n",
" 6.572614 | \n",
"
\n",
" \n",
" uniformity_cell_shape | \n",
" 1.443231 | \n",
" 6.560166 | \n",
"
\n",
" \n",
" marginal_adhesion | \n",
" 1.364629 | \n",
" 5.547718 | \n",
"
\n",
" \n",
" single_epithelial_cell_size | \n",
" 2.120087 | \n",
" 5.298755 | \n",
"
\n",
" \n",
" bare_nuclei | \n",
" 1.427948 | \n",
" 7.597510 | \n",
"
\n",
" \n",
" bland_chromatin | \n",
" 2.100437 | \n",
" 5.979253 | \n",
"
\n",
" \n",
" normal_cucleoli | \n",
" 1.290393 | \n",
" 5.863071 | \n",
"
\n",
" \n",
" mitoses | \n",
" 1.063319 | \n",
" 2.589212 | \n",
"
\n",
" \n",
" class | \n",
" 2.000000 | \n",
" 4.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"cancer benign malignant\n",
"clump_thickness 2.956332 7.195021\n",
"uniformity_cell_size 1.325328 6.572614\n",
"uniformity_cell_shape 1.443231 6.560166\n",
"marginal_adhesion 1.364629 5.547718\n",
"single_epithelial_cell_size 2.120087 5.298755\n",
"bare_nuclei 1.427948 7.597510\n",
"bland_chromatin 2.100437 5.979253\n",
"normal_cucleoli 1.290393 5.863071\n",
"mitoses 1.063319 2.589212\n",
"class 2.000000 4.000000"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"breast_cancer_data2.groupby(\"cancer\").mean().T"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"A very good place to learn is in the official [user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "casbbi-nrt-ds",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "aaa63b10875fd439d6e29c7af3cfc11b4f81a200a31b26fa6fffddaf9fa68644"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}