{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Análisis Exploratorio de Datos"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hemos hablado mucho de datos, todo muy de libro o juguete. Esta clase intentará acercarte a algunos de los principales desafíos a la hora de trabajar con distintas fuentes de datos y los problemas usuales que podrías encontrar."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fuentes de datos"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Para variar un poco, utilizaremos la librería `pathlib` en lugar de `os` para manejar directorios. El paradigma es un poco distinto, en lugar de muchas funciones, la filosofía es tratar a los directorios como objetos que tienen sus propios métodos y operaciones."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from pathlib import Path"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/home/jovyan/work/mat281_2020S2/data\n"
]
}
],
"source": [
"data_path = Path().resolve().parent / \"data\"\n",
"print(data_path)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### CSV"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Del inglés _Comma-Separated Values_, los archivos CSV utilizan comas (\",\") para separar valores y cada registro consiste de una fila. \n",
"\n",
"- Pros:\n",
" * Livianos.\n",
" * De fácil entendimiento.\n",
" * Editables usando un editor de texto.\n",
"- Contras:\n",
" * No está totalmente estandarizado (e.g. ¿Qué pasa si un valor tiene comas?)\n",
" * Son sensible al _encoding_ (es la forma en que se codifica un carácter).\n",
" \n",
"Pandas posee su propia función para leer csv: `pd.read_csv()`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Documentación\n",
"# pd.read_csv?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Un ejemplo de _encoding_ incorrecto"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" nombre | \n",
" apellido | \n",
" edad | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Juan | \n",
" P茅rez | \n",
" 12.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Le贸n | \n",
" Pardo | \n",
" 29.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Jos茅 | \n",
" Nu帽ez | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" nombre apellido edad\n",
"0 Juan P茅rez 12.0\n",
"1 Le贸n Pardo 29.0\n",
"2 Jos茅 Nu帽ez NaN"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv(data_path / \"encoding_example.csv\", sep=\",\", encoding=\"gbk\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mientras que el mismo dataset con el encoding correcto luce así"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" nombre | \n",
" apellido | \n",
" edad | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Juan | \n",
" Pérez | \n",
" 12.0 | \n",
"
\n",
" \n",
" 1 | \n",
" León | \n",
" Pardo | \n",
" 29.0 | \n",
"
\n",
" \n",
" 2 | \n",
" José | \n",
" Nuñez | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" nombre apellido edad\n",
"0 Juan Pérez 12.0\n",
"1 León Pardo 29.0\n",
"2 José Nuñez NaN"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv(data_path / \"encoding_example.csv\", sep=\",\", encoding=\"utf-8\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### JSON"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Acrónimo de _JavaScript Object Notation_, utilizado principalmente para intercambiar datos entre una aplicación web y un servidor.\n",
"\n",
"- Pros:\n",
" * Livianos.\n",
" * De fácil entendimiento.\n",
" * Editables usando un editor de texto.\n",
" * Formato estandarizado.\n",
"- Contras:\n",
" * La lectura con pandas puede ser un poco complicada.\n",
" \n",
"Pandas posee su propia función para leer JSON: `pd.read_json()`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# pd.read_json?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Se parecen mucho a los diccionarios de python pero en un archivo de texto."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"integer\": {\n",
" \"0\": 5,\n",
" \"1\": 5,\n",
" \"2\": 9,\n",
" \"3\": 6,\n",
" \"4\": 6,\n",
" \"5\": 9,\n",
" \"6\": 7,\n",
" \"7\": 1,\n"
]
}
],
"source": [
"!head ../data/json_example.json"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" integer | \n",
" datetime | \n",
" category | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5 | \n",
" 2015-01-01 00:00:00 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
" 2015-01-01 00:00:01 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 9 | \n",
" 2015-01-01 00:00:02 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 6 | \n",
" 2015-01-01 00:00:03 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 6 | \n",
" 2015-01-01 00:00:04 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" integer datetime category\n",
"0 5 2015-01-01 00:00:00 0\n",
"1 5 2015-01-01 00:00:01 0\n",
"2 9 2015-01-01 00:00:02 0\n",
"3 6 2015-01-01 00:00:03 0\n",
"4 6 2015-01-01 00:00:04 0"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_json(data_path / \"json_example.json\", orient=\"columns\").head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Pickle"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Es un módulo que implementa protocolos binarios de serialización y des-serialización de objetos de Python.\n",
"\n",
"* Pros\n",
" - Puede representar una inmensa cantidad de tipos de objetos de python.\n",
" - En un contexto de seguridad, como no es legible por el ser humano (representación binaria) puede ser útil para almacenar datos sensibles.\n",
"* Contras:\n",
" - Solo Python.\n",
" - Si viene de un tercero podría tener contenido malicioso.\n",
" \n",
"Pandas posee su propia función para leer pickles: `pd.read_pickle()`."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# pd.read_pickle?"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" year_start | \n",
" year_end | \n",
" position | \n",
" height | \n",
" weight | \n",
" birth_date | \n",
" college | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Alaa Abdelnaby | \n",
" 1991 | \n",
" 1995 | \n",
" F-C | \n",
" 6-10 | \n",
" 240.0 | \n",
" June 24, 1968 | \n",
" Duke University | \n",
"
\n",
" \n",
" 1 | \n",
" Zaid Abdul-Aziz | \n",
" 1969 | \n",
" 1978 | \n",
" C-F | \n",
" 6-9 | \n",
" 235.0 | \n",
" April 7, 1946 | \n",
" Iowa State University | \n",
"
\n",
" \n",
" 2 | \n",
" Kareem Abdul-Jabbar | \n",
" 1970 | \n",
" 1989 | \n",
" C | \n",
" 7-2 | \n",
" 225.0 | \n",
" April 16, 1947 | \n",
" University of California, Los Angeles | \n",
"
\n",
" \n",
" 3 | \n",
" Mahmoud Abdul-Rauf | \n",
" 1991 | \n",
" 2001 | \n",
" G | \n",
" 6-1 | \n",
" 162.0 | \n",
" March 9, 1969 | \n",
" Louisiana State University | \n",
"
\n",
" \n",
" 4 | \n",
" Tariq Abdul-Wahad | \n",
" 1998 | \n",
" 2003 | \n",
" F | \n",
" 6-6 | \n",
" 223.0 | \n",
" November 3, 1974 | \n",
" San Jose State University | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name year_start year_end position height weight \\\n",
"0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 \n",
"1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 \n",
"2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225.0 \n",
"3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 \n",
"4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 \n",
"\n",
" birth_date college \n",
"0 June 24, 1968 Duke University \n",
"1 April 7, 1946 Iowa State University \n",
"2 April 16, 1947 University of California, Los Angeles \n",
"3 March 9, 1969 Louisiana State University \n",
"4 November 3, 1974 San Jose State University "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_pickle(data_path / 'nba.pkl').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SQL"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Conocimos las bases de datos relacionales SQL en clases anteriores y como recordarás existe la función `pd.read_sql()`, lo interesante aquí es que debes crear una conexión antes de poder leer la base de datos. Cada Sistema de Gestión de Bases de Datos Relacionales (_Relational Database Management System_ o RDBMS) tiene su propia forma de conectarse."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# pd.read_sql?"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" AlbumId | \n",
" Title | \n",
" ArtistId | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" For Those About To Rock We Salute You | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Balls to the Wall | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" Restless and Wild | \n",
" 2 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" Let There Be Rock | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Big Ones | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" AlbumId Title ArtistId\n",
"0 1 For Those About To Rock We Salute You 1\n",
"1 2 Balls to the Wall 2\n",
"2 3 Restless and Wild 2\n",
"3 4 Let There Be Rock 1\n",
"4 5 Big Ones 3"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sqlite3\n",
"connector = sqlite3.connect(data_path / \"chinook.db\")\n",
"pd.read_sql_query(\"select * from albums\", con=connector).head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### API\n",
"\n",
"¿Has escuchado el término __API__? Fuera de todo tecnicismo, las APIs (_Application Programming Interface_) permiten hacer uso de funciones ya existentes en otro software (o de la infraestructura ya existente en otras plataformas) para no estar reinventando la rueda constantemente, reutilizando así código que se sabe que está probado y que funciona correctamente. Por ejemplo, cuando haces una compra online y utilizas WebPay o una página utiliza los mapas de GoogleMaps. ¡Hay APIs en todos lados!\n",
"\n",
"Utilizaremos la API de Open Notify para obtener cuántas personas hay en el espacio en este momento ([link](http://open-notify.org/Open-Notify-API/People-In-Space/))."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"import requests"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"response has type \n",
"\n"
]
}
],
"source": [
"response = requests.get(\"http://api.open-notify.org/astros.json\")\n",
"print(f\"response has type {type(response)}\")\n",
"print(response)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Puedes acceder a su contenido como un JSON de la siguiente manera"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'number': 3,\n",
" 'people': [{'craft': 'ISS', 'name': 'Chris Cassidy'},\n",
" {'craft': 'ISS', 'name': 'Anatoly Ivanishin'},\n",
" {'craft': 'ISS', 'name': 'Ivan Vagner'}],\n",
" 'message': 'success'}"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"response.json()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lo cual en la práctica lo carga como un diccionario en Python"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(response.json())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Por lo que podemos cargar ciertas estructuras a dataframes con métodos de pandas que utilicen diccionarios. Por dar un ejemplo, dentro del JSON obtenido hay una lista de personas."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" craft | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" ISS | \n",
" Chris Cassidy | \n",
"
\n",
" \n",
" 1 | \n",
" ISS | \n",
" Anatoly Ivanishin | \n",
"
\n",
" \n",
" 2 | \n",
" ISS | \n",
" Ivan Vagner | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" craft name\n",
"0 ISS Chris Cassidy\n",
"1 ISS Anatoly Ivanishin\n",
"2 ISS Ivan Vagner"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame.from_dict(response.json()[\"people\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Manos a la obra"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"El análisis exploratorio de datos es una forma de analizar datos definido por John W. Tukey (E.D.A.: Exploratory data analysis) es el tratamiento estadístico al que se someten las muestras recogidas durante un proceso de investigación en cualquier campo científico. Para mayor rapidez y precisión, todo el proceso suele realizarse por medios informáticos, con aplicaciones específicas para el tratamiento estadístico. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"El análisis exploratorio de datos debería dar respuestas (al menos) a lo siguiente:\n",
"1. ¿Qué pregunta(s) estás tratando de resolver (o probar que estás equivocado)?\n",
"2. ¿Qué tipo de datos tiene y cómo trata los diferentes tipos?\n",
"3. ¿Qué falta en los datos y cómo los maneja?\n",
"4. ¿Qué hacer con los datos faltantes, outliers o información mal inputada?\n",
"5. ¿Se puede sacar más provecho a los datos ?\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"toc-hr-collapsed": false
},
"source": [
"### Ejemplo: Datos de terremotos\n",
"\n",
"El dataset `earthquakes.csv` contiene la información de los terremotos de los países durante el año 2000 al 2011. Debido a que la información de este dataset es relativamente fácil de trabajar, hemos creado un dataset denominado `earthquakes_contaminated.csv` que posee información contaminada en cada una de sus columnas. De esta forma se podrá ilustrar los distintos inconvenientes al realizar análisis exploratorio de datos."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Año | \n",
" Pais | \n",
" Magnitud | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2011 | \n",
" Turkey | \n",
" 7.1 | \n",
"
\n",
" \n",
" 1 | \n",
" 2011 | \n",
" India | \n",
" 6.9 | \n",
"
\n",
" \n",
" 2 | \n",
" 2011 | \n",
" Japan | \n",
" 7.1 | \n",
"
\n",
" \n",
" 3 | \n",
" 2011 | \n",
" Burma | \n",
" 6.8 | \n",
"
\n",
" \n",
" 4 | \n",
" 2011 | \n",
" Japan | \n",
" 9.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Año Pais Magnitud\n",
"0 2011 Turkey 7.1\n",
"1 2011 India 6.9\n",
"2 2011 Japan 7.1\n",
"3 2011 Burma 6.8\n",
"4 2011 Japan 9.0"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv(data_path / \"earthquakes.csv\").head()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Año | \n",
" Pais | \n",
" Magnitud | \n",
" Informacion | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2000 | \n",
" Turkey | \n",
" 6 | \n",
" info no valiosa | \n",
"
\n",
" \n",
" 1 | \n",
" 2000 | \n",
" Turkmenistan | \n",
" 7 | \n",
" info no valiosa | \n",
"
\n",
" \n",
" 2 | \n",
" 2000 | \n",
" Azerbaijan | \n",
" 6.5 | \n",
" info no valiosa | \n",
"
\n",
" \n",
" 3 | \n",
" 2000 | \n",
" Azerbaijan | \n",
" 6.8 | \n",
" info no valiosa | \n",
"
\n",
" \n",
" 4 | \n",
" 2000 | \n",
" Papua New Guinea | \n",
" 8 | \n",
" info no valiosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Año Pais Magnitud Informacion\n",
"0 2000 Turkey 6 info no valiosa\n",
"1 2000 Turkmenistan 7 info no valiosa\n",
"2 2000 Azerbaijan 6.5 info no valiosa\n",
"3 2000 Azerbaijan 6.8 info no valiosa\n",
"4 2000 Papua New Guinea 8 info no valiosa"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"earthquakes = pd.read_csv(data_path / \"earthquakes_contaminated.csv\")\n",
"earthquakes.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Variables__\n",
"\n",
"* __Pais__:\n",
" - Descripción: País del devento sísmico.\n",
" - Tipo: _string_\n",
" - Observaciones: No deberían encontrarse nombres de ciudades, comunas, pueblos, estados, etc.\n",
"* Año:\n",
" - Descripción: Año del devento sísmico.\n",
" - Tipo: _integer_\n",
" - Observaciones: Los años deben estar entre 2000 y 2011.\n",
"* Magnitud:\n",
" - Descripción: Magnitud del devento sísmico medida en [Magnitud de Momento Sísmico](https://en.wikipedia.org/wiki/Moment_magnitude_scale).\n",
" - Tipo: _float_\n",
" - Observaciones: Magnitudes menores a 9.6.\n",
"* Informacion:\n",
" - Descripción: Columna contaminante.\n",
" - Tipo: _string_\n",
" - Observaciones: A priori pareciera que no entrega información a los datos."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A pesar que la magnitud es un _float_, el conocimiento de los datos nos da información relevante, pues el terremoto con mayor magnitud registrado a la fecha fue el de Valdivia, Chile el 22 de mayo de 1960 con una magnitud entre 9.4 - 9.6. \n",
"\n",
"__Los datos son solo bytes en el disco duro si es que no entregan valor y conocimiento.__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ¿Qué pregunta(s) estás tratando de resolver (o probar que estás equivocado)?\n",
"\n",
"A modo de ejemplo, consideremos que que queremos conocer la mayor magnitud de terremoto en cada país a lo largo de los años."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ¿Qué tipo de datos tiene y cómo trata los diferentes tipos?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Por el conocimiento de los datos sabemos que `Pais` e `Información` son variables categóricas, mientras que `Año` y `Magnitud` son variables numéricas."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Utilizemos las herramientas que nos entrega `pandas`."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 228 entries, 0 to 227\n",
"Data columns (total 4 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Año 226 non-null object\n",
" 1 Pais 226 non-null object\n",
" 2 Magnitud 225 non-null object\n",
" 3 Informacion 220 non-null object\n",
"dtypes: object(4)\n",
"memory usage: 7.2+ KB\n"
]
}
],
"source": [
"earthquakes.info()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" count | \n",
" unique | \n",
" top | \n",
" freq | \n",
"
\n",
" \n",
" \n",
" \n",
" Año | \n",
" 226 | \n",
" 16 | \n",
" 2003 | \n",
" 31 | \n",
"
\n",
" \n",
" Pais | \n",
" 226 | \n",
" 74 | \n",
" Indonesia | \n",
" 27 | \n",
"
\n",
" \n",
" Magnitud | \n",
" 225 | \n",
" 45 | \n",
" 6.4 | \n",
" 14 | \n",
"
\n",
" \n",
" Informacion | \n",
" 220 | \n",
" 3 | \n",
" info valiosa | \n",
" 166 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count unique top freq\n",
"Año 226 16 2003 31\n",
"Pais 226 74 Indonesia 27\n",
"Magnitud 225 45 6.4 14\n",
"Informacion 220 3 info valiosa 166"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"earthquakes.describe(include=\"all\").T"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Año object\n",
"Pais object\n",
"Magnitud object\n",
"Informacion object\n",
"dtype: object"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"earthquakes.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Todas las columnas son de tipo `object`, sospechoso. Además, algunas no tienen datos."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Tip__: Típicamente se utilizan nombres de columnas en minúsculas y sin espacios. Un truco es hacer lo siguiente:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" año | \n",
" pais | \n",
" magnitud | \n",
" informacion | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2000 | \n",
" Turkey | \n",
" 6 | \n",
" info no valiosa | \n",
"
\n",
" \n",
" 1 | \n",
" 2000 | \n",
" Turkmenistan | \n",
" 7 | \n",
" info no valiosa | \n",
"
\n",
" \n",
" 2 | \n",
" 2000 | \n",
" Azerbaijan | \n",
" 6.5 | \n",
" info no valiosa | \n",
"
\n",
" \n",
" 3 | \n",
" 2000 | \n",
" Azerbaijan | \n",
" 6.8 | \n",
" info no valiosa | \n",
"
\n",
" \n",
" 4 | \n",
" 2000 | \n",
" Papua New Guinea | \n",
" 8 | \n",
" info no valiosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" año pais magnitud informacion\n",
"0 2000 Turkey 6 info no valiosa\n",
"1 2000 Turkmenistan 7 info no valiosa\n",
"2 2000 Azerbaijan 6.5 info no valiosa\n",
"3 2000 Azerbaijan 6.8 info no valiosa\n",
"4 2000 Papua New Guinea 8 info no valiosa"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"earthquakes = earthquakes.rename(columns=lambda x: x.lower().strip())\n",
"earthquakes.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Se le aplicó una función `lambda` a cada nombre de columna! Puum! "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ¿Qué falta en los datos y cómo los maneja?\n",
"\n",
"No es necesario agregar más variables, pero si procesarla."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ¿Qué hacer con los datos faltantes, outliers o información mal inputada?\n",
"\n",
"A continuación iremos explorando cada una de las columnas."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"La columna año posee los siguientes valores únicos:\n",
" ['1990' '1997' '1999' '2000' '2001' '2002' '2003' '2004' '2005' '2006'\n",
" '2007' '2008' '2009' '2010' '2011' 'dos mil uno' nan]\n",
"\n",
"\n",
"La columna pais posee los siguientes valores únicos:\n",
" ['Afghanistan' 'Afghanistan ' 'Algeria' 'Algeria ' 'Argentina'\n",
" 'Azerbaijan' 'Azerbaijan ' 'Bangladesh' 'Burma ' 'Chile' 'Chile ' 'China'\n",
" 'China ' 'Colombia' 'Costa Rica' 'Costa Rica '\n",
" 'Democratic Republic of the Congo' 'Democratic Republic of the Congo '\n",
" 'Dominican Republic' 'Ecuador' 'El Salvador ' 'Greece' 'Greece '\n",
" 'Guadeloupe' 'Guatemala' 'Haiti ' 'India' 'India ' 'Indonesia'\n",
" 'Indonesia ' 'Iran' 'Iran ' 'Iran, 2005 Qeshm earthquake' 'Italy'\n",
" 'Italy ' 'Japan' 'Japan ' 'Kazakhstan' 'Kyrgyzstan ' 'Martinique'\n",
" 'Mexico ' 'Morocco' 'Morocco ' 'Mozambique' 'New Zealand' 'New Zealand '\n",
" 'Nicaragua' 'Pakistan' 'Pakistan ' 'Panama' 'Papua New Guinea' 'Peru'\n",
" 'Peru ' 'Philippines' 'Russian Federation' 'Rwanda' 'Samoa ' 'Serbia'\n",
" 'Slovenia' 'Solomon Islands ' 'Taiwan' 'Taiwan ' 'Tajikistan'\n",
" 'Tajikistan ' 'Tanzania' 'Tanzania ' 'Turkey' 'Turkey ' 'Turkmenistan'\n",
" 'United States ' 'Venezuela' 'Vietnam' 'arica' 'shile' nan]\n",
"\n",
"\n",
"La columna magnitud posee los siguientes valores únicos:\n",
" ['-10' '2002-Tanzania-5.8' '2003-japan-8.5' '4.7' '4.9' '5' '5.1' '5.2'\n",
" '5.3' '5.4' '5.5' '5.6' '5.7' '5.8' '5.9' '6' '6.1' '6.2' '6.3' '6.4'\n",
" '6.5' '6.6' '6.7' '6.8' '6.9' '7' '7.1' '7.2' '7.3' '7.4' '7.5' '7.6'\n",
" '7.7' '7.8' '7.9' '8' '8.1' '8.3' '8.4' '8.5' '8.6' '8.8' '9' '9.1' '9.7'\n",
" nan]\n",
"\n",
"\n",
"La columna informacion posee los siguientes valores únicos:\n",
" ['info no valiosa' 'info valiosa' 'valiosa' nan]\n",
"\n",
"\n"
]
}
],
"source": [
"for col in earthquakes:\n",
" print(f\"La columna {col} posee los siguientes valores únicos:\\n {earthquakes[col].sort_values().unique()}\\n\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* En la columna `año` se presentan las siguientes anomalías:\n",
" * Datos vacíos.\n",
" * Años sin importancia: Se ha establecido que los años de estudios son desde el año 2000 al 2011.\n",
" * Nombres mal escritos: en este caso sabemos que 'dos mil uno' corresponde a '2001'.\n",
"* En la columna `pais` se presentan las siguientes anomalías:\n",
" * Datos vacíos.\n",
" * Ciudades, e.g. _arica_.\n",
" * Países mal escritos e.g. _shile_.\n",
" * Países repetidos pero mal formateados, e.g. _Turkey_.\n",
" * Cruce de información, e.g. _Iran, 2005 Qeshm earthquake_.\n",
"* En la columna `magnitud` se presentan las siguientes anomalías:\n",
" * Datos vacíos.\n",
" * Cruce de información, e.g. _2002-Tanzania-5.8_.\n",
" * Valores imposibles, e.g. _9.7_.\n",
"* La columna `informacion` realmente no está entregando ninguna información valiosa al problema."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Partamos por eliminar la columna `informacion`."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" año | \n",
" pais | \n",
" magnitud | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2000 | \n",
" Turkey | \n",
" 6 | \n",
"
\n",
" \n",
" 1 | \n",
" 2000 | \n",
" Turkmenistan | \n",
" 7 | \n",
"
\n",
" \n",
" 2 | \n",
" 2000 | \n",
" Azerbaijan | \n",
" 6.5 | \n",
"
\n",
" \n",
" 3 | \n",
" 2000 | \n",
" Azerbaijan | \n",
" 6.8 | \n",
"
\n",
" \n",
" 4 | \n",
" 2000 | \n",
" Papua New Guinea | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" año pais magnitud\n",
"0 2000 Turkey 6\n",
"1 2000 Turkmenistan 7\n",
"2 2000 Azerbaijan 6.5\n",
"3 2000 Azerbaijan 6.8\n",
"4 2000 Papua New Guinea 8"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk = earthquakes.drop(columns=\"informacion\") # A veces es importante no sobrescribir el dataframe original para realizar análisis posteriores.\n",
"eqk.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Respecto a la columna `año`, corregir estos errores no es difícil, pero suele ser tedioso. Aparte que si no se realiza un correcto análisis es posible no detectar estos errores a tiempo. Empecemos con los registros nulos."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" año | \n",
" pais | \n",
" magnitud | \n",
"
\n",
" \n",
" \n",
" \n",
" 225 | \n",
" NaN | \n",
" NaN | \n",
" 2002-Tanzania-5.8 | \n",
"
\n",
" \n",
" 226 | \n",
" NaN | \n",
" NaN | \n",
" 2003-japan-8.5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" año pais magnitud\n",
"225 NaN NaN 2002-Tanzania-5.8\n",
"226 NaN NaN 2003-japan-8.5"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.loc[lambda x: x[\"año\"].isnull()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Veamos el archivo"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sed: can't read data/earthquakes_contaminated.csv: No such file or directory\n"
]
}
],
"source": [
"! sed -n \"226,228p\" data/earthquakes_contaminated.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Toda la información está contenida en una columna!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Para editar la información usaremos dos herramientas:\n",
" * Los métodos de `str` en `pandas`, en particular para dividir una columna.\n",
" * `loc` para asignar los nuevos valores."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([['2002', 'Tanzania', '5.8'],\n",
" ['2003', 'japan', '8.5']], dtype=object)"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.loc[lambda x: x[\"año\"].isnull(), \"magnitud\"].str.split(\"-\", expand=True).values"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"eqk.loc[lambda x: x[\"año\"].isnull(), :] = eqk.loc[lambda x: x[\"año\"].isnull(), \"magnitud\"].str.split(\"-\", expand=True).values"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" año | \n",
" pais | \n",
" magnitud | \n",
"
\n",
" \n",
" \n",
" \n",
" 225 | \n",
" 2002 | \n",
" Tanzania | \n",
" 5.8 | \n",
"
\n",
" \n",
" 226 | \n",
" 2003 | \n",
" japan | \n",
" 8.5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" año pais magnitud\n",
"225 2002 Tanzania 5.8\n",
"226 2003 japan 8.5"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.loc[[225, 226]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ahora los registros que no se pueden convertir a `numeric`. Veamos que no es posible convertirlo."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"invalid literal for int() with base 10: 'dos mil uno'\n"
]
}
],
"source": [
"try:\n",
" eqk[\"año\"].astype(np.int)\n",
"except Exception as e:\n",
" print(e)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 True\n",
"1 True\n",
"2 True\n",
"3 True\n",
"4 True\n",
" ... \n",
"223 True\n",
"224 True\n",
"225 True\n",
"226 True\n",
"227 True\n",
"Name: año, Length: 228, dtype: bool"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk[\"año\"].str.isnumeric().fillna(False)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" año | \n",
" pais | \n",
" magnitud | \n",
"
\n",
" \n",
" \n",
" \n",
" 31 | \n",
" dos mil uno | \n",
" China | \n",
" 5.4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" año pais magnitud\n",
"31 dos mil uno China 5.4"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.loc[lambda x: ~ x[\"año\"].str.isnumeric()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Veamos el valor a cambiar"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'dos mil uno'"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.loc[lambda x: ~ x[\"año\"].str.isnumeric(), \"año\"].iloc[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reemplazar es muy fácil!"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 2000\n",
"1 2000\n",
"2 2000\n",
"3 2000\n",
"4 2000\n",
" ... \n",
"223 1990\n",
"224 1999\n",
"225 2002\n",
"226 2003\n",
"227 2005\n",
"Name: año, Length: 228, dtype: object"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk[\"año\"].str.replace(\"dos mil uno\", \"2001\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Para asignar en el dataframe basta con:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"eqk[\"año\"] = eqk[\"año\"].str.replace(\"dos mil uno\", \"2001\").astype(np.int)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"La forma encadenada sería:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"# eqk[\"año\"] = eqk.assign(año=lambda x: x[\"año\"].str.replace(\"dos mil uno\", \"2001\").astype(\"int\"))"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"año int64\n",
"pais object\n",
"magnitud object\n",
"dtype: object"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finalmentem, filtremos los años necesarios:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"eqk = eqk.query(\"2000 <= año <= 2011\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Siguiendo de forma análoga con la columna `magnitud`."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" año | \n",
" pais | \n",
" magnitud | \n",
"
\n",
" \n",
" \n",
" \n",
" 219 | \n",
" 2010 | \n",
" Colombia | \n",
" NaN | \n",
"
\n",
" \n",
" 220 | \n",
" 2005 | \n",
" Indonesia | \n",
" NaN | \n",
"
\n",
" \n",
" 221 | \n",
" 2010 | \n",
" Venezuela | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" año pais magnitud\n",
"219 2010 Colombia NaN\n",
"220 2005 Indonesia NaN\n",
"221 2010 Venezuela NaN"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.loc[lambda x: x[\"magnitud\"].isnull()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"La verdad es que no hay mucho que hacer con estos valores, por el momento no _inputaremos_ ningún valor y los descartaremos."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"eqk = eqk.loc[lambda x: x[\"magnitud\"].notnull()]"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Ya es posible transformar la columna a float.\n"
]
}
],
"source": [
"try:\n",
" eqk[\"magnitud\"].astype(np.float)\n",
" print(\"Ya es posible transformar la columna a float.\")\n",
"except:\n",
" print(\"Aún no es posible transformar la columna a float.\")"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"eqk = eqk.astype({\"magnitud\": np.float})"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"año int64\n",
"pais object\n",
"magnitud float64\n",
"dtype: object"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 6. , 7. , 6.5, 6.8, 8. , 5.7, 6.4, 5.5, 6.3,\n",
" 5.4, 6.1, 6.7, 7.9, 7.2, 7.5, 5.3, 5.9, 9.7,\n",
" 5.8, 4.7, 7.6, 8.4, 5. , 5.6, 6.6, 6.2, 7.1,\n",
" 7.3, 5.1, 5.2, 8.3, 6.9, 9.1, 4.9, 7.8, 8.6,\n",
" 7.7, 7.4, 8.5, 8.1, 8.8, 9. , -10. ])"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.magnitud.unique()"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" año | \n",
" pais | \n",
" magnitud | \n",
"
\n",
" \n",
" \n",
" \n",
" 22 | \n",
" 2000 | \n",
" shile | \n",
" 9.7 | \n",
"
\n",
" \n",
" 217 | \n",
" 2011 | \n",
" shile | \n",
" -10.0 | \n",
"
\n",
" \n",
" 218 | \n",
" 2011 | \n",
" shile | \n",
" -10.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" año pais magnitud\n",
"22 2000 shile 9.7\n",
"217 2011 shile -10.0\n",
"218 2011 shile -10.0"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.query(\"magnitud < 0 or 9.6 < magnitud\")"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"eqk = eqk.query(\"0 <= magnitud <= 9.6\")"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" año | \n",
" pais | \n",
" magnitud | \n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [año, pais, magnitud]\n",
"Index: []"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.query(\"magnitud < 0 or 9.6 < magnitud\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finalmente, para la columna `pais`. Comenzaremos con los nombres erróneos, estos los podemos mapear directamente."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Turkey\n",
"1 Turkmenistan\n",
"2 Azerbaijan\n",
"3 Azerbaijan \n",
"4 Papua New Guinea\n",
" ... \n",
"215 China \n",
"216 New Zealand \n",
"225 Tanzania\n",
"226 japan\n",
"227 Chile\n",
"Name: pais, Length: 219, dtype: object"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"map_paises = {\"arica\": \"Chile\", \"shile\": \"Chile\", \"Iran, 2005 Qeshm earthquake\": \"Iran\"}\n",
"eqk[\"pais\"].map(map_paises).fillna(eqk[\"pais\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Para editarlo en el dataframe basta hacer un `assign`."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"eqk = eqk.assign(pais=lambda x: x[\"pais\"].map(map_paises).fillna(x[\"pais\"]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ahora formatearemos los nombres, pasándolos a minúsculas y quitando los espacios al principio y final de cada _string_. Y ahabíamos hablado del ejemplo de _Turkey_."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Turkey', 'Turkey '], dtype=object)"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.loc[lambda x: x[\"pais\"].apply(lambda s: \"Turkey\" in s), \"pais\"].unique()"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": [
"# Chaining method\n",
"eqk = eqk.assign(pais=lambda x: x[\"pais\"].str.lower().str.strip())"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['turkey'], dtype=object)"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.loc[lambda x: x[\"pais\"].apply(lambda s: \"turkey\" in s), \"pais\"].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nota que no hay países con valores nulos porque ya fueron reparados."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" año | \n",
" pais | \n",
" magnitud | \n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [año, pais, magnitud]\n",
"Index: []"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.loc[lambda x: x[\"pais\"].isnull()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ¿Se puede sacar más provecho a los datos ?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"No es posible crear variables nuevas o algo por el estilo, ya se hizo todo el procesamiento necesario para cumplir las reglas de negocio."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(228, 4)"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"earthquakes.shape"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(219, 3)"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Dar respuesta\n",
"\n",
"Como es un método de agregación podríamos simplemente hacer un `groupby`."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pais año \n",
"afghanistan 2000 6.3\n",
" 2001 5.0\n",
" 2002 7.3\n",
" 2003 5.8\n",
" 2004 6.5\n",
" ... \n",
"turkmenistan 2000 7.0\n",
"united states 2001 6.8\n",
" 2003 6.6\n",
"venezuela 2006 5.5\n",
"vietnam 2005 5.3\n",
"Name: magnitud, Length: 134, dtype: float64"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.groupby([\"pais\", \"año\"])[\"magnitud\"].max()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sin embargo, en ocasiones, una tabla __pivoteada__ es mucho más explicativa."
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" año | \n",
" 2000 | \n",
" 2001 | \n",
" 2002 | \n",
" 2003 | \n",
" 2004 | \n",
" 2005 | \n",
" 2006 | \n",
" 2007 | \n",
" 2008 | \n",
" 2009 | \n",
" 2010 | \n",
" 2011 | \n",
"
\n",
" \n",
" pais | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" afghanistan | \n",
" 6.3 | \n",
" 5 | \n",
" 7.3 | \n",
" 5.8 | \n",
" 6.5 | \n",
" 6.5 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" algeria | \n",
" 5.7 | \n",
" | \n",
" | \n",
" 6.8 | \n",
" | \n",
" | \n",
" 5.2 | \n",
" | \n",
" 5.5 | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" argentina | \n",
" 7.2 | \n",
" | \n",
" | \n",
" | \n",
" 6.1 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" azerbaijan | \n",
" 6.8 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" bangladesh | \n",
" | \n",
" | \n",
" | \n",
" 5.6 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" burma | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 6.8 | \n",
"
\n",
" \n",
" chile | \n",
" | \n",
" 6.3 | \n",
" | \n",
" | \n",
" | \n",
" 8 | \n",
" | \n",
" 7.7 | \n",
" | \n",
" | \n",
" 8.8 | \n",
" | \n",
"
\n",
" \n",
" china | \n",
" 5.9 | \n",
" 5.6 | \n",
" 5.5 | \n",
" 6.3 | \n",
" 5.3 | \n",
" 5.2 | \n",
" 5 | \n",
" 6.1 | \n",
" 7.9 | \n",
" 5.7 | \n",
" 6.9 | \n",
" 5.4 | \n",
"
\n",
" \n",
" colombia | \n",
" 6.5 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 5.9 | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" costa rica | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 6.4 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 6.1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" democratic republic of the congo | \n",
" | \n",
" | \n",
" 6.2 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 5.9 | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" dominican republic | \n",
" | \n",
" | \n",
" | \n",
" 6.4 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" ecuador | \n",
" 5.5 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" el salvador | \n",
" | \n",
" 7.6 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" greece | \n",
" | \n",
" | \n",
" 6.2 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 6.4 | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" guadeloupe | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 6.3 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" guatemala | \n",
" | \n",
" | \n",
" | \n",
" 6.4 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" haiti | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 7 | \n",
" | \n",
"
\n",
" \n",
" india | \n",
" | \n",
" 7.6 | \n",
" 6.5 | \n",
" | \n",
" | \n",
" 5.1 | \n",
" 5.3 | \n",
" 5.1 | \n",
" | \n",
" | \n",
" | \n",
" 6.9 | \n",
"
\n",
" \n",
" indonesia | \n",
" 7.9 | \n",
" | \n",
" 7.5 | \n",
" 6.9 | \n",
" 9.1 | \n",
" 8.6 | \n",
" 7.7 | \n",
" 8.5 | \n",
" 7.3 | \n",
" 7.6 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" iran | \n",
" 5.3 | \n",
" | \n",
" 6.5 | \n",
" 6.6 | \n",
" 6.3 | \n",
" 6.4 | \n",
" 6.1 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" italy | \n",
" | \n",
" 4.7 | \n",
" 5.9 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 6.2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" japan | \n",
" 6.1 | \n",
" 6.8 | \n",
" | \n",
" 8.5 | \n",
" 6.6 | \n",
" 6.6 | \n",
" | \n",
" 6.7 | \n",
" 6.9 | \n",
" 6.4 | \n",
" | \n",
" 9 | \n",
"
\n",
" \n",
" kazakhstan | \n",
" | \n",
" | \n",
" | \n",
" 6 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" kyrgyzstan | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 6.9 | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" martinique | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 7.4 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" mexico | \n",
" | \n",
" | \n",
" | \n",
" 7.5 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" morocco | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 6.3 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" mozambique | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 7 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" new zealand | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 5.4 | \n",
" | \n",
" | \n",
" 6.6 | \n",
" | \n",
" | \n",
" | \n",
" 6.3 | \n",
"
\n",
" \n",
" nicaragua | \n",
" 5.4 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" pakistan | \n",
" | \n",
" | \n",
" 6.3 | \n",
" | \n",
" 5.4 | \n",
" 7.6 | \n",
" 4.9 | \n",
" 5.2 | \n",
" 6.4 | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" panama | \n",
" | \n",
" | \n",
" | \n",
" 6.5 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" papua new guinea | \n",
" 8 | \n",
" | \n",
" 7.6 | \n",
" | \n",
" | \n",
" 6.1 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" peru | \n",
" | \n",
" 8.4 | \n",
" | \n",
" | \n",
" | \n",
" 7.5 | \n",
" | \n",
" 8 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" philippines | \n",
" | \n",
" | \n",
" 7.5 | \n",
" 6.5 | \n",
" | \n",
" 7.1 | \n",
" | \n",
" 5.3 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" russian federation | \n",
" | \n",
" | \n",
" | \n",
" 7.3 | \n",
" | \n",
" | \n",
" | \n",
" 6.2 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" rwanda | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 5.3 | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" samoa | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 8.1 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" serbia | \n",
" | \n",
" | \n",
" 5.7 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" slovenia | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 5.2 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" solomon islands | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 8.1 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" taiwan | \n",
" 6.4 | \n",
" | \n",
" 7.1 | \n",
" | \n",
" 5.2 | \n",
" | \n",
" 7 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" tajikistan | \n",
" | \n",
" | \n",
" 5.2 | \n",
" | \n",
" | \n",
" | \n",
" 5.6 | \n",
" 5.2 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" tanzania | \n",
" 6.4 | \n",
" | \n",
" 5.8 | \n",
" | \n",
" | \n",
" 6.8 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" turkey | \n",
" 6 | \n",
" | \n",
" 6.5 | \n",
" 6.3 | \n",
" 5.6 | \n",
" 5.9 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 6.1 | \n",
" 7.1 | \n",
"
\n",
" \n",
" turkmenistan | \n",
" 7 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" united states | \n",
" | \n",
" 6.8 | \n",
" | \n",
" 6.6 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" venezuela | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 5.5 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" vietnam | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" 5.3 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"año 2000 2001 2002 2003 2004 2005 2006 2007 2008 \\\n",
"pais \n",
"afghanistan 6.3 5 7.3 5.8 6.5 6.5 \n",
"algeria 5.7 6.8 5.2 5.5 \n",
"argentina 7.2 6.1 \n",
"azerbaijan 6.8 \n",
"bangladesh 5.6 \n",
"burma \n",
"chile 6.3 8 7.7 \n",
"china 5.9 5.6 5.5 6.3 5.3 5.2 5 6.1 7.9 \n",
"colombia 6.5 5.9 \n",
"costa rica 6.4 \n",
"democratic republic of the congo 6.2 5.9 \n",
"dominican republic 6.4 \n",
"ecuador 5.5 \n",
"el salvador 7.6 \n",
"greece 6.2 6.4 \n",
"guadeloupe 6.3 \n",
"guatemala 6.4 \n",
"haiti \n",
"india 7.6 6.5 5.1 5.3 5.1 \n",
"indonesia 7.9 7.5 6.9 9.1 8.6 7.7 8.5 7.3 \n",
"iran 5.3 6.5 6.6 6.3 6.4 6.1 \n",
"italy 4.7 5.9 \n",
"japan 6.1 6.8 8.5 6.6 6.6 6.7 6.9 \n",
"kazakhstan 6 \n",
"kyrgyzstan 6.9 \n",
"martinique 7.4 \n",
"mexico 7.5 \n",
"morocco 6.3 \n",
"mozambique 7 \n",
"new zealand 5.4 6.6 \n",
"nicaragua 5.4 \n",
"pakistan 6.3 5.4 7.6 4.9 5.2 6.4 \n",
"panama 6.5 \n",
"papua new guinea 8 7.6 6.1 \n",
"peru 8.4 7.5 8 \n",
"philippines 7.5 6.5 7.1 5.3 \n",
"russian federation 7.3 6.2 \n",
"rwanda 5.3 \n",
"samoa \n",
"serbia 5.7 \n",
"slovenia 5.2 \n",
"solomon islands 8.1 \n",
"taiwan 6.4 7.1 5.2 7 \n",
"tajikistan 5.2 5.6 5.2 \n",
"tanzania 6.4 5.8 6.8 \n",
"turkey 6 6.5 6.3 5.6 5.9 \n",
"turkmenistan 7 \n",
"united states 6.8 6.6 \n",
"venezuela 5.5 \n",
"vietnam 5.3 \n",
"\n",
"año 2009 2010 2011 \n",
"pais \n",
"afghanistan \n",
"algeria \n",
"argentina \n",
"azerbaijan \n",
"bangladesh \n",
"burma 6.8 \n",
"chile 8.8 \n",
"china 5.7 6.9 5.4 \n",
"colombia \n",
"costa rica 6.1 \n",
"democratic republic of the congo \n",
"dominican republic \n",
"ecuador \n",
"el salvador \n",
"greece \n",
"guadeloupe \n",
"guatemala \n",
"haiti 7 \n",
"india 6.9 \n",
"indonesia 7.6 \n",
"iran \n",
"italy 6.2 \n",
"japan 6.4 9 \n",
"kazakhstan \n",
"kyrgyzstan \n",
"martinique \n",
"mexico \n",
"morocco \n",
"mozambique \n",
"new zealand 6.3 \n",
"nicaragua \n",
"pakistan \n",
"panama \n",
"papua new guinea \n",
"peru \n",
"philippines \n",
"russian federation \n",
"rwanda \n",
"samoa 8.1 \n",
"serbia \n",
"slovenia \n",
"solomon islands \n",
"taiwan \n",
"tajikistan \n",
"tanzania \n",
"turkey 6.1 7.1 \n",
"turkmenistan \n",
"united states \n",
"venezuela \n",
"vietnam "
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eqk.pivot_table(\n",
" index=\"pais\",\n",
" columns=\"año\",\n",
" values=\"magnitud\",\n",
" aggfunc=\"max\",\n",
" fill_value=\"\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"¿Notas las similitudes con `groupby`? Ambos son métodos de agregación, pero retornan formas de la matriz distintas."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sin embargo, esto se vería mucho mejor con una visualización, que es lo que veremos en el próximo módulo."
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
""
],
"text/plain": [
"alt.Chart(...)"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import altair as alt\n",
"alt.themes.enable('opaque')\n",
"\n",
"alt.Chart(\n",
" eqk.groupby([\"pais\", \"año\"])[\"magnitud\"].max().reset_index()\n",
").mark_rect().encode(\n",
" x='año:O',\n",
" y='pais:N',\n",
" color='magnitud:Q'\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Resumen\n",
"\n",
"* En el _mundo real_ te encontrarás con múltiples fuentes de datos, es importante adaptarse ya que las tecnologías cambian constantemente.\n",
"* Datos deben entregar valor a través del análisis.\n",
"* Es poco probable que los datos vengan _\"limpios\"_.\n",
"* El análisis exploratorio de datos (EDA) es una metodología que sirve para asegurarse de la calidad de los datos.\n",
"* A medida que se tiene más experticia en el tema, mejor es el análisis de datos y por tanto, mejor son los resultados obtenidos.\n",
"* No existe un procedimiento estándar para realizar el EDA, pero siempre se debe tener claro el problema a resolver.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}