{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Agrupando Datos" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Group By__ se refiere al proceso que involucra uno o más de los siguientes pasos:\n", "\n", "* Dividir los datos en grupos basados en algún criterio.\n", "* Aplicar una función a cada uno de los grupos independientemente.\n", "* Combinar los resultados en una estructura de datos.\n", "\n", "La división es el paso principal. Usualmente, el usuario busca dividir la data en grupos y luego _hacer algo_ con estos grupos. En el paso de aplicar se puede: \n", "\n", "* Agregación: Calcular alguna estadística(s) para cada grupo. Por ejemplo, la suma, promedio y/o conteo de cada grupo.\n", "\n", "* Transformación: Algún cálculo específico a cada grupo pero devolviendo un índice similar al original. Por ejemplo, estandarizar o rellenar valores nulos respecto a cada grupo.\n", "\n", "* Filtrado: Descartar grupos de acuerdo a un cálculo grupal que se evalua verdadero o falso. Por ejemplo, descartar los regristros que la cantidad de miembros del grupo es menor a cierto umbral." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "La clase de hoy será motivada con el dataset de monstruos de bolsillo favorito de los milleniasl: __Pokemon__." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametype1type2hpattackdefensesp_atksp_defspeedgenerationlegendary
#
1BulbasaurGrassPoison4549496565451False
2IvysaurGrassPoison6062638080601False
3VenusaurGrassPoison808283100100801False
4Mega VenusaurGrassPoison80100123122120801False
5CharmanderFireNaN3952436050651False
\n", "
" ], "text/plain": [ " name type1 type2 hp attack defense sp_atk sp_def speed \\\n", "# \n", "1 Bulbasaur Grass Poison 45 49 49 65 65 45 \n", "2 Ivysaur Grass Poison 60 62 63 80 80 60 \n", "3 Venusaur Grass Poison 80 82 83 100 100 80 \n", "4 Mega Venusaur Grass Poison 80 100 123 122 120 80 \n", "5 Charmander Fire NaN 39 52 43 60 50 65 \n", "\n", " generation legendary \n", "# \n", "1 1 False \n", "2 1 False \n", "3 1 False \n", "4 1 False \n", "5 1 False " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pkm = (\n", " pd.read_csv(os.path.join(\"..\", \"data\", \"pokemon.csv\"), index_col=\"#\")\n", " .rename(columns=lambda x: x.replace(\" \", \"\").replace(\".\", \"_\").lower())\n", ")\n", "pkm.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ejemplo: ¿Sabes cuántos pokemones legendarios hay por generación? ¿No? Agrupemos por generación." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pkm.groupby(\"generation\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.groupby.generic.DataFrameGroupBy" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(pkm.groupby(\"generation\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hacer un groupby nos entrega un objeto groupby, usualmente no nos ayuda mucho, pero ya tiene los grupos separados internamente. \n", "\n", "Idea: Iteremos por grupo y contemos!" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "type(name): \n", "type(group): \n", "\n", "name: 1\n", "group:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametype1type2hpattackdefensesp_atksp_defspeedgenerationlegendary
#
1BulbasaurGrassPoison4549496565451False
2IvysaurGrassPoison6062638080601False
3VenusaurGrassPoison808283100100801False
4Mega VenusaurGrassPoison80100123122120801False
5CharmanderFireNaN3952436050651False
....................................
162DragoniteDragonFlying9113495100100801False
163MewtwoPsychicNaN10611090154901301True
164Mega Mewtwo XPsychicFighting1061901001541001301True
165Mega Mewtwo YPsychicNaN106150701941201401True
166MewPsychicNaN1001001001001001001False
\n", "

166 rows × 11 columns

\n", "
" ], "text/plain": [ " name type1 type2 hp attack defense sp_atk sp_def \\\n", "# \n", "1 Bulbasaur Grass Poison 45 49 49 65 65 \n", "2 Ivysaur Grass Poison 60 62 63 80 80 \n", "3 Venusaur Grass Poison 80 82 83 100 100 \n", "4 Mega Venusaur Grass Poison 80 100 123 122 120 \n", "5 Charmander Fire NaN 39 52 43 60 50 \n", ".. ... ... ... ... ... ... ... ... \n", "162 Dragonite Dragon Flying 91 134 95 100 100 \n", "163 Mewtwo Psychic NaN 106 110 90 154 90 \n", "164 Mega Mewtwo X Psychic Fighting 106 190 100 154 100 \n", "165 Mega Mewtwo Y Psychic NaN 106 150 70 194 120 \n", "166 Mew Psychic NaN 100 100 100 100 100 \n", "\n", " speed generation legendary \n", "# \n", "1 45 1 False \n", "2 60 1 False \n", "3 80 1 False \n", "4 80 1 False \n", "5 65 1 False \n", ".. ... ... ... \n", "162 80 1 False \n", "163 130 1 True \n", "164 130 1 True \n", "165 140 1 True \n", "166 100 1 False \n", "\n", "[166 rows x 11 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for name, group in pkm.groupby(\"generation\"):\n", " print(f\"type(name): {type(name)}\")\n", " print(f\"type(group): {type(group)}\\n\")\n", " print(f\"name: {name}\")\n", " print(f\"group:\")\n", " display(group)\n", " break" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "La generación 1 tiene 6 pokemones legendarios.\n", "La generación 2 tiene 5 pokemones legendarios.\n", "La generación 3 tiene 18 pokemones legendarios.\n", "La generación 4 tiene 13 pokemones legendarios.\n", "La generación 5 tiene 15 pokemones legendarios.\n", "La generación 6 tiene 8 pokemones legendarios.\n" ] } ], "source": [ "for name, group in pkm.groupby(\"generation\"):\n", " print(f\"La generación {name} tiene {group['legendary'].sum()} pokemones legendarios.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lo anterior no es lo mejor, porque es secuencial, es decir, debemos esperar a que la i-ésima iteración termine para ejecutar la (i+1)-ésima iteración." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Group By - Aggregation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Una vez el objeto GroupBy ha sido creado, es posible aplicar diferentes métodos para realizar los cálculos requeridos. El más común, es `aggregate()`, o equivalentemente, `agg()`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
legendary
generation
16
25
318
413
515
68
\n", "
" ], "text/plain": [ " legendary\n", "generation \n", "1 6\n", "2 5\n", "3 18\n", "4 13\n", "5 15\n", "6 8" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pkm.groupby(\"generation\").agg({\"legendary\": \"sum\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nota que después de aplicado el `agg` el nombre de la columna se mantiene, esto se puede cambiar de la siguiente forma:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
legendaries_sum
generation
16
25
318
413
515
68
\n", "
" ], "text/plain": [ " legendaries_sum\n", "generation \n", "1 6\n", "2 5\n", "3 18\n", "4 13\n", "5 15\n", "6 8" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Nombre de la columna como argumento de agg, tupla con nombre de la columna y operación.\n", "pkm.groupby(\"generation\").agg(legendaries_sum=(\"legendary\", \"sum\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comparemos tiempos" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5.46 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%%timeit\n", "aux = pd.DataFrame().rename_axis(index=\"generation\")\n", "for name, group in pkm.groupby(\"generation\"):\n", " aux.loc[name, \"legendary\"] = group['legendary'].sum()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.4 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "%%timeit\n", "pkm.groupby(\"generation\").agg({\"legendary\": \"sum\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Si es algo pequeño de una sola columna, puedes acceder directamente a ella." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "generation\n", "1 6\n", "2 5\n", "3 18\n", "4 13\n", "5 15\n", "6 8\n", "Name: legendary, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pkm.groupby(\"generation\")[\"legendary\"].sum() # Ojo! Devuelve una serie" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Puedes agrupar por más de una columna." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hp_max
type1type2
BugElectric70
Fighting80
Fire85
Flying86
Ghost1
.........
WaterIce130
Poison80
Psychic95
Rock100
Steel84
\n", "

136 rows × 1 columns

\n", "
" ], "text/plain": [ " hp_max\n", "type1 type2 \n", "Bug Electric 70\n", " Fighting 80\n", " Fire 85\n", " Flying 86\n", " Ghost 1\n", "... ...\n", "Water Ice 130\n", " Poison 80\n", " Psychic 95\n", " Rock 100\n", " Steel 84\n", "\n", "[136 rows x 1 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pkm.groupby([\"type1\", \"type2\"]).agg(hp_max=(\"hp\", \"max\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "También puedes agregar más de una columna." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hp_maxattack_max
type1type2
BugElectric7077
Fighting80185
Fire8585
Flying86155
Ghost190
............
WaterIce13095
Poison8095
Psychic9575
Rock100108
Steel8486
\n", "

136 rows × 2 columns

\n", "
" ], "text/plain": [ " hp_max attack_max\n", "type1 type2 \n", "Bug Electric 70 77\n", " Fighting 80 185\n", " Fire 85 85\n", " Flying 86 155\n", " Ghost 1 90\n", "... ... ...\n", "Water Ice 130 95\n", " Poison 80 95\n", " Psychic 95 75\n", " Rock 100 108\n", " Steel 84 86\n", "\n", "[136 rows x 2 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " pkm.groupby([\"type1\", \"type2\"])\n", " .agg(\n", " hp_max=(\"hp\", \"max\"),\n", " attack_max=(\"attack\", \"max\")\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Incluso hacer más de una agregación a una misma columna" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hp
minmeanmax
type1type2
BugElectric5060.00000070
Fighting8080.00000080
Fire5570.00000085
Flying3063.00000086
Ghost11.0000001
...............
WaterIce5090.000000130
Poison4061.66666780
Psychic6087.00000095
Rock5470.750000100
Steel8484.00000084
\n", "

136 rows × 3 columns

\n", "
" ], "text/plain": [ " hp \n", " min mean max\n", "type1 type2 \n", "Bug Electric 50 60.000000 70\n", " Fighting 80 80.000000 80\n", " Fire 55 70.000000 85\n", " Flying 30 63.000000 86\n", " Ghost 1 1.000000 1\n", "... .. ... ...\n", "Water Ice 50 90.000000 130\n", " Poison 40 61.666667 80\n", " Psychic 60 87.000000 95\n", " Rock 54 70.750000 100\n", " Steel 84 84.000000 84\n", "\n", "[136 rows x 3 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " pkm.groupby([\"type1\", \"type2\"])\n", " .agg({\"hp\": [\"min\", \"mean\", \"max\"]}\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Si quieres cambiar los nombres es un poco más verboso." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hp_minhp_meanhp_max
type1type2
BugElectric5060.00000070
Fighting8080.00000080
Fire5570.00000085
Flying3063.00000086
Ghost11.0000001
...............
WaterIce5090.000000130
Poison4061.66666780
Psychic6087.00000095
Rock5470.750000100
Steel8484.00000084
\n", "

136 rows × 3 columns

\n", "
" ], "text/plain": [ " hp_min hp_mean hp_max\n", "type1 type2 \n", "Bug Electric 50 60.000000 70\n", " Fighting 80 80.000000 80\n", " Fire 55 70.000000 85\n", " Flying 30 63.000000 86\n", " Ghost 1 1.000000 1\n", "... ... ... ...\n", "Water Ice 50 90.000000 130\n", " Poison 40 61.666667 80\n", " Psychic 60 87.000000 95\n", " Rock 54 70.750000 100\n", " Steel 84 84.000000 84\n", "\n", "[136 rows x 3 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " pkm.groupby([\"type1\", \"type2\"])\n", " .agg(\n", " hp_min=(\"hp\", \"min\"),\n", " hp_mean=(\"hp\", \"mean\"),\n", " hp_max=(\"hp\", \"max\"),\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "También puedes aplicar tus propias funciones" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hp_range
type1type2
BugElectric20
Fighting0
Fire30
Flying56
Ghost0
.........
WaterIce80
Poison40
Psychic35
Rock46
Steel0
\n", "

136 rows × 1 columns

\n", "
" ], "text/plain": [ " hp_range\n", "type1 type2 \n", "Bug Electric 20\n", " Fighting 0\n", " Fire 30\n", " Flying 56\n", " Ghost 0\n", "... ...\n", "Water Ice 80\n", " Poison 40\n", " Psychic 35\n", " Rock 46\n", " Steel 0\n", "\n", "[136 rows x 1 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " pkm.groupby([\"type1\", \"type2\"])\n", " .agg(\n", " hp_range=(\"hp\", lambda x: x.max() - x.min()),\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finalmente, si quieres interactuar con más de una columna, necesitas el método `apply`." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "type1 type2 \n", "Bug Electric 7.000000\n", " Fighting 60.000000\n", " Fire 12.500000\n", " Flying 8.571429\n", " Ghost 45.000000\n", " ... \n", "Water Ice -30.000000\n", " Poison 10.000000\n", " Psychic -31.000000\n", " Rock -30.000000\n", " Steel -2.000000\n", "Length: 136, dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " pkm.groupby([\"type1\", \"type2\"])\n", " .apply(lambda df: df[\"attack\"].mean() - df[\"defense\"].mean())\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Eres libre de definir tu propia función y entregarla a un groupby, por ejemplo, usando el mismo ejemplo anterior:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def attack_minus_defense(df):\n", " return df[\"attack\"].mean() - df[\"defense\"].mean()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "type1 type2 \n", "Bug Electric 7.000000\n", " Fighting 60.000000\n", " Fire 12.500000\n", " Flying 8.571429\n", " Ghost 45.000000\n", " ... \n", "Water Ice -30.000000\n", " Poison 10.000000\n", " Psychic -31.000000\n", " Rock -30.000000\n", " Steel -2.000000\n", "Length: 136, dtype: float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pkm.groupby([\"type1\", \"type2\"]).apply(attack_minus_defense)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nota que no es necesario usar la función lambda, pero si es importante que tu función definida tenga como argumento un dataframe (puedes pensar que en cada grupo se le entregará el dataframe filtrado)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Group By - Transform" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ejemplo: Normalizar cada columna agrupados por generación" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hpattackdefensesp_atksp_defspeedlegendary
#
1-0.739479-0.898969-0.763283-0.198010-0.160373-0.929521-0.193065
2-0.206695-0.476132-0.2744790.2375420.427740-0.424060-0.193065
30.5036850.1743860.4238120.8182771.2118920.249889-0.193065
40.5036850.7598521.8203951.4570861.9960430.249889-0.193065
5-0.952593-0.801391-0.972770-0.343193-0.748487-0.255573-0.193065
........................
796-0.8737540.8291822.3371490.8086412.496477-0.6398513.022779
797-0.8737542.8854211.0620582.6959801.1669681.6955103.022779
7980.5611161.171889-0.5318062.3814231.8317230.1386033.022779
7990.5611162.885421-0.5318063.0105361.8317230.5278303.022779
8000.5611161.1718891.3808311.7523100.5022140.1386033.022779
\n", "

800 rows × 7 columns

\n", "
" ], "text/plain": [ " hp attack defense sp_atk sp_def speed legendary\n", "# \n", "1 -0.739479 -0.898969 -0.763283 -0.198010 -0.160373 -0.929521 -0.193065\n", "2 -0.206695 -0.476132 -0.274479 0.237542 0.427740 -0.424060 -0.193065\n", "3 0.503685 0.174386 0.423812 0.818277 1.211892 0.249889 -0.193065\n", "4 0.503685 0.759852 1.820395 1.457086 1.996043 0.249889 -0.193065\n", "5 -0.952593 -0.801391 -0.972770 -0.343193 -0.748487 -0.255573 -0.193065\n", ".. ... ... ... ... ... ... ...\n", "796 -0.873754 0.829182 2.337149 0.808641 2.496477 -0.639851 3.022779\n", "797 -0.873754 2.885421 1.062058 2.695980 1.166968 1.695510 3.022779\n", "798 0.561116 1.171889 -0.531806 2.381423 1.831723 0.138603 3.022779\n", "799 0.561116 2.885421 -0.531806 3.010536 1.831723 0.527830 3.022779\n", "800 0.561116 1.171889 1.380831 1.752310 0.502214 0.138603 3.022779\n", "\n", "[800 rows x 7 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " pkm.groupby(\"generation\")\n", " .transform(lambda s: (s - s.mean()) / s.std())\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "También se lo puedes aplicar a una sola columna." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "1 -0.898969\n", "2 -0.476132\n", "3 0.174386\n", "4 0.759852\n", "5 -0.801391\n", " ... \n", "796 0.829182\n", "797 2.885421\n", "798 1.171889\n", "799 2.885421\n", "800 1.171889\n", "Name: attack, Length: 800, dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " pkm.groupby(\"generation\")[\"attack\"]\n", " .transform(lambda s: (s - s.mean()) / s.std())\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Personalmente, no suelo utilizar mucho este método, porque prefiero guardar la data original. Suelo agregar nuevas columnas, por ejemplo:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametype1type2hpattackdefensesp_atksp_defspeedgenerationlegendaryattack_nrm
#
1BulbasaurGrassPoison4549496565451False-0.898969
2IvysaurGrassPoison6062638080601False-0.476132
3VenusaurGrassPoison808283100100801False0.174386
4Mega VenusaurGrassPoison80100123122120801False0.759852
5CharmanderFireNaN3952436050651False-0.801391
.......................................
796DiancieRockFairy50100150100150506True0.829182
797Mega DiancieRockFairy501601101601101106True2.885421
798Hoopa ConfinedPsychicGhost8011060150130706True1.171889
799Hoopa UnboundPsychicDark8016060170130806True2.885421
800VolcanionFireWater8011012013090706True1.171889
\n", "

800 rows × 12 columns

\n", "
" ], "text/plain": [ " name type1 type2 hp attack defense sp_atk sp_def \\\n", "# \n", "1 Bulbasaur Grass Poison 45 49 49 65 65 \n", "2 Ivysaur Grass Poison 60 62 63 80 80 \n", "3 Venusaur Grass Poison 80 82 83 100 100 \n", "4 Mega Venusaur Grass Poison 80 100 123 122 120 \n", "5 Charmander Fire NaN 39 52 43 60 50 \n", ".. ... ... ... .. ... ... ... ... \n", "796 Diancie Rock Fairy 50 100 150 100 150 \n", "797 Mega Diancie Rock Fairy 50 160 110 160 110 \n", "798 Hoopa Confined Psychic Ghost 80 110 60 150 130 \n", "799 Hoopa Unbound Psychic Dark 80 160 60 170 130 \n", "800 Volcanion Fire Water 80 110 120 130 90 \n", "\n", " speed generation legendary attack_nrm \n", "# \n", "1 45 1 False -0.898969 \n", "2 60 1 False -0.476132 \n", "3 80 1 False 0.174386 \n", "4 80 1 False 0.759852 \n", "5 65 1 False -0.801391 \n", ".. ... ... ... ... \n", "796 50 6 True 0.829182 \n", "797 110 6 True 2.885421 \n", "798 70 6 True 1.171889 \n", "799 80 6 True 2.885421 \n", "800 70 6 True 1.171889 \n", "\n", "[800 rows x 12 columns]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pkm.assign(\n", " attack_nrm=lambda df: df.groupby(\"generation\")[\"attack\"].transform(lambda s: (s - s.mean()) / s.std())\n", ")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Group By - Filter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ejemplo: Filtrar el dataframe de pokemons pero manteniendo solo las generaciones que tengan más de 10 pokemones legendarios.\n", "\n", "Como te estás dando cuenta, no se puede hacer con una máscara, porque el criterio no depende de cada registro, depende del grupo al cual pertenece el regirstro." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametype1type2hpattackdefensesp_atksp_defspeedgenerationlegendary
#
273TreeckoGrassNaN4045356555703False
274GrovyleGrassNaN5065458565953False
275SceptileGrassNaN708565105851203False
276Mega SceptileGrassDragon7011075145851453False
277TorchicFireNaN4560407050453False
\n", "
" ], "text/plain": [ " name type1 type2 hp attack defense sp_atk sp_def speed \\\n", "# \n", "273 Treecko Grass NaN 40 45 35 65 55 70 \n", "274 Grovyle Grass NaN 50 65 45 85 65 95 \n", "275 Sceptile Grass NaN 70 85 65 105 85 120 \n", "276 Mega Sceptile Grass Dragon 70 110 75 145 85 145 \n", "277 Torchic Fire NaN 45 60 40 70 50 45 \n", "\n", " generation legendary \n", "# \n", "273 3 False \n", "274 3 False \n", "275 3 False \n", "276 3 False \n", "277 3 False " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pkm_filtered = pkm.groupby(\"generation\").filter(lambda df: df[\"legendary\"].sum() > 10)\n", "pkm_filtered.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Veamos cuales son las generaciones que permanecieron luego del filtrado." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([3, 4, 5])" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pkm_filtered[\"generation\"].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Verifiquemos que filtramos correctamente" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "generation\n", "1 6\n", "2 5\n", "3 18\n", "4 13\n", "5 15\n", "6 8\n", "Name: legendary, dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pkm.groupby(\"generation\")[\"legendary\"].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lo importante del argumento del método `filter` es que retorne un booleano! Es distinto a cuando uno hace una máscara, donde se obtiene una serie de elementos booleanos.\n", "\n", "Por ejemplo, filtrar los pokemones que son de la primera generación generación." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametype1type2hpattackdefensesp_atksp_defspeedgenerationlegendary
#
167ChikoritaGrassNaN4549654965452False
168BayleefGrassNaN6062806380602False
169MeganiumGrassNaN808210083100802False
170CyndaquilFireNaN3952436050652False
171QuilavaFireNaN5864588065802False
....................................
796DiancieRockFairy50100150100150506True
797Mega DiancieRockFairy501601101601101106True
798Hoopa ConfinedPsychicGhost8011060150130706True
799Hoopa UnboundPsychicDark8016060170130806True
800VolcanionFireWater8011012013090706True
\n", "

634 rows × 11 columns

\n", "
" ], "text/plain": [ " name type1 type2 hp attack defense sp_atk sp_def \\\n", "# \n", "167 Chikorita Grass NaN 45 49 65 49 65 \n", "168 Bayleef Grass NaN 60 62 80 63 80 \n", "169 Meganium Grass NaN 80 82 100 83 100 \n", "170 Cyndaquil Fire NaN 39 52 43 60 50 \n", "171 Quilava Fire NaN 58 64 58 80 65 \n", ".. ... ... ... .. ... ... ... ... \n", "796 Diancie Rock Fairy 50 100 150 100 150 \n", "797 Mega Diancie Rock Fairy 50 160 110 160 110 \n", "798 Hoopa Confined Psychic Ghost 80 110 60 150 130 \n", "799 Hoopa Unbound Psychic Dark 80 160 60 170 130 \n", "800 Volcanion Fire Water 80 110 120 130 90 \n", "\n", " speed generation legendary \n", "# \n", "167 45 2 False \n", "168 60 2 False \n", "169 80 2 False \n", "170 65 2 False \n", "171 80 2 False \n", ".. ... ... ... \n", "796 50 6 True \n", "797 110 6 True \n", "798 70 6 True \n", "799 80 6 True \n", "800 70 6 True \n", "\n", "[634 rows x 11 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pkm.loc[lambda df: df[\"generation\"] != 1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "La máscara dentro de `loc` es una serie de elementos booleanos." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", " ... \n", "796 True\n", "797 True\n", "798 True\n", "799 True\n", "800 True\n", "Name: generation, Length: 800, dtype: bool" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pkm[\"generation\"] != 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resumen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Agrupar datos por condiciones es una tarea usual.\n", "* Dependiendo de tu objetivo es posible operar, transformar o filtrar los grupos." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }