{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Regresión" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Los objetivos de esta clase son:\n", "\n", "* Comprender/recordar regresión lineal.\n", "* Estimar el error al aplicar modelos matemáticos a los datos.\n", "* Introducir la librería _scikit-learn_.\n", "* Otros tipos de regresión." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Motivación" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "La regresión lineal es una técnica universalmente utilizada y a pesar de su simpletaza, la derivación de este método entrega importantes consideraciones sobre su implementación, sus hipótesis y sus posibles extensiones.\n", "\n", "Para motivar el estudio utilizaremos datos de diabetes disponibles en la biblioteca _scikit learn_." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ThemeRegistry.enable('opaque')" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import altair as alt\n", "\n", "from sklearn import datasets\n", "\n", "alt.themes.enable('opaque') # Para quienes utilizan temas oscuros en Jupyter Lab" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexbmibps1s2s3s4s5s6target
00.0380760.0506800.0616960.021872-0.044223-0.034821-0.043401-0.0025920.019908-0.017646151.0
1-0.001882-0.044642-0.051474-0.026328-0.008449-0.0191630.074412-0.039493-0.068330-0.09220475.0
20.0852990.0506800.044451-0.005671-0.045599-0.034194-0.032356-0.0025920.002864-0.025930141.0
3-0.089063-0.044642-0.011595-0.0366560.0121910.024991-0.0360380.0343090.022692-0.009362206.0
40.005383-0.044642-0.0363850.0218720.0039350.0155960.008142-0.002592-0.031991-0.046641135.0
\n", "
" ], "text/plain": [ " age sex bmi bp s1 s2 s3 \\\n", "0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 \n", "1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 \n", "2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 \n", "3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 \n", "4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 \n", "\n", " s4 s5 s6 target \n", "0 -0.002592 0.019908 -0.017646 151.0 \n", "1 -0.039493 -0.068330 -0.092204 75.0 \n", "2 -0.002592 0.002864 -0.025930 141.0 \n", "3 0.034309 0.022692 -0.009362 206.0 \n", "4 -0.002592 -0.031991 -0.046641 135.0 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True, as_frame=True)\n", "diabetes = pd.concat([diabetes_X, diabetes_y], axis=1)\n", "diabetes.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "En este conjunto la variable a predecir (_target_) es una medida cuantitativa de la progresión de la enfermedad un año después de la línea base. Contiene 10 características (_features_) numéricas, definidas en la siguienta tabla. Cada una de estas ha sido centrada y escalada por su desviación estandar multiplicada por la cantidad de muestras, i.e. la suma de cuadrados de cada columna es 1, que en la prácitca es que tengan norma unitaria.\n", "\n", "| Feature | Descripción |\n", "| :------------- | :----------: |\n", "| age | age in years|\n", "| sex | sex |\n", "| bmi | body mass index|\n", "| bp | average blood pressure|\n", "| s1 | tc, T-Cells (a type of white blood cells)|\n", "| s2 | ldl, low-density lipoproteins|\n", "| s3 | hdl, high-density lipoproteins|\n", "| s4 | tch, thyroid stimulating hormone|\n", "| s5 | ltg, lamotrigine|\n", "| s6 | glu, blood sugar level|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A continuación exploremos los datos" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
age442.0-3.634285e-160.047619-0.107226-0.0372990.0053830.0380760.110727
sex442.01.308343e-160.047619-0.044642-0.044642-0.0446420.0506800.050680
bmi442.0-8.045349e-160.047619-0.090275-0.034229-0.0072840.0312480.170555
bp442.01.281655e-160.047619-0.112400-0.036656-0.0056710.0356440.132044
s1442.0-8.835316e-170.047619-0.126781-0.034248-0.0043210.0283580.153914
s2442.01.327024e-160.047619-0.115613-0.030358-0.0038190.0298440.198788
s3442.0-4.574646e-160.047619-0.102307-0.035117-0.0065840.0293120.181179
s4442.03.777301e-160.047619-0.076395-0.039493-0.0025920.0343090.185234
s5442.0-3.830854e-160.047619-0.126097-0.033249-0.0019480.0324330.133599
s6442.0-3.412882e-160.047619-0.137767-0.033179-0.0010780.0279170.135612
target442.01.521335e+0277.09300525.00000087.000000140.500000211.500000346.000000
\n", "
" ], "text/plain": [ " count mean std min 25% 50% \\\n", "age 442.0 -3.634285e-16 0.047619 -0.107226 -0.037299 0.005383 \n", "sex 442.0 1.308343e-16 0.047619 -0.044642 -0.044642 -0.044642 \n", "bmi 442.0 -8.045349e-16 0.047619 -0.090275 -0.034229 -0.007284 \n", "bp 442.0 1.281655e-16 0.047619 -0.112400 -0.036656 -0.005671 \n", "s1 442.0 -8.835316e-17 0.047619 -0.126781 -0.034248 -0.004321 \n", "s2 442.0 1.327024e-16 0.047619 -0.115613 -0.030358 -0.003819 \n", "s3 442.0 -4.574646e-16 0.047619 -0.102307 -0.035117 -0.006584 \n", "s4 442.0 3.777301e-16 0.047619 -0.076395 -0.039493 -0.002592 \n", "s5 442.0 -3.830854e-16 0.047619 -0.126097 -0.033249 -0.001948 \n", "s6 442.0 -3.412882e-16 0.047619 -0.137767 -0.033179 -0.001078 \n", "target 442.0 1.521335e+02 77.093005 25.000000 87.000000 140.500000 \n", "\n", " 75% max \n", "age 0.038076 0.110727 \n", "sex 0.050680 0.050680 \n", "bmi 0.031248 0.170555 \n", "bp 0.035644 0.132044 \n", "s1 0.028358 0.153914 \n", "s2 0.029844 0.198788 \n", "s3 0.029312 0.181179 \n", "s4 0.034309 0.185234 \n", "s5 0.032433 0.133599 \n", "s6 0.027917 0.135612 \n", "target 211.500000 346.000000 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "diabetes.describe().T" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age 1.000000\n", "sex 1.000000\n", "bmi 1.000000\n", "bp 1.000000\n", "s1 1.000000\n", "s2 1.000000\n", "s3 1.000000\n", "s4 1.000000\n", "s5 1.000000\n", "s6 1.000000\n", "target 3584.818126\n", "dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "diabetes.apply(np.linalg.norm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "También es interesante ver como se relaciona cada _feature_ con el _target_." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.HConcatChart(...)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base = alt.Chart(diabetes).mark_circle().encode(\n", " x=alt.X(alt.repeat(\"row\"), type='quantitative'),\n", " y=\"target\"\n", ").properties(\n", " width=300,\n", " height=300\n", ")\n", "\n", "base.repeat(row=diabetes_X.columns.tolist()[:4]) | base.repeat(row=diabetes_X.columns.tolist()[4:7]) | base.repeat(row=diabetes_X.columns.tolist()[7:])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Modelo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Supondremos que tenemos $m$ datos. \n", "\n", "Cada dato $x^{(i)}$, $i=1,\\dots,$ $m$ tiene $n$ componentes,\n", "$x^{(i)} = (x^{(i)}_1, ..., x^{(i)}_n)$. \n", "\n", "Conocemos además el valor (etiqueta) asociado a $x^{(i)}$ que llamaremos $y^{(i)}$, $i=1,\\dots, m$ ." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Nuestra hipótesis de modelo lineal puede escribirse como\n", "\n", "$$\\begin{aligned}\n", "h_{\\theta}(x) &= \\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 + ... + \\theta_n x_n \\\\\n", " &= \\begin{bmatrix}\\theta_0 & \\theta_1 & \\theta_2 & \\dots & \\theta_n\\end{bmatrix} \\begin{bmatrix}1 \\\\ x_1 \\\\x_2 \\\\ \\vdots \\\\ x_n\\end{bmatrix} \\\\\n", " &= \\theta^T \\begin{bmatrix}1\\\\x\\end{bmatrix} = \\begin{bmatrix}1 & x^T\\end{bmatrix} \\theta \\end{aligned}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Definiremos $x^{(i)}_0 =1$, de modo que\n", "$h_{\\theta}(x^{(i)}) = (x^{(i)})^T \\theta $ y buscamos el vector de parámetros\n", "$$\\theta = \\begin{bmatrix}\\theta_0 \\\\ \\theta_1 \\\\ \\theta_2 \\\\ \\vdots \\\\ \\theta_n\\end{bmatrix}$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Definamos las matrices\n", "\n", "$$\\begin{aligned}\n", "Y &= \\begin{bmatrix}y^{(1)} \\\\ y^{(2)} \\\\ \\vdots \\\\ y^{(m)}\\end{bmatrix}\\end{aligned}$$\n", "\n", "y\n", "\n", "$$\\begin{aligned}\n", "X = \n", "\\begin{bmatrix} \n", "1 & x^{(1)}_1 & \\dots & x^{(1)}_n \\\\ \n", "1 & x^{(2)}_1 & \\dots & x^{(2)}_n \\\\\n", "\\vdots & \\vdots & & \\vdots \\\\\n", "1 & x^{(m)}_1 & \\dots & x^{(m)}_n \\\\\n", "\\end{bmatrix}\n", "= \n", "\\begin{bmatrix} \n", "- (x^{(1)})^T - \\\\ \n", "- (x^{(2)})^T - \\\\\n", "\\vdots \\\\\n", "- (x^{(m)})^T - \\\\\n", "\\end{bmatrix}\\end{aligned}$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Luego la evaluación\n", "de todos los datos puede escribirse matricialmente como\n", "\n", "$$\\begin{aligned}\n", "X \\theta &= \n", "\\begin{bmatrix}\n", "1 & x_1^{(1)} & ... & x_n^{(1)} \\\\\n", "\\vdots & \\vdots & & \\vdots \\\\\n", "1 & x_1^{(m)} & ... & x_n^{(m)} \\\\\n", "\\end{bmatrix}\n", "\\begin{bmatrix}\\theta_0 \\\\ \\theta_1 \\\\ \\vdots \\\\ \\theta_n\\end{bmatrix} \\\\\n", "& = \n", "\\begin{bmatrix}\n", "1 \\theta_0 + x^{(1)}_1 \\theta_1 + ... + x^{(1)}_n \\theta_n \\\\\n", "\\vdots \\\\\n", "1 \\theta_0 + x^{(m)}_1 \\theta_1 + ... + x^{(m)}_n \\theta_n \\\\\n", "\\end{bmatrix} \\\\\n", "& = \n", "\\begin{bmatrix}\n", "h(x^{(1)}) \\\\\n", "\\vdots \\\\\n", "h(x^{(m)})\n", "\\end{bmatrix}\\end{aligned}$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Nuestro problema es\n", "encontrar un “buen” conjunto de valores $\\theta$ de modo que\n", "\n", "$$\\begin{aligned}\n", "\\begin{bmatrix}\n", "h(x^{(1)}) \\\\\n", "h(x^{(2)}) \\\\\n", "\\vdots \\\\\n", "h(x^{(m)})\n", "\\end{bmatrix}\n", "\\approx\n", "\\begin{bmatrix}y^{(1)} \\\\ y^{(2)} \\\\ \\vdots \\\\ y^{(m)}\\end{bmatrix}\\end{aligned}$$\n", "\n", "es decir, que $$X \\theta \\approx Y$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Para encontrar el mejor vector $\\theta$ podríamos definir una función de costo $J(\\theta)$ de la siguiente manera:\n", "\n", "$$J(\\theta) = \\frac{1}{2} \\sum_{i=1}^{m} \\left( h_{\\theta}(x^{(i)}) - y^{(i)}\\right)^2$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Implementaciones" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Aproximación Ingenieril" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "¿Cómo podemos resolver el problema\n", "en el menor número de pasos?\n", "\n", "Deseamos resolver el sistema $$A \\theta = b$$ con\n", "$A \\in \\mathbb{R}^{m \\times n}$ y $m > n$ (La matrix $A$ es skinny).\n", "\n", "¿Cómo resolvemos?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Bueno,\n", "si $A \\in \\mathbb{R}^{m \\times n}$, entonces\n", "$A^T \\in \\mathbb{R}^{n \\times m}$ y la multiplicación está bien definida\n", "y obtengo el siguiente sistema lineal, conocido como **Ecuación Normal**:\n", "$n \\times n$.\n", "$$(A^T A) \\ \\theta = A^T b$$ \n", "\n", "Si la matriz $A^T A$ es invertible, el sistema se puede solucionar “sin mayor reparo”. $$\\theta = (A^T A)^{-1} A^T b$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "En nuestro caso, obtendríamos $$\\theta = (X^T X)^{-1} X^T Y$$ Esta\n", "respuesta, aunque correcta, no admite interpretaciones y no permite\n", "generalizar a otros casos más generales.\n", "\n", "En particular...\n", "\n", "- ¿Qué relación tiene con la función de costo (no) utilizada?\n", "\n", "- ¿Qué pasa si $A^T A$ no es invertible?\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Aproximación Machine Learning" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "¿Cómo podemos obtener una\n", "buena aproximación para $\\theta$?\n", "\n", "Queremos encontrar $\\theta^*$ que minimice $J(\\theta)$.\n", "\n", "Basta con utilizar una buena rutina de optimización para cumplir con\n", "dicho objetivo.\n", "\n", "En particular, una elección natural es tomar la dirección de mayor\n", "descenso, es decir, el método del máximo descenso (gradient descent).\n", "\n", "$$\\theta^{(n+1)} = \\theta^{(n)} - \\alpha \\nabla_{\\theta} J(\\theta^{(n)})$$\n", "donde $\\alpha >0$ es la tasa de aprendizaje." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "En\n", "nuestro caso, puesto que tenemos\n", "$$J(\\theta) = \\frac{1}{2} \\sum_{i=1}^{m} \\left( h_{\\theta}(x^{(i)}) - y^{(i)}\\right)^2$$\n", "se tiene que\n", "\n", "$$\\begin{aligned}\n", "\\frac{\\partial J(\\theta)}{\\partial \\theta_k} &=\n", "\\frac{\\partial }{\\partial \\theta_k} \\frac{1}{2} \\sum_{i=1}^{m} \\left( h_{\\theta}(x^{(i)}) - y^{(i)}\\right)^2 \\\\\n", "&= \\frac{1}{2} \\sum_{i=1}^{m} 2 \\left( h_{\\theta}(x^{(i)}) - y^{(i)}\\right) \\frac{\\partial h_{\\theta}(x^{(i)})}{\\partial \\theta_k} \\\\\n", "&= \\sum_{i=1}^{m} \\left( h_{\\theta}(x^{(i)}) - y^{(i)}\\right) x^{(i)}_k\\end{aligned}$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Luego, el algoritmo queda como sigue:\n", "$$\\begin{aligned}\n", "\\theta^{(n+1)} & = \\theta^{(n)} - \\alpha \\nabla_{\\theta} J(\\theta^{(n)}) \\\\\\\\\n", "\\frac{\\partial J(\\theta)}{\\partial \\theta_k}\n", "&= \\sum_{i=1}^{m} \\left( h_{\\theta}(x^{(i)}) - y^{(i)}\\right) x^{(i)}_k\\end{aligned}$$\n", "\n", "**Observación**: La elección de $\\alpha$ es crucial para la convergencia. En\n", "particular, una regla de trabajo es utilizar $0.01/m$. Notar que el parámetro $\\alpha$ no es un parámetro del modelo como tal, si no que es parte del algoritmo, este tipo de parámetros se suelen llamar **hyperparameters**. Pudes reconocerlos porque el valor del parámetro es conocido antes de la fase de entrenamiento del modelo.\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "def lms_regression_slow(X, Y, theta, tol=1E-6):\n", " m, n = X.shape\n", " converged = False\n", " alpha = 0.01 / len(Y)\n", " while not converged:\n", " gradient = 0.\n", " for xiT, yi in zip(X, Y):\n", " xiT = xiT.reshape(1, n)\n", " hi = np.dot(xiT, theta)\n", " gradient += (hi - yi) * xiT.T\n", " new_theta = theta - alpha * gradient\n", " converged = np.linalg.norm(theta - new_theta) < tol * np.linalg.norm(theta)\n", " theta = new_theta\n", " return theta" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "m = 1000\n", "t = np.linspace(0, 1, m)\n", "x = 2 + 2 * t\n", "y = 300 + 100 * t\n", "X = np.array([np.ones(m), x]).T\n", "Y = y.reshape(m, 1)\n", "theta_0 = np.array([[0.0], [0.0]])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[199.39672176]\n", " [ 50.19457286]]\n" ] } ], "source": [ "theta_slow = lms_regression_slow(X, Y, theta_0)\n", "print(theta_slow)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Validamos si nuestro resultado es el indicado con una tolerancia dada." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.allclose(X @ theta_slow, Y, atol=0.5)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.allclose(X @ theta_slow, Y, atol=1e-3)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Implementación Vectorial" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**¿Cómo podemos obtener una justificación para la ecuación normal?**\n", "\n", "Necesitamos los siguientes ingredientes:\n", "\n", "$$\\begin{aligned}\n", "\\nabla_x &(x^T A x) = A x + A^T x \\\\ \n", "\\nabla_x &(b^T x) = b \\end{aligned}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Se tiene\n", "\n", "$$\\begin{aligned}\n", "J(\\theta) \n", "&= \\frac{1}{2} \\sum_{i=1}^{m} \\left( h_{\\theta}(x^{(i)}) - y^{(i)}\\right)^2 \\\\\n", "&= \\frac{1}{2} \\sum_{i=1}^{m} \\left( h_{\\theta}(x^{(i)}) - y^{(i)}\\right) \\left( h_{\\theta}(x^{(i)}) - y^{(i)}\\right) \\\\\n", "&= \\frac{1}{2} \\left( X \\theta - Y \\right)^T \\left( X \\theta - Y \\right) \\\\\n", "&= \\frac{1}{2} \\left( \\theta^T X^T - Y^T \\right) \\left( X \\theta - Y \\right) \\\\\n", "&= \\frac{1}{2} \\left( \\theta^T X^T X \\theta - \\theta^T X^T Y - Y^T X \\theta + Y^T Y \\right) \\\\\n", "&= \\frac{1}{2} \\left( \\theta^T X^T X \\theta - 2 (Y^T X) \\theta + Y^T Y \\right)\\end{aligned}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Aplicando a cada uno de los términos, obtenemos:\n", "\n", "$$\\begin{aligned}\n", "\\nabla_\\theta ( \\theta^T X^T X \\theta ) &= X^T X \\theta + (X^T X)^T \\theta \\\\\n", "& = 2 X^T X \\theta\\end{aligned}$$\n", "\n", "también se tiene\n", "\n", "$$\\begin{aligned}\n", "\\nabla_\\theta ( Y^T X \\theta ) &= (Y^T X) ^T\\\\\n", "&= X^T Y\\end{aligned}$$\n", "\n", "y por último\n", "\n", "$$\\begin{aligned}\n", "\\nabla_\\theta ( Y^T Y ) = 0\\end{aligned}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Por lo tanto se tiene que\n", "\n", "$$\\begin{aligned}\n", "\\nabla_\\theta J(\\theta) \n", "& = \\nabla_\\theta \\frac{1}{2} \\left( \\theta^T X^T X \\theta - 2 (Y^T X) \\theta + Y^T Y \\right) \\\\\n", "&= \\frac{1}{2} ( 2 X^T X \\theta - 2 X^T Y + 0 ) \\\\\n", "&= X^T X \\theta - X^T Y \\end{aligned}$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Esto significa que el problema $$\\min_\\theta J(\\theta)$$ se resuelve al\n", "hacer todas las derivadas parciales iguales a cero (ie, gradiente igual\n", "a cero) $$\\nabla_\\theta J(\\theta) = 0$$ lo cual en nuestro caso se\n", "convierte convenientemente a la ecuación normal $$X^T X \\theta = X^T Y$$\n", "y se tiene $$\\hat{\\theta} = (X^T X)^{-1} X^T Y$$\n", "Aquí $\\hat{\\theta}$ es una estimación de $\\theta$. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def lms_regression_fast(X, Y, theta, tol=1E-6):\n", " converged = False\n", " alpha = 0.01 / len(Y)\n", " theta = theta.reshape(X.shape[1], 1)\n", " A = np.dot(X.T, X)\n", " b = np.dot(X.T, Y)\n", " while not converged:\n", " gradient = np.dot(A, theta) - b\n", " new_theta = theta - alpha * gradient\n", " converged = np.linalg.norm(theta - new_theta) < tol * np.linalg.norm(theta)\n", " theta = new_theta\n", " return theta" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[199.39672176]\n", " [ 50.19457286]]\n" ] } ], "source": [ "theta_fast = lms_regression_fast(X, Y, theta_0)\n", "print(theta_fast)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Validación" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.allclose(X @ theta_fast, Y, atol=0.5)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.allclose(X @ theta_fast, Y, atol=1e-3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "También es posible usar la implementación de resolución de sistemas lineales dispoinible en numpy." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "def matrix_regression(X, Y, theta, tol=1E-6):\n", " A = np.dot(X.T,X)\n", " b = np.dot(X.T,Y)\n", " sol = np.linalg.solve(A,b)\n", " return sol" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[200.]\n", " [ 50.]]\n" ] } ], "source": [ "theta_npsolve = matrix_regression(X, Y, theta_0)\n", "print(theta_npsolve)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Interpretación Probabilística" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consideremos el modelo lineal\n", "\n", "$$ Y = X \\theta + \\varepsilon $$\n", "\n", "donde $\\varepsilon$ es un vector de errores aleatorios de media cero y matriz de dispersión $\\sigma^2 I$, donde $I$ es la matriz identidad. Es usual añadir el supuesto de normalidad al vector de errores, por lo que se asume que \n", "\n", "$$\\varepsilon \\sim \\mathcal{N}(0, \\sigma^2 I)$$\n", "\n", "Cabe destacar que:\n", "\n", "- $\\theta$ no es una variable aleatoria, es un parámetro\n", " (desconocido).\n", "- $Y \\ | \\ X; \\theta \\sim \\mathcal{N}(X \\theta, \\sigma^2 I)$\n", "\n", "\n", "La función de verosimilitud $L(\\theta)$ nos\n", "permite entender que tan probable es encontrar los datos observados,\n", "para una elección del parámetro $\\theta$.\n", "\n", "$$\n", "L(\\theta) = \\left( 2 \\pi \\sigma^2 \\right)^{-n/2} \\, \\exp\\left(- \\frac{1}{2 \\sigma ^2} || Y - X \\theta ||^2 \\right)\n", "$$\n", "\n", "Sea $l(\\theta) = \\log L(\\theta)$ la log-verosimilitud. Luego, ignorando los términos constantes se tiene\n", "\n", "$$\n", "l(\\theta) = -\\frac{n}{2} \\log \\sigma^2 - \\frac{1}{2 \\sigma ^2} || Y - X \\theta ||^2\n", "$$\n", "\n", "Luego, derivando respecto a $\\theta$:\n", "\n", "$$\n", "\\begin{aligned}\n", "\\frac{\\partial l(\\theta)}{\\partial \\theta}\n", "&= - \\frac{1}{2 \\sigma ^2} \\left( - 2 X^T Y + 2 X^T X \\theta \\right) \\\\\n", "&= - \\frac{1}{\\sigma ^2} \\left( X^T Y + X^T X \\theta \\right) \\\\\n", "\\end{aligned}\n", "$$\n", "\n", "Luego podemos usar toda nuestra artillería de optimización despejando $\\partial l(\\theta) / \\partial \\theta = 0$ y demostrando que es un máximo. Nuevamente llegamos a \n", "\n", "$$\\hat{\\theta} = (X^T X)^{-1} X^T Y$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_**[Scikit-learn](https://scikit-learn.org/)** Machine Learning in Python_\n", "\n", "Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.\n", "\n", "* Simple and efficient tools for predictive data analysis\n", "* Accesible to everybody, and reusable in various contexts\n", "* Built on `numpy`, `scipy` and `matplotlib`\n", "* Open source, commercially usable - BSD license" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn cuenta con enorme cantidad de herramientas de regresión, siendo la regresión lineal la más simple de estas. Ajustar una es tan sencillo como:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[200.]\n", " [ 50.]]\n" ] } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "reg = LinearRegression(fit_intercept=False)\n", "reg.fit(X, Y)\n", "theta_sklearn = reg.coef_.T\n", "print(theta_sklearn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nota que primero se crea un objeto `LinearRegression` en que se declaran algunos parámetros, por ejemplo, en nuestro caso la matriz de diseño `X` ya posee una columna de intercepto, por lo que no es necesario incluirla en el modelo de scikit-learn. Luego se ajusta el modelo `reg` con el método `fit()`.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Benchmark" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Implementación simple" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1min 33s ± 1.2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "%%timeit\n", "lms_regression_slow(X, Y, theta_0)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[199.39672176],\n", " [ 50.19457286]])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "theta_slow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Implementación vectorizada" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "241 ms ± 7.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "%%timeit\n", "lms_regression_fast(X, Y, theta_0)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[199.39672176],\n", " [ 50.19457286]])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "theta_fast" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Implementación numpy" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "17.1 µs ± 475 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n" ] } ], "source": [ "%%timeit\n", "matrix_regression(X, Y, theta_0)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[200.],\n", " [ 50.]])" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "theta_npsolve" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Implementación scikit-learn" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "278 µs ± 2.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "%%timeit\n", "LinearRegression(fit_intercept=False).fit(X, Y).coef_.T" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[200.],\n", " [ 50.]])" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "theta_sklearn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Algunos comentarios:\n", "\n", "- La implementación simple es **miles de veces** más lenta que la más rápida, que en este caso es la implementación de numpy.\n", "- La implementación de numpy es sin duda la más rápida, pero no es posible utilizarla con matrices singulares.\n", "- Las implementaciones de _gradient descent_ implementadas _from scratch_ no son lo sufiecientemente precisas.\n", "- scikit-learn demora más pues es más flexible, además de realizar validaciones al momento de ajustar los modelos." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Aspectos Prácticos\n", "\n", "Al realizar regresión, algunos autores indican que es conveniente normalizar/estandarizar los datos, es\n", "decir transformarlos para que tengan una escala común:\n", "\n", "- Utilizando la media y la desviación estándar\n", " $$\\frac{x_i-\\overline{x_i}}{\\sigma_{x_i}}$$\n", "\n", "- Utilizando mínimos y máximos\n", " $$\\frac{x_i-\\min{x_i}}{\\max{x_i} - \\min{x_i}}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**¿Porqué normalizar?**\n", "\n", "- Los valores numéricos poseen escalas de magnitud distintas.\n", "- Las variables tienen distintos significados físicos.\n", "- Algoritmos funcionan mejor.\n", "- Interpretación de resultados es más sencilla.\n", "\n", "**Algunos problemas potenciales**\n", "- Normalizar los datos puede producir colinealidad en los datos, produciendo inestabilidad numérica en la implementación." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "En particular, scikit-learn ofrece herramientas para transformar datos. Por ejemplo, para escalar la forma más fácil y directa es utilizar `sklearn.preprocessing.scale`" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0.80050009, 1.06548848, 1.29708846, ..., -0.05449919,\n", " 0.41855058, -0.37098854],\n", " [-0.03956713, -0.93853666, -1.08218016, ..., -0.83030083,\n", " -1.43655059, -1.93847913],\n", " [ 1.79330681, 1.06548848, 0.93453324, ..., -0.05449919,\n", " 0.06020733, -0.54515416],\n", " ...,\n", " [ 0.87686984, 1.06548848, -0.33441002, ..., -0.23293356,\n", " -0.98558469, 0.32567395],\n", " [-0.9560041 , -0.93853666, 0.82123474, ..., 0.55838411,\n", " 0.93615545, -0.54515416],\n", " [-0.9560041 , -0.93853666, -1.53537419, ..., -0.83030083,\n", " -0.08871747, 0.06442552]])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import preprocessing\n", "\n", "diabetes_X_scaled = preprocessing.scale(diabetes_X)\n", "diabetes_X_scaled" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sin embargo, para parovechar todas las bondades de scikit-learn se recomienda hacer uso de los objetos `Transformer`." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0.80050009, 1.06548848, 1.29708846, ..., -0.05449919,\n", " 0.41855058, -0.37098854],\n", " [-0.03956713, -0.93853666, -1.08218016, ..., -0.83030083,\n", " -1.43655059, -1.93847913],\n", " [ 1.79330681, 1.06548848, 0.93453324, ..., -0.05449919,\n", " 0.06020733, -0.54515416],\n", " ...,\n", " [ 0.87686984, 1.06548848, -0.33441002, ..., -0.23293356,\n", " -0.98558469, 0.32567395],\n", " [-0.9560041 , -0.93853666, 0.82123474, ..., 0.55838411,\n", " 0.93615545, -0.54515416],\n", " [-0.9560041 , -0.93853666, -1.53537419, ..., -0.83030083,\n", " -0.08871747, 0.06442552]])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scaler = preprocessing.StandardScaler().fit(diabetes_X)\n", "scaler.transform(diabetes_X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aplicación" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Para aplicar a nuestros datos de diabetes es tan fácil como " ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "reg = LinearRegression(fit_intercept=True).fit(diabetes_X, diabetes_y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Para obtener los coeficientes de regresión y el intercepto debes acceder a los atributos de la instancia" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ -10.01219782, -239.81908937, 519.83978679, 324.39042769,\n", " -792.18416163, 476.74583782, 101.04457032, 177.06417623,\n", " 751.27932109, 67.62538639])" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg.coef_" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "152.1334841628965" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg.intercept_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "También es posible predecir u obtener el score asociado a datos." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([206.11706979, 68.07234761, 176.88406035, 166.91796559,\n", " 128.45984241, 106.34908972, 73.89417947, 118.85378669,\n", " 158.81033076, 213.58408893, 97.07853583, 95.1016223 ,\n", " 115.06673301, 164.67605023, 103.07517946, 177.17236996,\n", " 211.75953205, 182.84424343, 147.99987605, 124.01702527,\n", " 120.33094632, 85.80377894, 113.11286302, 252.44934852,\n", " 165.48821056, 147.72187623, 97.12824075, 179.09342974,\n", " 129.05497324, 184.78138552, 158.71515746, 69.47588393,\n", " 261.50255826, 112.81897436, 78.37194762, 87.66624129,\n", " 207.92460213, 157.87686037, 240.84370686, 136.93372685,\n", " 153.48187659, 74.15703284, 145.63105805, 77.8280105 ,\n", " 221.0786645 , 125.22224022, 142.60147066, 109.4926324 ,\n", " 73.14037106, 189.87368742, 157.93636782, 169.55816531,\n", " 134.18186217, 157.72356219, 139.1077439 , 72.73252701,\n", " 207.8289973 , 80.10834588, 104.08562488, 134.57807971,\n", " 114.23779529, 180.67760064, 61.12644508, 98.7215441 ,\n", " 113.79626149, 189.96141244, 148.98263155, 124.33457266,\n", " 114.83969622, 122.00224605, 73.91315064, 236.70948329,\n", " 142.31366526, 124.51427625, 150.84273716, 127.75408702,\n", " 191.16674356, 77.05921006, 166.82129568, 91.00741773,\n", " 174.75026808, 122.83488194, 63.27214662, 151.99895968,\n", " 53.73407848, 166.00134469, 42.65030679, 153.04135861,\n", " 80.54493791, 106.9048058 , 79.94239571, 187.1634566 ,\n", " 192.60115666, 61.07125918, 107.40466928, 125.04038427,\n", " 207.72180472, 214.21749964, 123.47505642, 139.16396617,\n", " 168.21035724, 106.9267784 , 150.64502809, 157.92231541,\n", " 152.75856279, 116.22255529, 73.03090141, 155.66898717,\n", " 230.14278537, 143.50191007, 38.0947967 , 121.860737 ,\n", " 152.79569851, 207.99651918, 291.23082717, 189.17431487,\n", " 214.02871163, 235.18090808, 165.3872774 , 151.25000032,\n", " 156.57626783, 200.44154589, 219.35211772, 174.79049427,\n", " 169.23161767, 187.8719893 , 57.49473392, 108.55110499,\n", " 92.68518048, 210.87365701, 245.47433558, 69.84529943,\n", " 113.0351432 , 68.42945176, 141.69628649, 239.46177949,\n", " 58.3802079 , 235.47268158, 254.91986281, 253.31042713,\n", " 155.50813249, 230.55904185, 170.44063216, 117.99200943,\n", " 178.55548636, 240.07155813, 190.3398776 , 228.66100769,\n", " 114.24162642, 178.36570405, 209.09273631, 144.85567253,\n", " 200.65791056, 121.34184881, 150.50918174, 199.02165018,\n", " 146.2806806 , 124.02443772, 85.26036769, 235.16536625,\n", " 82.17255475, 231.29266191, 144.36634395, 197.04778326,\n", " 146.99720377, 77.18477545, 59.3728572 , 262.67891084,\n", " 225.12578458, 220.20506312, 46.59691745, 88.1040833 ,\n", " 221.77623752, 97.24900614, 164.48869956, 119.90114263,\n", " 157.79986195, 223.08505437, 99.5885471 , 165.84341641,\n", " 179.47571002, 89.83382843, 171.82492808, 158.36337775,\n", " 201.47857482, 186.39202728, 197.47094269, 66.57241937,\n", " 154.59826802, 116.18638034, 195.92074021, 128.04740268,\n", " 91.20285628, 140.56975398, 155.23013996, 169.70207476,\n", " 98.75498537, 190.1453107 , 142.5193942 , 177.26966106,\n", " 95.31403505, 69.0645889 , 164.16669511, 198.06460718,\n", " 178.26228169, 228.58801706, 160.67275473, 212.28682319,\n", " 222.48172067, 172.85184399, 125.27697688, 174.7240982 ,\n", " 152.38282657, 98.58485669, 99.73695497, 262.29658755,\n", " 223.73784832, 221.3425256 , 133.61497308, 145.42593933,\n", " 53.04259372, 141.81807792, 153.68369915, 125.21948824,\n", " 77.25091512, 230.26311068, 78.90849563, 105.20931175,\n", " 117.99633487, 99.06361032, 166.55382825, 159.34391027,\n", " 158.27612808, 143.05658763, 231.55938678, 176.64144413,\n", " 187.23572317, 65.38504165, 190.66078824, 179.74973878,\n", " 234.91022512, 119.15540438, 85.63464409, 100.85860205,\n", " 140.4174259 , 101.83836332, 120.66138775, 83.06599161,\n", " 234.58754656, 245.16192142, 263.26766492, 274.87431887,\n", " 180.67699732, 203.05474761, 254.21769367, 118.44122343,\n", " 268.44988948, 104.83643442, 115.87172349, 140.45788952,\n", " 58.46850453, 129.83264097, 263.78452618, 45.01240356,\n", " 123.28697604, 131.08314499, 34.89018315, 138.35659686,\n", " 244.30370588, 89.95612306, 192.07094588, 164.32674962,\n", " 147.74783541, 191.89381753, 176.44296313, 158.34707354,\n", " 189.19183226, 116.58275843, 111.44622859, 117.45262547,\n", " 165.79457547, 97.80241129, 139.54389024, 84.17453643,\n", " 159.9389204 , 202.4011919 , 80.48200416, 146.64621068,\n", " 79.05274311, 191.33759392, 220.67545196, 203.75145711,\n", " 92.87093594, 179.15570241, 81.80126162, 152.82706623,\n", " 76.79700486, 97.79712384, 106.83424483, 123.83477117,\n", " 218.13375502, 126.02077447, 206.76300555, 230.57976636,\n", " 122.0628518 , 135.67694517, 126.36969016, 148.49621551,\n", " 88.07082258, 138.95595037, 203.86570118, 172.55362727,\n", " 122.95773416, 213.92445645, 174.88857841, 110.07169487,\n", " 198.36767241, 173.24601643, 162.64946177, 193.31777358,\n", " 191.53802295, 284.13478714, 279.30688474, 216.0070265 ,\n", " 210.08517801, 216.22213925, 157.01489819, 224.06561179,\n", " 189.05840605, 103.56829281, 178.70442926, 111.81492124,\n", " 290.99913121, 182.64959461, 79.33602602, 86.33287509,\n", " 249.15238929, 174.51439576, 122.10645431, 146.27099383,\n", " 170.6555544 , 183.50018707, 163.36970989, 157.03563376,\n", " 144.42617093, 125.30179325, 177.50072942, 104.57821235,\n", " 132.1746674 , 95.06145678, 249.9007786 , 86.24033937,\n", " 62.00077469, 156.81087903, 192.3231713 , 133.85292727,\n", " 93.67456315, 202.49458467, 52.53953733, 174.82926235,\n", " 196.9141296 , 118.06646574, 235.3011088 , 165.09286707,\n", " 160.41863314, 162.37831419, 254.05718804, 257.23616403,\n", " 197.50578991, 184.06609359, 58.62043851, 194.3950396 ,\n", " 110.77475548, 142.20916765, 128.82725506, 180.12844365,\n", " 211.26415225, 169.59711427, 164.34167693, 136.2363478 ,\n", " 174.50905908, 74.67649224, 246.29542114, 114.14131338,\n", " 111.54358708, 140.02313284, 109.99647408, 91.37269237,\n", " 163.01389345, 75.16389857, 254.05755095, 53.47055785,\n", " 98.48060512, 100.66268306, 258.58885744, 170.67482041,\n", " 61.91866052, 182.3042492 , 171.26913027, 189.19307553,\n", " 187.18384852, 87.12032949, 148.37816611, 251.35898288,\n", " 199.69712357, 283.63722409, 50.85577124, 172.14848891,\n", " 204.06179478, 174.16816194, 157.93027543, 150.50201654,\n", " 232.9761832 , 121.5808709 , 164.54891787, 172.67742636,\n", " 226.78005938, 149.46967223, 99.14026374, 80.43680779,\n", " 140.15557121, 191.90593837, 199.27952034, 153.63210613,\n", " 171.80130949, 112.11314588, 162.60650576, 129.8448476 ,\n", " 258.02898298, 100.70869427, 115.87611124, 122.53790409,\n", " 218.17749233, 60.94590955, 131.09513588, 119.48417359,\n", " 52.60848094, 193.01802803, 101.05169913, 121.22505534,\n", " 211.8588945 , 53.44819015])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg.predict(diabetes_X)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5177494254132934" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg.score(diabetes_X, diabetes_y)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }