{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# Miniaturowy przykład notatnika Jupyter do projektu ZUM z użyciem języka Python" ], "metadata": { "id": "uWO8GBphz8r8" } }, { "cell_type": "markdown", "source": [ "Jest to przykład demonstrujący **wyłącznie** pobieranie plików z danymi oraz dodatkowych plików źródłowych z kodem, nie należy go traktować jako wzorca realizacji i dokumentowania projektu." ], "metadata": { "id": "iWEro61OMReE" } }, { "cell_type": "markdown", "source": [ "**Pobieranie plików z danymi (oryginalne źródło [UCI](https://archive.ics.uci.edu/ml/datasets/census+income), tutaj używana wersja przetworzona do formatu CSV):**" ], "metadata": { "id": "S0bAPzgEQLEP" } }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "TuCXV5IaxkVi", "outputId": "0d4aa84c-3edf-4c8d-9942-744f9af40ec4" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "--2023-03-29 08:32:44-- http://elektron.elka.pw.edu.pl/~pcichosz/zum/projekt/census-income-train.csv\n", "Resolving elektron.elka.pw.edu.pl (elektron.elka.pw.edu.pl)... 194.29.160.103\n", "Connecting to elektron.elka.pw.edu.pl (elektron.elka.pw.edu.pl)|194.29.160.103|:80... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 3518606 (3.4M) [text/csv]\n", "Saving to: ‘census-income-train.csv’\n", "\n", "census-income-train 100%[===================>] 3.36M 775KB/s in 5.1s \n", "\n", "2023-03-29 08:32:49 (669 KB/s) - ‘census-income-train.csv’ saved [3518606/3518606]\n", "\n", "--2023-03-29 08:32:50-- http://elektron.elka.pw.edu.pl/~pcichosz/zum/projekt/census-income-test.csv\n", "Resolving elektron.elka.pw.edu.pl (elektron.elka.pw.edu.pl)... 194.29.160.103\n", "Connecting to elektron.elka.pw.edu.pl (elektron.elka.pw.edu.pl)|194.29.160.103|:80... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 1759072 (1.7M) [text/csv]\n", "Saving to: ‘census-income-test.csv’\n", "\n", "census-income-test. 100%[===================>] 1.68M 503KB/s in 3.4s \n", "\n", "2023-03-29 08:32:53 (503 KB/s) - ‘census-income-test.csv’ saved [1759072/1759072]\n", "\n" ] } ], "source": [ "!wget -nc elektron.elka.pw.edu.pl/~pcichosz/zum/projekt/census-income-train.csv\n", "!wget -nc elektron.elka.pw.edu.pl/~pcichosz/zum/projekt/census-income-test.csv" ] }, { "cell_type": "markdown", "source": [ "**Pobieranie dodatkowych plików źródłowych:**" ], "metadata": { "id": "_bDx6nfkQTTs" } }, { "cell_type": "code", "source": [ "!wget -nc elektron.elka.pw.edu.pl/~pcichosz/zum/projekt/plot_roc.py" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "pnrPFkCJPdGg", "outputId": "9bccf77b-0b29-42ed-e391-4c50550442f2" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "--2023-03-29 08:32:53-- http://elektron.elka.pw.edu.pl/~pcichosz/zum/projekt/plot_roc.py\n", "Resolving elektron.elka.pw.edu.pl (elektron.elka.pw.edu.pl)... 194.29.160.103\n", "Connecting to elektron.elka.pw.edu.pl (elektron.elka.pw.edu.pl)|194.29.160.103|:80... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 429 [text/x-python]\n", "Saving to: ‘plot_roc.py’\n", "\n", "plot_roc.py 100%[===================>] 429 --.-KB/s in 0s \n", "\n", "2023-03-29 08:32:54 (62.7 MB/s) - ‘plot_roc.py’ saved [429/429]\n", "\n" ] } ] }, { "cell_type": "markdown", "source": [ "Eksperymenty..." ], "metadata": { "id": "J01C4VuMgJQM" } }, { "cell_type": "code", "source": [ "import pandas as pd" ], "metadata": { "id": "013yQzUd9Peu" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "cin_train = pd.read_csv('census-income-train.csv')\n", "cin_test = pd.read_csv('census-income-test.csv')" ], "metadata": { "id": "Hn1E9Ovh9T4B" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.preprocessing import OneHotEncoder\n", "from sklearn.compose import make_column_transformer, make_column_selector\n", "from sklearn.pipeline import make_pipeline" ], "metadata": { "id": "qI-s_LkX7aSM" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "rf = make_pipeline(make_column_transformer((OneHotEncoder(handle_unknown='ignore'),\n", " make_column_selector(dtype_exclude='number')),\n", " remainder = 'passthrough'),\n", " RandomForestClassifier())\n", "rf.fit(cin_train.drop('income', axis=1), cin_train['income'])" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 191 }, "outputId": "88fd1885-a97e-4bbf-e5be-3b2a63632be1", "id": "udd2G8bzc143" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Pipeline(steps=[('columntransformer',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('onehotencoder',\n", " OneHotEncoder(handle_unknown='ignore'),\n", " )])),\n", " ('randomforestclassifier', RandomForestClassifier())])" ], "text/html": [ "
Pipeline(steps=[('columntransformer',\n",
              "                 ColumnTransformer(remainder='passthrough',\n",
              "                                   transformers=[('onehotencoder',\n",
              "                                                  OneHotEncoder(handle_unknown='ignore'),\n",
              "                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f0dd5103d90>)])),\n",
              "                ('randomforestclassifier', RandomForestClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 6 } ] }, { "cell_type": "code", "source": [ "cin_prob = rf.predict_proba(cin_test)[:,1]" ], "metadata": { "id": "qLiKcgkXI3XO" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "from plot_roc import plot_roc" ], "metadata": { "id": "cusUz2N5Q320" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "plot_roc(cin_test['income'].to_numpy(), cin_prob)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 283 }, "id": "_4IURdNwJAeU", "outputId": "ee6cbe57-aad1-46c0-ed4a-f69132f154c6" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] } ] }