{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cleaning scripts"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Harmonize MX and Dmitry Nikolayev's datasets\n",
    "\n",
    "The replication datasets of Michalopoulos and Xue (MX) contain motifs data on 958 ethnic groups, taken from the 2019 version of Yuri Berezkin's catalog. However, these do not contain location data for the ethnic groups. Independent of MX, Dmitry Nikolayev provides coordinates for the ethnic groups [here](https://github.com/macleginn/mythology-queries). However, this only contains 926 ethnic groups and appears to be based on an older version of Berezkin's cataglog.\n",
    "\n",
    "This notebook reconciles the MX and DN datasets to the greatest extent possible. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 774,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import re\n",
    "from sklearn.linear_model import LinearRegression"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the MX dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 640,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group_Berezkin</th>\n",
       "      <th>a1</th>\n",
       "      <th>a10</th>\n",
       "      <th>a11a</th>\n",
       "      <th>a11b</th>\n",
       "      <th>a11c</th>\n",
       "      <th>a12</th>\n",
       "      <th>a12a</th>\n",
       "      <th>a12b</th>\n",
       "      <th>a12c</th>\n",
       "      <th>...</th>\n",
       "      <th>n28</th>\n",
       "      <th>n29</th>\n",
       "      <th>n3</th>\n",
       "      <th>n30</th>\n",
       "      <th>n4</th>\n",
       "      <th>n5</th>\n",
       "      <th>n6</th>\n",
       "      <th>n7</th>\n",
       "      <th>n8</th>\n",
       "      <th>n9</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Abaza (Abazins)</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Abkhaz</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Aceh</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Ache</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Achomavi</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>953</th>\n",
       "      <td>Teleut</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>954</th>\n",
       "      <td>Central Yakuts</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>955</th>\n",
       "      <td>Arabs: Iraq</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>956</th>\n",
       "      <td>Liaoning and Jilin Chinese</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>957</th>\n",
       "      <td>Norvegians, Faroe islanders</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>958 rows × 2565 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                  group_Berezkin a1 a10 a11a a11b a11c a12 a12a a12b a12c   \n",
       "0                Abaza (Abazins)  0   0    0    0    0   1    0    0    1  \\\n",
       "1                         Abkhaz  0   0    0    0    0   1    0    0    0   \n",
       "2                           Aceh  0   0    0    0    0   0    0    0    0   \n",
       "3                           Ache  0   0    0    1    0   1    1    0    0   \n",
       "4                       Achomavi  0   0    0    0    0   1    1    0    0   \n",
       "..                           ... ..  ..  ...  ...  ...  ..  ...  ...  ...   \n",
       "953                       Teleut  0   0    0    0    0   0    0    0    0   \n",
       "954               Central Yakuts  0   0    0    0    0   0    0    0    0   \n",
       "955                  Arabs: Iraq  0   0    0    0    0   0    0    0    0   \n",
       "956   Liaoning and Jilin Chinese  0   0    0    0    0   0    0    0    0   \n",
       "957  Norvegians, Faroe islanders  0   0    0    0    0   0    0    0    0   \n",
       "\n",
       "     ... n28 n29 n3 n30 n4 n5 n6 n7 n8 n9  \n",
       "0    ...   0   0  1   0  0  0  0  0  0  0  \n",
       "1    ...   0   0  0   0  0  0  1  0  0  0  \n",
       "2    ...   0   0  0   0  0  0  0  0  0  0  \n",
       "3    ...   0   0  0   0  0  0  0  0  0  0  \n",
       "4    ...   0   0  0   0  0  0  0  0  0  0  \n",
       "..   ...  ..  .. ..  .. .. .. .. .. .. ..  \n",
       "953  ...   0   0  0   0  0  0  0  0  0  0  \n",
       "954  ...   0   0  0   0  0  1  0  0  0  0  \n",
       "955  ...   0   0  0   0  0  0  0  0  0  0  \n",
       "956  ...   0   0  0   0  0  0  0  0  0  0  \n",
       "957  ...   0   0  0   0  0  0  0  0  0  0  \n",
       "\n",
       "[958 rows x 2565 columns]"
      ]
     },
     "execution_count": 640,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Motifs_Berezkin_groups = pd.read_stata('../../datasets/folklore/MX2021/Original_Files/Motifs_Berezkin_groups.dta')\n",
    "df = Motifs_Berezkin_groups.drop(\n",
    "    ['oid', 'motifs_total', 'nmbr_author', 'nmbr_language', 'nmbr_publisher', 'nmbr_title', 'year_firstpub', 'year_avgpub'], \n",
    "    axis=1\n",
    ")\n",
    "df"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load Nikolayev's coordinates dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 641,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Longitude</th>\n",
       "      <th>Latitude</th>\n",
       "      <th>Name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>20.0</td>\n",
       "      <td>-26.0</td>\n",
       "      <td>Bushmen</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>21.0</td>\n",
       "      <td>-32.0</td>\n",
       "      <td>Khoikhoi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>26.5</td>\n",
       "      <td>-32.5</td>\n",
       "      <td>Xhosa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>30.5</td>\n",
       "      <td>-28.5</td>\n",
       "      <td>Zulu,Swasi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>26.5</td>\n",
       "      <td>-27.5</td>\n",
       "      <td>Sotho, Tswana</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>921</th>\n",
       "      <td>-72.5</td>\n",
       "      <td>-39.0</td>\n",
       "      <td>Mapuche</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>922</th>\n",
       "      <td>-67.5</td>\n",
       "      <td>-42.0</td>\n",
       "      <td>Puelche</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>923</th>\n",
       "      <td>-69.0</td>\n",
       "      <td>-47.0</td>\n",
       "      <td>Tehuelche</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>924</th>\n",
       "      <td>-68.5</td>\n",
       "      <td>-54.5</td>\n",
       "      <td>Selknam</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>925</th>\n",
       "      <td>-71.0</td>\n",
       "      <td>-55.0</td>\n",
       "      <td>Yamana</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>926 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     Longitude  Latitude           Name\n",
       "0         20.0     -26.0        Bushmen\n",
       "1         21.0     -32.0       Khoikhoi\n",
       "2         26.5     -32.5          Xhosa\n",
       "3         30.5     -28.5     Zulu,Swasi\n",
       "4         26.5     -27.5  Sotho, Tswana\n",
       "..         ...       ...            ...\n",
       "921      -72.5     -39.0        Mapuche\n",
       "922      -67.5     -42.0        Puelche\n",
       "923      -69.0     -47.0      Tehuelche\n",
       "924      -68.5     -54.5        Selknam\n",
       "925      -71.0     -55.0         Yamana\n",
       "\n",
       "[926 rows x 3 columns]"
      ]
     },
     "execution_count": 641,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coords = pd.read_json('../../datasets/folklore/coords.json')\n",
    "coords"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Match MX to coords and aggregate\n",
    "\n",
    "Certain groups in MX are more disaggregated than in `coords`. For example, \"Yakut\" in `coords` is disaggregated as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 642,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group_Berezkin</th>\n",
       "      <th>a1</th>\n",
       "      <th>a10</th>\n",
       "      <th>a11a</th>\n",
       "      <th>a11b</th>\n",
       "      <th>a11c</th>\n",
       "      <th>a12</th>\n",
       "      <th>a12a</th>\n",
       "      <th>a12b</th>\n",
       "      <th>a12c</th>\n",
       "      <th>...</th>\n",
       "      <th>n28</th>\n",
       "      <th>n29</th>\n",
       "      <th>n3</th>\n",
       "      <th>n30</th>\n",
       "      <th>n4</th>\n",
       "      <th>n5</th>\n",
       "      <th>n6</th>\n",
       "      <th>n7</th>\n",
       "      <th>n8</th>\n",
       "      <th>n9</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>941</th>\n",
       "      <td>NW Yakuts (Yessey,Anabar,Olenyok, Lower Lena)</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>945</th>\n",
       "      <td>NE Yakuts (Yana,Indigirka,Kolyma)</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>950</th>\n",
       "      <td>Western Yakuts (Olyokma,Vilyuy)</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>954</th>\n",
       "      <td>Central Yakuts</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>4 rows × 2565 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                    group_Berezkin a1 a10 a11a a11b a11c a12   \n",
       "941  NW Yakuts (Yessey,Anabar,Olenyok, Lower Lena)  0   0    0    0    0   0  \\\n",
       "945              NE Yakuts (Yana,Indigirka,Kolyma)  0   0    0    0    0   1   \n",
       "950                Western Yakuts (Olyokma,Vilyuy)  0   0    0    0    0   0   \n",
       "954                                 Central Yakuts  0   0    0    0    0   0   \n",
       "\n",
       "    a12a a12b a12c  ... n28 n29 n3 n30 n4 n5 n6 n7 n8 n9  \n",
       "941    0    0    0  ...   0   1  0   0  0  0  0  0  0  0  \n",
       "945    1    0    0  ...   0   0  0   0  0  1  0  0  0  0  \n",
       "950    0    0    0  ...   0   0  0   0  0  1  0  0  0  0  \n",
       "954    0    0    0  ...   0   0  0   0  0  1  0  0  0  0  \n",
       "\n",
       "[4 rows x 2565 columns]"
      ]
     },
     "execution_count": 642,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df['group_Berezkin'].str.contains('Yakut')]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For these cases, we change the MX group names to match the `coords` names then aggregate. I have prepared the dictionary `match_names_to_coords` for this task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 643,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mx_name</th>\n",
       "      <th>coords_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Wolof</td>\n",
       "      <td>Fulbe,Wolof,Serer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Serer</td>\n",
       "      <td>Fulbe,Wolof,Serer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Japan AD 700-1700</td>\n",
       "      <td>Japan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Japanese folklore</td>\n",
       "      <td>Japan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NW Yakuts (Yessey,Anabar,Olenyok, Lower Lena)</td>\n",
       "      <td>Yakut</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Central Yakuts</td>\n",
       "      <td>Yakut</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>NE Yakuts (Yana,Indigirka,Kolyma)</td>\n",
       "      <td>Yakut</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Western Yakuts (Olyokma,Vilyuy)</td>\n",
       "      <td>Yakut</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Trans NG East Highlands</td>\n",
       "      <td>Trans New Guinea East</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Trans NG East Lowlands North</td>\n",
       "      <td>Trans New Guinea East</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Trans NG East Lowlands South</td>\n",
       "      <td>Trans New Guinea East</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Northern Khanty</td>\n",
       "      <td>Hanty</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Southern Khanty</td>\n",
       "      <td>Hanty</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Eastern Khanty(Ostyaks)</td>\n",
       "      <td>Hanty</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Forest Yukaghir (Upper Kolyma)</td>\n",
       "      <td>Yukaghir</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Tundra Yukaghir (Lower Kolyma)</td>\n",
       "      <td>Yukaghir</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Lahu,Sani,Nasu,Jino</td>\n",
       "      <td>Lahu,Sani,Hani,Nasu,Jino</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Hani, Akha</td>\n",
       "      <td>Lahu,Sani,Hani,Nasu,Jino</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          mx_name               coords_name\n",
       "0                                           Wolof         Fulbe,Wolof,Serer\n",
       "1                                           Serer         Fulbe,Wolof,Serer\n",
       "2                               Japan AD 700-1700                     Japan\n",
       "3                               Japanese folklore                     Japan\n",
       "4   NW Yakuts (Yessey,Anabar,Olenyok, Lower Lena)                     Yakut\n",
       "5                                  Central Yakuts                     Yakut\n",
       "6               NE Yakuts (Yana,Indigirka,Kolyma)                     Yakut\n",
       "7                 Western Yakuts (Olyokma,Vilyuy)                     Yakut\n",
       "8                         Trans NG East Highlands     Trans New Guinea East\n",
       "9                    Trans NG East Lowlands North     Trans New Guinea East\n",
       "10                   Trans NG East Lowlands South     Trans New Guinea East\n",
       "11                                Northern Khanty                     Hanty\n",
       "12                                Southern Khanty                     Hanty\n",
       "13                        Eastern Khanty(Ostyaks)                     Hanty\n",
       "14                 Forest Yukaghir (Upper Kolyma)                  Yukaghir\n",
       "15                 Tundra Yukaghir (Lower Kolyma)                  Yukaghir\n",
       "16                            Lahu,Sani,Nasu,Jino  Lahu,Sani,Hani,Nasu,Jino\n",
       "17                                     Hani, Akha  Lahu,Sani,Hani,Nasu,Jino"
      ]
     },
     "execution_count": 643,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "match_names_to_coords = pd.read_csv('match_names_to_coords.csv')\n",
    "match_names_to_coords"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 644,
   "metadata": {},
   "outputs": [],
   "source": [
    "dict_to_coords = match_names_to_coords.set_index('mx_name').to_dict()['coords_name']\n",
    "df_to_coords = df.copy()\n",
    "df_to_coords['group_Berezkin'] = df_to_coords['group_Berezkin'].replace(dict_to_coords)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 645,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group_Berezkin</th>\n",
       "      <th>a1</th>\n",
       "      <th>a10</th>\n",
       "      <th>a11a</th>\n",
       "      <th>a11b</th>\n",
       "      <th>a11c</th>\n",
       "      <th>a12</th>\n",
       "      <th>a12a</th>\n",
       "      <th>a12b</th>\n",
       "      <th>a12c</th>\n",
       "      <th>...</th>\n",
       "      <th>n28</th>\n",
       "      <th>n29</th>\n",
       "      <th>n3</th>\n",
       "      <th>n30</th>\n",
       "      <th>n4</th>\n",
       "      <th>n5</th>\n",
       "      <th>n6</th>\n",
       "      <th>n7</th>\n",
       "      <th>n8</th>\n",
       "      <th>n9</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>941</th>\n",
       "      <td>Yakut</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>945</th>\n",
       "      <td>Yakut</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>950</th>\n",
       "      <td>Yakut</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>954</th>\n",
       "      <td>Yakut</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>4 rows × 2565 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    group_Berezkin a1 a10 a11a a11b a11c a12 a12a a12b a12c  ... n28 n29 n3   \n",
       "941          Yakut  0   0    0    0    0   0    0    0    0  ...   0   1  0  \\\n",
       "945          Yakut  0   0    0    0    0   1    1    0    0  ...   0   0  0   \n",
       "950          Yakut  0   0    0    0    0   0    0    0    0  ...   0   0  0   \n",
       "954          Yakut  0   0    0    0    0   0    0    0    0  ...   0   0  0   \n",
       "\n",
       "    n30 n4 n5 n6 n7 n8 n9  \n",
       "941   0  0  0  0  0  0  0  \n",
       "945   0  0  1  0  0  0  0  \n",
       "950   0  0  1  0  0  0  0  \n",
       "954   0  0  1  0  0  0  0  \n",
       "\n",
       "[4 rows x 2565 columns]"
      ]
     },
     "execution_count": 645,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_to_coords[df_to_coords['group_Berezkin'].str.contains('Yakut')]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 646,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "group_Berezkin      object\n",
       "a1                category\n",
       "a10               category\n",
       "a11a              category\n",
       "a11b              category\n",
       "                    ...   \n",
       "n5                category\n",
       "n6                category\n",
       "n7                category\n",
       "n8                category\n",
       "n9                category\n",
       "Length: 2565, dtype: object"
      ]
     },
     "execution_count": 646,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_to_coords.dtypes"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To be able to aggregate, we need to change the categorical columns into numerical columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 647,
   "metadata": {},
   "outputs": [],
   "source": [
    "cols = [i for i in df_to_coords.columns if i not in [\"group_Berezkin\"]]\n",
    "for col in cols:\n",
    "    df_to_coords[col] = pd.to_numeric(df_to_coords[col])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 648,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "group_Berezkin    object\n",
       "a1                 int64\n",
       "a10                int64\n",
       "a11a               int64\n",
       "a11b               int64\n",
       "                   ...  \n",
       "n5                 int64\n",
       "n6                 int64\n",
       "n7                 int64\n",
       "n8                 int64\n",
       "n9                 int64\n",
       "Length: 2565, dtype: object"
      ]
     },
     "execution_count": 648,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_to_coords.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 649,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(947, 2565)"
      ]
     },
     "execution_count": 649,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_to_coords = df_to_coords.groupby('group_Berezkin').max().reset_index()\n",
    "df_to_coords.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 650,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group_Berezkin</th>\n",
       "      <th>a1</th>\n",
       "      <th>a10</th>\n",
       "      <th>a11a</th>\n",
       "      <th>a11b</th>\n",
       "      <th>a11c</th>\n",
       "      <th>a12</th>\n",
       "      <th>a12a</th>\n",
       "      <th>a12b</th>\n",
       "      <th>a12c</th>\n",
       "      <th>...</th>\n",
       "      <th>n28</th>\n",
       "      <th>n29</th>\n",
       "      <th>n3</th>\n",
       "      <th>n30</th>\n",
       "      <th>n4</th>\n",
       "      <th>n5</th>\n",
       "      <th>n6</th>\n",
       "      <th>n7</th>\n",
       "      <th>n8</th>\n",
       "      <th>n9</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>912</th>\n",
       "      <td>Yakut</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1 rows × 2565 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    group_Berezkin  a1  a10  a11a  a11b  a11c  a12  a12a  a12b  a12c  ...   \n",
       "912          Yakut   0    0     0     0     0    1     1     0     0  ...  \\\n",
       "\n",
       "     n28  n29  n3  n30  n4  n5  n6  n7  n8  n9  \n",
       "912    0    1   0    0   0   1   0   0   0   0  \n",
       "\n",
       "[1 rows x 2565 columns]"
      ]
     },
     "execution_count": 650,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_to_coords[df_to_coords['group_Berezkin'].str.contains('Yakut')]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Match coords to MX\n",
    "\n",
    "Next, where groups have a one-to-one match between MX and `coords` but under different names, we change thename in `coords` to follow MX. I have prepared the `match_names_to_max` dictionary for this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 651,
   "metadata": {},
   "outputs": [],
   "source": [
    "match_names_to_mx = pd.read_csv('match_names_to_mx.csv')\n",
    "match_names_to_mx = match_names_to_mx.set_index('coords_name').to_dict()['mx_name']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 652,
   "metadata": {},
   "outputs": [],
   "source": [
    "coords['Name'] = coords['Name'].replace(match_names_to_mx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 653,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(950, 4)"
      ]
     },
     "execution_count": 653,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "groups = df_to_coords[['group_Berezkin']].copy()\n",
    "df_coords = pd.merge(groups, coords, left_on=['group_Berezkin'], right_on=['Name'], how='outer')\n",
    "df_coords.shape"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following groups are those with no clear match and shall be discarded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 654,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group_Berezkin</th>\n",
       "      <th>Longitude</th>\n",
       "      <th>Latitude</th>\n",
       "      <th>Name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>Almora (Rangkas)</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <td>Arabs (literary tradition)</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>253</th>\n",
       "      <td>Fujian Chinese</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>254</th>\n",
       "      <td>Fula (Pular)</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>258</th>\n",
       "      <td>Galicians</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>274</th>\n",
       "      <td>Gulf: Kuwait,Bahrain,Qatar,Oman</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>286</th>\n",
       "      <td>Henan Chinese</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>291</th>\n",
       "      <td>Himachali Pahari</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>304</th>\n",
       "      <td>Iban,Bidayu,Sakarram</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>306</th>\n",
       "      <td>Icelanders (after A.D. 1800)</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>326</th>\n",
       "      <td>Jiangsu and Zhejang Chinese</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>381</th>\n",
       "      <td>Khotan Saka</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>439</th>\n",
       "      <td>Liaoning and Jilin Chinese</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>475</th>\n",
       "      <td>Maldives</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>674</th>\n",
       "      <td>Salars</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>690</th>\n",
       "      <td>Scandinavians, early written sources</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>714</th>\n",
       "      <td>Sichuan Chinese</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>781</th>\n",
       "      <td>Teleut</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>829</th>\n",
       "      <td>Tujia</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>830</th>\n",
       "      <td>Tulu</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>864</th>\n",
       "      <td>Urums, Rumei</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>880</th>\n",
       "      <td>Wallons</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>907</th>\n",
       "      <td>Xinca</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>927</th>\n",
       "      <td>Yeyi</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>947</th>\n",
       "      <td>NaN</td>\n",
       "      <td>-20.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>Edda,Saxo Grammaticus</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>948</th>\n",
       "      <td>NaN</td>\n",
       "      <td>40.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>1001 nights</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>949</th>\n",
       "      <td>NaN</td>\n",
       "      <td>111.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Other Dayak</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           group_Berezkin  Longitude  Latitude   \n",
       "24                       Almora (Rangkas)        NaN       NaN  \\\n",
       "40             Arabs (literary tradition)        NaN       NaN   \n",
       "253                        Fujian Chinese        NaN       NaN   \n",
       "254                          Fula (Pular)        NaN       NaN   \n",
       "258                             Galicians        NaN       NaN   \n",
       "274       Gulf: Kuwait,Bahrain,Qatar,Oman        NaN       NaN   \n",
       "286                         Henan Chinese        NaN       NaN   \n",
       "291                      Himachali Pahari        NaN       NaN   \n",
       "304                  Iban,Bidayu,Sakarram        NaN       NaN   \n",
       "306          Icelanders (after A.D. 1800)        NaN       NaN   \n",
       "326           Jiangsu and Zhejang Chinese        NaN       NaN   \n",
       "381                           Khotan Saka        NaN       NaN   \n",
       "439            Liaoning and Jilin Chinese        NaN       NaN   \n",
       "475                              Maldives        NaN       NaN   \n",
       "674                                Salars        NaN       NaN   \n",
       "690  Scandinavians, early written sources        NaN       NaN   \n",
       "714                       Sichuan Chinese        NaN       NaN   \n",
       "781                                Teleut        NaN       NaN   \n",
       "829                                 Tujia        NaN       NaN   \n",
       "830                                  Tulu        NaN       NaN   \n",
       "864                          Urums, Rumei        NaN       NaN   \n",
       "880                               Wallons        NaN       NaN   \n",
       "907                                 Xinca        NaN       NaN   \n",
       "927                                  Yeyi        NaN       NaN   \n",
       "947                                   NaN      -20.0      65.0   \n",
       "948                                   NaN       40.0      32.0   \n",
       "949                                   NaN      111.0       1.0   \n",
       "\n",
       "                      Name  \n",
       "24                     NaN  \n",
       "40                     NaN  \n",
       "253                    NaN  \n",
       "254                    NaN  \n",
       "258                    NaN  \n",
       "274                    NaN  \n",
       "286                    NaN  \n",
       "291                    NaN  \n",
       "304                    NaN  \n",
       "306                    NaN  \n",
       "326                    NaN  \n",
       "381                    NaN  \n",
       "439                    NaN  \n",
       "475                    NaN  \n",
       "674                    NaN  \n",
       "690                    NaN  \n",
       "714                    NaN  \n",
       "781                    NaN  \n",
       "829                    NaN  \n",
       "830                    NaN  \n",
       "864                    NaN  \n",
       "880                    NaN  \n",
       "907                    NaN  \n",
       "927                    NaN  \n",
       "947  Edda,Saxo Grammaticus  \n",
       "948            1001 nights  \n",
       "949            Other Dayak  "
      ]
     },
     "execution_count": 654,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_coords[df_coords['group_Berezkin'].isna() | df_coords['Name'].isna()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 655,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(923, 3)"
      ]
     },
     "execution_count": 655,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_coords = df_coords.dropna()\n",
    "df_coords = df_coords.drop(['Name'], axis=1)\n",
    "df_coords = df_coords.rename(\n",
    "    columns={\n",
    "        'group_Berezkin': 'group', \n",
    "        'Longitude': 'longitude',\n",
    "        'Latitude': 'latitude'\n",
    "    })\n",
    "df_coords.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 656,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group</th>\n",
       "      <th>longitude</th>\n",
       "      <th>latitude</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Abaza (Abazins)</td>\n",
       "      <td>42.0</td>\n",
       "      <td>44.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Abenaki,Penobscot</td>\n",
       "      <td>-70.5</td>\n",
       "      <td>44.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Abkhaz</td>\n",
       "      <td>40.8</td>\n",
       "      <td>43.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Abor,Gallong,Tani</td>\n",
       "      <td>95.0</td>\n",
       "      <td>28.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Aceh</td>\n",
       "      <td>95.6</td>\n",
       "      <td>5.3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               group  longitude  latitude\n",
       "0    Abaza (Abazins)       42.0      44.2\n",
       "1  Abenaki,Penobscot      -70.5      44.5\n",
       "2             Abkhaz       40.8      43.2\n",
       "3  Abor,Gallong,Tani       95.0      28.5\n",
       "4               Aceh       95.6       5.3"
      ]
     },
     "execution_count": 656,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_coords.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Merge and export\n",
    "\n",
    "The final list of groups is 923."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 657,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(947, 2565)"
      ]
     },
     "execution_count": 657,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_to_coords.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 658,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(923, 2568)"
      ]
     },
     "execution_count": 658,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_groups_motifs = pd.merge(df_to_coords, df_coords, left_on=['group_Berezkin'], right_on=['group'], how='outer').reset_index(drop=True)\n",
    "df_groups_motifs.dropna(subset=['group'], inplace=True)\n",
    "df_groups_motifs.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 659,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_groups_motifs.drop(['group_Berezkin', 'longitude', 'latitude'], axis=1, inplace=True)\n",
    "group_col = df_groups_motifs.pop('group')\n",
    "df_groups_motifs.insert(0, 'group', group_col)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 661,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_groups_motifs = pd.melt(df_groups_motifs, id_vars=['group'], var_name='motif_id', value_name='present')\n",
    "df_groups_motifs = df_groups_motifs.fillna(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 662,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_groups_motifs = df_groups_motifs[df_groups_motifs['present'] == 1]\n",
    "df_groups_motifs = df_groups_motifs.drop(['present'], axis=1)\n",
    "df_groups_motifs.to_csv('groups_motifs.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 663,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_coords.to_json('coords_clean.json', orient='records')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clean and export motifs list\n",
    "\n",
    "There are a total of 2564 motifs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 666,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>motif_id</th>\n",
       "      <th>title_english</th>\n",
       "      <th>title_russian</th>\n",
       "      <th>title_english_googleAPI</th>\n",
       "      <th>desc_eng</th>\n",
       "      <th>desc_russian</th>\n",
       "      <th>desc_english_googleAPI</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a1</td>\n",
       "      <td>The old sun</td>\n",
       "      <td>Древнее солнце</td>\n",
       "      <td>Ancient sun</td>\n",
       "      <td>Another sun, usually less benevolent and/or po...</td>\n",
       "      <td>Другое солнце – обычно менее могущественное ил...</td>\n",
       "      <td>Another sun — usually less powerful or less be...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a10</td>\n",
       "      <td>The sun finds its eyes</td>\n",
       "      <td>Солнце находит себе глаза</td>\n",
       "      <td>The sun finds its eyes</td>\n",
       "      <td>The sun gets his bright eye or eyes from an an...</td>\n",
       "      <td>Солнце получает свои сверкающие глаза (глаз) о...</td>\n",
       "      <td>The sun gets its sparkling eyes (eyes) from th...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>a11a</td>\n",
       "      <td>Eyes of the Sun and the Moon: coolness and night</td>\n",
       "      <td>Глаза светил: прохлада и ночь</td>\n",
       "      <td>Eyes of the luminaries: coolness and night</td>\n",
       "      <td>Visible sun and/or moon are the Sun's and/or t...</td>\n",
       "      <td>Видимое солнце или луна есть их глаза; если бы...</td>\n",
       "      <td>The visible sun or moon is their eyes; if the ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>a11b</td>\n",
       "      <td>One-eyed luminaries</td>\n",
       "      <td>Одноглазые светила</td>\n",
       "      <td>One-eyed luminaries</td>\n",
       "      <td>The Sun or the Moon have only one eye (the Mun...</td>\n",
       "      <td>Солнце или Месяц одноглаз (мундуруку: слеп)</td>\n",
       "      <td>Sun or Month odnoglaz (Munduruku: blind)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a11c</td>\n",
       "      <td>The Sun, the Moon and monster’s eyes</td>\n",
       "      <td>Солнце, Луна и глаза чудовища</td>\n",
       "      <td>Sun, moon and monster eyes</td>\n",
       "      <td>The Sun and the Moon kill a monster whose eyes...</td>\n",
       "      <td>Солнце и Луна убивают чудовище, чьи глаза свет...</td>\n",
       "      <td>The sun and the moon kill a monster whose eyes...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2559</th>\n",
       "      <td>n5</td>\n",
       "      <td>They recognize winter by rime, summer by rain</td>\n",
       "      <td>Зиму узнают по инею, лето по дождю</td>\n",
       "      <td>Winter learn by hoarfrost, summer by rain</td>\n",
       "      <td>Long trips, campaigns, flights or battles are ...</td>\n",
       "      <td>Длительные поездки, походы, полеты или битвы о...</td>\n",
       "      <td>Long trips, trips, flights or battles are desc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2560</th>\n",
       "      <td>n6</td>\n",
       "      <td>Horse tells to whip him strongly</td>\n",
       "      <td>Хлестнуть коня</td>\n",
       "      <td>Whip a horse</td>\n",
       "      <td>A horse tells his rider to whip him with such ...</td>\n",
       "      <td>Конь велит всаднику хлестнуть его так сильно, ...</td>\n",
       "      <td>The horse tells the rider to whip him so hard ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2561</th>\n",
       "      <td>n7</td>\n",
       "      <td>Three apples</td>\n",
       "      <td>Три яблока</td>\n",
       "      <td>Three apples</td>\n",
       "      <td>Closing formula of the folktale: three apples ...</td>\n",
       "      <td>Сказочный текст завершается формулой, сообщающ...</td>\n",
       "      <td>The fabulous text ends with a formula that sta...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2562</th>\n",
       "      <td>n8</td>\n",
       "      <td>Storyteller instead of a cannonball</td>\n",
       "      <td>Сказочник вместо ядра</td>\n",
       "      <td>The storyteller instead of the core</td>\n",
       "      <td>Closing formula of the folktale: characters pu...</td>\n",
       "      <td>Сказочный текст завершается формулой, сообщающ...</td>\n",
       "      <td>The fabulous text ends with a formula that sta...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2563</th>\n",
       "      <td>n9</td>\n",
       "      <td>Who is coming?</td>\n",
       "      <td>Кто приближается?</td>\n",
       "      <td>Who is coming?</td>\n",
       "      <td>Two persons see a horseman who is ever nearer ...</td>\n",
       "      <td>Двое персонажей обсуждают приближение всадника...</td>\n",
       "      <td>Two characters are discussing the approach of ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2564 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     motif_id                                     title_english   \n",
       "0          a1                                       The old sun  \\\n",
       "1         a10                            The sun finds its eyes   \n",
       "2        a11a  Eyes of the Sun and the Moon: coolness and night   \n",
       "3        a11b                               One-eyed luminaries   \n",
       "4        a11c              The Sun, the Moon and monster’s eyes   \n",
       "...       ...                                               ...   \n",
       "2559       n5     They recognize winter by rime, summer by rain   \n",
       "2560       n6                  Horse tells to whip him strongly   \n",
       "2561       n7                                      Three apples   \n",
       "2562       n8               Storyteller instead of a cannonball   \n",
       "2563       n9                                    Who is coming?   \n",
       "\n",
       "                           title_russian   \n",
       "0                         Древнее солнце  \\\n",
       "1              Солнце находит себе глаза   \n",
       "2          Глаза светил: прохлада и ночь   \n",
       "3                     Одноглазые светила   \n",
       "4          Солнце, Луна и глаза чудовища   \n",
       "...                                  ...   \n",
       "2559  Зиму узнают по инею, лето по дождю   \n",
       "2560                      Хлестнуть коня   \n",
       "2561                          Три яблока   \n",
       "2562               Сказочник вместо ядра   \n",
       "2563                   Кто приближается?   \n",
       "\n",
       "                         title_english_googleAPI   \n",
       "0                                    Ancient sun  \\\n",
       "1                         The sun finds its eyes   \n",
       "2     Eyes of the luminaries: coolness and night   \n",
       "3                            One-eyed luminaries   \n",
       "4                     Sun, moon and monster eyes   \n",
       "...                                          ...   \n",
       "2559   Winter learn by hoarfrost, summer by rain   \n",
       "2560                                Whip a horse   \n",
       "2561                                Three apples   \n",
       "2562         The storyteller instead of the core   \n",
       "2563                              Who is coming?   \n",
       "\n",
       "                                               desc_eng   \n",
       "0     Another sun, usually less benevolent and/or po...  \\\n",
       "1     The sun gets his bright eye or eyes from an an...   \n",
       "2     Visible sun and/or moon are the Sun's and/or t...   \n",
       "3     The Sun or the Moon have only one eye (the Mun...   \n",
       "4     The Sun and the Moon kill a monster whose eyes...   \n",
       "...                                                 ...   \n",
       "2559  Long trips, campaigns, flights or battles are ...   \n",
       "2560  A horse tells his rider to whip him with such ...   \n",
       "2561  Closing formula of the folktale: three apples ...   \n",
       "2562  Closing formula of the folktale: characters pu...   \n",
       "2563  Two persons see a horseman who is ever nearer ...   \n",
       "\n",
       "                                           desc_russian   \n",
       "0     Другое солнце – обычно менее могущественное ил...  \\\n",
       "1     Солнце получает свои сверкающие глаза (глаз) о...   \n",
       "2     Видимое солнце или луна есть их глаза; если бы...   \n",
       "3           Солнце или Месяц одноглаз (мундуруку: слеп)   \n",
       "4     Солнце и Луна убивают чудовище, чьи глаза свет...   \n",
       "...                                                 ...   \n",
       "2559  Длительные поездки, походы, полеты или битвы о...   \n",
       "2560  Конь велит всаднику хлестнуть его так сильно, ...   \n",
       "2561  Сказочный текст завершается формулой, сообщающ...   \n",
       "2562  Сказочный текст завершается формулой, сообщающ...   \n",
       "2563  Двое персонажей обсуждают приближение всадника...   \n",
       "\n",
       "                                 desc_english_googleAPI  \n",
       "0     Another sun — usually less powerful or less be...  \n",
       "1     The sun gets its sparkling eyes (eyes) from th...  \n",
       "2     The visible sun or moon is their eyes; if the ...  \n",
       "3              Sun or Month odnoglaz (Munduruku: blind)  \n",
       "4     The sun and the moon kill a monster whose eyes...  \n",
       "...                                                 ...  \n",
       "2559  Long trips, trips, flights or battles are desc...  \n",
       "2560  The horse tells the rider to whip him so hard ...  \n",
       "2561  The fabulous text ends with a formula that sta...  \n",
       "2562  The fabulous text ends with a formula that sta...  \n",
       "2563  Two characters are discussing the approach of ...  \n",
       "\n",
       "[2564 rows x 7 columns]"
      ]
     },
     "execution_count": 666,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Motif_Master = pd.read_stata('../../datasets/folklore/MX2021/Original_Files/Motif_Master.dta')\n",
    "Motif_Master"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 695,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_motifs = Motif_Master[['motif_id', 'title_english', 'desc_eng']]\n",
    "df_motifs = df_motifs.rename(\n",
    "    columns={\n",
    "        'title_english': 'title', \n",
    "        'desc_eng': 'description'\n",
    "    })"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 696,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>motif_id</th>\n",
       "      <th>title</th>\n",
       "      <th>description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a1</td>\n",
       "      <td>The old sun</td>\n",
       "      <td>Another sun, usually less benevolent and/or po...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a10</td>\n",
       "      <td>The sun finds its eyes</td>\n",
       "      <td>The sun gets his bright eye or eyes from an an...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>a11a</td>\n",
       "      <td>Eyes of the Sun and the Moon: coolness and night</td>\n",
       "      <td>Visible sun and/or moon are the Sun's and/or t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>a11b</td>\n",
       "      <td>One-eyed luminaries</td>\n",
       "      <td>The Sun or the Moon have only one eye (the Mun...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a11c</td>\n",
       "      <td>The Sun, the Moon and monster’s eyes</td>\n",
       "      <td>The Sun and the Moon kill a monster whose eyes...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2559</th>\n",
       "      <td>n5</td>\n",
       "      <td>They recognize winter by rime, summer by rain</td>\n",
       "      <td>Long trips, campaigns, flights or battles are ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2560</th>\n",
       "      <td>n6</td>\n",
       "      <td>Horse tells to whip him strongly</td>\n",
       "      <td>A horse tells his rider to whip him with such ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2561</th>\n",
       "      <td>n7</td>\n",
       "      <td>Three apples</td>\n",
       "      <td>Closing formula of the folktale: three apples ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2562</th>\n",
       "      <td>n8</td>\n",
       "      <td>Storyteller instead of a cannonball</td>\n",
       "      <td>Closing formula of the folktale: characters pu...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2563</th>\n",
       "      <td>n9</td>\n",
       "      <td>Who is coming?</td>\n",
       "      <td>Two persons see a horseman who is ever nearer ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2564 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     motif_id                                             title   \n",
       "0          a1                                       The old sun  \\\n",
       "1         a10                            The sun finds its eyes   \n",
       "2        a11a  Eyes of the Sun and the Moon: coolness and night   \n",
       "3        a11b                               One-eyed luminaries   \n",
       "4        a11c              The Sun, the Moon and monster’s eyes   \n",
       "...       ...                                               ...   \n",
       "2559       n5     They recognize winter by rime, summer by rain   \n",
       "2560       n6                  Horse tells to whip him strongly   \n",
       "2561       n7                                      Three apples   \n",
       "2562       n8               Storyteller instead of a cannonball   \n",
       "2563       n9                                    Who is coming?   \n",
       "\n",
       "                                            description  \n",
       "0     Another sun, usually less benevolent and/or po...  \n",
       "1     The sun gets his bright eye or eyes from an an...  \n",
       "2     Visible sun and/or moon are the Sun's and/or t...  \n",
       "3     The Sun or the Moon have only one eye (the Mun...  \n",
       "4     The Sun and the Moon kill a monster whose eyes...  \n",
       "...                                                 ...  \n",
       "2559  Long trips, campaigns, flights or battles are ...  \n",
       "2560  A horse tells his rider to whip him with such ...  \n",
       "2561  Closing formula of the folktale: three apples ...  \n",
       "2562  Closing formula of the folktale: characters pu...  \n",
       "2563  Two persons see a horseman who is ever nearer ...  \n",
       "\n",
       "[2564 rows x 3 columns]"
      ]
     },
     "execution_count": 696,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_motifs"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some descriptions are blank by mistake. I fill these in manually from http://www.mythologydatabase.com/bd/."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 697,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>motif_id</th>\n",
       "      <th>title</th>\n",
       "      <th>description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>a8a</td>\n",
       "      <td>The Sun, the Moon and the star: released by th...</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>770</th>\n",
       "      <td>h21a</td>\n",
       "      <td>Not to kill a big fish</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1099</th>\n",
       "      <td>i97</td>\n",
       "      <td>Rainbow horse</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1833</th>\n",
       "      <td>l23b</td>\n",
       "      <td>Transformation into spindle</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1859</th>\n",
       "      <td>l37a</td>\n",
       "      <td>To get know causes of problems</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011</th>\n",
       "      <td>m105a</td>\n",
       "      <td>Make believe killing of children</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     motif_id                                              title description\n",
       "99        a8a  The Sun, the Moon and the star: released by th...            \n",
       "770      h21a                             Not to kill a big fish            \n",
       "1099      i97                                      Rainbow horse            \n",
       "1833     l23b                        Transformation into spindle            \n",
       "1859     l37a                     To get know causes of problems            \n",
       "2011    m105a                   Make believe killing of children            "
      ]
     },
     "execution_count": 697,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_motifs[df_motifs['description'] == '']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 698,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_motifs.loc[99, 'description'] = 'The sun, moon and star (stars) appear as three consecutive and comparable objects/characters in the stories about the abduction and subsequent release of heavenly bodies.'\n",
    "df_motifs.loc[770, 'description'] = 'The fish is concentrated in a small container, from which the owner takes as much as necessary. Another character opens the receptacle, breaking the rules, and the fish breaks out of it.'\n",
    "df_motifs.loc[1099, 'description'] = 'The rainbow is an ungulate animal (horse, bull, goat, sheep).'\n",
    "df_motifs.loc[1833, 'description'] = 'Trying to free himself, the captured character consistently changes his appearance. The last transformation is a small wooden object (usually a spindle that must be broken in half).'\n",
    "df_motifs.loc[1859, 'description'] = 'On the way to a powerful being, a person meets characters who ask him to ask him questions on their behalf (usually to find out the reason for their misfortunes).'\n",
    "df_motifs.loc[2011, 'description'] = 'The character hides his children, but tells the other person that he killed them, he believes. See M104 motif.'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 699,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_motifs.to_csv('motifs.csv', index=False, encoding='utf-8')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Concept-tagging motifs\n",
    "\n",
    "MX use ConceptNet to tag motifs with concepts. This allows them to check, say, whether societies close to high-intensity earthquake regions have more motifs related to earthquakes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 670,
   "metadata": {},
   "outputs": [],
   "source": [
    "Concepts_Tagged_Per_Motif = pd.read_stata('../../datasets/folklore/MX2021/Original_Files/Concepts_Tagged_Per_Motif.dta')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 671,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>motif_id</th>\n",
       "      <th>say_related</th>\n",
       "      <th>one_related</th>\n",
       "      <th>go_related</th>\n",
       "      <th>get_related</th>\n",
       "      <th>would_related</th>\n",
       "      <th>know_related</th>\n",
       "      <th>make_related</th>\n",
       "      <th>like_related</th>\n",
       "      <th>think_related</th>\n",
       "      <th>...</th>\n",
       "      <th>mindful_related</th>\n",
       "      <th>optimum_related</th>\n",
       "      <th>repercussion_related</th>\n",
       "      <th>shabby_related</th>\n",
       "      <th>subjectivity_related</th>\n",
       "      <th>aspiring_related</th>\n",
       "      <th>distorted_related</th>\n",
       "      <th>galley_related</th>\n",
       "      <th>overlapping_related</th>\n",
       "      <th>situational_related</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a1</td>\n",
       "      <td>[]</td>\n",
       "      <td>['one', 'another']</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a10</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>['get']</td>\n",
       "      <td>['get', 'find']</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>a11a</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>['would']</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>a11b</td>\n",
       "      <td>[]</td>\n",
       "      <td>['one']</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>a11c</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>['take', 'give']</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>['give']</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2559</th>\n",
       "      <td>n5</td>\n",
       "      <td>['describe', 'mean', 'know']</td>\n",
       "      <td>[]</td>\n",
       "      <td>['get']</td>\n",
       "      <td>['get']</td>\n",
       "      <td>[]</td>\n",
       "      <td>['know', 'recognize', 'learn']</td>\n",
       "      <td>[]</td>\n",
       "      <td>['like', 'similar']</td>\n",
       "      <td>['know']</td>\n",
       "      <td>...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2560</th>\n",
       "      <td>n6</td>\n",
       "      <td>['tell']</td>\n",
       "      <td>[]</td>\n",
       "      <td>['come']</td>\n",
       "      <td>['come']</td>\n",
       "      <td>['would']</td>\n",
       "      <td>['tell']</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2561</th>\n",
       "      <td>n7</td>\n",
       "      <td>['say']</td>\n",
       "      <td>['one', 'three', 'least']</td>\n",
       "      <td>['get']</td>\n",
       "      <td>['get', 'give']</td>\n",
       "      <td>[]</td>\n",
       "      <td>['say']</td>\n",
       "      <td>['give']</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2562</th>\n",
       "      <td>n8</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>['arrive']</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>['make']</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2563</th>\n",
       "      <td>n9</td>\n",
       "      <td>[]</td>\n",
       "      <td>['one', 'two']</td>\n",
       "      <td>['come']</td>\n",
       "      <td>['come']</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2564 rows × 9887 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     motif_id                   say_related                one_related   \n",
       "0          a1                            []         ['one', 'another']  \\\n",
       "1         a10                            []                         []   \n",
       "2        a11a                            []                         []   \n",
       "3        a11b                            []                    ['one']   \n",
       "4        a11c                            []                         []   \n",
       "...       ...                           ...                        ...   \n",
       "2559       n5  ['describe', 'mean', 'know']                         []   \n",
       "2560       n6                      ['tell']                         []   \n",
       "2561       n7                       ['say']  ['one', 'three', 'least']   \n",
       "2562       n8                            []                         []   \n",
       "2563       n9                            []             ['one', 'two']   \n",
       "\n",
       "     go_related       get_related would_related   \n",
       "0            []                []            []  \\\n",
       "1       ['get']   ['get', 'find']            []   \n",
       "2            []                []     ['would']   \n",
       "3            []                []            []   \n",
       "4            []  ['take', 'give']            []   \n",
       "...         ...               ...           ...   \n",
       "2559    ['get']           ['get']            []   \n",
       "2560   ['come']          ['come']     ['would']   \n",
       "2561    ['get']   ['get', 'give']            []   \n",
       "2562         []        ['arrive']            []   \n",
       "2563   ['come']          ['come']            []   \n",
       "\n",
       "                        know_related make_related         like_related   \n",
       "0                                 []           []                   []  \\\n",
       "1                                 []           []                   []   \n",
       "2                                 []           []                   []   \n",
       "3                                 []           []                   []   \n",
       "4                                 []     ['give']                   []   \n",
       "...                              ...          ...                  ...   \n",
       "2559  ['know', 'recognize', 'learn']           []  ['like', 'similar']   \n",
       "2560                        ['tell']           []                   []   \n",
       "2561                         ['say']     ['give']                   []   \n",
       "2562                              []     ['make']                   []   \n",
       "2563                              []           []                   []   \n",
       "\n",
       "     think_related  ... mindful_related optimum_related repercussion_related   \n",
       "0               []  ...              []              []                   []  \\\n",
       "1               []  ...              []              []                   []   \n",
       "2               []  ...              []              []                   []   \n",
       "3               []  ...              []              []                   []   \n",
       "4               []  ...              []              []                   []   \n",
       "...            ...  ...             ...             ...                  ...   \n",
       "2559      ['know']  ...              []              []                   []   \n",
       "2560            []  ...              []              []                   []   \n",
       "2561            []  ...              []              []                   []   \n",
       "2562            []  ...              []              []                   []   \n",
       "2563            []  ...              []              []                   []   \n",
       "\n",
       "     shabby_related subjectivity_related aspiring_related distorted_related   \n",
       "0                []                   []               []                []  \\\n",
       "1                []                   []               []                []   \n",
       "2                []                   []               []                []   \n",
       "3                []                   []               []                []   \n",
       "4                []                   []               []                []   \n",
       "...             ...                  ...              ...               ...   \n",
       "2559             []                   []               []                []   \n",
       "2560             []                   []               []                []   \n",
       "2561             []                   []               []                []   \n",
       "2562             []                   []               []                []   \n",
       "2563             []                   []               []                []   \n",
       "\n",
       "     galley_related overlapping_related situational_related  \n",
       "0                []                  []                  []  \n",
       "1                []                  []                  []  \n",
       "2                []                  []                  []  \n",
       "3                []                  []                  []  \n",
       "4                []                  []                  []  \n",
       "...             ...                 ...                 ...  \n",
       "2559             []                  []                  []  \n",
       "2560             []                  []                  []  \n",
       "2561             []                  []                  []  \n",
       "2562             []                  []                  []  \n",
       "2563             []                  []                  []  \n",
       "\n",
       "[2564 rows x 9887 columns]"
      ]
     },
     "execution_count": 671,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Concepts_Tagged_Per_Motif"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 672,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>column</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>motif_id</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>say_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>go_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>get_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9882</th>\n",
       "      <td>aspiring_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9883</th>\n",
       "      <td>distorted_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9884</th>\n",
       "      <td>galley_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9885</th>\n",
       "      <td>overlapping_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9886</th>\n",
       "      <td>situational_related</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>9887 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                   column\n",
       "0                motif_id\n",
       "1             say_related\n",
       "2             one_related\n",
       "3              go_related\n",
       "4             get_related\n",
       "...                   ...\n",
       "9882     aspiring_related\n",
       "9883    distorted_related\n",
       "9884       galley_related\n",
       "9885  overlapping_related\n",
       "9886  situational_related\n",
       "\n",
       "[9887 rows x 1 columns]"
      ]
     },
     "execution_count": 672,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "concepts_columns = pd.DataFrame(Concepts_Tagged_Per_Motif.columns, columns=['column'])\n",
    "concepts_columns"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "MX investigate how certain concepts are more likely to be present in the motifs of an oral tradition given its linguistic group's physical environment and mode of subsistence. \n",
    "\n",
    "- **near earthquake regions**: earthquake\n",
    "- **cold climates**: frozen, cold, ice, frost, freeze\n",
    "- **farming societies**: cereal, grain, cob, corn, maize, crop, wheat, flour, rice\n",
    "- **pastoral societies**: cattle, agriculture, graze, herder, farm, herdsman, livestock, pasture\n",
    "- **fishing societies**: fish\n",
    "- **hunting societies**: hunt, chase, deer, scavenger, hunter, pursuit, search, quest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 752,
   "metadata": {},
   "outputs": [],
   "source": [
    "words_earthquakes = ['earthquake', 'quake']\n",
    "words_coldness = ['frozen', 'cold', 'ice', 'frost', 'freeze', 'freezer', 'iceberg']\n",
    "words_farming = ['cereal', 'grain', 'corn', 'crop', 'wheat', 'flour', 'rice']\n",
    "words_pastoral = ['cattle', 'agriculture', 'graze', 'herd', 'farm', 'farming', 'farmhouse', 'farmland', 'shepherd', 'herding', 'livestock', 'pasture']\n",
    "words_fishing = ['fish', 'fishing', 'fisherman', 'fishery']\n",
    "words_hunting = ['hunt', 'hunting', 'chase', 'deer', 'hunter', 'pursuit', 'search', 'quest']"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Double-check that these words are in the concepts list. If not, adjust accordingly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 751,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>column</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>question_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>230</th>\n",
       "      <td>research_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>644</th>\n",
       "      <td>search_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1024</th>\n",
       "      <td>researcher_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1331</th>\n",
       "      <td>purchase_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1346</th>\n",
       "      <td>request_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2218</th>\n",
       "      <td>hunt_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2460</th>\n",
       "      <td>hunting_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2526</th>\n",
       "      <td>deer_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2596</th>\n",
       "      <td>chase_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2655</th>\n",
       "      <td>hunter_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2991</th>\n",
       "      <td>questionnaire_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3569</th>\n",
       "      <td>pursuit_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3676</th>\n",
       "      <td>quest_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5679</th>\n",
       "      <td>questionable_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5856</th>\n",
       "      <td>questioning_related</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6876</th>\n",
       "      <td>conquest_related</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     column\n",
       "96         question_related\n",
       "230        research_related\n",
       "644          search_related\n",
       "1024     researcher_related\n",
       "1331       purchase_related\n",
       "1346        request_related\n",
       "2218           hunt_related\n",
       "2460        hunting_related\n",
       "2526           deer_related\n",
       "2596          chase_related\n",
       "2655         hunter_related\n",
       "2991  questionnaire_related\n",
       "3569        pursuit_related\n",
       "3676          quest_related\n",
       "5679   questionable_related\n",
       "5856    questioning_related\n",
       "6876       conquest_related"
      ]
     },
     "execution_count": 751,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "concepts_columns[concepts_columns['column'].str.contains('|'.join(words_hunting))]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 746,
   "metadata": {},
   "outputs": [],
   "source": [
    "def prep_concepts(words, concept):\n",
    "\n",
    "    def encode(x):\n",
    "        if x == '[]':\n",
    "            return 0\n",
    "        elif bool(re.search('\\[.+\\]', x)):\n",
    "            return 1\n",
    "    \n",
    "    columns = [w + '_related' for w in words]\n",
    "    columns.insert(0, 'motif_id')\n",
    "\n",
    "    motifs_concept = Concepts_Tagged_Per_Motif.copy()\n",
    "    motifs_concept = motifs_concept[columns]\n",
    "\n",
    "    # Convert to 1s and 0s\n",
    "    motifs_concept[columns[1:]] = motifs_concept[columns[1:]].applymap(encode)\n",
    "\n",
    "    # Create summary column\n",
    "    motifs_concept[concept] = motifs_concept[columns[1:]].apply(lambda row: row.max(), axis=1)\n",
    "    motifs_concept = motifs_concept[motifs_concept[concept] == 1]\n",
    "    motifs_concept = motifs_concept[['motif_id', concept]]\n",
    "    \n",
    "    # Attach column with concept presence to groups_motifs list\n",
    "    groups_concept = df_groups_motifs.copy()\n",
    "    groups_concept = pd.merge(groups_concept, motifs_concept, on='motif_id', how='left')\n",
    "    groups_concept = groups_concept.fillna(0)\n",
    "\n",
    "    # Compute share of motifs with the concept\n",
    "    groups_concept_sum = groups_concept.groupby('group')[concept].mean().reset_index()\n",
    "\n",
    "    return groups_concept_sum\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 747,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group</th>\n",
       "      <th>earthquakes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Abaza (Abazins)</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Abenaki,Penobscot</td>\n",
       "      <td>0.031250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Abkhaz</td>\n",
       "      <td>0.003279</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Abor,Gallong,Tani</td>\n",
       "      <td>0.012422</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Aceh</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>918</th>\n",
       "      <td>Zaparo</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>919</th>\n",
       "      <td>Zapotec,Chatino</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>920</th>\n",
       "      <td>Zoque</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>921</th>\n",
       "      <td>Zulu,Swasi</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>922</th>\n",
       "      <td>Zuni</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>923 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                 group  earthquakes\n",
       "0      Abaza (Abazins)     0.000000\n",
       "1    Abenaki,Penobscot     0.031250\n",
       "2               Abkhaz     0.003279\n",
       "3    Abor,Gallong,Tani     0.012422\n",
       "4                 Aceh     0.000000\n",
       "..                 ...          ...\n",
       "918             Zaparo     0.000000\n",
       "919    Zapotec,Chatino     0.000000\n",
       "920              Zoque     0.000000\n",
       "921         Zulu,Swasi     0.000000\n",
       "922               Zuni     0.000000\n",
       "\n",
       "[923 rows x 2 columns]"
      ]
     },
     "execution_count": 747,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "groups_earthquakes = prep_concepts(words_earthquakes, 'earthquakes')\n",
    "groups_earthquakes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 753,
   "metadata": {},
   "outputs": [],
   "source": [
    "groups_concepts = df_coords.copy()\n",
    "\n",
    "groups_concepts = pd.merge(groups_concepts, prep_concepts(words_earthquakes, 'earthquakes'), on='group', how='left')\n",
    "groups_concepts = pd.merge(groups_concepts, prep_concepts(words_coldness, 'coldness'), on='group', how='left')\n",
    "groups_concepts = pd.merge(groups_concepts, prep_concepts(words_farming, 'farming'), on='group', how='left')\n",
    "groups_concepts = pd.merge(groups_concepts, prep_concepts(words_pastoral, 'pastoral'), on='group', how='left')\n",
    "groups_concepts = pd.merge(groups_concepts, prep_concepts(words_fishing, 'fishing'), on='group', how='left')\n",
    "groups_concepts = pd.merge(groups_concepts, prep_concepts(words_hunting, 'hunting'), on='group', how='left')\n",
    "\n",
    "groups_concepts = groups_concepts.melt(\n",
    "    id_vars=['group', 'longitude', 'latitude'],\n",
    "    value_vars=['earthquakes', 'coldness', 'farming', 'pastoral', 'fishing', 'hunting'],\n",
    "    var_name='concept', \n",
    "    value_name='share'\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 754,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group</th>\n",
       "      <th>longitude</th>\n",
       "      <th>latitude</th>\n",
       "      <th>concept</th>\n",
       "      <th>share</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Abaza (Abazins)</td>\n",
       "      <td>42.0</td>\n",
       "      <td>44.2</td>\n",
       "      <td>earthquakes</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Abenaki,Penobscot</td>\n",
       "      <td>-70.5</td>\n",
       "      <td>44.5</td>\n",
       "      <td>earthquakes</td>\n",
       "      <td>0.031250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Abkhaz</td>\n",
       "      <td>40.8</td>\n",
       "      <td>43.2</td>\n",
       "      <td>earthquakes</td>\n",
       "      <td>0.003279</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Abor,Gallong,Tani</td>\n",
       "      <td>95.0</td>\n",
       "      <td>28.5</td>\n",
       "      <td>earthquakes</td>\n",
       "      <td>0.012422</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Aceh</td>\n",
       "      <td>95.6</td>\n",
       "      <td>5.3</td>\n",
       "      <td>earthquakes</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5533</th>\n",
       "      <td>Zaparo</td>\n",
       "      <td>-75.0</td>\n",
       "      <td>-2.5</td>\n",
       "      <td>hunting</td>\n",
       "      <td>0.285714</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5534</th>\n",
       "      <td>Zapotec,Chatino</td>\n",
       "      <td>-96.5</td>\n",
       "      <td>16.5</td>\n",
       "      <td>hunting</td>\n",
       "      <td>0.166667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5535</th>\n",
       "      <td>Zoque</td>\n",
       "      <td>-92.5</td>\n",
       "      <td>16.5</td>\n",
       "      <td>hunting</td>\n",
       "      <td>0.109091</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5536</th>\n",
       "      <td>Zulu,Swasi</td>\n",
       "      <td>30.5</td>\n",
       "      <td>-28.5</td>\n",
       "      <td>hunting</td>\n",
       "      <td>0.092593</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5537</th>\n",
       "      <td>Zuni</td>\n",
       "      <td>-109.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>hunting</td>\n",
       "      <td>0.065421</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5538 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                  group  longitude  latitude      concept     share\n",
       "0       Abaza (Abazins)       42.0      44.2  earthquakes  0.000000\n",
       "1     Abenaki,Penobscot      -70.5      44.5  earthquakes  0.031250\n",
       "2                Abkhaz       40.8      43.2  earthquakes  0.003279\n",
       "3     Abor,Gallong,Tani       95.0      28.5  earthquakes  0.012422\n",
       "4                  Aceh       95.6       5.3  earthquakes  0.000000\n",
       "...                 ...        ...       ...          ...       ...\n",
       "5533             Zaparo      -75.0      -2.5      hunting  0.285714\n",
       "5534    Zapotec,Chatino      -96.5      16.5      hunting  0.166667\n",
       "5535              Zoque      -92.5      16.5      hunting  0.109091\n",
       "5536         Zulu,Swasi       30.5     -28.5      hunting  0.092593\n",
       "5537               Zuni     -109.0      35.0      hunting  0.065421\n",
       "\n",
       "[5538 rows x 5 columns]"
      ]
     },
     "execution_count": 754,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "groups_concepts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 755,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5"
      ]
     },
     "execution_count": 755,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "groups_concepts['share'].max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 756,
   "metadata": {},
   "outputs": [],
   "source": [
    "groups_concepts.to_csv('groups_concepts.csv', index=False)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Folklore and contemporary beliefs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 757,
   "metadata": {},
   "outputs": [],
   "source": [
    "Country_Regressions_Ready = pd.read_stata('../../datasets/folklore/MX2021/Replication_Tables_Figures/Country_Regressions_Ready.dta')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 761,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cntry</th>\n",
       "      <th>lrgdpch2010</th>\n",
       "      <th>lnp06_18pc</th>\n",
       "      <th>lnavgy06_18</th>\n",
       "      <th>fem19</th>\n",
       "      <th>trust_wvsavg</th>\n",
       "      <th>lntrust_wvsavg</th>\n",
       "      <th>risktaking</th>\n",
       "      <th>trust_gps</th>\n",
       "      <th>patience</th>\n",
       "      <th>...</th>\n",
       "      <th>harm_vice</th>\n",
       "      <th>fair_vice</th>\n",
       "      <th>ingroup_vice</th>\n",
       "      <th>auth_vice</th>\n",
       "      <th>purity_vice</th>\n",
       "      <th>harm_virtue</th>\n",
       "      <th>fair_virtue</th>\n",
       "      <th>ingroup_virtue</th>\n",
       "      <th>auth_virtue</th>\n",
       "      <th>purity_virtue</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AFG</td>\n",
       "      <td>6.955211</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.678959</td>\n",
       "      <td>48.848999</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.120764</td>\n",
       "      <td>0.315964</td>\n",
       "      <td>-0.201360</td>\n",
       "      <td>...</td>\n",
       "      <td>42.266427</td>\n",
       "      <td>0.952258</td>\n",
       "      <td>14.069490</td>\n",
       "      <td>3.765717</td>\n",
       "      <td>7.757953</td>\n",
       "      <td>3.237963</td>\n",
       "      <td>2.588764</td>\n",
       "      <td>42.566406</td>\n",
       "      <td>23.111392</td>\n",
       "      <td>3.172359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AGO</td>\n",
       "      <td>8.538473</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>75.372002</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>23.754314</td>\n",
       "      <td>1.250721</td>\n",
       "      <td>9.719039</td>\n",
       "      <td>1.912417</td>\n",
       "      <td>8.483376</td>\n",
       "      <td>3.865623</td>\n",
       "      <td>0.673173</td>\n",
       "      <td>25.869074</td>\n",
       "      <td>10.680665</td>\n",
       "      <td>3.299046</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AIA</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>40.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>24.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>11.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>65.000000</td>\n",
       "      <td>40.000000</td>\n",
       "      <td>5.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ALB</td>\n",
       "      <td>8.797417</td>\n",
       "      <td>-0.805970</td>\n",
       "      <td>0.162193</td>\n",
       "      <td>47.081001</td>\n",
       "      <td>1.192248</td>\n",
       "      <td>0.175841</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>63.299846</td>\n",
       "      <td>0.115660</td>\n",
       "      <td>25.964083</td>\n",
       "      <td>8.133978</td>\n",
       "      <td>14.429597</td>\n",
       "      <td>8.202667</td>\n",
       "      <td>3.342797</td>\n",
       "      <td>80.812782</td>\n",
       "      <td>48.338948</td>\n",
       "      <td>5.107501</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AND</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.211741</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>82.000516</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>27.000358</td>\n",
       "      <td>9.000020</td>\n",
       "      <td>18.000079</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>4.000040</td>\n",
       "      <td>111.000854</td>\n",
       "      <td>59.000357</td>\n",
       "      <td>8.000060</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>194</th>\n",
       "      <td>WSM</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.636054</td>\n",
       "      <td>-0.144895</td>\n",
       "      <td>23.587999</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>195</th>\n",
       "      <td>YEM</td>\n",
       "      <td>7.780259</td>\n",
       "      <td>-2.526622</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.827000</td>\n",
       "      <td>1.403987</td>\n",
       "      <td>0.339316</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>41.246887</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>14.811722</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>7.094139</td>\n",
       "      <td>1.282417</td>\n",
       "      <td>2.094139</td>\n",
       "      <td>51.776191</td>\n",
       "      <td>28.964469</td>\n",
       "      <td>2.905861</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196</th>\n",
       "      <td>ZAF</td>\n",
       "      <td>8.924416</td>\n",
       "      <td>0.369796</td>\n",
       "      <td>2.071975</td>\n",
       "      <td>48.768002</td>\n",
       "      <td>1.212878</td>\n",
       "      <td>0.192996</td>\n",
       "      <td>0.970596</td>\n",
       "      <td>-0.166918</td>\n",
       "      <td>0.057912</td>\n",
       "      <td>...</td>\n",
       "      <td>29.194306</td>\n",
       "      <td>0.536004</td>\n",
       "      <td>12.453652</td>\n",
       "      <td>3.678451</td>\n",
       "      <td>7.425861</td>\n",
       "      <td>4.298234</td>\n",
       "      <td>1.763692</td>\n",
       "      <td>41.109678</td>\n",
       "      <td>19.954455</td>\n",
       "      <td>3.224105</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>ZMB</td>\n",
       "      <td>7.324647</td>\n",
       "      <td>-2.477074</td>\n",
       "      <td>0.008516</td>\n",
       "      <td>70.785004</td>\n",
       "      <td>1.115467</td>\n",
       "      <td>0.109273</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>10.663001</td>\n",
       "      <td>0.902600</td>\n",
       "      <td>4.673161</td>\n",
       "      <td>2.997479</td>\n",
       "      <td>3.888401</td>\n",
       "      <td>1.197730</td>\n",
       "      <td>0.437495</td>\n",
       "      <td>14.061597</td>\n",
       "      <td>6.331164</td>\n",
       "      <td>1.679670</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>ZWE</td>\n",
       "      <td>5.765326</td>\n",
       "      <td>-2.742053</td>\n",
       "      <td>0.691748</td>\n",
       "      <td>78.733002</td>\n",
       "      <td>1.087762</td>\n",
       "      <td>0.084122</td>\n",
       "      <td>0.523195</td>\n",
       "      <td>-0.509133</td>\n",
       "      <td>-0.238587</td>\n",
       "      <td>...</td>\n",
       "      <td>18.091701</td>\n",
       "      <td>1.646761</td>\n",
       "      <td>7.775430</td>\n",
       "      <td>2.343199</td>\n",
       "      <td>4.236911</td>\n",
       "      <td>4.851816</td>\n",
       "      <td>1.968671</td>\n",
       "      <td>20.617171</td>\n",
       "      <td>7.614549</td>\n",
       "      <td>1.679720</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>199 rows × 9146 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    cntry  lrgdpch2010  lnp06_18pc  lnavgy06_18      fem19  trust_wvsavg   \n",
       "0     AFG     6.955211         NaN    -1.678959  48.848999           NaN  \\\n",
       "1     AGO     8.538473         NaN          NaN  75.372002           NaN   \n",
       "2     AIA          NaN         NaN          NaN        NaN           NaN   \n",
       "3     ALB     8.797417   -0.805970     0.162193  47.081001      1.192248   \n",
       "4     AND          NaN    0.211741          NaN        NaN           NaN   \n",
       "..    ...          ...         ...          ...        ...           ...   \n",
       "194   WSM          NaN   -0.636054    -0.144895  23.587999           NaN   \n",
       "195   YEM     7.780259   -2.526622          NaN   5.827000      1.403987   \n",
       "196   ZAF     8.924416    0.369796     2.071975  48.768002      1.212878   \n",
       "197   ZMB     7.324647   -2.477074     0.008516  70.785004      1.115467   \n",
       "198   ZWE     5.765326   -2.742053     0.691748  78.733002      1.087762   \n",
       "\n",
       "     lntrust_wvsavg  risktaking  trust_gps  patience  ...  harm_vice   \n",
       "0               NaN    0.120764   0.315964 -0.201360  ...  42.266427  \\\n",
       "1               NaN         NaN        NaN       NaN  ...  23.754314   \n",
       "2               NaN         NaN        NaN       NaN  ...  40.000000   \n",
       "3          0.175841         NaN        NaN       NaN  ...  63.299846   \n",
       "4               NaN         NaN        NaN       NaN  ...  82.000516   \n",
       "..              ...         ...        ...       ...  ...        ...   \n",
       "194             NaN         NaN        NaN       NaN  ...   6.000000   \n",
       "195        0.339316         NaN        NaN       NaN  ...  41.246887   \n",
       "196        0.192996    0.970596  -0.166918  0.057912  ...  29.194306   \n",
       "197        0.109273         NaN        NaN       NaN  ...  10.663001   \n",
       "198        0.084122    0.523195  -0.509133 -0.238587  ...  18.091701   \n",
       "\n",
       "     fair_vice  ingroup_vice  auth_vice  purity_vice  harm_virtue   \n",
       "0     0.952258     14.069490   3.765717     7.757953     3.237963  \\\n",
       "1     1.250721      9.719039   1.912417     8.483376     3.865623   \n",
       "2     0.000000     24.000000   5.000000    11.000000     4.000000   \n",
       "3     0.115660     25.964083   8.133978    14.429597     8.202667   \n",
       "4     0.000000     27.000358   9.000020    18.000079     9.000000   \n",
       "..         ...           ...        ...          ...          ...   \n",
       "194   0.000000      0.000000   0.000000     2.000000     0.000000   \n",
       "195   1.000000     14.811722   6.000000     7.094139     1.282417   \n",
       "196   0.536004     12.453652   3.678451     7.425861     4.298234   \n",
       "197   0.902600      4.673161   2.997479     3.888401     1.197730   \n",
       "198   1.646761      7.775430   2.343199     4.236911     4.851816   \n",
       "\n",
       "     fair_virtue  ingroup_virtue  auth_virtue  purity_virtue  \n",
       "0       2.588764       42.566406    23.111392       3.172359  \n",
       "1       0.673173       25.869074    10.680665       3.299046  \n",
       "2       3.000000       65.000000    40.000000       5.000000  \n",
       "3       3.342797       80.812782    48.338948       5.107501  \n",
       "4       4.000040      111.000854    59.000357       8.000060  \n",
       "..           ...             ...          ...            ...  \n",
       "194     0.000000        8.000000     3.000000       0.000000  \n",
       "195     2.094139       51.776191    28.964469       2.905861  \n",
       "196     1.763692       41.109678    19.954455       3.224105  \n",
       "197     0.437495       14.061597     6.331164       1.679670  \n",
       "198     1.968671       20.617171     7.614549       1.679720  \n",
       "\n",
       "[199 rows x 9146 columns]"
      ]
     },
     "execution_count": 761,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Country_Regressions_Ready"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 762,
   "metadata": {},
   "outputs": [],
   "source": [
    "columns = ['cntry', 'lntrust_wvsavg', 'tricksters_punish', 'risktaking', 'challenge_competition', 'fem19', 'malebias']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 772,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-0.70643896"
      ]
     },
     "execution_count": 772,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Country_Regressions_Ready['trust_gps'].min()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 764,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cntry</th>\n",
       "      <th>lntrust_wvsavg</th>\n",
       "      <th>tricksters_punish</th>\n",
       "      <th>risktaking</th>\n",
       "      <th>challenge_competition</th>\n",
       "      <th>fem19</th>\n",
       "      <th>malebias</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AFG</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.024060</td>\n",
       "      <td>0.120764</td>\n",
       "      <td>0.064749</td>\n",
       "      <td>48.848999</td>\n",
       "      <td>0.301265</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AGO</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.090911</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.080376</td>\n",
       "      <td>75.372002</td>\n",
       "      <td>0.081707</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AIA</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.005747</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.091954</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.160920</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ALB</td>\n",
       "      <td>0.175841</td>\n",
       "      <td>-0.028450</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.053650</td>\n",
       "      <td>47.081001</td>\n",
       "      <td>0.249465</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AND</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.009901</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.072607</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.211221</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>194</th>\n",
       "      <td>WSM</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.022222</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.022222</td>\n",
       "      <td>23.587999</td>\n",
       "      <td>0.044444</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>195</th>\n",
       "      <td>YEM</td>\n",
       "      <td>0.339316</td>\n",
       "      <td>0.033566</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.075617</td>\n",
       "      <td>5.827000</td>\n",
       "      <td>0.248730</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196</th>\n",
       "      <td>ZAF</td>\n",
       "      <td>0.192996</td>\n",
       "      <td>-0.021237</td>\n",
       "      <td>0.970596</td>\n",
       "      <td>0.072578</td>\n",
       "      <td>48.768002</td>\n",
       "      <td>0.150613</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>ZMB</td>\n",
       "      <td>0.109273</td>\n",
       "      <td>-0.058759</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.073665</td>\n",
       "      <td>70.785004</td>\n",
       "      <td>0.102419</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>ZWE</td>\n",
       "      <td>0.084122</td>\n",
       "      <td>-0.061962</td>\n",
       "      <td>0.523195</td>\n",
       "      <td>0.090295</td>\n",
       "      <td>78.733002</td>\n",
       "      <td>0.157501</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>199 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    cntry  lntrust_wvsavg  tricksters_punish  risktaking   \n",
       "0     AFG             NaN          -0.024060    0.120764  \\\n",
       "1     AGO             NaN          -0.090911         NaN   \n",
       "2     AIA             NaN          -0.005747         NaN   \n",
       "3     ALB        0.175841          -0.028450         NaN   \n",
       "4     AND             NaN          -0.009901         NaN   \n",
       "..    ...             ...                ...         ...   \n",
       "194   WSM             NaN           0.022222         NaN   \n",
       "195   YEM        0.339316           0.033566         NaN   \n",
       "196   ZAF        0.192996          -0.021237    0.970596   \n",
       "197   ZMB        0.109273          -0.058759         NaN   \n",
       "198   ZWE        0.084122          -0.061962    0.523195   \n",
       "\n",
       "     challenge_competition      fem19  malebias  \n",
       "0                 0.064749  48.848999  0.301265  \n",
       "1                 0.080376  75.372002  0.081707  \n",
       "2                 0.091954        NaN  0.160920  \n",
       "3                 0.053650  47.081001  0.249465  \n",
       "4                 0.072607        NaN  0.211221  \n",
       "..                     ...        ...       ...  \n",
       "194               0.022222  23.587999  0.044444  \n",
       "195               0.075617   5.827000  0.248730  \n",
       "196               0.072578  48.768002  0.150613  \n",
       "197               0.073665  70.785004  0.102419  \n",
       "198               0.090295  78.733002  0.157501  \n",
       "\n",
       "[199 rows x 7 columns]"
      ]
     },
     "execution_count": 764,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regressions = Country_Regressions_Ready[columns]\n",
    "regressions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 765,
   "metadata": {},
   "outputs": [],
   "source": [
    "regressions.to_csv('regressions.csv', index=False) "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Obtain OLS coefficients\n",
    "\n",
    "In order to plot the trendlines of the cross-country regressions, I run the regressions myself to get the models' parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 778,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_trust = Country_Regressions_Ready[['cntry', 'lntrust_wvsavg', 'tricksters_punish', 'lnyear_firstpub', 'lnnmbr_title']]\n",
    "df_trust = df_trust.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 779,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cntry</th>\n",
       "      <th>lntrust_wvsavg</th>\n",
       "      <th>tricksters_punish</th>\n",
       "      <th>lnyear_firstpub</th>\n",
       "      <th>lnnmbr_title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ALB</td>\n",
       "      <td>0.175841</td>\n",
       "      <td>-0.028450</td>\n",
       "      <td>7.537470</td>\n",
       "      <td>3.978251</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>ARG</td>\n",
       "      <td>0.183864</td>\n",
       "      <td>-0.021799</td>\n",
       "      <td>7.537475</td>\n",
       "      <td>3.611511</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>ARM</td>\n",
       "      <td>0.180377</td>\n",
       "      <td>0.001109</td>\n",
       "      <td>7.536747</td>\n",
       "      <td>3.793288</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>AUS</td>\n",
       "      <td>0.385149</td>\n",
       "      <td>-0.015488</td>\n",
       "      <td>7.532831</td>\n",
       "      <td>3.636905</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>AUT</td>\n",
       "      <td>0.293364</td>\n",
       "      <td>-0.026174</td>\n",
       "      <td>7.520525</td>\n",
       "      <td>4.067886</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>193</th>\n",
       "      <td>VNM</td>\n",
       "      <td>0.390781</td>\n",
       "      <td>-0.022763</td>\n",
       "      <td>7.542804</td>\n",
       "      <td>3.477649</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>195</th>\n",
       "      <td>YEM</td>\n",
       "      <td>0.339316</td>\n",
       "      <td>0.033566</td>\n",
       "      <td>7.549919</td>\n",
       "      <td>2.764451</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196</th>\n",
       "      <td>ZAF</td>\n",
       "      <td>0.192996</td>\n",
       "      <td>-0.021237</td>\n",
       "      <td>7.533932</td>\n",
       "      <td>3.457292</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>ZMB</td>\n",
       "      <td>0.109273</td>\n",
       "      <td>-0.058759</td>\n",
       "      <td>7.556819</td>\n",
       "      <td>2.716290</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>ZWE</td>\n",
       "      <td>0.084122</td>\n",
       "      <td>-0.061962</td>\n",
       "      <td>7.552801</td>\n",
       "      <td>3.221077</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>104 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    cntry  lntrust_wvsavg  tricksters_punish  lnyear_firstpub  lnnmbr_title\n",
       "3     ALB        0.175841          -0.028450         7.537470      3.978251\n",
       "6     ARG        0.183864          -0.021799         7.537475      3.611511\n",
       "7     ARM        0.180377           0.001109         7.536747      3.793288\n",
       "10    AUS        0.385149          -0.015488         7.532831      3.636905\n",
       "11    AUT        0.293364          -0.026174         7.520525      4.067886\n",
       "..    ...             ...                ...              ...           ...\n",
       "193   VNM        0.390781          -0.022763         7.542804      3.477649\n",
       "195   YEM        0.339316           0.033566         7.549919      2.764451\n",
       "196   ZAF        0.192996          -0.021237         7.533932      3.457292\n",
       "197   ZMB        0.109273          -0.058759         7.556819      2.716290\n",
       "198   ZWE        0.084122          -0.061962         7.552801      3.221077\n",
       "\n",
       "[104 rows x 5 columns]"
      ]
     },
     "execution_count": 779,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_trust"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 783,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>lntrust_wvsavg</th>\n",
       "      <th>tricksters_punish</th>\n",
       "      <th>lnyear_firstpub</th>\n",
       "      <th>lnnmbr_title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>104.000000</td>\n",
       "      <td>104.000000</td>\n",
       "      <td>104.000000</td>\n",
       "      <td>104.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.228953</td>\n",
       "      <td>-0.020420</td>\n",
       "      <td>7.540038</td>\n",
       "      <td>3.397068</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.107977</td>\n",
       "      <td>0.019649</td>\n",
       "      <td>0.011507</td>\n",
       "      <td>0.588741</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.055010</td>\n",
       "      <td>-0.061962</td>\n",
       "      <td>7.487174</td>\n",
       "      <td>1.393201</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.148913</td>\n",
       "      <td>-0.028557</td>\n",
       "      <td>7.535173</td>\n",
       "      <td>3.094192</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.209711</td>\n",
       "      <td>-0.021608</td>\n",
       "      <td>7.538700</td>\n",
       "      <td>3.543206</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>0.290824</td>\n",
       "      <td>-0.011935</td>\n",
       "      <td>7.546710</td>\n",
       "      <td>3.761305</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>0.519469</td>\n",
       "      <td>0.041841</td>\n",
       "      <td>7.568855</td>\n",
       "      <td>4.599667</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       lntrust_wvsavg  tricksters_punish  lnyear_firstpub  lnnmbr_title\n",
       "count      104.000000         104.000000       104.000000    104.000000\n",
       "mean         0.228953          -0.020420         7.540038      3.397068\n",
       "std          0.107977           0.019649         0.011507      0.588741\n",
       "min          0.055010          -0.061962         7.487174      1.393201\n",
       "25%          0.148913          -0.028557         7.535173      3.094192\n",
       "50%          0.209711          -0.021608         7.538700      3.543206\n",
       "75%          0.290824          -0.011935         7.546710      3.761305\n",
       "max          0.519469           0.041841         7.568855      4.599667"
      ]
     },
     "execution_count": 783,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_trust.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 781,
   "metadata": {},
   "outputs": [],
   "source": [
    "reg_trust = LinearRegression().fit(df_trust[['tricksters_punish', 'lnyear_firstpub', 'lnnmbr_title']], df_trust[['lntrust_wvsavg']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 782,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Coefficients:  [[ 1.856006   -3.2042735  -0.02046409]]\n",
      "Intercept:  [24.496714]\n"
     ]
    }
   ],
   "source": [
    "print('Coefficients: ', reg_trust.coef_)\n",
    "print('Intercept: ', reg_trust.intercept_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 784,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>risktaking</th>\n",
       "      <th>challenge_competition</th>\n",
       "      <th>lnyear_firstpub</th>\n",
       "      <th>lnnmbr_title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>76.000000</td>\n",
       "      <td>76.000000</td>\n",
       "      <td>76.000000</td>\n",
       "      <td>76.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.012658</td>\n",
       "      <td>0.057511</td>\n",
       "      <td>7.540360</td>\n",
       "      <td>3.384763</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.301881</td>\n",
       "      <td>0.015940</td>\n",
       "      <td>0.011473</td>\n",
       "      <td>0.538440</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-0.792435</td>\n",
       "      <td>0.005366</td>\n",
       "      <td>7.487174</td>\n",
       "      <td>1.393201</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>-0.157406</td>\n",
       "      <td>0.048737</td>\n",
       "      <td>7.535480</td>\n",
       "      <td>3.074302</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>-0.019577</td>\n",
       "      <td>0.059116</td>\n",
       "      <td>7.538700</td>\n",
       "      <td>3.416417</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>0.163387</td>\n",
       "      <td>0.066100</td>\n",
       "      <td>7.549553</td>\n",
       "      <td>3.692096</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>0.970596</td>\n",
       "      <td>0.113599</td>\n",
       "      <td>7.560711</td>\n",
       "      <td>4.599667</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       risktaking  challenge_competition  lnyear_firstpub  lnnmbr_title\n",
       "count   76.000000              76.000000        76.000000     76.000000\n",
       "mean     0.012658               0.057511         7.540360      3.384763\n",
       "std      0.301881               0.015940         0.011473      0.538440\n",
       "min     -0.792435               0.005366         7.487174      1.393201\n",
       "25%     -0.157406               0.048737         7.535480      3.074302\n",
       "50%     -0.019577               0.059116         7.538700      3.416417\n",
       "75%      0.163387               0.066100         7.549553      3.692096\n",
       "max      0.970596               0.113599         7.560711      4.599667"
      ]
     },
     "execution_count": 784,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_risk = Country_Regressions_Ready[['cntry', 'risktaking', 'challenge_competition', 'lnyear_firstpub', 'lnnmbr_title']]\n",
    "df_risk = df_risk.dropna()\n",
    "df_risk.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 787,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Coefficients:  [[ 5.438077   -2.2783062  -0.23996948]]\n",
      "Intercept:  [17.6914]\n"
     ]
    }
   ],
   "source": [
    "reg_risk = LinearRegression().fit(df_risk[['challenge_competition', 'lnyear_firstpub', 'lnnmbr_title']], df_risk[['risktaking']])\n",
    "print('Coefficients: ', reg_risk.coef_)\n",
    "print('Intercept: ', reg_risk.intercept_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 788,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fem19</th>\n",
       "      <th>malebias</th>\n",
       "      <th>lnyear_firstpub</th>\n",
       "      <th>lnnmbr_title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>174.000000</td>\n",
       "      <td>174.000000</td>\n",
       "      <td>174.000000</td>\n",
       "      <td>174.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>51.511448</td>\n",
       "      <td>0.179793</td>\n",
       "      <td>7.543276</td>\n",
       "      <td>3.181844</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>15.755623</td>\n",
       "      <td>0.054670</td>\n",
       "      <td>0.013249</td>\n",
       "      <td>0.642237</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>5.827000</td>\n",
       "      <td>0.044444</td>\n",
       "      <td>7.487174</td>\n",
       "      <td>1.305195</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>44.729750</td>\n",
       "      <td>0.142352</td>\n",
       "      <td>7.536789</td>\n",
       "      <td>2.815713</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>53.423500</td>\n",
       "      <td>0.187226</td>\n",
       "      <td>7.543113</td>\n",
       "      <td>3.257886</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>60.614251</td>\n",
       "      <td>0.210022</td>\n",
       "      <td>7.551734</td>\n",
       "      <td>3.637969</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>84.160004</td>\n",
       "      <td>0.310007</td>\n",
       "      <td>7.590555</td>\n",
       "      <td>4.599667</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            fem19    malebias  lnyear_firstpub  lnnmbr_title\n",
       "count  174.000000  174.000000       174.000000    174.000000\n",
       "mean    51.511448    0.179793         7.543276      3.181844\n",
       "std     15.755623    0.054670         0.013249      0.642237\n",
       "min      5.827000    0.044444         7.487174      1.305195\n",
       "25%     44.729750    0.142352         7.536789      2.815713\n",
       "50%     53.423500    0.187226         7.543113      3.257886\n",
       "75%     60.614251    0.210022         7.551734      3.637969\n",
       "max     84.160004    0.310007         7.590555      4.599667"
      ]
     },
     "execution_count": 788,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_fem = Country_Regressions_Ready[['cntry', 'fem19', 'malebias', 'lnyear_firstpub', 'lnnmbr_title']]\n",
    "df_fem = df_fem.dropna()\n",
    "df_fem.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 789,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Coefficients:  [[-111.90181     -12.22385       0.28624266]]\n",
      "Intercept:  [162.92767]\n"
     ]
    }
   ],
   "source": [
    "reg_fem = LinearRegression().fit(df_fem[['malebias', 'lnyear_firstpub', 'lnnmbr_title']], df_fem[['fem19']])\n",
    "print('Coefficients: ', reg_fem.coef_)\n",
    "print('Intercept: ', reg_fem.intercept_)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "twopoints-venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}