{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Cleaning scripts" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Harmonize MX and Dmitry Nikolayev's datasets\n", "\n", "The replication datasets of Michalopoulos and Xue (MX) contain motifs data on 958 ethnic groups, taken from the 2019 version of Yuri Berezkin's catalog. However, these do not contain location data for the ethnic groups. Independent of MX, Dmitry Nikolayev provides coordinates for the ethnic groups [here](https://github.com/macleginn/mythology-queries). However, this only contains 926 ethnic groups and appears to be based on an older version of Berezkin's cataglog.\n", "\n", "This notebook reconciles the MX and DN datasets to the greatest extent possible. " ] }, { "cell_type": "code", "execution_count": 774, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import re\n", "from sklearn.linear_model import LinearRegression" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Load the MX dataset." ] }, { "cell_type": "code", "execution_count": 640, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
group_Berezkina1a10a11aa11ba11ca12a12aa12ba12c...n28n29n3n30n4n5n6n7n8n9
0Abaza (Abazins)000001001...0010000000
1Abkhaz000001000...0000001000
2Aceh000000000...0000000000
3Ache000101100...0000000000
4Achomavi000001100...0000000000
..................................................................
953Teleut000000000...0000000000
954Central Yakuts000000000...0000010000
955Arabs: Iraq000000000...0000000000
956Liaoning and Jilin Chinese000000000...0000000000
957Norvegians, Faroe islanders000000000...0000000000
\n", "

958 rows × 2565 columns

\n", "
" ], "text/plain": [ " group_Berezkin a1 a10 a11a a11b a11c a12 a12a a12b a12c \n", "0 Abaza (Abazins) 0 0 0 0 0 1 0 0 1 \\\n", "1 Abkhaz 0 0 0 0 0 1 0 0 0 \n", "2 Aceh 0 0 0 0 0 0 0 0 0 \n", "3 Ache 0 0 0 1 0 1 1 0 0 \n", "4 Achomavi 0 0 0 0 0 1 1 0 0 \n", ".. ... .. .. ... ... ... .. ... ... ... \n", "953 Teleut 0 0 0 0 0 0 0 0 0 \n", "954 Central Yakuts 0 0 0 0 0 0 0 0 0 \n", "955 Arabs: Iraq 0 0 0 0 0 0 0 0 0 \n", "956 Liaoning and Jilin Chinese 0 0 0 0 0 0 0 0 0 \n", "957 Norvegians, Faroe islanders 0 0 0 0 0 0 0 0 0 \n", "\n", " ... n28 n29 n3 n30 n4 n5 n6 n7 n8 n9 \n", "0 ... 0 0 1 0 0 0 0 0 0 0 \n", "1 ... 0 0 0 0 0 0 1 0 0 0 \n", "2 ... 0 0 0 0 0 0 0 0 0 0 \n", "3 ... 0 0 0 0 0 0 0 0 0 0 \n", "4 ... 0 0 0 0 0 0 0 0 0 0 \n", ".. ... .. .. .. .. .. .. .. .. .. .. \n", "953 ... 0 0 0 0 0 0 0 0 0 0 \n", "954 ... 0 0 0 0 0 1 0 0 0 0 \n", "955 ... 0 0 0 0 0 0 0 0 0 0 \n", "956 ... 0 0 0 0 0 0 0 0 0 0 \n", "957 ... 0 0 0 0 0 0 0 0 0 0 \n", "\n", "[958 rows x 2565 columns]" ] }, "execution_count": 640, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Motifs_Berezkin_groups = pd.read_stata('../../datasets/folklore/MX2021/Original_Files/Motifs_Berezkin_groups.dta')\n", "df = Motifs_Berezkin_groups.drop(\n", " ['oid', 'motifs_total', 'nmbr_author', 'nmbr_language', 'nmbr_publisher', 'nmbr_title', 'year_firstpub', 'year_avgpub'], \n", " axis=1\n", ")\n", "df" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Load Nikolayev's coordinates dataset." ] }, { "cell_type": "code", "execution_count": 641, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LongitudeLatitudeName
020.0-26.0Bushmen
121.0-32.0Khoikhoi
226.5-32.5Xhosa
330.5-28.5Zulu,Swasi
426.5-27.5Sotho, Tswana
............
921-72.5-39.0Mapuche
922-67.5-42.0Puelche
923-69.0-47.0Tehuelche
924-68.5-54.5Selknam
925-71.0-55.0Yamana
\n", "

926 rows × 3 columns

\n", "
" ], "text/plain": [ " Longitude Latitude Name\n", "0 20.0 -26.0 Bushmen\n", "1 21.0 -32.0 Khoikhoi\n", "2 26.5 -32.5 Xhosa\n", "3 30.5 -28.5 Zulu,Swasi\n", "4 26.5 -27.5 Sotho, Tswana\n", ".. ... ... ...\n", "921 -72.5 -39.0 Mapuche\n", "922 -67.5 -42.0 Puelche\n", "923 -69.0 -47.0 Tehuelche\n", "924 -68.5 -54.5 Selknam\n", "925 -71.0 -55.0 Yamana\n", "\n", "[926 rows x 3 columns]" ] }, "execution_count": 641, "metadata": {}, "output_type": "execute_result" } ], "source": [ "coords = pd.read_json('../../datasets/folklore/coords.json')\n", "coords" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Match MX to coords and aggregate\n", "\n", "Certain groups in MX are more disaggregated than in `coords`. For example, \"Yakut\" in `coords` is disaggregated as follows:" ] }, { "cell_type": "code", "execution_count": 642, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
group_Berezkina1a10a11aa11ba11ca12a12aa12ba12c...n28n29n3n30n4n5n6n7n8n9
941NW Yakuts (Yessey,Anabar,Olenyok, Lower Lena)000000000...0100000000
945NE Yakuts (Yana,Indigirka,Kolyma)000001100...0000010000
950Western Yakuts (Olyokma,Vilyuy)000000000...0000010000
954Central Yakuts000000000...0000010000
\n", "

4 rows × 2565 columns

\n", "
" ], "text/plain": [ " group_Berezkin a1 a10 a11a a11b a11c a12 \n", "941 NW Yakuts (Yessey,Anabar,Olenyok, Lower Lena) 0 0 0 0 0 0 \\\n", "945 NE Yakuts (Yana,Indigirka,Kolyma) 0 0 0 0 0 1 \n", "950 Western Yakuts (Olyokma,Vilyuy) 0 0 0 0 0 0 \n", "954 Central Yakuts 0 0 0 0 0 0 \n", "\n", " a12a a12b a12c ... n28 n29 n3 n30 n4 n5 n6 n7 n8 n9 \n", "941 0 0 0 ... 0 1 0 0 0 0 0 0 0 0 \n", "945 1 0 0 ... 0 0 0 0 0 1 0 0 0 0 \n", "950 0 0 0 ... 0 0 0 0 0 1 0 0 0 0 \n", "954 0 0 0 ... 0 0 0 0 0 1 0 0 0 0 \n", "\n", "[4 rows x 2565 columns]" ] }, "execution_count": 642, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['group_Berezkin'].str.contains('Yakut')]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "For these cases, we change the MX group names to match the `coords` names then aggregate. I have prepared the dictionary `match_names_to_coords` for this task." ] }, { "cell_type": "code", "execution_count": 643, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mx_namecoords_name
0WolofFulbe,Wolof,Serer
1SererFulbe,Wolof,Serer
2Japan AD 700-1700Japan
3Japanese folkloreJapan
4NW Yakuts (Yessey,Anabar,Olenyok, Lower Lena)Yakut
5Central YakutsYakut
6NE Yakuts (Yana,Indigirka,Kolyma)Yakut
7Western Yakuts (Olyokma,Vilyuy)Yakut
8Trans NG East HighlandsTrans New Guinea East
9Trans NG East Lowlands NorthTrans New Guinea East
10Trans NG East Lowlands SouthTrans New Guinea East
11Northern KhantyHanty
12Southern KhantyHanty
13Eastern Khanty(Ostyaks)Hanty
14Forest Yukaghir (Upper Kolyma)Yukaghir
15Tundra Yukaghir (Lower Kolyma)Yukaghir
16Lahu,Sani,Nasu,JinoLahu,Sani,Hani,Nasu,Jino
17Hani, AkhaLahu,Sani,Hani,Nasu,Jino
\n", "
" ], "text/plain": [ " mx_name coords_name\n", "0 Wolof Fulbe,Wolof,Serer\n", "1 Serer Fulbe,Wolof,Serer\n", "2 Japan AD 700-1700 Japan\n", "3 Japanese folklore Japan\n", "4 NW Yakuts (Yessey,Anabar,Olenyok, Lower Lena) Yakut\n", "5 Central Yakuts Yakut\n", "6 NE Yakuts (Yana,Indigirka,Kolyma) Yakut\n", "7 Western Yakuts (Olyokma,Vilyuy) Yakut\n", "8 Trans NG East Highlands Trans New Guinea East\n", "9 Trans NG East Lowlands North Trans New Guinea East\n", "10 Trans NG East Lowlands South Trans New Guinea East\n", "11 Northern Khanty Hanty\n", "12 Southern Khanty Hanty\n", "13 Eastern Khanty(Ostyaks) Hanty\n", "14 Forest Yukaghir (Upper Kolyma) Yukaghir\n", "15 Tundra Yukaghir (Lower Kolyma) Yukaghir\n", "16 Lahu,Sani,Nasu,Jino Lahu,Sani,Hani,Nasu,Jino\n", "17 Hani, Akha Lahu,Sani,Hani,Nasu,Jino" ] }, "execution_count": 643, "metadata": {}, "output_type": "execute_result" } ], "source": [ "match_names_to_coords = pd.read_csv('match_names_to_coords.csv')\n", "match_names_to_coords" ] }, { "cell_type": "code", "execution_count": 644, "metadata": {}, "outputs": [], "source": [ "dict_to_coords = match_names_to_coords.set_index('mx_name').to_dict()['coords_name']\n", "df_to_coords = df.copy()\n", "df_to_coords['group_Berezkin'] = df_to_coords['group_Berezkin'].replace(dict_to_coords)" ] }, { "cell_type": "code", "execution_count": 645, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
group_Berezkina1a10a11aa11ba11ca12a12aa12ba12c...n28n29n3n30n4n5n6n7n8n9
941Yakut000000000...0100000000
945Yakut000001100...0000010000
950Yakut000000000...0000010000
954Yakut000000000...0000010000
\n", "

4 rows × 2565 columns

\n", "
" ], "text/plain": [ " group_Berezkin a1 a10 a11a a11b a11c a12 a12a a12b a12c ... n28 n29 n3 \n", "941 Yakut 0 0 0 0 0 0 0 0 0 ... 0 1 0 \\\n", "945 Yakut 0 0 0 0 0 1 1 0 0 ... 0 0 0 \n", "950 Yakut 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "954 Yakut 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "\n", " n30 n4 n5 n6 n7 n8 n9 \n", "941 0 0 0 0 0 0 0 \n", "945 0 0 1 0 0 0 0 \n", "950 0 0 1 0 0 0 0 \n", "954 0 0 1 0 0 0 0 \n", "\n", "[4 rows x 2565 columns]" ] }, "execution_count": 645, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_to_coords[df_to_coords['group_Berezkin'].str.contains('Yakut')]" ] }, { "cell_type": "code", "execution_count": 646, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "group_Berezkin object\n", "a1 category\n", "a10 category\n", "a11a category\n", "a11b category\n", " ... \n", "n5 category\n", "n6 category\n", "n7 category\n", "n8 category\n", "n9 category\n", "Length: 2565, dtype: object" ] }, "execution_count": 646, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_to_coords.dtypes" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To be able to aggregate, we need to change the categorical columns into numerical columns." ] }, { "cell_type": "code", "execution_count": 647, "metadata": {}, "outputs": [], "source": [ "cols = [i for i in df_to_coords.columns if i not in [\"group_Berezkin\"]]\n", "for col in cols:\n", " df_to_coords[col] = pd.to_numeric(df_to_coords[col])" ] }, { "cell_type": "code", "execution_count": 648, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "group_Berezkin object\n", "a1 int64\n", "a10 int64\n", "a11a int64\n", "a11b int64\n", " ... \n", "n5 int64\n", "n6 int64\n", "n7 int64\n", "n8 int64\n", "n9 int64\n", "Length: 2565, dtype: object" ] }, "execution_count": 648, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_to_coords.dtypes" ] }, { "cell_type": "code", "execution_count": 649, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(947, 2565)" ] }, "execution_count": 649, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_to_coords = df_to_coords.groupby('group_Berezkin').max().reset_index()\n", "df_to_coords.shape" ] }, { "cell_type": "code", "execution_count": 650, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
group_Berezkina1a10a11aa11ba11ca12a12aa12ba12c...n28n29n3n30n4n5n6n7n8n9
912Yakut000001100...0100010000
\n", "

1 rows × 2565 columns

\n", "
" ], "text/plain": [ " group_Berezkin a1 a10 a11a a11b a11c a12 a12a a12b a12c ... \n", "912 Yakut 0 0 0 0 0 1 1 0 0 ... \\\n", "\n", " n28 n29 n3 n30 n4 n5 n6 n7 n8 n9 \n", "912 0 1 0 0 0 1 0 0 0 0 \n", "\n", "[1 rows x 2565 columns]" ] }, "execution_count": 650, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_to_coords[df_to_coords['group_Berezkin'].str.contains('Yakut')]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Match coords to MX\n", "\n", "Next, where groups have a one-to-one match between MX and `coords` but under different names, we change thename in `coords` to follow MX. I have prepared the `match_names_to_max` dictionary for this." ] }, { "cell_type": "code", "execution_count": 651, "metadata": {}, "outputs": [], "source": [ "match_names_to_mx = pd.read_csv('match_names_to_mx.csv')\n", "match_names_to_mx = match_names_to_mx.set_index('coords_name').to_dict()['mx_name']" ] }, { "cell_type": "code", "execution_count": 652, "metadata": {}, "outputs": [], "source": [ "coords['Name'] = coords['Name'].replace(match_names_to_mx)" ] }, { "cell_type": "code", "execution_count": 653, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(950, 4)" ] }, "execution_count": 653, "metadata": {}, "output_type": "execute_result" } ], "source": [ "groups = df_to_coords[['group_Berezkin']].copy()\n", "df_coords = pd.merge(groups, coords, left_on=['group_Berezkin'], right_on=['Name'], how='outer')\n", "df_coords.shape" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The following groups are those with no clear match and shall be discarded." ] }, { "cell_type": "code", "execution_count": 654, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
group_BerezkinLongitudeLatitudeName
24Almora (Rangkas)NaNNaNNaN
40Arabs (literary tradition)NaNNaNNaN
253Fujian ChineseNaNNaNNaN
254Fula (Pular)NaNNaNNaN
258GaliciansNaNNaNNaN
274Gulf: Kuwait,Bahrain,Qatar,OmanNaNNaNNaN
286Henan ChineseNaNNaNNaN
291Himachali PahariNaNNaNNaN
304Iban,Bidayu,SakarramNaNNaNNaN
306Icelanders (after A.D. 1800)NaNNaNNaN
326Jiangsu and Zhejang ChineseNaNNaNNaN
381Khotan SakaNaNNaNNaN
439Liaoning and Jilin ChineseNaNNaNNaN
475MaldivesNaNNaNNaN
674SalarsNaNNaNNaN
690Scandinavians, early written sourcesNaNNaNNaN
714Sichuan ChineseNaNNaNNaN
781TeleutNaNNaNNaN
829TujiaNaNNaNNaN
830TuluNaNNaNNaN
864Urums, RumeiNaNNaNNaN
880WallonsNaNNaNNaN
907XincaNaNNaNNaN
927YeyiNaNNaNNaN
947NaN-20.065.0Edda,Saxo Grammaticus
948NaN40.032.01001 nights
949NaN111.01.0Other Dayak
\n", "
" ], "text/plain": [ " group_Berezkin Longitude Latitude \n", "24 Almora (Rangkas) NaN NaN \\\n", "40 Arabs (literary tradition) NaN NaN \n", "253 Fujian Chinese NaN NaN \n", "254 Fula (Pular) NaN NaN \n", "258 Galicians NaN NaN \n", "274 Gulf: Kuwait,Bahrain,Qatar,Oman NaN NaN \n", "286 Henan Chinese NaN NaN \n", "291 Himachali Pahari NaN NaN \n", "304 Iban,Bidayu,Sakarram NaN NaN \n", "306 Icelanders (after A.D. 1800) NaN NaN \n", "326 Jiangsu and Zhejang Chinese NaN NaN \n", "381 Khotan Saka NaN NaN \n", "439 Liaoning and Jilin Chinese NaN NaN \n", "475 Maldives NaN NaN \n", "674 Salars NaN NaN \n", "690 Scandinavians, early written sources NaN NaN \n", "714 Sichuan Chinese NaN NaN \n", "781 Teleut NaN NaN \n", "829 Tujia NaN NaN \n", "830 Tulu NaN NaN \n", "864 Urums, Rumei NaN NaN \n", "880 Wallons NaN NaN \n", "907 Xinca NaN NaN \n", "927 Yeyi NaN NaN \n", "947 NaN -20.0 65.0 \n", "948 NaN 40.0 32.0 \n", "949 NaN 111.0 1.0 \n", "\n", " Name \n", "24 NaN \n", "40 NaN \n", "253 NaN \n", "254 NaN \n", "258 NaN \n", "274 NaN \n", "286 NaN \n", "291 NaN \n", "304 NaN \n", "306 NaN \n", "326 NaN \n", "381 NaN \n", "439 NaN \n", "475 NaN \n", "674 NaN \n", "690 NaN \n", "714 NaN \n", "781 NaN \n", "829 NaN \n", "830 NaN \n", "864 NaN \n", "880 NaN \n", "907 NaN \n", "927 NaN \n", "947 Edda,Saxo Grammaticus \n", "948 1001 nights \n", "949 Other Dayak " ] }, "execution_count": 654, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_coords[df_coords['group_Berezkin'].isna() | df_coords['Name'].isna()]" ] }, { "cell_type": "code", "execution_count": 655, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(923, 3)" ] }, "execution_count": 655, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_coords = df_coords.dropna()\n", "df_coords = df_coords.drop(['Name'], axis=1)\n", "df_coords = df_coords.rename(\n", " columns={\n", " 'group_Berezkin': 'group', \n", " 'Longitude': 'longitude',\n", " 'Latitude': 'latitude'\n", " })\n", "df_coords.shape" ] }, { "cell_type": "code", "execution_count": 656, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
grouplongitudelatitude
0Abaza (Abazins)42.044.2
1Abenaki,Penobscot-70.544.5
2Abkhaz40.843.2
3Abor,Gallong,Tani95.028.5
4Aceh95.65.3
\n", "
" ], "text/plain": [ " group longitude latitude\n", "0 Abaza (Abazins) 42.0 44.2\n", "1 Abenaki,Penobscot -70.5 44.5\n", "2 Abkhaz 40.8 43.2\n", "3 Abor,Gallong,Tani 95.0 28.5\n", "4 Aceh 95.6 5.3" ] }, "execution_count": 656, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_coords.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Merge and export\n", "\n", "The final list of groups is 923." ] }, { "cell_type": "code", "execution_count": 657, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(947, 2565)" ] }, "execution_count": 657, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_to_coords.shape" ] }, { "cell_type": "code", "execution_count": 658, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(923, 2568)" ] }, "execution_count": 658, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_groups_motifs = pd.merge(df_to_coords, df_coords, left_on=['group_Berezkin'], right_on=['group'], how='outer').reset_index(drop=True)\n", "df_groups_motifs.dropna(subset=['group'], inplace=True)\n", "df_groups_motifs.shape" ] }, { "cell_type": "code", "execution_count": 659, "metadata": {}, "outputs": [], "source": [ "df_groups_motifs.drop(['group_Berezkin', 'longitude', 'latitude'], axis=1, inplace=True)\n", "group_col = df_groups_motifs.pop('group')\n", "df_groups_motifs.insert(0, 'group', group_col)" ] }, { "cell_type": "code", "execution_count": 661, "metadata": {}, "outputs": [], "source": [ "df_groups_motifs = pd.melt(df_groups_motifs, id_vars=['group'], var_name='motif_id', value_name='present')\n", "df_groups_motifs = df_groups_motifs.fillna(0)" ] }, { "cell_type": "code", "execution_count": 662, "metadata": {}, "outputs": [], "source": [ "df_groups_motifs = df_groups_motifs[df_groups_motifs['present'] == 1]\n", "df_groups_motifs = df_groups_motifs.drop(['present'], axis=1)\n", "df_groups_motifs.to_csv('groups_motifs.csv')" ] }, { "cell_type": "code", "execution_count": 663, "metadata": {}, "outputs": [], "source": [ "df_coords.to_json('coords_clean.json', orient='records')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Clean and export motifs list\n", "\n", "There are a total of 2564 motifs." ] }, { "cell_type": "code", "execution_count": 666, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
motif_idtitle_englishtitle_russiantitle_english_googleAPIdesc_engdesc_russiandesc_english_googleAPI
0a1The old sunДревнее солнцеAncient sunAnother sun, usually less benevolent and/or po...Другое солнце – обычно менее могущественное ил...Another sun — usually less powerful or less be...
1a10The sun finds its eyesСолнце находит себе глазаThe sun finds its eyesThe sun gets his bright eye or eyes from an an...Солнце получает свои сверкающие глаза (глаз) о...The sun gets its sparkling eyes (eyes) from th...
2a11aEyes of the Sun and the Moon: coolness and nightГлаза светил: прохлада и ночьEyes of the luminaries: coolness and nightVisible sun and/or moon are the Sun's and/or t...Видимое солнце или луна есть их глаза; если бы...The visible sun or moon is their eyes; if the ...
3a11bOne-eyed luminariesОдноглазые светилаOne-eyed luminariesThe Sun or the Moon have only one eye (the Mun...Солнце или Месяц одноглаз (мундуруку: слеп)Sun or Month odnoglaz (Munduruku: blind)
4a11cThe Sun, the Moon and monster’s eyesСолнце, Луна и глаза чудовищаSun, moon and monster eyesThe Sun and the Moon kill a monster whose eyes...Солнце и Луна убивают чудовище, чьи глаза свет...The sun and the moon kill a monster whose eyes...
........................
2559n5They recognize winter by rime, summer by rainЗиму узнают по инею, лето по дождюWinter learn by hoarfrost, summer by rainLong trips, campaigns, flights or battles are ...Длительные поездки, походы, полеты или битвы о...Long trips, trips, flights or battles are desc...
2560n6Horse tells to whip him stronglyХлестнуть коняWhip a horseA horse tells his rider to whip him with such ...Конь велит всаднику хлестнуть его так сильно, ...The horse tells the rider to whip him so hard ...
2561n7Three applesТри яблокаThree applesClosing formula of the folktale: three apples ...Сказочный текст завершается формулой, сообщающ...The fabulous text ends with a formula that sta...
2562n8Storyteller instead of a cannonballСказочник вместо ядраThe storyteller instead of the coreClosing formula of the folktale: characters pu...Сказочный текст завершается формулой, сообщающ...The fabulous text ends with a formula that sta...
2563n9Who is coming?Кто приближается?Who is coming?Two persons see a horseman who is ever nearer ...Двое персонажей обсуждают приближение всадника...Two characters are discussing the approach of ...
\n", "

2564 rows × 7 columns

\n", "
" ], "text/plain": [ " motif_id title_english \n", "0 a1 The old sun \\\n", "1 a10 The sun finds its eyes \n", "2 a11a Eyes of the Sun and the Moon: coolness and night \n", "3 a11b One-eyed luminaries \n", "4 a11c The Sun, the Moon and monster’s eyes \n", "... ... ... \n", "2559 n5 They recognize winter by rime, summer by rain \n", "2560 n6 Horse tells to whip him strongly \n", "2561 n7 Three apples \n", "2562 n8 Storyteller instead of a cannonball \n", "2563 n9 Who is coming? \n", "\n", " title_russian \n", "0 Древнее солнце \\\n", "1 Солнце находит себе глаза \n", "2 Глаза светил: прохлада и ночь \n", "3 Одноглазые светила \n", "4 Солнце, Луна и глаза чудовища \n", "... ... \n", "2559 Зиму узнают по инею, лето по дождю \n", "2560 Хлестнуть коня \n", "2561 Три яблока \n", "2562 Сказочник вместо ядра \n", "2563 Кто приближается? \n", "\n", " title_english_googleAPI \n", "0 Ancient sun \\\n", "1 The sun finds its eyes \n", "2 Eyes of the luminaries: coolness and night \n", "3 One-eyed luminaries \n", "4 Sun, moon and monster eyes \n", "... ... \n", "2559 Winter learn by hoarfrost, summer by rain \n", "2560 Whip a horse \n", "2561 Three apples \n", "2562 The storyteller instead of the core \n", "2563 Who is coming? \n", "\n", " desc_eng \n", "0 Another sun, usually less benevolent and/or po... \\\n", "1 The sun gets his bright eye or eyes from an an... \n", "2 Visible sun and/or moon are the Sun's and/or t... \n", "3 The Sun or the Moon have only one eye (the Mun... \n", "4 The Sun and the Moon kill a monster whose eyes... \n", "... ... \n", "2559 Long trips, campaigns, flights or battles are ... \n", "2560 A horse tells his rider to whip him with such ... \n", "2561 Closing formula of the folktale: three apples ... \n", "2562 Closing formula of the folktale: characters pu... \n", "2563 Two persons see a horseman who is ever nearer ... \n", "\n", " desc_russian \n", "0 Другое солнце – обычно менее могущественное ил... \\\n", "1 Солнце получает свои сверкающие глаза (глаз) о... \n", "2 Видимое солнце или луна есть их глаза; если бы... \n", "3 Солнце или Месяц одноглаз (мундуруку: слеп) \n", "4 Солнце и Луна убивают чудовище, чьи глаза свет... \n", "... ... \n", "2559 Длительные поездки, походы, полеты или битвы о... \n", "2560 Конь велит всаднику хлестнуть его так сильно, ... \n", "2561 Сказочный текст завершается формулой, сообщающ... \n", "2562 Сказочный текст завершается формулой, сообщающ... \n", "2563 Двое персонажей обсуждают приближение всадника... \n", "\n", " desc_english_googleAPI \n", "0 Another sun — usually less powerful or less be... \n", "1 The sun gets its sparkling eyes (eyes) from th... \n", "2 The visible sun or moon is their eyes; if the ... \n", "3 Sun or Month odnoglaz (Munduruku: blind) \n", "4 The sun and the moon kill a monster whose eyes... \n", "... ... \n", "2559 Long trips, trips, flights or battles are desc... \n", "2560 The horse tells the rider to whip him so hard ... \n", "2561 The fabulous text ends with a formula that sta... \n", "2562 The fabulous text ends with a formula that sta... \n", "2563 Two characters are discussing the approach of ... \n", "\n", "[2564 rows x 7 columns]" ] }, "execution_count": 666, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Motif_Master = pd.read_stata('../../datasets/folklore/MX2021/Original_Files/Motif_Master.dta')\n", "Motif_Master" ] }, { "cell_type": "code", "execution_count": 695, "metadata": {}, "outputs": [], "source": [ "df_motifs = Motif_Master[['motif_id', 'title_english', 'desc_eng']]\n", "df_motifs = df_motifs.rename(\n", " columns={\n", " 'title_english': 'title', \n", " 'desc_eng': 'description'\n", " })" ] }, { "cell_type": "code", "execution_count": 696, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
motif_idtitledescription
0a1The old sunAnother sun, usually less benevolent and/or po...
1a10The sun finds its eyesThe sun gets his bright eye or eyes from an an...
2a11aEyes of the Sun and the Moon: coolness and nightVisible sun and/or moon are the Sun's and/or t...
3a11bOne-eyed luminariesThe Sun or the Moon have only one eye (the Mun...
4a11cThe Sun, the Moon and monster’s eyesThe Sun and the Moon kill a monster whose eyes...
............
2559n5They recognize winter by rime, summer by rainLong trips, campaigns, flights or battles are ...
2560n6Horse tells to whip him stronglyA horse tells his rider to whip him with such ...
2561n7Three applesClosing formula of the folktale: three apples ...
2562n8Storyteller instead of a cannonballClosing formula of the folktale: characters pu...
2563n9Who is coming?Two persons see a horseman who is ever nearer ...
\n", "

2564 rows × 3 columns

\n", "
" ], "text/plain": [ " motif_id title \n", "0 a1 The old sun \\\n", "1 a10 The sun finds its eyes \n", "2 a11a Eyes of the Sun and the Moon: coolness and night \n", "3 a11b One-eyed luminaries \n", "4 a11c The Sun, the Moon and monster’s eyes \n", "... ... ... \n", "2559 n5 They recognize winter by rime, summer by rain \n", "2560 n6 Horse tells to whip him strongly \n", "2561 n7 Three apples \n", "2562 n8 Storyteller instead of a cannonball \n", "2563 n9 Who is coming? \n", "\n", " description \n", "0 Another sun, usually less benevolent and/or po... \n", "1 The sun gets his bright eye or eyes from an an... \n", "2 Visible sun and/or moon are the Sun's and/or t... \n", "3 The Sun or the Moon have only one eye (the Mun... \n", "4 The Sun and the Moon kill a monster whose eyes... \n", "... ... \n", "2559 Long trips, campaigns, flights or battles are ... \n", "2560 A horse tells his rider to whip him with such ... \n", "2561 Closing formula of the folktale: three apples ... \n", "2562 Closing formula of the folktale: characters pu... \n", "2563 Two persons see a horseman who is ever nearer ... \n", "\n", "[2564 rows x 3 columns]" ] }, "execution_count": 696, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_motifs" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Some descriptions are blank by mistake. I fill these in manually from http://www.mythologydatabase.com/bd/." ] }, { "cell_type": "code", "execution_count": 697, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
motif_idtitledescription
99a8aThe Sun, the Moon and the star: released by th...
770h21aNot to kill a big fish
1099i97Rainbow horse
1833l23bTransformation into spindle
1859l37aTo get know causes of problems
2011m105aMake believe killing of children
\n", "
" ], "text/plain": [ " motif_id title description\n", "99 a8a The Sun, the Moon and the star: released by th... \n", "770 h21a Not to kill a big fish \n", "1099 i97 Rainbow horse \n", "1833 l23b Transformation into spindle \n", "1859 l37a To get know causes of problems \n", "2011 m105a Make believe killing of children " ] }, "execution_count": 697, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_motifs[df_motifs['description'] == '']" ] }, { "cell_type": "code", "execution_count": 698, "metadata": {}, "outputs": [], "source": [ "df_motifs.loc[99, 'description'] = 'The sun, moon and star (stars) appear as three consecutive and comparable objects/characters in the stories about the abduction and subsequent release of heavenly bodies.'\n", "df_motifs.loc[770, 'description'] = 'The fish is concentrated in a small container, from which the owner takes as much as necessary. Another character opens the receptacle, breaking the rules, and the fish breaks out of it.'\n", "df_motifs.loc[1099, 'description'] = 'The rainbow is an ungulate animal (horse, bull, goat, sheep).'\n", "df_motifs.loc[1833, 'description'] = 'Trying to free himself, the captured character consistently changes his appearance. The last transformation is a small wooden object (usually a spindle that must be broken in half).'\n", "df_motifs.loc[1859, 'description'] = 'On the way to a powerful being, a person meets characters who ask him to ask him questions on their behalf (usually to find out the reason for their misfortunes).'\n", "df_motifs.loc[2011, 'description'] = 'The character hides his children, but tells the other person that he killed them, he believes. See M104 motif.'" ] }, { "cell_type": "code", "execution_count": 699, "metadata": {}, "outputs": [], "source": [ "df_motifs.to_csv('motifs.csv', index=False, encoding='utf-8')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Concept-tagging motifs\n", "\n", "MX use ConceptNet to tag motifs with concepts. This allows them to check, say, whether societies close to high-intensity earthquake regions have more motifs related to earthquakes. " ] }, { "cell_type": "code", "execution_count": 670, "metadata": {}, "outputs": [], "source": [ "Concepts_Tagged_Per_Motif = pd.read_stata('../../datasets/folklore/MX2021/Original_Files/Concepts_Tagged_Per_Motif.dta')" ] }, { "cell_type": "code", "execution_count": 671, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
motif_idsay_relatedone_relatedgo_relatedget_relatedwould_relatedknow_relatedmake_relatedlike_relatedthink_related...mindful_relatedoptimum_relatedrepercussion_relatedshabby_relatedsubjectivity_relatedaspiring_relateddistorted_relatedgalley_relatedoverlapping_relatedsituational_related
0a1[]['one', 'another'][][][][][][][]...[][][][][][][][][][]
1a10[][]['get']['get', 'find'][][][][][]...[][][][][][][][][][]
2a11a[][][][]['would'][][][][]...[][][][][][][][][][]
3a11b[]['one'][][][][][][][]...[][][][][][][][][][]
4a11c[][][]['take', 'give'][][]['give'][][]...[][][][][][][][][][]
..................................................................
2559n5['describe', 'mean', 'know'][]['get']['get'][]['know', 'recognize', 'learn'][]['like', 'similar']['know']...[][][][][][][][][][]
2560n6['tell'][]['come']['come']['would']['tell'][][][]...[][][][][][][][][][]
2561n7['say']['one', 'three', 'least']['get']['get', 'give'][]['say']['give'][][]...[][][][][][][][][][]
2562n8[][][]['arrive'][][]['make'][][]...[][][][][][][][][][]
2563n9[]['one', 'two']['come']['come'][][][][][]...[][][][][][][][][][]
\n", "

2564 rows × 9887 columns

\n", "
" ], "text/plain": [ " motif_id say_related one_related \n", "0 a1 [] ['one', 'another'] \\\n", "1 a10 [] [] \n", "2 a11a [] [] \n", "3 a11b [] ['one'] \n", "4 a11c [] [] \n", "... ... ... ... \n", "2559 n5 ['describe', 'mean', 'know'] [] \n", "2560 n6 ['tell'] [] \n", "2561 n7 ['say'] ['one', 'three', 'least'] \n", "2562 n8 [] [] \n", "2563 n9 [] ['one', 'two'] \n", "\n", " go_related get_related would_related \n", "0 [] [] [] \\\n", "1 ['get'] ['get', 'find'] [] \n", "2 [] [] ['would'] \n", "3 [] [] [] \n", "4 [] ['take', 'give'] [] \n", "... ... ... ... \n", "2559 ['get'] ['get'] [] \n", "2560 ['come'] ['come'] ['would'] \n", "2561 ['get'] ['get', 'give'] [] \n", "2562 [] ['arrive'] [] \n", "2563 ['come'] ['come'] [] \n", "\n", " know_related make_related like_related \n", "0 [] [] [] \\\n", "1 [] [] [] \n", "2 [] [] [] \n", "3 [] [] [] \n", "4 [] ['give'] [] \n", "... ... ... ... \n", "2559 ['know', 'recognize', 'learn'] [] ['like', 'similar'] \n", "2560 ['tell'] [] [] \n", "2561 ['say'] ['give'] [] \n", "2562 [] ['make'] [] \n", "2563 [] [] [] \n", "\n", " think_related ... mindful_related optimum_related repercussion_related \n", "0 [] ... [] [] [] \\\n", "1 [] ... [] [] [] \n", "2 [] ... [] [] [] \n", "3 [] ... [] [] [] \n", "4 [] ... [] [] [] \n", "... ... ... ... ... ... \n", "2559 ['know'] ... [] [] [] \n", "2560 [] ... [] [] [] \n", "2561 [] ... [] [] [] \n", "2562 [] ... [] [] [] \n", "2563 [] ... [] [] [] \n", "\n", " shabby_related subjectivity_related aspiring_related distorted_related \n", "0 [] [] [] [] \\\n", "1 [] [] [] [] \n", "2 [] [] [] [] \n", "3 [] [] [] [] \n", "4 [] [] [] [] \n", "... ... ... ... ... \n", "2559 [] [] [] [] \n", "2560 [] [] [] [] \n", "2561 [] [] [] [] \n", "2562 [] [] [] [] \n", "2563 [] [] [] [] \n", "\n", " galley_related overlapping_related situational_related \n", "0 [] [] [] \n", "1 [] [] [] \n", "2 [] [] [] \n", "3 [] [] [] \n", "4 [] [] [] \n", "... ... ... ... \n", "2559 [] [] [] \n", "2560 [] [] [] \n", "2561 [] [] [] \n", "2562 [] [] [] \n", "2563 [] [] [] \n", "\n", "[2564 rows x 9887 columns]" ] }, "execution_count": 671, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Concepts_Tagged_Per_Motif" ] }, { "cell_type": "code", "execution_count": 672, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column
0motif_id
1say_related
2one_related
3go_related
4get_related
......
9882aspiring_related
9883distorted_related
9884galley_related
9885overlapping_related
9886situational_related
\n", "

9887 rows × 1 columns

\n", "
" ], "text/plain": [ " column\n", "0 motif_id\n", "1 say_related\n", "2 one_related\n", "3 go_related\n", "4 get_related\n", "... ...\n", "9882 aspiring_related\n", "9883 distorted_related\n", "9884 galley_related\n", "9885 overlapping_related\n", "9886 situational_related\n", "\n", "[9887 rows x 1 columns]" ] }, "execution_count": 672, "metadata": {}, "output_type": "execute_result" } ], "source": [ "concepts_columns = pd.DataFrame(Concepts_Tagged_Per_Motif.columns, columns=['column'])\n", "concepts_columns" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "MX investigate how certain concepts are more likely to be present in the motifs of an oral tradition given its linguistic group's physical environment and mode of subsistence. \n", "\n", "- **near earthquake regions**: earthquake\n", "- **cold climates**: frozen, cold, ice, frost, freeze\n", "- **farming societies**: cereal, grain, cob, corn, maize, crop, wheat, flour, rice\n", "- **pastoral societies**: cattle, agriculture, graze, herder, farm, herdsman, livestock, pasture\n", "- **fishing societies**: fish\n", "- **hunting societies**: hunt, chase, deer, scavenger, hunter, pursuit, search, quest" ] }, { "cell_type": "code", "execution_count": 752, "metadata": {}, "outputs": [], "source": [ "words_earthquakes = ['earthquake', 'quake']\n", "words_coldness = ['frozen', 'cold', 'ice', 'frost', 'freeze', 'freezer', 'iceberg']\n", "words_farming = ['cereal', 'grain', 'corn', 'crop', 'wheat', 'flour', 'rice']\n", "words_pastoral = ['cattle', 'agriculture', 'graze', 'herd', 'farm', 'farming', 'farmhouse', 'farmland', 'shepherd', 'herding', 'livestock', 'pasture']\n", "words_fishing = ['fish', 'fishing', 'fisherman', 'fishery']\n", "words_hunting = ['hunt', 'hunting', 'chase', 'deer', 'hunter', 'pursuit', 'search', 'quest']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Double-check that these words are in the concepts list. If not, adjust accordingly." ] }, { "cell_type": "code", "execution_count": 751, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
column
96question_related
230research_related
644search_related
1024researcher_related
1331purchase_related
1346request_related
2218hunt_related
2460hunting_related
2526deer_related
2596chase_related
2655hunter_related
2991questionnaire_related
3569pursuit_related
3676quest_related
5679questionable_related
5856questioning_related
6876conquest_related
\n", "
" ], "text/plain": [ " column\n", "96 question_related\n", "230 research_related\n", "644 search_related\n", "1024 researcher_related\n", "1331 purchase_related\n", "1346 request_related\n", "2218 hunt_related\n", "2460 hunting_related\n", "2526 deer_related\n", "2596 chase_related\n", "2655 hunter_related\n", "2991 questionnaire_related\n", "3569 pursuit_related\n", "3676 quest_related\n", "5679 questionable_related\n", "5856 questioning_related\n", "6876 conquest_related" ] }, "execution_count": 751, "metadata": {}, "output_type": "execute_result" } ], "source": [ "concepts_columns[concepts_columns['column'].str.contains('|'.join(words_hunting))]" ] }, { "cell_type": "code", "execution_count": 746, "metadata": {}, "outputs": [], "source": [ "def prep_concepts(words, concept):\n", "\n", " def encode(x):\n", " if x == '[]':\n", " return 0\n", " elif bool(re.search('\\[.+\\]', x)):\n", " return 1\n", " \n", " columns = [w + '_related' for w in words]\n", " columns.insert(0, 'motif_id')\n", "\n", " motifs_concept = Concepts_Tagged_Per_Motif.copy()\n", " motifs_concept = motifs_concept[columns]\n", "\n", " # Convert to 1s and 0s\n", " motifs_concept[columns[1:]] = motifs_concept[columns[1:]].applymap(encode)\n", "\n", " # Create summary column\n", " motifs_concept[concept] = motifs_concept[columns[1:]].apply(lambda row: row.max(), axis=1)\n", " motifs_concept = motifs_concept[motifs_concept[concept] == 1]\n", " motifs_concept = motifs_concept[['motif_id', concept]]\n", " \n", " # Attach column with concept presence to groups_motifs list\n", " groups_concept = df_groups_motifs.copy()\n", " groups_concept = pd.merge(groups_concept, motifs_concept, on='motif_id', how='left')\n", " groups_concept = groups_concept.fillna(0)\n", "\n", " # Compute share of motifs with the concept\n", " groups_concept_sum = groups_concept.groupby('group')[concept].mean().reset_index()\n", "\n", " return groups_concept_sum\n" ] }, { "cell_type": "code", "execution_count": 747, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupearthquakes
0Abaza (Abazins)0.000000
1Abenaki,Penobscot0.031250
2Abkhaz0.003279
3Abor,Gallong,Tani0.012422
4Aceh0.000000
.........
918Zaparo0.000000
919Zapotec,Chatino0.000000
920Zoque0.000000
921Zulu,Swasi0.000000
922Zuni0.000000
\n", "

923 rows × 2 columns

\n", "
" ], "text/plain": [ " group earthquakes\n", "0 Abaza (Abazins) 0.000000\n", "1 Abenaki,Penobscot 0.031250\n", "2 Abkhaz 0.003279\n", "3 Abor,Gallong,Tani 0.012422\n", "4 Aceh 0.000000\n", ".. ... ...\n", "918 Zaparo 0.000000\n", "919 Zapotec,Chatino 0.000000\n", "920 Zoque 0.000000\n", "921 Zulu,Swasi 0.000000\n", "922 Zuni 0.000000\n", "\n", "[923 rows x 2 columns]" ] }, "execution_count": 747, "metadata": {}, "output_type": "execute_result" } ], "source": [ "groups_earthquakes = prep_concepts(words_earthquakes, 'earthquakes')\n", "groups_earthquakes" ] }, { "cell_type": "code", "execution_count": 753, "metadata": {}, "outputs": [], "source": [ "groups_concepts = df_coords.copy()\n", "\n", "groups_concepts = pd.merge(groups_concepts, prep_concepts(words_earthquakes, 'earthquakes'), on='group', how='left')\n", "groups_concepts = pd.merge(groups_concepts, prep_concepts(words_coldness, 'coldness'), on='group', how='left')\n", "groups_concepts = pd.merge(groups_concepts, prep_concepts(words_farming, 'farming'), on='group', how='left')\n", "groups_concepts = pd.merge(groups_concepts, prep_concepts(words_pastoral, 'pastoral'), on='group', how='left')\n", "groups_concepts = pd.merge(groups_concepts, prep_concepts(words_fishing, 'fishing'), on='group', how='left')\n", "groups_concepts = pd.merge(groups_concepts, prep_concepts(words_hunting, 'hunting'), on='group', how='left')\n", "\n", "groups_concepts = groups_concepts.melt(\n", " id_vars=['group', 'longitude', 'latitude'],\n", " value_vars=['earthquakes', 'coldness', 'farming', 'pastoral', 'fishing', 'hunting'],\n", " var_name='concept', \n", " value_name='share'\n", ")" ] }, { "cell_type": "code", "execution_count": 754, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
grouplongitudelatitudeconceptshare
0Abaza (Abazins)42.044.2earthquakes0.000000
1Abenaki,Penobscot-70.544.5earthquakes0.031250
2Abkhaz40.843.2earthquakes0.003279
3Abor,Gallong,Tani95.028.5earthquakes0.012422
4Aceh95.65.3earthquakes0.000000
..................
5533Zaparo-75.0-2.5hunting0.285714
5534Zapotec,Chatino-96.516.5hunting0.166667
5535Zoque-92.516.5hunting0.109091
5536Zulu,Swasi30.5-28.5hunting0.092593
5537Zuni-109.035.0hunting0.065421
\n", "

5538 rows × 5 columns

\n", "
" ], "text/plain": [ " group longitude latitude concept share\n", "0 Abaza (Abazins) 42.0 44.2 earthquakes 0.000000\n", "1 Abenaki,Penobscot -70.5 44.5 earthquakes 0.031250\n", "2 Abkhaz 40.8 43.2 earthquakes 0.003279\n", "3 Abor,Gallong,Tani 95.0 28.5 earthquakes 0.012422\n", "4 Aceh 95.6 5.3 earthquakes 0.000000\n", "... ... ... ... ... ...\n", "5533 Zaparo -75.0 -2.5 hunting 0.285714\n", "5534 Zapotec,Chatino -96.5 16.5 hunting 0.166667\n", "5535 Zoque -92.5 16.5 hunting 0.109091\n", "5536 Zulu,Swasi 30.5 -28.5 hunting 0.092593\n", "5537 Zuni -109.0 35.0 hunting 0.065421\n", "\n", "[5538 rows x 5 columns]" ] }, "execution_count": 754, "metadata": {}, "output_type": "execute_result" } ], "source": [ "groups_concepts" ] }, { "cell_type": "code", "execution_count": 755, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 755, "metadata": {}, "output_type": "execute_result" } ], "source": [ "groups_concepts['share'].max()" ] }, { "cell_type": "code", "execution_count": 756, "metadata": {}, "outputs": [], "source": [ "groups_concepts.to_csv('groups_concepts.csv', index=False)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Folklore and contemporary beliefs" ] }, { "cell_type": "code", "execution_count": 757, "metadata": {}, "outputs": [], "source": [ "Country_Regressions_Ready = pd.read_stata('../../datasets/folklore/MX2021/Replication_Tables_Figures/Country_Regressions_Ready.dta')" ] }, { "cell_type": "code", "execution_count": 761, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cntrylrgdpch2010lnp06_18pclnavgy06_18fem19trust_wvsavglntrust_wvsavgrisktakingtrust_gpspatience...harm_vicefair_viceingroup_viceauth_vicepurity_viceharm_virtuefair_virtueingroup_virtueauth_virtuepurity_virtue
0AFG6.955211NaN-1.67895948.848999NaNNaN0.1207640.315964-0.201360...42.2664270.95225814.0694903.7657177.7579533.2379632.58876442.56640623.1113923.172359
1AGO8.538473NaNNaN75.372002NaNNaNNaNNaNNaN...23.7543141.2507219.7190391.9124178.4833763.8656230.67317325.86907410.6806653.299046
2AIANaNNaNNaNNaNNaNNaNNaNNaNNaN...40.0000000.00000024.0000005.00000011.0000004.0000003.00000065.00000040.0000005.000000
3ALB8.797417-0.8059700.16219347.0810011.1922480.175841NaNNaNNaN...63.2998460.11566025.9640838.13397814.4295978.2026673.34279780.81278248.3389485.107501
4ANDNaN0.211741NaNNaNNaNNaNNaNNaNNaN...82.0005160.00000027.0003589.00002018.0000799.0000004.000040111.00085459.0003578.000060
..................................................................
194WSMNaN-0.636054-0.14489523.587999NaNNaNNaNNaNNaN...6.0000000.0000000.0000000.0000002.0000000.0000000.0000008.0000003.0000000.000000
195YEM7.780259-2.526622NaN5.8270001.4039870.339316NaNNaNNaN...41.2468871.00000014.8117226.0000007.0941391.2824172.09413951.77619128.9644692.905861
196ZAF8.9244160.3697962.07197548.7680021.2128780.1929960.970596-0.1669180.057912...29.1943060.53600412.4536523.6784517.4258614.2982341.76369241.10967819.9544553.224105
197ZMB7.324647-2.4770740.00851670.7850041.1154670.109273NaNNaNNaN...10.6630010.9026004.6731612.9974793.8884011.1977300.43749514.0615976.3311641.679670
198ZWE5.765326-2.7420530.69174878.7330021.0877620.0841220.523195-0.509133-0.238587...18.0917011.6467617.7754302.3431994.2369114.8518161.96867120.6171717.6145491.679720
\n", "

199 rows × 9146 columns

\n", "
" ], "text/plain": [ " cntry lrgdpch2010 lnp06_18pc lnavgy06_18 fem19 trust_wvsavg \n", "0 AFG 6.955211 NaN -1.678959 48.848999 NaN \\\n", "1 AGO 8.538473 NaN NaN 75.372002 NaN \n", "2 AIA NaN NaN NaN NaN NaN \n", "3 ALB 8.797417 -0.805970 0.162193 47.081001 1.192248 \n", "4 AND NaN 0.211741 NaN NaN NaN \n", ".. ... ... ... ... ... ... \n", "194 WSM NaN -0.636054 -0.144895 23.587999 NaN \n", "195 YEM 7.780259 -2.526622 NaN 5.827000 1.403987 \n", "196 ZAF 8.924416 0.369796 2.071975 48.768002 1.212878 \n", "197 ZMB 7.324647 -2.477074 0.008516 70.785004 1.115467 \n", "198 ZWE 5.765326 -2.742053 0.691748 78.733002 1.087762 \n", "\n", " lntrust_wvsavg risktaking trust_gps patience ... harm_vice \n", "0 NaN 0.120764 0.315964 -0.201360 ... 42.266427 \\\n", "1 NaN NaN NaN NaN ... 23.754314 \n", "2 NaN NaN NaN NaN ... 40.000000 \n", "3 0.175841 NaN NaN NaN ... 63.299846 \n", "4 NaN NaN NaN NaN ... 82.000516 \n", ".. ... ... ... ... ... ... \n", "194 NaN NaN NaN NaN ... 6.000000 \n", "195 0.339316 NaN NaN NaN ... 41.246887 \n", "196 0.192996 0.970596 -0.166918 0.057912 ... 29.194306 \n", "197 0.109273 NaN NaN NaN ... 10.663001 \n", "198 0.084122 0.523195 -0.509133 -0.238587 ... 18.091701 \n", "\n", " fair_vice ingroup_vice auth_vice purity_vice harm_virtue \n", "0 0.952258 14.069490 3.765717 7.757953 3.237963 \\\n", "1 1.250721 9.719039 1.912417 8.483376 3.865623 \n", "2 0.000000 24.000000 5.000000 11.000000 4.000000 \n", "3 0.115660 25.964083 8.133978 14.429597 8.202667 \n", "4 0.000000 27.000358 9.000020 18.000079 9.000000 \n", ".. ... ... ... ... ... \n", "194 0.000000 0.000000 0.000000 2.000000 0.000000 \n", "195 1.000000 14.811722 6.000000 7.094139 1.282417 \n", "196 0.536004 12.453652 3.678451 7.425861 4.298234 \n", "197 0.902600 4.673161 2.997479 3.888401 1.197730 \n", "198 1.646761 7.775430 2.343199 4.236911 4.851816 \n", "\n", " fair_virtue ingroup_virtue auth_virtue purity_virtue \n", "0 2.588764 42.566406 23.111392 3.172359 \n", "1 0.673173 25.869074 10.680665 3.299046 \n", "2 3.000000 65.000000 40.000000 5.000000 \n", "3 3.342797 80.812782 48.338948 5.107501 \n", "4 4.000040 111.000854 59.000357 8.000060 \n", ".. ... ... ... ... \n", "194 0.000000 8.000000 3.000000 0.000000 \n", "195 2.094139 51.776191 28.964469 2.905861 \n", "196 1.763692 41.109678 19.954455 3.224105 \n", "197 0.437495 14.061597 6.331164 1.679670 \n", "198 1.968671 20.617171 7.614549 1.679720 \n", "\n", "[199 rows x 9146 columns]" ] }, "execution_count": 761, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Country_Regressions_Ready" ] }, { "cell_type": "code", "execution_count": 762, "metadata": {}, "outputs": [], "source": [ "columns = ['cntry', 'lntrust_wvsavg', 'tricksters_punish', 'risktaking', 'challenge_competition', 'fem19', 'malebias']" ] }, { "cell_type": "code", "execution_count": 772, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-0.70643896" ] }, "execution_count": 772, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Country_Regressions_Ready['trust_gps'].min()" ] }, { "cell_type": "code", "execution_count": 764, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cntrylntrust_wvsavgtricksters_punishrisktakingchallenge_competitionfem19malebias
0AFGNaN-0.0240600.1207640.06474948.8489990.301265
1AGONaN-0.090911NaN0.08037675.3720020.081707
2AIANaN-0.005747NaN0.091954NaN0.160920
3ALB0.175841-0.028450NaN0.05365047.0810010.249465
4ANDNaN-0.009901NaN0.072607NaN0.211221
........................
194WSMNaN0.022222NaN0.02222223.5879990.044444
195YEM0.3393160.033566NaN0.0756175.8270000.248730
196ZAF0.192996-0.0212370.9705960.07257848.7680020.150613
197ZMB0.109273-0.058759NaN0.07366570.7850040.102419
198ZWE0.084122-0.0619620.5231950.09029578.7330020.157501
\n", "

199 rows × 7 columns

\n", "
" ], "text/plain": [ " cntry lntrust_wvsavg tricksters_punish risktaking \n", "0 AFG NaN -0.024060 0.120764 \\\n", "1 AGO NaN -0.090911 NaN \n", "2 AIA NaN -0.005747 NaN \n", "3 ALB 0.175841 -0.028450 NaN \n", "4 AND NaN -0.009901 NaN \n", ".. ... ... ... ... \n", "194 WSM NaN 0.022222 NaN \n", "195 YEM 0.339316 0.033566 NaN \n", "196 ZAF 0.192996 -0.021237 0.970596 \n", "197 ZMB 0.109273 -0.058759 NaN \n", "198 ZWE 0.084122 -0.061962 0.523195 \n", "\n", " challenge_competition fem19 malebias \n", "0 0.064749 48.848999 0.301265 \n", "1 0.080376 75.372002 0.081707 \n", "2 0.091954 NaN 0.160920 \n", "3 0.053650 47.081001 0.249465 \n", "4 0.072607 NaN 0.211221 \n", ".. ... ... ... \n", "194 0.022222 23.587999 0.044444 \n", "195 0.075617 5.827000 0.248730 \n", "196 0.072578 48.768002 0.150613 \n", "197 0.073665 70.785004 0.102419 \n", "198 0.090295 78.733002 0.157501 \n", "\n", "[199 rows x 7 columns]" ] }, "execution_count": 764, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regressions = Country_Regressions_Ready[columns]\n", "regressions" ] }, { "cell_type": "code", "execution_count": 765, "metadata": {}, "outputs": [], "source": [ "regressions.to_csv('regressions.csv', index=False) " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Obtain OLS coefficients\n", "\n", "In order to plot the trendlines of the cross-country regressions, I run the regressions myself to get the models' parameters." ] }, { "cell_type": "code", "execution_count": 778, "metadata": {}, "outputs": [], "source": [ "df_trust = Country_Regressions_Ready[['cntry', 'lntrust_wvsavg', 'tricksters_punish', 'lnyear_firstpub', 'lnnmbr_title']]\n", "df_trust = df_trust.dropna()" ] }, { "cell_type": "code", "execution_count": 779, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cntrylntrust_wvsavgtricksters_punishlnyear_firstpublnnmbr_title
3ALB0.175841-0.0284507.5374703.978251
6ARG0.183864-0.0217997.5374753.611511
7ARM0.1803770.0011097.5367473.793288
10AUS0.385149-0.0154887.5328313.636905
11AUT0.293364-0.0261747.5205254.067886
..................
193VNM0.390781-0.0227637.5428043.477649
195YEM0.3393160.0335667.5499192.764451
196ZAF0.192996-0.0212377.5339323.457292
197ZMB0.109273-0.0587597.5568192.716290
198ZWE0.084122-0.0619627.5528013.221077
\n", "

104 rows × 5 columns

\n", "
" ], "text/plain": [ " cntry lntrust_wvsavg tricksters_punish lnyear_firstpub lnnmbr_title\n", "3 ALB 0.175841 -0.028450 7.537470 3.978251\n", "6 ARG 0.183864 -0.021799 7.537475 3.611511\n", "7 ARM 0.180377 0.001109 7.536747 3.793288\n", "10 AUS 0.385149 -0.015488 7.532831 3.636905\n", "11 AUT 0.293364 -0.026174 7.520525 4.067886\n", ".. ... ... ... ... ...\n", "193 VNM 0.390781 -0.022763 7.542804 3.477649\n", "195 YEM 0.339316 0.033566 7.549919 2.764451\n", "196 ZAF 0.192996 -0.021237 7.533932 3.457292\n", "197 ZMB 0.109273 -0.058759 7.556819 2.716290\n", "198 ZWE 0.084122 -0.061962 7.552801 3.221077\n", "\n", "[104 rows x 5 columns]" ] }, "execution_count": 779, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_trust" ] }, { "cell_type": "code", "execution_count": 783, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lntrust_wvsavgtricksters_punishlnyear_firstpublnnmbr_title
count104.000000104.000000104.000000104.000000
mean0.228953-0.0204207.5400383.397068
std0.1079770.0196490.0115070.588741
min0.055010-0.0619627.4871741.393201
25%0.148913-0.0285577.5351733.094192
50%0.209711-0.0216087.5387003.543206
75%0.290824-0.0119357.5467103.761305
max0.5194690.0418417.5688554.599667
\n", "
" ], "text/plain": [ " lntrust_wvsavg tricksters_punish lnyear_firstpub lnnmbr_title\n", "count 104.000000 104.000000 104.000000 104.000000\n", "mean 0.228953 -0.020420 7.540038 3.397068\n", "std 0.107977 0.019649 0.011507 0.588741\n", "min 0.055010 -0.061962 7.487174 1.393201\n", "25% 0.148913 -0.028557 7.535173 3.094192\n", "50% 0.209711 -0.021608 7.538700 3.543206\n", "75% 0.290824 -0.011935 7.546710 3.761305\n", "max 0.519469 0.041841 7.568855 4.599667" ] }, "execution_count": 783, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_trust.describe()" ] }, { "cell_type": "code", "execution_count": 781, "metadata": {}, "outputs": [], "source": [ "reg_trust = LinearRegression().fit(df_trust[['tricksters_punish', 'lnyear_firstpub', 'lnnmbr_title']], df_trust[['lntrust_wvsavg']])" ] }, { "cell_type": "code", "execution_count": 782, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Coefficients: [[ 1.856006 -3.2042735 -0.02046409]]\n", "Intercept: [24.496714]\n" ] } ], "source": [ "print('Coefficients: ', reg_trust.coef_)\n", "print('Intercept: ', reg_trust.intercept_)" ] }, { "cell_type": "code", "execution_count": 784, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
risktakingchallenge_competitionlnyear_firstpublnnmbr_title
count76.00000076.00000076.00000076.000000
mean0.0126580.0575117.5403603.384763
std0.3018810.0159400.0114730.538440
min-0.7924350.0053667.4871741.393201
25%-0.1574060.0487377.5354803.074302
50%-0.0195770.0591167.5387003.416417
75%0.1633870.0661007.5495533.692096
max0.9705960.1135997.5607114.599667
\n", "
" ], "text/plain": [ " risktaking challenge_competition lnyear_firstpub lnnmbr_title\n", "count 76.000000 76.000000 76.000000 76.000000\n", "mean 0.012658 0.057511 7.540360 3.384763\n", "std 0.301881 0.015940 0.011473 0.538440\n", "min -0.792435 0.005366 7.487174 1.393201\n", "25% -0.157406 0.048737 7.535480 3.074302\n", "50% -0.019577 0.059116 7.538700 3.416417\n", "75% 0.163387 0.066100 7.549553 3.692096\n", "max 0.970596 0.113599 7.560711 4.599667" ] }, "execution_count": 784, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_risk = Country_Regressions_Ready[['cntry', 'risktaking', 'challenge_competition', 'lnyear_firstpub', 'lnnmbr_title']]\n", "df_risk = df_risk.dropna()\n", "df_risk.describe()" ] }, { "cell_type": "code", "execution_count": 787, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Coefficients: [[ 5.438077 -2.2783062 -0.23996948]]\n", "Intercept: [17.6914]\n" ] } ], "source": [ "reg_risk = LinearRegression().fit(df_risk[['challenge_competition', 'lnyear_firstpub', 'lnnmbr_title']], df_risk[['risktaking']])\n", "print('Coefficients: ', reg_risk.coef_)\n", "print('Intercept: ', reg_risk.intercept_)" ] }, { "cell_type": "code", "execution_count": 788, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fem19malebiaslnyear_firstpublnnmbr_title
count174.000000174.000000174.000000174.000000
mean51.5114480.1797937.5432763.181844
std15.7556230.0546700.0132490.642237
min5.8270000.0444447.4871741.305195
25%44.7297500.1423527.5367892.815713
50%53.4235000.1872267.5431133.257886
75%60.6142510.2100227.5517343.637969
max84.1600040.3100077.5905554.599667
\n", "
" ], "text/plain": [ " fem19 malebias lnyear_firstpub lnnmbr_title\n", "count 174.000000 174.000000 174.000000 174.000000\n", "mean 51.511448 0.179793 7.543276 3.181844\n", "std 15.755623 0.054670 0.013249 0.642237\n", "min 5.827000 0.044444 7.487174 1.305195\n", "25% 44.729750 0.142352 7.536789 2.815713\n", "50% 53.423500 0.187226 7.543113 3.257886\n", "75% 60.614251 0.210022 7.551734 3.637969\n", "max 84.160004 0.310007 7.590555 4.599667" ] }, "execution_count": 788, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_fem = Country_Regressions_Ready[['cntry', 'fem19', 'malebias', 'lnyear_firstpub', 'lnnmbr_title']]\n", "df_fem = df_fem.dropna()\n", "df_fem.describe()" ] }, { "cell_type": "code", "execution_count": 789, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Coefficients: [[-111.90181 -12.22385 0.28624266]]\n", "Intercept: [162.92767]\n" ] } ], "source": [ "reg_fem = LinearRegression().fit(df_fem[['malebias', 'lnyear_firstpub', 'lnnmbr_title']], df_fem[['fem19']])\n", "print('Coefficients: ', reg_fem.coef_)\n", "print('Intercept: ', reg_fem.intercept_)" ] } ], "metadata": { "kernelspec": { "display_name": "twopoints-venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }