# Cleaning scripts

## Harmonize MX and Dmitry Nikolayev's datasets

The replication datasets of Michalopoulos and Xue (MX) contain motifs data on 958 ethnic groups, taken from the 2019 version of Yuri Berezkin's catalog. However, these do not contain location data for the ethnic groups. Independent of MX, Dmitry Nikolayev provides coordinates for the ethnic groups [here](https://github.com/macleginn/mythology-queries). However, this only contains 926 ethnic groups and appears to be based on an older version of Berezkin's cataglog.

This notebook reconciles the MX and DN datasets to the greatest extent possible. 

In [774]:
import pandas as pd
import re
from sklearn.linear_model import LinearRegression

Load the MX dataset.

In [640]:
Motifs_Berezkin_groups = pd.read_stata('../../datasets/folklore/MX2021/Original_Files/Motifs_Berezkin_groups.dta')
df = Motifs_Berezkin_groups.drop(
    ['oid', 'motifs_total', 'nmbr_author', 'nmbr_language', 'nmbr_publisher', 'nmbr_title', 'year_firstpub', 'year_avgpub'], 
    axis=1
)
df

Unnamed: 0,group_Berezkin,a1,a10,a11a,a11b,a11c,a12,a12a,a12b,a12c,...,n28,n29,n3,n30,n4,n5,n6,n7,n8,n9
0,Abaza (Abazins),0,0,0,0,0,1,0,0,1,...,0,0,1,0,0,0,0,0,0,0
1,Abkhaz,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,Aceh,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Ache,0,0,0,1,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Achomavi,0,0,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
953,Teleut,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
954,Central Yakuts,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
955,Arabs: Iraq,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
956,Liaoning and Jilin Chinese,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Load Nikolayev's coordinates dataset.

In [641]:
coords = pd.read_json('../../datasets/folklore/coords.json')
coords

Unnamed: 0,Longitude,Latitude,Name
0,20.0,-26.0,Bushmen
1,21.0,-32.0,Khoikhoi
2,26.5,-32.5,Xhosa
3,30.5,-28.5,"Zulu,Swasi"
4,26.5,-27.5,"Sotho, Tswana"
...,...,...,...
921,-72.5,-39.0,Mapuche
922,-67.5,-42.0,Puelche
923,-69.0,-47.0,Tehuelche
924,-68.5,-54.5,Selknam


### Match MX to coords and aggregate

Certain groups in MX are more disaggregated than in `coords`. For example, "Yakut" in `coords` is disaggregated as follows:

In [642]:
df[df['group_Berezkin'].str.contains('Yakut')]

Unnamed: 0,group_Berezkin,a1,a10,a11a,a11b,a11c,a12,a12a,a12b,a12c,...,n28,n29,n3,n30,n4,n5,n6,n7,n8,n9
941,"NW Yakuts (Yessey,Anabar,Olenyok, Lower Lena)",0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
945,"NE Yakuts (Yana,Indigirka,Kolyma)",0,0,0,0,0,1,1,0,0,...,0,0,0,0,0,1,0,0,0,0
950,"Western Yakuts (Olyokma,Vilyuy)",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
954,Central Yakuts,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


For these cases, we change the MX group names to match the `coords` names then aggregate. I have prepared the dictionary `match_names_to_coords` for this task.

In [643]:
match_names_to_coords = pd.read_csv('match_names_to_coords.csv')
match_names_to_coords

Unnamed: 0,mx_name,coords_name
0,Wolof,"Fulbe,Wolof,Serer"
1,Serer,"Fulbe,Wolof,Serer"
2,Japan AD 700-1700,Japan
3,Japanese folklore,Japan
4,"NW Yakuts (Yessey,Anabar,Olenyok, Lower Lena)",Yakut
5,Central Yakuts,Yakut
6,"NE Yakuts (Yana,Indigirka,Kolyma)",Yakut
7,"Western Yakuts (Olyokma,Vilyuy)",Yakut
8,Trans NG East Highlands,Trans New Guinea East
9,Trans NG East Lowlands North,Trans New Guinea East


In [644]:
dict_to_coords = match_names_to_coords.set_index('mx_name').to_dict()['coords_name']
df_to_coords = df.copy()
df_to_coords['group_Berezkin'] = df_to_coords['group_Berezkin'].replace(dict_to_coords)

In [645]:
df_to_coords[df_to_coords['group_Berezkin'].str.contains('Yakut')]

Unnamed: 0,group_Berezkin,a1,a10,a11a,a11b,a11c,a12,a12a,a12b,a12c,...,n28,n29,n3,n30,n4,n5,n6,n7,n8,n9
941,Yakut,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
945,Yakut,0,0,0,0,0,1,1,0,0,...,0,0,0,0,0,1,0,0,0,0
950,Yakut,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
954,Yakut,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [646]:
df_to_coords.dtypes

group_Berezkin      object
a1                category
a10               category
a11a              category
a11b              category
                    ...   
n5                category
n6                category
n7                category
n8                category
n9                category
Length: 2565, dtype: object

To be able to aggregate, we need to change the categorical columns into numerical columns.

In [647]:
cols = [i for i in df_to_coords.columns if i not in ["group_Berezkin"]]
for col in cols:
    df_to_coords[col] = pd.to_numeric(df_to_coords[col])

In [648]:
df_to_coords.dtypes

group_Berezkin    object
a1                 int64
a10                int64
a11a               int64
a11b               int64
                   ...  
n5                 int64
n6                 int64
n7                 int64
n8                 int64
n9                 int64
Length: 2565, dtype: object

In [649]:
df_to_coords = df_to_coords.groupby('group_Berezkin').max().reset_index()
df_to_coords.shape

(947, 2565)

In [650]:
df_to_coords[df_to_coords['group_Berezkin'].str.contains('Yakut')]

Unnamed: 0,group_Berezkin,a1,a10,a11a,a11b,a11c,a12,a12a,a12b,a12c,...,n28,n29,n3,n30,n4,n5,n6,n7,n8,n9
912,Yakut,0,0,0,0,0,1,1,0,0,...,0,1,0,0,0,1,0,0,0,0


### Match coords to MX

Next, where groups have a one-to-one match between MX and `coords` but under different names, we change thename in `coords` to follow MX. I have prepared the `match_names_to_max` dictionary for this.

In [651]:
match_names_to_mx = pd.read_csv('match_names_to_mx.csv')
match_names_to_mx = match_names_to_mx.set_index('coords_name').to_dict()['mx_name']

In [652]:
coords['Name'] = coords['Name'].replace(match_names_to_mx)

In [653]:
groups = df_to_coords[['group_Berezkin']].copy()
df_coords = pd.merge(groups, coords, left_on=['group_Berezkin'], right_on=['Name'], how='outer')
df_coords.shape

(950, 4)

The following groups are those with no clear match and shall be discarded.

In [654]:
df_coords[df_coords['group_Berezkin'].isna() | df_coords['Name'].isna()]

Unnamed: 0,group_Berezkin,Longitude,Latitude,Name
24,Almora (Rangkas),,,
40,Arabs (literary tradition),,,
253,Fujian Chinese,,,
254,Fula (Pular),,,
258,Galicians,,,
274,"Gulf: Kuwait,Bahrain,Qatar,Oman",,,
286,Henan Chinese,,,
291,Himachali Pahari,,,
304,"Iban,Bidayu,Sakarram",,,
306,Icelanders (after A.D. 1800),,,


In [655]:
df_coords = df_coords.dropna()
df_coords = df_coords.drop(['Name'], axis=1)
df_coords = df_coords.rename(
    columns={
        'group_Berezkin': 'group', 
        'Longitude': 'longitude',
        'Latitude': 'latitude'
    })
df_coords.shape

(923, 3)

In [656]:
df_coords.head()

Unnamed: 0,group,longitude,latitude
0,Abaza (Abazins),42.0,44.2
1,"Abenaki,Penobscot",-70.5,44.5
2,Abkhaz,40.8,43.2
3,"Abor,Gallong,Tani",95.0,28.5
4,Aceh,95.6,5.3


### Merge and export

The final list of groups is 923.

In [657]:
df_to_coords.shape

(947, 2565)

In [658]:
df_groups_motifs = pd.merge(df_to_coords, df_coords, left_on=['group_Berezkin'], right_on=['group'], how='outer').reset_index(drop=True)
df_groups_motifs.dropna(subset=['group'], inplace=True)
df_groups_motifs.shape

(923, 2568)

In [659]:
df_groups_motifs.drop(['group_Berezkin', 'longitude', 'latitude'], axis=1, inplace=True)
group_col = df_groups_motifs.pop('group')
df_groups_motifs.insert(0, 'group', group_col)

In [661]:
df_groups_motifs = pd.melt(df_groups_motifs, id_vars=['group'], var_name='motif_id', value_name='present')
df_groups_motifs = df_groups_motifs.fillna(0)

In [662]:
df_groups_motifs = df_groups_motifs[df_groups_motifs['present'] == 1]
df_groups_motifs = df_groups_motifs.drop(['present'], axis=1)
df_groups_motifs.to_csv('groups_motifs.csv')

In [663]:
df_coords.to_json('coords_clean.json', orient='records')

## Clean and export motifs list

There are a total of 2564 motifs.

In [666]:
Motif_Master = pd.read_stata('../../datasets/folklore/MX2021/Original_Files/Motif_Master.dta')
Motif_Master

Unnamed: 0,motif_id,title_english,title_russian,title_english_googleAPI,desc_eng,desc_russian,desc_english_googleAPI
0,a1,The old sun,Древнее солнце,Ancient sun,"Another sun, usually less benevolent and/or po...",Другое солнце – обычно менее могущественное ил...,Another sun — usually less powerful or less be...
1,a10,The sun finds its eyes,Солнце находит себе глаза,The sun finds its eyes,The sun gets his bright eye or eyes from an an...,Солнце получает свои сверкающие глаза (глаз) о...,The sun gets its sparkling eyes (eyes) from th...
2,a11a,Eyes of the Sun and the Moon: coolness and night,Глаза светил: прохлада и ночь,Eyes of the luminaries: coolness and night,Visible sun and/or moon are the Sun's and/or t...,Видимое солнце или луна есть их глаза; если бы...,The visible sun or moon is their eyes; if the ...
3,a11b,One-eyed luminaries,Одноглазые светила,One-eyed luminaries,The Sun or the Moon have only one eye (the Mun...,Солнце или Месяц одноглаз (мундуруку: слеп),Sun or Month odnoglaz (Munduruku: blind)
4,a11c,"The Sun, the Moon and monster’s eyes","Солнце, Луна и глаза чудовища","Sun, moon and monster eyes",The Sun and the Moon kill a monster whose eyes...,"Солнце и Луна убивают чудовище, чьи глаза свет...",The sun and the moon kill a monster whose eyes...
...,...,...,...,...,...,...,...
2559,n5,"They recognize winter by rime, summer by rain","Зиму узнают по инею, лето по дождю","Winter learn by hoarfrost, summer by rain","Long trips, campaigns, flights or battles are ...","Длительные поездки, походы, полеты или битвы о...","Long trips, trips, flights or battles are desc..."
2560,n6,Horse tells to whip him strongly,Хлестнуть коня,Whip a horse,A horse tells his rider to whip him with such ...,"Конь велит всаднику хлестнуть его так сильно, ...",The horse tells the rider to whip him so hard ...
2561,n7,Three apples,Три яблока,Three apples,Closing formula of the folktale: three apples ...,"Сказочный текст завершается формулой, сообщающ...",The fabulous text ends with a formula that sta...
2562,n8,Storyteller instead of a cannonball,Сказочник вместо ядра,The storyteller instead of the core,Closing formula of the folktale: characters pu...,"Сказочный текст завершается формулой, сообщающ...",The fabulous text ends with a formula that sta...


In [695]:
df_motifs = Motif_Master[['motif_id', 'title_english', 'desc_eng']]
df_motifs = df_motifs.rename(
    columns={
        'title_english': 'title', 
        'desc_eng': 'description'
    })

In [696]:
df_motifs

Unnamed: 0,motif_id,title,description
0,a1,The old sun,"Another sun, usually less benevolent and/or po..."
1,a10,The sun finds its eyes,The sun gets his bright eye or eyes from an an...
2,a11a,Eyes of the Sun and the Moon: coolness and night,Visible sun and/or moon are the Sun's and/or t...
3,a11b,One-eyed luminaries,The Sun or the Moon have only one eye (the Mun...
4,a11c,"The Sun, the Moon and monster’s eyes",The Sun and the Moon kill a monster whose eyes...
...,...,...,...
2559,n5,"They recognize winter by rime, summer by rain","Long trips, campaigns, flights or battles are ..."
2560,n6,Horse tells to whip him strongly,A horse tells his rider to whip him with such ...
2561,n7,Three apples,Closing formula of the folktale: three apples ...
2562,n8,Storyteller instead of a cannonball,Closing formula of the folktale: characters pu...


Some descriptions are blank by mistake. I fill these in manually from http://www.mythologydatabase.com/bd/.

In [697]:
df_motifs[df_motifs['description'] == '']

Unnamed: 0,motif_id,title,description
99,a8a,"The Sun, the Moon and the star: released by th...",
770,h21a,Not to kill a big fish,
1099,i97,Rainbow horse,
1833,l23b,Transformation into spindle,
1859,l37a,To get know causes of problems,
2011,m105a,Make believe killing of children,


In [698]:
df_motifs.loc[99, 'description'] = 'The sun, moon and star (stars) appear as three consecutive and comparable objects/characters in the stories about the abduction and subsequent release of heavenly bodies.'
df_motifs.loc[770, 'description'] = 'The fish is concentrated in a small container, from which the owner takes as much as necessary. Another character opens the receptacle, breaking the rules, and the fish breaks out of it.'
df_motifs.loc[1099, 'description'] = 'The rainbow is an ungulate animal (horse, bull, goat, sheep).'
df_motifs.loc[1833, 'description'] = 'Trying to free himself, the captured character consistently changes his appearance. The last transformation is a small wooden object (usually a spindle that must be broken in half).'
df_motifs.loc[1859, 'description'] = 'On the way to a powerful being, a person meets characters who ask him to ask him questions on their behalf (usually to find out the reason for their misfortunes).'
df_motifs.loc[2011, 'description'] = 'The character hides his children, but tells the other person that he killed them, he believes. See M104 motif.'

In [699]:
df_motifs.to_csv('motifs.csv', index=False, encoding='utf-8')

## Concept-tagging motifs

MX use ConceptNet to tag motifs with concepts. This allows them to check, say, whether societies close to high-intensity earthquake regions have more motifs related to earthquakes. 

In [670]:
Concepts_Tagged_Per_Motif = pd.read_stata('../../datasets/folklore/MX2021/Original_Files/Concepts_Tagged_Per_Motif.dta')

In [671]:
Concepts_Tagged_Per_Motif

Unnamed: 0,motif_id,say_related,one_related,go_related,get_related,would_related,know_related,make_related,like_related,think_related,...,mindful_related,optimum_related,repercussion_related,shabby_related,subjectivity_related,aspiring_related,distorted_related,galley_related,overlapping_related,situational_related
0,a1,[],"['one', 'another']",[],[],[],[],[],[],[],...,[],[],[],[],[],[],[],[],[],[]
1,a10,[],[],['get'],"['get', 'find']",[],[],[],[],[],...,[],[],[],[],[],[],[],[],[],[]
2,a11a,[],[],[],[],['would'],[],[],[],[],...,[],[],[],[],[],[],[],[],[],[]
3,a11b,[],['one'],[],[],[],[],[],[],[],...,[],[],[],[],[],[],[],[],[],[]
4,a11c,[],[],[],"['take', 'give']",[],[],['give'],[],[],...,[],[],[],[],[],[],[],[],[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2559,n5,"['describe', 'mean', 'know']",[],['get'],['get'],[],"['know', 'recognize', 'learn']",[],"['like', 'similar']",['know'],...,[],[],[],[],[],[],[],[],[],[]
2560,n6,['tell'],[],['come'],['come'],['would'],['tell'],[],[],[],...,[],[],[],[],[],[],[],[],[],[]
2561,n7,['say'],"['one', 'three', 'least']",['get'],"['get', 'give']",[],['say'],['give'],[],[],...,[],[],[],[],[],[],[],[],[],[]
2562,n8,[],[],[],['arrive'],[],[],['make'],[],[],...,[],[],[],[],[],[],[],[],[],[]


In [672]:
concepts_columns = pd.DataFrame(Concepts_Tagged_Per_Motif.columns, columns=['column'])
concepts_columns

Unnamed: 0,column
0,motif_id
1,say_related
2,one_related
3,go_related
4,get_related
...,...
9882,aspiring_related
9883,distorted_related
9884,galley_related
9885,overlapping_related


MX investigate how certain concepts are more likely to be present in the motifs of an oral tradition given its linguistic group's physical environment and mode of subsistence. 

- **near earthquake regions**: earthquake
- **cold climates**: frozen, cold, ice, frost, freeze
- **farming societies**: cereal, grain, cob, corn, maize, crop, wheat, flour, rice
- **pastoral societies**: cattle, agriculture, graze, herder, farm, herdsman, livestock, pasture
- **fishing societies**: fish
- **hunting societies**: hunt, chase, deer, scavenger, hunter, pursuit, search, quest

In [752]:
words_earthquakes = ['earthquake', 'quake']
words_coldness = ['frozen', 'cold', 'ice', 'frost', 'freeze', 'freezer', 'iceberg']
words_farming = ['cereal', 'grain', 'corn', 'crop', 'wheat', 'flour', 'rice']
words_pastoral = ['cattle', 'agriculture', 'graze', 'herd', 'farm', 'farming', 'farmhouse', 'farmland', 'shepherd', 'herding', 'livestock', 'pasture']
words_fishing = ['fish', 'fishing', 'fisherman', 'fishery']
words_hunting = ['hunt', 'hunting', 'chase', 'deer', 'hunter', 'pursuit', 'search', 'quest']

Double-check that these words are in the concepts list. If not, adjust accordingly.

In [751]:
concepts_columns[concepts_columns['column'].str.contains('|'.join(words_hunting))]

Unnamed: 0,column
96,question_related
230,research_related
644,search_related
1024,researcher_related
1331,purchase_related
1346,request_related
2218,hunt_related
2460,hunting_related
2526,deer_related
2596,chase_related


In [746]:
def prep_concepts(words, concept):

    def encode(x):
        if x == '[]':
            return 0
        elif bool(re.search('\[.+\]', x)):
            return 1
    
    columns = [w + '_related' for w in words]
    columns.insert(0, 'motif_id')

    motifs_concept = Concepts_Tagged_Per_Motif.copy()
    motifs_concept = motifs_concept[columns]

    # Convert to 1s and 0s
    motifs_concept[columns[1:]] = motifs_concept[columns[1:]].applymap(encode)

    # Create summary column
    motifs_concept[concept] = motifs_concept[columns[1:]].apply(lambda row: row.max(), axis=1)
    motifs_concept = motifs_concept[motifs_concept[concept] == 1]
    motifs_concept = motifs_concept[['motif_id', concept]]
    
    # Attach column with concept presence to groups_motifs list
    groups_concept = df_groups_motifs.copy()
    groups_concept = pd.merge(groups_concept, motifs_concept, on='motif_id', how='left')
    groups_concept = groups_concept.fillna(0)

    # Compute share of motifs with the concept
    groups_concept_sum = groups_concept.groupby('group')[concept].mean().reset_index()

    return groups_concept_sum


In [747]:
groups_earthquakes = prep_concepts(words_earthquakes, 'earthquakes')
groups_earthquakes

Unnamed: 0,group,earthquakes
0,Abaza (Abazins),0.000000
1,"Abenaki,Penobscot",0.031250
2,Abkhaz,0.003279
3,"Abor,Gallong,Tani",0.012422
4,Aceh,0.000000
...,...,...
918,Zaparo,0.000000
919,"Zapotec,Chatino",0.000000
920,Zoque,0.000000
921,"Zulu,Swasi",0.000000


In [753]:
groups_concepts = df_coords.copy()

groups_concepts = pd.merge(groups_concepts, prep_concepts(words_earthquakes, 'earthquakes'), on='group', how='left')
groups_concepts = pd.merge(groups_concepts, prep_concepts(words_coldness, 'coldness'), on='group', how='left')
groups_concepts = pd.merge(groups_concepts, prep_concepts(words_farming, 'farming'), on='group', how='left')
groups_concepts = pd.merge(groups_concepts, prep_concepts(words_pastoral, 'pastoral'), on='group', how='left')
groups_concepts = pd.merge(groups_concepts, prep_concepts(words_fishing, 'fishing'), on='group', how='left')
groups_concepts = pd.merge(groups_concepts, prep_concepts(words_hunting, 'hunting'), on='group', how='left')

groups_concepts = groups_concepts.melt(
    id_vars=['group', 'longitude', 'latitude'],
    value_vars=['earthquakes', 'coldness', 'farming', 'pastoral', 'fishing', 'hunting'],
    var_name='concept', 
    value_name='share'
)

In [754]:
groups_concepts

Unnamed: 0,group,longitude,latitude,concept,share
0,Abaza (Abazins),42.0,44.2,earthquakes,0.000000
1,"Abenaki,Penobscot",-70.5,44.5,earthquakes,0.031250
2,Abkhaz,40.8,43.2,earthquakes,0.003279
3,"Abor,Gallong,Tani",95.0,28.5,earthquakes,0.012422
4,Aceh,95.6,5.3,earthquakes,0.000000
...,...,...,...,...,...
5533,Zaparo,-75.0,-2.5,hunting,0.285714
5534,"Zapotec,Chatino",-96.5,16.5,hunting,0.166667
5535,Zoque,-92.5,16.5,hunting,0.109091
5536,"Zulu,Swasi",30.5,-28.5,hunting,0.092593


In [755]:
groups_concepts['share'].max()

0.5

In [756]:
groups_concepts.to_csv('groups_concepts.csv', index=False)

## Folklore and contemporary beliefs

In [757]:
Country_Regressions_Ready = pd.read_stata('../../datasets/folklore/MX2021/Replication_Tables_Figures/Country_Regressions_Ready.dta')

In [761]:
Country_Regressions_Ready

Unnamed: 0,cntry,lrgdpch2010,lnp06_18pc,lnavgy06_18,fem19,trust_wvsavg,lntrust_wvsavg,risktaking,trust_gps,patience,...,harm_vice,fair_vice,ingroup_vice,auth_vice,purity_vice,harm_virtue,fair_virtue,ingroup_virtue,auth_virtue,purity_virtue
0,AFG,6.955211,,-1.678959,48.848999,,,0.120764,0.315964,-0.201360,...,42.266427,0.952258,14.069490,3.765717,7.757953,3.237963,2.588764,42.566406,23.111392,3.172359
1,AGO,8.538473,,,75.372002,,,,,,...,23.754314,1.250721,9.719039,1.912417,8.483376,3.865623,0.673173,25.869074,10.680665,3.299046
2,AIA,,,,,,,,,,...,40.000000,0.000000,24.000000,5.000000,11.000000,4.000000,3.000000,65.000000,40.000000,5.000000
3,ALB,8.797417,-0.805970,0.162193,47.081001,1.192248,0.175841,,,,...,63.299846,0.115660,25.964083,8.133978,14.429597,8.202667,3.342797,80.812782,48.338948,5.107501
4,AND,,0.211741,,,,,,,,...,82.000516,0.000000,27.000358,9.000020,18.000079,9.000000,4.000040,111.000854,59.000357,8.000060
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194,WSM,,-0.636054,-0.144895,23.587999,,,,,,...,6.000000,0.000000,0.000000,0.000000,2.000000,0.000000,0.000000,8.000000,3.000000,0.000000
195,YEM,7.780259,-2.526622,,5.827000,1.403987,0.339316,,,,...,41.246887,1.000000,14.811722,6.000000,7.094139,1.282417,2.094139,51.776191,28.964469,2.905861
196,ZAF,8.924416,0.369796,2.071975,48.768002,1.212878,0.192996,0.970596,-0.166918,0.057912,...,29.194306,0.536004,12.453652,3.678451,7.425861,4.298234,1.763692,41.109678,19.954455,3.224105
197,ZMB,7.324647,-2.477074,0.008516,70.785004,1.115467,0.109273,,,,...,10.663001,0.902600,4.673161,2.997479,3.888401,1.197730,0.437495,14.061597,6.331164,1.679670


In [762]:
columns = ['cntry', 'lntrust_wvsavg', 'tricksters_punish', 'risktaking', 'challenge_competition', 'fem19', 'malebias']

In [772]:
Country_Regressions_Ready['trust_gps'].min()

-0.70643896

In [764]:
regressions = Country_Regressions_Ready[columns]
regressions

Unnamed: 0,cntry,lntrust_wvsavg,tricksters_punish,risktaking,challenge_competition,fem19,malebias
0,AFG,,-0.024060,0.120764,0.064749,48.848999,0.301265
1,AGO,,-0.090911,,0.080376,75.372002,0.081707
2,AIA,,-0.005747,,0.091954,,0.160920
3,ALB,0.175841,-0.028450,,0.053650,47.081001,0.249465
4,AND,,-0.009901,,0.072607,,0.211221
...,...,...,...,...,...,...,...
194,WSM,,0.022222,,0.022222,23.587999,0.044444
195,YEM,0.339316,0.033566,,0.075617,5.827000,0.248730
196,ZAF,0.192996,-0.021237,0.970596,0.072578,48.768002,0.150613
197,ZMB,0.109273,-0.058759,,0.073665,70.785004,0.102419


In [765]:
regressions.to_csv('regressions.csv', index=False) 

### Obtain OLS coefficients

In order to plot the trendlines of the cross-country regressions, I run the regressions myself to get the models' parameters.

In [778]:
df_trust = Country_Regressions_Ready[['cntry', 'lntrust_wvsavg', 'tricksters_punish', 'lnyear_firstpub', 'lnnmbr_title']]
df_trust = df_trust.dropna()

In [779]:
df_trust

Unnamed: 0,cntry,lntrust_wvsavg,tricksters_punish,lnyear_firstpub,lnnmbr_title
3,ALB,0.175841,-0.028450,7.537470,3.978251
6,ARG,0.183864,-0.021799,7.537475,3.611511
7,ARM,0.180377,0.001109,7.536747,3.793288
10,AUS,0.385149,-0.015488,7.532831,3.636905
11,AUT,0.293364,-0.026174,7.520525,4.067886
...,...,...,...,...,...
193,VNM,0.390781,-0.022763,7.542804,3.477649
195,YEM,0.339316,0.033566,7.549919,2.764451
196,ZAF,0.192996,-0.021237,7.533932,3.457292
197,ZMB,0.109273,-0.058759,7.556819,2.716290


In [783]:
df_trust.describe()

Unnamed: 0,lntrust_wvsavg,tricksters_punish,lnyear_firstpub,lnnmbr_title
count,104.0,104.0,104.0,104.0
mean,0.228953,-0.02042,7.540038,3.397068
std,0.107977,0.019649,0.011507,0.588741
min,0.05501,-0.061962,7.487174,1.393201
25%,0.148913,-0.028557,7.535173,3.094192
50%,0.209711,-0.021608,7.5387,3.543206
75%,0.290824,-0.011935,7.54671,3.761305
max,0.519469,0.041841,7.568855,4.599667


In [781]:
reg_trust = LinearRegression().fit(df_trust[['tricksters_punish', 'lnyear_firstpub', 'lnnmbr_title']], df_trust[['lntrust_wvsavg']])

In [782]:
print('Coefficients: ', reg_trust.coef_)
print('Intercept: ', reg_trust.intercept_)

Coefficients:  [[ 1.856006   -3.2042735  -0.02046409]]
Intercept:  [24.496714]


In [784]:
df_risk = Country_Regressions_Ready[['cntry', 'risktaking', 'challenge_competition', 'lnyear_firstpub', 'lnnmbr_title']]
df_risk = df_risk.dropna()
df_risk.describe()

Unnamed: 0,risktaking,challenge_competition,lnyear_firstpub,lnnmbr_title
count,76.0,76.0,76.0,76.0
mean,0.012658,0.057511,7.54036,3.384763
std,0.301881,0.01594,0.011473,0.53844
min,-0.792435,0.005366,7.487174,1.393201
25%,-0.157406,0.048737,7.53548,3.074302
50%,-0.019577,0.059116,7.5387,3.416417
75%,0.163387,0.0661,7.549553,3.692096
max,0.970596,0.113599,7.560711,4.599667


In [787]:
reg_risk = LinearRegression().fit(df_risk[['challenge_competition', 'lnyear_firstpub', 'lnnmbr_title']], df_risk[['risktaking']])
print('Coefficients: ', reg_risk.coef_)
print('Intercept: ', reg_risk.intercept_)

Coefficients:  [[ 5.438077   -2.2783062  -0.23996948]]
Intercept:  [17.6914]


In [788]:
df_fem = Country_Regressions_Ready[['cntry', 'fem19', 'malebias', 'lnyear_firstpub', 'lnnmbr_title']]
df_fem = df_fem.dropna()
df_fem.describe()

Unnamed: 0,fem19,malebias,lnyear_firstpub,lnnmbr_title
count,174.0,174.0,174.0,174.0
mean,51.511448,0.179793,7.543276,3.181844
std,15.755623,0.05467,0.013249,0.642237
min,5.827,0.044444,7.487174,1.305195
25%,44.72975,0.142352,7.536789,2.815713
50%,53.4235,0.187226,7.543113,3.257886
75%,60.614251,0.210022,7.551734,3.637969
max,84.160004,0.310007,7.590555,4.599667


In [789]:
reg_fem = LinearRegression().fit(df_fem[['malebias', 'lnyear_firstpub', 'lnnmbr_title']], df_fem[['fem19']])
print('Coefficients: ', reg_fem.coef_)
print('Intercept: ', reg_fem.intercept_)

Coefficients:  [[-111.90181     -12.22385       0.28624266]]
Intercept:  [162.92767]
