I. Introduction

The report was done at Texas Tech University as a final project of Multivariate Analysis class by Marcin Grzechowiak, Mikaela Pisani, Roger Valdez and Jereamy Riggs. The dataset is used from Gapminder World, which contains data from several sources. The motivation is focused on comparing countries from the economical, social and government perspectives.

The aim of this post is to present the results of the analysis, while the complete Report can be found here Github Link. The file contains information about variables and their sources as well as describes the process of data cleaning. Furthermore, the visualizations presented here are interpreted in more detail in the Report.

II. Data cleaning

The whole code for data loading, merging and cleaning is preseted in the function Read_Clean() Github Link.

library(readr)
library(readxl)
library(dplyr)
library(countrycode)
library(car)
source('Read_Clean.R')
#cleaned <- Read_Clean()

III. Visualizations

Dimension Reduction

a. Multidimensional Scaling (MDS)
Code for the graph: Github Link

Continents presented in 3D space

In order to provide a general insights into the data, all countries were presented in 3-dimensional space. At the first glance clusters between continents can be seen. Countries which are in the same continent in general present a similar profile.

b. Principal Components Analysis (PCA)
Code for the PCA: Github Link

Interpretation of PC1, PC2, and PC3 are as follow:
PC1: highly loaded in variables such as number of phones, life expectancy, Corruption index, Acces to the Internet and Income.
PC2: highly loaded in the number of suicides and sex ratio.
PC3: especially meaningful in the context of inequality.

Summary Result for the first three Principal Components.


Plot PC1 vs PC2


Plot PC1 vs PC3


PCA on the World Map
In order to show which countries are the highest in what Principal Component the World Map was presented. From each component (PC1, PC2, and PC3) top 15 countries with the highest loading in each group were chosen and plotted on the map.

# PrinCompPlot <- PCA(cleaned)

Note: from the analysis columns such as number of murder, number of armed forces and percentage of investments are excluded. Those variables had a low correlation with the rest of the columns and much more dimensions would be needed to explain the data. As such data would not be possible to be well explained on the 2-dimensional plot.



Cluster Analysis

Different Cluster Analysis techniques were introduced in order to find similarities between continents and countries and form groups.

Hierarchical Clustering between Continents

Code for the graph: Github Link

library(ape)
source('cluster_continents.R')
# Cl_continents <- cluster_continents(cleaned)

The hierarchical model above shows that the data has three main clusters. The first cluster includes North America, Europe, South America, Central America, and Asia while the second includes Oceania, and the third, Africa. The interesting thing is that Oceania is not clustered with the developed countries like North America and Europe despite it contains countries such as Australia and New Zeland. This is more than likely due to the reason that Oceania also has many small islands included in the continent that are not as developed.



K-means and Model-Based Clustering between countries
Code for the graph: Github Link

library(mclust)
source('clusters_countries.R')
# Cl_countries <- clusters_countries(cleaned)



IV. Exploratory Factor Analysis (EFA)

Exploratory Factor Analysis looks to link observable variables to unobservable variables via regression modeling. The discovery of the relationship(s) between variables without making any assumptions that such relationships may exist is crucial when defining what factors, and how many, may best describe the data. Code for the graph: Github Link

source('EFA.R')
# EFA_loadings(cleaned)

From the loadings the Factors were interpreted as follow:
- Factor 1: Developed
- Factor 2: High Inequality
- Factor 3: Suicide
- Factor 4: Gender/Income

source('EFA.R')
# groups = EFA_plot(cleaned)

Note: There are some countries such as Singapore or Qatar that are in the groups but are too small to show in the map.

V. Conclusion

The project was based on different indicators pulled from the GapMinder website. In order to improve analysis more economical, social, financial and technological factors might be added to the models presented in the report in order to profile countries more precisely. The project is available online, in order to contribute in the further analysis please visit: https://github.com/grzechowiak/Multivariate-Analysis-Project