You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

Data science skills: Building partnership for efficient school curriculum delivery in Africa

Abstract

Data science is a concept to unify statistics, data analysis, machine learning and their related methods in order to analyze actual phenomena with data to provide better understanding. This article focused its investigation on acquisition of data science skills in building partnership for efficient school curriculum delivery in Africa, especially in the area of teaching statistics courses at the beginners’ level in tertiary institutions. Illustrations were made using Big data of selected 18 African countries sourced from United Nations Educational, Scientific and Cultural Organization (UNESCO) with special focus on some macro-economic variables that drives economic policy. Data description techniques were adopted in the analysis of the sourced open data with the aid of R analytics software for data science, as improvement on the traditional methods of data description for learning and thus open a new charter of education curriculum delivery in African schools. Though, the collaboration is not without its own challenges, its prospects in creating self-driven learning culture among students of tertiary institutions has greatly enhanced the quality of teaching, advancing students skills in machine learning, improved understanding of the role of data in global perspective and being able to critique claims based on data.



1.Introduction

Data science is a “concept to unify statistics, data analysis, machine learning and their related methods” in order to “understand and analyze actual phenomena” with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science.

Data Science has spread its branches through several quintessential fields in modern day learning. It has emerged as a global phenomenon that has revolutionized industries and has increased their performances substantially [1]. Given the vast increase in the volume and complexity of data and the new technologies that have been developed to process and analyze this information, it can be argued that there is an increased need for statistical thinking in the context of working with data [2]. Key statistical reasoning topics that are critical for Data Scientists to know at a deep level include but are not limited to the following: developing clear statements of the problem/scientific research question; ensuring acquisition of high-quality data; understanding the process that produced the data, to provide proper context for analysis; allowing domain knowledge of the problem to guide both data collection and analysis; approaching modeling as a process that requires an overall strategy.

The modern day “romance” between Data Science and Statistics cannot be overemphasized (see Fig. 1). Statistics can be a powerful tool when performing the art of Data Science. From a high-level view, statistics is the use of mathematics to perform technical analysis of data. A basic visualization such as a bar chart might give some high-level information, but with statistics one gets to operate on the data in a much more information-driven and targeted way. The analysis involved helps to form concrete conclusions about our data rather than just guesstimating. Using statistics, we can gain deeper and more fine grained insights into how exactly our data is structured and based on that structure, optimally apply other data science techniques to get even more information [3].

Figure 1.

The interactive disciplines of data science.

The interactive disciplines of data science.

Education is the key to shaping the lives of people. Since the dawn of civilization, humans have evolved through education and have developed mechanisms to improve education. In the 21st century, where data is omnipresent in every walk of life, education is no exception. With advancements in computing techniques, it is possible to imbibe all the information through powerful big-data platforms [4]. Various Schools have to keep themselves updated with the demands of the industry so as to provide appropriate courses to their students. Furthermore, it is a challenge for the Schools to keep up with the growth of industries. In order to accommodate this, Schools are using Data Science systems to analyze growing trends in the market [5]. Using various statistical measures and monitoring techniques, data science can be useful for analyzing the industrial patterns and help the course creators to imbibe useful topics. Furthermore, using predictive analytics, Schools can analyze demands for new skill sets and curate courses that address them [6].

The performance of students depends on the teachers. While there are many assessment techniques that have been used to assess the performance of teachers, it has been mostly manual in nature. With the breakthrough in data science, it is possible to keep track of the teacher performance. This is not only valid for recorded data but also real-time data. As a result, with real-time monitoring of teachers, rigorous data collection is possible, along with its analysis. Furthermore, we can store and manage unstructured data like student reviews on a big data platform.

1.1Data science and statistics curriculum

A growing number of students are completing bachelor’s degrees in statistics and entering the workforce as data analysts. In these positions, they are expected to understand how to use databases and other data warehouses, scrape data from Internet sources, program solutions to complex problems in multiple languages, and think algorithmically as well as statistically [7]. This increase in the number of undergraduates may help address the impending shortage of quantitatively trained workers. Statistics graduates at the bachelor’s level often work as analysts, and as a result need training in statistical methods, statistical thinking and statistical practice; a foundation in theoretical statistics; increased skills in computing and data-related technologies; and the ability to communicate [6, 7]. Computing skills to enable processing of large data sets are particularly relevant, as noted in the recent London Report on the Future of Statistics. Much of the statistics education literature focuses on the introductory statistics course and statistics before college. Given the relatively few decades since the establishment of undergraduate statistics programs, this is not surprising. While there has been impressive growth in the number of students taking introductory statistics, there has been a relative dearth of articles on the curriculum beyond the introductory course [8].

The digital age is having a profound impact on statistics and the nature of data analysis, and these changes necessitate revaluation of the training and education practices in statistics. Computing is an increasingly important and necessary aspect of a statistician’s work, and needs to be incorporated into statistics [9]. Successful statisticians must be familiar with the computer, for they are expected to be able to access data from various sources, apply the latest statistical methodologies, and communicate their findings to others in novel ways and via new media. In addition, researchers exploring new statistical methodology rely on computer experiments and simulation to explore the characteristics of methods as an aid to formalizing their mathematical framework [10, 11, 12].

Thus, for the field of statistics to have its greatest impact on policy and science, statisticians must seriously reflect on these major changes and their implications for statistics education. Faculty of science in African higher institutions needs to indicate to students that computing and data science is an important element of their statistics education, and it must be taught with an intellectual foundation that provides students with skills to reason about important computational tasks and continue to learn about new computational topics in statistics and Data science. Instead of teaching similar concepts with varying degrees of mathematical rigor, statisticians need to address what is missing from the curricula and take the lead in improving the level of students’ data competence. It is our responsibility, as statistics educators, to ensure our students have the computational understanding, skills, and confidence needed to actively and whole-heartedly participate in the computational arena.

Based on the discussion above, traditional statistics is the basis of data science, but there should be some improvement in the statistics curriculum. These changes are necessary in order to attract and prepare future statisticians, and to keep pace with the rapidly changing “big science” fields. As the practice of science and statistics research continues to change, its perspective and attitudes must also change so as to realize the field’s potential and maximize the important influence that statistical thinking has on scientific endeavors.

2.Materials and methods

2.1Materials

Social-economic panel data spanning between year 1999 and 2018, consisting of variables GDP at Purchasing Power Parity (PPP) per capita (constant 2011 international $), GNI per capita based on PPP and Official Exchange rates of sixteen Eq. (16) West African countries as published by United Nations Educational, Scientific and Cultural Organization (UNESCO), was used for data description and visualization in R-statistical software for data science. This made the dataset (named as social.csv) to contain 320 rows and 4 columns. The data frame includes the following columns with description:

  • 1. Variable Country relates to each of the West African countries as two letters abbreviation. A factor with levels: BJ, Benin; BF, Burkina Faso; CV, Cape Verde; GM, Gambia; GH, Ghana; GN, Guinea; GW, Guinea Bissau; CI, Cote d’Ivoire; LR, Liberia; ML, Mali; MR, Mauritania; NE, Niger; NG, Nigeria; SN, Senegal; SL, Sierra Leone; and TG, Togo was used to represent those countries as published by UNESCO.

  • 2. Variable GDP at PPP per capita is the Gross Domestic Product adjusted for inflation. It relates to the total monetary or market value of all finished goods and services produced within countries borders in a specific period of time divided by the average (or mid-year) population for the same year.

  • 3. Variable GNIPC based on PPP (US$) is referred to as the Gross National Income Per Capita based on the Purchasing Power Parity rates. It is the gross national income, converted to US dollars using the PPP rates.

  • 4. Variable ER is shortened as Exchange Rate. It is the value of the selected West Africans currencies in relation to the United States’ (US$) currency.

These variables were used to explain the data description techniques to the students, which also serves as a mean of driven their knowledge on the usefulness of socio-economic indicators.

2.2Methods

Descriptive Statistics: Descriptive statistics is the first technique used to represent nearly every dataset as they form the foundations for more complicated computations. R sets of commands were generated for the statistics and used to calculate summary statistics, including mean, standard deviation, range, quartile and percentilepercentile as expressed in the following equations:

Arithmetic Mean: The arithmetic mean of observations x1,x2,xn for ungrouped data is given by

(1)
x¯=1ni=1nxii=1,2,,n

For grouped data, we have

(2)
x¯=1i=nfii=1nfixii=1,2,,n

Where fi is the frequency of each observations.

Median: The middle value after a set of observations x1,x2,xn is arranged in order of magnitude is given by

(3)
x𝑚𝑒𝑑=12(n+1)th observation

Equation (3) is used when the number of observation is odd. But when the number of observation is even, we have

x𝑚𝑒𝑑=12[(12nth observation)
(4)
+12(n+1)th observation]

For grouped observations with corresponding frequencies f1,f2,,fn, we have

(5)
x𝑚𝑒𝑑=L1+[12N-f*fm]C

Where; L1 is the lower class boundary of the median class; N is the total observations under consideration; f* is the cumulative of the frequencies preceding the median class; fm is the frequency of the median class. However, the median class is determined by the class to which 12n falls in the cumulative frequency column.

Variance: The variance of observations x1,x2,xn for ungrouped data is given by

(6)
s2=1ni=1n(xi-x¯)2i=1,2,,n

For ungrouped data, we have

s2=1i=nfii=1nfi(xi-x¯)2
(7)
i=1,2,,n

Square root of Eqs (6) and (2.2) give the standard deviation.

Range (R): Given observations x1,x2,xn, the difference between the maximum and minimum value is referred to as the range. It is given as

(8)
R=max observation-min observation

Quartiles: this divides a given set of observations x1,x2,xn into four Eq. (2.2) equal parts given as

Qi=LQi+[14N-fQi*fQi]C,
(9)
i=1,2,3

Where Qi is the ith quartile; fQi* is the cumulative frequencies preceding the ith quartiles class; fQi is the frequency of the ith quartile class; C is the class interval.

Percentiles: This divide a given set of observations x1,x2,xn into hundred (100) parts, give as

Pi=LPi+[1100N-fPi*fPi]C,
(10)
i=1,2,,99

Where Pi is the ith percentilepercentile; fPi* is the cumulative frequencies preceding the ith percentiles class; fPi is the frequency of the ith percentile class; C is the class interval.

Moments: Given observations x1,x2,xn, the rth moment about the origin for grouped and ungrouped data is defined by

(11)
μr=1ni=1nxr;r=1,2,,n
(12)
μr=i=1nfxri=1nf;r=1,2,,n

However, the corresponding rth moment about the mean for ungrouped data is defined by:

(13)
μr=1ni=1n(xi-x¯)r
μr=1i=nfii=1nfi(xi-x¯)r
(14)
i=1,2,,n

Equating r=1,2,3,4, corresponds to first moment, second moment, third moment, forth moment and so on.

Skewness and Kurtosis: Skewness is the measure of departure of a curve from symmetry. The distribution of a set of data is symmetrical if the three measures of central tendencies coincide while Kurtosis is the measure of Peakedness. Students were exposed to how Skewness and Kurtosis of a curve can be measured using method of moments as given below:

(15)
α1=μ1σ=0
(16)
α2=μ2σ2=σ2σ2=1
(17)
α3=μ3σ3=μ3μ23/2=1
(18)
α4=μ4σ4=μ4μ2=1

If α3=0, then the distribution is symmetrical, but if α3<0, we have a negatively skewed curve and α3>0 indicates a positively skewed curve. α4 measures the 4th moment (Peakedness) of the dataset hereinafter referred to as Kurtosis. The criteria is to know if the curve is either Mesokurtic (α4=3), Platykurtic (α4<3) and Leptokurtic (α4>3).

Shapiro Wilk normality Test is a test of normality in frequents statistics. It tests the null hypothesis that a sample x1,x2xn came from a normally distributed population. The test statistic is written as

(19)
W=[i=1naix(i)]2i=1n(xi-x¯)2

where x(i) is the ith order statistic and the constants ai are given by mTV-1(mTV-1V-1m)1/2. m=(m1mn)T.

And m1mn are the expected values of the order statistics of independent and identically distributed random variables sampled from the standard normal distribution, and V is the covariance matrix of those order statistics. The null hypothesis may be rejected if W is too small.

3.Results and discussion

The dataset was extracted in MS-excel and was saved as a “comma delimited (social.csv) file”. Another object was created in R for the social.csv file named w_africans as used in exporting the data into the console using the command line:

w_africans<-read.csv(“social.csv”,header=T)

However, the w_africans dataset was inspected for correctness before commencing the analysis using the commands stated below and the output is as given in Table 1.

Table 1

Output of the first 15 observations of the w_africans dataset

Country period GDPPC_PPP GNIPC_PPP ER
1BJ19991621.901260615.47
2BJ20001666.471320710.21
3BJ20011703.021380732.40
4BJ20021728.701410693.71
5BJ20031734.701450579.90
6BJ20041757.901510527.34
7BJ20051735.971540527.26
8BJ20061752.961600522.43
9BJ20071805.621690478.63
10BJ20081841.191770446.00
11BJ20091831.881770470.29
12BJ20101818.781770494.79
13BJ20111820.891820471.25
14BJ20121855.941880510.56
15BJ20131934.621990493.90

#Displaying the first 15 observations of the w_africans dataset

print(head(w_africans, n=15))

The nature of the columns (variables) in the w_africans dataset was also explored, using

ls(DATAVAR) or names(DATAVAR), where DATAVAR represent the dataframe name to be explored using the commands given below, with the subsequent results.

#Dataset variable names can be viewed using names (dataset) or ls(dataset)

ls(w_africans)

[1] “Country” “ER” “GDPPC_PPP” “GNIPC_PPP”

#Viewing the number of rows and columns in the w_ africans dataset; use ncol(dataset) and nrow(dataset)

ncol(w_africans); nrow(w_africans)

[1] 5

[1] 320

From the results output, the w_africans dataset contains 4 variables and 320 rows as explained earlier

#A more advanced way to view the structure of the dataset is by using str(DATAVAR)

str(w_africans) #Data structure

data.frame’: 320 obs. of 5 variables:

$ Country: Factor w/16 levels “BF”,“BJ”,“CI”,..:2 2 2 2 2 2 2 2 2 2…

$ period: int 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008…

$ GDPPC_PPP: num 1622 1666 1703 1729 1735…

$ GNIPC_PPP: int 1260 1320 1380 1410 1450 1510 1540 1600 1690 1770…

$ ER: num 615 710 732 694 580…

The w_africans data.frame includes 2 numeric variables, 2 integer variables and 1 categorical variable

The Mean value of each of the variables is computed using the commands:

#Calculate the mean of variable with mean(DATAVAR$ VAR): mean of GDPPC_PPP variable

mean(w_africans$GDPPC_PPP, na.rm=TRUE)

[1] 2258.119

#mean of GNIPC_PPP variable

mean(w_africans$GNIPC na.rm=TRUE)

[1] 2117.962

#mean of ER variable

mean(w_africans$ER, na.rm=TRUE)

[1] 857.6926

Here, the average GDP at purchasing power parity per capita, GNI at purchasing power parity per capita and exchange rate (ER) for the 16 West African countries between years 1999 and 2018 is about $2258.12, $2117.962 and 857.6926 per US$ respectively.

Note: The na.rm =TRUE command the console to remove missing value in case there is one.

For the standard deviation, the following commands subsist; and the results represent the spread of the variables.

sd(w_africans$GDPPC_PPP, na.rm=TRUE)#Standard deviation of GDPPC_PPP

[1] 1331.402

> sd(w_africans$GNIPC_PPP, na.rm=TRUE)#Standard deviation of GNIPC_PPP

[1] 1341.855

sd(w_africans$ER, na.rm=TRUE)#Standard deviation of ER

[1] 1596.375

Continuing in the same terrain for the Range computation, minimum and maximum are computed on a single variable using the min(VAR) and max(VAR) formula. Students were taught how to calculate minimums and maximums using the codes below:

#Minimum and maximum GDP of the selected w_ african countries

min(w_africans$GDP, na.rm=TRUE); max(w_africans $GDP, na.rm=TRUE)

[1] 754.86

[1] 6661.99

From the output, the minimum GDP at purchasing power parity per capita is $754.86 and the maximum is about $6,661.99. This indicated a large gap in GDP per capita taking distribution among the West African countries in response to their purchasing power parity into consideration.

> #Minimum and maximum GNIPC of the selected w_ african countries

> min(w_africans$GNIPC_PPP, na.rm=TRUE); max (w_africans$GNIPC_PPP, na.rm=TRUE)

[1] 600

[1] 7330

It can be inferred that the Gross National Income at PPP per capital of all West Africa is between $600 and $7330 inclusive.

#Minimum and maximum ER of the selected w_african countries

min(w_africans$ER, na.rm=TRUE); max(w_africans$ ER, na.rm=TRUE)

[1] 0.27

[1] 9088.32

It is evidenced within the studied periods that Ghana’s economy has not been adversely affected by external forces as shown from their cedis minimum exchange rate to the US$ while the maximum exchange rate of 9088.32 is attributed to Guinea. We can infer that African countries as a nation is still developing and may take some time to meet up with other continents currency rates.

The command “range(VAR)” is used to summarize the minimums and maximums on individual variables. These computations are demonstrated in the following codes:

#Calculate the range of a variable with range(VAR)

range(w_africans$GDPPC_PPP, na.rm=TRUE)#Range of variable GDP

range(w_africans$GDPPC_PPP, na.rm=TRUE)#Range of variable GNDPPC_PPP

[1] 754.86 6661.99

range(w_africans$GNIPC_PPP, na.rm=TRUE)#Range of variable GNIPC_PPP

[1] 600 7330

range(w_africans$ER, na.rm=TRUE)#Range of variable ER

[1] 0.27 9088.32

Students have been taught that a quartile is a value computed from a collection of numeric measurements, showing observation’s rank when compared to all other present observations. Quartile can also be alternatively expressed as a percentilepercentile, as it is identical but on a scale of 0 to 100. Thus, we used quantile() function to obtain quartile and percentile in R, with commands

quantile(VAR, prob=c(prob value1, prob value2, …, prob valuei))

#Calculate the 25th, 50th, 75th percentilepercentile for GDP per capita at PPP

quantile(w_africans$GDPPC_PPP, na.rm=TRUE, prob =c(0.25, 0.50, 0.75, 0.95))

25% 50% 75% 95%

1369.780 1728.700 2851.580 5361.187

From the output, it easily observed that 25% of average GDP at PPP per capita was $136.780 with median (50th percentilepercentile) of $1728.700; 75% was about $2851.580 and 95% of the African countries have about $5361.187. This may explain the wide gap in GDP growth of each West African countries since GDP per capita is correlated with GNI per capita.

#Calculate the 25th, 50th, 75th percentilepercentile for GNIPC_PPP

quantile(w_africans$GNIPC, na.rm=TRUE, prob=c (0.25, 0.50, 0.75, 0.95))

25% 50% 75% 95%

1195 1680 2625 5435

#Calculate the 25th, 50th, 75th percentilepercentile for ER

quantile(w_africans$ER, na.rm=TRUE, prob=c(0.25, 0.50, 0.75, 0.95))

25% 50% 75% 95%

83.060 494.040 591.740 4528.037

Table 2

Pooled descriptive statistics

StatisticGDP per capita, PPP ($)GNI per capita, PPP ($)ER
Mean2258.1192117.965857.693
Standard Deviation1331.4021341.8551596.375
25th Percentile (Q1)1369.780119583.060
50th Percentile (Q2)1728.7001680494.040
75th Percentile (Q3)2851.5802625591.740
95th Percentile5361.18754354528.037
Minimum754.8606000.27
Maximum6661.99073309088.32

Source: Extracted from R-console output.

Table 3

Variables normality test

MomentsGDP per capita, PPP ($)GNI per capita, PPP ($)ER
Skewness1.3531.5183.283
Kurtosis4.2274.94013.810
Shapiro Wilk Test Statistics0.8480.8400.502
P-value0.0000.0000.0000

Source: Extracted from R-console output.

Students were also taught how to use summary(x) function, where x can be any number of objects, including datasets, variables, and linear models to generate the descriptive statistics of the variables in the dataset. The code is written below for the w_africans dataset with the subsequent results presented below it.

> #Summarize the w_africans dataset using the command summary(x)

> print(summary(w_africans))

CountryperiodGDPPC_PPPGNIPC_PPPER
BF:20Min.:1999Min.:754.9Min.:600Min.:0.27
BJ:201st Qu.:20041st Qu.:1369.81st Qu.:11951st Qu.:83.06
CI:20Median:2008Median:1728.7Median:1680Median:494.04
CV:20Mean:2008Mean:2258.1Mean:2118Mean:857.69
GH:203rd Qu.:20133rd Qu.:2851.63rd Qu.:26253rd Qu.:591.74
GM:20Max.:2018Max.:6662.0Max.:7330Max.:9088.32
(Other):200NA’s:1NA’s:1

The summary outputs provides the descriptive statistics of all objects in the sample dataset and is explicitly presented in Table 2. Further exploration was carried out on the data by checking their respective distributions through Skewness, kurtosis and further test such as the Shapiro wilk test of normality. These were done using the “moments” library in R. Students were taught how to load packages from R as library(). Details are as given below while the summary presented in Table 3:

library(moments)

skewness(w_africans$GDPPC_PPP, na.rm=T) #Skewness coefficient of GDP per capita at PPP

[1] 1.353004

skewness(w_africans$GNIPC_PPP, na.rm=T) #Skewness coefficient of GNIPC at PPP

[1] 1.517567

skewness(w_africans$ER, na.rm=T) #Skewness coefficient of ER

[1] 3.283139

kurtosis(w_africans$GDPPC_PPP, na.rm=T) #Kurtosis coefficient of GDP per capita at PPP

[1] 4.226773

kurtosis(w_africans$GNIPC_PPP, na.rm=T) #Kurtosis coefficient of GNIPC at PPP

[1] 4.940481

kurtosis(w_africans$ER, na.rm=T) #Kurtosis coefficient of ER

[1] 13.80796

shapiro.test(w_africans$GDP)#GDP test of Normality

Table 4

Cross-section data description on average

S/nCountryCODEMean GDP per capita PPPMean GNIPC PPPMean ER
1BeninBJ1841.461 [141.8469]1759.500 [330.621]554.3915 [82.72886]
2Burkina FasoBF1386.965 [213.250]1320.000 [333.024]555.261 82.9732
3Cape VerdeCV5355.335 [1009.555]5039.500 [1414.874]93.1725 [13.51637]
4Cote D’IvoireCI2913.830 [338.916]2647.500 [614.524]555.261 [82.9732]
5GambiaGM1460.178 [40.394]1349.500 [184.033]29.855 [10.73796]
6GhanaGH3031.057 [696.137]2897.500 [969.063]1.772 [1.346909]
7GuineaGN1735.404 [226.013]1593.500 [399.569]5075.988 [2625.065]
8Guinea BissauGW1430.202 [72.702]1360.500 [239.109]555.261 [82.9732]
9LiberiaLR1137.824 [136.222]970.526 [190.860]71.5625 [24.41597]
10MaliML1794.605 [151.470]1670.500 [321.943]555.261 [82.9732]
11MauritaniaMR3348.436 [370.771]3193.000 [627.259]28.2255 [4.042962]
12NigerNE823.119 [61.436]779.500 [138.049]555.261 [82.973]
13NigeriaNG4565.789 [907.056]4237.500 [1296.651]158.823 [61.094]
14SenegalSN2758.823 [263.357]2614.500 [524.740]555.261 [82.973]
15Sierra LeoneSL1204.011 [243.925]1158.500 [342.856]3822.465 [1756.515]
16TogoTG1286.844 [226.013]1238.500 [399.569]555.261 [82.9732]

Values in parentheses [ ] represent standard deviation. Source: Extracted from R-console output.

Shapiro-Wilk normality test

data: w_africans$GDPPC_PPP

W = 0.84758, p-value < 2.2e-16

shapiro.test(w_africans$GNIPC)#GNIPC test of Normality

Shapiro-Wilk normality test

data: w_africans$GNIPC_PPP

W = 0.83966, p-value < 2.2e-16

shapiro.test(w_africans$ER)#ER test of Normality

Shapiro-Wilk normality test

data: w_africans$ER

W = 0.5022, p-value < 2.2e-16

Positive coefficients of 1.353, 1.518, and 3.283 indicated that the econometric variables of GDP, GNIPC and ER is highly skewed to the right and may not be normally distributed. As the Kurtosis measure the fourth moments, selected West Africans exchange rate was found to be normally distributed (kurtosis 3) with other kurtosis of other variables > 3, indicating a leptokurtic shape compared to a normal distribution. However, normality test of the data confirmed the non-normality of the data since its associated p-values are lower than 5% level of significance.

Figure 2.

Normal Q-Q plots of GDP at PPP per capita of some selected West African countries.

Normal Q-Q plots of GDP at PPP per capita of some selected West African countries.

Figure 3.

Normal Q-Q plots of GNI at PPP per capita of some selected West African countries.

Normal Q-Q plots of GNI at PPP per capita of some selected West African countries.

Figure 4.

Normal Q-Q plots of ER of some selected West African countries.

Normal Q-Q plots of ER of some selected West African countries.

Figure 5.

Bar chart of average GDP per capita based on PPP rates of selected West African countries.

Bar chart of average GDP per capita based on PPP rates of selected West African countries.

Figure 6.

Bar chart of average GNIPC based on PPP rates of selected West African countries.

Bar chart of average GNIPC based on PPP rates of selected West African countries.

Figure 7.

Bar chart of average ER of selected West African countries.

Bar chart of average ER of selected West African countries.

Quantile plots visualize the distribution of the data per variable and details generated by the below commands are as given in Figs 24 respectively

par(mfrow=c(2,2)) #Partitioning of plots space

#Quantile plot of GDP per capita at PPP rates

qqnorm(w_africans$GDPPC_PPP);qqline(w_africans$ GDPPC_PPP,col=“red”)

#Quantile plot of GNI per capita at PPP rates

qqnorm(w_africans$GNIPC_PPP);qqline(w_africans$ GNIPC_PPP,col=“black”)

#Quantile plot of Exchange rate

qqnorm(w_africans$ER);qqline(w_africans$ER,col= “green”)#Quantile plot of ER

The Figs 24 showed that the quantile plots of the selected variables do not lie on the theoretical normal line. Thus, the variables are not precisely normal but may not be too far off.

Students were also introduced to data splitting in R using dataframe_name[n:m,]. This method was used due to the fact that the data structure was paneled in nature with the first 20 observations on row-wise which represents republic of Benin followed by Burkina Faso, among others. The command line used is given below with the results output presented in Table 4.

benin_d<-w_africans[1:20,];benin_d #Extracted Benin republic variables from the panel structured data.

The data was further explored using ExPanDaR package in R. Average GDP, GNIPC and ER per cross sections (countries) were visualized from the Shiny app using simple bar chart presented in Figs 35 respectively.

library(ExPanDaR)

ExPanD(df=w_africans)

The Figs 57 showed that Cape Verde (CV) recorded the highest average GDP (per capita) and GNI (per capita) taking into consideration purchasing power parity among the West African countries followed by Nigeria (NG). Cape Verde (CV) also has the highest average GNIPC at purchasing power parity rates and Ghana (GH) possess the strongest currency rate among other west African nations taking the US$ exchange rate into consideration. Niger (NE) recorded the lowest average GDP per capita and GNIPC at PPP and Guinea (GN) with the weakest currency rate within the selected timeframe. This can also be evidenced from Table 4 with an associated variability from the mean.

3.1Summary of findings

This paper presented students learning experience on the introduction of data science skills for curriculum delivery in Africa using social-economic data extracted from UNESCO website. The interactive session helped students on how to use R software for analyzing for descriptive statistics, and appropriate interpretation of results based on the type of data used for analysis. This bridged the gap between the traditional method of data analysis and the conventional form especially in the area of big data. Findings from the analysis showed that economic growth varies from countries to countries as shown from the pictorial representation of data and respective spread of observation from the mean. However, this result is an indication that Cape Verde (CV) among other West African countries is better off in terms of their economic growth taking purchasing power parity into consideration. This indicated that Nigeria economic growth may be marred by inflation, resulting to the devaluation of her naira note in the international market, among other developing countries. Hence, West African countries in general are far from being developed compared to countries in Asia, America, and Europe to mention a few.

4.Conclusion

Introducing beginner students in statistics to data science is a vexatious task, especially in African countries where regular supply of power is a luxury and uninterrupted internet facilities are quite expensive and almost impossible. The developing nature of most Africa countries has created a paradoxical approach to achieving reasonable success in students’ learning of data science. However, for the purpose of this research, great achievement was made in introducing the students to data description using R software for data science, thereby equipping them with a career in data analysis. From the beginning, students offering introductory statistics gain reasonable experience of what constitutes both the practical and conceptual aspects of the working life of a data scientist, as they were able to run simple codes on exploratory data analysis using the focused data. The students equally enhanced their knowledge in deducing reasonable inference from the output of data analysis. 200 level students were able to run with ease, R codes to estimate basic descriptive statistics within a 1 hour lecture period. The activities was carried out without much supervision on the part of the tutor. Comparison was made per member countries on their developmental rate taking their respective Gross Domestic Product, Gross National Income per capita, and Exchange Rate into consideration.

It is of the opinion that topics covered in data science courses can and should be brought into a variety of statistics courses at undergraduate level, while adequate facilities provided for its teaching and learning. Thus, key data science skills need to be introduced, reiterated, and reinforced throughout the undergraduate statistics curriculum.

Though, the exercise is not without its own challenges, but its prospects in creating self-driven learning culture among students of tertiary institutions has greatly enhance the quality of teaching, advancing students skills in machine learning, improved understanding of the role of data in global perspective and on the spot ability of the students to be able to critique claims based on data.

Acknowledgments

The authors are grateful to Federal Polytechnic Ilaro and the students of Mathematics & Statistics department for creating the enabling environments suitable for the data science activities carried out in this research.

References

[1] 

Jordan, M.I. and Mitchell, T.M. Machine learning: trends, perspectives, and prospects. Science, 2015, 349(6245), 255–260.

[2] 

Mayer-Schönberger, V. and Cukier, K. Big Data: A Revolution That Will Transform How We Live, Work, and Think, 2013. New York: Houghton Mifflin Harcourt.

[3] 

Provost, F. and Fawcett, T. Data science and its relationship to big data and data-driven decision making. Big Data, 2013, 1(1), 51–59.

[4] 

Kuhn, T.S. The Structure of Scientific Revolutions. 3rd ed. Chicago, IL: University of Chicago Press, 1996.

[5] 

Box, G.E.P. Science and statistics. Journal of the American Statistical Association, 2012, 71(356), 791–799. Reprint of original from 1962.

[6] 

Nolan, D. and Temple Lang, D. Computing in the statistics curricula. The American Statistician, 2010, 64, 97–107. doi: 10.1198/tast.2010.09132.

[7] 

Hardin, J., Hoerl, R., Horton, N.J. and Nolan, D. Data Science in Statistics Curricula: Preparing Students to “Think with Data”. The American Statistician, 2014. doi: 10.1080/00031305.2015.1077729.

[8] 

Tukey, J.W. The future of data analysis. Annals of Mathematical Statistics, 1962, 33(1), 1–67. Moore, D.S., McCabe, G.P. and Craig, B.A. (2012). Introduction to the Practice of Statistics. New York: WH Freeman.

[9] 

National Science Foundation. Accelerating discovery in science and engineering through Petascale simulations and analysis (PetaApps), 2008. Posted July 28, 2008.

[10] 

Gershman, S.J., Horvitz, E.J. and Tenenbaum, J.B. Computational rationality: a converging paradigm for intelligence in brains, minds, and machines. Science, 2015, 349(6245), 273–278.

[11] 

Horton, N.J. and Hardin, J.S. Teaching the next generation of statistics students to ‘think with data’: special issue on statistics and the undergraduate curriculum, The American Statistician, 2015, 69, 259–265. doi: 10.1080/00031305.2015.1094283.

[12] 

Hoerl, J., Horton, J., Nolan, N.J., Baumer, D., Hall-Holt, D. and Ward, M.D. Data science in statistics curricula: preparing students to ‘think with data’, The American Statistician, 2015, 69, 343–353. doi: 10.1080/00031305.2015.1077729.

Appendices

Appendix 1: Data

GDP per capita PPP, GNI per capita PPP, and Exchange Rate of selected 16 west African countries.

CountryPeriodGDP per capita, PPP (2011 international $)GNI per capita, PPP ($)Exchange rate
BJ19991621.91260615.47
BJ20001666.471320710.21
BJ20011703.021380732.4
BJ20021728.71410693.71
BJ20031734.71450579.9
BJ20041757.91510527.34
BJ20051735.971540527.26
BJ20061752.961600522.43
BJ20071805.621690478.63
BJ20081841.191770446
BJ20091831.881770470.29
BJ20101818.781770494.79
BJ20111820.891820471.25
BJ20121855.941880510.56
BJ20131934.621990493.9
BJ20142001.052100493.76
BJ20151987.142110591.21
BJ20162009.662160592.61
BJ20172069.292260580.66
BJ20182151.542400555.45
BF19991086.62840615.7
BF20001075.4850711.98
BF20011114.2900733.04
BF20021129.74930696.99
BF20031183.09990581.2
BF20041200.421030528.28
BF20051266.361120527.47
BF20061305.921200522.89
BF20071338.841260479.27
BF20081393.71340447.81
BF20091392.21340472.19
BF20101423.381360495.28
BF20111472.721420471.87
BF20121521.451520510.53
BF20131562.31590494.04
BF20141582.331620494.41
BF20151596.331650591.45
BF20161642.481710593.01
BF20171696.231810582.09
BF20181755.591920555.72
CV19993472.62660102.7
CV20003896.963020115.88
CV20013915.163150123.21
CV20024053.373270117.26
CV20034157.15344097.79
CV20044513.97382088.75
CV20054759.13409088.65
CV20065071.86447087.93
CV20075768.87532080.62
CV20086078.55569075.34
CV20095929.44560080.04
CV20105943.35557083.28
CV20116102.41586079.28
CV20126090.55594086.32
CV20136061.31607083.07
CV20146021.63605083.03
CV20156007.22618099.39
CV20166214.08647099.69
CV20176387.1679097.81

CountryPeriodGDP per capita, PPP (2011 international $)GNI per capita, PPP ($)Exchange rate
CV20186661.99733093.41
GM19991416.72106011.4
GM20001448.62111012.79
GM20011484.89115015.69
GM20021391.43108019.92
GM20031440.18116028.53
GM20041493.71124030.03
GM20051434.39123028.58
GM20061407.03124028.07
GM20071415.08129024.87
GM20081452.45136022.19
GM20091500.82141026.64
GM20101551.59147028.01
GM20111440.79139029.46
GM20121476.06146032.08
GM20131500.51152035.96
GM20141442.1149041.73
GM20151481.48154042.51
GM20161443.69153043.88
GM20171465.34158046.61
GM20181516.69168048.15
GH19992193.116700.27
GH20002219.2117100.54
GH20012252.1317900.72
GH20022296.5818600.79
GH20032357.3319400.87
GH20042428.2620500.9
GH20052507.5922100.91
GH20062600.7923700.92
GH20072644.7224800.94
GH20082813.2126901.06
GH20092875.4227701.41
GH20103026.3629201.43
GH20113368.832601.51
GH20123595.6434801.8
GH20133769.9438301.95
GH20143791.2838802.9
GH20153786.9639903.67
GH20163830.540603.91
GH20174051.4643404.35
GH20184211.8546504.59
GN19991515.6511501387.4
GN20001518.5211801746.87
GN20011541.0912101950.56
GN20021588.7913001975.84
GN20031577.9312301984.93
GN20041583.6212702243.93
GN20051598.1712903644.33
GN20061582.6613605148.75
GN20071653.2814704197.75
GN20081682.6615004601.69
GN20091626.1714504801.08
GN20101666.4915305726.07
GN20111721.4516006658.03
GN20121783.6717106985.83
GN20131812.8817806907.88
GN20141836.5618807014.12
GN20151859.7419307485.52
GN20162007.3421308959.72

CountryPeriodGDP per capita, PPP (2011 international $)GNI per capita, PPP ($)Exchange rate
GN20172213.4624209088.32
GN20182337.9524809011.13
GW19991365.771000615.7
GW20001410.921090711.98
GW20011411.491100733.04
GW20021367.121110696.99
GW20031343.981100581.2
GW20041349.351140528.28
GW20051374.031200527.47
GW20061372.441250522.89
GW20071383.121300479.27
GW20081392.521320447.81
GW20091403.551340472.19
GW20101430.971400495.28
GW20111506.71520471.87
GW20121442.151480510.53
GW201314501470494.04
GW20141425.771560494.41
GW20151474.241610591.45
GW20161526.811690593.01
GW20171576.751740582.09
GW20181596.361790555.72
CI19993132.642310615.7
CI20002989.152160711.98
CI20012922.032100733.04
CI20022810.192030696.99
CI20032714.011940581.2
CI20042690.742070528.28
CI20052679.792300527.47
CI20062662.332350522.89
CI20072650.492400479.27
CI20082657.672460447.81
CI20092682.042500472.19
CI20102673.012520495.28
CI20112495.52400471.87
CI20122696.192660510.53
CI20132864.052840494.04
CI20143038.843130494.41
CI20153225.193340591.45
CI20163395.093650593.01
CI20173564.63760582.09
CI20183733.054030555.72
LR199941.9
LR20001317.8793040.9
LR20011307.9388048.59
LR20021325.3890061.75
LR2003910.161059.38
LR2004916.4965054.91
LR2005940.1670057.1
LR2006981.8978058.01
LR20071034.2987061.27
LR20081063.3793063.21
LR20091076.1196068.29
LR20101101.4898071.4
LR20111154.41109072.23
LR20121211.05112073.51
LR20131281.55120077.52
LR20141257.63119083.89
LR20151225.93119086.19

CountryPeriodGDP per capita, PPP (2011 international $)GNI per capita, PPP ($)Exchange rate
LR20161176.19116094.43
LR20171175.641170112.71
LR20181161.181130144.06
ML19991508.481160615.7
ML20001465.761150711.98
ML20011642.351270733.04
ML20021643.041270696.99
ML20031738.131410581.2
ML20041710.111430528.28
ML20051763.91520527.47
ML20061786.311580522.89
ML20071788.031640479.27
ML20081812.051700447.81
ML20091835.971740472.19
ML20101875.191760495.28
ML20111877.891810471.87
ML20121808.011770510.53
ML20131796.771800494.04
ML20141868.311920494.41
ML20151922.432010591.45
ML20161974.312070593.01
ML20172019.442170582.09
ML20182055.622230555.72
MR19992922.44232020.95
MR20002833.93228023.89
MR20012813.65223025.56
MR20022755.18237027.17
MR20032839.11248026.3
MR20042918.42261026.43
MR20053090.86284026.55
MR20063570.52320026.86
MR20073567.26330025.86
MR20083503.27335023.82
MR20093367.49331026.24
MR20103426.47330027.59
MR20113483.52338028.11
MR20123578.1351029.66
MR20133685.7369030.07
MR20143779.09381030.27
MR20153722.7383032.47
MR20163690.24389035.24
MR20173696.35400035.79
MR20183724.41416035.68
NE1999793.78610615.7
NE2000754.86600711.98
NE2001779.6630733.04
NE2002774.09630696.99
NE2003785.6650581.2
NE2004757.75650528.28
NE2005762.87680527.47
NE2006777.48710522.89
NE2007772.37730479.27
NE2008815.04780447.81
NE2009778.98750472.19
NE2010812.3790495.28
NE2011799.26790471.87
NE2012859.79860510.53
NE2013870.4880494.04
NE2014900.14930494.41

CountryPeriodGDP per capita, PPP (2011 international $)GNI per capita, PPP ($)Exchange rate
NE2015903.42940591.45
NE2016912.03960593.01
NE2017920.63990582.09
NE2018931.991030555.72
NG19992996.94227092.34
NG20003069.442230101.7
NG20013170.442440111.23
NG20023565.392760120.58
NG20033731.462910129.22
NG20043973.623190132.89
NG20054121.53390131.27
NG20064258.593830128.65
NG20074421.363990125.81
NG200845974220118.55
NG20094835.954450148.9
NG20105085.414710150.3
NG20115213.844920153.86
NG20125290.635130157.5
NG20135494.525420157.31
NG20145687.595810158.55
NG20155685.935910192.44
NG20165448.915760253.49
NG20175351.445710305.79
NG20185315.825700306.08
SN19992398.951840615.7
SN20002417.831890711.98
SN20012468.531980733.04
SN20022424.871970696.99
SN20032523.672100581.2
SN20042605.442230528.28
SN20052682.442370527.47
SN20062677.932450522.89
SN20072736.882570479.27
SN20082772.552660447.81
SN20092754.752640472.19
SN20102775.72690495.28
SN20112739.342700471.87
SN20122800.412810510.53
SN20132799.962850494.04
SN20142902.513010494.41
SN20153001.823140591.45
SN20163104.243260593.01
SN20173232.313460582.09
SN20183356.343670555.72
SL1999875.356601804.2
SL2000908.717002092.13
SL2001820.76501986.15
SL2002993.288002099.03
SL20031036.668602347.94
SL20041057.698902701.3
SL20051063.919302889.59
SL20061073.929702961.91
SL20071129.3811202985.19
SL20081162.4112002981.51
SL20091172.8612303385.65
SL20101208.0512003978.09
SL20111255.4512404349.16
SL20121413.8814904344.04
SL20131669.1317204332.5

CountryPeriodGDP per capita, PPP (2011 international $)GNI per capita, PPP ($)Exchange rate
SL20141707.117604524.16
SL20151326.2114005080.75
SL20161376.413306289.94
SL20171403.7915007384.43
SL20181425.3415207931.63
TG19991282.72970615.7
TG20001235.46960711.98
TG20011182.2940733.04
TG20021140.99930696.99
TG20031167.5970581.2
TG20041162.34990528.28
TG20051145.911010527.47
TG20061161.061050522.89
TG20071156.061080479.27
TG20081170.781120447.81
TG20091202.521160472.19
TG20101241.921210495.28
TG20111286.471360471.87
TG20121334.661360510.53
TG20131379.41440494.04
TG20141423.551520494.41
TG20151467.251620591.45
TG20161501.121640593.01
TG20171529.521680582.09
TG20181565.461760555.72

Source: Extracted from UIS.stat report (uis.unesco.org).