serving the poorest of the poor through targeted education ...  · web viewserving the poorest of...

25
Serving the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia schools with analytics and pattern visualization Abstract This paper describes the analytical support that a Saint Joseph’s University (SJU) Haub School’s Data Mining class has provided over the past three academic years to Fe y Alegría in Bolivia (FyA:B), a Jesuit-sponsored institution dedicated to education of the poorest of the poor in over twenty countries, mostly in Latin America. The paper details the involvement of undergraduate business school students as global citizens helping FyA:B identify from survey data alone Bolivian high-school students in the most unfavorable socio-economic condition, i.e., those who might most benefit from school reach-out efforts and teacher attention. This initiative is an important social sustainability instrument in an environment of very limited resources as it supports Fe y Alegría’s core mission of providing justice-based education for those who need it most while also helping business school students in the U.S. increase their awareness of vastly different realities. The paper provides contextual foundation and historical background for this ongoing initiative and then describes its evolution over time as sequential cohorts of students in a data-mining semester-long class focus on the issue and, through live interactions with FyA:B in several iterations, have engaged in continuous analysis improvement and tool fine- tuning. The paper lists the statistical methods used in the business classroom and describes different survey response data- bases, but focuses mostly on the social impact of the initiative. In closing, the paper provides an example of the work done: a web-based data visualization instrument which allows for very efficient examination of survey answers. Introduction This paper describes the analytical support that a Saint Joseph’s University (SJU) Haub School’s Data Mining class has

Upload: buituong

Post on 30-Dec-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia

Serving the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia schools with analytics and

pattern visualization

Abstract

This paper describes the analytical support that a Saint Joseph’s University (SJU) Haub School’s Data Mining class has provided over the past three academic years to Fe y Alegría in Bolivia (FyA:B), a Jesuit-sponsored institution dedicated to education of the poorest of the poor in over twenty countries, mostly in Latin America. The paper details the involvement of undergraduate business school students as global citizens helping FyA:B identify from survey data alone Bolivian high-school students in the most unfavorable socio-economic condition, i.e., those who might most benefit from school reach-out efforts and teacher attention. This initiative is an important social sustainability instrument in an environment of very limited resources as it supports Fe y Alegría’s core mission of providing justice-based education for those who need it most while also helping business school students in the U.S. increase their awareness of vastly different realities. The paper provides contextual foundation and historical background for this ongoing initiative and then describes its evolution over time as sequential cohorts of students in a data-mining semester-long class focus on the issue and, through live interactions with FyA:B in several iterations, have engaged in continuous analysis improvement and tool fine-tuning. The paper lists the statistical methods used in the business classroom and describes different survey response data-bases, but focuses mostly on the social impact of the initiative. In closing, the paper provides an example of the work done: a web-based data visualization instrument which allows for very efficient examination of survey answers.

Introduction

This paper describes the analytical support that a Saint Joseph’s University (SJU) Haub School’s Data Mining class has provided over the past three academic years to Fe y Alegría in Bolivia (FyA:B), a Jesuit-sponsored institution dedicated to education of the poorest of the poor in over twenty countries, mostly in Latin America. The initial question addresses how to identify, from survey data alone, which early high-school students (the third year of la secundaria corresponding to high-school freshman students in the U.S.) are most impoverished so that support efforts can be targeted earlier to those with highest need. While SJU and FyA:B have had an ongoing partnership for over fifteen years, this question was first brought into the SJU data mining classroom in the fall of 2015, and resulted in a discovery-based learning opportunity for undergraduate business students to address a real-world challenge with potentially high benefits for the underserved. In essence, the work done was to classify the underserved by identifying those most in need to help them achieve economic prosperity while initiating responsible global citizens in the current generation. Each progressive semester, students have “mined” survey data and have interacted with FyA:B through an in-class Skype consultation and/or presentation of results.

The data provides SJU students with a window into a vastly different reality, one of scarce resources and urgent needs. Coming in contact with another culture becomes a mind-broadening experience that has led U.S. business students to build on prior terms’ cohorts’

Page 2: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia

analyses and creatively add value in innovative ways. Over the semesters this has resulted in a virtuous cycle that every term has culminated with the presentation of new and different approaches to not only identify the most impoverished students in the Bolivian schools surveyed, but also to then allow FyA:B schools to provide targeted support. There has been an evolution in the scope of the analyses over subsequent semesters: what began as an analysis of two FyA:B schools in the city of Potosí evolved into an examination of schools across different regions of the country and later incorporated non-FyA:B schools. Throughout this period, the scope of the business undergraduate students’ contributions grew, including survey question suggestions and pattern visualization techniques. This evolution has led to deeper student involvement and a broadening impact. As a concluding example, this paper shares a system created by SJU students in the Fall 2017 to help make initial analyses easier for those looking to implement immediate student outreach initiatives in Bolivia. A structured web-based Tableau dashboard visualization tool was developed. This innovative dashboard is an empowerment tool, i.e., a method for impoverished student identification that may be shared with those able to help them within the FyA:B organization. In summary, this paper provides evidence of how the FyA:B-SJU partnership adds value to both FyA:B’s high school students and to SJU’s business undergraduate students (future global citizens) through the application of data analytics and data visualization.

The next section provides the contextual foundation for this paper, with a brief description of Fe y Alegría (FyA) and its work in Bolivia, an introduction to that country, a description of the FyA:B-SJU partnership, and an account of the beginnings of the research to better understand individual student socio-economic conditions through surveys. The third section provides conceptual detail on the various survey data-base techniques used inside the SJU classroom. The fourth section presents the task-based chronological evolution of the SJU student involvement in the joint initiative described in this paper. The fifth section illustrates one specific student-developed result of this effort – a tool for easy visualization of surveyed student characteristics. The last section concludes.The Fe y Alegría:Bolivia-Saint Joseph’s Partnership leading to the data-mining initiative

This section describes the Bolivian context, Fe y Alegría, and the FyA:Bolivia-SJU partnership that helped facilitate the original request for student survey analysis.Bolivian context

Bolivia is a landlocked developing country in South America with a 2017 estimated population of 11.1 million inhabitants, GDP of US$ 37.78 billion, and an area of 1,098 thousand square kilometers (almost three times the size of Montana). Broadly speaking, the country has three geographic regions with very different climates. These are the very warm low altitude plains that include jungles in the northeast, savannas in the east, and the Chaco swamplands in the southeast; the temperate middle-altitude regions, that include the valleys and the Andes foothills; and the western mountains, that include a high-altitude plateau separating two main Andes ranges, with inclement weather and harsh conditions (almost one-third of the country is above ten thousand feet). Over the centuries the different flora, fauna, and human adaptation factors in each geographic region led to the development of very distinct autochthonous cultures and because Bolivia is the Latin American country with the highest percentage of indigenous ethnicity inhabitants (over sixty percent), cultural plurality permeates society to this day: the country’s official name is Estado Plurinacional de Bolivia. There are over three dozen native-american tribal nations represented, the most numerous of which are the Aymaras, the Quéchuas, and the Guaranis (Mesa, Gisbert, and Mesa Gisbert, 2008: 17, 43, 49).

Page 3: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia

Raw materials are plentiful as the country exports mineral commodities including natural gas, crude oil, and tin but since the discovery of the New World, Bolivia’s original inhabitants saw the richness of their land used to benefit outsiders. From 1556 to 1783 Potosi’s Cerro Rico alone yielded 45,000 tons of pure silver, which was minted into bars and coins and transported to Spain (Ferguson, 2008: 23). Bolivia declared independence from Spain in 1825, initiating a turbulent republican period, which featured almost 200 coup d’etats for an average of over one per year (2018 CIA World Factbook). In December 2005 the country elected its first-ever indigenous president, Evo Morales, by the widest margin of any elected president since the most recent restoration of civilian rule in 1982. Mr. Morales ran on a platform to empower the native population and was twice re-elected. Despite recent improvement, Bolivia still has very unequal income and wealth distribution. Roughly forty percent of the population is below the poverty line and although the country’s Gini index for distribution of family income has fallen from 0.60 to 0.47, Bolivia still exhibits extreme poverty (World Bank, 2018).

Bolivia’s population is young and 92.5 percent of the population is literate. Pre-university level education includes two cycles, the primary cycle including the six years of elementary school (ages 6 to 11), and the secondary cycle comprising six years (ages 12 to 17). According to the Bolivian Ministry of Education - MEB (2004), Spanish is the mother tongue for roughly two thirds of Bolivians, Quéchua for one-fifth, while most of the remainder speak Aymara. The country respects cultural heritage (as exemplified by the word “plurinational” in its official name) and bilingual education with Spanish is now a reality for many Quéchua and Aymara children. Although recent public initiatives have done much to improve education in Bolivia, it still lags other South American countries in most pedagogical metrics.

Fe y Alegría in BoliviaFounded in 1955 in Caracas, Venezuela, Fe y Alegría (FyA, translation “Faith and Joy”)

is a Jesuit-sponsored not-for-profit organization focusing on education and development of the “poorest of the poor” in over twenty countries, mostly in Latin America but also including Chad, Madagascar, and Spain. The popular saying is that FyA’s work begins “where the pavement ends, where there is no running water, where the city loses its name”. FyA has developed a unique approach to providing the managerial, administrative, pedagogical, and developmental expertise for in-network schools and operates as an international federation because schools in different countries have vastly different regulatory environments. FyA acts in each country through a small staff which leverages capabilities and resources across schools in the network to train and develop faculty members, to work with individual school personnel to establish and reach aggressive goals, to identify and develop best practices, and to ensure that these best practices are disseminated. In 2018 FyA schools numbered over one thousand worldwide and reached over five hundred thousand students.

Fe y Alegría started in Bolivia in 1966 and is present in every Bolivian province (departamento), operating in a decentralized structure with departmental (provincial) directors who provide local leadership and a national office that coordinates nationwide support activities. FyA:B counts over four hundred schools with over ten thousand teachers and over 180,000 students, and is now an integral part of the country’s educational system, offering a wide range of educational services. The largest area is “formal education,” which oversees a network of elementary and secondary schools including classes in the widely spoken Quéchua and Aymara indigenous languages. The local impact of FyA is apparent because under very harsh conditions network schools not only help individuals become fully integrated members of society with a

Page 4: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia

deep appreciation and respect for their own culture and heritage, but also foster within the communities served a very strong sense of self-worth and local identity.

The Fe y Alegría:Bolivia-Saint Joseph’s University partnershipThe partnership between FyA:B and SJU began over fifteen years ago through the

facilitation of an agreement between the Jesuit Provinces of Maryland and Bolivia to collaborate and share resources. In 2001, SJU staff conducted two exploratory visits to Bolivia and tangible steps were taken to initiate a joint collaboration, resulting in three initiatives: FyA:B staff attendance at the English Services Center near SJU; periodic ten-day SJU faculty and staff immersion trips to Bolivia which initially were scheduled annually subject to funding availability; and workshops by SJU faculty for FyA:B. Five years later, in 2006, two more initiatives were added: student immersion trips to Bolivia comprised of a full academic course in which students studied the Bolivian context and participated in a week-long trip to the country over spring break; and periodic immersion trips by FyA:B personnel in Philadelphia – these became biannual on even-numbered years as the SJU faculty and staff immersion trips to Bolivia occurred on odd-numbered years, subject to funding availability. Although there had been several faculty collaborations for the benefit of FyA, the first long-term community-engaged research project began after the 2008 immersion trip and had the objective of examining school efficiencies – for more details on this project please see Neiva de Figueiredo and Marca Barrientos (2012). In 2015, the initiative described in this paper took hold and was followed by several other service-related in-class initiatives with Fe y Alegría. For a detailed account of the process by which the partnership supported one research initiative please see Neiva de Figueiredo et al. (2013).

The partnership has been able to grow over the years despite the constant pressures on time given the many responsibilities both of SJU and FyA members because a solid foundation of trust was gradually built. Contributing to this growth in trust were several factors. One was a strengths-based approach through which both parties understood that each had a lot to learn from the other and endeavored to identify and recognize each other’s abilities. A second factor was a mutual respect for cultural characteristics, including the Bolivian culture’s gift of focusing on the whole individual, therefore moving beyond the task at hand. A third factor was the concerted mutual deference and curiosity which led to active listening, in turn leading to effective communication. Social change is achieved only gradually and FyA’s work is evidence of its expertise in achieving it. The process of consensus-building at various levels, the knowledge that success is determined locally by local stakeholders, and FyA’s gradual approach emphasizing patience, are just some of the precious gifts that SJU has received from this partnership.

The original FyA:Bolivia request which led to the research and in-class business student activities described herein therefore likely would not have occurred had it not been for this slowly built trust foundation. In 2015 FyA:B was searching for a way to identify, through survey data alone, students who were in most need of reach-out efforts. The idea was consistent with FyA’s mission to provide education support to those who need it most. In other words, FyA:B was hoping to identify as many students as possible with unfavorable personal situations, something that is particularly important for high-schools in environments of resource scarcity, because older students’ families tend to be less involved in education. Methods considered in addressing Fe y Alegría Bolivia survey analysis requests over time This section summarizes the statistical concepts that were used in sequential iterations by the SJU undergraduate business classroom to address FyA:B survey analysis requests. The goal is to

Page 5: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia

provide a succinct rationale and basic definitions without excessive statistical discussion, i.e., to enumerate several statistical analyses that were executed in the SJU classroom in this long-term sustainability project. The subsequent section describes the chronology of these analyses.

The following statistical and data analysis tools were applied by students through many iterations, with slight adaptations each semester through class interactions, the new data sets, and changing goals of the project.

Data cleaning. Incorrect survey answers such as blanks and mistakes were identified throughout each phase. It was necessary to create consistent and reasonable techniques to ensure that all iterations considered the data through a similar lens, leading to the development of identical data-cleaning rules. Typically more than 2 blanks or errors were removed from the data. With two or fewer errors substitutions were made in consultation with Fe y Alegría or using reasonable inferences.

Data Visualization . Histograms, boxplots, bar charts, tree diagrams, and pivot tables were used in each iteration of the analysis: every step provided graphics and visuals that often helped guide both the students doing the analysis and the team at Fey Alegría who relied on the results.

Distribution analysis. In every data iteration the distributions of survey responses were considered both on a per-school basis and on an aggregate basis to identify underlying data patterns. Histograms provided quick insights into how students responded to each question and helped quickly examine differences among schools regarding pertinent questions.

Principal Component Analysis (PCA)/ Factor Analysis (FA). Principal component analysis allows the user to reduce the number of independent variables in a model while capturing as much of the total variability among all the independent variables as possible. The independent variables are regrouped into factors, which combine associated variables into orthogonal (uncorrelated) groups using weights identified in the factor analysis. In these iterations PCA/FA was used to identify which variables could be eliminated from consideration without substantial loss of explanatory variation.

Analysis of Variance . One of the main goals was often to find schools that were similar and schools that were different from one another focusing on survey questions that might be indicative of poverty. In each iteration, analyses of variance were run in order to assess the overall average of any specific question answered as well as how each school compared to that average. The boxplots and results of the ANOVA tests revealed differences among the schools being analyzed.

Cluster Analysis . This technique considers regrouping the students (row data) into like groups that have minimal variation within each group while maximizing the variability between any two different groups. The output provides an “average” student answer to each question by cluster. These averages provide insights into students who have fewer meals, minimal access to electricity or water, or parents who have fewer job prospects (i.e., who work fewer hours each week). Once created, these clusters can be used to rank the students in need using the UN definition of poverty or other reasonable indicators of poverty.

In 1995, the United Nations defined absolute poverty as:“…a condition characterized by severe deprivation of basic human needs,

including food, safe drinking water, sanitation facilities, health, shelter, education and information. It depends not only on income but also on access to services.”

Multiple Linear Regression. The data set of answers is formed by purely independent variables with no clear dependent variable that could be used to create a predictive regression

Page 6: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia

model. Several possibilities have been considered with the long term goal of establishing a dependent variable representing a poverty score. Such a score could then be used to develop a weighted model to help categorize future survey takers. Some progress has been made but better models continue to be considered through successive iterations of data analysis.

Bootstrapping. This advanced simulation technique is an iterative process of sampling from within a larger data set in an attempt to identify key data patterns. Simulations were applied in an effort to create a meaningful independent variable by which a poverty value might be created for each current student within the collected data.

Logistic Regression. From each cluster iteration groups of students may be considered to be “more impoverished” than other groups when considering the average answers in each cluster. A binary dependent variable where the more impoverished students receive a value of one while the remaining students receive a value of zero was considered. In this way a logistic regression model can be made to evaluate which variables are most useful in identifying poverty, possibly creating a predictive tool that could be implemented with future students.

The evolving SJU student responses since the FyA:Bolivia original request: 2015-2018This section summarizes the in-class analyses conducted by SJU undergraduate business

students in response to the original FyA:B inquiry, and is structured in chronological order to describe the data sets, analyses, successes, and take-aways in each academic semester.

A) Data set 1 (from four schools in Potosí, with 272 pre-data-cleaning student responses and 261 after)

Fall 2015 - a class of SJU students was asked to help assess a simple analytical question posed by Miguel Marca Barrientos. The initial question had two parts:

1. What was the meaning of the principal component analysis/ factor analysis (PCA/FA) output which had been generated for him by an outside source?

2. Could the coefficients from said analysis be applied as weights to produce a meaningful poverty measure for each student that might be indicative of poverty (poverty score)?

The data mining class was working through the topics of Principal Component Analysis and Factor Analysis and collectively decided to focus on the first question as a real world example of survey analysis. The class met via Skype with FyA:B to identify the most important variables and discussed how they might be formed into factors, which is the goal of PCA/FA. Though this provided insight into the analysis that Miguel had in hand, it did not help answer the more important question of creating a predictive model to identify poverty.

The students inquired whether they could continue working through the analysis in an attempt to help with the second question, leading to the use of data mining to address social issues. This initiated a social entrepreneurship relationship with the goal of helping fine-tune student poverty degree identification within FyA:B schools. On the back end Miguel was working to implement training programs to the students identified as most in need to help break the poverty cycle. In the remainder of the course, the class worked to create a poverty model with insights and results presented to Miguel at the end of the semester. After data cleaning (resulting in 11 students being dropped) PCA/FA analysis was performed. This technique allowed for data reduction and refined independent variables. Next, cluster analysis was applied to partition the 261 remaining students into subgroups with similar attributes to help identify those most impoverished. Several iterations were run using clusters of sizes between 4 and 9 and the class settled on using eight clusters (see Table 1) with a preset goal of identifying the 25%

Page 7: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia

most impoverished. The class decided to use more clusters and rank the clusters in order using key “poverty” features as defined by the United Nations based on some key questions on food, shelter, electricity, water, and parents work. Using this logic, the analysis identified the 23 (lowest 9%) most impoverished student, i.e., those who lacked the basic necessities of food, electricity, or water (blue highlighted clusters). Moreover, Cluster 8 identified 36 (13.8% moderate lowest) students who had on average 2.28 meals a day and whose fathers were less educated and less likely to work a full 5 days in each given week. Table 1. Cluster output from first iteration of data analysis Fall 2015.Cluster Water Electricity no_meals food_b4

schooldad_ed dad_wk mom_ed mom_wk

1 (n=9) Yes No 2.11 1.33 2.78 3.11 2.33 2.442 (n=6) No Yes 3 1.17 2.17 2.5 2 1.333 (n =21) Yes Yes 2.95 1.38 4.48 3.71 4.29 3.574 (n=90) Yes Yes 2.43 1 2.63 2.83 2.14 1.415 (n=91) Yes Yes 2.57 1.18 2.46 2.45 2.07 1.646 (n=1) No No 1 2 2 4 2 27 (n=7) No Yes 1.57 1.29 3 2.57 2.71 2.148 (n=36) Yes Yes 2.28 2 2.25 2.417 2.11 1.22

Students recognized that when creating smaller clusters there was evidence of bigger groups stuck together and smaller identified groups (e.g., lacking water or lacking electricity), resulting in uneven partitions, i.e., the original request of identifying a specific percentage of those most in need (25%) was more elusive. At this point the class was able to provide initial identification of students most in need by school and by name. With only a week remaining in class in the fall 2015 term, other methods to model the data including an attempt to create a dependent variable using the clusters were considered but could not be fully implemented. It is important to note that although the data was dated (collected two years prior in 2013), the teachers within the schools were able to confirm that a majority of the students identified in this cluster analysis were indeed the most impoverished. Successes: data cleaning, integrity, Principal Component Analysis and Factor Analysis, cluster analysis, concepts and goals.

Needing work: simpler cluster technique and creation of a usable predictive model.

B) Data set 2 (from six schools in Potosí and Sucre, with 838 pre-data-cleaning student responses and 731 after) this data was used from Spring 2016 through Fall 2017 with techniques and conclusions advancing due to continuous analysis improvement and tool fine-tuning.

Spring 2016 – For the first half of the semester the class focused time and attention on the first data set with 261 students. The new data arrived in the middle of March and required intense cleaning, with 107 responses dropped due to missing or invalid answers. As the class cleaned the data, they also read up on the two Bolivian regions and provided suggestions on how to adapt questions to get more useful insights, which were considered and implemented in subsequent surveys. Once the data was cleaned, correlations were checked and two variables (mothers’ and fathers’ education levels) seemed to be mildly correlated (.54). As with previous data, PCA/FA was run with similar results to the first data set and was again used to reduce the

Page 8: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia

variables and identify which survey questions would be considered when running the cluster analysis. The students tried several iterations of cluster analysis as a class – 8 clusters were chosen, one of which identified the 38 most impoverished Bolivian students. A new application was then considered with this second data set. The 38 identified students were given a dependent binary value of “1” indicating poverty while the remaining students were given a value of “0” indicating they were less impoverished. Meanwhile, students and school identification allowed immediate feedback to FyA:B.

Table 2. Table of counts and percentage of students by school in initial analysis. School # of students from school % of students from the schoolLuis Espinal Camps 17 11Gualberto Paredes 6 7Sagrada Familia 5 4Loyola De Fe Y Alegría B 4 2Jose Maria Valez 4 4Fray Vicente Bernedo B 2 2

Using the newly created binary dependent variable the class attempted to create a logistic regression. This model would help identify weighted averages that might be useful when applied to future survey-based data sets. However, with fewer than 5% of students coded as “1” for impoverished, the model was less accurate due to the small sample bias in the logistic model’s maximum likelihood estimation (Allison, 2012). The attempt at obtaining a logistic regression needs further investigation in order to ensure model accuracy.   Successes: data cleaning, integrity, Principal Component Analysis and Factor Analysis, cluster analysis, creation of initial dependent variable. Needing work: the number of identified most impoverished students (38) was too small, and the predictive model relied heavily on two survey questions, namely electricity and water.

Fall 2016 /Spring 2017 - In these semesters, the classes worked through the initial cleaning, Principal Component Analysis and Factor Analysis, and cluster analysis in smaller groups with the whole class discussing successes and failures. Students were then invited to make suggestions of best next steps for analysis.

One major change in Fall 2016 was prompted by one group who decided to pre-partition a subset of the data. This group decided to classify any Bolivian student who responded that they were without electricity or water as impoverished. Once classified as impoverished, they were removed from the data set allowing for the clustering technique to be applied on the remaining (n = 693) students. The class embraced this change. After partitioning, the number of identified impoverished students increased to a total of 69 students (9%) nearly doubling the number of identified students as impoverished. In turn, this change helped create a slightly better logistic model that could feasibly be applied to future survey collection and data analysis. While the new dependent variable produced a feasible model, logistic models do not have easily identifiable weights, for example some care about the magnitude of the effect, and others the magnitude of the odds ratios but both are easy to misinterpret (Norton and Dowd, 2018). Despite the challenges of interpreting the outputs in layman’s terms, this was the first time that a reasonable model was created. The results were shared with Miguel with the goal of producing coefficients for future survey applications.  

Page 9: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia

Successes: data cleaning, integrity, Principal Component Analysis and Factor Analysis, cluster analysis, initial dependent variable created, larger sample identified, decent logistic model created. Needing work: number of identified most impoverished students (69) still small, creation of a continuous dependent variable.

Two major changes occurred in Spring 2017. First, a class of graduate students was invited to work on analyzing the data to help create a dependent variable in addition to the two usual undergraduate classes. Second, in this iteration the concept of bootstrapping was considered in an attempt to create a continuous dependent variable using simulated samples from within the survey to try to make estimates about the population (students in FyA:B). In other words, once all of the initial steps of cleaning the data and reducing the independent variables (survey questions) through Principal Component Analysis and Factor Analysis had been successful, cluster analysis could be applied using a resampling method called bootstrapping. The technique is often useful for analyzing smallish expensive-to-collect data sets for which prior information is sparse, distributional assumptions are unclear, and for which further data may be difficult to acquire (Henderson, 2005). In this case, no prior data was available, the data was unsupervised, and we had established a viable method for creating a model that needed verification.

Once SJU students had performed the initial data cleaning and reduction, they were given an opportunity to consider a few replications of the method by hand. Instead of dealing with the entire data set, smaller subsets (about 25 %) of the larger data set were randomly chosen. This subset was then run through cluster analysis (typically with four clusters) and each cluster was ranked according to need by viewing the averages of the responses on the questions being considered. In doing this by hand, the two undergraduate classes noticed that the rankings naturally followed an ordered pattern that most often depended on how many meals the students consumed daily. Fewer meals were typically consistent with the group that would be considered most impoverished. Working in groups (where each group ran an iteration) we would record a value to the students within each cluster once the rankings were put in order. Due to random generation and 20 separate groups running the analysis, each Bolivian student was chosen multiple times and had been assigned a cluster each time they were chosen for an iteration. The classes averaged these recorded values to see if a continuous variable might be created through which each student might be attributed a value from 1 to 4.

Meanwhile, working on their own, a group of graduate students came to a similar conclusion. This group recognized that true bootstrapping requires resampling to be done thousands of times in order to assure the most appropriate results and they wrote an R – program that was simulated 10,000 times. Each time a sample was taken, a cluster value was recorded to each student (approximately 2,000 values assigned to each student). Again, values were averaged creating a feasible continuous dependent variable ranging from 1 to 4. In this way a multiple linear regression became a feasible option for possibly identifying weights applied to questions to get a poverty score. A distribution of this continuous dependent variable revealed clear partitions of students and the lowest section (y-value ≤ 2.3) was also assigned a value of one for impoverished with the remaining students given a value of zero indicating less impoverished (see Figure 1). At this point in time both a logistic regression model and a multiple regression model became available as feasible methodologies that FyA:B could use to evaluating students then or in the future.

Page 10: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia

Figure 1. Distribution of dependent averages after bootstrapping is applied.

Over the summer work was done to assess the most viable cluster size and best model. A more advanced R program was created that could be applied to this data set and future survey results providing an open source which could be used by FyA:B (Garwood & Dhobale, 2018). Successes: two reasonable and useful models created. Needing work: use other data samples for verification of process.

C) Data set 3 (eight schools in various regions of the country with 204 pre-data-cleaning student responses and between 18 and 32 students from each school).

This final data set was dramatically different from previous sets. No student names were provided and the schools were not currently part of the FyA:B school system. Many students left blank answers. The underlying question/goal here was noticeably different, namely to identify schools with similar students to those within FyA:B assuming perhaps that these schools may be interested in being adopted by the Fe y Alegría administration.

Fall 2017 – While waiting for access to the new data set, data set 2 was considered for the first iteration of the project for this class. At this juncture the students were asked to clean, organize, and analyze the data for insights and anomalies. One group proposed a dashboard that could filter through all the students based on eight questions providing a list of students who could be high need. The filter could be applied to as little as one question and the resulting visual dashboard would narrow down the students who fell within need category based on the filter(s) chosen. Within each question the possible responses were given colors of red (likely impoverished), yellow (possibly impoverished), green (less likely impoverished) that would indicate a different need level for the filtered students for each of the eight questions. While this does not go through the statistical processes required to make a model, it is useful for immediate visualization by the user and was shared with FyA:B soon after its creation (please see Tableau illustration in section 5) knowing it could be updated with new data easily and efficiently.

As for the class, once students had become familiar with Fe y Alegría, they were asked to compare data set 2 (with which they now had familiarity) against data set 3 (the new and very different data set). One objective proposed to the classes was to use the data to advise FyA:B (Miguel) as to how the two data sets compared with one another and see if any patterns existed. The class initiated the analysis by running analyses of variance in an effort to identify if there were any schools in the 2017 data that aligned more closely with the 2016 data set, i.e., seeming to have similar students. A secondary effort was made to join all of the schools and identify the lowest 25% of students overall. Similarly to the analysis conducted in the previous semester,

Page 11: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia

the students ran Principal Component Analysis and Factor Analysis to identify the characteristics providing the most variability, and worked to isolate clusters of impoverished students. Finally, drilling down into the data, efforts were made to identify where the students were from (school) and who the students were – student names (2016 data) or the student number (2017 data).  An example of the ANOVA used to identify in-need students looked at the average number of meals per day by data set (Figure 2) where there was evidence of two schools in each data set fall primarily below the average of 3 meals a day.

Figure 2. Anova of the number od meals per day by school showing both 2016 and 2017 data. In summary this analysis provided early insights into patterns, and also a recommendation to FyA:B (Miguel Marca) as to which schools seemed to have the most need in the new data set and how these schools compared with current FyA:B schools.Successes: adapting to new and different data, identify similar schools across the data sets. Needing work: small samples and lack of names introduces new goals of creating a poverty index to help find schools with most need. Tableau Dashboard: An expanded example of students moving the project forward

When the class created the original Tableau dashboard, it contained tree maps partitioned by eight separate questions. A tree diagram is a space-constrained visualization of a hierarchical structure which uses enclosure to visualize trees, using size and color coding to map sub-trees onto a sequence of nested rectangular areas (Shneiderman & Wattenberg, 2001). By nesting the data into separate categories (eight in this case), and coloring the level of poverty using answers to the survey questions, survey respondents are easily compared to one another and the visual allows for quick interpretation regarding their level of need depending on their answer to each question. Dropdown menus at the top of the dashboard contains filters that partition these tree maps and allow for a user-friendly and interactive experience while the color-coding of responses visually articulates each student’s environment. Table 3. Table of questions used in original Tableau dashboard 1 How Many Meals Per Day Do You Eat?2 Is There Running Water in Your House (Do You Have Access to Potable Water)?

Page 12: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia

3 How Many People Sleep In Your Bedroom?4 Is There A Shower In Your House?5 Which One of the Following Groups Best Fits Your Mother’s Current Work Situation?6 Which One of the Following Groups Best Fits Your Father’s Current Work Situation?7 Does Your House Have Electricity?8 Do You Work and Get Paid For It?

The questions used in the original Tableau dashboard can be seen in Table 3. Several improvements were suggested to ensure a more successful final product that could be shared within FyA:B. These included: increased participant confidentiality, translation to Spanish, categorization of answers based on need, and reconsideration of questions used. The first element that needed to be rectified was the confidentiality of the students. Originally, each square represented a student and, as the user hovered over a student’s respective cells, their names were displayed. Confidentiality is a crucial element that was necessary in updating the dashboard. Second, the dashboard was in English. The main users of this dashboard are Spanish-speaking and therefore the resulting dashboard should be as well. Third, although color-coding is a success of the original data compilation, only one question was partitioned into three categories. To reduce the polarization of the other survey questions, the three level scale was adopted for all questions that had three or more responses. Finally, a reflection of the questions from the selection of 22 questions in the survey resulted in the conclusion that the eight questions initially selected did not adequately capture the environments of the students. For example, there were two questions regarding the status of running water in the home that will likely express similar identification information. In this way, some questions were swapped out so that the survey could be expanded to include a different facet of the student’s environment. Moreover, the water information considers only access to potable water at this time with the goal of creating a water index in the next iteration.

The privacy of participants was protected by adapting each tree map to include alternate identifiers for both the school name and the student name. To create the identifier codes three letter acronyms for each school were created as shown in the following list:

Code School NameFVB FRAY VICENTE BERNEDO BGBP GUALBERTO PAREDESJMV JOSE MARIA VELAZLEC LUIS ESPINAL CAMPSLFA LOYOLA DE FE Y ALEGRÍA BSAF SAGRADA FAMILIA

In addition, each student was assigned an identifying number. The first student in the list was assigned 001, the second 002 and so on increasing in increments of 1 and restarting for each schools list of students. To create a student ID, the school and student identifiers were concatenated to form a unique ID so that there was a confidential identifier that maintained regional information. This list can be supplied to all individuals using the tool who had a need for decoding the student ID to help identify the original student.

Page 13: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia

The survey originally came to the school in Spanish, and was translated to English for the use of the data mining students. For this tool to be efficient and user-friendly it had to be translated from English to Spanish to allow for ease of use by Miguel and others at FyA:B. Table 4. Table of questions used in final Tableau dashboard 1 How Many Meals Per Day Do You Eat?2 Is There Running Water in Your House (Do You Have Access to Potable Water)?3 Which One of the Following Groups Best Fits Your Mother’s Current Work Situation?4 Which One of the Following Groups Best Fits Your Father’s Current Work Situation?5 Does Your House Have Electricity?6 Do You Work and Get Paid For It?7 Are There Any Violent Gangs In Your Neighborhood Or School?

The questions used in the final Tableau dashboard can be seen in Table 4. Although the highest education level achieved by the mother and father is asked on the survey, SJU students elected to include their employment status in the dashboard as it more directly reflects how the respondent is being provided for. These two questions, combined with the students’ need to work, deliver the user an idea of the respondent’s family financial situation which is consistent with the UN definition of poverty (United Nations, 1995).

Although most students have access to electricity and running water, these are crucial questions as they directly reflect poverty level and are not speculative, as is the father’s current work situation. Finally, the presence of gangs in their neighborhoods and schools and even some descriptive scenarios of their circle of friends are indicators of socio-economic status as the prevalence of organized crime is often more elevated in neighborhoods of lower socioeconomic status though the region factor might need to also be considered (as urban regions often have a higher proportion of activity).

In the survey supplied to students, multiple choice answers ranged from 1 to 2 for yes/no questions and then 1 to 4 or 1 to 5 when multiple individual answers were an option. When considering each question, most answers were grouped into three color categories: green, yellow and red indicating a low, medium and high level of need respectively. In the case of two-answer (yes/no) questions – there is no corresponding medium or intermediate need (yellow) category. Table 5 provides an example of how the questions were color coded for the tree maps, and indicate how answers were categorized by needs.Table 5. Responses to gang question and how it was color coded. Question: Are there any violent gangs in your neighborhood or school?Provided Answers Need Base ColorThere aren’t any gangs. Neither at school nor in my neighborhood. Low Need GreenThere are gangs in my neighborhood, but they do not come close to school. Low Need GreenI come to school or go home I see gangs. Medium Need YellowThere are gangs at my school. High Need RedSome of my friends are part of a gang. High Need Red

By translating the dashboard, adjusting the confidentiality for students, and altering how the information is presented, the final product is a user-friendly and informative dashboard. This tool will be helpful in identifying trends (like region and school) and identifying a surface level of socio-economic well-being among FyA Bolivian students. Long term, this tool will hopefully

Page 14: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia

become impactful as it has the ability to update immediately when new survey responses are uploaded and instantly see a reflection of the students who are most in need.

Conclusion

This paper and the work herein share achievements made in educational sustainability through an ongoing process that address new ways of identifying students who are in the most need of economic prosperity while also engaging collegiate students in the stewardship of global citizenship and responsibility. It does this by describing the analytical support that a Saint Joseph’s University (SJU) Haub School’s Data Mining class has provided over the past three academic years to Fe y Alegría in Bolivia (FyA:B). The initial question addresses how to identify from survey data which early high-school students (in la secundaria) are most impoverished so that support efforts can be targeted earlier to those with highest need. Over time more data and new questions have arisen and been addressed. Changes in approach, analysis, and instruction have been implemented. With continued work, next steps would be to create a feasible and useful poverty index using the survey responses. Once created, making sure it is reasonable, reliable, and reproducible could help schools both in Bolivia and perhaps other countries in which Fe y Alegría does its important work.

Page 15: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia

ReferencesAllison, Paul. Logistic Regression for Rare Events. 2012. https://statisticalhorizons.com/logistic-

regression-for-rare-events, accessed on May 19, 2018

Fergusson, N. The Ascent of Money. Penguin Press, New York, 2008.

Garwood, Kathleen C & Dhobale, Arpit. (2018) A comparison of cluster algorithms as applied to unsupervised surveys. International Journal of Business Intelligence and Data Mining. In press, 2018.

. Henderson, A. Ralph. The bootstrap: A technique for data-driven statistics. Using computer-

intensive analyses to explore experimental data. Clinica Chimica Acta. 359 1(2005):1-26. Language: English. DOI: 10.1016/j.cccn.2005.04.002, Database: ScienceDirect

MEB - Ministerio de Educación de Bolivia. La Educación en Bolivia: Indicadores, Cifras y Resultados. La Paz, Bolivia, 2004.

Mesa, J., Gisbert, T., Gisbert, C.D.M. Historia de Bolivia. (7th edn). Ed. Gisbert, La Paz, Bolivia, 2008.

Neiva de Figueiredo, J., and Marca Barrientos, M. “A decision support methodology for increasing school efficiency in Bolivia’s low-income communities.” International Transactions in Operational Research 19 (2012): 99–121.

Neiva de Figueiredo, J., Jursca-Keffer, A.M., Marca Barrientos, M., and Gonzalez Camacho, S. “A robust University-NGO partnership: Analysing school efficiencies in Bolivia with community-based management techniques.” Gateways: International Journal of Community Research and Engagement 6 (2013): 93-112.

Norton, Edward C. & Dowd, Bryan E. “Log Odds and the Interpretation of Logit Models.” Health Services Research. 53 2 (2018): 859-878. Published online on May 30, 2017.

Shneiderman B, Wattenberg M. “Ordered Treemap Layouts.” In: Proceedings of the IEEE Symposium on Information Visualization 2001 (INFOVIS’01), Washington, DC, USA, 2001. IEEE Computer Society (2001): 73.

United Nations, The Copenhagen Declaration and Programme of Action, World Summit for Social Development. (1995): 6-12. New York, United Nations, 1995.

US Government: Central Intelligence Agency. World Factbook, 2018. Washington, D.C., 2018. https://www.cia.gov/library/publications/the-world-factbook/geos/print_bl.html, accessed on May 14, 2018.

World Bank, 2018. World Bank Country Overview – Bolivia. Washington , D.C., 2018. http://www.worldbank.org/en/country/bolivia/overview, accessed on May 14, 2018.

Page 16: Serving the poorest of the poor through targeted education ...  · Web viewServing the poorest of the poor through targeted education: Using the business classroom to help Fe y Alegría-Bolivia