determining the location of postal centers in b&h using...

Determining the location of postal centers in B&H using machine learning clustering method

and GIS Amel Kosovac*, Ermin Muharemović*, Muhamed Begović*, Edvin Šimić*

* Faculty of Traffic and Communications, University of Sarajevo, Bosnia and Herzegovina [email protected]

Abstract -The rapid development of technology is directly affecting the growth and development of e-commerce shipments, especially in the Business to Customer segment. An increase in e-commerce shipments has a strong impact on the express delivery industry. In these conditions, a very significant challenge is how to organize a postal network. The problem that arises is how many postal centers, and at what locations, should be implemented in a specific geographical area in order to optimize the level of service for the users. Solving this challenge has latterly received increased attention in both industry and academia. The aim of this paper is to firstly provide a concise overview of current approaches in the process of determining the optimal location of postal centers. The second part of the paper will focus on proposing an approach that will rely on machine learning methods for clustering in defined conditions and specific geographical environment using appropriate geographic information tools for spatial data analysis and visualization.

Keywords - postal centers; B2C model; shipments pickup and delivery; machine learning; geographic information systems

I. INTRODUCTION The development of technology has introduced new

ways of communication between people and enabled a whole new approach for products sale. Today there is almost no geographical barrier in the market. E-commerce has completely eliminated these barriers and created new challenges in the postal industry [1]. This type of market allows people in rural areas to have direct access to buying a different type of products. The customer wants the ordered product to be delivered to his address as soon as possible. These new conditions have put a grave task for postal companies in terms of network organization, delivery quality and cost management.

In order to find a balance between customer expectations and costs, the postal companies have to pay more attention to problems of defining the optimal number of postal centers in some geographic areas. Based on the number of shipments in pickup and delivery for a specific geographic area, new approaches enable the simpler ways for the determination of the optimal number of postal centers. To reduce costs, postal companies adopt a hub-and-spoke network structure [2]. The location of the central warehouse or hub is determined in the geographical area in which is the highest percentage of picked up and

delivered shipments, as well as on the development of transport infrastructure.

In this paper, the determination of clusters or postal centers is made according to the number of shipments in the pickup and delivery using the clustering K-means machine learning method. The data used in this research are for a period of one month, with over a 30.000 number of shipments in pickup and delivery. The research presented in this paper integrates the use of machine learning clustering and GIS software for postal centers number and position determination in Bosnia and Herzegovina.

II. RELATED RESEARCH The pioneer of researching the facility location in

relation to customers was Weber in 1909 with the main focus of reducing the total distance and overall cost. The one-hub siting problem introduces by O'Kelly [3] and it is equivalent to a Weber least-cost location model. This paper has introduced the problem of locating hub facilities so as to serve a set of interacting cities. The problem of hub determination can be viewed in two ways, from network characteristics and location problems.

Article [4] describes the simple plant location problem (SPLP) considering an important family of discrete, deterministic, single-criterion and widely applicable optimization problems. In addition to the methods described above, clustering methods are also used to determine the optimal center location.

Fuente and Lozano used cluster analysis and cost evaluation for determining warehouse number and location in Spain [5]. One of the basic algorithms for clustering is k-means clustering. Because of its simplicity and easy application, it is often used today even though it was described more than 60 years ago. Choosing a postal center location is a problem that has been around for a long time and it is researched mainly by supply chain researchers. The Warehouse Location Problem (WLP) problem has often been described as a clustering problem, and many papers have developed methods to solve this problem. In addition to this problem, we need to look at the postal center in the network perspective, particularly as a hub-and-spoke system.

1582 MIPRO 2020/miproBIS

Hub location problem is the problem that frequently arises in the design of transportation and distribution systems, postal delivery networks and airline passenger flow [6]. A hub-and-spoke network typically involves three simultaneous decisions: deciding the optimal number of hub nodes, their locations and the appropriate allocation of the non-hub nodes to the hubs [7]. When using the K-means method, we can also select the number of clusters, and this is one of the very important factors when using this method.

Authors in [8] illustrate some of the key issues: measuring similarity, forming clusters, and deciding on the number of clusters that best represent a structure that will show in our paper later on. Paper [9] finds the optimal allocation of warehouses and hospitals to demand cluster centers. In the paper [10] authors transform a special location-allocation problem into a clustering problem. Clustering methods used for logistics networks optimization in GIS allow us to visualize problems and make it easier to decide on the number of clusters relative to many practical issues that software cannot take into account. However, recent studies integrate machine learning into GIS software, as will be shown in this paper. Several machine learning models were used with GIS integration to determine the potential site for the proposed hotel based on business success indicators [11].

Previously, GIS was used only as a tool for better visualization of existing data and previously conducted research. GIS however, provides far greater possibilities than mere visualization. Nowadays, GIS is used for analyzing large amounts of data generated by a complex system such as the traffic. It performs demanding spatial analyzes while at the same time helping in making decisions important to many segments of traffic engineering [12]. In [13] authors used the clustering method in GIS to develop a tool for planning and managing the reuse of agricultural drainage water for irrigation in the Nile Delta. There is also an approach [14] to conduct the geochemical analysis by using the machine learning approach. For their analysis, Simple K-means is used. The decision support system in logistics inbound and transportation system with the application of Geographic Information System (GIS) was presented in [15]. The clustering method is often used for solving the problem of location and network routing in order to reduce costs [16]. In recent years, there has been great progress in solving the problem of facility location based on clustering. Continuous facility location problem and its application to a clustering problem are also described in [17]. The authors of this paper propose good approximation algorithms for Euclidean and squared Euclidean distance functions.

III. ANALYSIS AND INTERPRETATION OF RESULTS The application of machine learning methods to

optimization processes in various fields enables the detection of relations and patterns among large amounts of data.

Figure 1. Results of the Elbow method

In the practical implementation, real-world data were used on the locations of pickup and delivery of shipments. These data were generated by one of the operators of commercial courier services that covers the territory of Bosnia and Herzegovina.

Data exclusively considered the B2C (Business-to-consumer) packet pickup and delivery segment. The authors of this have ignored the existing layout of the postal centers and have tried to determine the optimal number and location of the postal centers. Given that the data were unlabeled, it is clear that the described problem has to be solved by unsupervised machine learning methods. Additionally, as it is necessary to determine the affiliation of each point to a particular center from which a limited geographical area would be served, the problem is clearly characterized by clustering methods. Clustering is one of the most used data exploration techniques for grasping knowledge about data structure. It allows identifying subgroups in the data so that the data in the same subgroup (cluster) is very similar, while the data points in the different subgroups are very different. To implement clustering, the K-means method was used, primarily because of its ease of implementation and the fact that it fits well with large datasets. Considering the fact that the research includes over 30.000 records of individual pick-ups and deliveries, it is clear why K-means was suitable. Random initialization trap (choosing initial cluster center points so that the algorithm gives a false positive model) was avoided using K++ means advanced method. The algorithm was implemented using Python programming language. Program code includes pandas and NumPy libraries for importing and preprocessing the data, sklearn for implementing the K-means method, and finally matplotlib for graphical displaying of the results. The K-means method requires manual determination of the number of clusters before implementing the algorithm. Due to the geographical specificity of Bosnia and Herzegovina, the minimum number of centers should be 5. Existing companies that deal with the commercial services of pickup and delivery have an average of about 10 centers. Therefore, the research was conducted by implementing the K-means method with a set of different values of the number k, where 𝑘 ∈ [5,15] and compared the numerical and graphical results for each selected number of clusters. In the first attempt to determine the optimal number of clusters Elbow method was used.

MIPRO 2020/miproBIS 1583

(a) Silhouette analysis for K-means clustering: k=11

(b) Silhouette analysis for K-means clustering: k=12

(c) Silhouette analysis for K-means clustering: k=13

(d) Silhouette analysis for K-means clustering: k=14

Figure 2. Determining the optimal number of clusters using silhouette analysis


Figure 3. Results of the Silhouette score method

Figure 4. Part of QGIS visualization for insight in real-world of the determined geographical location of centers

The Elbow method is a heuristic method for evaluating consistency within a cluster. The number of clusters is determined by increasing clusters as long as it results in significantly better data modeling. However, the results obtained were vague and difficult to interpret (see Fig. 1). It was very difficult to determine where the "elbow" was located, that is, for each increase in the number of clusters, the value of the WCSS parameter (within-cluster sum of squares) decreased with relatively the same intensity.

If 𝐶 denotes a cluster, 𝑆𝑖 denotes all shipments that belong to one cluster, 𝑑 is the distance between each shipment 𝑆𝑖 and its corresponding cluster 𝐶, then WCSS can be defined as (𝑛 is the observed number of clusters):

𝑊𝐶𝑆𝑆 = ∑ 𝑑(𝑆𝑖 , 𝐶1) +𝑆𝑖 ∈𝐶1∑ 𝑑(𝑆𝑖 , 𝐶2) +𝑆𝑖 ∈𝐶2

⋯ ∑ 𝑑(𝑆𝑖 , 𝐶𝑛)𝑆𝑖 ∈𝐶𝑛 (1)

As the Elbow method did not produce clear results for interpretation, a Silhouette analysis was introduced. Silhouette value is a way of assessing the similarity of a feature and cluster to which it belongs. It can take a value between -1 and +1. A higher value means that a particular feature is more evident to belong in its own cluster, that is, it is more distant from neighboring clusters. If too many points have a small or even negative value, then it is highly

likely that the number of clusters is not ideal for particular data.

If da(i) denotes the mean distance between Si and all other data points in the same cluster, and db(i) denotes the smallest mean distance of 𝑆𝑖 to all points in any other cluster, of which is 𝑆𝑖 not a member, then silhouette value can be defined as:

𝑠(𝑖) =𝑑𝑏(𝑖)− 𝑑𝑎(𝑖)

max {𝑑𝑎(𝑖),𝑑𝑏(𝑖)} (2)

In Fig. 2, one can see in detail the Silhouette score for each value of k, where 𝑘 ∈ [11,14], as well as for each individual cluster and location within that cluster. In addition, the two-dimensional Cartesian coordinate system shows the locations of each individual pickup and delivery, as well as the estimated ideal positions of the centers (clusters). Affiliation to a cluster is interpreted by common color for all locations within its area of coverage.

Using this method has yielded much clearer results. It showed that for a set threshold value of 0.577, the number of centers between 11 and 14 is acceptable, and the best results were achieved with k = 13 (see Fig. 3.).

Unfortunately, the upbuilding of a postal center in an optimal location determined by the application of these algorithms is often not practically possible. The point of the cluster center may be in an unfavorable location. These reasons may include a location in the city center where such facilities are not feasible, the absence or high cost of buying or renting suitable land for construction, protected areas where construction is prohibited, inadequate access traffic infrastructure, etc. Therefore, certain optimal locations of centers determined by the algorithm must be adjusted in space according to the characteristics of the environment.

The output of results, that is, the position of clusters (postal centers) in the coordinate system of the matplotlib library does not give insight into the real geographical position of the points. It is much easier to draw conclusions using geoinformation systems that provide rich visualization and additional analysis of spatial data. For this reason, the data .csv output from Python has been imported into QGIS (Quantum Geographical Information System) software and interpreted the actual geographical position of certain clusters using the Openstreet basemap. From the visualization in GIS (see Fig.4.), it was concluded that certain positions of the centers are closest to the following larger settlements: Sarajevo, Zenica, Bihać, Mostar, Trebinje, Posušje, Goražde, Bijeljina, Tuzla, Brčko, Banja Luka, Prijedor, and Doboj.

IV. CONCLUSION In optimizing the costs of organizing a postal network,

great attention is paid to the quality of service delivery and customer satisfaction. Only through the good organization of pickup and delivery, and an optimal number of centers

MIPRO 2020/miproBIS 1585

is it possible to achieve a high level of customer satisfaction in the B2C segment. The analysis of the K-means method results demonstrates that there is potential for the application of machine learning and geoinformation systems for determining the number and spatial dispersion of postal centers.

The data used in this research were related to the distance between the delivery/pickup locations and respective postal centers. The focus was not on other factors such as type of shipment, the weight of shipments, development, and condition of transport infrastructure, postal center fleet capacity, number of stops in pickup and delivery, etc. In future research, it is desirable to test the accuracy of the method by considering multiple factors, that is, multiple dimensions that influence the selection of the optimal number of centers and their location. Also, it would be useful to compare and verify the results using some other clustering methods such as hierarchical clustering.

REFERENCES [1] S. Čaušević, E. Muharemović, B. Memić, and M. Begović,

“Integration of logistics information systems with electronic sales channels,” in International Scientific Conference “Science and Traffic Development” - ZIRP 2018, 2018, pp. 53–62.

[2] M. Masaeli, S. A. Alumur, and J. H. Bookbinder, “Shipment scheduling in hub location problems,” Transportation Research Part B: Methodological, vol. 115, pp. 126–142, 2018.

[3] M. E. O’Kelly, “Location of Interacting Hub Facilities.,” Transportation Science, vol. 20, no. 2, pp. 92–106, 1986.

[4] J. Krarup and P. M. Pruzan, “The simple plant location problem: Survey and synthesis,” European Journal of Operational Research, vol. 12, no. 1, 1983.

[5] D. de la Fuente and J. Lozano, “Determining warehouse number and location in Spain by cluster analysis,” International Journal of Physical Distribution & Logistics Management, vol. 28, no. 1, pp. 68–79, 1998.

[6] M. Naeem and B. Ombuki-Berman, “An efficient genetic algorithm for the uncapacitated single allocation hub location

problem,” in 2010 IEEE World Congr. Comput. Intell. WCCI 2010 - 2010 IEEE Congr. Evol. Comput. CEC 2010, 2010.

[7] M. E. O’Kelly, “Hub facility location with fixed costs,” Papers in Regional Science, vol. 71, no. 3, pp. 293–306, 1992.

[8] C. Beckett, L. Eriksson, E. Johansson, and C. Wikström, Multivariate Data Analysis (MVDA). 2017.

[9] L. Özdamar and O. Demir, “A hierarchical clustering and routing procedure for large scale disaster relief logistics planning,” Transportation Research Part E: Logistics and Transportation Review, vol. 48, no. 3, pp. 591–602, 2012.

[10] K. Liao and D. Guo, “A Clustering-based approach to the capacitated facility location problem,” Transactions in GIS, vol. 12, no. 3, pp. 323–339, 2008.

[11] Y. Yang, J. Tang, H. Luo, and R. Law, “Hotel location evaluation: A combination of machine learning tools and web GIS,” International Journal of Hospitality Management, vol. 47, pp. 14–24, 2015.

[12] S. Čaušević, A. Deljanin, M. Begović, and E. Deljanin, “Potentials and advantages of applying geographic information systems in various fields of traffic engineering,” in Road and Rail Infrastructure V, 2018, vol. 5, pp. 1285–1290.

[13] M. Shaban, B. Urban, A. El Saadi, and M. Faisal, “Detection and mapping of water pollution variation in the Nile Delta using multivariate clustering and GIS techniques,” Journal of Environmental Management, vol. 91, no. 8, pp. 1785–1793, 2010.

[14] G. H. Alférez, J. Rodríguez, B. Clausen, and L. Pompe, “Interpreting the geochemistry of southern California granitic rocks using machine learning,” in Proc. 2015 Int. Conf. Artif. Intell. ICAI 2015 - WORLDCOMP 2015, 2019, no. Lil, pp. 592–598.

[15] R. Tangkitjaroenmongkol, S. Kaittisin, and S. Ongwattanakul, “Inbound Logistics Cassava Starch Planning,” in Eighth Int. Jt. Conf. Comput. Sci. Softw. Eng. Inbound, 2011, pp. 204–209.

[16] S. Barreto, C. Ferreira, J. Paixão, and B. S. Santos, “Using clustering analysis in a capacitated location-routing problem,” European Journal of Operational Research, vol. 179, no. 3, pp. 968–977, 2007.

[17] L. A. A. Meira and F. K. Miyazawa, “A continuous facility location problem and its application to a clustering problem,” in Proc. ACM Symp. Appl. Comput., 2008, pp. 1826–1831.


determining the location of postal centers in b&h using...

Documents