Relation between air and soil pollution based on statistical analysis and interpolation of Nickel (Ni) and Lead (Pb): Case study of Zagreb, Croatia

Purpose. This paper focuses on the comparison of Ni and Pb concentrations in air and soil pollution in the Zagreb area. Due to the very limited amount of publicly available data from soil analysis samples, 2016 and 2019 were chosen as the best possible indicators of related changes in metal concentrations in soil and air. Methods. Testing the normality of Ni and Pb concentrations in the total deposited matter (TDM) confirmed the feasibility of using two parametric statistical tools – the Pearson correlation coefficient and the t -test. The Inverse Distance Weighting (IDW) interpolation method was selected as the best approach for a small number of measurements. Findings. The insufficient amount of data is the main shortcoming for urban health policy in a large area like Zagreb. The small number of air measurement stations and especially soil sampling sites cannot lead to any reliable conclusions about urban pollutants, their activity over time and direct links to soil toxic degradation based on statistical or geological methods and analyses. However, there is no doubt that urban pollution sources fill the soil with accumulated toxic elements such as Ni and Pb, especially in suburban areas located along the paths of the dominant wind directions. Originality. This is an original research that for the first time statistically analyzes and maps publicly available air and soil pollution data for the period 2016-2019. Practical implications. This research is a necessary step in determining the future planning of air and soil measurement stations in the Zagreb urban area.


Introduction
Soil and air pollution continues to be a major problem in urban areas such as Zagreb. As the largest city in Croatia, it faces high population density and heavy traffic, which has a significant impact on air and soil quality. The total mass of pollutants transferred from the air to surfaces is called the total deposited matter (TDM). The TDM is one of the parameters for assessing environmental air quality. As part of TDM determination, TDM metal concentrations are also determined (Pb, Cd, As, Ni, Tl, etc.). Considering that air particles settle to the ground as a result of deposition in the air, it can be concluded that there is a correlation between the concentrations of metals in air and soil.
Air quality is an important environmental variable that determines the spatial accumulation of heavy metals in the soil. Some results have shown that there is a sources-receptor relationship mechanism between air quality and heavy metals in soil. Industry releases heavy metal particles into the atmosphere, which then enter the soil through atmospheric deposition and precipitation [1].
However, soil pollution in urban or rural areas can be affected by both air and water streams. Even more, both me-chanisms are generally the most active in mining areas, increasing the concentrations of heavy metals, for example, in industrial discharged water or gases. So Kadriu et al. in [2] proved for several sampling sites that the industrial activity (mine and flotation) of the Trepça Mine is the main cause of environmental pollution. Their analysis of discharged water from the mine and the flotation showed concentrations of heavy metals in the surrounding environment.
Moreover, Polishchuk et al. in [3] defined the problem of assessing the guaranteed atmospheric air quality for a point source. Such point(s) can be defined in various fields such as mining, metallurgy, chemistry etc., but are always described as a vector random field and a variable with multidimensional distribution density. Consequently, there is a clear link between air quality and industrial sources.
Skrobala et al. in [4] presented mining as one of the most dangerous sources of environmental pollution using the case study of the Nadiya mine rock dumps in Ukraine. Pollution depends on the relief and slope exposure, which affect the distribution of chemical elements and their increase compared to the background values. Dispersion (according to the Principal Component Analysis) influenced by anthropogenic activity is the most intense for Mg, Pb, Sn, Fe, Al, Cu, P, Ni, Zn, affecting vegetation near dumps.
However, in Croatia, mining activities do not have a direct influence on the environment due to the closure of all active coal mines several decades ago (last in 1999). Only a few relatively small quarries are still active, but their influence is not chemical, like the accumulation of heavy metals, dissolved ions etc., but in air quality due to the large quantities and volumes of mining dust. Gravel and clay pits do not have any influence on local inhabited areas, except for aquifers (unless they are properly planned).
Consequently, the biggest air pollutant in the city of Zagreb is traffic [5], in addition to several industrial plants in the eastern part of the city, a heating plants, a waste disposal site and an airport near the city. Of course, pollution can also occur when a pollutant enters the soil directly. However, this data is not considered in this work, since the main goal is to monitor where most wind-blown pollutants (mostly caused by traffic) settle, so that we can better adapt the measurement stations in further research. Polluting substances are carried by the wind and, depending on the particle size, direction and speed of the wind, settle on the soil surface near the source of pollution or are carried to more distant areas (primarily if they are small particles PM2.5 and smaller, ultrafine particles). Many papers study the relationship between pollutants and the influence of meteorology on the behaviour of pollutants, the direction of their trajectories and, ultimately, deposition [6]- [8]. The general conclusion of all the papers is that wind direction plays a key role in pollutant migration from its source to deposition. The mechanisms that explain this relationship are complex, since many other phenomena influence the path of particles from the source to the soil surface, such as the size of the particles themselves, wind direction and speed, rainfall, vegetation, tall buildings, etc. [9].
Heavy metals are very dangerous for the environment and organisms. Through the food chain, humans suffer the consequences of soil pollution. If soil is contaminated with heavy metals, it is difficult to rehabilitate it [10]. These heavy metals are persistent and bioaccumulative, do not biodegrade or metabolize in the environment [11].
Within the framework of this paper, Ni and Pb were selected as potentially toxic elements in high concentrations. These elements were chosen because of the same elements measured in soil near air measurement stations during two years of observations. Nickel (Ni) is a toxic and carcinogenic heavy metal. It enters the soil with diesel fuel, smelters, city effluent and mining [12]. The average concentration of Ni in soils is 20-30 mg/kg [13]. Spatial distribution of Ni in central Croatia area ranges within 12 and 427 mg/kg with an average value of 33 mg/kg, which is lower than the average value for the entire Croatia (48 mg/kg) ( Fig. 1) [14].
Lead (Pb) is a heavy metal mostly presented in the top layer of soil due to the deposition from air containing smoke from vehicles. In uncontaminated soils, Pb concentrations are generally below 50 mg/kg [15]. Also, industrial activities have a large impact on heavy metal pollution, ecological and health risks. When the trace element concentration exceeds the critical threshold, it may have a potentially toxic effect on human health and the ecosystem [16]. According to the International Agency for Research on Cancer (IARC), high concentrations or extreme exposure to Pb can cause severe neurological and hematological disorders in the exposed population, mainly children.

Figure 1. Regional spatial distribution of Ni [10]
On the other hand, short-term exposure to a high Ni concentration via inhalation can cause kidney and lung disorders. Exposure to high levels via ingestion can lead to neurological disorders and gastrointestinal discomfort. Moreover, long-term exposure through dermal contact may result in dermatitis, while inhalation can cause many respiratory diseases, such as nasal and lung cancer [17]. Lead content in this region varies between 14 and 217 mg/kg, with an average value of 27 mg/kg, which signi-ficantly exceeds the state average content of this element in soils ( Fig. 2) [14].

Figure 2. Regional spatial distribution of Pb [14]
Soil enrichment with Cd, Zn, As, and Pb is attributed to high-temperature coal combustion [18]. In the areas of large cities, there are mixed sources of pollution from industrial emissions, the burning of fossil fuels, emissions from vehicles, irrigation with wastewater, as well as the disposal of solid waste and natural sources of parent soil materials in urban soils [19].
The major methods for the prevention and control of heavy metals in soils are focused on the identification of heavy metals in the soil from different pollution sources. The average contribution rates of the soil-forming process in Cd, Hg, As, and Pb are around 80-90%, and those of anthropogenic activities are 10-20% [20]. Certain papers come to the conclusion that the source material plays a very important role in the distribution and quantity of metals in soil horizons. Based on the example of Cd, it was concluded that the parent material is the decisive factor determining the magnitude and depth of exogenous Cd accumulation in the soil profile. These findings suggest that the parent materialinduced Cd fraction distribution and accumulation should be considered in order to effectively explore remediation strategies for Cd pollution [21].
However, how pollutants behave in an unsaturated zone is a very complex question. Their carrier is affected by many physical, chemical and microbiological processes, such as advection, dispersion, diffusion, sorption, cation exchange, etc. [22]. Sorption processes of potentially toxic metals in soils differ at different soil pH values, and the soil ability to retain them depends on its resistance to any soil pH changes [23].
A total of 420 surface soil layer samples (0-15 cm) were sampled in Zagreb and surrounding areas using systematic sampling on a 2 km grid [24]. The conclusion of this work is that the influence of the anthropogenic urban environment on the distribution of heavy metals in the soils of Zagreb is noticeable only for Hg, Pb and Zn, while pollution caused by Cr, Ni and Mn has not revealed. When investigating the movement of pollutants in the unsaturated zone of the Zagreb aquifer [25], the highest concentrations of potentially toxic elements (Pb, Cd, Zn) were determined in the shallowest part of the soil profile. Considering the obtained results of the predicted transport model, in this case, the initial concentration of potentially toxic elements on the soil profile surface is about 10% of the elements carried into the aquifer. Therefore, the soil surface layer can be considered the most significantly polluted with potentially toxic metals from the air.
An interesting research was published by Saeiti et al. [26], where the authors pointed out mixed models as a useful way of longitudinal discriminant data analyses. Although the model was applied to specific patient deviations from the general trend over time (i.e., in biostatistics), they introduced a so called "joint distribution" to include all model random effects and assumed as multivariate normal distribution. Two sample sets are tested (n = 250 and 2500), where misspecification of the random effect distributions is found. Authors [26] stated that "when there is severe departure from normality, more flexible mixture distributions can give better classification accuracy. However, in many cases, the incorrect assumption of a single multivariate normal distribution has little impact on the classification accuracy". Although analysis considered time-shifted medical data, an important general conclusion can be drawn. If there is a slight deviation (for random effects) from a normal distribution, then assuming a single Gaussian distribution will not lead to significantly less accuracy. However, if the deviation is larger, more accurate classifications will be reached with "a more flexible random effects distributions", but it is only possible for larger datasets.
In a similar study, also in the field of medicine, presented by Sheng et al. [27], a correlation is analysed in clustered data (formed by two ears of the same individual). They also propose a linear mixed-effects model for analysing the multilevel data structures. Generally, they concluded that more efficient models are mixed, multivariate, with a larger number of inputs (in this case, method for both ears, including participants, audiologists and confounder).
Also often in geology there is an approach to increase artificially the data using bootstrap as a nonparametric statistical method. Ivšinović and Litvić [28] show that resampling an input dataset can provide a chance to reach a new set which is characterised by normal distribution. However, this method is based on a large number of resampling. The authors show that at least 1000 (new) subsurface porosity artificial data were necessary to obtain a normal distribution.
All these approaches deal with distribution of data and can be tested for both small and large datasets in different fields of science and engineering. They all have their limitations, especially bootstrap when resampled datasets may not be statistical representative, but they all can help to make decision how to spatially represent variability.

Study area description
The study area includes six measurement stations in Zagreb for observing air quality (A1-A6). The behaviour of air pollutants in the environment, as well as human exposure to air pollutants in Croatia, are investigated and monitored by the Institute for Medical Research and Occupational Health (IMI). During the two-year period, 2016 and 2019, daily samples of Ni and Pb concentrations in the total deposited matter were taken, and monthly and annual averages are considered for mapping and statistical analyses. Additionally, soil samples were taken from two locations (S1 and S2), and values in 2016 and 2019 were compared with air pollutants. These 2 years are the only ones when mutual air and soil measurements exist. Moreover, locations S1 and S2 were chosen due to their proximity to possible pollution sources. S1 is the location of the Zagreb wastewater facility and S2 is close to the city landfill. The entire observed area (air measurement and soil sampling stations) covers approximately 300 km 2 . The sampled locations are shown in Figure 3.

Figure 3. Study areas with measurement stations (Aair measurement stations, Ssoil sampling stations)
Soil samples were taken at a depth of 20 cm and subsequently subjected to drying, sieving and preparation for sequential extraction analysis (BCR) [29]. The BCR method was used to determine the overall distribution of heavy metals in sediments.

Geological settings
The study area consists primarily of Quaternary deposits composed of Pleistocene non-carbonate loess and Holocene alluvium, flooded and splay sediments [30]. It is represented by lowlands located at the very southwestern margin of the Pannonian Basin System, below Medvednica Mt. These Holocene sediments are carried by the Sava River in numerous redeposition cycles originating from the Alpine regions [31].
Apart from alluvial deposition, a significant effect of tectonics results in pronounced heterogeneity and anisotropy [32].
Quaternary sediments surrounding Zagreb were firstly described at the beginning of the century [33]. Based on all available geologic-geophysical data, Pleistocene and Holocene sediments have been completed to a depth of approximately 100 m. At this depth, sediments are composed of loose and weakly cemented clasts, the sizes of which correspond to gravels, sands, silts and clays in variable proportions (Fig. 4).
The upper part (upper 50 m) is dominated by gravels. They are underlain by a thin layer of silty clay overlying the following sequence: gravels and sands, thin silty clayed sediments and sand, silts and clays forming the lowermost 10 meters [34]. Analysed surface soil samples were taken from 0-20 cm depth in locations S1 and S2 [25], [29], [35], while metal concentrations in the total deposited matter were measured monthly in 2016 and 2019 at six locations, namely A1-A6 (Table 1-3).

Statistical analysis
The Inverse Distance Weighting (IDW) interpolation method was selected, which will give the best data for a small number of measurements [36]. The IDW method is based on the distance ponder with an exponent, mostly the second power [37]- [39], between the measured data and the location where an unknown value is estimated. Such values are in grid nodes.
where: ZUIestimated value; di-distance to the "i-th" location; ppower of distance; Zimeasured values at "i-th" location (i = 1 ,…, n) [40]. The result of IDW interpolation depends on the value of the exponent p, which is obtained experimentally and has a different value in different fields of science [38]. Specifically, in this work, the value of 2 was used as the exponent p, which in previous studies in similar areas was shown to be optimal [38], [39]. This means that the weight given to each point is inversely proportional to the square of the distance between the unknown point and the known point. This gives more weight to nearby points and less weight to distant points, resulting in a smoother and more accurate interpolation surface. However, the choice of power exponent depends on the nature of the data and the desired accuracy of the interpolation. For example, a power exponent of 1 will give equal weight to all neighbouring points, while a higher power exponent will give more weight to the closest points.
To investigate the correlation between variables (in this case, metal concentrations), it is useful to check the normality of the TDM annual data distribution. In the limited datasets, a normality check is quite an unreliable analysis. However, the amount of data for normality analyses, t-test and interpolation was different. Statistical data were checked based on monthly averages for two compared years (t-test) or single years (normality check). It included 24 or 12 values, respectively. Oppositely, for interpolation, the datasets included 6 points and 2 auxiliary (for interpretation) data. So, 10+ data points are considered the minimum for the logical selection of formal normality tests, where also one was "stricter" (Shapiro-Wilk) and another "lighter" (Kolmogorov-Smirnov). One of the most frequently used and reliable (strong) tests for confirming the hypothesis of a normal distribution, the Shapiro-Wilk (SW) test is expressed by the Equation: where: x(i)the i-th smallest number in the sample; xa sample mean; aia constant. This is a hypothesis test that is applied to a sample and whose null hypothesis is that the sample was generated from a normal distribution. If the p-value is low, we can reject such a null hypothesis and say that the sample was not generated from a normal distribution [41].
Due to the small number of data (monthly averages), we additionally check the normality with another test called the Kolmogorov-Smirnov (KS) test [42]. The test is based on the empirical distribution function (ECDF). Given N ordered data points Y1, Y2, ..., YN, the ECDF is defined as: where: n(i)the number of points less than Yi; Yiordered from the smallest to the largest value. This is a step function that increases by 1/N at the value of each ordered data point.
For groups where normality is confirmed by at least one formal test, the parametric t-test was made to compare the means of the two groups of monthly values between 2016 and 2019. The paired sample t-test was chosen to determine whether the mean difference between two sets of observations is zero [43]. In the t-test, we used the p-value as a measure of the statistical significance of the observed result. If the p-value is small (less than 0.05), it suggests that the observed result is unlikely to have occurred by chance, and we reject the null hypothesis. On the other hand, if the p-value is large, this suggests that the observed result could have been obtained by chance (the concentration of individual metals in different years is statistically significant). It is important to note that the p-value does not provide information on the magnitude of the effect or the size of the difference between groups; it only shows us whether the observed effect is statistically significant. Two sets of observations are paired if each observation in one set has a special correspondence of relationship with exactly one observation in the other data set [44]. The formula for estimating the paired t-test is: where: mthe mean of the group; σthe standard deviation of the group; nthe sample size. Also, for samples with a normal distribution, a correlation was made using Pearson correlation coefficient [45]. The Pearson correlation coefficient is a measure of the strength of a linear relationship between two variables [46]. It is commonly denoted by the Greek letter ρ (rho) and may be referred to as the population correlation coefficient. Given a pair of random variables (X, Y), the formula for ρ [44] is: where: covthe covariance; σXthe standard deviation of X; σYthe standard deviation of Y [47]. When applied to a sample, it is commonly represented as r and may be referred to as the sample correlation coefficient or the sample Pearson correlation coefficient.
ii ii x x y y r x x y y where: rcorrelation coefficient; xivalues of the x-variable in a sample; xa mean of the values of the x-variable; yivalues of the y-variable in a sample; xa mean of the values of the y-variable.

Results and discussion
In order to check the normality of the distribution of monthly values of Pb and Ni concentrations in the air during 2016 and 2019, Shapiro-Wilks and Kolmogorov-Smirnov formal tests were performed. The results are presented in Table 4. Most monthly concentrations have a normal distribution, especially when analysed using the KS test. It was expected because the KS test is not as strict as the SW test. The monthly Ni concentrations at the A2 measurement site in 2016 are an exception, since there was one extreme value, which was probably real and not a measurement error. Based on this, we can compare the monthly concentrations of individual measurement stations in 2016 and 2019 (except for Ni at A2) using the t-test and Person correlation coefficient (Table 5). In the t-test, we used the p-value as a measure of the statistical significance of the observed result. If the p-value is less than 0.05, it suggests that the observed result is unlikely to have occurred by chance, and we reject the null hypothesis (the concentration of individual metals in differrent years is not statistically significant). When normality is not satisfied, this means that the data distribution is not Gaussian, and therefore statistical methods based on this assumption may not be valid. In the context of air pollution data, this may happen because air pollution sources are not evenly distributed in space and time, or because some pollutants are more persistent in the atmosphere than others. In this case it is important to use non-parametric methods, such as the Wilcoxon rank-sum test, the Kruskal-Wallis test, Spearman's rank correlation or Robust regression, which do not assume a specific data distribution. Alternatively, one can transform the data using a suitable transformation, such as logarithmic or square root transformation, to make the distribution more normal before applying parametric methods.

Table 5. Results of statistical paired t-test (p-value) and Pearson correlation coefficient (r) between concentrations in 2016 and
Pearson correlation coefficient does not significantly differ from zero. According to this correlation coefficient, there is not significant linear relationship between monthly concentrations of Ni in 2016 and 2019, as well as between respective Pb concentrations. A paired t-test is used because it can be assumed that, without additional external pollutants, the sample means will be similar, i.e., statistically related. The null hypothesis assuming similarity between 2016 and 2019 monthly means is rejected based on the p value at locations A1, A5 and A6 for both Pb and Ni concentrations. At measurement station A4, the null hypothesis is not rejected for Ni concentrations, and it is rejected for Pb. At measurement station A2 and measurement station A3, null hypothesis is not rejected for both Pb and Ni concentrations.
In short, using an additional t-test, the hypothesis of equality of their means is mostly rejected (H1 : μ1 ≠ μ2), which places the time-shifted data of majority of air samples in different weather and soil conditions. This can be compared to negative correlations often calculated between the same datasets, where both values mask any relationship between stations, which probably cannot be the prevailing statement. However, finding the critical value of the rock petrophysical values when applying t-test in geological research could be a tentative process [48], and an increase in the number of data can help reach the representativeness of the statistics, i.e., belonging to the so called "parent population". This means that increasing the network of stations probably improve the relationship between different stations, expressed by statistical values such as correlation coefficient or t-test values.
Significant Since the spatial distribution of these potentially toxic elements (Pb and Ni) can be significantly affected by wind direction and speed, both factors were observed using an annual wind rose valid for Zagreb (meteorological station located in the uptown). The data for measured wind properties reflect the same time frames (2016 and 2019) as the Pb and Ni concentration measurements from the air. The length of each wind direction spike is the cardinal direction frequency in which the wind blows. It is represented as a percentage (0, 10, 20, 30…).
In both years (Fig. 7), the wind direction was predominantly towards the NE with speed of 0-2 m/s. In addition, the second spike was directed towards the SE with speed of mostly 0-1 m/s, rarely higher. Wind roses for 2016 and 2019 are very comparable.
Moreover, we used points S1 and S2 for a quantitative comparison of mapping results (A1-A6) and dominant wind directions (Figs. 9 and 10). Two soil samples were taken as an example to see if metal concentrations in soils increase or decrease over the years, as metals from the TDM settle and accumulate in the topsoil over the years. Of course, this is an assumption that cannot be precisely confirmed with such a small amount of data, and this paper is precisely a guide for further research into the relationship between air and soil metal pollution.

Conclusions
This paper is focused on the comparison of Ni and Pb concentrations in air and soil pollution in the Zagreb area. Due to the very limited amount of publicly available data from soil analysis samples, 2016 and 2019 were chosen as the best possible indicators of related changes in metal concentrations in soil and air.
The main problem was that the input data contained only six measurement stations for air quality and, more importantly, just two analysed soil sampling sites for the same metals (measured in total deposited matter). As can be seen on the maps, the only area defined by the air measurement stations can be mapped, due to the inadequate number of soil sampling sites. Moreover, those two subareas cannot be mapped simultaneously, but each by itself due to different physical values, but they are interpreted as possibly dependent.
Moreover, a larger dataset of air concentrations was statistically analysed by itself, based on monthly averages for 2016 and 2019 used to calculate the (mapped) annual averages. This included formal normal distribution testing, linear correlation calculations, and t-test.
Testing the normality of Ni and Pb concentrations in the TDM confirmed the feasibility of using two parametric statistical toolsthe Pearson correlation coefficient and the t-test. Interestingly, the Pearson correlation coefficient does not significantly differ from zero, i.e., the "r" is "not significant" and there is no significant linear relationship between the monthly concentrations of Ni and Pb in 2016 and 2019.
Furthermore, the t-test for most of the data shows that the null hypothesis is rejected. Thus, it can be concluded that the environment, i.e., the pollutant sources operating in the observed area, have changed over the three years. This is also confirmed by the absence of a linear correlation. In the locations where the null hypothesis was not rejected (A2 and A3), it is concluded that the area is still potentially polluted by the same sources in similar proportions.
Since the mean values of Pb and Ni concentrations in the air in 2019 are lower compared to 2016, while the Pb concentration in soil, on the contrary, increased in 2019, it can be concluded that, as expected, pollution from the air is accumulated in the surface soil layers. From the graphic display of wind directions and speeds, it is evident that the most dominant wind directions were in the northeast and southeast directions. The influence of the NE direction cannot be confirmed due to the absence of soil samplings, but the SE is clearly reflected in the increase in Pb in S1 and S2, although the air concentration in the Zagreb urban area decreased over the analysed years.
The deposition of particles from the air, their accumulation and behaviour within soil horizons is a very complex issue. Some analytical methods on how to deal with them are shown in this work. However, the most substantial conclusion is probably that the insufficient amount of data is the main shortcoming for urban health policy in a large area like Zagreb. The small number of air measurement stations and especially soil sampling sites cannot lead to any reliable conclusions about urban pollutants, their activity over time and direct links to soil toxic degradation based on statistical or geological methods and analyses. There is no doubt that urban pollution sources fill the soil with accumulated toxic elements such as Ni and Pb, especially in suburban areas located along the paths of the dominant wind directions.
However, with a rather restricted dataset and no meaningful spatial network of stations, normality and t-test could be used as some of the rare available statistical tools that could help to explain meaningless correlation values (all were in the range of r = +/-0.25, which was very strange depending on the number of calculations, where at least one pair could randomly reach a high correlation). Such an approach could be much more effective if the sampling network were expanded in the future. This is considered a necessary condition for any proper protection of urban soils and waters, and especially citizens, and is crucial for any reliable health policy that people can trust based on qualitative and quantitative statistical and geological tools, methods, and facts.