By James Taylor, Claudio Piccinini and Corentin LeRoux

**Introduction**

One of the aims for the Spatial Data Processing group is to understand how the large spatial data sets that are being generated can be simplified to feed information to users. In this report, we look at two potential end-user scenarios with very different intentions. The first case study looks at the wealth of information generated with the vineyard imaging sensor, and how it may be compressed (simplified) for a more rapid in-field decision process, the second focuses on how raw yield data can be compressed to provide more coherent information to the grower to prompt decision-making.

**Case Study 1:**

*Problem:*

Using the CMU sensor, images, and subsequently berry count and berry colour data, are collected along rows. While mapping and visualising these data are important for future management, at harvest the critical logistical information is

a) how many berries/bunches are there in a row and,

b) what percentage of them are ready to pick.

This then drives a decision on whether to send picking crews down a given row and how many boxes (and how much support) is needed in the row. This information informs the expected man hours and time required for a harvest operation.

However, to effectively utilise this information, growers require a fast turnaround in the analysis. At the moment the CMU sensor developed within the project is capable of very high throughput imaging, which generates very large volumes of data. In fact there is so much data it cannot be sent over wireless networks and has required storage onto external disk drives and the physical transport of the external drive to a computer laboratory for processing. While part of the sensor development is to shorten processing times, and significant progress has been made on this, growers need information turned around as soon as possible. Relying on physically moving data is time consuming. From an engineering and purely scientific perspective, the more data the better. However, from a viticulture perspective, the more timely the better, provided accuracy is not adversely affected.

To address the viticulture need for timeliness, a data reduction analysis was performed to understand the trade-off between data density and measurement accuracy of the total berry count and the percentage of harvestable fruit within rows in table grape vineyards

The analysis assumed that the number of images to be processed could be reduced and interpolation used to fill gaps. Interpolation is much faster than image processing leading to a saving in processing time and reduction in data size for a rapid output. It does not preclude the collection and processing of high resolution imagery for future vineyard management that is not time-constrained. As described subsequently, the analysis investigated the best way to predict total number of berries and the % of each color grade within a vineyard row by considering:

- how much data to subset (from 1% - 50% of data used)

- regular or random sub-sampling, including where the first point is - simple vs complex interpolation methods - interpolate along the rows only or to allow data from neighboring rows to be used also.

*Methods:*

The following analysis is to determine the effect of data reduction (and data processing) on estimates of the number of berries and % of harvestable berries in each row. It does not consider the spatial distribution of berry count or color within a row.

As a starting point, the full data set provided by CMU was considered to be correct, i.e the number of berries in a row was assumed to be the number counted in the continuous overlapping images taken along the row. Each image also returns a % of berries with each color grade (A – D), which can be analysed in the same manner as the berry count data. The prediction of berry count and color grade % was done on each row individually.

For any given row *k*, the data was subset to 50, 33.3, 25, 20, 16.7, 14.3, 12.5, 11.1. 10, 5, 3.3, 2 and 1% of the original data density. This was done in two ways;

1) randomly along the row, or

2) systematically by selecting every second, third, fourth,…, hundredth observation along the row.

Within each of the 1) and 2) approaches above iterations were run where the subsets were constrained to either

A) force the first and last point in the row to be used or,

B) set no restriction on the location of the starting point.

Forcing the first and last points ensures that the row length is properly defined and also standardises the systematic subset approach as the initial seed point is always set as the first point in the row.

The original data set is a continuum; that is the imagery can be mosaicked together to form a continuous picture of the trellis. Each meter (or foot) of row can then be assigned a berry count and color profile (% of each color grade) and georeferenced with the center point. Generating subsets of data generates gaps in the data, so that some points along the row have no imagery (data) assigned and it is necessary to fill these gaps. To calculate counts along the entire row, these gaps need to be filled. There are various ways that these gaps can be filled using both simple and more complex geostatistical approaches. Interpolation along the rows was trialled using the following smoothers and interpolators (in order of simplicity);

i) Nearest Neighbor (NN) approximation

ii) Moving average (MA) interpolation (using a window equivalent to 1.5 row width)

iii) Inverse Distance Weighting (IDW) interpolation, and

iv) Ordinary Kriging (OK) with a global variogram.

For each of the interpolation processes, two scenarios were imposed;

a) Interpolation only with data from within the same row, i.e not using data from neighbouring rows. Preliminary analysis shows that management along a row tends to have a large effect on production patterns, so it was hypothesised that it was not useful (and potentially misleading) to include data from neighboring rows.

b) Interpolation methods were free to select data from neighbouring rows. Where this was the case, the data in the neighbouring rows was subset and processed in the same manner as the target row before the interpolation.

The analysis therefore;

Tested 13 levels of data reduction from 50% to 1% of the original data size

Compared a systematic to a random subset approach

Tested 4 methods of interpolation to fill gaps left by the subset

Compared the effect of constraining interpolation to use only within-row only data vs within and neighbouring row data (unconstrained).

The variations in systematic vs. random subsets, interpolation and intra- vs. inter-row data availability were applied to all levels of subsets (1 – 50%).

The information within the original (full) data set was considered to be correct. Following the subset and interpolation, berry counts and the % of each color grade were derived for each row and compared to the original full count and the average percentage error calculated over all the rows.

*Results:*

With so many permutations, it is not possible to present all outcomes in this research note, so only the key observations are noted here for the berry counting and the color grading.

*1) Predicting berry counts (total berries in a row).*

There was no difference in % error in prediction between the advanced geostatistical methods (kriging) and the simple approaches (IDW or moving average) for any given level of data subset. However, there was a very large difference in processing time (minutes vs multiple hours) so using simple interpolation approaches appears sufficient with these data.

The % error *increased* when information from adjoining rows was incorporated into the moving average and kriging methods. This indicates that information from adjacent rows is degrading the prediction, relative to within-row information. This needs to be investigated and is likely a case where management is overriding environmental effects on production.

For the simple geostatistical approaches there was no effect between forcing the first and last points, and not forcing (free selection of) points. However, systematic subsets, rather than random subsets were better performed (lower % error in general across different levels of subsets) with more advanced interpolation methods.

The outcome of the iterations indicated that inverse distance weighting interpolation (IDW) (a simple interpolator) with a systematic subset approach and a random starting point provide both rapid and high quality results. Figure 1a shows the percentage error of this approach as the % of data subset decreases. It can be seen that the % error, even using 1- 5% of the original data, is always less than ±1%. There is a very slight bias in the under prediction of total berry count. Providing an indication of total berry count to within 1% of measurement is of sufficient quality for growers to make harvest logistic decisions. (Note this is the error relative to the visible berries. It is not correcting for any potentially occluded berries at the moment).

*2) Color grade percentages.*

When the same approaches were applied to determine the percentage of each color grade within a row, similar results were obtained. Again IDW interpolation with an ordered systematic subset approach was as accurate as other more complicated approaches. However, while the error from the total number of berries (Part 1 above) was low (±1%), the error for individual color grades was much higher, typically ±10% ( or differences of 7000+ berries in Fig 1b). However, there was no trend in the error and even subsets that retained 20-50% of the original data still have high potential errors (relative to the severe subsets (1-5%)). Figure 1b below shows the error for color grade D.

*Figure 1b:** Total berry count error for Grade D berries across 30 rows in a vineyard.*

This provides less certainty for the producer. Ideally information not on the total berry count, but on the total harvestable berries is needed. It would seem that there are still opportunities to improve the way that the color data is decomposed into grades and then reconstructed after data reduction and we will continue to address this issue.

**Case Study 2:**

*Problem:*

Yield sensors are available for grape harvesters, providing point rich yield maps. These point maps are often noisy and not intuitive to interpret over time (see Figure 2a). There are ways of mapping the yield data that make the visualisation of these patterns easier. This works well with 1 or 2 years of data. However, yield data can (and should) be collected every year and it is possible to have numerous years of yield data. Figure 2a shows 4 years of yield data for 4 blocks in a vineyard in the Lake Erie region, NY. As the number of years of data increases, our ability to visually integrate the spatio-temporal patterns of yield tends to diminish. Growers need a simpler way of visualising their yield data that does not require a lot of knowledge of interpolation.

*Method:*

To this end, an algorithm to generate zones within fields (vineyards) from multi-temporal yield data sets has been developed. This is an algorithm specifically designed for multi-variate analysis of yield data. It aims to identify areas/zones where the yield response is similar across multiple years. This could be consistently relatively high yielding or low yielding areas or areas where the relative yield changes between years, but does so in a consistent manner within the zone. Unlike other zoning approaches, that integrate yield, canopy and environmental (soil) data, this approach is purely to provide information on how yield varies over time and space. It is not to understand the drivers of the spatio-temporal yield variation – that is the next step.

*Figure 2a:** Point yield maps of 4 blocks of Concord grapes (Vitis labrusca) from a vineyard in Westfield, NY (Lake Erie viticulture region). Color scheme is from Red à Blue to indicate Low à High yield.*

*The algorithm*: The complete description of the algorithm is in press in Computers and Electronics in Agriculture[1], but is described briefly here. The algorithm uses a segmentation, not a classification, approach to the zoning so that each zone is spatially constrained (i.e. zones cannot be split over different areas).

The raw yield data is cleaned of erroneous data and standardised to remove mean yield differences between years. A 5 m2 grid is placed over the vineyard blocks and the average standardised response within the 5 m2 ‘pixels’ is extracted for each year.

Using the 5 m2 aggregated data, an algorithm to select seeds (starting points) for the zoning algorithm is applied. Each block is zoned independently, so there are a different number of seeds between blocks. Once the seeds have been selected, a multivariate region-growing algorithm is applied to the data to assign each pixel to a seed point using neighbourhood relationships. By default, the number of zones created will equal the initial number of seeds selected. There are options to use different threshold values in the seed selection process to alter the number of initial seeds selected. The novelty in this approach is that the zoning segmentation algorithm is multi-dimensional, in contrast to previous applications that have been univariate, i.e. only able to zone an individual year not multi-year data.

After applying the algorithm – a simplified map of yield zones is generated that integrates the 4 years of data. This is shown in Fig 2b. (Note: As each vineyard block was zoned independently, there is no relationship between Zone 1 in the different blocks, likewise for Zone 2, Zone 3, etc…). Each zone has a unique yield response over time, and a response that is different to its neighbours in the block.

[1] LeRoux, C., et al (under revision) A zone-based approach for processing and interpreting variability in multitemporal yield data sets. Computers and Electronics in Agriculture

*Figure 2c:** Actual average mean yield (t/ac) within the zones in each year. The zones remain stable over time as they are derived by integrating the 4 years of yield maps. This visualises how individual zones change over time. *

*Results:*

In Figure 2c, the average actual yield for each year within each zone has been mapped using a standard legend. This visualises the yield response for each zone over time. It is similar to the information presented in the point maps (Fig 2a) but organised into zones.

An alternative way of viewing the data presented in Figure 2c is to plot the trend in average yield response over time of the different zones. This has been done for Zone 2 and its neighbouring zones (Zones 4, 5, 7 and 8) for the rectangular block second from the top (north) (highlighted in square box in Fig. 2d). In the graph in Fig 2d, Zone 2 (purple) has the lowest response in most years. Zones 5 (light green) and 7 (orange) show the same trend over time. Zone 5 is approximately 1 -2 t/ac higher than Zone 2 and Zone 7 again 1 – 2 t/ac greater than Zone 5 (i.e. Zone 7 yield is 2-4 t/ac > Zone 2). While Zones 2, 5 and 7 follow the same temporal pattern), Zones 4 and 8 deviate, especially in 2017, indicating a different effect on production. The depressed and very depressed yield response in 2016 and 2017 in Zone 4 (relative to the neighbouring zones and the 2014-15 yield) seems of particular concern. Zone 8 (pink) also appears depressed in 2017 relative to its response in 2014-16, when it is similar to Zone 5.

The zones in this block are quite small, on average just over 1 acre in size. However, they are a potentially manageable size although some may net some slight reconfiguring to fit with mechanical operations. Within this block, Zone 6 is a very small zone (centre bottom surrounded by Zone 8). The respone of Zone 6 is very similar to Zone 5, however its size makes it unsuitable for management. In reality, Zone 6 would be part of Zone 8 (pink).

The information presented in Figure 2d is a first step in simplifying the dense and often confusing data in Figure 2a yield maps. Discussions are on-going with local collaborators to understand the resolution at which growers a) want information and b) want to perform management. In this interation zones of 0.5 – 1.5 acres have been targeted. Changing some of the defaults in the zoning algorithm could generate finer smaller zones or coarser larger zones.

**Summation**

These two case studies illustrate some of the challenges and the opportunities when dealing with large amounts of spatio-temporal agricultural data. Information (not data) is needed in a timely and comprehensible manner to make decisions. The examples here, while a step forward, are certainly not the end point, nor have they met grower criteria. We will continue to discuss outcomes and preferences with the end-users to arrive at a best-use scenario that integrates best practice science with practical solutions and interoperability behind the farm-gate.

In case study 2, for example, the size of yield zones is a concern for the grower, as is the decision to treat and analyse the blocks separately. In many cases, such concerns are subjective and while there may be a general preference, there is unlikely to be a “one-size-fits-all” solution. Approaches such as these presented here therefore need to also be adaptive to grower preferences.

This work is on-going and evolving. We welcome any comments on it.