## Abstract

A new method is proposed for unravelling the patterns between a set of experiments and the features that characterize those experiments. The aims are to extract these patterns in the form of a coupling between the rows and columns of the corresponding data matrix and to use this geometry as a support for model testing. These aims are reached through two key steps, namely application of an iterative geometric approach to couple the metric spaces associated with the rows and columns, and use of statistical physics to generate matrices that mimic the original data while maintaining their inherent structure, thereby providing the basis for hypothesis testing and statistical inference. The power of this new method is illustrated on the study of the impact of water stress conditions on the attributes of ‘Cabernet Sauvignon’ Grapes, Juice, Wine and Bottled Wine from two vintages. The first step, named data mechanics, de-convolutes the intrinsic effects of grape berries and wine attributes due to the experimental irrigation conditions from the extrinsic effects of the environment. The second step provides an analysis of the associations of some attributes of the bottled wine with characteristics of either the matured grape berries or the resulting juice, thereby identifying statistically significant associations between the juice pH, yeast assimilable nitrogen, and sugar content and the bottled wine alcohol level.

## 1. Introduction

The ever-growing roles of digital sensors whose technologies are constantly advancing, allowing for better and broader experimental coverage of the parameters defining a process, and the corresponding advances in the technologies of the computers that monitor and analyse the output of these sensors have led to an exponential growth in the amount of data that scientists collect. Research may be at the threshold of an era in which hypothesis-driven science is being complemented with data-driven discovery. This alternative way to pursue research affects all fields, from genomics in biology, to astrophysics and many domains in social sciences. The collected data allow for the study of processes at scales never observed before. Of even higher significance, the seemingly unlimited breadth and depth of their sources provide the means to move beyond reductionism and develop efforts to study complex systems using a more comprehensive approach. These data, however, come with a high level of complexity and heterogeneity, usually providing indirect measurements of hidden, albeit essential, processes that are key to the systems under study. The analyses of such complex data require developments pertaining to many disciplines that are directly or indirectly coupled to data sciences. In this paper, we introduce a new method for detecting patterns within, and between datasets collected in a longitudinal study. This method draws from such interdisciplinary efforts; we illustrate its applications to the analysis of the effects of water stress on the process of winemaking.

Winemaking is the process of producing wine, starting from the growing of grapes and ending with bottling the finished wine. It has been presented as an art, a skill, or a science that predates recorded history. Competition among wines, continuous consumer demand for new wine styles, the economic importance of wine, and the growing concern about the impact of winemaking on the environment make winemaking one of the most studied processes in agriculture. Among all factors potentially affecting wine quality, the amount of irrigation and/or rainfall available during the growing season has become a recent focus. Indeed, most of the world's wine-producing regions experience seasonal drought conditions. With the increased aridity predicted from climatic shifts, water availability is likely to become a limiting factor for wine production. Some European regions are already experiencing low precipitation incapable of supporting optimum grapevine growth [1]. Understanding the relationships between irrigation (or lack thereof) and wine quality has, therefore, become the subject of a multitude of studies (see e.g. [2–6]). In this article, we focus on characterizing those relationships at four stages of the process of winemaking, namely the composition of grapes at harvest, the juice after processing, wine composition after alcoholic fermentation and the composition of the bottled wine.

There are many difficulties to overcome when assessing the effects of changing one or many variables that may affect the quality of wine. Firstly, a definition of ‘quality’ is needed. While ultimately quality should be defined based on customers' appreciation, the latter is definitely subjective, prone to biases and not amenable to quantitative measures. Instead, quality is usually considered as a generic term that refers to levels of a diverse range of chemical constituents in the wine. While the specific chemical components that determine the overall ‘quality’ of a wine are difficult to identify [7,8], it is generally understood that flavonoid (such as tannins) concentrations, acidity, pH and sugar content are some of the key attributes [9]. Secondly, the concentrations of these components either in the grapes, juice or wine are influenced by many factors, such as soil physico-chemical properties, weather, water availability during grape growth, and yeast types and possible adducts during alcoholic fermentation. Most inference analyses of winemaking are designed to analyse the effects of modifying only a few (sometimes even just one) of these factors; it is, however, often unclear whether the observed responses in wine quality are direct or indirect effects, as those factors are not independent. Understanding how these factors interact and consequently influence the levels of the various chemical constituents is therefore integral to a proper analysis of any experiments that study the impacts of the manipulation of even a few factors on wine quality [6]. Finally, the nature of these interactions between these different factors is not known, that is, one does not possess the information that would allow us to disentangle their effects. Accordingly, a method is needed that is able to analyse the relationships between factors affecting grape growth and wine quality in the context of these unknown dependences between the factors and the measures of quality; this method should also allow for estimating the relationships between these stages, i.e. identify factors whose effects carry through the whole process. Such a method is introduced in this paper.

In a typical manipulative experiment designed to assess the effect of environment on winemaking a few factors, *n*, are manipulated, and a few descriptors, *m*, of wine quality (usually based on biochemical assays), are evaluated. We refer to the *n* factors as ‘nodes’ belonging to a space and to the *m* descriptors as ‘features’ belonging to a space . The relationships within each of the two spaces are unknown, while the relationships between the two spaces are stored in a data matrix with *n* rows and *m* columns. In the experiments considered in this paper, similar node spaces are tested at four different stages of the process of winemaking, corresponding to four feature spaces , with . We are interested in characterizing the relationships between and for all *i*, as well as in finding possible coupling between the phases, i.e. identifying factors (nodes) that have persistent effects throughout the different phases as well as pairs of features from two phases that show strong associations.

Our procedure for extracting this information proceeds through two steps. In the first step, we build a coupling geometry between and its associated feature space by computing iteratively two tightly coupled ultrametric trees onto the rows and columns of the data matrix . Based on this coupling geometry, we extract interacting patterns within nodes, within features, and between nodes and features. This bears similarity with the topological concept of shapes, as introduced by Carlsson [10] to analyse data, or by Mémoli [11] for shape comparison. The result of this analysis is a new data matrix that is a permutation of , with the rows and columns re-organized to highlight the patterns described above. This matrix is then further optimized so that it corresponds to the minimum or ‘macrostate’ of an energy function *E*, akin to the energy of an Ising model in statistical mechanics [12,13]. In the second step, bootstrapping is applied to generate a set of random matrices that emulate , i.e. that conform to its internal structure. Each of these matrices is then characterized by its energy , and the distribution of those energies characterizes the space of states or configurations for the coupling that conforms with the data. These matrices serve as support for estimating the significance of associations between features at one phase and features at another phase of winemaking.

The method presented here is an extension of our recent work on analysing the geometry of data and networks. We have recently developed a new methodology, called data cloud geometry-tree (DCG-tree) to capture the underlying geometry of a set of data points [14,15]. It is an integral part of the data mechanics approach [16] which captures the coupling geometry in binary bipartite networks, a prelude to step 1 of our method. The novelties of the approach included in this paper include extensions of those methods to weighted matrices, and the development of a bootstrapping technique for generating matrices that mimic weighted matrices.

The paper is organized as follows: in the next section, the experiments on the effects of water stress on winemaking are briefly described. A complete depiction of the methods used to quantify the chemical components of grapes, juice and wine is provided in the electronic supplementary material. In addition, we include in the latter a qualitative analysis of the data of the water stress experiments, discussed in the light of recent similar studies. In the main body of the paper, we focus on describing a new statistical framework for analysing those data, and the results of the corresponding analyses. The following section comprehensively covers the data analysis methods and their implementation. In the results section, we use our method to characterize the relationships between different conditions of water stress and wine attributes at four stages of winemaking, as well as the degrees of association between the chemical composition at different phases. Finally, we conclude with a discussion of the advantages as well as limitations of our approach and propose directions for future improvement.

## 2. Experimental: measuring the effects of water stress on grapes, juice and wine

### 2.1. Grape cultivation, winemaking and irrigation regimes

Six irrigation regimes were designed to analyse the effects of timed water stress on grape growth (table 1). Those water regimes are switched at veraison, a key time point of berry maturation which typically occurs in mid-summer. Irrigation regimes were based on stem-water potential measurements taken at solar noon, which are accepted to be the most sensitive measures of whole-plant water status, while also being easily converted into the prevalent measure of leaf-water potential (LWP) as described in [17]. Twenty-four measurements were taken twice a week using a calibrated manual pump-up pressure chamber (PMS Instruments, Albany, OR, USA).

The six irrigation regimes were implemented during the 2011 growing season in a commercial vineyard (Mast Ranch) in the Dunnigan Hills American Viticultural Area (AVA; Esparto, CA, USA) (38°45 N, 125°57 W), planted in 1999 to ‘Cabernet Sauvignon’ grapevines on 1103P rootstock and followed during the 2012 and 2013 growing seasons. Each treatment comprised a ‘data row’ of 80 grapevines split into four sub-sample groups of 20 grapevines each. Two buffer rows on each side of the data row were treated identically with regard to irrigation, but were not sampled. The Dunnigan Hills AVA is classified as region 5 under the Winkler Growing-Degree-Day (GDD) concept, a heat-summation of the growing season between 1 April and 31 October (2012: 4878 GDD; 2013: 4899 GDD). Winter precipitations amounted to 23.10 inches in 2012 and 3.14 inches in 2013. For the 2013 season, winter irrigation was applied to saturate the soil-profile to levels comparable with the 2012 season. All irrigation applied was accounted for with flow-meters positioned at the end of each data row.

Grapes were harvested separately for each water treatment, based on a target concentration of soluble solids of 24 Brix. Accordingly, treatments were harvested over a three-week period during both seasons, with the water treatments RHP, LD and ED− consistently harvested the earliest, and ED+ the latest. Differences in the onset of ripening were not observed. Four samples are considered for each regime, leading to a total of 48 different experiments for the grapes at harvest.

After harvest, grapes from the four replicates of one irrigation regime were pooled and crushed. The resulting, homogenized must was divided into triplicate fermentations, leading to 36 wines. All winemaking occurred at the Robert Mondavi Institute Teaching and Research Winery at the University of California, Davis. The use of this facility enables full control over all parameters of the winemaking process, with high precision and reproducibility. A previous R&D red wine fermentation protocol of the Constellation R&D department was adopted to function with the 200 l TJ fermenters available at the Pilot winery. The whole winemaking process is thus characterized by four major phases: grapes at harvest, juice sampled from the processed grapes (2 days after harvest), wine at the time of bottling and bottled wine; these four phases will be referred to as ‘Grape’, ‘Juice’, ‘Wine’ and ‘Bottled Wine’, respectively. The authors do acknowledge that this is not a traditional longitudinal study, as the Grape phase included 48 samples, while the subsequent phases comprised 36 samples.

### 2.2. Chemical composition at the different phases of winemaking

Each stage of the winemaking process has its own set of attributes that will ultimately affect wine quality. We, therefore, consider different aspects of chemical composition, one for each phase, that include 14 measurements in the grapes, six measures for the juice, six other measures for the wine at bottling time, and finally 13 measures for the bottled wine; the list of those measures is provided in table 2. Details on how those measures are performed are provided in the electronic supplementary material. It should be noted that the chemical composition discussed here only comprises critical factors that are commonly assessed in the wine industry. The dimension of further aroma- and flavour-molecules was not explored as part of this project.

### 2.3. Preprocessing the data

Based on our experimental setting, each phase of the winemaking process is characterized with a data matrix, where , the number of rows, is the number of water stress experiments considered (i.e. 48 for Grapes, and 36 for the other phases), and is the number of attributes or features considered at that phase (table 2). Each feature is measured using a specific technique, and its quantification is expressed with a unit reflecting the chemistry considered. As a consequence, the columns of the data matrix have different scales and ranges. To minimize these scale effects, each column of the data matrix is preprocessed to have 0 mean and variance of 1.

We note that there is implicit and explicit collinearity in the data matrices. In the Juice phase, yeast assimilable nitrogen (YAN) is in fact the sum of NOPA and ammonia. At the Bottled Wine phase, intensity is the sum of A420 and A520, while hue is the ratio A420/A520. The effects of the presence of collinearity are minimized, but not removed by the preprocessing of the data.

## 3. Data mechanics: a new method to detect coupling between nodes and features

Consider *m* water stress experiments at a phase *i* in the winemaking process, where *i* varies from 1 to 4. As part of each experiment, *n* measurements are taken, corresponding to *n* features designed to estimate the quality of the wine at this stage. The set of experiments forms a node space denoted , while the set of measurements forms a feature space and is denoted as . For simplicity, we consider that neither nor are ordered with respect to temporal, spatial or ordinal axes of any kind. Let be the data matrix whose rows are the experiments, columns the measurements, and the value of measurement *j* for experiment *i*. The latter is usually a real number; as such, we refer to as being weighted, in opposition to a binary matrix whose values are either 0 or 1. Our goals are to characterize the relationships between and each of the feature spaces and to measure the coupling between the different phases during winemaking. Our approach mirrors these two goals in that it proceeds in two stages, firstly building statistical models that capture the coupled geometry between and the feature space for all four phases, and secondly, assessing possible associations between two phases based on these statistical models. These two stages can themselves be broken down into three steps:

(1) capture the geometry of the data matrix by building coupled ultrametric tree structures on its node and feature spaces;

(2) use the ultrametric tree structure to optimize a matrix equivalent to that minimizes an Ising model-like potential field built upon the matrix; represents the ‘macrostate’ of ; and

(3) generate sampling matrices that mimic while conforming to its structure; build the distribution of energies of those matrices.

Steps 1 and 2 are used to define the macrostate of the data matrix , while step 3 allows for the construction of an ensemble of matrices or microstates that conform to the macrostate. These matrices are then used to assess the significance of associations between features from two different phases.

### 3.1. Step 1: capturing the geometry of the data matrix

Capturing the structure of a data matrix can be stated as the problem of identifying sub-matrices, i.e. blocks formed with subsets of rows that exhibit similar behaviour over a subset of the columns. It is very unlikely that the data matrix obtained by simply collecting measurements over a set of experiments will reveal those blocks. Indeed, the initial ordering is often arbitrary, without prior knowledge of similarity. The method proposed here uses successive permutations of the rows and columns in order to reveal this similarity. As a first step, the rows and columns are clustered separately: experiments are regrouped based on similarities over the values of their features, while features are regrouped based on their distributions of values over the different experiments. We use the DCG engine [14,15] to perform this clustering. The output of DCG is a tree, as well as an ultrametric. It should be noted that an ultrametric *d* on a space is a metric that satisfies the strong triangular inequality for any nodes *x*, *y* and *z* in . The initial, independent clustering of the rows and columns of the data matrix does not take into account possible correlations. Two rows are paired based on similarities over all features, while only a subset of those features may be relevant to those two rows. To capture those correlations, we follow the concept of coupling geometries between metric spaces [21]. Namely, we connect the ultrametric on the rows of the matrix with the ultrametric on its columns using a metric equivalent to a Gromov–Wasserstein distance [11,21]. The two ultrametric and are then updated iteratively through this connection, until they do not change anymore. The whole procedure is implemented as the data mechanics procedure. An initial version of this procedure applied to binary matrices is described in [16]. Its extension to weighted matrices is fully explained in the electronic supplementary material. The outputs of data mechanics include the converged ultrametric trees and , as well as a matrix , the reordered version of that conforms to those trees. The block structure of can be visualized using a heat map.

### 3.2. Step 2: finding the macrostate of the data matrix

The data matrix is akin to the bi-adjacency matrix of a weighted bipartite network, which we write as [22]. As such, any permutation of the rows and/or columns of lead to a new matrix that represents the same network. Let us refer to such a permutation as , where and are permutation matrices of sizes *m* × *m* and *n* × *n*, respectively; those matrices belong to the corresponding groups and . Finally, let be the set of all permutations of the matrix .

Clearly, both and the optimized matrix generated by the data mechanics algorithm described above belong to this set *S*. While they represent the same data, their structures are quite different: corresponds to uninformed ordering of the *m* experiments and *n* features, while puts order in the rows and columns by conforming to coupled similarity measures given by their respective optimized ultrametrics. As such, the latter is a better representation of the shape of the data [10]. It is not clear, however, whether this representation is optimal. Indeed, we note that an ultrametric tree is invariant with respect to switching arrangements between branches on the same level, and to switching arrangements among nodes within the lowest branches. Ultimately, we are interested in a representation of the data that would explicitly identify the coupling geometries within and between rows and columns. Such a representation is identified with the ground state, or macrostate of a ferromagnetic potential field *E*, defined by
3.1where is the set of the four nearest neighbours to the (*i*, *j*) entry and the interaction potential is taken to be constant 1 for simplicity. This potential is equivalent to the potential of a two-dimensional Ising model in statistical physics [12]. The second term mimics the action of an external magnetic field with magnetic moment equal to 1. The coefficient *μ* can be seen as a chemical potential, in which case the second term becomes a coupling to an external ‘bath’ of spins. The ground state corresponding to equation (3.1) is expected to reveal the patterns of the data, implicitly present but hidden in . This ground state is a permutation of , written as .

There are *m*! and *n*! permutation matrices in the groups and , respectively. Therefore, finding the ground state defined above amounts to performing a search in a space of dimension *m*! × *n*!. The computational complexity of such a search would make it completely prohibitive. To alleviate this problem, the search is constrained to conform to the block structures identified in the optimized matrix , i.e. by constraining the permutation matrix *σ* to the structure encoded in the ultrametric tree and the permutation matrix *π* to the structure encoded in the ultrametric tree . We perform this constrained search using algorithm 1 in [13]. The output of this search is an optimized matrix that represents the ground state of the potential energy imposed on the matrix, i.e. that captures the inherent geometry of the data.

### 3.3. Step 3: generate random matrices that mimic the input data matrix

The key concept of this stage may be likened to mimicry. We rely on the principle that all random matrices are constructed such that they possess all (or at least most of) the global characteristics embedded within the input matrix . The global characteristics of particular interest are those patterns that capture the coupling within, and between the nodes (rows) and features (columns) represented in . To ensure that the random matrices inherit those characteristics, we enforce that they ‘conform’ to the geometry encoded in the input matrix. Thus, the block information identified using steps 1 and 2 described above is retained in every random matrix generated. This principle is drawn from the idea that a real dynamic system is characterized by a combination of deterministic structures and randomness within these structures [23].

The low energy matrix serves therefore as a natural basis for generating the random matrices, as it was designed to capture the geometric patterns of the data. The generation of one random matrix is then performed as follows: we isolate the diagonal and off-diagonal blocks from , generate random sub-matrices for each of these blocks with constraints on their marginal sums of rows and columns, and combine the corresponding sub-matrices into the final random matrix. While most of these steps are straightforward, an algorithm is needed for generating a random matrix from a given matrix that conforms to constraints on its geometry. We distinguish the cases where the matrix is binary or weighted.

The problem of counting the total number of binary matrices with given constraints and testing hypotheses about them appear in many fields, including mathematics, statistics, ecology and social sciences. As a consequence, there is a large body of literature on generating random binary matrices with constraints, see e.g. [24–27] and references therein. We use for this purpose an adapted version of the algorithm originally developed by Bayati and co-workers to generate random binary graphs with prescribed degrees on their nodes [28]. Our version is described in detail in [13,16].

Much less has been developed for non-binary matrices (i.e. matrices with positive real, categorical or even signed values). To our knowledge, there are no algorithms available for generating random weighted matrices. We propose a method based on the concept of breaking the matrix up into a series of binary matrices (i.e. a form of binary splitting) that can then be sampled with one of the methods cited above. This method is implemented in algorithm S2 described in the electronic supplementary material.

An energy *E*(*A*_{R}) is computed for each random matrix *A*_{R} generated using this scheme, using equation (3.1). The random matrices enable hypothesis testing on the input data matrix; this will be used to assess the significance of association between features.

## 4. Results

We analyse the impact of water stress on four phases of the winemaking process, namely grape composition at harvest, juice composition after processing, wine composition after fermentations and bottled wine. Each phase is characterized with a set of chemical constituents meant to capture the qualities of the pre-wine products. First, our analyses of the couplings between water stress and those compounds are reported, using the data mechanics method introduced above. Subsequently, a sensitivity analysis of the relationships between composition of the bottled wine with that of previous phases in winemaking is documented as an attempt to identify key constituents in pre-wine products that correlate with the quality of the bottled wine.

### 4.1. Coupling geometries in four different phases of winemaking

Six different irrigation regimes were designed to analyse the effects of water stress on grape maturation in the field, as well as on the subsequent wine at three different phases, namely as Juice just after harvest, as Wine after the fermentation stage and as Bottled Wine (see table 1 and §2 for details). The whole analysis was repeated over two successive years, 2012 and 2013. The results of all those experiments are stored in four data matrices, one for each phase. The Grape matrix has size 48 × 14, while the Juice, Wine and Bottled Wine matrices have 36 rows, and 6, 6 and 13 columns for the corresponding measurements, respectively. Each of these matrices was subsequently analysed using the data mechanics method introduced in the §3.1. The data mechanics method was designed to reveal patterns, or coupled geometries between nodes (water stress experiments) and features (wine attributes). The outcomes include two ultrametric trees, one for the nodes and one for the features, as well as a reorganized (i.e. permuted) data matrix that conforms to those trees. In figure 1, we show these resulting matrices for all four phases of the winemaking process.

Surprisingly, the most significant association observed between water stress conditions and wine attribute measurements relates to the years in which the experiments were performed rather than to the water stress itself. Indeed, the ultrametric trees on the irrigation experiments identify two groups at a coarse level, with each group mapping to either 2012, highlighted with a blue bar, or 2013, highlighted with a black bar (figure 1). This behaviour is strong at the Grape maturation and Juice phase, and is still present, if somewhat weaker at the Wine and Bottled Wine phases. A minor exception is presented by one of the 2013 LD treatments in the Wine phase, which clusters among the otherwise homogeneous 2012 treatments. The strong influence of the year illustrates that the effects of the environment (external conditions) dominate over the effects of management (irrigation) in this specific set of experiments. The actual environmental factor that explains these results is unclear. All experiments were performed in the same field. Growing season heat sums (degree days) were comparable. Winter rainfall differed among vintages, but was accounted for by irrigation. These 2 years in fact coincide with the beginning of the recent drought that California is experiencing. The weather was warm both years, with a relatively cool month of July (average high temperature of 95°F) and relatively warm month of August (average high temperature of 99°F) in 2012, and opposite trends in 2013, namely hot July (99°F on average) and cool August, (95°F on average).

In figure 1, the ultrametric trees for the experiments (rows of each of the four data matrices) have been coloured according to the water stress conditions considered (see legend). Regroupings of the replicates for each water stress condition over each year were observed, without consistent patterns between these conditions over the 2 different years, or between the four different phases. The two opposite experiments CTL (control, with full irrigation over the whole period) and ED− (constant stress over the whole period) are usually observed to be close in the ultrametric trees corresponding to 2013, but not in those for 2012. The two conditions ED (coloured in yellow) and ED+ (coloured in magenta) are expected to show similar behaviours as both include high stress levels before veraison and nearly normal irrigation afterwards. These similarities are observed in the ultrametric trees for 2013, especially at the Wine and Bottled Wine phases, but not in the ultrametric trees corresponding to 2012. Clearly, in the experiments reported here, differences between irrigation regimes are too small compared with differences in the environment in 2 different years of winemaking to be of significance.

Figure 1 also reveals the ultrametric trees on the attributes (features) at different phases of the winemaking process. At the Grape phase, a regrouping of the main acids (tartaric, malic, lactic and acetic) with the anthocyanins is evident, while tannins, total soluble solids (TSS) and pyrazines form a different group. This separation into two main groups reveals, for example, a well-known relationship in wine, namely that low acidity in the grapes is usually associated with high sugar content (as measured by TSS) especially in warm climates, and vice versa. As a more minor observation, the two independent measures of anthocyanin contents (ACY1 and ACY2) are found to be very similar over the whole range of experiments. The clustering of attributes at the Juice phase is less informative. Two main groups are found, with the first group corresponding to those attributes associated with nitrogen content, and the second including all the other attributes. At the two Wine phases, the marginal residual sugar amount in wine is directly associated with its levels of tartaric acid and malic acid: the wines from 2012 have relatively low levels of these three attributes, while wines from 2013 exhibit an opposite behaviour.

### 4.2. Sampling a data matrix

Each measurement included in a data matrix, such as those considered in this study, is prone to experimental errors. To generate a meaningful model from such measurements would, therefore, require sampling according to these experimental errors. The situation is, however, more complicated when the model is designed from the whole data matrix, i.e. is meant to capture coupling between experiments and measurements. In such a situation, the sampling needs to accommodate this coupling. In practice, this amounts to generating random matrices that conform to the geometry of the data matrix. In §3.3 a method to build such random matrices was proposed.

We considered four different (geometric) structures for the Juice data matrix: a fine level that accounts for both rows and columns clustering, two levels for which only the clustering of either the rows or the columns of the matrix is considered, and a level with no prior information on the organization of the matrix. Each level corresponds to a partitioning of the full matrix into sub-blocks, whose structures are conserved in the random matrices. One hundred random matrices were generated at each level. Figure 2 shows the distributions of the energies of those random matrices, computed using equation (3.1).

Given the original data matrix for the Juice phase, we note first that the energy of the ‘ground state’ is low, but not lower than the energies of all the random matrices generated, which may seem counterintuitive. The ground state is in fact the result of permutations of the original data matrix, while the random matrices conform to its structure, without a global constraint on its values (see §3.3 for details). As such, the sum of the elements of any random matrix is not guaranteed to be equal to the sum of the elements of . The second term on the left of equation (3.1) is meant to account for this, as it sets a coupling to a ‘bath of spins’. The corresponding coupling constant, *μ*, was arbitrarily set to 1, leading to the small discrepancy seen here. The key observation from figure 2, however, is that the matrices generated from constraints on both rows and columns of the data matrix have significantly lower energies than matrices generated without block constraints, and that their energies indicate that they resemble the ground state. The same behaviours were observed for the data matrices corresponding to the other phases of winemaking.

### 4.3. Associations between wine characteristics and grape and juice properties

The associations between the bottled wine hue and alcohol content were studied (as those two characteristics are usually considered as good indicators of ‘quality’), and all chemical characteristics characterizing the Grape and Juice phases. Let *W* be the vector that contains the values for one wine attribute observed over *m* fermentations, and let *F _{i}*, for be the corresponding values for the

*i*th feature of either the Grape phase or Juice phase. It should be noted that while the correspondence is direct for the Juice phase, it requires some preprocessing for the Grape phase. Indeed, there are 36 fermentation conditions associated with the Bottled Wine, and 48 field conditions associated with the Grapes. These two sets can be mapped as follows. Each irrigation regime is characterized with four field replicates and three fermentation replicates. We randomly assign one of the field replicates to each fermentation. While we understand that this mapping is not exact, it does conform to the water regime and year structures of the data. Random assignments and the subsequent analyses were repeated multiple times and were not found to produce significant differences.

The traditional approach of building a multilinear regression model between *W* and the *K* feature vectors *F _{i}* is likely to fail, or at least generate noisy and unreliable models because of the collinearity of the features. The authors propose to use the concept of model selection instead, combined with an empirical assessment of significance, in lieu of using significance tests based on the assumption of data being normally distributed. We generate a comprehensive set of models by systematically sampling the contribution of each and all of the features. Namely, the following multilinear model is considered,
4.1where each

*α*may take the value of 0 or 1 to indicate if the feature

_{i}*F*is included in the model or not. There are 2

_{i}*possible values for the*

^{K}*K*-dimensional vector

*α*. A regression model is built for each of these values and characterized with its Akaike information criterion (AIC) [29]. The model with the lowest AIC is selected. Features whose coefficient

*α*is set to 1 for this model are deemed significant. This analysis is repeated over the 100 random matrices described above. The results are stored in a table

_{i}*S*of size 100 ×

*K*, such that element

*S*(

*i*,

*j*) = 1 if feature

*F*was found significant in experiment

_{j}*i*, and 0 otherwise. Figure 3 shows the corresponding tables as bar plots for the two bottled wine features alcohol and hue, and the attributes of Grape and Juice.

There are no significant associations between both wine alcohol and hue and the chemical composition of the grape berries in the study presented here. While general associations of irrigation and grape composition were present, this study focuses on the modelling of four data matrices starting at harvest. Additional dimensions, such as variable extractability, could not be considered in detail. On the other hand, significant associations are found when wine is compared with the juice composition. For example, the alcohol content of bottled wine is found to be related significantly to the pH, YAN and total suspended solids (TSS) of the juice. All three associations are expected. Indeed, pH is an indirect measure of the amount of acids in the juice, and acidity influences the growth and vitality of yeast during fermentation. YAN measures the amount of nitrogen available to the yeast, with nitrogen being its second most important nutrient needed to carry out successful fermentation. The most important nutrients are sugars, which are the main soluble solids in Juice, whose concentrations are given by the TSS value. It may be noted that the association with sugar appears less significant than the associations with pH and YAN. Hue, i.e. a measure of the colour of the wine, is found to be related significantly to the amounts of ammonia, malic acid and suspended solids in the juice.

## 5. Discussion

Understanding the effects of water stress on grapes and subsequently on the quality of wine has long been of interest to winemakers [2–6], especially in the current context of global warming and regional water limitations. Experiments designed to analyse those effects typically proceed by applying multiple regimes of irrigation during the growing season, and studying their impact using the resulting composition of grapes or wines. Unfortunately, this system is laden with difficulties, the most important being that the chemical constituents are often covariant, making it difficult to differentiate their individual effects. This work uses a new approach to circumvent at least some of these difficulties. Our goal is to study the relationships between the experiments (i.e. the different water stress conditions), between the features (i.e. the chemical composition of the winemaking phase considered), as well as the coupling between those two dimensions, as encoded in the corresponding data matrix. Principal component analysis (PCA), and partial least squares (PLS), while the most common techniques used to analyse such data matrices, provide only partial representations of their underlying patterns due to the assumptions behind them, mostly the idea of linearity of the relationships. A new method that circumvents those problems is introduced in this paper. The reader is referred to the electronic supplementary material for detailed descriptions of these three techniques as well as for the when and why of applying them on their data. This new approach includes two key steps: (i) a geometric approach to capture the coupling structure embedded in the data matrix and (ii) a statistical physics approach to generate matrices that mimic the original data matrix while maintaining its inherent structure, thereby providing the basis for hypothesis testing and statistical inference. Applications of the former to study the impact of water stress on the attributes of Grapes, Juice, Wine and Bottled Wine phases in making ‘Cabernet Sauvignon’ over 2 years, 2012 and 2013, has allowed us to separate the intrinsic effects due to the experimental conditions, from the extrinsic effects due to the environment. Applications of the latter enabled analyses of the association of some attributes of the bottled wine with characteristics of either the matured grapes or the resulting juice, thereby identifying statistically significant associations between the Juice pH, YAN and sugar content with the bottled wine alcohol level.

In this work, a data matrix collects the measurements of a list of chemical constituents at a given phase in the winemaking process (columns of the matrix), for a series of experimental conditions, namely water stress conditions over two independent years (rows of the matrix). Originally, the data are organized using arbitrary orders both along the rows and along the columns. The purpose of step 1 in our approach is to reorganize this matrix through permutations to identify its ‘geometry’, i.e. a block structure that exposes similarities between rows, between columns, as well as coupling between rows and columns. This concept of coupling is often referred to as ‘biclusters', namely subsets of rows which exhibit similar behaviour for a subset of columns, especially in the context of clustering microarray gene expression data in genomics studies [30]. Finding the biclusters in a data matrix usually depends on a merit function that evaluates the quality of these biclusters. Several methods have been developed to solve this NP-hard problem. These methods can be divided into two somewhat opposite groups: those that directly reorganize the rows and columns of the matrix to increase local coherence between rows and columns [31], thereby revealing biclusters, and those that instead narrow down the rows and column to directly identify stable biclusters, as implemented for example in the coupled two way clustering (CTWC) method [32,33]. The data mechanics approach proposed in this work differs in that it does not look at biclusters as such. Instead, it iteratively refines a coupling between metrics defined on the row and columns, much akin to a Gromov–Wasserstein distance [11,21], thereby identifying the ‘geometry’ of the data matrix as a whole. Applications of this technique to the data considered here provide insight into the relative importance of intrinsic versus extrinsic effects on grape maturation, and subsequently on wine quality, as we could clearly identify a dominant effect of the year of the study over the individual water stress conditions considered in the experiments reported here (figure 1).

While it is always possible to compute a multivariate regression model between an observation and features that influence this observation, it is usually difficult to assess the significance of this model. This is notably true when there is implicit or explicit collinearity in the data. A multiple regression model with correlated features may indicate how well the entire set of features predicts the observation, but it may not give valid results about any individual feature, or about which features of the model are significant. Step 2 in our approach is a means to avoid this complication. First, we use a model selection approach based on AIC to select the best model between observation and features. Secondly, we repeat the model selection on multiple variants of the data matrix generated by a random sampling procedure that conforms to the geometry of the original matrix. This second step enables us to estimate significance explicitly and not based on a statistical model that is likely to be unreliable in the presence of collinearity.

The essence of our approach lies in its step 2, namely the bootstrapping of the data matrix , i.e. the generation of a collection of random matrices that mimic . Indeed, those matrices serve to generate a statistical model on the relationship between the nodes and features that define the data matrix, which can then be used to test hypotheses, such as the dependence of different phases in the winemaking process. The problem of generating matrices that are similar to a given matrix is much akin to the problem of randomizing the edges of a network (see [34] for a short review). There are many options to perform the latter. The simplest is to assume that interactions are equally likely between nodes in the network, in which case edges can be rearranged randomly. Most real networks, however, are unlikely to satisfy this assumption. It is then necessary to constrain the randomization such that it conforms with the structure (geometry) of the network. This requires that the geometry of the network be known, the so-called community extraction problem [35,36], and that this geometry be translated into constraints for the random generation of edges. This task remains a challenge for the network bootstrapping community [37,38]. In our case, it is solved using step 1 of our approach. To generate the random matrices, we have proposed a method that relies on sequential binary splitting of the matrix of interest. This method still requires development. For example, constraints on the random matrices have not been considered. This leads to the problem that the energies of these matrices may not reflect their true differences with the ground state matrix (figure 2). The authors are currently working on solving this problem.

## Authors' contributions

All authors contributed to the conception, design and interpretation of the experiments presented in the paper. C.H. collected all the data on the winemaking process, while C.-H.H. performed the data analysis. All authors participated in the drafting and revision of the paper, and approve its final version.

## Competing interests

We declare we have no competing interests.

## Funding

This work was partially funded by the USDA Specialty Crop Research Initiative (SCRI), specifically ‘Precision canopy and water management of specialty crops through sensor-based decision making’, grant number 2010-51181-21068. H.F. acknowledges support from the National Science Foundation, grant DMS-1007219 (co-funded by Cyber-enabled Discovery and Innovation (CDI) programme). P.K. acknowledges support from the Ministry of Education of Singapore by grant number: MOE2012-T3-1-008.

## Acknowledgements

The authors acknowledge our colleagues Yannis Toutounzis, Oren Kaye and Jim Orvis at Constellation Brands for assistance with fruit production and analyses.

- Received August 23, 2015.
- Accepted September 23, 2015.

- © 2015 The Author(s)