The U.S. Delegation to ISO/TC131/SC8/WG13 (Math Modeling ad hoc Project) was asked by other delegations to determine the minimum number of data samples needed to create a useful mathematical model of a hydraulic pump. As a result, coupled with some disagreements concerning technique, I came up with what appears to be a new idea: the progressive and iterative generation of math models by adding one sample to the regression data source as the iteration continues, until all samples have been included in the models.
If the pump has been subjected to, say, 353 data samples (data rows, or observations in the test data file), nearly 353 different models will be generated. Each model is created by using one more data sample than the previous model. It is a calculation-intensive process, to say the least. I call the procedure Progressively Sequenced Regression (PSR) analysis.
PSR analysis works as it does for one basic reason: When a sufficient number of samples has been accumulated to produce a reliable mathematical model with good predictive ability, the model reaches “information saturation,” and any further samples taken from within the same hyper volume will not appreciably change the model or its residual error. The necessary and sufficient number of samples will have been reached. Further testing—along with the resulting greater number of samples—will contribute little or nothing to the model. This is the consequence of having reached information saturation.
However, a well-known and well-established statistical procedure is also necessary: Latin HyperCube (LHC) Sampling. Together, the two procedures together allow the user (test lab) to converge on and verify the adequacy of the sample size. It yields a far more objective assessment of sample size than other statistics-based methods, which require knowledge of parameters often unknown to engineers.
Below are four summary reports that may be downloaded, copied, and distributed without penalty by anyone with a stated interest in the subject of Sample Size Reduction as applied to mathematical modeling using linear, multiple regression as a model development method. However, each document’s cover page and all succeeding pages may not be separated and must be left intact, with no editing, modifying, or redacting whatsoever.
Other members in the Working Group assisting with this project are Paul Michael, MSOE, research chemist and Samuel Hall, engineer, Danfoss, Ames, Iowa. I intend to write a more formal paper for publication in some scholarly journal after the next meeting, occurring May 16 in London.
The four reports are presented below, in the order in which I wrote them. As such, they point out the evolution of the procedures and my thinking as research continued. The final result was that Latin hypercube sampling—or an equivalent method—is vital to verifying the minimum number of samples required to produce useful and reliable models.
Latin hypercube sampling is a procedure for designing an experiment, that is, it creates a set of test set points to be used in the lab test procedure. The procedure was used in several tests on a piston pump. The test setpoints are so well distributed throughout tested hyperspace that it all but guarantees that information saturation will be reached in almost any number of samples. PSR analysis is then used to verify that indeed, information saturation is reached. This represents the first time that an objective means for determining minimum data sample size has been devised and promises to shorten time in the lab.
Abstract: The theory of statistical analysis, and its oft companion subject, ordinary linear multiple regression, hold that of the total universe of possible measurements, it is possible to obtain reliable models by sampling only a limited number of observations. That is, it is not necessary to make all possible measurements, because in the case of a pump, there would be no real physical limits to the amount of data. There is always a practical limit that is concerned with costs and time needed to conduct testing.
If the theory of valid but limited sample size is true, then it follows that there must come a sampling quantity where the number of samples cause the coefficients and FoMs to reach terminal values that are representative of many more samples. The data sample size for a variable displacement piston pump and a fixed displacement gear pump were shown to reach convergence on the terminal values. However, without proper sequencing of the data, false negatives are likely to result.
This report contains the results of several Progressively Sequenced Regression (PSR) Analyses on hydraulic pump performance test data that demonstrate how the converged values can be displayed. It attempts to determine the conditions that must be met to avoid false negatives, and to provide procedural assurance that Progressively Sequenced Regression Analysis will lead to convergence. At the zone of convergence, sufficient data samples will have been collected because the regressed model coefficients and figures of merit converge on respective plateaus. Identifying the onset of the convergence zone and its plateaus will establish that sufficient data samples have been gathered.
Abstract: This Sequel has been prepared in order to report on research that was underway when the original report was distributed about three weeks ago. The main discovery of this sequel is that the reason for reaching the convergence plateau at times in 50 samples and at others in 100 samples has now been identified. It originates with the creation of the Latin hypercube test sequence. The number of samples required to cross the learning zone is completely predictable and controllable. The results are very interesting—perhaps puzzling. But in the end, they are eminently commonsensical and rational.
Abstract: Test set points are a matrix of planned values for the independent variables to be used in a laboratory test program. The matrix will have, for a hydraulic pump, data columns consisting of the planned values of test pressure, test speed, test displacement, and test viscosity. During the actual test procedure, the independent variable values will be set into the test apparatus as “target test points,” and after the test equipment has stabilized at the instant steady state condition, all the independent variables and all pertinent dependent variables will be measured and recorded and placed into an output data matrix.
In this context, efficacious set point randomization applies to the creation of the sequence of set points to be used in the testing of a hydraulic pump. Latin hypercube sampling is one design of experiments method that was applied in the investigations reported here. Efficacy of sample randomization requires a uniformity of distribution and penetration throughout the test plan’s hyperspace to sufficient density that essentially no region is left unsampled, with some exceptions.
The selection of test set points must be such that their very limited number of samples permeates the entirety of the hyper volume with the same totality as would the use of the tens of thousands of possible test points. Only then can the sample data set produce information saturation, which is the desired effect.
Information saturation occurs when a model created from a data set remains essentially unchanged if more samples are added to the data set. Test point sequencing—i.e., experiment design—shall have taken place prior to the test. Limited sampling theory as a representation of a total population is a proven, and universally accepted, method of making modeling a practical and finite endeavor. LHC efficacy was clearly demonstrated in the Sequel Report in Fig. 4, page 5.
Efficacious randomization shall have so thoroughly characterized the entire unlimited universe of possible test points that no further samples are needed in order to produce a useful and accurate model from the output data matrix, but it shall have done so with the fewest number of samples in that output matrix. That is, the efficacious data set must demonstrate that it has reached information saturation. An efficacious setpoint matrix shall select, sequence, and distribute the set points throughout the entire test plan’s hyper volume so as to assure information saturation.
Progressively Sequenced Regression Analysis (PSR analysis) provides an objective means for assuring the testing agency that information saturation will occur within the chosen number of data points, and a sufficient number of samples has been collected so that the data is capable of producing accurate and reliable mathematical models.
Abstract: The original report on the method of Progressively Sequenced Regression Analysis demonstrated how PSR Analysis would track the regression coefficients and chosen figures of merit for a given set of source data and a specific regression function as the number of samples in the source data increased by one observation in an iterative and progressive manner. The basic purpose was to determine when the number of samples had reached a point where further increases in the number of samples caused no significant change in the model’s regression coefficients or figures of merit. In this way, the sufficient number of samples could be objectively verified.
This study also revealed that the effects of collinearity between regressor terms is displayed in a graphical and clearly visible way. This paper provides an example of collinearity between regressor terms and shows how it reveals itself and can be recognized in PSR Analysis.
For more information, please e-mail me.