by Andrew W. Jones, Technical Manager, KMI/PAREXEL LLC
How large of a sample set is needed of previously recorded data to determine ranges that are truly representative of the process, and will the ranges be useful in the Validation effort and not set one up for failure? This is a difficult question to answer, and it is important to note that the batches selected should have no changes between them, thus be produced with the same processing conditions. The draft FDA Guidance for Industry, Manufacturing, Processing, or Holding Active Pharmaceutical Ingredients from March of 1998 suggests that 10-30 consecutive batches be examined to assess process consistency. 1
This is a good target statistically because when selecting a sample size or population, the concept of Normality or Degree of Normality becomes important. As a general rule of thumb, the population should be large enough that the distribution of the samples within the population approaches a Normal or Gaussian distribution (one defined by a bell-shaped curve). Thus in theory if the samples are normally distributed then 99% of the samples will fall within the +/- 3 Standard Deviation (S.D.) range. 2 When considering this, one should think in the sense of Statistical Significance (the P-Value). The P-Value is the significance that a given sample will be indicative of the whole population. Thus at 99% (a P-Value of 0.01) then a given sample has 1% chance of falling outside the +/- 3 S.D. range or (assuming no relationship with other variables) the sample has a 99% statistical significance as representative of the population. Finally, when considering the central limit theorem, which has the underlying concept that as the sample size gets larger, the distribution of the population becomes more normal, then in general a sample size of 10 - 30 as the FDA suggests would have a high chance of being distributed normally.
The data used to establish the parameters must be extracted from controlled documents.
When the number of batches to review is selected, the next step is to determine from what documents the processing data will be extracted. Typically the range establishing data must be taken from approved and controlled documents (see the examples below).
Examples of Controlled Documents:
The data extracted from the controlled documents will be analyzed to establish ranges.
Having established where the data will be selected from, the data must then be analyzed for specific trends such to define ranges for the Process Validation Protocol’s acceptance criteria. This acceptance criteria will be what the “Actual” process data collected during the execution of the Protocol will be compared to, in order to verify its acceptance. This part is where much thought needs to be applied so that the acceptance criteria are not so tight that failure is eminent or so broad that the achievement of the criteria proves nothing. Listed below are general steps that can be incorporated to determine the analysis.
Trend Charts, also referred to as X-Charts, are a good way of plotting data points from a set of data where the target is the same metric (for example pH as measured at a specific point in the process). It is a matter of defining the X-axis by the number of samples and the Y-axis by the metric that is being used. As an example, the X-axis could be a list of the batches by batch number and the Y axis could be pH. Figure 4 is an example of a type of trend chart. This way the data is presented graphically and can be appreciated with respect to setting a range.
With the data plotted, one can quickly assess any visible trends in the data. Additionally one can no begin the task of applying statistics to the data. It is important to determine if there are outliers in the data. Outliers may exist and can usually be rationalized by adverse events in processing as long as they are reported appropriately. Outliers can also exist as samples that are “statistically insignificant.” As mentioned before, the P-Value is the significance that a given sample will be indicative of the whole population so that outliers would have a very low P-Value. One method for determining outliers is to use a box-plot where a box is drawn from a lower point (defined typically by the 25th percentile) to an upper point (typically the 75th percentile). The “H-spread” is defined as the distance between the upper and lower points. 3 Outliers are then determined to be any data that falls outside a predetermined multiplier of the H-spread. For example the lower outlier limit and upper outlier limit are defined as 4 times the H-Spread, anything above or below these limits is statistically insignificant and are outliers.
With the accumulated data plotted, the Degree of Normality should be investigated so that the data can be analyzed by the appropriate method. There are several models for determining the Degree of Normality; some common ones are the Kolmogorov-Smirnov test, Chi-Square goodness-of-fit test, and Shapiro-Wilks’ W test. 4 Once the Degree of Normality is determined a more appropriate statistical method can be applied for setting ranges. If the data is determined to be non-Normal than there are two approaches to evaluating the data. The first way is to apply a Nonparametric Statistical model (e.g. the Box-Cox Transformation 5 ), however, these tests are considered to be less powerful and less flexible in terms of the conclusions that they provide, so it is preferred to increase the sample size such that a normal distribution is approached. 5 If the data is determined to be Normal or the sample size is increased such that the data is distributed more normally, then the data can be better analyzed for it’s range characteristics.
Click on any image for larger view
The data having now been displayed graphically should be analyzed mathematically. This can be done by using simple statistics where the mean is determined as well as the standard deviation. The mean refers to the average of the samples in the population. The standard deviation is the measure of the variation in the population from the mean. If the distribution proves to be normal, as from our normality tests above or by selecting a large enough population such that the central limit theorem predicts the distribution to be normal, then it stands that 99% of the data will fall within the +/–3 SD range. Using our example from Figure 4, the data is analyzed for its mean and standard deviation using the displayed formulas in Figure 5. Once this is determined, the +/– 3 SD can be applied to the trend charts by drawing them as limits at their values. This graphically displays the data as it is applied per batch and how it fits within the statistical limit of +/– 3 SD (see Figure 6.)