USletter 0pt0pt0.25in0.5in
_fileclose0x80d3528
This dissertation is submitted in partial fulfillment of the
requirements for the degree of:
|
Accepted for the School of Engineering and Applied Science:
| Dean, School of Engineering and Applied Science |
| Symbol | Meaning |
| p, pi | Number of variables, in system i |
| n, ni | Number of observations, in system i |
| sab\xspace | Covariance of a and b |
| s2x\xspace | Variance of x |
| [^(Y\xspace)], [^(Yi\xspace)] | Metamodel estimate, submodel i estimate. |
| X | Column vector of X variables |
| X\xspace | n ×p Data matix |
| Y\xspace | n ×py output matrix or vector |
| b | Vector of model parameters |
| [^(b)] | Vector of estimated model parameters |
| H\xspace | the H matrix from linear models |
As a motivating example, a semiconductor manufacturing plant illustrates a large system composed of many interrelated subcomponents, with several levels of understanding of the system.
Process engineers in the semiconductor industry often focus on the individual processes rather than the overall manufacturing system [Herrmann et al., 2000,p. 1491].
Simulation models of factories focus on measures of system performance such as total lot processing time, throughput, utilization, and work-in-process [Herrmann et al., 2000]. However, high level models of product performance characteristics, such as yield, conformance to specifications, and quality, require understanding of the influences and contributions of the sub-systems outputs on the end product. For example, it is believed that quality characteristics in processing at the Gate Contact level can have significant effects on the performance and yield characteristics of the memory chips. An understanding of this effect, and the effects of competing models is important for management decision making.
From a bottom-up perspective, manufacturing systems have embraced statistical process monitoring and control to identify and understand problem areas in manufacturing. Traditionally, these were univariate and tightly bound to the specific manufacturing process. Distinct or interrelated processes require more advanced techniques such as multivariate or multi-stream techniques [Montgomery, 1996,p. 235]. As the measurement systems were integrated into the machinery, control methods were used to more accurately monitor and actively control the processes. In complex manufacturing processes, the necessity for control across distinct runs has inspired the run-to-run (R2R) control methodology, in which information external to the process is used to adjust the active controls in the process [Moyne et al., 2001]. Optimizing each sub-process to minimize the variation in its outputs may indeed improve the overall product performance, but local optimization of subprocesses does not guarantee global optimization, especially if the local optimizations are not of the same variables. This research builds a framework for combining locally optimized processes into a global model of product performance characteristics.
Hierarchical control of manufacturing systems applied in the semiconductor industry is a new and active area of research Moyne et al. [2001],p. 9, p. 321. The work shown in Leang et al. [1996] is an application of the R2R control scheme to include several machines in a subprocess to form a single layer on a semiconductor wafer. This is an expansion of the single sub-process to several sub-processes and is the first step up from the bottom level of a bottom-up hierarchical model. On the other hand, simulation models of logistical characteristics of a factory, for example, are the first step down from a top-down hierarchical model. Between these top-down and bottom-up methodologies is a need for understanding the relationship between low level sub-process characteristics and product performance characteristics.
Our conceptual approach is to use hierarchical modeling and data fusion as a means of addressing the problem of understanding the interrelationships between low-level process parameters and high level product performance characteristics.
The key characteristics of this data seem to be well studied, however, interactions between the characteristics outlined below combine to make the application of the common solutions difficult.
Although these characteristics have been studied separately, these issues combine to produce a problem that is incompletely understood.
Supposing the cardiac, car, and housing data were all somehow related in a larger system, how can useful models be developed from the data to help provide understanding of the influence of the systems subcomponents on a variable of interest. For example, high blood pressure as a function of housing, medical data, census, and consumer goods data. Medical data is likely indexed on individual patients, while housing and census data may be by the household or by region.
Inconsistent process structure data: Changes in a process fracture a data set and raise questions of whether the change affected, improved, or hindered one or several indices of performance. In the univariate case with one process change, a simple hypothesis test suffices. With multivariate indices, multiple models and tests must be considered together, necessitating adjustments in hypothesis testing significance levels, Neter et al., [1996,p. 1024] For example, the effects of a policy change in a school or administration at UVA on a variable that is used to rank colleges might be testable using a two-sample hypothesis test, but one must also consider the many other policy changes that occur at UVA. The many process changes in a semiconductor manufacturing plant also interrelate and confound the analysis process.
Missing data due to sampling plans in the Die Sort testing process produces missing data rates of up to 67%. These data may not be missing at random - wafers and lots with higher failure rates are examined more carefully. Missing data due to process changes are also non-random.
Large number of variables relative to sample size: Many tools exist for dealing with high dimensional data, however, these methods each begin with a well formed data matrix. Combining heterogeneous data sets produces increased problems with missing data and dimensionality. Forming a larger data matrix requires joins of datasets, and the large number of variables exacerbates missing data problems. Considering the Metal 1 layer production data, suppose only 7% of the data in each set was missing at random in each of 10 sub-processes, the resulting table join would have 52% rows with missing data. Established techniques, such as SAS PROC REG, PROC CLUSTER, and PROC PCA, each case-wise delete data with missing elements. Combination of sets with missing data results in larger missing data problems.
Solutions to problems of missing data are generally limited to missing at random (MAR) and missing completely at random (MCAR), (SAS:ProcMI, p. 154). Complex systems that require combination of data from different data sources must treat non-random sources of missing data, as well as the technical problems of indexing and combining many large data sets. Clearly, missing data due to process changes causes systematic changes in the data set. Large blocks of nulls exist after an old process ended or before a new process is started. Missing data due to process changes is systematic, and violates the MAR assumption.
Data aggregation: Differences in data aggregation is normally treated by summarizing the non-aggregated data to match the level of aggregation in the smaller dataset, or by expanding the aggregated data to match the non-aggregated data through the use of duplication, resampling, or simulation. Expansion of the analysis to lower levels of aggregation increases the dimensionality. For example, using chip level information to predict lot level characteristics increases the number of variables by a factor equivalent to the number of chips in a lot.
Interpretation of competing models: Multiple competing models require higher levels of significance that single models, Neter et al., [1996,p. 1024]. Often there is more than one models available for describing a system, and it would be good to use methods which combine the available models]. Decision makers need tools to compare competing models. For example, some process engineers believe the Gate Contact (GC) layer is the most critical layer; others say that lithography is the most important process.
Hierarchical modeling: Hierarchical models are present in a number of disciplines. In manufacturing, a hierarchical nested control methodology establishes clearly defined levels, inputs, outputs, and the relationships between them, and fits process controllers into controllable points in the process Moyne et al., [2001,p. 8]. It is not our intent to deliver control mechanisms, but to develop understanding of the relationships between process variables and product characteristics.
Hierarchical modeling in the form of predicting the intermediate level characteristics and using them in turn to predict high level characteristics is limited to the prediction power of the intermediate data. If the statistics of interest cannot be reliably predicted using the intermediates, then estimates of the intermediates will do no better. Based on this result, the research plan will use a flattened hierarchy to estimate of final variables directly, and will recombine these using regression and more advanced tools.
Hierarchical modeling used in medical research is called meta-analysis, which combines the results of multiple studies to improve the information. Meta-analysis are typically performed on similar sorts of interventions, with the hierarchical levels and elements based on the study, type of model, and an overall effect. These studies typically have univariate outputs, and the studies identify the effects of treatments on an outcome [Sutton et al., 2000]. E.g., several drug efficacy studies show that a drug reduces the risk of disease; the studies are grouped together, and pooled estimates of effects by study type and drug are produced. The benefit of meta-analysis is to provide a Bayesian prior distribution of an effect, which tends to moderate and tighten the confidence intervals of each study, increasing the information gained.
Another hierarchical modeling body of literature is structural equation modeling (SEM), a methodology of specifying causality and confirming a hierarchy of models in a system. SEM studies are primarily low dimensional, and serve to support models of causality in social sciences.
Data fusion is a method for combining sources of data to produce better information about the systems. Meta-analysis is means for combining a number of disparate published studies into an overall model. In the manufacturing environment, states of machine operation (in control out of control), could be considered treatments, and the yield and failure rate data could be modeled as an log odds ratio. If partitions of the manufacturing system were modeled as separate studies, then the resulting separate models could be combined using meta-analytical techniques. Results from a meta-analysis model would be ``out of control conditions in the sub-models under study will produce mi±csi% greater failure rates in margin yield for each sub-model i.
Although I mentioned dynamic programming in my proposal, I use it only to provide a notation for describing manufacturing questions. E.g, the ill-defined ``Golden signature'' modeling can be used for diagnosis of plant problems, in-line disposition, or future design, all with very different needs for analysis. The observability and controllability of a control system is dependent upon the understanding the system. Building understanding of the overall system must come before prediction and control. This work will help to build understanding of the relationships between process variables and product variables.
I propose building a two level hierarchical model. The sub-components are the models matching existing data structures and engineering expectations. Samples of sub-models matching the data structures is the electrical testing databases and the logistics database. Samples of sub-models from engineering expectations are the Metal 1 process data, the Gate Contact process data, the Key Quality Control measurement parameters, and the TEG PSL database. If simple models of product outputs can be created for each of these, the models can then be combined using data fusion techniques.
To demonstrate the method, while limiting the scope of the effort, I will use the lot-level aggregated data on three output variables, (margin failure, functional, and DC testing data), process data from two semiconductor layers, (Metal-1 and Gate Contact), the quality control data, and the electrical test data to build models of the margin yield, the functional yield, and the DC testing yield.
Equation 1.1 shows how partitioning the problem based on sub-models can reduce the dimensionality of the problem. Each of the i sub-models is a [^(Y\xspace)]\xspacei=f(X\xspace\xspacei) where the estimation function could be linear, but may require data cleaning, missing data imputation, variable selection, feature reduction, or feature extraction. These models may not explain significant portions of the variation in Y\xspace, but it is hoped that, taken together, the several models can help explain some portion of the variation.
| (1.1) |
| (1.2) |
Although the partitioning and segmenting of the problem decreases the dimensionality of the sub-problems, it does not solve all the problems in the manufacturing data. Missing data, high dimensionality, aggregation, and structure changes will remain as problems, but will be more manageable in the smaller models.
Techniques for building the submodels (Equation 1.2) include regression, logistic regression, principle components regression, or partial least squares. Since the outputs of the models are to be combined in the upper level model, other techniques could be used, as long as they produce an estimate of the [^(Y.\xspace)]\xspace product characteristics. The top level model may use these estimates to build understanding of the effects of the sub-models on the [^(Y\xspace)]\xspace, and the interactions between processes.
Theoretical elements - Semiconductor manufacturing has a number of interesting elements, some of which have been treated separately.
Automated manufacturing systems, such as those that produce semiconductors, can produce such large quantities of data that understanding of the interrelations between subsystems is difficult. This work produces
In order to control and improve chip production in semiconductor manufacture, a company may seek to use manufacturing and process data already recorded during the process to more fully understand the system and provide avenues for improvements. Semiconductor manufacturers collect a large amount of data, in terms of storage space and number of variables, but small in terms of the number of coherent observations. While any particular operation may have a number of observations, the large number of monitored variables produce effects similar to short run manufacturing processes: insufficient degrees of freedom to reliably model the process. This work will produce a modeling methodology for managing hierarchical manufacturing data and will seek to produce useful models for semiconductor manufacturing, and complex manufacturing systems in general.
Smaller lot sizes and more flexible manufacturing processes, along with the increase process complexity, combine to produce short-run processes. As more automated measuring and recording equipment enters the manufacturing process, a difficulty with the dimensionality of the problem emerges: runs shorter than the dimensionality of the problem. A high dimensional manufacturing process can have fewer distinct observations n than the number of process variables p. In prior work with a semiconductor manufacturer, we established that direct models of Yyield=bXprocess can be ill-defined due to the dimensions of the data matrix X{n ×p} where n << p. These conditions lead to instability in the parameter estimates b for a linear model, and singularity of the cov(X), but they occur in complex manufacturing processes. Realizing that the degrees of freedom in the system is limited by the number of observations, additional constraints must be placed upon the overall model. Assuming a hierarchical structure to the process, i.e. that the outputs of the system are functions of certain key parameters of the system, which are in turn functions of lower level operations in the production process, may constrain the models into estimable and testable problems.
High dimensional data analysis requires effective visualization methods, since traditional methods such as run plots and scatter plots do not scale well to high dimensional systems [Forrest and Mastrangelo, 2001]. Using methods from clustering to sort the variables and observations can aid in high dimensional visualization. A certain manufacturer seeks a ``Golden Signature'' program which intends to identify production parameters from high quality lots and estimate the effects of deviations from these ideal parameters. This ``Golden Signature'' program requires a comprehensive model relating the many low level process variables to the high level yield variables.
Although difficult, this project is an ideal application of systems engineering due to the complex system managed by different groups of people. Semiconductor manufacturing is an extreme case of complicated manufacturing systems, with issues of discrete part manufacturing, aggregations and dis-aggregations of data in time, production lots, and production processes. The interactions between production, engineering, information technology, and management indicate a need for an interdisciplinary method integrating the process. The general methodology proposed here is to develop a methodology using hierarchical structures inherent in the manufacturing and engineering processes to manage the complex models and high dimensionality of manufacturing processes.
Commercial semiconductor devices are manufactured in and on the surface of wafers from large ultra-pure crystals-thin disks, typically 200mm or 300mm in diameter. An area on the wafer containing a single discrete device or integrated circuit (IC) is called a chip or die. Depending on the dimensions of the wafer and the dies, several hundred chips are formed on a single wafer.
During fabrication, wafers are transported and processed in standard lots of twenty-five wafers each. Each lot undergoes hundreds of individual processing steps, in which different parts of the ICs are etched in thin layers of material grown or deposited on the working surface of the wafers. Each process step must be tightly controlled to ensure dimensional tolerances typically measured in nanometers.
Fabrication of a single lot requires approximately three months. Throughout, process settings, engineering parameters, and test data are logged for each fabrication tool at both the wafer and lot level, via a central computer network called a manufacturing execution system. With as many as 5000 wafer starts a week, process and engineering databases requiring hundreds of gigabytes of memory are normal.
The data details the manufacturing processes involved in the production process. Analysis of the data differs from current data mining techniques developed for business sales information, market-basket analysis, image analysis, or spatial data because of the large number of variables, interactions between sub-processes and relatively small number of observations. For example, a memory device involving 22 layers of semiconductor can involve 524 processing steps over 3 months with 21710 process variables. Figure 1.1 shows a sample of 90 days of lot level production data for one product, the misalignments between separate data tables, and that n=221 << 21710=p. Besides vast amounts of data, another challenge is that the measurements are commonly collected on different aggregations of parts at the chip, wafer, batch and lot levels. Since the measurements for a particular chip are spread out over time, collected at different aggregation levels and are many with respect to the production yield data, current data mining and analysis techniques such as clustering and linear regression modeling are inefficient and difficult to apply to semiconductor manufacturing environments.
The target of the proposed work is at the system operational level and is to extract knowledge from data from sophisticated processes in order to improve operations - that is to improve productivity, decrease ramp-up time, identify and validate quality control parameters, these will ultimately increase yield. The anticipated research will focus on two areas: operational modeling of manufacturing data and data representation and manipulation. I will develop a methodological approach to solving the complex modeling problems that arise in semiconductor manufacturing. I will also show how subsystems of the manufacturing process could be combined to produce an overall model suitable for process monitoring and improvement.
|
If we think of each uppercase In, An, Xn, Yn as a row vector of the various parameters in each processing step, the current database system records the different An vector of machine settings and Yn vector of measurements at each processing step. For example, if step n=5 is a visual inspection that is always done the same way, A5 is the constant procedure that is used for inspection, Y5 is the results of the inspection, d5 is any change in the wafer due to the inspection (e.g. a mote of dust became stuck to the part), and the part changed from an uninspected wafer without dust, (X4), to an inspected part with dust (X5), and for the next step, we know everything we did before, plus the facts that it was inspected, and the results of the inspection. For a more complex operation, masking for example, the vectors would be much more complicated: the action would have many more options and machine settings, and the step would produce much more data.
This model captures several important facts. We may not know everything important about the wafer at every stage, we measure some features that may or may not be representative of the state of the part, we can do different things at different stages, and we know more and more about the part as the part steps through the process. The IN vector holds every item of recorded data about a particular wafer, and may be thousands or millions of attributes wide. This model is very general, but has enough elements to represent several problems of interest to a manufacturer: yield improvement, design of the best recipe, in-line classification and disposition, and identification of new defects.
Each of these alternative manufacturing problems depend on estimating YN from different I0,...,N. This mathematical model is general enough to capture all of the elements of these problem, but it might also be intractable due to a number of problems:
To expedite the collection of data, each machine operation is recorded in a transactional database whose structure mirrors the physical production and testing machinery. This optimizes data collection, but hinders data analysis. Each machine is capable of emitting a number of different measurement records, and tables corresponding to each machine and each record type are automatically updated as production flows through the machines. While this data recording was initially driven by contractual agreements with the parent companies of the manufacturing plant, current efforts seek to use this data to improve the production process. The transactional database holds all the required data, but since the results are not aligned with the batch, lot, wafer, reticle shot, or chip, using the data for analysis is not possible without intricate database queries.
High dimensional data often requires reduction of the number of dimensions in order to build knowledge. Examples of data domains similar to semiconductor manufacturing data include high dimensional data from image analysis, radar and spectral data, text recognition and mining, speech recognition, genetic code sequences, and chemometrics. Interpretation of high dimensional data is difficult, as is understanding of a high dimensional model. Several of these fields have underlying spatial or theoretical models on which to base further analysis. If the theory is lacking, however, then the use of data mining techniques to build models may help to develop theory about these complex domains. Semiconductor manufacturing differs from these domains in that the structure of complex manufacturing data is not well organized for analysis.
Examination of the semiconductor manufacturing process leads one to the question: What are appropriate methods to use large dimension, small sample manufacturing data for prediction and understanding complex manufacturing processes?
XXX
High dimensional complex systems present a number of unique challenges in the data analysis for understanding, prediction, and control. This work studies the top level system of a semiconductor manufacturing factory in order to build understanding of the effects of lower level processes on the upper level goals and objectives. Process engineers in the semiconductor industry often focus on the individual processes rather than the overall manufacturing system [Herrmann et al., 2000,p. 1491].
Simulation models of factories focus on measures of system performance such as total lot processing time, throughput, utilization, and work-in-process [Herrmann et al., 2000]. However, high level models of product performance characteristics, such as yield, conformance to specifications, and quality, require understanding of the influences and contributions of the sub-systems outputs on the end product. For example, it is believed that quality characteristics in processing at the Gate Contact level can have significant effects on the performance and yield characteristics of the memory chips. An understanding of this effect, and the effects of competing models is important for management decision making.
From a bottom-up perspective, manufacturing systems have embraced statistical process monitoring and control to identify and understand problem areas in manufacturing. Traditionally, these were univariate and tightly bound to the specific manufacturing process. Distinct or interrelated processes require more advanced techniques such as multivariate or multistream techniques [Montgomery, 1996,p. 235]. As the measurement systems were integrated into the machinery, control methods were used to more accurately monitor and actively control the processes. In complex manufacturing processes, the necessity for control across distinct runs has inspired the run-to-run (R2R) control methodology, in which information external to the process is used to adjust the active controls in the process [Moyne et al., 2001]. Optimizing each sub-process to minimize the variation in its outputs may indeed improve the overall product performance, but local optimization of subprocesses does not guarantee global optimization, especially if the local optimizations are not of the same variables. This research will build a framework for combining locally optimized processes into a global model of product performance characteristics.
Hierarchical control of manufacturing systems applied in the semiconductor industry is a new and active area of research Moyne et al. [2001],p. 9, p. 321. The work shown in Leang et al. [1996] is an application of the R2R control scheme to include several machines in a subprocess to form a single layer on a semiconductor wafer. This is an expansion of the single sub-process to several sub-processes and is the first step up from the bottom level of a bottom-up hierarchical model. On the other hand, simulation models of logistical characteristics of a factory, for example, are the first step down from a top-down hierarchical model. Between these top-down and bottom-up methodologies is a need for understanding the relationship between low level sub-process characteristics and product performance characteristics.
Our conceptual approach is to use hierarchical modeling and data fusion as a means of addressing the problem of understanding the interrelationships between low-level process parameters and high level product performance characteristics.
The key characteristics of this data seem to be well studied, however, interactions between the characteristics outlined below combine to make the application of the common solutions difficult.
Although these characteristics have been studied separately, these issues combine to produce a problem that is incompletely understood.
In order to examine the relationships between low-level processes and high level system performance characteristics, a hierarchical model of process outputs based on sub-processes can provide support for decision making in allocating resources between sub-processes.
Although the problem domain is broad, the focus on a methodology for understanding the product performance characteristics as related to the process subcomponents is more tightly defined. This methodology is applicable beyond the factory supplying my data; large scale systems with heterogeneous sub-components have a need for understanding the effects of the sub-components on the entire system. Fowler et al., [2000]
Van Zant [1997] provides a good overview of the entire semiconductor manufacturing process, while Horton, [1998] further explains a number of yield modeling techniques and formulas. Nurani et al. [1998] predicts yield based on defect density information drawn from multiple-layer inspection information using in-line pattern defect density information. Shindo et al. [1998] model the effect of defects on lower layers in a semiconductor sandwich on upper layers.
Hess and Weiland [1999] use a sampling plan across a wafer and lot to produce defect density distributions, which can help to model yield. The paper seems most applicable to predicting yield of one program based on the defect maps of another program, (e.g. 128MiB DRAM based on 64MiB DRAM).
Cunningham and MacKinnon [1998] show a number of defect characterization statistics in order to more fully understand low yield. The methods of quadrat statistics (defect per die), spatial point pattern statistics for spatial randomness, and spatial clustering, and collinearity identification are discussed. A fuller explanation of spatial clustering monitoring by Hansen et al. [1997] provides guidance in creating test statistics from a pick-map and using them to monitor wafers for spatial randomness. Friedman et al. [1997] explain a method for monitoring large area defects by separating a smoothed cluster from the underlying spatially random component.
Chaudry et al. [1998], working with SEMATECH, propose an object-oriented database to provide responsive control during the manufacturing process. This method recognizes the hierarchy of the semiconductor manufacturing process, but requires good models of inter-process interactions (i.e. detailed models of downstream features based on upstream parameters). Given these detailed models, an ``active-database'' could generate novel recipes as the process proceeds.
Richards and Shen, [2000] develop a model of the physical characteristics of a semiconductor device based on some electrical test data. This is the inverse of the problem of predicting in-line electrical test results based on process parameters.
Fowler et al. [2000] surveys a number of modeling methods applied to semiconductor manufacturing data from the probe testing machine in a micromechanical accelerometer production process. The data is in several families, (i.e. min, max, ave, std, quartiles, and quartile range) of the several monitored variables. Although this work uses wafer-level semiconductor data, it uses data from only one step in the manufacturing process. They use some ad-hoc clustering to stratify the yields before applying some tools: On the set of low yield wafers, the models did not work well. Fowler et al. [2000] used PCA based regression to reduce the dimensionality , but the resulting model was poor, and interpretation is difficult. The model complexity is 128=23 ×6 parameters based on 1123 observations, which is about a 1:10 ratio. The micromechanical accelerometer in Fowler et al. [2000] may be well described by 23 features, but a 64MiB memory chip is a much more complicated device with many more elements.
Shin and Park [2000] discuss data volume in semiconductor manufacturing as 1,000,000 wafer tranasctions per day, this is consistent with what we see in other fabrication plants, 25 lots of 25 wafers starting and progressing through a 500 step process in 90 days, assuming three transactions per wafer per operation. They use a hybrid Neural Network with memory (implemented as a k-nearest neighbor vector). The k-NN approach helps the interpretation of the reasoning done by the Neural Net, and improves the straight neural net performance by adding a the outputs of a k-Nearest Neighbor model to the inputs of a neural net.
These papers have shown that yield modeling in semiconductor manufacture is an area of significant current interest, that attempts are being made to estimate physical characteristics based on easily measured electrical characteristics, and that yield modeling is still not perfectly understood. Opportunities exist for linking the various levels of semiconductor modeling to produce useful models of yield.
An examination of different types of models and the dimensionality of their data provides insight into the use of degrees of freedom in model building. Large data can be large a number of different ways depending on the shape, size, and storage requirements of data. Data with large storage requirements can cause slow processing. Data with large numbers of observations can also impact processing speed. Data with large numbers of dimensions or variables with respect to the number of observations can cause problems with modeling through the estimation of the covariance of a data set.
A simple linear model Y=bX which does not does not require a covariance matrix estimate, but only estimates of point values of the coefficients requires a b for each x in the model. Even if the b terms are zero, it requires a degree of freedom to estimate them as zero. This simple model requires at least p observations in order to make estimates of the process. A more rigorous linear model additionally estimates standard errors of the model coefficients in order to determine if the parameters are significant or not. These two sets of parameters imply that 2 ×p degrees of freedom are consumed by a simple modeling process.
Using the simple Hotelling multivariate process monitor, T2 for example, requires estimating p2 covariance and p mean coefficients. The number of degrees of freedom consumed by these estimates exceed those of linear models by including covariance terms between each pair of variables.
The small sample sizes available over a time span of interest provide for only small models relative to the potential dimensionality of the problem. In order to provide useful models with only a limited number of observations, the models should be limited to a complexity smaller than the number of observations. Complexity in this sense is the number of parameters in the model.
A very small model of the manufacturing process might estimate two terms representing a index of a process step and its effect on yield. For example: Process yield is 0.75 plus some factor times the anneal temperature in step 255. Estimating confidence intervals of the intercept and factor would consume four observations, leaving the rest of the observations to estimate the uncertainty in the predicted yield. The problem with models like these is that there are a great number of competing models, and the uncertainty in model parameters nearly guarantees acceptance of invalid models.
More data would be consumed to validate models, and to choose between competing models Kennedy and et al., [1998]. As an extension to the general wisdom of a sample size of 6-10 times the complexity of the dataset, He and Shau, [2000] establish bounds on the increase of complexity of a model as the sample size increases. Their limits are based on the types of functions being estimated (i.e. linear and logistic regressions and a spatial median), and the continuity of the functions. Discontinuous functions can support less complexity on the same data, while increasingly large samples can support more complex models, but not at the rate of increase of the sample size (e.g, a sample of 100 points supporting a 10 term model would better satisfy asymptotic assumptions than a 1000 samples of a 100 term model). Under one reference they cite, a linear model without discontinuities would support only about 3 times as many terms with 10 times as many samples: 31 terms on 1000 samples is similar to 10 terms on 100 samples.
Huffer and Park [2000] show a test for structure, basically by removing the first and second moments in the data, then studying the multivariate distribution with a chi-squared test. This leads into other methods of high dimensional visualization, such as the Sliced Inverse Regression (SIR) [Li, 2001]. SIR, which bins the output variable, calculates the corresponding means and covariance of the input variables, reduces the dimension of the predictors, and then examines the output variables in the reduced space [Basilevsky, 1994].
Several feature extraction methods, such as Principal Component Analysis (PCA), Singular Value Decomposition, (SVD), Factor Analysis, and Partial Least Squares (PLS), can be used to code the original variables in a smaller dimensional space. PCA produces a set of uncorrelated linear combinations of the initial variables ranked by their contributions to the overall variance [Johnson and Wichern, 1992]. Each PCA component includes each of the original variables, encoded in the associated eigenvector. Singular value analysis, is a method of characterizing a data matrix of less than full rank with eigenvalues, eigenvectors, and an orthonormal basis matrix Basilevsky, [1994]. Alter, [2000] uses SVD to reduce the dimensionality of a high dimensional gene data ({n ×p} = {14 ×5981}) to a smaller space of `eigengenes'. These feature extraction techniques map the high dimensional data into a different space, then truncate the dimension of the new space into a smaller dimension.
In contrast to feature extraction methods are feature selection methods that attempt to choose a subset of the initial variables while maintaining the information required to reliably model the process. Feature selection methods choose and exclude variables from an analysis based on some measure of relevance. Methods of the variable subset selection include nested model methods such as backwards elimination or forward selection in regression using changes in model R2; decision trees such as C4.5 or CART that use an information measure to rank variables, and manual methods using expert advice from domain experts.
Bocchieri and Wilpon, [1993] discuss the addition and elimination of features in a speech recognition problem. As equipment becomes faster, new features and higher order transformations of the original features become available. The new speech recognition variables can improve the accuracy of speech recognition algorithms, but the computational complexity of the algorithms becomes an issue. Bocchieri and Wilpon, [1993] suggest a method for limiting the number of features based on a misclassification distance in each of the dimensions. John et al., [1994] suggest a elimination of irrelevant variables using a ``wrapper'' technique based on stepwise selection or elimination of features and applying the data mining technique to each of the subsets. John et al., [1994] develop a definition of weak relevance based on conditional dependence on a subset of variables. Hall and Holmes, [2000] compare several methods of attribute selection and suggest information gain and a correlation based method for high dimensional data. Hall, [2000] develops a correlation based feature selection method as a heuristic search of all subsets of features. Wu and Urpani, [1999] suggest eliminating the least relevant features rather than selecting the most relevant in order to handle messy data. Liu and Setiono [2001] propose several random search methods for selecting subsets from high dimensional data.
Feature extraction methods are often used to summarize and index high dimensional databases for similarity searches. Aslandogan and Yu [1999] survey several systems for image storage and retrieval. Dimensional reduction of color or spectral histograms, and texture signatures derived from fourier transforms of the images are also discussed.
Each semiconductor chip, wafer, lot, or batch carries a large number of independent process variables and characteristic measurements which may differ with each chip/wafer/lot/batch. Image analysis contains a large numbers of pixels and their associated characteristics which may differ for each image. For example, digital cameras routinely produce 1.3 megapixel images, reducing these to a simple greyscale image of 100x200 pixel by 8 bit depth for internet web presentation produces an array of pixels containing 20000 variables with 256 levels. Dimensional reduction is a strategy of creating summary or signature features (or variables) that may give an analyst a better perspective than a pixel-by-pixel representation. For example, an analyst could query the database for images that are `green', or in manufacturing, creating indices by lot number, by process and by yield. This facilitates similarity signature modeling in that an analyst could request all of the lots similar in yield to lot X, Y, and Z for example. In addition, a practical consideration of indexing strategies is that the number of attributes or fields in relational tables is limited. For example, the commercial database program, Oracle 7, limits the number of attributes to 256 in a table. Ng and Tam [1999] use a multiple level filter to manage high dimensional data. They find that color, texture and `eigenface' representations of image data may generate 256, 240, or 400 dimensions, respectively. Compared to the original dimensionality of the image data, the reduction is dramatic, but search through an index with > 20 dimensions essentially degenerates into a sequential search. Ng and Tam [1999] solve this problem with a system for storing a multidimensional index in a hierarchy based on transformed features.
Gene expression data is a domain with small sample sizes and large domain in which a large number of the variables are irrelevant to the problem of interest. Eisen et al., [1998] describes data of n » 102, p » 105 and a method for clustering variables based on a correlation coefficient using average linkage, then displaying them for human interpretation. The general model using gene arrays is to take samples of the biological manifestation, and then compare gene arrays in order to identify genes that are related to the question of interest. Kamimura et al. [2000] use mean hypothesis testing, checking for statistical significance of difference between variables given classes of outputs to determine relevance of genetic information to a problem of interest.
Identification of subcomponents can come from examining the form of the data, the division of the people studying the problem, the models applied to the data, and the reports used to examine the system.
In the semiconductor manufacturing plant under study, data is stored along functional or departmental lines. Manufacturing data is collected automatically for each machine and stored in a large database with a table for each machine. Testing measurements are recorded by a different department in a different database. Process engineers use a summary report of historically critical data to monitor the process. Quality control engineers use a different set of databases to monitor the processes. Production managers use routing and output data to monitor the efficiency of the plant. Each of these functional groups produces, stores, accesses and analyzes data drawn from portions of the entire process, draws conclusions, and makes decisions based on their mental models of the process. Considering each of these systems as a subcomponent of the larger production system raises a question of how to combine and understand these models in relation to one another.
If a large system can be decomposed into meaningful subunits, understanding may be gained by examining the relationships of subunits to the high level system outputs and between subunits. If users of the system conceive of the system as a agglomeration of subsystems, a modeling methodology that explains the interdependencies and effects of the subcomponents may help build understanding of the system.
Measurable process outputs If the intermediate monitoring variables are used as the sole predictor of the higher level system output, the resulting predictions can only be as good as the best prediction based on the intermediate variables. Several studies have identified an increase in US citizen's heights with an increase in nutrition; using height as a sole surrogate variable to summarize the nutritional variables in prediction of health will limit the results to the information present in the height variable. The intermediate variables can be used, but they should be augmented with other information to produce an improved estimate.
End result variables
In the specific case of manufacturing, process yield is an important measure of the performance of a manufacturing system. Alternate important measures might be time expended or resources used. From the standpoint of the process engineer attempting to understand yield, a focus on defects and their root causes points towards defect rate as a meaningful measure of process performance. If, for a particular defect, the defect rate can be understood and attributed to root causes, then the path toward improving the process or reducing the defect rate can more clearly be found.
In a manufacturing process, risks or failure rate might be more commonly used, but models with (-¥,+¥) provide more tractable models.
Supposing that the estimate of the defect rate due to some submodel Mi [^(yi)] = fi([(bi)\vec],[(Xi)\vec]) where [(Xi)\vec] is a vector of variables in submodel i and [(bi)\vec] is a vector of parameters, and fi() is an estimation model, the model [^(yi)] will be able to predict the system output with some level of accuracy. Supposing further that i=10 subsystems are independent and that each is responsible for 1/20 of the variance observed in the output, with the remaining 1/2 of the variance due to other sources. Each of the 10 models can explain at most 5% of the variation, and may seem insignificant, but a combination of the subsystems has the potential to explain 1/2 of the variance in the entire system.
If the entire system is small enough to be modeled in one system, that would assuredly produce better prediction results, however, segmentation into subcomponents can perhaps aid in the interpretation of the resulting model.
Interdependencies between the subsystems can be reflected in the correlations between the subsystem outputs.
Continuity of x/n at x=n and x=0: for the discrete case:
|
|
If we cannot reject this null hypothesis, then the adjustment is reasonable. Alternately, if the adjustment is unreasonable, the underlying problem is a a mixture of two processes: One with the base distribution, and another process which generates many of the x=0 cases.
The submodels XXX
Two internal reports produced at the manufacturer summarize production and testing data: the TEG electrical characterization data through WIPNavigator, and the SIView quality control and process data reports. These two reports represent variables of special interest to the process and design engineers. The internal reports presents the variables in these datasets with scatter plots against yield and other output variables, encouraging univariate models. Discussions with process and design engineers help to build the relationships in Figure A.1. Combining these variables with analysis is a first step in creating a model of yield or other process outputs. Each of these variables could be used as response variables for subsets of the process data. The resulting hierarchical model could use the process data to predict the engineering variables, and then in turn predict the yield response. Although the global model uses the same data, the structure imposed by the constraints reduces the number of parameters estimates required by a multivariate model, and helps manage the short run and small sample sizes.
Specifically, DS contains 27 lot level TEG contains 71 measurements, each a vector of lot summary statistics, monitoring Prime Specification Limits (PSLs) thought important by process engineers. QC contains 21710 variables separated into measurement and process data. Quality control and process engineers have selected 31 of the QC measurement variables as key Engineering Specifications (ES), each a vector of several lot summary statistics and wafer samples.
Modeling the PSL variables as outputs of lower level processes, such as the ES, and the ES parameters as outputs from QC measurement and test data divides the manufacturing system into ``natural'' subcomponents. Also, modeling the yield performances in the die sort database as functions of the TEG PSLs may aid in assessing the relative benefits of measuring the PSLs.
Figure A.1 shows a number of modeling efforts that may succeed in relating the low level production data to the high level yield and efficiency data. Each connecting line in figure A.1 represents a smaller, more manageable model, that may provide insight into the wafer production process. For instance, at the bottom of the figure, each of the several dotted lines link some subset of production data to an engineering specification. It may be possible to discover and control the process parameters contributing to the TV nitride measure using partial least squares or other regression techniques to develop lower and intermediate models. By monitoring and controlling the engineering specifications, it should be possible to understand, predict, and control of some characteristics in electrical test and die sort.
Extraction of some die sort and engineering test data from a production system has already been accomplished. What remains is extracting production and measurement information, clarifying the hierarchy between elements of these data, building some of the many possible sub-models, and showing that the sub-models can be combined in a hierarchical model.
Referring to Figure A.1, I will build a limited number of models at different levels of the process, relating the outputs of lower level processes to higher level variables of interest. Specifically, data will be cleaned and feature-selected from a database table, aligned with higher level output data, and models will be built using methods such as multiple regression, principle components regression and partial least squares. The smaller models may be amenable to linear regression, general linear models or partial least squares, depending on dimensionality and the output variables. The organizational hierarchy of the existing data, shown in figure A.1, allows for testing of the sub-models. Relating the elements of the production hierarchy this way should demonstrate that although a simple multiple regression on the process parameters to the yield output is not possible, a detailed overall model of the production process may be successful.
At this writing, domains of similar complexity have been explored in the literature, data has been extracted, preliminary models of die sort variables as a function using in-line testing data have been attempted. I have used SAS summary data, visualization tools, and elicited expert knowledge to manually select relevant feature subsets, and built models of in-line-testing data.
Separating the production margin yields into the contributors to margin yield failures allows modeling of the different defect rates separately, and can provide a more detailed explanation of the correlations between electrical test and die sort data.
Preliminary models on limited data indicate that ordinary least squares models of these separate yield failures are stronger than models of the overall yield. XXX
Multivariate models including the entire process are impractical due to data structure problems. By developing a framework for relating smaller models in a hierarchical manner, a complex manufacturing process can be broken down into more manageable smaller models which can be combined to produce an overall model.
Opportunities exist to build a number of sub-models relating process parameters to key engineering parameters, key engineering parameters to electrical test data, and electrical test data to die sort and yield data. Feature selection and feature selection methods can help to reduce the dimensionality of the sub-models in cases where the sub-models have rank problems. Combining these models to build a hierarchical model could increase understanding of the semiconductor manufacturing process. Although the combined model could represent the entire process, it is hoped that the sub-models and overall model will enhance understanding of the process as compared to an under-determined overall model and the man parallel univariate models. Some of the sub-models may prove useful and provide avenues for improvements of the process.
The extraction of useful information from the large storage space and high dimensional databases of semiconductor manufacturing may provide substantial benefits in yield modeling. Using a methodology which captures the hierarchical nature of the semiconductor manufacturing process can provide a method for managing short-run, data-poor complex processes. By segmenting the model into discrete units, multiple less-complex models can be created from the available data. Ultimately, the development of a hierarchical model can help manage the complexity of the semiconductor process and relate high-level yield information and electrical parameters to specific manufacturing process parameters. The methodology used to produce and combine these models into a comprehensive model of the semiconductor manufacturing process will be applicable to other complex manufacturing processes, and also to other complex systems with n << p.
Two reasons for using the decomposition and metamodeling approach are for systems with more variables than observations, and for understanding of complex systems and the relationships between subcomponents.
The information loss is a real concern, and is apparent from a comparison of a partitioned flat model and a hierarchical model. In the flat model, the sub-model covariance matrices along the diagonal are represented in the sub-models, while the between sub-model covariances are represented as the off-diagonal elements in the metamodel covariance matrix. Since the information is used in a flat model, but is not used in this meta-modeling methodology, loss of information indicates that the methodology should not be used if a flat model is adequate. However, if a flat model is infeasible for some reason, e.g.: p\xspace >> n\xspace, it is not a fair comparison. An argument for using a meta-model even if a alternative flat model exists, is that it may help in interpretation: If there is clear hierarchy to the entire system, intermediate variables may help explain the interaction between and the contributions of the subsystems to the process output.
The contamination of the meta-model inputs by the submodel estimation process points to a clear methodological solution: use a separate training set for the sub-model fitting and for the meta-model fitting. If the sub-models and meta-model are trained on the same data set, then from (5.13) we see that [^(Y1\xspace)] is a function of Y.
|
|
| (5.9) |
If a submodel (5.9) is generated for each partition i (5.10), the covariances of the [^(Y·\xspace)]\xspace estimates from the submodels shows the structure of the meta model.
|
|
Assuming the form of the submodel (5.9) is a
linear least squares fit, [^(Y\xspace)]\xspace on X\xspace\xspace, where each is centered
with mean zero, the least squares model gives [^(Y\xspace)]\xspace=X\xspace\xspace[^(b)]\xspace
where [^(b)]\xspace=(X\xspace\xspacetX\xspace\xspace)-1X\xspace\xspacetY\xspace then the covariance of [^(Y\xspace)]is s2[^(Y\xspace)]\xspace\xspace=([^(Y\xspace)]\xspacet[^(Y\xspace)]\xspace])/(n-1) which is:
|
When examining the covariance of two models, the development is similar, but the simplification is not as clear:
|
the simplification of the model is to replace the large covariance matrix of the flat model, 5.8 with the summary structure of the covariance matrix of the metamodel 5.11.
Assuming linear models, the each element of the meta model covariance matrix (5.11) is generated as:
|
Intermediate variables as the links in a hierarchy: Models of intermediate variables, TEG to DS have correlation coefficients of r=0.007 in some cases. This concept turned out to be flawed because the meta-model will at best predict only as well as the model of the top level parameters as a function of the sub-model outputs. In the specific, using TEG variable intermediaries shows limits the performance of the metamodel to the model of the process outputs based on actual TEG data. In this case, a correlation coefficient of r=0.007 indicates that the metamodel will not be very useful.