![]() |
|
|
|
![]() ![]() ![]()
|
BUYERS' GUIDE TO DATABASE MINING & PREDICTIVE MODELING TOOLS
|
||
|
There are many database mining and predictive modeling tools available today. There are different technologies to choose from, and each product has different features. But what do you really need? When evaluating products, you must separate the facts from the hype and distinguish between features that you really need and the bells and whistles that you will never use. Before buying any database mining or predictive modeling tool, ask yourself these questions: Is it an over-hyped modeling technology? Over-hyped technologies are supported by a large number of vendors, technical writers, and academic researchers who have a vast interest in maintaining the myth about the superior modeling of these technologies. Don't let yourself to be caught in this type of hype. Ask yourself how a black box can outperform modeling methods that allow you to incorporate expert knowledge in the model building. They can not! Look carefully and ask what facilities exist in the software for incorporating your knowledge in the model building process. Do I need a multipurpose product? Some products offer several different methods for predictive modeling or database mining. They are, for the most part, libraries of algorithms published in various AI journals - and for a good reason. Development of a superior technology for pattern discovery requires lots of research, money and dedication. Unless the vendor is called Microsoft, it is not likely to have the resources to carry on several developments. Look for tools that do one thing but do it very well. Can I gain insight into my data? The key to good modeling is understanding the model. Experienced analysts know that the most important product of data analysis is not the numbers or even the recommended decisions but improved insight into data and their relationships. Experienced analysts are not likely to accept modeling results without understanding why one decision is recommended over another and which assumptions are most critical. Look for clear explanations of the patterns discovered. For example, rules are much easier to read, understand and use than a complicated equation. Also look for full auditability of models and decisions. This lets you see where the patterns were found and how decisions were made. Can the software ignore the noise? In most business and industrial databases there are data that represent anomalous situations, noisy periods, etc. Neural networks and other parametric techniques cannot recognize these type of data. Ask your vendor whether the software can identify and ignore noise and anomalies in pattern search and model building. When is my model applicable/not applicable? In decision/prediction from new data, parametric techniques like neural networks or regression will always give you an answer, even in situations where the developed model is no longer applicable. See whether the software can recognize and tell you when the model is not applicable. Can I evaluate the quality of rules? A number of products will give you rules from data, many of which are a variation of published inductive algorithms such as ID3. But can you trust the rules? Even very good techniques cannot guarantee good results all the time. You must then have a way to evaluate the quality of the rules before using them. Look for indicators in the reports that show how strong the rules are and how well they generalize the data. See if the package provides rule validation features that allow you evaluate predictive capability quickly. If not, this should tell you something. Will the system outperform others? Watch out for this phrase. It is well documented that given a set of training and test data, most tools will perform very closely. This is because each tool can be optimized for a given train and test set until it predicts the test data quite well. The question is, however, how well such an optimized system will work with future data. This we don't know but can learn from experienced analysts and modelers. The accepted wisdom is that if models are based on principles that underlie the problem, and not fortuitous correlations, they are likely to perform well over time. See if you can understand the models' underlying principles, key features and the relationships between data. Is a fancy interface really important? When a vendor develops a software product there is always a question of where to put money first - into interfaces or software functions. Newer product that are still evolving often focus on adding new features and methods in the product. Later, when no new improvements are made in software performance, attention is directed to nicer looking interfaces, direct connections to other products, and other routine maintenance issues. Therefore, don't let the look of interfaces influence your choice too much. Often an attractive interface means that the product has exhausted its potential for growing. Is data preprocessing required? Most tools need some data preprocessing to overcome their weaknesses. Preprocessing involves extra work for you as well as the risk of introducing errors into the analysis. Ask if the product can handle both symbolic and numeric data - if not, you have to change your data to fit the tool. If you will be working with incomplete databases ask if missing values need to be filled in. Filling in missing data values with averages or some other values can disturb the relationships and give you invalid results. Does the product overfit? The answer is simple. If a model includes redundant variables then it may overfit your data. Redundant means that a variable can be removed from the rules without losing any ability to make predictions. Redundancy is dangerous because it may give you a false confidence about the accuracy of model. Do I need all these features? Some tools include features such as statistical tests, multidimensional graphs, and plots that are already in your spreadsheet, database or in a dedicated statistical tool. Look for software that focuses on what it should do and does not try to dazzle you with features you can get somewhere else for a fraction of the cost.
|
||
| REDUCT & Lobbe
Technologies Inc. P.O. Box 800, 186 - 8120 No.2 Road., Richmond, BC, Canada V7C 5J8 ph: (604) 275-3711 fax: (604) 275-3711 email: dispatch@reduct.com |