Chapter 7. Data mining: process, toolkits, and standards – Collective Intelligence in Action

Chapter 7. Data mining: process, toolkits, and standards

This chapter covers
  • A brief overview of the data mining process
  • Introduction to key mining algorithms
  • WEKA, the open source data mining software
  • JDM, the Java Data Mining standard

The data mining process enables us to find gems of information by analyzing data. In this chapter, you’ll be introduced to the field of data mining. The various data mining algorithms, tools, and data mining jargon can be overwhelming. This chapter provides a brief overview and walks you through the process involved in building useful models. Implementing algorithms takes time and expertise. Fortunately, there are free open source data mining frameworks that we can leverage. We use WEKA—Waikato Environment for Knowledge Analysis—a Java-based open source toolkit that’s widely used in the data mining community. We look at the core packages of WEKA and work through a simple example to show how WEKA can be used for learning. We really don’t want our implementation to be specific to WEKA. Fortunately, two initiatives through the Java Community Process—JSR 73 and JSR 247—provide a standard API for data mining. This API is known as Java Data Mining (JDM). We discuss JDM in the last section of this chapter and review its core components. We take an even deeper look at JDM in chapters 9 and 10, when we discuss clustering and predictive models.

7.1. Core concepts of data mining

Data mining is the automated process of analyzing data to discover patterns and build predictive models. Data mining has a strong theoretical foundation and draws from many fields, including mathematics, statistics, and machine learning. Machine learning is a branch of artificial intelligence that deals with developing algorithms that machines can use to learn the patterns in data in an automated manner. Data mining is different from data analysis, which typically deals with fitting data to already-known models. Querying for data and analyzing the data for summaries and trends—commonly known as reporting and OLAP (online analytic processing)—are common forms of data analysis. Data mining, on the other hand, deals with the discovery of previously unknown patterns or models by analyzing the data. They provide new insight into the data being analyzed. Reporting, OLAP, and data mining all fall in the general area of business intelligence.

In this section, we look at some of the core concepts associated with the process of data mining. We first discuss the three forms of attributes: numerical, ordinal, and nominal. Next, we look at supervised and unsupervised learning, some of the key learning algorithms, and the data mining process.

7.1.1. Attributes

A learning algorithm needs data in order to learn and find patterns; this data can be in the form of examples, where each example consists of a set of attributes or dimensions. Each attribute should be independent of other attributes; it shouldn’t be possible to compute the value of an attribute based on another attribute’s value. Depending on whether the attributes can be ordered and the kind of values they can take, we can categorize attributes into the following three forms:

  • Numerical values have a real number associated with them. Two numerical values can be compared. Further, numerical values can be continuous (for example, a person’s age) or can take only discrete values, such as the number of images uploaded by a user. Since every numerical value can be compared to another numerical value, there is ordering associated with numerical attributes. Furthermore, the difference in magnitude between two numerical values provides a measure of closeness—for example, the number 5 is closer to 3 than the number 10.
  • Ordinal values are also discrete, but there is ordering associated with them. For example, {small, medium, large} may be used to characterize the size of an article. Here, there’s no absolute measurement of how much smaller small is compared to medium; just that medium is larger than small, while large is larger than medium.
  • Nominal values consist of discrete values that are in no particular order, also sometimes called categorical. For example, the color of a person’s eyes could be {blue, green, black, brown}. Here, a linear distance measure isn’t helpful in getting a measure of closeness between two points—for example, there’s no way to state that blue is closer to green than black in the previous example.

There are a number of algorithms that work either with continuous values or with nominal values. It’s possible to convert a continuous variable into a discrete variable and vice versa.

Continuous numeric values can be discretized by creating appropriate bins. For example, the number of times a user has logged in to the application in a week can be converted into discrete values, small, medium, and large. One such binning criteria might be this: if the number of logins is one, then it falls in the small category, two to five logins corresponds to the medium category, while greater than five amounts to a large number of logins.

Discrete variables can be converted to numerical variables in the following manner. First, let’s consider the example of a person’s eye color. There were four values associated with this nominal variable: {blue, green, black, brown}. This can be transformed into four attributes: blue, green, black, and brown. These attributes have a value of 1 when the color of the person’s eyes matches the attribute; otherwise the value is 0.

Ordinal variables that are discrete and have an ordering associated with them can be treated differently. Let’s consider an attribute that takes three values, small, medium, and large. This can be converted into three attributes, small, medium, and large. When a variable takes the large value, all these three attributes take the value of 1; medium corresponds to setting the small and medium attributes to a value of 1; while a value of small will correspond to simply setting the variable small to 1. Table 7.1 summarizes the common terms used to describe the different kinds of attributes.

Table 7.1. Common terms used to describe attributes

Attribute type

Description

Discrete/continuous

Example

Continuous Takes real values Continuous The amount of time spent by a user on the site
Ordinal There is ordering in the fixed set of values that the attribute can take Discrete or continuous Length of session expressed as {small, medium, large}
Nominal There is no ordering related to the values taken from the fixed set by the attribute Discrete Gender of a person {male, female}

Algorithms that discover relationships between different attributes in a dataset are known as association rule algorithms, while algorithms that analyze the importance of an attribute in relation to predicting the value of another attribute or in clustering data are known as attribute importance algorithms. Based on how data is analyzed, algorithms can be classified into supervised and unsupervised learning, which we look at next.

7.1.2. Supervised and unsupervised learning

In supervised learning, we have a training dataset, with a set of instances or examples for which the predicted value is known. Each example consists of input attributes and a predicted attribute, as shown in figure 7.1. The aim of the algorithm is to build a mathematical model that can predict the output attribute value given a set of input attribute values. Decision trees, neural networks, regression, Bayesian belief networks, and so on are all examples of predictive models. The predictive models built by some algorithms such as decision trees, belief networks, and rule induction are easier to understand than those of other algorithms such as neural networks and regression. The accuracy of a predictive model is measured by means of how well the model does on previously unseen data. When the predicted attribute—the target—is a categorical attribute, the prediction model is also known as a classifier and the problem is one of classification. For example, a predictive modeler might classify a blog entry into appropriate categories. When the output attribute is a continuous, it’s also known as a regressor and the problem is one of regression.

Figure 7.1. A predictive model makes a prediction based on the values for the input attributes.

In unsupervised learning, there’s no predicted value to be learned. The algorithm analyzes the multidimensional data to form clusters—groups of similar points. For example, figure 7.2 shows two clusters formed by analyzing two-dimensional data. K-means clustering, hierarchical clustering, and density-based clustering are examples of commonly used clustering algorithms. Unsupervised learning is good at analyzing the data and discovering patterns in an automated manner.

Figure 7.2. Two clusters in a two-dimensional attribute space found by analyzing the proximity of the data points

Now that we’ve classified the various learning algorithms into supervised and unsupervised learning, let’s briefly look at some commonly used learning algorithms. We revisit a few of these algorithms in greater detail in chapters 9 and 10.

7.1.3. Key learning algorithms

In this section, we provide a high-level overview of some the commonly used learning algorithms: decision trees, clustering, regression, neural networks (MLP and RBF), SVM, and Bayesian algorithms.

A decision tree is one of the most commonly used classifiers and deals only with nominal attributes. Let’s go through an example to understand the basic concepts. Let’s assume that we want to predict whether a user will like a recently introduced new feature in our application. Assume that we’ve created a dataset consisting of two attributes: user gender and number of logins. The first step is to create all the attributes into nominal attributes. By the process of binning, the number of logins gets converted into three values: {small, medium, and large}. The user’s gender is already nominal and can take two values: {male, female}. Let’s assume that we’ve created a dataset where each instance is a user with these two outputs, along with whether the user liked the new feature. Figure 7.3 shows an example decision tree, which consists of nodes and links. A node in the tree corresponds to an attribute value evaluation. The root node in this example corresponds to the attribute number of logins. The arcs correspond to the categorical values associated with the attribute. There are three links from the parent node corresponding to the three nominal values associated with the attribute: {small, medium, large}.

Figure 7.3. An example decision tree showing two attributes

It’s easy to translate decision trees into rules to understand the logic discovered. For example, the rule for the output from the second row is

If {number of logins = “small”} and {gender = male} then ....

If {number of logins = “small”} and {gender = female} then ....

Among clustering algorithms, k-means is perhaps one of the most well-known. The algorithm is typically seeded with k randomly selected clusters, where k is the number of predefined clusters. Each example is then associated with the cluster whose center is closest to that point. At the end of the iteration, the means of the k clusters are recomputed by looking at all the points associated with the cluster. This process of learning continues until the examples don’t move between clusters or a maximum number of iterations is reached.

Hierarchical clustering is another popular clustering algorithm. In this algorithm, each data point starts as its own cluster. Next, two points that are most similar are combined together into a parent node. This process is repeated until we’re left with no more points to combine.

Density-based clustering algorithms try to find high-density areas that are separated by low-density areas. These algorithms automatically determine the number of clusters by omitting low-density regions, which are treated as noise. We look at clustering algorithms in greater detail in chapter 9.

Given two points in a two-dimensional space, it’s fairly straightforward to compute the two constants associated to find a line that joins two points. Now extend this concept to finding the line or other higher-dimensional functions that best fit multiple points in a multi-dimensional space. Regression-based algorithms represent the data in a matrix form and transform the matrix to compute the required parameters. Regression-based techniques require numerical attributes to create the predictive models. These algorithms aim to minimize the sum of the squared error between the predicted value and the actual value for all the cases in the training dataset.

Multi-layer perceptron (MLP) and radial basis functions (RBF) are two of the most commonly used neural networks. Neural networks are useful both as predictive models and as classifiers.

An MLP consists of a number of layers, beginning with the input layer as shown in figure 7.4. The number of input nodes corresponds to the number of input attributes. Depending on the nature of the transformation function used by the node, the input values may be scaled between –1 and 1. Links in the network correspond to a weight by which the output from the node is multiplied. In a three-layer network, the second set of nodes is known as hidden nodes. The input to a node is the sum of the outputs from the nodes, multiplied by the weight associated with the link. The third layer is the output layer and predicts the attribute of interest.

Figure 7.4. A multi-layer perceptron where the input from one layer feeds into the next layer

Building an MLP predictive model consists of estimating the weights associated with each of the links. Typically, a gradient descent algorithm is used to learn the weights associated with an MLP; the learning procedure is known as back propagation. With this procedure, there’s no guarantee of finding a global minimum, and the learning process may be enhanced to run in conjunction with some optimization methodologies, such as simulated annealing or genetic algorithms.

In an RBF, first the data is clustered into k-clusters using the k-means clustering algorithm. Each cluster corresponds to a node in the network, the output from which is dependent on the proximity of the input to the center of the node. The output from this layer is transformed into the output using weights. Learning the weights associated with these links is a linear regression problem.

A relatively new algorithm that’s becoming popular for classification problems is the support vector machines (SVM) algorithm. Consider a two-dimensional space with a large number of points. There are a large number of lines that can be used to divide the points into two segments; let these lines be known as separating lines. Now define margin as the distance between a separating line and a parallel line that passes through the closest point to the line. SVM selects the line that has the maximum margin associated with it. Points that this second, parallel line passes through are known as support vector points. SVM generalizes this concept across multiple dimensions and has been observed to work well. The output variable can be discrete, containing two values, or continuous.

Algorithms based on probability theory are commonly known as Bayesian algorithms. One such simple but good classifier is the Naïve Bayes’ classifier. This algorithm assumes independence between the input attributes and makes a prediction based on estimating probabilities from the training data. Bayesian belief networks (BBN), also known as probabilistic belief networks, are another approach at estimating the probabilities using the Bayes’ theorem. BBNs are directed acyclic graphs (DAG), where a link signifies conditional distribution function between the parent and the child node.

Table 7.2 summarizes the different kinds of algorithms used for mining data, along with the kinds of input and output attributes.

Table 7.2. Summary of different kinds of data mining algorithms

Type of algorithm

Description

Type of input

Type of output

Example

Regression Builds a predictive model that predicts the output variable based on the values of the inputs Continuous Continuous Regression, neural networks
Classification Predicts the output value for the discrete output variable Discrete[a] Discrete Decision tree, Naïve Bayes’
Clustering Creates clusters in data to find patterns Discrete or continuous None k-means, hierarchical clustering, density-based algorithms
Attribute importance Determines the importance of attributes with respect to predicting the output attribute Discrete or continuous None Minimum description length, decision tree pruning
Association rules Finds interesting relationships in data by looking at co-occurring items Discrete None Association rules, Apriori

a Many tool sets allow the use of both continuous and discrete variables as input for regression and classification.

There’s a lot more involved in understanding and implementing these algorithms. We take a more detailed look at a few of them in later chapters. Fortunately, rather than having to reimplement all these algorithms, there are a few open source data mining platforms that we can leverage, one being WEKA. You can find the list of data mining tool vendors at http://www.dmoz.org//Computers/Software/Databases/Data_Mining/Tool_Vendors/.

With this brief overview on the different kinds of learning algorithms, we’re now ready to look at the process of analyzing data to discover intelligence.

7.1.4. The mining process

Analyzing data to build a predictive model or discover clusters consists of the following six steps. Typically, the process is iterative, where you may repeat these steps in the quest for a better model or to refine an existing one:

  1. Modeling and selecting the attributes— We first need to understand what we’re looking for. Is our aim to build a predictive model or find patterns in the data? Based on the needs, identify the attributes that are available for use in the analysis.
  2. Creating the learning dataset— We need a set of examples—a dataset—to be used by the algorithm. The dataset is typically broken up into two datasets: the training dataset which usually contains 90 percent of the data and is used for creating the predictive model, and the testing dataset, which is used for evaluating the quality of the predictive model. It may be computationally infeasible or too expensive to analyze all entries in a large dataset. In this case, random sampling is used to create a smaller dataset that’s expected to be representative of the complete dataset.
  3. Normalizing and cleaning the data— Instances that have missing values for any of the attributes are removed from the dataset. Further, each attribute may be normalized between the scales of [–1, 1] or [0, 1]. Typically, a distance measure is used, and normalizing the data ensures that an attribute doesn’t skew the distance measurement due to a bigger range of values. For example, an attribute that takes a value between 0 and 100 will show a bigger distance measure for two points than an attribute that takes values between 0 and 1.
  4. Analyzing the data— The training dataset is analyzed to either build a predictive model or discover clusters. A common problem to avoid in analyzing the data is overfitting the data—the model memorizes the training data, leading to poor predictive capabilities. The algorithm may be intelligent enough to prune the number of attributes being used based on their effectiveness in predicting the attribute of interest.
  5. Evaluating the quality of the predictive model— The quality of the predictive model is evaluated using the testing dataset.
  6. Embedding the predictive model— Once a predictive model has been built, it can be embedded in your application to make predictions.

A common approach to avoid overfitting and select the best predictive model is to use the k-fold cross validation strategy. This approach is typically used when the number of examples available for learning and testing is small. In this case, the dataset is broken randomly into k subsets. K sets of learning runs are executed, where one of the k sets is used for validating the predictive model while the other remaining (k–1) datasets are used for the learning process. Typically, the error associated with a predictive model is the average error in the test data across the k runs. A number of predictive models may be created using different algorithms or different settings, and the one with the least average error across the k runs is selected.

With this background on the data mining process and the key learning algorithms, it’s helpful to work through an example to see the learning process in action. Writing learning algorithms can be complex and tedious; fortunately, a lot of work has been done in the open source community for us. We leverage one such open source data mining framework: WEKA.

7.2. Using an open source data mining framework: WEKA

The Waikato Environment for Knowledge Analysis, commonly known as WEKA, is one of the most popular suites of data mining algorithms that have been written in Java. WEKA is available under the GNU General Public License.[1] WEKA was developed at the University of Waikato in New Zealand. Early work on WEKA began in 1993; work on the Java version started in 1997. In September 2006, Pentaho,[2] a developer of open source business intelligence software, bought WEKA. WEKA contains tools for data preprocessing, classification, regression, clustering, association rules, and visualization. It also has a GUI application through which you can apply the various data mining algorithms to datasets. We are more interested in its Java API and will use that directly.

1http://www.gnu.org/copyleft/gpl.html

2http://www.pentaho.com/

Over the years, the WEKA APIs have been leveraged to build additional tools and packages—you can see a list of these at http://weka.sourceforge.net/wiki/index.php/ Related_Projects. RapidMiner,[3] formerly known as Yet Another Learning Environment (YALE), is one such project that leverages WEKA to provide an environment for data mining. There’s also an excellent book on data mining and WEKA written by two of the professors[4] associated with WEKA.

3http://rapid-i.com/content/blogcategory/10/69/

4 Ian H. Witten and Eibe Frank Data Mining: Practical machine learning tools and techniques, 2nd Edition. Morgan Kaufmann, San Francisco, 2005.

In this section, we familiarize ourselves with the WEKA learning platform. We begin with a brief tutorial in which we use the WEKA learning environment to orient ourselves with the WEKA Java packages and key classes. Next, we write a simple example that invokes the WEKA API. At the end of this section, you should be familiar with WEKA, its GUI application and Java API, and what’s involved in embedding the platform. We leverage WEKA later in chapters 9, 10, and 12.

7.2.1. Using the WEKA application: a step-by-step tutorial

We begin with a brief tutorial to guide us through installing and running WEKA on Windows.

Installing Weka

First, we need to download the WEKA package from http://www.cs.waikato.ac.nz/ml/weka/. We use the developer version 3.5.6. Note that the WEKA book uses an older version, 3.4. If you don’t have Java, download the self-extracting executable that includes Java VM 5.0 (weka-3-5-6jre.exe; 31,661,572 bytes); otherwise download the WEKA version (weka-3-5-6.exe; 15,665,809 bytes) that doesn’t have the JVM. Install the software using the default values. I installed mine at C:\Program Files\Weka-3-5. If you open this directory on your machine, it should look similar to figure 7.5.

Figure 7.5. The directory structure and some of the files for WEKA

A good place to start is to open the documentation.html page that’s available in the Weka-3-5 directory, a screenshot of which is shown in figure 7.6.

Figure 7.6. WEKA documentation that’s available in the install

The top three links on the page are tutorials for the three environments for which WEKA has a GUI application. These components are

  • Explorer— An environment for exploring data
  • Experimenter— An environment for performing experiments and comparing different learning algorithms
  • KnowledgeFlow— An environment that’s similar to Explorer but supports dragand-drop functionality to carry out the learning process

The Package Documentation link points to the JavaDoc for the WEKA classes, which we explore later in this section. Next, it’s worthwhile to spend a few minutes exploring the WEKA GUI application. Start the GUI application by going to the Start menu of your window’s computer and selecting the Weka 3.5 (with console) option in the Weka 3.5.6 program menu. You should see a window similar to the one shown in figure 7.7. Before we jump into the API, you may want to spend a few minutes playing with the Explorer application. The Explorer Guide link shown in figure 7.6 is a good reference to go through.

Figure 7.7. WEKA GUI with options to start one of four applications

Alternatively, here’s a simple five-step exercise that you may find useful to explore the application:

1.  Select the Explorer option under the Applications menu to start Explorer.

2.  Click on Open File and go to the data directory data. Select the iris.arff dataset.

3.  This should open up a window similar to the one in figure 7.8. Note that there are five attributes shown in the Attributes window, and you can find out the details about each attribute in the two windows on the right side.

Figure 7.8. The five attributes of the iris.arff dataset, with details about the sepallength attribute

4.  As shown in figure 7.9, we convert the sepallength variable into a discrete variable. Select the Choose button and then select Discretize Filter, as shown in figure 7.9. This variable is now a nominal attribute that can take three values.

Figure 7.9. Converting a continuous variable into a discrete variable using filters in WEKA

5.  Click the Visualize All option to see a visual representation of the five attributes.

Similarly, spend some time exploring the capabilities of the Experimenter and the KnowledgeFlow application.

With that brief overview of the WEKA application, we’re now ready to explore the WEKA APIs.

7.2.2. Understanding the WEKA APIs

In this section, we explore the WEKA APIs. A good introduction to the WEKA JAVA APIs is the tutorial.pdf file, part of the WEKA installation in the C:\Program Files\Weka-3-5 directory.

If you use an IDE such as Eclipse, you may also want to add the WEKA source available in the weka-src.jar file to your Eclipse project.[5] The compiled Java classes are in weka.jar; you’ll want to add the jar file to your project lib file so that Java can find it at runtime.

5 Unzip this jar file in the src/java directory and include the files in your project.

At this stage, it’ll be useful to look at the JavaDoc for the APIs. Click on the Package Documentation link as shown in figure 7.6. Table 7.3 contains six of the most important packages that we use.

Table 7.3. The key packages in WEKA

Package

Description

weka.core Core package containing common components used by other packages. Classes for modeling attributes, dataset, converters, matrix manipulation, text parsing, tree representation, and XML.
weka.classifiers Contains implementation of the various classification algorithms. These include algorithms for numerical prediction.
weka.clusterers Contains implementations of the various clustering algorithms.
weka.attributeselection Algorithms associated with selecting attributes.
weka.associations Algorithms related to finding associations.
weka.filters Classes related to applying filters on the dataset; for example, to remove an attribute from analysis.

The weka.core package contains the representation for a dataset. As shown in figure 7.10, each dataset is represented by the class Instances, which contains a list of examples, represented by the class Instance. Each Instance is composed of a number of attributes. WEKA uses its own implementation for a vector, FastVector.

Figure 7.10. A dataset in WEKA is represented by instances.

To load data from various sources, WEKA defines the Loader interface, as shown in figure 7.11. This has the following method to create a dataset:

Instances getDataSet() throws IOException;
Figure 7.11. Classifer uses instances to build the model and classifies an instance.

There are a number of Loader implementations, including reading data from a csv file and from the database.

The weka.classifier package contains implementations for classification and prediction algorithms. As shown in figure 7.12, a Classifier learns its model using Instances. It then can classify an Instance. The WEKA library contains a large set of classification and prediction algorithms, some of which are shown in figure 7.12.

Figure 7.12. Classifer uses instances to build the model and classifies an instance.

Similarly, the weka.clusterer package contains the implementation for the various clustering algorithms. Each Clusterer creates clusters from the Instances and then associates an Instance with the appropriate cluster, as shown in figure 7.13. The figure also shows some of the clustering algorithms that are available.

Figure 7.13. Clusterer uses instances to build the model and associate an instance with the appropriate cluster.

The weka.associations package contains two algorithms, Apriori and Predictive-Apriori, that are available to learn association rules, as shown in figure 7.14. All association-learning algorithms extend the Associator interface. CARuleMiner is an optional interface for those schemes that can produce class association rules.

Figure 7.14. Association-learning algorithms available in WEKA

So far, you should have a good understanding about the core packages of WEKA. You’re encouraged to explore other packages mentioned in table 7.3: weka.attributeselection and weka.filters. Now it’s time to write some Java code to demonstrate the learning process. We create a dataset, build a model, and then evaluate it using the WEKA API.

7.2.3. Using the WEKA APIs via an example

In this section, we use the WEKA API to create a dataset and build a predictive model to predict values. There’s also a tutorial on the WEKA wiki.[6] We write the code to solve the following learning problem.

6http://weka.sourceforge.net/wiki/index.php/Programmatic_Use and http://weka.sourceforge.net/wiki/index.php/Use_Weka_in_your_Java_code

Imagine that we want to predict the number of times a user is expected to log in to our application within a week. Unfortunately, we’re still a relatively new company and we have data for only four users, which is shown in table 7.4. We’ve captured two attributes during the registration process about the user: age and gender. Age is a continuous attribute, while gender is a nominal attribute. We know that the learning dataset is really small, and potentially we may not even find a good predictor, but we’re keen to try out the WEKA mining APIs, so we go ahead and build the predictive model in preparation for future better times.

Table 7.4. The data associated with the WEKA API tutorial

User

Age

Gender

Number of logins

John 20 male 5
Jane 30 female 2
Ed 40 male 3
Amy 35 female 4

For our example, we do the following five steps:

1.  Create the attributes.

2.  Create the dataset for learning.

3.  Build the predictive model.

4.  Evaluate the quality of the model built.

5.  Predict the number of logins for a new user.

We implement a class WEKATutorial, which follows these five steps. The code for this class is shown in listing 7.1.

Listing 7.1. Implementation of the WEKATutorial

The main method for our tutorial simply invokes the method executeWekaTutorial(), which consists of invoking five methods that execute each of the five steps. Let’s look at the first step, createAttributes(), the code for which is shown in listing 7.2.

Listing 7.2. Implementation of the method to create attributes

Remember, as shown in figure 7.10, WEKA uses its own implementation, FastVector, for creating a list of objects. There are two ways to create an Attribute.

For continuous attributes, such as age, we need to simply pass in the name of the attribute in the constructor:

Attribute ageAttribute = new Attribute("age");

For nominal attributes, we first need to create a FastVector that contains the various values that the attribute can take. In the case of attribute gender, we do this with the following code:

FastVector genderAttributeValues = new FastVector(2);
genderAttributeValues.addElement("male");
genderAttributeValues.addElement("female");

The constructor for nominal attributes takes in the name of the attribute and a Fast-Vector containing the various values that this attribute can take. Therefore, we create the genderAttribute as follows:

Attribute genderAttribute = new Attribute("gender", genderAttributeValues);

Next, we need to create the dataset for the data contained in table 7.4. A dataset is represented by Instances, which is composed of a number of Instance. Each Instance has values associated with each of the attributes. The code for creating the Instances is shown in listing 7.3.

Listing 7.3. Implementation of the method createLearningDataSet

To create the dataset for our example, we need to create an instance of Instances:

Instances trainingDataSet = new Instances("wekaTutorial",
      allAttributes, 4);

The constructor takes three parameters: the name for the dataset, the FastVector of attributes, and the expected size for the dataset. The method createInstance creates an instance of Instance. Note that there needs to be a dataset associated with each Instance:

instance.setDataset(associatedDataSet);

Now that we’ve created the learning dataset, we’re ready to create a predictive model. There are a variety of predictive models that we can use; for this example we use the radial basis function (RBF) neural network. The code for creating the predictive model is shown in listing 7.4.

Listing 7.4. Creating the predictive model

The constructor for creating the RBF is fairly simple:

RBFNetwork rbfLearner = new RBFNetwork();

We go with the default parameters associated with RBF learning, except we set the number of clusters to be used to 2:

rbfLearner.setNumClusters(2);

Once we have an instance of a classifier, it’s simple enough to build the predictive model:

Classifier classifier = getClassifier();
classifier.buildClassifier(learningDataset);

Having built the predictive model, we need to evaluate its quality. To do so we typically use another set of data, commonly known as test dataset; we iterate over all instances and compare the predicted value with the expected value. The code for this is shown in listing 7.5.

Listing 7.5. Evaluating the quality and predicting the number of logins

Evaluating the quality of the model built is fairly straightforward. We simply need to create an instance of an Evaluation object and pass in the classifier for evaluation:

Evaluation learningSetEvaluation = new Evaluation(learningDataset);
     learningSetEvaluation.evaluateModel(classifier, learningDataset);

Lastly, we use the predictive model for predicting the number of logins for previously unknown cases. The code is shown in listing 7.6.

Listing 7.6. Predicting the number of logins

We try to predict the number of logins for two users. The first user is a 32-year-old male; the second is a 32-year-old female. Listing 7.7 shows the output from running the program.

Listing 7.7. The output from the main method
Correlation coefficient             0.4528
Mean absolute error                 0.9968
Root mean squared error             0.9968
Relative absolute error            99.6764 %
Root relative squared error        89.16  %
Total Number of Instances           4

Predicted number of logins [age=32]:
   Male = 3.3578194529075382
   Female = 2.9503429358320865

Listing 7.7 shows the details of how well the predicted model performed for the training data. As shown, the correlation coefficient[7] measures the quality of the prediction; for a perfect fit, this value will be 1. The predicted model shows an error of about 1.

7 See http://mathworld.wolfram.com/CorrelationCoefficient.html for more details.

The model predicts that the 32-year-old male is expected to log in 3.35 times, while the 32-year-old female is expected to log in 2.95 times. Using the data presented to the model, the model predicts that male users are more likely to log in than female users.

This example has been helpful in understanding the WEKA APIs. It also brings out an important issue: the example we implemented makes our application highly dependent on WEKA. For example, the WEKA APIs use FastVector instead of perhaps a List to contain objects. What if tomorrow we wanted to switch to a different vendor or implementation? Switching to a different vendor implementation at that point would be painful and time consuming. Wouldn’t it be nice if there were a standard data mining API, which different vendors implemented? This would make it easy for a developer to understand the core APIs and if needed easily switch to a different implementation of the specification with simple changes, if any, in the code. This is where the Java Data Mining (JDM) specification developed under Java Community Process JSR 73 and JSR 247 comes in.

7.3. Standard data mining API: Java Data Mining (JDM)

JDM aims at building a standard API for data mining, such that client applications coded to the specification aren’t dependent on any specific vendor application. The JDBC specification provides a good analogy to the potential of JDM. The promise is that just like it’s fairly easy to access different databases using JDBC, in the same manner, applications written to the JDM specification should make it simple to switch between different implementations of data mining functions. JDM has wide support from the industry, with representations from a number of companies including Oracle, IBM, SPSS, CA, Fair Isaac, SAP, SAS, BEA, and others. Oracle[8] and KXEN[9] have implementations compliant with the JDM specification as of early 2008. It’s only a matter of time before other vendors and data mining toolkits adopt the specification.

8http://www.oracle.com/technology/products/bi/odm/odm_jdev_extension.html

9http://kxen.com/products/analytic_framework/apis.php

Work on JSR 73[10] began in July 2000, with the final release in August 2004. JDM supports the five different types of algorithms we looked at in section 7.1: clustering, classification, regression, attribute importance, and association rules. It also supports common data mining operations such as building, evaluating, applying, and saving a model. It defines XML Schema for representing models as well as accessing data mining capabilities from a web service.

10http://www.jcp.org/en/jsr/detail?id=73

JSR 247,[11] commonly known as JDM 2.0, addresses features that were deferred from JDM 1.0. Some of the features JSR 247 addresses are multivariate statistics, time series analysis, anomaly detection, transformations, text mining, multi-target models, and model comparisons. Work on the project started in June 2004, and the public review draft was approved in December 2006.

11http://www.jcp.org/en/jsr/detail?id=247

If you’re interested in the details of JDM, I encourage you to download and read the two specifications—they’re well written and easy to follow. You should also look at a recent well-written book[12] by Mark Hornick, the specification lead for the two JSRs on data mining and JDM. He coauthored the book with two other members of the specification committee, Erik Marcadé, from KXEN, and Sunil Venkayala from Oracle.

12Java Data Mining: Strategy, Standard, and Practice, 2007, Morgan Kaufmann.

Next, we briefly look at the JDM architecture and the core components of the API. Toward the end of the section, we write code that demonstrates how a connection can be made to a data mining engine using the JDM APIs. In later chapters, when we discuss clustering, predictive models, and other algorithms, we review relevant sections of the JDM API in more detail.

7.3.1. JDM architecture

The JDM architecture has the following three logical components. These components could be either collocated or distributed on different machines:

  1. The API: The programming interface that’s used by the client. It shields the client from knowing about any vendor-specific implementations.
  2. The Data Mining Engine (DME): The engine that provides data mining functionality to the client.
  3. Mining object repository (MOR): The repository to store the data mining objects.

All packages in JDM begin with javax.datamining. There are several key packages, which are shown in table 7.5.

Table 7.5. Key JDM packages

Concept

Packages

Comments

Common objects used throughout Javax.datamining Contains common objects such as MiningObject, Factory that are used throughout the JDM packages
Top-level objects used in other packages Javax.datamining.base Contains top-level interfaces such as Task, Model, BuildSettings, AlgorithmSettings. Also introduced to avoid cyclic package dependencies
Algorithms-related packages Javax.datamining.algorithm Javax.datamining.association Javax.datamining.attributeimportance Javax.datamining.clustering Javax.datamining.supervised Javax.datamining.rule Contains interfaces associated with the different types of algorithms, namely: association, attribute importance, clustering, supervised learning—includes both classification and categorization. Also contains Java interfaces representing the predicate rules created as part of the models, such as tree model.
Connecting to the data mining engine Javax.datamining.resource Contains classes associated with connecting to a data mining engine (DME) and metadata associated with the DME.
Data-related packages Javax.datamining.data Javax.datamining.statistics Contains classes associated with representing both a physical and logical dataset and statistics associated with the input mining data.
Models and tasks Javax.datamining.task Javax.datamining.modeldetail Contains classes for the different types of tasks: build, evaluate, import and export. Provides detail on the various model representations.

Next, let’s take a deeper look at some of the key JDM objects.

7.3.2. Key JDM objects

The MiningObject is a top-level interface for JDM classes. It has basic information such as a name and description, and can be saved in the MOR by the DME. JDM has the following types of MiningObject, as shown in figure 7.15.

Figure 7.15. Key JDM objects

  • Classes associated with describing the input data, including both the physical (PhysicalDataSet) and logical (LogicalDataSet) aspects of the data.
  • Classes associated with settings. There are two kinds of settings, first related to setting for the algorithm. AlgorithmSettings is the base class for specifying the setting associated with an algorithm. Second is the high-level specification for building a data mining model. BuildSettings is the base implementation for the five different kinds of models: association, clustering, regression, classification, and attribute importance.
  • Model is the base class for mining models created by analyzing the data. There are five different kinds of models: association, clustering, regression, classification, and attribute importance.
  • Task is the base class for the different kinds of data mining operations, such as applying, testing, importing, and exporting a model.

We look at each of these in more detail in the next few sections. Let’s begin with representing the dataset.

7.3.3. Representing the dataset

JDM has different interfaces to describe the physical and logical aspects of the data, as shown in figure 7.16. PhysicalDataset is an interface to describe input data used for data mining, while LogicalData is used to represent the data used for model input. Attributes of the PhysicalDataset, represented by PhysicalAttribute, are mapped to attributes of the LogicalData, which is represented by LogicalAttribute. The separation of physical and logical data enables us to map multiple PhysicalDatasets into one LogicalData for building a model. One PhysicalDataset can also translate to multiple LogicalData objects with variations in the mappings or definitions of the attributes.

Figure 7.16. Key JDM interfaces to describe the physical and logical aspects of the data

Each PhysicalDataset is composed of zero or more PhysicalAttributes. An instance of the PhysicalAttribute is created through the PhysicalAttributeFactory. Each PhysicalAttribute has an AttributeDataType, which is an enumeration and contains one of the values {double, integer, string, unknown}. The PhysicalAttribute also has a PhysicalAttributeRole; another enumeration is used to define special roles that some attributes may have. For example, taxonomyParentId represents a column of data that contains the parent identifiers for a taxonomy. LogicalData is composed of one or more LogicalAttributes. Each Logical-Attribute is created by the LogicalAttributeFactory and has an associated AttributeType. Each AttributeType is an enumeration with values {numerical, categorical, ordinal, not specified}. Associated with a LogicalAttribute is also a DataPreparation-Status, which specifies whether the data is prepared or unprepared. For categorical attributes, there’s also an associated CategorySet, which specifies the set of categorical values associated with the LogicalAttribute.

Now that we know how to represent a dataset, let’s look at how models are represented in the JDM.

7.3.4. Learning models

The output of a data mining algorithm is represented by the Model interface. Model, which extends MiningObject, is the base class for representing the five different kinds of data mining models, as shown in figure 7.17. Each model may have an associated ModelDetail, which captures algorithm-specific implementations. For example, NeuralNetworkModelDetail in the case of a neural network model captures the detailed representation of a fully connected, MLP network model. Similarly, Tree-ModelDetail contains model details for a decision tree, and contains methods to traverse the tree and get information related to the decision tree. To keep figure 7.17 simple, the subclasses of ModelDetail are omitted.

Figure 7.17. The model representation in JDM

Table 7.6 shows the six subclasses of the Model interface. Note that Supervised-Model acts as a base interface for both ClassificationModel and RegressionModel.

Table 7.6. Key subclasses for Model

Model type

Description

AssociationModel Model created by an association algorithm. It contains data associated with itemsets and rules.
AttributeImportanceModel Ranks the attributes analyzed. Each attribute has a weight associated with it, which can be used as an input for building a model.
Clustering Model Represents the output from a clustering algorithm. Contains information to describe the clusters and associate a point with the appropriate cluster.
SupervisedModel Is a common interface for supervised learning–related models.
ClassificationModel Represents the model created by a classification algorithm.
RegressionModel Represents the model created by a regression algorithm.

So far, we’ve looked at how to represent the data and the kinds of model representation. Next, let’s look at how settings are set for the different kinds of algorithms.

7.3.5. Algorithm settings

AlgorithmSettings, as shown in figure 7.18, is the common base class for specifying the settings associated with the various algorithms. A DME will typically use defaults for the settings and then use the specified settings to override the defaults.

Figure 7.18. The settings associated with the different kinds of algorithms

Each specific kind of algorithm typically has its own interface to capture the settings. For example, KMeansSettings captures the settings associated with the k-means algorithm. This interface specifies settings such as the number of clusters, the maximum number of iterations, the distance function to be used, and the error tolerance range.

So far in this section, we’ve looked at the JDM objects for representing the dataset, the learning models, and the settings for the algorithms. Next, let’s look at the different kinds of tasks that are supported by the JDM.

7.3.6. JDM tasks

There are five main types of tasks in JDM. These are tasks associated with building a model, evaluating a model, computing statistics, applying a model, and importing and exporting models from the MOR. Figure 7.19 shows the interfaces for some of the tasks in JDM. Tasks can be executed either synchronously or asynchronously. Some of the tasks associated with data mining, such as learning the model and evaluating a large dataset, take a long time to run. JDM supports specifying these as asynchronous tasks and monitoring the status associated with them.

Figure 7.19. The interfaces associated with the various tasks supported by JDM

The Task interface is an abstraction of the metadata needed to define a data-mining task. The task of applying a mining model to data is captured by ApplyTask. DataSetApplyTask is used to apply the model to a dataset, while RecodApplyTask is used to apply the mining model to a single record. ExportTask and ImportTask are used to export and import mining models from the MOR.

Task objects can be referenced, reexecuted, or executed at a later time. DME doesn’t allow two tasks to be executed with the same name, but a task that has completed can be re-executed if required. Tasks executed asynchronously provide a reference to an ExecutionHandle. Clients can monitor and control the execution of the task using the ExecutionHandle object.

Next, we look at the details of clients connecting to the DME and the use of ExecutionHandle to monitor the status.

7.3.7. JDM connection

JDM allows clients to connect to the DME using vendor-neutral connection architecture. This architecture is based on the principles of Java Connection Architecture (JCX). Figure 7.20 shows the key interfaces associated with this process.

Figure 7.20. The interfaces associated with creating a Connection to the data-mining service

The client code looks up an instance of ConnectionFactory, perhaps using JNDI, and specifies a user name and password to the ConnectionFactory. The Connection-Factory creates Connection objects, which are expected to be single-threaded and are analogous to the Connection objects created while accessing the database using the JDBC protocol. The ConnectionSpec associated with the ConnectionFactory contains details about the DME name, URI, locale, and the user name and password to be used.

A Connection object encapsulates a connection to the DME. It authenticates users, supports the retrieval and storage of named objects, and executes tasks. Each Connection object is a relatively heavyweight JDM object and needs to be associated with a single thread. Clients can access the DME via either a single Connection object or via multiple instances. Version specification for the implementation is captured in the ConnectionMetaData object.

The Connection interface has two methods available to execute a task. The first one is used for synchronous tasks and returns an ExecutionStatus object:

public ExecutionStatus execute(Task task, java.lang.Long timeout)
    throws JDMException

The other one is for asynchronous execution:

public ExecutionHandle execute(java.lang.String taskName)
    throws JDMException

It returns a reference to an ExecutionHandle, which can be used to monitor the task’s status. The Connection object also has methods to look for mining objects, such as the following, which looks for mining objects of the specified type that were created in a specified time period:

public java.util.Collection getObjectNames(java.util.Date createdAfter,
       java.util.Date createdBefore,
       NamedObject objectType) throws JDMException

With this overview of the connection process, let’s look at some sample code that can be used to connect to the DME.

7.3.8. Sample code for accessing DME

It’s now time to write some code to illustrate how the JDM APIs can be used to create a Connection to the DME. The first part of the code deals with the constructor and the main method, which calls the method to create a new connection. This is shown in listing 7.8.

Listing 7.8. Constructor and main method for JDMConnectionExample

In our example, we use a JDMConnectionExample object to create a new instance of the Connection object. The constructor for JDMConnectionExample takes in four parameters: the username and password for the DME, the URI for the DME server, and the URI for the provider. Sample values are shown in the main method. The main method creates a Connection object with the following call:

Connection connection = eg.createANewConnection();

There are three steps involved in getting a new Connection, as shown in listing 7.9.

Listing 7.9. Creating a new connection in the JDMConnectionExample

First, we need to create an instance of the ConnectionFactory. Next, we need to obtain a ConnectionSpec from the ConnectionFactory, populate it with the credentials, and then create a new Connection from the ConnectionFactory using the ConnectionSpec.

Listing 7.10 contains the remaining part of the code for this example, and deals with creating the connection factory and the initial context.

Listing 7.10. Getting a ConnectionFactory and ConnectionSpec

To get the ConnectionFactory, we first need to create the InitialContext for the JNDI lookup. The constructor for InitialContext takes a Hashtable, and we set the provider URL, username, and password for the lookup. Here the code

(ConnectionFactory) initialJNDIContext.lookup(
      "java:com/env/jdm/yourDMServer");

provides access to the ConnectionFactory. We get access to the ConnectionSpec with

ConnectionSpec connectionSpec = connectionFactory.getConnectionSpec();

The ConnectionSpec object is populated with the serverURI, the name, and password credentials, and a new Connection object is created from the ConnectionFactory by the following code:

connectionFactory.getConnection(connectionSpec);

Once you have a Connection object, you can execute the different types of Tasks that are available, per the JDM specification. This completes our JDM example and a brief overview of the JDM architecture and the key APIs. Before we end this chapter, it’s useful to briefly discuss how JDM fits in with PMML, an XML standard for representing data mining models.

7.3.9. JDM models and PMML

Predictive Model Markup Language (PMML) is an XML standard developed by the Data Mining Group[13] (DMG) to represent predictive models. There’s wide support among the vendors to import and/or export PMML models. But PMML doesn’t specify the settings used to create the model, so there may be some loss of information when JDM models are converted to PMML format and vice versa; this is dependent on each vendor’s JDM model implementation. PMML does contain adequate information to apply and test the model. PMML models map readily to JDM. JDM also influenced certain aspects of the PMML 2.0 release.

13http://www.dmg.org/

7.4. Summary

Data mining is the automated process of analyzing data to discover previously unknown patterns and create predictive models. Mining algorithms need test data in order to learn. A dataset is composed of a number of examples. Each example consists of values for a set of attributes. An attribute can be continuous or discrete. Discrete values that have an ordering associated with them are known as ordinal, while those that don’t have any ordering are called nominal.

There are five major types of mining algorithms:

  • Attribute importance— Ranks the available attributes in terms of importance for predicting the output variable
  • Association rules— Finds interesting relationships in data by looking at co-occurring items
  • Clustering— Finds clusters of similar data points
  • Regression— Predicts the value of the output variable based on the input attributes
  • Classification— Classifies a discrete attribute into one of enumerated value

Writing mining algorithms is complex. Fortunately, there are a few open source data mining platforms that one can use. WEKA is perhaps the most commonly used Java-based open source data mining platform. WEKA includes all the five different types of learning algorithms along with APIs to represent and manipulate the data.

You don’t want to tie your application code with a specific vendor implementation of data mining algorithms. Java Data Mining (JDM) is a specification developed under Java Community Process JSR 73 and JSR 247. JDM aims at providing a set of vendor-neutral APIs for accessing and using a data-mining engine. There are couple of data-mining engines that are compliant with the JDM specification, and it’s expected that more companies will implement it in the future.

With this background, you should have a basic understanding of the data mining process; the algorithms; WEKA, the open source data mining toolkit; and JDM, the Java Data Mining standard.

For the learning process, we need a dataset. In the next chapter, chapter 8, we build a text analysis toolkit, which enables us to convert unstructured text into a format that can be used by the learning algorithms. We take a more detailed look at some of the data-mining algorithms, especially those associated with clustering and predictive models in chapter 9 and chapter 10.

7.5. Resources

Burges, Christopher J. C. “A tutorial on support vector machines for pattern recognition.” 1998. Data Mining and Knowledge Discovery. http://www.umiacs.umd.edu/~joseph/support-vector-machines4.pdf

“Familiarize yourself with data mining functions and algorithms.” 2007. JavaWorld. http://www.javaworld.com/javaworld/jw-02-2007/jw-02-jdm.html?page=2

Hornick, Mark, Erik Marcadé, and Sunil Venkayala. Java Data Mining: Strategy, Standard, and Practice. 2007. Morgan Kaufmann.

Java Data Mining API 1.0. JSR 73. http://www.jcp.org/en/jsr/detail?id=73

Java Data Mining API 2.0. JSR 247. http://www.jcp.org/en/jsr/detail?id=247

Jose, Benoy. “The Java Data Mining API.” Java Boutique. http://javaboutique.internet.com/articles/mining_java/

Moore, Andrew. “Statistical Data Mining Algorithms.” http://www.autonlab.org/tutorials/

Sommers, Frank. “Mine Your Own Data with the JDM API.” 2005. -http://www.artima.com/lejava/articles/data_mining.html

Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. 2006.

“Use Weka in your Java Code.” http://weka.sourceforge.net/wiki/index.php/Use_Weka_in_your_Java_code

Vapnik, Vladimir. Statistical Learning Theory. 1998. Wiley Science.

Venkayala, Sunil. “Using Java Data Mining to Develop Advanced Analytics Applications: The predictive capabilities of enterprise Java apps.” Java Developer Journal, http://jdj.sys-con.com/read/49091.htm

Witten, Ian H. and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition. 2005. Morgan Kaufmann, San Francisco.