Chapter 6 Identify Risk or Threat Model – Artificial Intelligence for Risk Management

CHAPTER 6

Identify Risk or Threat Model

Define Goal

Define the goal of the artificial intelligence (AI) model for risk identification or the threat model.

The goal is to identify the risk associated with the application area using standard identification; instead, use AI/machine learning (ML) to automate the process.


See Figure 6.1.

Figure 6.1 Identify the risk or threat model


Now let us evaluate stores and store sales performance to continue, close down, or improve sales.

Evaluation Steps

Let us go through one measure: key performance indicator (KPI) of sales performance in the following example. For simplicity see Figure 6.2. We consider the retail industry in this example to illustrate the steps involved in identifying risk.

Figure 6.2 Evaluation steps to identify risk

Evaluation Measure or KPI

From the given measure or KPI, we consider the retail industry and sales transaction amount as the measure from a point of sale (POS) dataset. One important fueling point for any corporation or store is at the POS transactions in the store. POS indicates when people check out of the store and pay. Analysts can consider many important measures and KPIs to evaluate risk. Planning is required to show how measures/KPI can be used to identify the risk process. Similarly, it can be automated to evaluate other measures and KPIs.

Input:

  • List of measures and KPI

Output:

  • Identified risk measures list

Evaluate the Business Process

From the given business process, consider the retail industry and POS transaction business process. In the POS business process, KPIs reside in multiple dimensions and measures.

In this step of evaluating the business process, the input and expected output follows.

Input:

  • List of business processes and measures/KPIs.

Expected output:

  • Identified risk measures listed as business processes.

Steps:

  • Feed all measures of the given business process to train the model.

Evaluate Given Dataset

From the given dataset (retail industry|POS dataset) do the following in order:

Identify the business process > identify the list of measures > identify the risk

Evaluate Project-Related Documents

From given project-related documents: Collect the data

Define Risk Identification Steps

Let us take an example of an analytical step of evaluation of stores.

There are three possibilities of actions:

  • Store is doing very well and continues to operate and keep up sales.
  • Store is not making adequate sales, determines the overall loss, and comes up a strategy going forward.
  • Store is just doing ok and is meeting its breakeven point.
    The store has a possibility of increasing sales. The store performs well in sales, then continues to strategize to keep up
    the sales.

Let us get into data collection now.

Data Collection

Collect Risk Data

Collect the appropriate dataset associated with the application area. The dataset will have to be collected by resources working in the application area. This includes KPIs and key risk indicators (KRIs).

Input:

  • Risk management plan.
  • Risk supporting documents.
    • Historically identified risk datasets such as historical data of past risk occurrences, and impact and mitigation actions taken.
  • Industry policy and standards.
    • Retail industry standard KPIs such as minimum, maximum, average of industry level KPIs with KRIs from retail industry statistics.
  • Technology policy and standards.
    • POS terminal performance data such as system slowness and network slowness.
  • Risk policy and standards.
  • POS terminal secured network access log data.
  • Organization-specific policies and standards.
    • Organization survival minimum requirements data such as revenue going below 40 percent for three months continuously, then closing down the store. Similarly, all other business rules are governed by the organization governance department.
    • Hacker-avoidance policies.
    • Organization rules, regulations, and policies.
  • ISO Risk standards 31000.
    • risk categorization documents.
  • Dataset
    • POS transaction dataset [use this dataset for training and explanation purposes].

Design Algorithm

Determine appropriate algorithms based on the risk dataset. Conduct further research on how the algorithm can be designed to fit the data.

Identify Features

Identify features from the POS transaction dataset, considering only the important features to explain what is going on.

From the POS transaction dataset:

  • Sales transaction amount, store name, store location, transaction date (measures and attributes in the POS transaction dataset).

From unstructured documents:

  • Extract data from unstructured documents such as risk plan documents and other word documents using natural language processing to convert to natural language understanding.

From retail industry statistics:

  • Retail industry average sales transaction amount by category, store location, and month.

From organization rules document:

  • Minimum monthly sales decreased percent considered negative risk.

From the risk standards document:

  • Month revenue decrease

From risk supporting documents:

  • Past sales transaction amount, risk flag, impact, and action taken

Output:

  • Is there a risk or not?
  • Determine true or false

ML/AI use case:

ML/AI is a binary classification ML problem using supervised learning. Now we do a deep dive into the binary classification steps.

Identify a List of Binary Classification Models

Here is the typical list of algorithms used for binary classification.

Deep learning algorithms

  • Convolution Neural Network
  • Recurrent Neural Network
  • Hierarchical Attention Network
    Ensemble
  • Random Forest
    Neural Networks
  • Radial Basis Function Network
  • Perceptron
  • Back-Propagation
  • Hopfield Network
    Support Vector Machine (SVM)
  • Multiclass SVM
    Regression
  • Logistic regression
    Bayesian
  • Naive Bayes
    Decision Tree
  • Classification and Regression Tree
  • Conditional Decision Trees
    Dimensionality Reduction
  • Quadratic Discriminant Analysis
  • Linear Discriminant Analysis (LDA)
    Instance based
  • K-Nearest Neighbors

You can pick up more algorithms to be considered to train and evaluate. For simplicity, limiting the following algorithms for binary classification problem.

  • LDA
  • Quadratic Discriminant Analysis
  • Logistic regression

Let’s get into data preparation steps from the POS transaction dataset.

Data Preparation

The sample here is the sample dataset considered in our experiment. See Table 6.1.


Table 6.1 Sample Data for Risk Identification Use Case

  • Features: all columns except RISK
  • Expected output: RISK column
  • Input all the features in the above dataset and find feature scoring for the test dataset.
  • Identify important features, train with identified features, and find scoring for the test dataset.
  • Create bias/variance adjustment for the training dataset to avoid the influence of dominant class (to avoid overfitting/underfitting)
  • Split the dataset into three sets as follows:
    • Training dataset (60 percent)
    • Test dataset (20 percent)
    • Validation dataset (20 percent)
  • Train/test/validate the model
  • Use the identified datasets to train the model.

Train the Model

Train the selected algorithms using the prepared dataset.

Tune the Model

Tune the hyperparameters to train the model

Run the training multiple times with different combinations of provided hyperparameters of batch size, epochs, optimizer, learn rate, momentum, and dropout rate to find the optimum combination of hyperparameters to determine the appropriate results.

  • Batch size = [10, 20, 40, 60, 80, 100]
  • epochs = [10, 50, 100]
  • optimizer = [‘SGD’, ‘RMSprop’, ‘Adagrad’, ‘Adadelta’, ‘Adam’, ‘Adamax’, ‘Nadam’]
  • learn rate = [0.001, 0.01, 0.1, 0.2, 0.3]
  • momentum = [0.0, 0.2, 0.4, 0.6, 0.8, 0.9]
  • dropout rate = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

By using Grid Search() you can automate to try out different combinations of hyperparameters. It is recommended to use Grid Search().

grid search = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, scoring=accuracy)

Compare models with different hyperparameters, choose the best fit hyperparameters, and use the model to train further with the full dataset.

Test the Model

Test the model using the test dataset for each selected algorithm with given methods:

  • Specify the used loss function with respective algorithms.
  • Learning rate curve.
  • Learning curve.

Loss versus epoch—learning rate graph. See Figure 6.3. Pick the model with the “learning rate- Good” depicted in the graph.

Figure 6.3 Loss versus epoch learning graph

Receiver Operating Characteristic Curve

Receiver operating characteristic (ROC) is another common tool used with binary classifiers. See Figure 6.4. For this use case, we choose the model with a false positive rate as 0.99 because we expect 90 percent accuracy from identify risk or threat model.

Figure 6.4 ROC curve

Evaluate the Model

Evaluate the model using accuracy, mean square error, and determine the learning rate. The following scoring methods are used.

Scoring Methods

  • Precision score
  • Recall score
  • F1 score
  • Support score
  • Accuracy score
  • Area under the curve/ROC
  • Learning rate (ranges from 0 to 1)

Confusion Matrix

  • Analyze the inputs that are improperly classified using the confusion matrix.
  • Accuracy versus epoch graph.
  • Decide the final model with accuracy.

Repeat steps from data collection to data preparation, feature extraction, training, testing, and evaluation of the model until reaching the necessary accuracy of 85 and above.

Model Conclusion

Based on accuracy and F1 score, linear discriminant analysis is the best performing model for this binary classification problem of risk identification. See Table 6.2.


Table 6.2 Model score comparison

Model

Accuracy

Precision

Recall

F1 score

AUC ROC

Linear discriminant analysis

0.948757

0.959455

0.928571

0.947142

0.993056

Quadratic discriminant analysis

0.94348

0.939012

0.941415

0.943893

0.989324

Logistic regression

0.943348

0.945292

0.932303

0.943039

0.989973

Random Forest

0.941126

0.944909

0.928457

0.940401

0.982154

K-Nearest Neighbors

0.935848

0.943361

0.916987

0.93403

0.954365

Bayes

0.930994

0.931379

0.917903

0.930448

0.97938

Note. AUC = area under the curve, ROC = receiver operating characteristic.


Binary Classification Methods Comparison

Publish/Production of the Model

  • Retrain the model until it yields the desired output.
    • Repeat the steps from data collection, data preparation, feature extraction, training, testing, and evaluation of the model; then publish the model.
    • Repeat the entire process for all measures in this business process, as explained in the evaluation steps.
  • Possible ways to productionalize the trained model follow.
    • Host in Google cloud, Microsoft Azure, or AWS.
  • How do we deploy the trained model?
    https://cloud.google.com/mlengine/docs/tensorflow/deploying-models
  • Regularly monitor and update the model.
  • How do we use the productionalized model for business users?
    • Integrate the trained model with the application for business users to identify new risks or threats from the new dataset.

Conclusion

Types of resources needed for this project follow:

  • Risk analysis subject matter expert.
  • Risk mitigation strategists
  • Data analysts
  • Data architects
  • Data scientists
  • Data engineers

Simplify all of the previous steps in www.BizStats.AI

To automate, following are the steps to be done

  • Provide the input dataset
  • Train multiple models
  • Present models with accuracy
  • Pick the model and activate it
  • The business user can directly use it just by searching.