linkedin icon twitter icon facebook icon

Private Area


Predictive model


On our most recent modeling projects, the amount of data as well as the complexity of interconnected business systems has required our enterprise modeling team to blend a number of methodologies within the areas of statistical analysis, rule-base expert systems and data mining. Nextbit recently implemented a Classification and Predictive Model to monitor fraud detection for one of the largest financial institutions in Italy. The predictive modeling techniques were utilized as part of a “model evolution”, where data mining was used to discover rules of behavior unknown to the business, examining patterns and trends in historical records; statistical analysis was applied to identify the predictive power of variables and elements within these patterns and systems, and modeling techniques (including regression models, decision tree classification, factor and cluster analysis) were used to extrapolate rules of behavior to be applied to hundreds of millions of records processed each day to estimate fraud possibilities.


 The Nextbit model evolution process guides the general design and implementation of Classification and Predictive Models. The Model evolution process involves a problem definition phase, a model construction and a validation phase. The diagram below is a high level overview of the Classification and Predictive Model development process.
  1. Problem Definition. At this stage we decide what we are attempting with the model, its basic components, and the nature and meaning of the data elements. A clear specification of project objectives is a crucial phase of the project. The objective statement defines what we and the client expect from the model, and what decisions will be made based on the model output. The aim of this particular project was to classify customers in homogeneous groups according to their credit card utilization behavior, to predict the customer patterns characterized by the highest propensity to fraud risk, and to understand the main determinants of a credit card fraud. We looked both at variables regarding the clients and their credit card utilization and at statistics on credit card transactions. This phase typically requires 3 weeks of work and sets the project objectives and milestones.
  2. Data Partition. We identify input data sets, from which we sample a smaller data set, and partition this data set into training, validation, and test data sets. In this project we have split the credit card transactions’ data set by periods of time, considering as training set all the transactions occurred in a chosen semester and as validation set the ones happened in the next semester.
  3. Data Cleaning and Transformation. This is the most time consuming and the most difficult part of the process. It consists of a preliminary descriptive data analysis, in which we explore the data sets statistically and graphically, plot the data, and obtain descriptive statistics. Large databases contain noise, due to common error conditions such as missing values, wrong values, outliers. The critical elements for analysis are isolated and a process of scrubbing and purifying the data arises. We prepare the data for analysis, by identifying and removing outliers and replacing missing values, if necessary. Also, in order to reinforce the predictive power of the model we transform qualitative variables, by aggregating and simplifying their modalities, and create additional indicators starting from the available raw data. It took 6 weeks of data cleaning to finish this part of the process.
  4. Data Exploration. Some of the variables are redundant or irrelevant for the analysis; it is therefore important to move from a long list to a short list of input variables. This phase is preliminary to the model and is aimed at (i) determining which variables are likely to be predictive and (ii) performing association analysis for detecting highly correlated input variables, if any. For this project, firstly we have split the credit card transactions into two groups (the genuine and the fraudulent) and then ran the following statistical analyses: (i) a t-test for testing between groups differences in the mean of the quantitative input variables, (ii) chi-square tests for detecting between groups differences in the distribution of the qualitative input variables, (iii) logistic regressions to establish which input variables (both qualitative and quantitative) may influence a transaction to be fraudulent.
  5. Statistical Analysis. After data acquisition and data cleaning, we move to statistical analyses aimed at classifying customers into homogeneous groups.
  6. Correlations and latent variables. Firstly we monitor whether there exist correlations among the variables included in the data and considered for the classification analysis. We used Factor Analysis statistical technique for eliminating the redundant variables and synthesizing correlated variables into common hidden factors. Through this technique we could reduce the number of inputs from 52 quantitative variables to 13 uncorrelated and standardized latent factors, and so simplify the dimensional complexity of the model.
  7. Segmentation. Secondly we perform Cluster Analysis to find groupings of customers in the data, based both on the set of factors identified in the previous phase and on the most relevant business patterns. After grouping into clusters, we use the most relevant input variables to try to describe each group. In this project clients were partitioned into 22 clusters, characterized by similar consumption behavior, in terms of average amount and frequency of expenditure, type of goods and services purchased, but also in terms of age, gender, educational level, socioeconomic status.
  8. Data modeling. After having selected the most relevant information on clients and on their credit card transactions, we build a model aimed at predicting the fraudulent transactions. As predictive technique we considered the decision trees, as they are easier to understand than other techniques. We develop the model using the training set and analyze all the patterns we want to discover. Firstly, we compare competing predictive models (building charts that plot the percentage of respondents, percentage of respondents captured, lift) and then choose the most suitable one. We selected the best model by comparing, in particular, misclassification error, sensitivity, specificity, and captured responses of each candidate. Based on the business decision making process, each of these criteria can become the most important determining factor in deciding which model to use. With the validation set we assess and calibrate the model, and evaluate its goodness-of-fit.
  9. Model Prediction. Once a model has been validated with statistical tests and endorsed by the clients through the User Acceptance Test, we move to the prediction analysis phase . One of the purpose of this model is improving fraud detection rate, by predicting whether a current credit card transaction is fraudulent given the set of available information. During the implementation phase of the model, we constantly monitor the model performance and its ability to predict outcomes, in terms of misclassification error, sensitivity, specificity, and captured responses.

Predictive model