Systematic Fraud Detection Through Automated Data Analytics in MATLAB

By Jan Eggers, MathWorks

As the Madoff Ponzi scheme and recent high-profile rate-rigging scandals have shown, fraud is a significant threat to financial organizations, government institutions, and individual investors. Financial services and other organizations have responded by stepping up their efforts to detect fraud.

Systematic fraud detection presents several challenges. First, fraud detection methods require complex investigations that involve the processing of large amounts of heterogeneous data. The data is derived from multiple sources and crosses multiple knowledge domains, including finance, economics, business, and law. Gathering and processing this data manually is prohibitively time-consuming as well as error-prone. Second, fraud is "a needle in a haystack" problem because only a very small fraction of the data is likely to be coming from a fraudulent case. The vast quantity of regular data—that is, data produced from nonfraudulent sources—tends to blend out the cases of fraud. Third, fraudsters are continually changing their methods, which means that detection strategies are frequently several steps behind.

Using hedge fund data as an example, this article demonstrates how MATLAB^® can be used to automate the process of acquiring and analyzing fraud detection data. It shows how to import and aggregate heterogeneous data, construct and test models to identify indicators for potential fraud, and train machine learning techniques to the calculated indicators to classify a fund as fraudulent or nonfraudulent.

The statistical techniques and workflow described are applicable to any area requiring detailed analysis of large amounts of heterogeneous data from multiple sources, including data mining and operational research tasks in retail and logistic analysis, defense intelligence, and medical informatics.

The Hedge Fund Case Study

The number of hedge funds has grown exponentially in recent years: The Eurekahedge database indicates a total of approximately 20,000 active funds worldwide.¹ Hedge funds are minimally regulated investment vehicles and, therefore, prime targets of fraud. For example, hedge fund managers may fake return data to create the illusion of high profits and attract more investors.

We will use monthly returns data from January 1991 to October 2008 from three hedge funds:

Gateway Fund
Growth Fund of America
Fairfield Sentry Fund

The Fairfield Sentry Fund is a Madoff fund known to have reported fake data. As such, it offers a benchmark for verifying the efficacy of fraud detection mechanisms.

Gathering Heterogeneous Data

Data for the Gateway Fund can be downloaded from the Natixis web site as a Microsoft^® Excel^® file containing the net asset value (NAV) of the fund on a monthly basis. Using the MATLAB Data Import Tool, we define how the data is to be imported (Figure 1). The Data Import Tool can automatically generate the MATLAB code to reproduce the defined import style.

Figure 1. The MATLAB Data Import Tool for interactively importing data from files.

After importing the NAV for the Gateway Fund, we use the following code to calculate the monthly returns:

% Calculate monthly returns
gatewayReturns = tick2ret(gatewayNAV);

For the Growth Fund of America, we use Datafeed Toolbox™ to obtain data from Yahoo! Finance, specifying the ticker symbol for the fund (AGTHX), the name of the relevant field (adjusted close price), and the time period of interest:

% Connect to yahoo and fetch data
c = yahoo;
data = fetch(c, 'AGTHX', 'Adj Close', startDate, endDate);

Unfortunately, Yahoo does not provide data for the period from January 1991 to February 1993. For this time period, we have to collect the data manually.

Using the financial time series object in Financial Toolbox™, we convert the imported daily data to the desired monthly frequency:

%Convert to monthly returns
tsobj = fints(dates, agthxClose);
tsobj = tomonthly(tsobj);

Finally, we import reported data from the Fairfield Sentry fund. We use two freely available Java™ classes, PDFBox and FontBox, to read the text from the pdf version of the Fairfield Sentry fund fact sheet:

% Instantiate necessary classes
pdfdoc = org.apache.pdfbox.pdmodel.PDDocument;
reader = org.apache.pdfbox.util.PDFTextStripper;

% Read data
pdfdoc = pdfdoc.load(FilePath);
pdfstr = reader.getText(pdfdoc);

Having imported the text, we extract the parts containing the data of interest—that is, a table of monthly returns.

Some tests for fraudulent data require comparison of the funds' returns data to standard market data. We import the benchmark data for each fund using the techniques described above.

Once the data is imported and available, we can assess its consistency—for example, by comparing the normalized performance of all three funds (Figure 2).

Figure 2. Plot comparing the performance of the funds under consideration.

Simply viewing the plot allows for a qualitative assessment. For example, the Madoff fund exhibits an unusually smooth growth, yielding a high profit. Furthermore, there are no obvious indications of inconsistency in the underlying data. This means that we will be able to use formal methods to detect fraudulent activities.

Analyzing the Returns Data

Since misbehavior or fraud in hedge funds manifests itself mainly in misreported data, academic researchers have focused on devising methods to analyze and flag potentially manipulated fund returns. We compute metrics introduced by Bollen and Pool² and use them as potential indicators for fraud on the reported hedge fund returns. For example:

Discontinuity at zero in the fund's returns distribution
Low correlation with other assets, contradicting market trends
Unconditional and conditional serial correlation, indicating smoother than expected trends
Number of returns equal to zero
Number of negative, unique, and consecutive identical returns
Distribution of the first digit (Does it follow Benford's law?) and the last digit (Is it uniform?) of reported returns

To illustrate the techniques, we will focus on discontinuity at zero.

Testing for Discontinuity at Zero

Since funds with a higher number of positive returns attract more capital, fund managers have an incentive to misreport results to avoid negative returns. This means that a discontinuity at zero can be a potential indicator for fraud.

One test for such a discontinuity is counting the number of return observations that fall in three adjacent bins, two to the left of zero and one to the right. The number of observations in the middle bin should approximately equal the average of the surrounding two bins. A significant shortfall in the middle bin observations must be flagged.

Figure 3 shows the histograms of the funds' returns, with the two bins around zero highlighted. Green bars indicate no flag, and red bars indicate potential fraud. Only the Madoff fund did not pass this test.

Figure 3. Histograms of monthly returns for funds under consideration.

Results for Funds Under Consideration

Applying all the tests described above to the present data yields a table of indicators for each fund (Figure 4).

The Madoff fund raised a flag in nine out of ten tests, but the other two funds also raised flags. Positive test results do not prove that a given hedge fund was involved in fraudulent activities. However, a table like the one shown in Figure 4 indicates funds that merit further investigation.

Classifying Analysis Results with Machine Learning

We now have a set of flags that can be used as indicators for fraud. Automating the analytics enables us to review larger data sets and to use the computed flags to categorize funds as fraudulent or nonfraudulent. This classification problem can be addressed using machine learning methods—for example, bagged decision trees, using the TreeBagger algorithm in Statistics and Machine Learning Toolbox™. The TreeBagger algorithm will require data for supervised learning to train the models. Note that our example uses data for only three funds. Applying bagged decision trees or other machine learning methods to an actual problem would require considerably more data than this small, illustrative set.

We want to build a model to classify funds as fraudulent or nonfraudulent, applying the indicators described in the section “Analyzing the Returns Data” as predictor variables. To create the model, we need a training set of data. Let us consider M hedge funds that are known as fraudulent or nonfraudulent. We store this information in the M-by-1-vector yTrain and compute the corresponding MxN-matrix xTrain of indicators. We can then create a bagged decision tree model using the following code:

% Create fraud detection model based on training data
fraudModel = TreeBagger(nTrees,xTrain,yTrain);

where nTrees is the number of decision trees created based on bootstrapped samples of the training data. The output of the nTrees decision trees is aggregated into a single classification.

Now, for a new fund, the classification can be performed by

% Apply fraud detection model to new data
isFraud = predict(fraudModel, xNew);

We can use the fraud detection model to classify hedge funds based purely on their returns data. Since the model is automated, it can be scaled to a large number of funds.

The Bigger Picture

This article outlines the process of developing a fully automated algorithm for fraud detection based on hedge fund returns. The approach can be applied to a much larger data set using large-scale data processing solutions such as MATLAB Parallel Server™ and Apache™ Hadoop^®. Both technologies enable you to cope with data that exceeds the amount of memory available on a single machine.

The context in which the algorithm is deployed depends largely on the application use cases. Fund-of-funds managers working mostly with Excel might prefer to deploy the algorithm as an Excel add-In. They could use the module to investigate funds under consideration for future investments. Regulatory authorities could integrate a fraud detection scheme into their production systems, where it would periodically perform the analysis on new data, summarizing results in an automatically generated report.

We used advanced statistics to compute individual fraud indicators, and machine learning to create the classification model. In addition to the bagged decision trees discussed here, many other machine learning techniques are available in MATLAB, Statistics and Machine Learning Toolbox, and Deep Learning Toolbox™, enabling you to extend or alter the proposed solution according to the requirements of your project.

¹ Eurekahedge

² Bollen, Nicolas P. B., and Pool, Veronika K.. “Suspicious Patterns in Hedge Fund Returns and the Risk of Fraud”(November 2011). https://www2.owen.vanderbilt.edu/nick.bollen/

Published 2014 - 92196v00

Learn More

Example: Credit Rating by Bagged Decision Trees

View Articles for Related Capabilities

View Articles for Related Industries

Financial Services