Detecting Collusion, Corruption, and Fraud

by Carlos Petricioli

The World Bank Group lends billions of dollars each year to fund development projects in its efforts to reduce global poverty. My team and I, are helping investigators at the Bank search for patterns of collusion, corruption, and fraud in its contracts data, using models of contract-specific risk. Developing an automated approach to detecting these offenses can help the World Bank efficiently target future investigations.

Contractors providing goods and services on World Bank projects are typically hired through a competitive bidding process. Occasionally, prospective contractors influence the competitive system by colluding with other contractors, bribing government officials, or otherwise manipulating the bidding process. These offenses have far-reaching effects on the price and quality of contract delivery. The World Bank is committed to detecting instances of collusion, corruption, and fraud in order to maximize its global impact.

Early this summer, my team and I met with the World Bank team in charge of attacking this problem. Our first objective was to understand what does corruption look like in the data? Their suggestion was to look for specific patterns in the procurement data. For example, turn-taking behavior among suppliers of goods and services is a possible indicator of collusion,

also, patterns of non competitive biding process as one supplier winning all the contracts as an indicator of corruption, among other possible indicators.

We incorporated data from multiple sources. We used historical data on over 300,000 major contracts funded by World Bank loans from the past 20 years, including features as company name, country, sector, and total award amount. We needed to add some additional features to our data in order to classify in a more accurate way each contract, so we incorporated annual economic development indicators, collected by the World Bank, for countries and industries within them. Finally, the World Bank gave us investigations data, covering companies and projects investigated for collusion, corruption or fraud in the past years that includes specific allegations and case outcomes.

The first big problem that we faced after cleaning all the data was that company names are represented by different text strings among our different data sources, so a single company may be represented in several very different ways (e.g. ACME Inc. vs. A.C.M.E. Co.). This was a big problem because in order for us to develop a model in which we can predict the level of risk for a contract presenting corruption or fraudulent activities, we need to have common strings among the data. What we did, was a company name disambiguation. Company names were reconciled by querying each name on Google and comparing their top 10 URL results. Names that had at least 7 links in common were considered to be a single company. This was a complicated task in terms of computational issues because of the size of our data. Google does not like us using their resources, so we had to create a big number of virtual machines, query Google for the URLs from each machine and then gather all the results in a database. The result of this process was good enough for the World Bank team because with this disambiguation now they have a better way of investigating companies.

After long nights of waiting for this algorithm to end, we finally were ready to build tools for a proactive investigation within the World Bank. To evaluate contract risk, we generated features and models tracking companies’ historical involvement on World Bank projects within specific countries and sectors. As well we created co-award network features for each company. For example, down here you can see the network for General Electric Company. In blue there's every project that General Electric Company has been part of and in green there's every company that worked in that specific project. This is a public version of the network, in the one we delivered to the World Bank, they have different colors whether a company was investigated and found to be guilty. Network features turned out to be very good predictors of risk in our final model.

In terms of the model, we trained a binary classifier separating past contracts that were investigated by the World Bank from those that were not investigated. We evaluated and compared models using precision, recall, and area under ROC curve. A random forest provided the best results across all metrics on a held-out test set.

Finaly we developed an interactive dashboard for World Bank investigators to track a company’s activity across countries, sectors, and time. Using this tool, investigators can track contract awards companies have received, including under different names (e.g. ACME, Inc. vs. ACME Co.), view a risk score for each World Bank contract, as calculated by our contract risk model and visualize the immediate neighborhood of the company in its co-award network.