Credit-decisioning model
alternative data
credit scoring
Machine Learning
Credit risk
FinBox DeviceConnect

Under the hood: The secret science of predicting credit risk

Anna Catherine   /    Content Specialist    /    2022-06-29


Financial inclusion has always been viewed from a welfare point of view. Lenders, particularly banks, saw it mostly as an obligation imposed upon them by the RBI. But that’s changing, thanks to alternative data-driven credit scoring models. Risk engines powered by such credit-scoring models help lenders develop a highly nuanced understanding of customers, including those with little to no credit history. 

This has brought about a paradigm shift in lending to the bottom of the pyramid — from financial inclusion being considered as an imperative duty to now being seen as a massive business opportunity. This seismic shift in attitude could potentially unlock trillions of dollars in GDP — World Economic Forum estimates it at $3.7 trillion in GDP by 2025. 

The world is betting on machine learning (ML) models and data to transform the unit economics of the current lending setup and make ‘financial inclusion’ a lucrative business. But what are these alternative-data driven credit-decisioning models? What goes into their making? How do they work? Why do lenders and platforms need them? Let’s explore these questions further and understand the fundamentals of these ML models. 

What are alternative credit-decisioning models?

Credit risk modelling is a technique used to determine the level of risk associated with extending credit to a borrower. Data is the fuel that runs these credit-decisioning engines. But what differentiates next-gen models from traditional ones are big data, machine learning, and automation. 

Take the case of our risk intelligence product FinBox DeviceConnect to better understand how it works. 

It’s a fully automated risk engine embedded in the business rule engine (BRE) to underwrite digitally acquired customers — a completely straight-through processing system that enables the BRE to generate loan offers instantly. The level of intelligence gathering it undertakes can be unfathomable for the uninitiated, so here’s a whitepaper you can download to get the full picture.

To put it simply, it extracts the applicant’s device data (in an anonymised format based on applicant’s explicit consent) to analyse risk and decide on the loan amount, tenure, EMI affordability of the applicant, and more. But how does it arrive at these decisions? How is the brain of the risk engine wired to gauge how risky or less risky a particular applicant is? 

To understand these, let’s delve into the fundamental mathematics that govern ML-based credit risk models. Let’s look at four machine learning systems commonly used for building credit risk models. 

  • K-Nearest Neighbours

  • Logistic Regression

  • Decision Trees

  • Neural Networks

Machine learning models are trained on historical data. For instance, FinBox’s underwriting model is trained on a historical dataset benchmarked on 16 million customers, predominantly new-to-credit customers. What does this mean?

Imagine it like this — you give a rock to a geologist, he/she will be quick to tell you what rock it is, without having to compare its attributes with all the other rocks he/she knows of. This is because the geologist is well-versed with rocks and their attributes. Similarly our credit risk models are trained on data of over 16 million customers to tell you instantly the exact type of customer the applicant is. How does it do that? The answer is machine learning models!

K-Nearest Neighbours (KNN)

This model can be used to solve both classification and regression problems.It classifies borrowers using proximity. Basically, the KNN funda is, show me your friends and I’ll tell you who you are. 

To simplify matters, let’s assume that the database has only two features: age and income for this and all the subsequent models. Then it would look something like the graph below; sky blue dots represent all those who paid back the loan and the yellows indicate those who didn't. The indigo blue dot is the applicant you want to assess. You then circle in the nearest neighbours of the applicant. 

In this case, 80% of them (4 people) paid back and 20% (1 person) defaulted. Now you may ask why pick only 5 neighbours and not 10. The number 5 is not an arbitrary value but a carefully considered value based on testing. The model is tested on numerous random values and evaluated based on how many times it predicted accurately to arrive at a final optimised number. 

The above figure is a 2D graph as it’s based on 2 features — easy to visualise. Had there been 4 features, say, age, income, expenditure, and location — data would look something like this.


Now imagine 5000 such features! As you begin to visualise it you will soon realise you have run out of dimensions to plot your variables on. The above prototype has been depicted to merely help you make sense of the operations that run in the backend and give you a feel for what starts to happen as your input space becomes multidimensional. 

Logistic regression

Logistic regression estimates the probability of an event occurring, given input variables. The most common logistic regression models a binary outcome; is he lying or not, did she vote or not, would he default on the loan or not. Let’s explore how it works in credit risk modelling. 

In the graph above, yellow dots (a person who defaulted on a loan) are placed on the x axis at 0 — as there was 0 probability that the person would pay back. Similarly, those who paid back loans taken in the past (sky blue dots) will be placed at 1, as it can be said that the probability of them paying back their loan was 1. This is just a graphical way of saying that something definitely happened or didn’t happen. 

You may ask, what’s the curve that goes through the middle? This is where things get really interesting. Using some very cool maths that we won’t dive into here, the logistic regression algorithm calculates a curve that looks something like the one in the figure above.

When you have a new loan application, you just need to figure out where the applicant (indigo blue dot) falls on the curve to gauge the probability of he/she defaulting on the loan. The indigo blue dotted line tells us the probability of a new person with a specific income paying back their loan. 

Decision trees

It’s a model that tests numerous attributes (nodes) and branches out based on the output of an attribute. 

It’s something like the ‘20 questions game’ to guess the animal. For example, you ask if it’s a wild animal or not and if it is, you have successfully eliminated 96% of the animal world, for only 4% live in the wild. Then you go on to ask more questions that will help you finally identify the animal. This exercise will look something like the depiction below.

This is how it works in the business of credit risk modelling as well (see the figure below). Let’s say you want to classify your historical dataset into groups based on different parameters, say income and age as in the figure below.

You split the dataset into two — those with income less than ₹20k in Group 1 and rest in Group 2. You further split the dataset based on age — those below 28 years of age and those above (as in the figure above). Having classified your dataset, you may now want to know what group the loan applicant belongs to. In this case, the applicant (indigo blue dot) falls into Group 4. If this were to be represented as a decision tree, it would look like the figure below.

You can keep splitting it further based on various other parameters until your applicant is narrowed down to a small group where the outcome — default or no default becomes reasonably predictable. 

Neural Networks

The structure of neural networks are inspired by the human brain. Artificial neural networks (ANNs) comprises node layers that contain input data, one or more hidden layers, and an output layer (see the figure below).

Each node, or artificial neuron, is connected to other nodes and the relationship is defined by a specific weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network. (IBM)

They make a fascinating algorithm but it’s ultra complex and therefore super hard to explain. However, Jeremy Mahoney, breaks it down beautifully for the layman, read it here

Hybrid: Best of all worlds

At FinBox, our data scientists have gone to great lengths to combine the best of all decisioning models to build a hybrid credit-decisioning model. Different types of machine learning algorithms are grouped together to build FinBox DeviceConnect which solves the classification problem during underwriting for our partner lenders and platforms. If you would like to know more about how FinBox DeviceConnect segments customers, read a detailed explanation here

Why do lenders need alternative credit models?

Having explained the machinations of these magnificent credit-decisioning models, let’s come to the why of it — why do lenders/platforms even need these models in the first place? 

According to McKinsey research, the average bank with €50 billion in assets from small and medium-size enterprises (SMEs) could see €100 million to €200 million of additional profit with improved credit-decisioning models. This is the profit side of the argument. 

As the space becomes more democratised with ‘open banking’, banks and large NBFCs cannot play the incumbent card for long. Unless traditional lenders start rolling with the punches, they will have to part a sizable portion of their client base with challenger banks and fintechs. Hence, its adoption is essential as part of a broad business strategy. 

In other words, using new credit-decisioning models is not only a way to boost profits but also a business-critical competitive imperative. With our automated credit-decisioning models, we enable lenders to tap new data sources, gain insights into customer behaviour, open up to new customer segments, and react faster to changes in the business environment.