FSI & Banking platform with Databricks Data Intelligence Platform - Fraud detection in real time

What is The Databricks Data Intelligence Platform for Banking?

It's the only enterprise data platform that allows you to leverage all your data, from any source, on any workload to optimize your business with real time data, at the lowest cost.

The Lakehouse allow you to centralize all your data, from customer & retail banking data to real time fraud detection, providing operational speed and efficiency at a scale never before possible.

Simple

One single platform and governance/security layer for your data warehousing and AI to **accelerate innovation** and **reduce risks**. No need to stitch together multiple solutions with disparate governance and high complexity.

Open

Built on open source and open standards. You own your data and prevent vendor lock-in, with easy integration with external solution. Being open also lets you share your data with any external organization, regardless of their data stack/vendor.

Multicloud

One consistent data platform across clouds. Process your data where your need.

Reducing Fraud with the lakehouse

Being able to collect and centralize information in real time is critical for the industry. Data is the key to unlock critical capabilities such as realtime personalization or fraud prevention.

What we'll build

In this demo, we'll build a end 2 end Banking platform, collecting data from multiple sources in real time.

With this information, not only we'll be able to analyse existing past fraud and understand common patterns, but we'll also be able to rate financial transaction risk in realtime.

Based on this information, we'll be able to proactively reduce Fraud. A typical example could be asking for an extra security challenge or having human intervention when our model scores high.

At a very high level, this is the flow we'll implement:

1. Ingest and create our Banking database, with tables easy to query in SQL
2. Secure data and grant read access to the Data Analyst and Data Science teams.
3. Run BI queries to analyse existing Fraud
4. Build ML model & deploy them to provide real-time fraud detection capabilities.

Our dataset

To simplify this demo, we'll consider that an external system is periodically sending data into our blob storage (S3/ADLS/GCS):

- Banking ransaction history
- Customer data (profile)
- Metadata table
- Report of past fraud (used as our label)

*Note that at a technical level, our data could come from any source. Databricks can ingest data from any system (SalesForce, Fivetran, queuing message like kafka, blob storage, SQL & NoSQL databases...).*

Let's see how this data can be used within the Lakehouse to analyse our customer transactions & detect potential fraud in realtime.

1/ Ingesting and preparing the data (Data Engineering)

Our first step is to ingest and clean the raw data we received so that our Data Analyst team can start running analysis on top of it.

Delta Lake

All the tables we'll create in the Lakehouse will be stored as Delta Lake table. [Delta Lake](https://delta.io) is an open storage framework for reliability and performance.

It provides many functionalities *(ACID Transaction, DELETE/UPDATE/MERGE, Clone zero copy, Change data Capture...)*

For more details on Delta Lake, run `dbdemos.install('delta-lake')`

Simplify ingestion with Spark Declarative Pipelines (SDP)

Databricks simplifies data ingestion and transformation with Spark Declarative Pipelines by allowing SQL users to create advanced pipelines, in batch or streaming. The engine will simplify pipeline deployment and testing and reduce operational complexity, so that you can focus on your business transformation and ensure data quality.

Open the FSI banking & Fraud Spark Declarative Pipelines pipeline or the [SQL notebook]($./01-Data-ingestion/01.1-SDP-fraud-detection-SQL) *(Alternatives: SDP Python version Soon available)*.

For more details on SDP: `dbdemos.install('pipeline-bike')` or `dbdemos.install('declarative-pipeline-cdc')`

2/ Securing data & governance (Unity Catalog)

Now that our first tables have been created, we need to grant our Data Analyst team READ access to be able to start analyzing our banking databse information.

Let's see how Unity Catalog provides Security & governance across our data assets with, including data lineage and audit log.

Note that Unity Catalog integrates Delta Sharing, an open protocol to share your data with any external organization, regardless of their stack. For more details: `dbdemos.install('delta-sharing-airlines')`

Open [Unity Catalog notebook]($./02-Data-governance/02-UC-data-governance-ACL-fsi-fraud) to see how to setup ACL and explore lineage with the Data Explorer.

3/ Analysing existing Fraud (BI / Data warehousing / SQL)

Our datasets are now properly ingested, secured, with a high quality and easily discoverable within our organization.

Data Analysts are now ready to run BI interactive queries, with low latencies & high througput, including Serverless Datawarehouses providing instant stop & start.

Let's see how we Data Warehousing can done using Databricks, including with external BI solutions like PowerBI, Tableau and other!

Open the [Datawarehousing notebook]($./03-BI-data-warehousing/03-BI-Datawarehousing-fraud) to start running your BI queries or access or directly open the Banking fraud analysis dashboard

4/ Predict Fraud risk with Data Science & Auto-ML

Being able to run analysis on our past data already gave us a lot of insight to drive our business. We can better understand our customer data and past fraud.

However, knowing what we had Fraud in the past isn't enough. We now need to take it to the next level and build a predictive model to detect potential threat before they happen, reducing our risk and increasing customer satisfaction.

This is where the Lakehouse value comes in. Within the same platform, anyone can start building ML model to run such analysis, including low code solution with AutoML.

ML: Fraud Detection Model Training

Let's see how Databricks accelerate ML project with AutoML : 1 click model training with the [04.1-AutoML-FSI-fraud]($./04-Data-Science-ML/04.1-AutoML-FSI-fraud)

ML: Realtime Model Serving

Once our model trained and available, Databricks Model Serving can be used to enable real time inferences, allowing Fraud detection in real time.

Review the [04.3-Model-serving-realtime-inference-fraud]($./04-Data-Science-ML/04.3-Model-serving-realtime-inference-fraud)

5/ Deploying and orchestrating the full workflow

While our data pipeline is almost completed, we're missing one last step: orchestrating the full workflow in production.

With Databricks Lakehouse, no need to manage an external orchestrator to run your job. Databricks Workflows simplifies all your jobs, with advanced alerting, monitoring, branching options etc.

Open the [workflow and orchestration notebook]($./06-Workflow-orchestration/06-Workflow-orchestration-fsi-fraud) to schedule our pipeline (data ingetion, model re-training etc)

Conclusion

We demonstrated how to implement an end 2 end pipeline with the Lakehouse, using a single, unified and secured platform:

- Data ingestion
- Data analysis / DW / BI
- Data science / ML
- Workflow & orchestration

As result, our analyst team was able to simply build a system to not only understand but also forecast future failures and take action accordingly.

This was only an introduction to the Databricks Platform. For more details, contact your account team and explore more demos with `dbdemos.list()`