Foundation Model fine-tuning: Named Entity Recognition

In this demo, we will focus on Fine Tuning our model for Instruction Fine Tuning, specializing llama 3.2 3B to extract drug name from text. This process is call NER (Named Entity Recognition)

Fine tuning an open-source model on a medical Named Entity Recognition task will make the model output
1. More accurate, and
2. More efficient, reducing model serving expenses

Preparing our dataset

For simplicity in this example, we'll use an existing NER dataset from Huggingface. In commercial applications, it typically makes more sense to invest in data labeling to get enough samples to improve model performance.

Preparing your dataset for Instruction Fine Tuning is key. The Databricks Mosaic AI research team has published some [helpful guidelines](https://www.databricks.com/blog/limit-less-more-instruction-tuning) for developing a training data curation strategy.

Build our prompt template to extract entities

Extracting our entities with a baseline version (non fine-tuned)

Let's start by performing a first entity extraction with our baseline, non fine-tuned model.

We will be using the same endpoint `dbdemos_llm_not_fine_tuned_llama3p2_3B` as in the previous [../02-llm-evaluation]($../02-llm-evaluation) notebook to reduce cost.

**Make sure you run this notebook to setup the endpoint before.**

Evaluating our baseline model

We can see that our model is extracting a good number of entities, but it also sometimes add some random text after/before the inferences.

Precision & recall for entity extraction

We'll benchmark our model by computing its accuracy and recall. Let's compute these value for each sentence in our test dataset.

*_NOTE: Results will vary from run to run_

In the sample, we see that the baseline LLM generally having a Recall of 0.9652 which means that it successfully identifies about 96.52% of all actual drug names present in the text. This metric is crucial in healthcare and related fields where missing a drug name can lead to incomplete or incorrect information processing.

Precision of 0.9174 on avg means that the baseline LLM model identifies a token or a sequence of tokens as a drug name, about 91.74% of those identifications are correct.

Fine-tuning our model

Fine tuning data preparation

Before fine-tuning, we need to apply our prompt template to the samples in the training dataset, and extract the ground truth list of drugs into the list format we are targeting.

We'll save this to our Databricks catalog as a table. Usually, this is part of a full Data Engineering pipeline.

Remember that this step is key for your Fine Tuning, make sure your training dataset is of high quality!

Prepare the eval dataset as well. We have the data available in `df_validation`

Fine-tuning

Once our data is ready, we can just call the fine tuning API

Tracking model fine tuning through your MLFlow experiment

Your can open the MLflow Experiment run to track your fine tuning experiment. This is useful for you to know how to tune the training run (ex: add more epoch if you see your model still improves at the end of your run).

Deploy Fine-Tuned model to serving endpoint

Post-fine-tuning evaluation

The fine-tuned model was registered to Unity Catalog, and deployed to an endpoint with just a couple of clicks through the UI.

Benchmarking recall & precision

Let's now evaluate it again, comparing the new Precision and Recall compared to the baseline model.

Measuring token output

Let's see if our new model behaves as expected.

We also slightly cut the output down, removing extra text (hence price) on top of improving accuracy!

Conclusion:

In this notebook, we saw how Databricks simplifies Fine Tuning and LLM deployment using Instruction Fine tuning for Entity Extraction.

We covered how Databricks makes it easy to evaluate our performance improvement between the baseline and fine tuned model.

Fine Tuning can be applied to a wild range of use-cases. Using the Chat API simplifies fine tuning as the system will codify the prompt for us out of the box, use it whenever you can!

License:
This datasets leverage the following Drug Dataset:

```
@inproceedings{Tiktinsky2022ADF,
title = "A Dataset for N-ary Relation Extraction of Drug Combinations",
author = "Tiktinsky, Aryeh and Viswanathan, Vijay and Niezni, Danna and Meron Azagury, Dana and Shamay, Yosi and Taub-Tabib, Hillel and Hope, Tom and Goldberg, Yoav",
booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jul,
year = "2022",
address = "Seattle, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.naacl-main.233",
doi = "10.18653/v1/2022.naacl-main.233",
pages = "3190--3203",
}
```