Text normalization is the de-facto for any Natural Language Processing tasks. Why? Raw text data is unstructured data that has a ton of dimensionality to it. What exactly does unstructured data mean? Well, it means there has been no concerted effort to organize the data into some kind of model or database structure. This makes sense when you consider language itself is incredibly nuanced and can be hard to measure or standardize. One example of normalizing text is to remove emojis in an effort to make the text easier to reason process. But, should you remove emojis? For example, if I were to write: “I’m so happy for you! 🙄🙄🙄” it means something completely different than “I’m so happy for you!”, or even “I’m so happy for you! 😡😡😡”. In the world of modern LLM application development where the models have already been pre-trained and you are only enriching prompts with custom data (Retrieval Augmented Generation technique), should you remove emojis? Well, let’s talk about it!
Don’t Lemmatize or use Stemming
What is Lemmatization and Stemming? Lemmatization and Stemming are strategies for simplifying words. They reduce a word to it’s base form. Although this process helps to structure and standardize your text, it can also reduce the nuance and context of your text which is why I don’t recommend using it. In developing our own custom AI chatbot we ran an experiment using the Natural Language Toolkit library in python to lemmatize text before uploading the text to a vector database. After uploading it the vector datbase, we then asked a series of automated, repeatable questions to our chatbot. Adding lemmatization decreased accuracy in answering those questions by approximately ~7%.
Don’t Lowercase Everything
Case sensitivity can introduce important semantic meaning. I have two words for this one: proper nouns. If your data uses a lot of proper nouns, relies on user input that is not all the same case, or if the training data contains both lower and upper case then you should not case standardize your text before uploading to vector database.
Do Deduplicate your Data
Duplicate data can cause bias. It can also increase your token usage. Both are bad. Ensuring you have the most relevant, clean, and meaningful data is one of the most crucial steps in developing a custom chatbot. Deduplicating our data increased our accuracy by approximately ~14%.
Do remove PII (Personal Identifiable Information)
With the rise of LLM’s it’s important to ensure data used does not contain any information about humans that is considered PII. For example, say you have internal employee documentation that you’d like to feed into your vector database, but someone noted their personal phone number for on call support. Do you want to expose that person’s phone number to anyone who uses your LLM? I don’t think your coworker would be very happy about it. Prompt injection hacks are infinite and the best way to avoid them is to just not expose any data that you don’t want public which will ensure security of your LLM.
Do ensure Data is Ethical
The world is scary enough as it is the last thing we need is LLMs to reinforce that. Make sure the data you are using is data you have obtained ethically and does not reinforce discrimination or bias. Let’s use AI to build a better world not a worse one.