GenAI Guide on Data Preparation Part 1: Introduction

In our previous Generative AI series we talked a lot about how fine tuning works along with different approaches to fine tuning (by way of introduction). However before we even get to this step we need to decide on what data we are going to fine tune our foundational model with and how we are going to prepare our data. The truth is, there are no flawless data sets. But striving to make them flawless is the key to success. That’s why it takes up to 80% of every data science project’s time.

Before we get into the weeds of this series lets provide some historical context

In 2014, Amazon initiated the development of an experimental recruitment tool driven by machine learning (ML). Similar to Amazon’s rating system, this tool aimed to assign scores ranging from 1 to 5 stars to job applicants based on resume screening. While the idea seemed promising, it turned out that the ML model had a preference for men. Resumes containing the term “women’s,” such as “women’s softball team captain,” were penalized. In 2018, Reuters reported that Amazon eventually discontinued the project.

The question arises: How did Amazon’s ML model end up being biased? There are several potential reasons for this, including:

  • AI gone rogue?
  • Inexperienced data scientists?
  • Faulty data sets?

While these factors may contribute to bias, data plays a significant role in determining the success or failure of projects. In the case of Amazon, the models were trained on a dataset consisting mostly of resumes submitted by men over a period of 10 years. This raises an important question: How is data prepared for machine learning, and in our context, for Generative AI ?

How much data is enough to train a good model?

Everything starts with careful planning and problem formulation, especially when it involves harnessing the power of machine learning. This process is not too different from making any other business decision. As you embark on constructing a training data set, you encounter the first hurdle: determining the optimal amount of data required to train a high-quality model. Should you settle for just a few samples? Or do you need thousands, or perhaps even more? Unfortunately, there is no one-size-fits-all formula that can precisely calculate the ideal dataset size for a machine learning model. Numerous factors come into play, ranging from the specific problem you aim to solve to the learning algorithm employed within the model.

As a general guideline, it is advisable to gather as much data as possible, as it becomes challenging to predict which specific data samples and how many of them will bring the most value to the model. In other words, a large volume of training data is typically beneficial. However, the term “a lot” might seem rather vague and open to interpretation, leaving room for uncertainty.

To provide a clearer understanding, let’s explore a few real-life examples. You’re probably familiar with Gmail, the email service from Google. One of its intelligent features is the smart reply suggestions, which conveniently generate short responses for users. To achieve this functionality, the Google team gathered and processed a training set comprising 238 million sample messages, both with and without responses. On the other hand, consider Google Translate, a project that required trillions of examples to accomplish its goals. But what about Generative AI ?

Data Quality

This is often cited as the most important element by many in the field of Generative AI. Data quality trumps parameter size every time. This has been true long before Generative AI took off.

The quality of the data you feed into them is crucial. Quality data is accurate, relevant, complete, and free from bias. If the data is full of inaccuracies, irrelevant features, missing values, or biases, even the best machine learning models and algorithms will struggle to produce meaningful results. This is encapsulated in the saying “garbage in, garbage out.”

Even a smaller dataset of high-quality data can often outperform a larger dataset of poor-quality data. Let’s consider a scenario where you’re developing a model to detect fraudulent transactions. A smaller dataset of well-labeled, accurate, and varied examples of both fraudulent and non-fraudulent transactions would be more valuable than a larger dataset with inaccurate labels, irrelevant features, or a lack of variety in the transaction examples.

However, a high-quality dataset alone is not enough; it also needs to be representative of the problem space. For example, a high-quality dataset for fraudulent transaction detection would need to include examples of various types of fraudulent transactions, not just a single type.

On the other hand, quantity of data has its place. More data can lead to more robust and generalizable models, especially when the data is diverse and represents a broad spectrum of the problem space. This is particularly true for deep learning models, such as large language models, which often perform better with more data.

In conclusion, while both quality and quantity are important, the quality of data should be the primary focus when developing machine learning models. It’s better to have a smaller set of high-quality, representative data than a large volume of poor-quality or non-representative data. However, once quality is ensured, increasing the quantity of data can further improve the model’s performance.

What questions and methodologies should I be using to ensure I’m using the right data 

Here are some questions and considerations that might help (example sales forecasting)

1. Understanding your business needs and goals:

    • What is the business problem you’re trying to solve?
    • What kind of sales are you trying to predict (e.g., retail, e-commerce, B2B, etc.)?
    • What do you expect to gain from the AI model? Increased accuracy in forecasts? Better inventory management? Optimized pricing?
    • What does success look like to you in this project?

2. Assessing your current situation and readiness:

    • Do you have a data strategy in place? If yes, what does it look like?
    • Have you used any data analytics or machine learning tools in the past? What were your experiences?
    • Do you have the necessary data to train the model? Is the data clean, labeled, and stored in an accessible format?
    • Do you have the necessary hardware and software infrastructure to support an AI model?
    • Do you have skilled personnel to manage and maintain the AI model, or would you need external support?

3. Gauging your commitment and resources:

    • Are you ready to invest time and resources to ensure the success of this project? This includes not just the cost of developing the model, but also ongoing costs for maintenance and updates.
    • Do you have buy-in from key stakeholders, including executive leadership? This is crucial for ensuring the project gets the resources and support it needs.
    • How will you handle the change management aspect of implementing AI in your sales process?

Frameworks and methodologies:

CRISP-DM (Cross-Industry Standard Process for Data Mining):

This methodology outlines the life cycle of a data mining project, from business understanding and data understanding, to data preparation, modeling, evaluation, and deployment. You can learn more here 

Data Maturity Model: Assessing the organization’s data maturity can help you understand their readiness to adopt AI. This includes aspects like data quality, data management, data literacy within the organization, and the existence of a data-driven culture.



Data Sources

Collecting data for fine-tuning large language models generally involves gathering both structured and unstructured data. Structured data refers to information that is highly organized and easily searchable in relational databases (like SQL), while unstructured data is information that doesn’t fit into these predefined models (like text, images, or audio files).

In the context of an enterprise, data can indeed come in many forms, including Word documents, SharePoint files, databases, emails, and more. Here are some strategies to collect data from these sources:

  1. Database Extraction: Structured data can often be found in various types of databases, such as SQL databases or NoSQL databases like MongoDB. Data can be extracted directly from these databases using queries specific to the database language. Enterprise resource planning (ERP) and customer relationship management (CRM) systems are other common sources of structured data in an enterprise context.
  2. Document Parsing: Unstructured data can often come from Word documents, PDFs, and other types of files. There are various libraries available for parsing these documents and extracting the text. For instance, Apache POI can be used for Microsoft Office documents, and PyPDF2 can be used for PDFs.
  3. Email Extraction: Emails are a rich source of text data. Depending on the email service used, there may be an API that can be used to access the emails. For instance, the Gmail API can be used to access and download emails from a Gmail account. The emails can then be parsed to extract the relevant text data.
  4. SharePoint: Microsoft provides the SharePoint Online Management Shell, which can be used to manage SharePoint Online users, sites, and site collections. SharePoint also has APIs that can be used to extract data.
  5. Web Scraping Intranets: With appropriate permissions, internal web pages can be scraped in a similar manner to external websites, allowing you to extract both structured and unstructured data.
  6. Logs: Many systems generate logs which are stored in text files or databases. These can be a valuable source of data for tasks such as anomaly detection.

Extracting data from databases and documents involves several steps, including connecting to the data source, querying or parsing the data, and then storing the data in a format suitable for machine learning. Let’s break down these steps:


  1. Connect to the database: The exact process depends on the type of database (e.g., SQL, MongoDB, etc.). You can use database-specific libraries in programming languages like Python or R to establish a connection.
  2. Query the data: Use SQL queries or equivalent commands to extract the required data. This might include filtering for specific columns, rows, or conditions.
  3. Store the extracted data: The extracted data can be stored in several formats, including CSV, Excel, or JSON, depending on the data’s structure and the requirements of your machine learning model. In Python, pandas is a popular library for managing structured data, and it can directly import data from SQL databases and export it to CSV, Excel, or JSON.


  1. Parsing Documents: For Word documents, you can use libraries like python-docx to read the text. For PDFs, libraries like PyPDF2 or PDFMiner can be useful.
  2. Store the extracted data: The parsed text data can be stored in a text file or as a CSV/JSON file along with other metadata. If you’re dealing with multiple documents, you may create a CSV file where each row represents a document, and there are columns for the document ID, text content, and any other relevant metadata.

Storing Data for Model Fine-Tuning:

Once the data has been extracted and stored in an intermediary format like CSV or JSON, the next step is to prepare it for model fine-tuning. This generally involves preprocessing the data to the format expected by the model and then splitting it into training, validation, and test sets (we’ll get to this in later blogs)


Data Preparation (Labelling)

let’s consider a scenario where you’re developing a generative AI model to function as a knowledge base for customer support in a software company. The goal of the model would be to generate answers to customer queries based on previous interactions and a predefined knowledge base.

To train such a model, you would need a large dataset of customer interactions, including both the customer queries and the corresponding responses provided by the support staff.

  • Labeling Process: Each interaction (or conversation) in this dataset could be treated as an individual example. The customer’s question would be the input, and the support staff’s response would be the label or the desired output. Thus, the task of labeling would involve pairing each question with its correct answer.

The labeled data is typically stored in a file format that can be easily read and processed by machine learning algorithms. There are several file formats commonly used for storing labeled data, including:

CSV (Comma-Separated Values): This is a text file format that stores data in a tabular format, with each row representing an example and each column representing a feature or attribute. The last column typically contains the target label or output value.

JSON (JavaScript Object Notation): This is a lightweight data interchange format that stores data in a hierarchical structure using key-value pairs. JSON files can be used to store labeled data in a nested format, where each example is a JSON object containing the input features and target label.

TFRecord (TensorFlow Record): This is a binary file format used by the TensorFlow framework to store large datasets efficiently. TFRecord files store data as a sequence of binary records, where each record contains a serialized example object that includes the input features and target label.


Here is an example of how to store labeled data in a CSV file:







In this example, each row represents an example, with the first three columns representing input features (age, gender, and income) and the last column representing the target label (0 or 1).

The specific file format used for labeling data depends on the preferences of the data scientist and the tools and libraries being used for the machine learning project.

Once labeled, this dataset could be used to train a machine learning model, like a sequence-to-sequence (Seq2Seq) model, which is a type of model well-suited for generating responses in a conversational context.

Throughout training, the model learns to map customer questions (inputs) to the appropriate responses (labels). Given a new customer question, the model can then generate an appropriate response based on what it’s learned.

This is an example of how labeling can be used in the context of generative AI for an enterprise application. The quality of the labels (in this case, the quality and relevance of the support staff responses) is crucial for the model’s ability to generate useful responses.

Keep in mind, creating a successful AI-powered customer support system requires not just an initial labeled dataset, but also continuous learning and updating as the system interacts with customers and the product evolves. This can involve an ongoing labeling process, where new customer interactions are continually being added to the training dataset to help the model stay updated.

Nevertheless, labels can also be readily available within the data itself. For example, if you are constructing a model to predict whether a person will repay a loan, you would have access to historical loan repayment and bankruptcy data, which serves as valuable information for the model.

Data Reduction

Of course, labeling isn’t the only procedure needed when preparing data for machine learning. One of the most crucial data preparation processes is data reduction and cleansing. Wait, what? Reduce data? Clean it? Shouldn’t we collect all the data possible? Well, you do need to collect all possible data, but it doesn’t mean that every piece of it carries value for your machine learning project. So you do the reduction to put only relevant data in your model.


imagine you work for an ecommerce company and you’re tasked with building a machine learning model to predict whether a customer will make a purchase within the next 30 days.

Your dataset includes a wealth of information about each customer’s browsing and purchasing behavior. You have a number of variables such as the number of website visits in the last month, average session duration, total number of items viewed, number of purchases made last month, total spend last month, time since last purchase, and so on.

In this hypothetical dataset, each column (feature) represents a dimension in the high-dimensional space. If your dataset is very large and contains many features (dimensions), this high dimensionality could pose a challenge to your predictive model.

Dimensionality reduction could be helpful here. Let’s go over some examples:

    1. Redundant Features: There might be features that are highly correlated. For instance, the total number of items viewed and total spend last month might be strongly correlated. Similarly, the number of website visits and average session duration could also be strongly correlated. In such cases, one of each pair could be eliminated without losing much information.
    2. Low Variance Features: There may be features in the dataset that exhibit very little variation. For example, if most of your customers are from the United States, the “Country” feature might have little variance and might not be very informative. Such features could be candidates for removal.
    3. Irrelevant Features: Sometimes, a dataset may contain features that are not related to the outcome we want to predict. For example, if we have a feature like “Customer’s Favorite Color”, it might not be useful in predicting their purchasing behavior and could potentially be removed.

After initial pruning of features based on the above, you could further use feature selection or feature extraction techniques to reduce dimensionality. Techniques such as Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE) could be used to further reduce the number of features, especially when the number is very high.

However, keep in mind that every time we remove or combine features, we are losing some information. Therefore, it’s important to assess the performance of your model after performing dimensionality reduction to ensure it still meets your needs.

Cleaning Data

Datasets are often incomplete, containing empty cells, meaningless records, or question marks instead of necessary values. Not to mention that some data can be corrupted or just inaccurate. That needs to be fixed. It’s better to feed a model with imputed data than leave blank spaces for it to speculate. As an example, you fill in missing values with selected constants or some predicted values based on other observations in the data set. As far as corrupted or inaccurate data, you simply delete it from a set.


Consider you’re a data scientist at a major health insurance company, tasked with building a machine learning model to predict the likelihood of insurance claims being fraudulent. You’re provided with a historical dataset that includes attributes like the patient’s age, diagnosis, treatment details, cost, location, and more.

However, upon inspecting the dataset, you notice several issues. There are missing values in critical fields, like diagnosis and cost. Some records have a question mark in the location field, and a few entries seem to have impossible values (e.g., age listed as 250).

In such a case, data cleaning becomes essential before feeding it into your predictive model.

    1. Missing Values: For fields like diagnosis, it might not be appropriate to impute values since the data is highly specific. You may choose to exclude these records from your dataset. For cost, you might decide to impute missing values based on other related fields. For example, you might fill in missing cost values with the average cost for that particular treatment.
    2. Incorrect Entries: In fields like location, question marks could be replaced with a “Unknown” tag, or you may choose to exclude these records depending on how critical the location data is to your model.
    3. Corrupted Data: For clearly erroneous entries like an age of 250, you might decide to remove these records from your dataset altogether. Alternatively, if you notice that such errors are common and follow a pattern, you might infer that these errors could have been systematically introduced (e.g., 250 could signify “age not disclosed”) and address them accordingly.

In all these steps, it’s important to maintain a record of what changes were made to the original dataset. This ensures reproducibility and transparency in your data preprocessing pipeline. By performing careful data cleaning, you’re likely to build a more effective machine learning model that accurately predicts insurance claim fraud.

Okay, data is reduced and cleansed. Here comes another fun part, data wrangling.

Data Wrangling / Data Normalization

Data preprocessing involves transforming raw data into a format that effectively describes the underlying problem for a machine learning model. This step includes various techniques such as formatting and normalization, which may sound technical but are not as intimidating as they seem.

When combining data from multiple sources, it’s essential to ensure that the format aligns with your machine learning system’s requirements. For instance, if the collected data is in an .xls file format but you need it in a plain text format like .csv, you would perform formatting to convert it accordingly.

In addition to formatting, it’s crucial to make the data instances consistent across datasets.


Let’s say you have a dataset for a customer segmentation task, and one of the features is “Age” which ranges from 20 to 80 years. Another feature is “Income” which ranges from $20,000 to $200,000. If you directly input these features into a machine learning model without normalization, the income feature with its larger numerical range might dominate the model’s learning process, leading to biased results.

To address this, you can apply normalization techniques to ensure equal importance is given to both features. One commonly used method is z-score normalization (standardization). In this approach, each value is transformed by subtracting the mean of the feature and dividing by its standard deviation. This process results in a new feature distribution with a mean of zero and a standard deviation of one.

For example, let’s say the mean age is 40 years and the standard deviation is 10 years. The mean income is $100,000, and the standard deviation is $50,000. By applying z-score normalization, an age value of 30 years would be transformed to (30 – 40) / 10 = -1, and an income value of $150,000 would be transformed to (150,000 – 100,000) / 50,000 = 1.

Normalization ensures that both features are on a similar scale, ranging from negative values to positive values centered around zero. This allows the machine learning model to treat each feature equally and make fair comparisons between them during the learning process.

By normalizing the features, you remove any potential bias that may arise due to differences in the numerical ranges. This helps in improving the model’s performance and prevents features with larger values from dominating the predictions solely based on their scale.

Normalization is an essential step in data preprocessing, ensuring that features are comparable and balanced in their impact on the machine learning model’s performance.

Feature Engineering

Moving beyond existing features, there are situations where new features need to be created, known as feature engineering. This process involves extracting more useful information from complex variables.


For instance, in predicting customer demand for hotel rooms, you might have date-time information in its native form. Recognizing that demand varies based on specific days, months, and even time periods (e.g., more bookings at night, fewer in the morning), you can decompose the date and time into separate numerical features. This enables the model to leverage the predictive power of each aspect more efficiently. To make your data work best for your models, you often need to make it easier for the models to detect patterns. This often involves transforming raw data into a format that is more compatible with your chosen algorithm(s).

To make your data work best for your models, you often need to make it easier for the models to detect patterns. This often involves transforming raw data into a format that is more compatible with your chosen algorithm(s).

Let’s illustrate this with a Python example where we are extracting different features from a date-time column using Pandas library:

import pandas as pd

# Let's assume we have a DataFrame df with 'date_time' column
df = pd.DataFrame({
'date_time': pd.date_range(start='1/1/2022', periods=1000, freq='H')

# Convert the date_time column to pandas datetime format
df['date_time'] = pd.to_datetime(df['date_time'])

# Extract various features
df['year'] = df['date_time'].dt.year
df['month'] = df['date_time'].dt.month
df['day'] = df['date_time']
df['hour'] = df['date_time'].dt.hour
df['day_of_week'] = df['date_time'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int) # 5 and 6 correspond to Sat and Sun
df['is_night'] = (df['hour'] > 20) | (df['hour'] < 6) # assuming night hours are between 9 PM and 6 AM
df['is_night'] = df['is_night'].astype(int)

In the above code snippet, we initially create a DataFrame with ‘date_time’ column containing timestamps for every hour starting from 1st of January, 2022. Then, we extract various features like year, month, day, hour, day of the week from this timestamp. We also create two binary features: ‘is_weekend’ which indicates whether the day is a weekend (Saturday or Sunday), and ‘is_night’ which indicates whether the timestamp is during night hours (9 PM to 6 AM).

By performing this feature engineering, we’re helping a machine learning model understand the different ways that time can impact hotel bookings, without it having to derive these complex relationships on its own. As a result, the model can more effectively learn from the data and make more accurate predictions. Ultimately, the accuracy and intelligence of a machine learning model or fine tuning a foundational model rely on the quality and preprocessing of the training data you provide.

In conclusion, data preparation is a crucial step in the process of fine-tuning a foundational model for Generative AI. It involves careful consideration of data quality, dataset size, data sources, labeling, data reduction, data cleaning, data wrangling, normalization, and feature engineering. By ensuring high-quality data, addressing biases, and handling missing or irrelevant information, we lay the foundation for training effective machine learning models. The examples and methodologies discussed in this blog post serve as guidelines for data scientists embarking on their data preparation journey. Remember, the success of a fine-tuned model depends heavily on the quality and preprocessing of the training data provided. With thorough data preparation, we pave the way for more accurate, reliable, and impactful AI systems in various domains and applications.”

Explore Dell’s Generative AI Offerings:

Share the Post:

Related Posts

%d bloggers like this: