GenAI Guide on Data Preparation Part 2: Data Collection

Collecting Data from Different Sources

As mentioned in our previous blog, gathering data to refine extensive language models typically includes the collection of both structured and unstructured data. Structured data pertains to well-organized information that can be easily searched within relational databases, such as SQL. On the other hand, unstructured data encompasses information that doesn’t conform to these predetermined formats, such as text, images, or audio files.

Last time we briefly explored strategies for collecting data from the following sources. In this blog we will go further into these strategies and provide code to help you understand the process more deeply.

  1. Database Extraction: Structured data can often be found in various types of databases, such as SQL databases or NoSQL databases like MongoDB. Data can be extracted directly from these databases using queries specific to the database language. Enterprise resource planning (ERP) and customer relationship management (CRM) systems are other common sources of structured data in an enterprise context.
  2. Document Parsing: Unstructured data can often come from Word documents, PDFs, and other types of files. There are various libraries available for parsing these documents and extracting the text. For instance, Apache POI can be used for Microsoft Office documents, and PyPDF2 can be used for PDFs.
  3. Email Extraction: Emails are a rich source of text data. Depending on the email service used, there may be an API that can be used to access the emails. For instance, the Gmail API can be used to access and download emails from a Gmail account. The emails can then be parsed to extract the relevant text data.
  4. SharePoint: Microsoft provides the SharePoint Online Management Shell, which can be used to manage SharePoint Online users, sites, and site collections. SharePoint also has APIs that can be used to extract data.
  5. Web Scraping Intranets: With appropriate permissions, internal web pages can be scraped in a similar manner to external websites, allowing you to extract both structured and unstructured data.
  6. Logs: Many systems generate logs which are stored in text files or databases. These can be a valuable source of data for tasks such as anomaly detection.

Let’s see how we can employ various steps to connect, query, and store extracted structured data

1) Getting data from a SQL database using Python

The process of retrieving data from databases comprises several stages, which encompass establishing a connection with the data source, querying or parsing the data, and subsequently storing it in a machine learning-compatible format:

  1. Connect to the database: The exact process depends on the type of database (e.g., SQL, MongoDB, etc.). You can use database-specific libraries in programming languages like Python or R to establish a connection.
  2. Query the data: Use SQL queries or equivalent commands to extract the required data. This might include filtering for specific columns, rows, or conditions.
  3. Store the extracted data: The extracted data can be stored in several formats, including CSV, Excel, or JSON, depending on the data’s structure and the requirements of your machine learning model. In Python, pandas is a popular library for managing structured data, and it can directly import data from SQL databases and export it to CSV, Excel, or JSON.
Here’s an example of Python code that demonstrates connecting to a SQL database, querying the data, and storing the extracted data in CSV format using the pandas library:
					import pandas as pd
import sqlite3  # Assuming you're connecting to a SQLite database

# Connect to the database
conn = sqlite3.connect('your_database.db')

# Query the data
query = "SELECT column1, column2, column3 FROM your_table WHERE condition;"
data = pd.read_sql_query(query, conn)

# Store the extracted data as CSV
data.to_csv('extracted_data.csv', index=False)

# Close the database connection

In this instance, we presume that you are establishing a connection to a SQLite database (your_database.db). Please replace ‘your_table’ with the actual table name and ‘condition’ with your specific SQL query condition. Feel free to modify the SELECT statement to retrieve the desired columns.

The data retrieved will be stored in a pandas DataFrame named ‘data’. You can employ the pandas’ to_csv() function to export the DataFrame to a CSV file named ‘extracted_data.csv’. To ensure the row index is not included in the CSV file, set the parameter index=False.

If you haven’t already installed the required libraries (pandas and sqlite3), use the following command in a code cell to install them:

					!pip install pandas sqlite3

Make sure to adapt the code accordingly if you are using a different database and library (e.g., for MongoDB, you would employ the pymongo library for connecting and querying).

Similarly, extracting unstructured data requires a multiple steps


2) Getting data from documents

  1. Parse Documents: For Word documents, you can use libraries like python-docx to read the text. For PDFs, libraries like PyPDF2 or PDFMiner can be useful.
  2. Store the extracted data: The parsed text data can be stored in a text file or as a CSV/JSON file along with other metadata. If you’re dealing with multiple documents, you may create a CSV file where each row represents a document, and there are columns for the document ID, text content, and any other relevant metadata.
Here’s an example of Python code that demonstrates parsing Word documents using the python-docx library and storing the extracted data along with metadata into a CSV or JSON file.

First, let’s install the python-docx library:

					!pip install python-docx

Next, you can employ the parse_word_document() function, which takes the path to a Word document as input and utilizes the python-docx library to extract the text from the document’s paragraphs

					import docx
import csv
import json

def parse_word_document(doc_path):
    doc = docx.Document(doc_path)
    paragraphs = [p.text for p in doc.paragraphs]
    return paragraphs

# Example usage
document_path = 'your_document.docx'
parsed_data = parse_word_document(document_path)

# Metadata for the document
metadata = {
    'document_id': '12345',
    'author': 'John Doe',
    'date': '2023-07-13'

# Store the extracted data in a CSV file
csv_filename = 'extracted_data.csv'
with open(csv_filename, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['document_id', 'text'])
    for paragraph in parsed_data:
        writer.writerow([metadata['document_id'], paragraph])

# Store the extracted data in a JSON file
json_filename = 'extracted_data.json'
data = {
    'metadata': metadata,
    'text': parsed_data
with open(json_filename, 'w') as jsonfile:
    json.dump(data, jsonfile)

You can modify the parse_word_document() function to extract other relevant information from the Word document if needed.

Once you’ve parsed the Word document and obtained the parsed_data, you can define the metadata for the document in a dictionary. In the example, we provided a basic metadata structure with keys such as document_id, author, and date. Adapt these fields according to your specific needs.

To store the extracted data, you have the option to choose between CSV or JSON formats. For CSV, we open a file using the csv module, write the header row, and then iterate over the parsed data, writing each paragraph along with the document ID to a new row.

Alternatively, for JSON, we create a dictionary called ‘data’ that includes both the metadata and the parsed data. We then utilize json.dump() to write this dictionary to a JSON file.

Remember to replace ‘your_document.docx’ with the actual path to your Word document. Also, adjust the filenames (‘extracted_data.csv’ and ‘extracted_data.json’) as desired.

3) Emails contain a wealth of textual data, presenting a valuable resource for analysis. Depending on the email service utilized, there might be an available API that allows accessing the email content. Let’s see how to access and download emails from a Gmail account:

Here’s an example of Python code that uses the IMAP protocol to access and download emails from a Gmail account, and then uses the email library to parse the emails and extract the relevant text data:

Make sure to substitute ‘‘ and ‘your_password’ with your real Gmail account credentials. This code establishes a connection to the Gmail server using IMAP, retrieves all emails from the inbox, and extracts the pertinent text data from each email. You can carry out additional processing with the extracted text data based on your specific needs.

					import imaplib
import email

# IMAP settings
EMAIL_PASSWORD = 'your_password'

def extract_text_from_email(msg):
    text = ""
    if msg.is_multipart():
        for part in msg.walk():
            content_type = part.get_content_type()
            if content_type == 'text/plain':
                text += part.get_payload()
        text = msg.get_payload()
    return text

# Connect to the IMAP server

# Search for all emails
status, response =, 'ALL')
email_ids = response[0].split()

for email_id in email_ids:
    status, response = imap.fetch(email_id, '(RFC822)')
    raw_email = response[0][1]
    msg = email.message_from_bytes(raw_email)

    # Extract relevant text data from the email
    text_data = extract_text_from_email(msg)

    # Do further processing with the extracted text data
    # (e.g., save it to a file, perform analysis, etc.)

# Close the connection

4) Getting data from Sharepoint

To interact with SharePoint Online using Python, you can utilize the SharePlum library, which offers a Pythonic interface to the SharePoint REST API. Below is an example of Python code to extract data from SharePoint Online using the SharePoint Management Shell:

Make sure to replace ‘your username’, ‘your_password’, and ‘https://your_sharepoint_site_url’ with your actual SharePoint Online credentials and site URL. Additionally, update ‘YourListName’ with the name of the specific list or library from which you wish to extract data. This code establishes a connection to SharePoint Online, retrieves all items from the specified list, and extracts the pertinent data from each item. You can further process and utilize the extracted data based on your specific requirements.

					from shareplum import Office365
from import Site

# SharePoint Online credentials
username = 'your_username'
password = 'your_password'
site_url = 'https://your_sharepoint_site_url'

# Connect to SharePoint Online
authcookie = Office365(site_url, username=username, password=password).GetCookies()
site = Site(site_url, authcookie=authcookie)

# Specify the SharePoint list/library to extract data from
list_name = 'YourListName'

# Retrieve all items from the list
data = site.List(list_name).GetListItems()

# Extract relevant data from the response
for item in data:
    # Access item fields
    field1 = item['Field1']
    field2 = item['Field2']
    # Process the extracted data as needed
    # (e.g., store it, perform further analysis, etc.)

# Disconnect from SharePoint Online

5) Web scraping intranets

To retrieve data from internal web pages using Python, you can leverage libraries such as requests and BeautifulSoup. Here’s an example code snippet that demonstrates how to extract data from an internal web page:

					import requests
from bs4 import BeautifulSoup

# Specify the URL of the internal web page
url = ''

# Send a GET request to the web page
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content of the web page
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find specific elements or data on the web page using BeautifulSoup
    # For example, let's extract all the links on the page
    links = soup.find_all('a')

    # Process the extracted data as per your requirement
    for link in links:
        link_text = link.text
        link_url = link['href']
        print(f"Link: {link_text}\nURL: {link_url}")
    print(f"Request failed with status code: {response.status_code}")

In the code snippet above, the requests library is utilized to send a GET request to the specified internal web page URL ( The code then checks if the request was successful by verifying the status code (200). Once successful, BeautifulSoup is employed to parse the HTML content of the web page.

In this example, we extract all the <a> tags (links) from the web page using soup.find_all(‘a’). However, you have the flexibility to modify this extraction method to suit your specific requirements. For instance, you can use different BeautifulSoup methods like find(), find_all(), or CSS selectors to locate specific elements on the web page.

After extracting the data, you are free to process it according to your needs. In this case, the extracted links are iterated over, and their text and URLs are printed. Nonetheless, you can perform any desired data processing or choose to store the extracted data in a suitable format.

To proceed with the code, ensure you have installed the necessary libraries (requests and beautifulsoup4) if you haven’t already done so. You can use the following command in a code cell to install them:

					!pip install requests beautifulsoup4


Remember to replace ‘‘ with the real URL of the internal web page from which you wish to extract data.

5) Extracting Logs

To access log files using Python, you can utilize file I/O operations to read and process the log data. Here’s an example code snippet that demonstrates how to access log files:

					log_file_path = 'path_to_your_log_file.log'

# Open the log file in read mode
with open(log_file_path, 'r') as log_file:
    # Read the contents of the log file
    log_data =

    # Process the log data as per your requirement
    # For example, let's print each line of the log file
    lines = log_data.split('\n')
    for line in lines:

In the code above, you should specify the path to your log file by replacing ‘path_to_your_log_file.log’ with the actual file path.

The code uses a with statement to open the log file in read mode, ensuring proper closure even if an exception occurs during reading. The contents of the log file are then read using the read() method and stored in the log_data variable.

After reading the log data, you have the flexibility to process it as needed. In this example, we split the log data into individual lines using the newline character (‘\n’) as the delimiter. Subsequently, we iterate over the lines and print each line. However, you can customize this process to suit any specific processing requirements for your log data.

Make sure that you possess the necessary permissions to access the log file, and replace ‘path_to_your_log_file.log’ with the correct file path to access your specific log file.

Keep in mind that log files may have different formats and structures, so you might need to adapt the code to parse and extract relevant information based on the specific format of your log file.

Alright! We have the data collected, now what?

In the next blog we’ll explore how we can label the data so that it can be used to train a machine learning model.

Share the Post:

Related Posts

%d bloggers like this: