Building a Lightweight Python Data Pipeline for a Research Team

Building a Lightweight Python Data Pipeline for a Research Team

Photo by Stephen Dawson / Unsplash

When a small research team reached out to me looking for help with data consolidation, I saw a great opportunity to apply my Python skills to a real-world workflow problem. Their process involved manually copying data from dozens of Excel files into a master spreadsheet every week—and then summarizing those results in a report. It was slow, error-prone, and draining valuable time from their actual research.

In this post, I’ll walk through how I built a simple ETL (Extract, Transform, Load) pipeline using Python, pandas, openpyxl, and matplotlib to streamline their data process and automate reporting.


Project Goals

  • Extract data from multiple Excel workbooks with slightly inconsistent formatting
  • Transform the data into a clean, standardized structure
  • Load the combined data into a single master file for reporting
  • Visualize key trends using simple plots
  • Add validation checks and logging to ensure reliability

Step-by-Step: How I Built It

1. 📂 Project Setup

I organized the pipeline into the following file structure:

project/
│
├── input_data/
│   └── [raw Excel files here]
│
├── output/
│   ├── cleaned_master.xlsx
│   └── summary_report.pdf
│
├── logs/
│   └── pipeline.log
│
├── pipeline.py
└── config.yaml

I also created a simple config.yaml file to store things like input/output paths and expected column headers. This made it easy to adjust the setup without touching the main code.


2. 📥 Extraction

Each Excel file had one sheet of data but sometimes the formatting varied—extra header rows, slightly different column names, etc. I used openpyxl and pandas to read and parse each file:

import pandas as pd
import openpyxl
import os

def load_excel_files(input_dir):
    all_data = []
    for file in os.listdir(input_dir):
        if file.endswith(".xlsx"):
            df = pd.read_excel(os.path.join(input_dir, file), engine="openpyxl")
            all_data.append(df)
    return all_data

I also implemented a basic validation check to make sure each file had the required columns (using fuzzy matching if needed).


3. 🔧 Transformation

Once I had the dataframes, I applied cleaning steps to:

  • Standardize column names (.str.lower().str.strip())
  • Drop empty rows and columns
  • Convert date strings to datetime
  • Normalize categorical values (e.g., turning “Yes”, “YES”, “yes” into “Yes”)

Then I concatenated all the cleaned dataframes into a master dataset:

combined_df = pd.concat(cleaned_dfs, ignore_index=True)

To help the team catch issues early, I added assertions and logging:

assert not combined_df.isnull().any().any(), "Missing values found!"
import logging

logging.basicConfig(filename='logs/pipeline.log', level=logging.INFO)
logging.info(f"Successfully loaded {len(cleaned_dfs)} files at {pd.Timestamp.now()}")

4. 📊 Reporting & Visualization

Using matplotlib, I created a simple time-series line plot and a bar chart showing trends across the research period:

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
combined_df.groupby("month")["participants"].sum().plot(kind='line', ax=ax)
plt.title("Monthly Participant Count (2023-2024)"
plt.xlabel("Month")
plt.ylabel("Participants")
plt.savefig("output/summary_report.pdf")


The team could easily include these plots in their presentations and grant reports.


5. 💾 Output & Automation

I exported the final master dataset to Excel with one command:

combined_df.to_excel("output/cleaned_master.xlsx", index=False)

To wrap everything up, I turned the script into a runnable pipeline.py and included comments and usage instructions.


What I Learned

  • Working with real data means handling inconsistency and building in safeguards
  • Logging and validation can prevent silent errors
  • Even a small data pipeline can save hours of manual work
  • Communication with non-technical stakeholders is key—I provided documentation and demoed the script

Outcome

After deploying the script, the research team told me they reduced their weekly data processing time from 3 hours to under 10 minutes. Better yet, the reports were more accurate and consistent—freeing them up to focus on insights, not spreadsheets.


Thanks for reading!
If you're working with messy Excel files and want to automate your reporting, feel free to reach out or check out the code (available on request).