Case study

Cloud Data Platform Ingestor for Enterprise Data Management

Company

Enterprise Client

Industry

Financial Services

Challenge

Heterogeneous data formats with no unified ingestion or processing pipeline

Impact

28 data pipelines supporting near real-time processing across the organisation

An enterprise organisation embarked on a Cloud Data Platform (CDP) initiative to centralise and govern data across multiple business units. The project required building Ingestor and Exporter tools that could handle the full lifecycle of data, from ingestion in diverse formats through to governed storage in AWS Redshift and export back to teams in their preferred formats.

SISU Solutions was engaged to design and build the Ingestor component, which needed to accept data in formats ranging from Excel and CSV to XML, JSON, and DAT files, transform it into optimised Parquet format, and load it into Redshift tables with full schema validation and governance.

The challenge

Taming data diversity at enterprise scale

The core challenge was integrating heterogeneous data formats into a unified processing pipeline. Each business unit had its own data standards, file formats, and delivery mechanisms. Building a system flexible enough to handle this diversity while enforcing consistent quality and governance standards required careful architectural thinking.

Schema validation and type enforcement across these diverse data structures was critical. Incorrectly typed data flowing into the data warehouse would undermine trust and break downstream analytics. The metadata management system needed to support multiple file types while remaining simple enough for operations teams to maintain.

The architecture also needed to be extensible. New data sources and formats would continue to emerge as more teams onboarded to the platform, so the system couldn't be hardcoded around today's requirements. Error handling and logging in the distributed, event-driven environment had to be robust enough to diagnose issues quickly across complex multi-step pipelines.

The solution

Event-driven, modular, and built to scale

The SISU team developed a Python-based metadata module with a key/value structure that allowed configuration-driven processing for each data source. This meant new data types could be onboarded by defining metadata rather than writing new code, dramatically reducing the effort required to expand the platform.

A CI/CD pipeline was implemented using Azure DevOps for metadata version control and deployment, ensuring changes were tested and promoted through environments reliably. Modular AWS Lambda functions were built in Python for format-specific data processing, with AWS S3 handling data staging and Apache Airflow orchestrating the end-to-end ETL workflow.

The ingestion pipeline was triggered through an event-driven architecture using SNS/SQS to invoke Lambda functions, enabling near real-time processing as data arrived. Parquet conversion was implemented for optimised columnar storage in Redshift, and custom JSON-format logging was built for analytics and debugging, integrated with CloudWatch for operational monitoring. Infrastructure was managed through Terraform, ensuring the entire platform was reproducible and version-controlled.

Results
28
Data pipelines deployed
6+
Data formats supported
Near real-time
Processing via event-driven Lambda

Explore another case study to see how we deliver results, or get in touch to discuss your project.