Data architecture, pipelines, infrastracture, and more
I've had the opportunity to model, architect and build a data platform from scratch on 3 occasions.
1. The first was at Voicemod where I built a data platform on AWS to track user events, marketing campaigns and product data. I used Redshift as the data warehouse, S3 as the intermediate storage,Kinesis Firehose for batching events, DBT(data build tool) for transformations,Airflow as the orchestrator and Looker for visualization. For backend services related to data, I used AWS ECS(Elastic compute service) on auto-scaling EC2 instances.
2. The second time was at Time2Play media(KafeRocks) where I built a data platform on GCP(Google Cloud Platform) to track marketing campaigns, SEO related data(Ahrefs and some other SaaS tools), data from Google Ads, analytics and search console as well as affiliate data from multiple sources. I used BigQuery as the data warehouse, GCS(Google cloud storage) as the intermediate storage, dataflow for processing streaming data from Pub/Sub,Airflow hosted on a Google Compute Engine as the orchestrator, DBT for transformations, DBT catalog for data discovery and Domo as the data visualization tool.
3. The third was during my time at New Work where I worked a data lake solution for our machine learning platform. I used S3 as the storage layer along with Apache Hudi as the data lake format providing ACID transactions on S3. All these platforms were built with IaaC(Terraform and Terragrunt), logging and monitoring support using Prometheus, Grafana and Sentry for AWS and Cloud monitoring and alerting for GCP. Additionally, all the data stores were periodically backed up to ensure reliability and data integrity.
At New Work, my team and I built a company-wide machine learning platform to help data scientists and engineers to build, train, deploy and monitor machine learning models.
I was part of a central team that built the platform and provided support to the data science teams across the company. The platform was built on top of AWS. Since we had an AWS account per team, we used our own account as the central account to host Metaflow for ML pipeline orchestration, MLFlow for model tracking, S3for storing model artifacts.
We also provided our stakeholders with a Terraform module to deploy the client side of things (step functions, lambdas, s3 buckets etc.) in their own accounts. We also provided a CI/CD pipeline using github actions to deploy the infrastructure. We also provided monitoring via AWS Cloudwatch and built an accompanying library to make it easier to integrate existing workflows with the new platform.
I've worked on multiple data pipelines using Apache Spark across many companies.
At Voicemod, we used Spark via Glue jobs to process extracts and dumps from our MongoDB backend.
At New Work, we used Apache Spark with Scalafor our data processing on-premise and on AWS. Our on-premise pipelines were used to transform and sink data on our HDFScluster and Kafka queues. We used AWS EMR for our on-premise workloads. These were usually triggered via cloudwatch events or scheduled viaAirflow.
At HelloFresh, we use PySpark and Airflow to process data on AWS EMR on EKS.
I've worked on implementing CI/CD pipelines for our ETLs, data microservices and infrastructure as code.
1. At Voicemod, we used Gitlab CI/CD to deploy our ETL codebase and Airflow DAGs onto our AWS ECS cluster and to run DBT tests before deploying the models to the data warehouse.
2. At Time2Play media, we used bitbucket pipelines and laterGithub actions to deploy our dataflow jobs, Airflow DAGs and DBT models to GCP.
3. At RakutenTV, we used ArgoCD to deploy our streamlit applications on Kubernetes.
4. At New Work, we used Jenkins and Github actions to deploy our infrastructure as code, data pipelines, linting and testing for our microservices and platform.
During my time at Voicemod, my team and I built a data warehouse using Amazon Redshift to store and analyze data from multiple sources (CRM, product, performance marketing).
It was before AWS released AQUA(Advanced Query accelerator) so we had to use a lot of techniques to optimize the performance of the queries. I gained experience in traditional warehouses with coupled storage and compute, scaling the cluster without downtime and the art of choosing the right distribution and sort keys. I also learned how to use the VACUUM and ANALYZE commands to optimize the performance of the cluster.
At RakutenTV, we used Snowflake as our data warehouse. I contributed to the data modeling and ETL processes where I improved code quality, security and performance of the Python codebase.
I also worked on the data quality checks and monitoring of the data pipelines. There were 3 environments(dev, staging and prod) and I worked on the CI/CD pipelines to test and deploy the codebase to these environments. I also worked on a lot of performance optimizations and tuning of the Snowflake queries. I also had the opportunity to work with Matillion ETL tool to build data pipelines and automate certain tasks using its API.
At Time2Play media, we used Google BigQuery as our data warehouse. It was my first experience working with a managed cloud data warehouse. GCP also provides a free tier of 1TB of data processed per month which was very helpful for our small team.
I worked on the data modeling, ETL processes, data quality checks and monitoring of the data pipelines. I also worked on the CI/CD pipelines to test and deploy the codebase to the GCP environment.
At New Work, we used Apache Hudi as the data lake format on S3. We used it to store the training data for our machine learning models.
I worked on exporting our data from our on-premise hadoop cluster to S3 using Apache Spark. The data was then stored on S3 using Apache Hudi to enable ACID transactions on the data lake and to be able to use hudi's indexing capabilities for GDPR deletions.
The tables were registered with the AWS Glue catalog for querying using Athena. I am also a contributor to the Apache Hudi project (both Java and Rust).
At Voicemod, our product team used Google Cloud Platform (GCP)for running backend services. The backend customer data was stored in MongoDB and the changes were streamed to Google Cloud Pub/Sub. Our data and analytics stack was on AWS. I built a data pipeline to consume events from GCP Pub/Sub and publish them to AWS Kinesis Firehose. The data was then transformed and stored in S3. This helped batch events by time and size. We also transformed and synced this data to Redshift for analytics.
I've architected solutions for product and marketing analytics while working at Voicemod(AWS) and Time2Play(GCP) media. I learnt a lot about architecting based on business requirements and constraints.
For example, on AWS I created a solution to track the performance of our marketing campaigns via callbacks to an API gateway to SQS and batch it onwards to a lambda to do basic transformation before storing it on s3.
I've also worked on a solution to track user events from multiple sources and sync them to a data warehouse. Some designs were around a CDP(Customer Data Platform) like mparticle and segment to track user events and sync them to multiple destinations(including the data warehouse).
I have built internal tools using streamlit for technical as well as non-technical stakeholders to monitor the health of our data pipelines, data quality, adhoc repetitive queries that could be automated as well exported as CSV or Excel files. (Image source: Streamlit Blog)