Dliver fresh and fast insights
At RakutenTV, we had a lot of cases where data was stored in s3 in real-time from external systems.
I designed and implemented a real-time ingestion pipeline into Snowflake using Snowpipe that uses AWS SQS queues and S3 event notifications to load data as it arrives.
I also built a python streamlit application to monitor the ingestion status and data quality. This allowed us to deliver fresh insights to our stakeholders and reduced the time to insight from days to minutes.
At New Work, my team and I worked on a project to enrich customer data in real-time using Kafka Streams and KSQL DB.
It was deployed on our on-premise Kubernetes cluster as a statefulset as we needed to maintain the state using on-disk RocksDB.
Later, we also built a stream processing application using Apache Hudi deltastreamer to sink this data into a Hudi table on HDFS.
Some of the legacy applications used Akka Streams for stream processing.
During my time at New Work, I worked on a POC to ingest real-time data into a Feast feature store.
We used Athena as our offline store with data stored on S3, and, DynamoDB as our online store.
The POC used the Python Faust library to consume data from Kafka topics and write it to the feature store.
I created a POC to ingest data directly from Kafka into Qdrant, a vector database. The application was written in Rust and used the Kafka Rust library to consume the messags and the Qdrant Rust SDK to batch based on time and size and write the vectors to the database. The offsets were committed only after the data was uploaded to Qdrant successfully. Tokio was used for async IO operations. (Repository)