Journal and Whitepaper publications.
In this paper, we describe a new web-based corpus for hypernym detection. It consists of 32 GB of high quality english paragraphs along with their part-of-speech tagged and dependency parsed versions. For hypernym detection, the current state-of-the-art uses a corpus which is not available freely. We evaluate the state-of-the-art methods on our corpus and achieve similar results. The advantage of this corpora is that it is available under an open license. Our main contribution is the corpus with POS-tags and dependency tags and the code to extract and simulate the results we have achieved using our corpus.
OSS projects I've contributed to.
I've contributed to the Apache Hudi Project. I've worked mainly on fixing bugs, adding integration tests and improving the Rust implementation of Hudi.
I contributed some bug fixes to the Feast project during my time at New Work when I was evaluating it for our feature store POC.
I contributed bug fixes to the Open Metadata project during my time at New Work when I was working on the data catalog POC. We eventually decided to use DataHub by LinkedIn for our data catalog.
Fun projects I've worked on to learn new technologies and concepts.
Krank is a queue-based auto-scaling solution designed to dynamically adjust the number of replicas for your deployments based on the metrics collected from various message brokers. Currently, it supports Kafka, with plans to support RabbitMQ and SQS in the future.
A Kafka Connect plugin to stream data from Kafka to AWS Kinesis. Inspired from AWSLabs Connector.
In-memory key-value store, fault tolerant via WAL. Backed by a RB-Tree for fast lookups. It also supports accessing the store via a REST API. Created this project to learn more about storage data structures. Plan on extending it to support LSM Trees based storage.
A simple project to send Airflow metrics to Google BigQuery. Supports both sync and async modes. It listens to StatsD metrics emitted by Airflow and sends them to BigQuery.
Shepherd herds your Kafka messages to Kinesis. It has an inbuilt AIOScheduler that schedules a poll from Kafka and forwards it to Kinesis based on the time to wait or the number of records to send. The Forwarder fowards messages to Kinesis using an in-memory buffer. The consumer consumes from Kafka using the AIOKafka library which is asynchronous/non-blocking.
The Qdrant Kafka Ingestor is a Rust-based CLI application designed to ingest data from Kafka and upload it to a Qdrant database. This tool is particularly useful for integrating streaming data pipelines with vector search engines.
Sentinel is a systemd runner mainly developed for running python programs as services. Never got around to completing this one. Let's see if I can revive it.