Publications

Journal and Whitepaper publications.

Hypernym-LIBre: A Free Web-based Corpus for Hypernym Detection

In this paper, we describe a new web-based corpus for hypernym detection. It consists of 32 GB of high quality english paragraphs along with their part-of-speech tagged and dependency parsed versions. For hypernym detection, the current state-of-the-art uses a corpus which is not available freely. We evaluate the state-of-the-art methods on our corpus and achieve similar results. The advantage of this corpora is that it is available under an open license. Our main contribution is the corpus with POS-tags and dependency tags and the code to extract and simulate the results we have achieved using our corpus.

Artificial IntelligenceNatural Language ProcessingMachine Learning

Open Source Software Contributions

OSS projects I've contributed to.

Apache Hudi

I've contributed to the Apache Hudi Project. I've worked mainly on fixing bugs, adding integration tests and improving the Rust implementation of Hudi.

RustApache HudiDatalake Table format

Feast Feature Store

I contributed some bug fixes to the Feast project during my time at New Work when I was evaluating it for our feature store POC.

PythonFeastFeature StoreSpark

Open Metadata

I contributed bug fixes to the Open Metadata project during my time at New Work when I was working on the data catalog POC. We eventually decided to use DataHub by LinkedIn for our data catalog.

JavaOpen MetadataData Catalog

Toy Projects

Fun projects I've worked on to learn new technologies and concepts.

Krank - Queue Based Autoscaler

Krank is a queue-based auto-scaling solution designed to dynamically adjust the number of replicas for your deployments based on the metrics collected from various message brokers. Currently, it supports Kafka, with plans to support RabbitMQ and SQS in the future.

KubernetesKafkaPrometheusKotlinHelm

Kafka Connect AWS Kinesis

A Kafka Connect plugin to stream data from Kafka to AWS Kinesis. Inspired from AWSLabs Connector.

KafkaAWS KinesisJavaKafka Connect

Vajra Key-Value Store

In-memory key-value store, fault tolerant via WAL. Backed by a RB-Tree for fast lookups. It also supports accessing the store via a REST API. Created this project to learn more about storage data structures. Plan on extending it to support LSM Trees based storage.

JavaSpring BootKV StoreRB-TreeLSM Tree

Airflow Metrics to Google BigQuery

A simple project to send Airflow metrics to Google BigQuery. Supports both sync and async modes. It listens to StatsD metrics emitted by Airflow and sends them to BigQuery.

PythonAirflowGoogle BigQueryStatsDGCP

Shepherd

Shepherd herds your Kafka messages to Kinesis. It has an inbuilt AIOScheduler that schedules a poll from Kafka and forwards it to Kinesis based on the time to wait or the number of records to send. The Forwarder fowards messages to Kinesis using an in-memory buffer. The consumer consumes from Kafka using the AIOKafka library which is asynchronous/non-blocking.

PythonKafkaKinesisAIOKafkaAsyncIO

Qdrant Kafka Ingestor

The Qdrant Kafka Ingestor is a Rust-based CLI application designed to ingest data from Kafka and upload it to a Qdrant database. This tool is particularly useful for integrating streaming data pipelines with vector search engines.

RustKafkaQdrantTokio

Sentinel

Sentinel is a systemd runner mainly developed for running python programs as services. Never got around to completing this one. Let's see if I can revive it.

RustSystemdClapSerde