A Data Engineer passionate about Data Science 📊. I like automating things, building pipelines, exploring scalability problems, improving efficiency and performance tuning. I’m a strong advocate for 📜 open source, ☁️ Cloud computing, 🚀 DevOps, 🆕 Innovation and Automation 🤖
I am a Data Engineer with 5+ years of experience in Design, Architecture, Development, and Deployment of Hadoop, Spark, AWS & Big Data Technologies with work experience in the Middle East and India in the Healthcare and Pharmaceuticals domains.View Resume
M.S. Data Analytics Engineering, 2021 - 2023
B.Tech in Computer Science and Engineering, 2016
Manipal Institute of Technology
MY MAJOR EXPERTISE
AWS: RDS, S3, EMR, Glue, Spark, Scala, Hive, Hadoop, Snowflake, Git, Bitbucket, Maven
Innovations & Enhancements:
Automated validation reports post-migration bringing down costs by 90%.
Improved runtime from 18 hours to 9.5 hours by refactoring Spark Scala ETL code.
Refactored tables to use parquet formats, snappy compressions, and include partitions.
Improved efficiency and turnaround time from 6 hours to 1.5 hours by automating data quality and validity checks between Hive and SQL server loads.
Developed automation scripts for SIT in Spark Scala to assess quality, validity, counts of inbound data files & tables to remove manual effort and intervention.
Spark, Scala, Hive, Impala, Hadoop, Unix, Shell scripting, Control M, Bamboo, Git, Bitbucket, Maven, Eclipse, Cloudera distribution
Central Data Repository for MIT, Manipal:
Delivered a web application with its main objectives to serve as a means of data entry, to collect the required data, to analyze the given data, and finally to generate reports dynamically according to the custom report format requirements of the user. The data was loaded from the databases using Sqoop and analyzed using a Hadoop cluster. The reports are generated after querying using Hive and displayed in the web application.
Project Management System:
Developed a web application that enabled the interaction between different users of different departments and their respective projects while accessing their functions on a large scale.
CERTIFICATIONS, HONOURS AND AWARDS
A Content-based recommendation engine API for movies of the 1900’s built using NLP, Flask, Heroku and Python.
Predicting the cost of treatment and insurance using Machine Learning.
Stream real time Tweets of current affairs like covid-19 using Kafka 2.0.0 high throughput producer & consumer into Elasticsearch using safe, idempotent and compression configurations.
Determining which programming languages and execution engines are the quickest or the slowest at processing files
Some of my recent literary work