I am Thomas George Thomas

A Data Engineer passionate about Data Science 📊. I like automating things, building pipelines, exploring scalability problems, improving efficiency and performance tuning. I’m a strong advocate for 📜 open source, ☁️ Cloud computing, 🚀 DevOps, 🆕 Innovation and Automation 🤖

Contact me Know more















Senior Software Engineer - Niche

Legato Health Technologies, Anthem Inc.

Oct 2020 – Aug 2021 Bangalore, India

Highlights :

  • Transformed segmentation data from S3, Athena to the reporting layer in Snowflake.
  • Built data pipelines in AWS leveraging services S3, RDS, Athena, Step functions, and EMR.
  • Migrated data from the on-premise Hadoop cluster to AWS and Snowflake.


AWS: RDS, S3, EMR, Glue, Spark, Scala, Hive, Hadoop, Snowflake, Git, Bitbucket, Maven


Software Engineer - Big Data

Legato Health Technologies, Anthem Inc.

Jun 2018 – Oct 2020 Bangalore, India

Highlights :

  • Engaged primarily in developing spark Scala code involving RDD’s, dataframes, and SparkSQL.
  • Developed shell scripts to process 1.5 TB CSV, Parquet data from inbound to the outbound layer for generating Tableau reports.
  • Developed fully automated CI/CD pipelines using Bamboo to migrate Unix items and ETL jars into pre-prod and prod environments removing any manual effort.

Innovations & Enhancements:

  • Automated validation reports post-migration bringing down costs by 90%.

  • Improved runtime from 18 hours to 9.5 hours by refactoring Spark Scala ETL code.

  • Refactored tables to use parquet formats, snappy compressions, and include partitions.

  • Improved efficiency and turnaround time from 6 hours to 1.5 hours by automating data quality and validity checks between Hive and SQL server loads.

  • Developed automation scripts for SIT in Spark Scala to assess quality, validity, counts of inbound data files & tables to remove manual effort and intervention.


Spark, Scala, Hive, Impala, Hadoop, Unix, Shell scripting, Control M, Bamboo, Git, Bitbucket, Maven, Eclipse, Cloudera distribution


Software Engineer - Big Data & Hadoop Developer

Middle East Management Consultancy and Marketing

Jun 2016 – May 2018 Muscat, Oman

Highlights :

  • Shipped and delivered product end to end.
  • Implemented SQOOP for massive dataset transfer between the Hadoop file system and RDBMS.
  • Involved in the design and creation of partitioned table DDLs in Hive.
  • Worked on performance tuning, analysis, and response time reduction techniques in SQL and Sqoop.
  • Worked with different file formats such as CSV, Parquet, and snappy compressed files.
  • Processed delimited data using Spark SQL to build pipelines from landing zone to outbound layer.

Practice School Student/ Researcher

Manipal Institute of Technology

Jan 2016 – May 2016 Manipal, India

Central Data Repository for MIT, Manipal:

Delivered a web application with its main objectives to serve as a means of data entry, to collect the required data, to analyze the given data, and finally to generate reports dynamically according to the custom report format requirements of the user. The data was loaded from the databases using Sqoop and analyzed using a Hadoop cluster. The reports are generated after querying using Hive and displayed in the web application.


Software Development Intern

CGI Information Systems and Management Consultants Pvt. Ltd

May 2015 – Jul 2015 Manipal, India

Project Management System:

Developed a web application that enabled the interaction between different users of different departments and their respective projects while accessing their functions on a large scale.



Anthem Go Above IMPACT Award 2021

Awarded for going above and beyond in 2021
See certificate

IBM Data Science Professional

Earned for completing IBM Data Science Certification
See certificate

Annual Team Innovation Award

Awarded for innovations delivered for 2019 – 2020.

Arctic Code Vault Contributor

Awarded for OSS contributions towards the GitHub Archive program.
See certificate

Iron Man of Technology 2

Awarded for being a standout performer for Q4 of 2019.
See certificate



Retro Movies Recommender

A Content-based recommendation engine API for movies of the 1900’s built using NLP, Flask, Heroku and Python.

Clustering Paris and London

Clustering Neighborhoods of Paris and London using Machine learning.

Treatment Cost Prediction

Predicting the cost of treatment and insurance using Machine Learning.

Covid 19 Tweet Data Scraping

Stream real time Tweets of current affairs like covid-19 using Kafka 2.0.0 high throughput producer & consumer into Elasticsearch using safe, idempotent and compression configurations.

File Processing Comparative Analytics

Determining which programming languages and execution engines are the quickest or the slowest at processing files

Movie Analytics in Spark and Scala

Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.

Recent Publications

Some of my recent literary work

Determining insurance and treatment costs based on different features to ultimately build a regression model that accurately predict trends.

Comparing and Benchmarking popular programming languages and execution engines

Six definite ways to improve efficiency and reduce load times.

Clustering Neighborhoods of London and Paris using Machine Learning