File Processing Comparative Analytics

We conduct a Data Science experiment (of sorts) where we compare and benchmark popular languages and execution engines. We find out which languages among Java, Scala and Python is the quickest or slowest. Similarly, we compare and benchmark execution engines like Hadoop and Spark.

Observations

For programming languages, we observe that Python has the least execution time for small and large files while Scala has the largest execution time. Interestingly, Scala takes 7 seconds more for processing the small file rather than the larger file.

For execution engines, we observe that the Spark engine has the least execution time while Hadoop’s Mapreduce engine has the highest execution time. This is in line with the claim that Spark is 100 times quicker than Hadoop.

Thomas George Thomas

Data Scientist & Big Data Engineer with interests in Big Data, Data Science, Cloud Computing, Machine Learning, AI and DevOps.

Publications

Comparing and Benchmarking popular programming languages and execution engines