How to get a Pandas Dataframe from a gzip file?

Nowadays, data is available in different formats and it is mostly compressed due to memory complexities and to transmit data on any platform. Data compression usually involves compressing the data without any loss of information and the original data can be cropped on different platforms by uncompressing the data into the respective formats. So, gzip is one of the formats where large files are compressed into smaller file formats and can be decompressed easily, which finds its main use in cloud and server data transmission and is mainly used in various ETL tools. So in this article, let’s see how to unzip a gzip file into a simple pandas database.

Contents

  1. What is a gzip file?
  2. Benefits of a gzip file?
  3. Implementation to get pandas dataframe from gzip file
  4. Summary

What is a gzip file?

Among the various file compression formats, gzip is also one such file compression format in which larger files are compressed into smaller file formats, mostly in megabytes (MB). All gzip files end with a file format specifier like (gz). This compression format was basically created in 1992 and has been transformed into an open source file format where and was intended to be used on a programming paradigm named “compress”, and now gzip file formats are widely used to facilitate data transmission and ETL tools.

Are you looking for a comprehensive repository of Python libraries used in data science, check here.

Benefits of a gzip file?

  1. Easy to compress and decompress file formats on different platforms
  2. Reduces data transmission time on cloud platforms.
  3. Dynamic capability to compress any type of data, from images to plain text.
  4. Faster calculation on web servers and 75% of web servers use this format.

Implementation to get pandas dataframe from gzip file

As gzip supports compression of various data formats, loading time of gzip file formats on different platforms varies depending on resources and platform. If the gzip files are loaded on cloud-based platforms or on a server, the gzip files can decompress quickly compared to decompressing the gzip file on local hardware.

So in this article a standard gzip file is used and the full implementation of how to unzip the gzip file in a standard pandas database is shown.

Let’s import some basic libraries that would be needed to load the data block

import numpy as np
import pandas as pd

Here python subprocess module is used instead of operating system module for easy compression of gzip file, to uncompress gzip file regardless of platform. The check_output library is used and decodes the appropriate data from the zip files on the web server.

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8")

Here basically two gzip files are used with different size of memory allocations where one file has memory size close to 400MB and one gzip file has memory up to 3MB respectively.

Let’s see if there is a time difference between loading a smaller gzip file and a larger gzip file in the same working environment.

Loading a smaller gzip file

Here we can see that we are trying to unzip a 2.26MB gzip file in a working environment.

gzip_df_small = pd.read_csv('../input/dot_traffic_stations_2015.txt.gz', compression='gzip', 
                                 header=0, sep=',', quotechar=""")
gzip_df_small.head(10)

Loading a larger gzip file

Here we can see that we are using a 465.12MB gzip to unzip it in a working environment.

gzip_df_big = pd.read_csv('../input/dot_traffic_2015.txt.gz', compression='gzip', 
                         header=0, sep=',', quotechar=""")

gzip_df_big.head(10)

Main results of decompressing gzip files

  1. Depending on the size of the gzip file and the working environment, decompressing zip files may vary slightly from a fraction of a second to a few minutes.
  2. The variation in decompression time is considerable on different platforms because gzip renders decompressed files in a considerable amount of time.
  3. Knowledge of the storage and separation of each unit of data must be known in order to use the separation and quoting characters required for any special escape characters.

Summary

Transferring large data originally across different platforms is time consuming and not memory efficient. Rendering of data for all applications will not be feasible due to certain constraints. This is where compressed file formats play a vital role in transmitting data efficiently and gzip is one such compressed file format where it finds its primary use in transmitting data on web servers and tools ETL due to lightness and faster decompression of data independent of platforms and if decompressed in pandas format, data can be easily manipulated as per user or data handlers needs.