A compact way to store your dataframes to S3 directly from Python

Jagrit Varshney   /    Data Scientist    /    2021-07-23

LinkedInLinkedIn

Table of contents

    “Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” 

    This popular quote is a great representation of the FinBox ethic. Yes, we’re about delivering stellar quality, but to us, the tools matter just as much as the final product. 

    Our focus remains on optimizing our work flows - and one such optimization covers handling data between Jupyter notebooks/scripts and AWS S3. As I continue to push myself and refine my skills at FinBox, I want to share some basic hacks that might help fellow data scientists save time and disk space with no overwriting conflict.

    So let’s get started.

    There are a lot of ways to export your dataframe and copy it to S3. You may be required to share this data with teammates or re-use it for analysis. The simplest method is to save the data onto the system’s disk and then move it to S3. One way to do this includes exporting the dataframe using pd.to_csv and upload it to S3 using :

    1. aws s3 cp <source> <destination> CLI command

    2. Creating a S3 client using boto3 in Python and then using the put function

    3. Uploading it to S3 using the AWS console

    It requires you to export the dataframe into the disk, and is the best way to, well, fill up your server/local machine disk space.

    Pandas support directly uploading your files to S3 using pd.to_csv. It also supports feather and parquet files. However, the only drawback is that it will overwrite any existing file with the same name at the given S3 location.

    Here’s how this drawback can be overcome. upload_df_s3() uploads the dataframes to S3 without overwriting any existing file. It involves using boto3 to create the most efficient solution.

    upload_df_s3.py on GitHub

    Before you go, let me leave you with some quick reminders:

    1. Configure your AWS access key and secret key in the terminal

    2. Install the package s3fs using pip install s3fs

    3. Your IAM should have ListBucket , GetObject and PutObject permissions in S3 for the bucket

    So, that was my take on the best way to upload your dataframes from Python to S3. I hope it comes to the use of fellow professionals in my field. As I go forward in this journey here, I’m excited for all that’s in store. In particular, I look forward to sharpening my skills as I continue to try and push myself to write more often.

    Table of contents