Optimizing AWS Costs for big data workloads

2018-11-24 · Rupesh Bansal · Engineering team lead

This is part 1 of a 3 part series on tricks and workarounds of optimizing AWS bills. We will be covering multiple AWS technologies like RDS, Redshift, EC2 Instances, S3, Kinesis, API Gateway

This blog is my learning from a recent extensive exercise of shrinking our AWS bill by roughly 50% without compromising on any resources needed by our applications.

Many many moons ago, when we had very less experience with AWS technologies and had no clue on the scale and volume to which the business might grow, we over provisioned some of our AWS resources just to be on a safer side. We at Finbox committed this mistake, and when our AWS bills started skyrocketing, we narrowed down multiple areas in AWS which could have been optimized. In this blog, we will be discussing Redshift and RDS. In the subsequent blogs, we will be discussing how you can optimize API Gateway, Kinesis Streams, EC2 Instances and S3.


AWS provides this petabyte scale fully managed columnar database offering. Companies store their hot data in this AWS offering for speed and SQL-like query syntax.

  1. Compression

15%+ of our total infrastructure cost goes in Redshift. If you store a table in Redshift without proper encoding of the columns, tables will probably not get compressed and will end up taking huge space. Since its pricing is on the basis of the number of nodes, it basically makes it proportional to your storage space. AWS provides this very cool feature of deep copy, using which you can compress your existing uncompressed columns, but comes with its own limitations. If you are not aware of the type of compression you should be doing on different columns, redshift provides a command

analyze compression <TABLE_NAME>

This is a time-consuming step and might take a while to produce the expected encodings

This will output results something like

Column   | Encoding------------+----------- A     | RAW B     | mostly16 C     | bytedict D     | lzo

Please note that you should never encode your sort-key in redshift. It expects it to be in a raw format for indexing and scanning. Having said that, once you know the type of encodings you should be doing on each column, create a new table with schema as per the expected encodings. You can get the schema of the current table using:

pg_dump -h <HOST> -U <USERNAME> -d <DATABASE> -t <TABLENAME> --schema-only -p 5439;

Output would be something like :

-- -- PostgreSQL database dump -- -- Dumped from database version 8.0.2 -- Dumped by pg_dump version 9.6.10 SET statement_timeout = 0; SET lock_timeout = 0; SET idle_in_transaction_session_timeout = 0; SET client_encoding = 'UTF8'; SET standard_conforming_strings = off; SELECT pg_catalog.set_config('search_path', '', false); SET check_function_bodies = false; SET client_min_messages = warning; SET escape_string_warning = off; SET row_security = off; SET default_tablespace = ''; SET default_with_oids = true; -- -- Name: TABLENAME; Type: TABLE; Schema: public; Owner: USERNAME -- CREATE TABLE public.TABLENAME ( id bigint DEFAULT "identity"(334987, 0, '1,1'::text), batch_id character varying(50), start_timr timestamp without time zone, duration integer, type character varying(10), created_at timestamp without time zone ); ALTER TABLE public.TABLENAME OWNER TO USERNAME; -- -- PostgreSQL database dump complete --

Create a new table using the output of the above command, and updated column encodings. Now copy everything from the original table to this, do some sanity testing and drop the old table and rename this one. This simple activity can free up to 30–40% disk space, which will directly get reflected in your savings. This activity might be challenging for the use cases where data is continuously getting injected in redshift. We are using firehose to load to redshift. Firehose provides a retry functionality, wherein if the incoming datapoints injection fails, it will keep retrying for a maximum of 2 Hours (You can find this option in firehose configuration sessions). This helps us in making deep copy activity smooth and hassle-free.

  1. Daily ETL

We used redshift as our data-warehouse cum data-lake. This is one of the biggest mistakes we had done in our server configurations. All of us have heard about redshift spectrum, but we never actually bothered to dig deep into these Amazon offerings. This was a gamechanger for us in terms of saving money. Spectrum provides a very nice abstraction around your structured and semi-structured data which is stored in S3, which you can query from redshift itself, and join with tables stored in it. This helped us bring down our x node cluster to x/4 node.

Preferred storage bucket and your redshift cluster must be in the same region

External tables can be created only in an external schema, which virtually refers to an external database. To create an external schema Spectrum, use

create external schema spectrum from data catalog database 'spectrumdb' iam_role 'arn:aws:iam::123456789012:role/SpectrumRole' create external database if not exists;

External Tables can be created using:

create external table spectrum.<TABLE_NAME>( column_a integer, column_b integer, column_c integer, column_d integer, column_e smallint, column_f decimal(8,2), column_g timestamp) row format delimited fields terminated by '\t' stored as textfile location 's3://<BUCKET>/a/b/c/d' table properties ('numRows'='36000');

Now you can simply use redshift unload and copy command to load data in this external table. Spectrum billing in on the amount of data that is scanned, and since we rarely query our raw data once it has been featurized, it helps us save some bucks.

Redshift comes with 2 types of storages, dense compute and dense storage. You can either optimize your cluster for queries or have more storage but compromise upon the speed. But there is a small catch here. Redshift enforces you to take a cluster of a minimum size of 2TB for the latter case. Even though the per GB cost of this type of redshift is cheap, but if you have a volume of roughly 200–300gigs, there is no point in using this node. To continue to keep our Redshift cluster to a single node, we are using airflow to move everything accumulated in Redshift, on which the featurization and extraction jobs have already incorporated in their models to S3 in a parquet format. We have configured spectrum with the same cluster which helps us run all our ad-hoc queries. Note that we migrate all the data from redshift to S3 in parquet format, because spectrum billing is on the amount of data scanned, and parquet being a heavily compressed columnar storage format which can store your structured data.

  1. Vacuum Cleaning

Whenever a delete or update query is executed on redshift, it does not claim the free space automatically. This is a choice by design by Redshift developers. Hence, every firm using redshift is expected to set up a policy of periodically running Vacuum on redshift clusters. There are multiple types of Vacuum, which you can choose depending upon the use case and resource availability.

This is a time-consuming task, and might lock your tables for some time.


The cost of RDS service comprises of EC2 compute cost, EBS data storage cost, IO cost, and outbound data transfer cost. In order to optimize the RDS bills, it is important to select a right-sized instance for your application. It is also important that you have provisioned apt amount of throughput to this instance as per needed by your application. You can use AWS managed console to fine tune the RDS configuration, or you can use some fully managed services which take care of optimizing your RDS instance.

One major learning that I have had in my recent experience with RDS is always avoid data dumping. It’s important that all the devs in your team understand this, and do not load any unwanted tables/databases in RDS. This might seem a trivial problem, but if you are a company which deals with humungous data and heavily rely on Machine Learning, chances are high that someone might dump data in your running RDS instance, for their temporary testing/training and later forget to delete it. We at Finbox have 50+ databases in some of our instances, and it’s difficult to get track of databases which are actually of use in this. Set up a policy and train developers so that only data that need to be in RDS are stored in it. Review the schema periodically and optimize it for RDS.

Another thing to keep in mind is to carefully design and reviewing SQL queries which will, in turn, minimize the load on the machine. Things like caching auth-token instead of validating it from database can reduce one DB call per request made to the server. Small small things like these can make a tremendous difference in the load on your machine which will directly get reflected in your bills.

We realized that even though we had Reserved Instances, our daily RDS bills were significantly high. This came to us as a surprise as the expected bill was supposed to be negligible. Upon some research, we realized that the RI’s are just for the RDS instance, and storage, snapshots and backup charges are separate. After further digging, we realized that while spinning up a new RDS instance, AWS provides some defaults, which a user should understand before hitting create. Some of these defaults are: a. Default Instance type: AWS suggests m4.xlarge as default type. Be sure if you really need this big instance. For most of my use case, especially for our dev environments, a dual-core machine with 4gigs of RAM (t2.medium or if you want to stay in m family, m4.large) is more than enough. This really depends on your use case, and the kind of traffic you will be routing to this DBb. Multi-AZ deployment: This option basically configures if you need your database to be present in multiple regions or not. This option is checked by default but again know your use case. You might not want to replicate your DB in multi-region and increase your server cost multifold times for your dev environments.c. Storage Type: AWS provides 3types of storages:

  • Provisioned IOPS — Provisioned IOPS storage is designed to meet the needs of I/O-intensive workloads, particularly database workloads, that require low I/O latency and consistent I/O throughput.

  • General Purpose SSD: General Purpose SSD, also called gp2, volumes offer cost-effective storage that is ideal for a broad range of workloads. These volumes deliver single-digit millisecond latencies and the ability to burst to 3,000 IOPS for extended periods of time. Baseline performance for these volumes is determined by the volume’s size.

  • Magnetic — AWS RDS also supports magnetic storage for backward compatibility. We recommend that you use General Purpose SSD or Provisioned IOPS for any new storage needs. The maximum amount of storage allowed for DB instances on magnetic storage is less than that of the other storage types.

Please note that it is not an easy task to reduce the size of your instance.It’s a very common situation where a team realizes that their RDS instance is way too overprovisioned, and the need the shrink the DB size. Some might try to take a snapshot and restore a new DB with lesser disk space just to realize that shrinking a snapshot feature is not provided by AWS. We at Finbox also found ourselves in a similar situation and wrote a simple script which took the backup from RDS and restores it in the smaller variant.

Github Gist

I hope you find this blog useful. Feel free to get in touch if you find any mistakes or have any query. In the subsequent blogs, I will be discussing about optimizations in S3, Api Gateway, Kinesis Streams and EC2 Instances

FinBox works with banks & NBFCs to digitize their customer journeys & to help them underwrite NTC customers using alternative data from the smartphone. To get in touch, drop a line here

We are hiring!!! If the stuff I have discussed above excites you, feel free to get in touch at rupesh@finbox.in OR tech@finbox.in

< Back to Blogs