As huge fans of Amazon Redshift, it’s exciting to see a whole ecosystem growing up around this amazing technology. With us in the depths of the gift-giving season, we’ve enlisted our top Amazon Redshift tips to take you further next year. Happy querying!
Tip #1: Use the same AWS region
Make sure you launch your Redshift cluster in the region where the majority of your data resides. There are two benefits to doing this:
Firstly, the performance of your loads using the COPY
statement is better.
Secondly, you won’t be incurring inter-region data transfer costs – costs that a lot of people are caught off-guard. The analogy to think about here is local telephone calling versus long distance calling, and long distance calling is a lot of more expensive.
Tip #2: Pick the right data distribution style
Amazon Redshift provides a couple of data distribution styles, distinct from dist keys and sort keys. The default is the EVEN
(round-robin) distribution, which may or may not be the best option for your data, depending on how you join your tables. The other options are KEY
distribution and ALL
distribution. It’s important to pick the right distribution style to avoid broadcasting data across nodes.
Tip #3: Use SORT
keys for faster table scans
Besides DIST KEY
and DIST STYLE
, SORT KEY
is really important to understand and use appropriately. The sort key defines how Redshift stores the sorted data on disk. As Redshift stores data in blocks, entire blocks can be skipped when the sort key columns are used as predicates in your query. This makes your table scans fast!
Tip #4: Use column compression
Appropriately using column compression has two benefits:
Firstly, your data will take less space on the Redshift cluster. Redshift stores similadr sequentially (a property of being a column store) – resulting in better compression ratios than the alternate row store
Secondly, you will be trading a bit of CPU time (for un-compression) in exchange for faster IO performance. Mostly analytical use cases are heavily IO bound, and therefore this is a good trade-off to make.
Tip #5: Use Upsert to Update Existing Rows
Though redshift doesn’t natively support upserts, we’ve described a work around for redshift upserts a while back. Instead of dropping your tables and loading your entire data table, you can incrementally update your Redshift cluster.