by Sristi Raj
With the advent of cloud data warehouse and analytics tools, concepts like data lake, in-place querying, schema on read and data mashup have become a critical and integral part of an organization’s analytics capabilities. Consequently, new tools and features like AWS Redshift spectrum, Athena, Presto, Hive etc. are now playing a game changing role for data analysts to explore and mine data and get value out of existing data assets lying unused within the organization. Although, with so many options to choose from, it becomes utterly necessary to understand the subtle differences between each of them and advantages of each, to select the right tool of choice for a particular use case. With this blog, we will explore some capabilities of Redshift spectrum and advantages of using it in some of the use cases, while keeping the comparison and differences between different tool of choice to a separate blog post.
What is AWS Redshift Spectrum?
As per Amazon “AWS Redshift spectrum is a capability or a service which can be enabled with AWS Redshift that enables you to run SQL queries directly against all your data, out to exabytes, in Amazon S3 and for which you simply pay for the number of bytes scanned.
With Amazon Redshift Spectrum, you now have a fast, cost-effective engine that minimizes data processed with dynamic partition pruning. You can further improve query performance by reducing the data scanned. You do this by partitioning and compressing data and by using a columnar format for storage.”
Pretty clear from the above that Redshift spectrum is a service that can be enabled with AWS Redshift database to query on the data stored in AWS S3 buckets as files. This would sound similar to external tables in Oracle or SQL server database, but with a lot more flexibility and features which I would discuss later in the post.