Create External Table in Amazon Athena Database to Query Amazon S3 Text Files

Amazon Athena is a serverless AWS query service which can be used by cloud developers and analytic professionals to query data of your data lake stored as text files in Amazon S3 buckets folders. In this Amazon Athena tutorial, I want to show SQL query developers how they can map their data lake data stored as text files in Amazon S3 buckets as external tables in an Amazon Athena database and execute sample SELECT query to query data from these text files.

For AWS professionals who want to follow the steps covered in this Amazon Athena tutorial, I share the sample data file albums.csv here. Please put the sample text file into a separate folder in your Amazon S3 buckets

First of all the data lake or part of the data lake should be defined in Athena as a logical database. To create a database on Amazon Athena data engineers and SQL developers can use following SQL DDL command.
Instead of S3RedshiftDB use a database name which describes your data collection.


create database S3RedshiftDB

Code

As seen in below screenshot from Amazon Athena Query Editor screen, type CREATE DATABASE command with the database name argument and press "Run Query" button.

create database on Amazon Athena Query Editor

After we have created our Amazon Athena database, SQL developer can create tables in this new Athena database using SQL DDL commands which will map to Amazon S3 folders containing data files.
Now switch to recently created new Athena database using left Database menu
Choose s3redshiftdb (or the database name you used) from databases dropdownlist.

Right after the initial creation of an Athena database, as expected there will be no existing tables or views as seen in following screenshot. The 0 number right after Tables and Views are showing that there is not any table or view in that database yet.

Amazon Athena database tables and views list

To create a new table from Amazon S3 data stored as text files in S3 bucket folders, click on Create table link button and choose Create table from S3 bucket data option.

create table from Amazon S3 bucket data in Athena database

This option will lead the AWS Athena developer to a wizard which enables developers to add a new table to existing Athena database

First select the Athena database name where you want to create a new table.
Give a name to your new Athena database table by typing in Table Name textbox.
Additionally provide the Amazon S3 bucket folder which contains data files for the Athena database table.
It is in format "s3://bucket-name/folder-name/sub-folder/"

The format of the location of input data set is exteremely important because the help text for location input text field includes region identifier within the S3 file path. Please do not use the AWS region codes within S3 folder path otherwise queries executed on this external Athena table will fail to retrieve any data.

Athena table creation from Amazon S3 bucket folders

The following step is for selecting the Amazon S3 bucket files' data format among available data formats like well known CSV (comma separated values) files, delimited text files, JSON files, Parquet files, ORC files, TSV and Apache web log files, etc.

Amazon S3 data file formats for Athena database tables

My sample data containing music albums is created and stores in Amazon S3 bucket folder as a comma separated CSV file. So I choose option CSV and continued to next step in table creation wizard.

In step 3, Athena developer will actually define the data structure similar to SQL developers do during table creation. Column name and data type should be provided in this step for all columns.

By the way bulk add columns option maybe more easier for some Athena developers especially if the table or file structure includes big number of columns.
For this sample, I will choose to create Athena database table columns using the user interface typing column name and choosing approptiate column type from available data types list.

create Amazon Athena table columns defining column names and data types

Last step is defining partitions for the data. Partitioning data is actually an important method to boost query performance in every database platform also in serverless query services like Amazon Athena.
For simplicity of this Athena tutorial, I will skip this step. But if your data is very big in sizes and can be splitted into different files according to a column value, you can use this column for partitioning your Athena tables to increase query performance based on partition columns.

configure partition columns for Amazon Athena table to increase query performance

Amazon Athena can now complete external table creation by pressing Create table command.

The create table wizard will create the external table within the targeted Amazon Athena database.
Additionaly the wizard will prepare the SQL command (DDL Data Definition Command) which is executed for table creation within a new Amazon Athena query editor window.
The developers can save this SQL command for future requirements. You can simply execute this SQL command on Athena Query Editor window instead of following each step in the table creation wizard manually once more in future.


CREATE EXTERNAL TABLE IF NOT EXISTS s3redshiftdb.album (

 `AlbumId` int,

 `Title` string,

 `ArtistId` int

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

WITH SERDEPROPERTIES (

 'serialization.format' = ',',

 'field.delim' = ','

) LOCATION 's3://cdl-redshift-1-data-transfer/spectrum-sample-data/albums/'

TBLPROPERTIES (

 'has_encrypted_data'='false',

 'skip.header.line.count'='1'

);

Code

And a SQL programmer will definitely use above SQL command example as a template for other Athena database table creations and will simply replace table name, columns and Amazon S3 bucket folder path and execute the new SQL statement on Query Editor without using the wizard.

create external table in Amazon Athena database for data in Amazon S3 buckets

Amazon Athena developers can now preview table data to test if external table creation is successfull or not by simply pressing the 3 dots and choosing the "Preview data" menu option.

preview Amazon Athena table on Query Editor

As Athena developers can see, each user interface command is displayed with its SQL statement in a new Query Editor window. For example to preview external table data in an Amazon Athena database for a table can be also managed by executing following SQL SELECT command.


select * from "s3redshiftdb"."album" limit 10;

Code

Please note that the SQL DDL command has a section named TBLPROPERTIES for table properties.
TBLPROPERTIES has_encrypted_data property identifies that the data is encrypted or not.
There are a number of other table properties. One of them is skip.header.line.count which SQL developers creating Amazon Athena external tables use a lot.
If the source CSV data files includes a column headers line, external table will fail to understand this information and will try to display column headers as an additional data row if the skip.header.line.count table property is not set correctly.

Please note that when you create an Amazon Athena external table, the SQL developer provides the S3 bucket folder as an argument to the CREATE TABLE command, not the file's path. This means that, for different database entities like for example, sales orders, sales order details, customers, etc, create different Amazon S3 folders and place text files containing data of the same entity into the same bucket folder. In short, organize your Amazon S3 bucket folders so that each folder can be defined as a separate table in Amazon Athena database.

It is also important that every file in a particular folder has identical structure, like delimeters, column header row counts, number of columns, column data types, etc.
Otherwise, the data scanned by Amazon Athena will be interpreted different than the developers expect and query results will be wrong.