Streaming Data Solutions on AWS AWS Whitepaper
AWS Glue Data Catalog
The AWS Glue Data Catalog contains references to data that is used as sources and targets of your
ETL jobs in AWS Glue. The AWS Glue Data Catalog is an index to the location, schema, and runtime
metrics of your data. You can use information in the Data Catalog to create and monitor your ETL
jobs. Information in the Data Catalog is stored as metadata tables, where each table specifies a
single data store. By setting up a crawler, you can automatically assess numerous types of data
stores, including DynamoDB, S3, and Java Database Connectivity (JDBC) connected stores, extract
metadata and schemas, and then create table definitions in the AWS Glue Data Catalog.
To work with Amazon Kinesis Data Streams in AWS Glue streaming ETL jobs, it is best practice to
define your stream in a table in an AWS Glue Data Catalog database. You define a stream-sourced
table with the Kinesis stream, one of the many formats supported (CSV, JSON, ORC, Parquet, Avro
or a customer format with Grok). You can manually enter a schema, or you can leave this step to
your AWS Glue job to determine during runtime of the job.
AWS Glue streaming ETL job
AWS Glue runs your ETL jobs in an Apache Spark serverless environment. AWS Glue runs these jobs
on virtual resources that it provisions and manages in its own service account. In addition to being
able to run Apache Spark based jobs, AWS Glue provides an additional level of functionality on top
of Spark with DynamicFrames.
DynamicFrames are distributed tables that support nested data such as structures and arrays.
Each record is self-describing, designed for schema flexibility with semi-structured data. A record
in a DynamicFrame contains both data and the schema describing the data. Both Apache Spark
DataFrames and DynamicFrames are supported in your ETL scripts, and you can convert them
back and forth. DynamicFrames provide a set of advanced transformations for data cleaning and
ETL.
By using Spark Streaming in your AWS Glue Job, you can create streaming ETL jobs that run
continuously, and consume data from streaming sources like Amazon Kinesis Data Streams, Apache
Kafka, and Amazon MSK. The jobs can clean, merge, and transform the data, then load the results
into stores including Amazon S3, Amazon DynamoDB, or JDBC data stores.
AWS Glue processes and writes out data in 100-second windows, by default. This allows data to
be processed efficiently, and permits aggregations to be performed on data arriving later than
expected. You can configure the window size by adjusting it to accommodate the speed in response
vs the accuracy of your aggregation. AWS Glue streaming jobs use checkpoints to track the data
AWS Glue and AWS Glue streaming 15