Notes: (AWS re:Invent 2020 DAT310) Deep Dive on Amazon Timestream

Published: 2021-03-14

Lastmod: 2023-01-17

Abstract

In recent years, TSDB (Time Series Database) has gradually been pulled out due to its particularity. It is suitable for IoT applications or DevOps/Apps analysis scenarios. To AWS products is Amazon Timestream. Using AWS’s advantages in distributed computing and storage, Amazon Timestream has been created with Serverless architecture and high scalability, which makes people quite curious about its underlying structure.

In this short sharing, I captured three key points:

The use cases of Time Series Database and the strengths/advantages of Amazon Timestream.
Adjust the composition structure of the data write based on the billing structure rules. (How to modify code to reduce price from $25 to $0.78)
Best practices for querying processing.

Not so deep dive, but it’s a 30-minute session that covers architectural concepts and terminology. Suitable for friends who are comparing various TSDBs for a quick overview.

Topic

Deep Dive on Amazon Timestream

Speaker

Tony Gibbs, AWS Speaker (Principal Database Solution Architect, AWS)

Content

Overview: Deep dive on Amazon Timestream

Introducing Amazon Timestream
Architectural concepts and terminology
Data storage and ingestion
Query processing
Additional resources

Introducing Amazon Timestream

Time-series use cases

IoT applications
- Collect motion or temperature data from the device sensors, iterpolate to identify the time ranges without motion, or alert consumers to take actions such as turning off the lights to save energy.
DevOps analysis
- Collect and analyze performance and health metrics such as CPU/memory utilization, network data, and IOPS to monitor health and optimize instance usage.
App analysis
- Easily store and analyze clickstream data at scale to understand the customer journey - the user activity across your applications over a period of time.

Building with tie-series data is challenging

Relational databases
- ❌ Inefficient at processing time-series data
- ❌ Data management issues with rigid schema
- ❌ Limited integrations for ML, analytics, and data collection
Existing time-series solutions
- ❌ Difficult to scale for large volumes of data
- ❌ Minimal data lifecycle management
- ❌ Real-time and historical data are decoupled

Amazon Timestream

Fast, scalable, and serverless time-series database

Serverless and easy to use
- No servers to manage or instances to provision; software patches, indexes, and database optimizations are handled automatically
Performance at scale
- Capable of ingesting trillions of events daily; the adaptive SQL query engine provides rapid point-in-time queries with its in-memory store, and fast analytical queries through its magnetic store
Purpose build for time series data
- Built-in analytics using standard SQL with added interpolation and smoothing functions to identify trends, patterns, and anomalies
Secure from the ground up
- All data is encrypted inflight, and at rest using AWS Key Management Service (AWS KMS) with customer-managed keys (CMK)

Architectural concepts and terminology

Continuous releases
- No maintenance or downtime
- Serverless architecture

Terminology and concepts: Tables

Encrypted container that holds records
No data definition or columns are specified at creation
Time-based data retention policies for controlling data lifecycle within storage tiers

Terminology and concepts: Storage tiers

Two storage tiers: in-memory and magnetic

Retention periods are required for both tiers at table creation
Retention periods can be modified after table creation
In-memory store can range from 1 hour to 1 year, and magnetic store from 1 dat to 200 years

Amazon Timestream architecture

Decoupled architecture
- Highly available - 99.99% SLA
- Independently scalable ingestion, storage, and SQL processing
High throughput auto-scaling ingestion
- Data is replicated across multiple Availability Zones
- Automatic data deduplication handling
- No need to provision or configure write I/O
Multiple tiers of storage
- Scalable to petabytes and beyond
- In-memory store is designed for fast point-in-time queries
- Magnetic store is designed for high performance analytics queries and low cost long-term storage
Scalable SQL query engine
- Adaptive query engine is capable of querying data across multiple data tiers
- No indexes to configure and no provisioning required

Terminology and concepts: Dimensions

Are a set of attributes that uniquely describe a measurement

Eash table allows up to 128 unique dimensions
All dimensions are represented as varchars
Dimensions are dynamically added to the table during ingestion

Terminology and concepts: Measures

Each Amazon Timestream record contains a single measurement comprised of a name and value

Each table supports up to 1,024 unique measure names
Measurement values support boolean, bigint, double, and varchar
Measures are dynamically added to the table during ingestion

Terminology and concepts: Time series

Sequence of records that are represented as data points over a time interval for given measurement

A time series is a set of timestamp and measure value pairs that have the same dimension name, value, and measure name

Example: Time series in Amazon Timestream

Terminology and concepts: Data modeling

In a traditional relational database we would create a wide table or use dimension and fact tables to models the data.
Amazon Timestream represents data where is a single measure per record.

Characteristics of Amazon Timestream data

All records require a timestamp, one or more dimensions, a measurement name, and a measurement value.
Records cannot be deleted or updated.
- Records are only removed when they reach the retention limit within the magnetic tier.
- Choice of first of last writer wins semantics for handling duplicates.
Multiple measures are logically represented as multiple individual records.
- One measure per record.
Automatically scales to handle high throughput, real-time data ingestion.

Data storage and ingestion

Data ingestion: Connectivity

Data is written using the AWS SDK
- Java, Python, Golang, Node.js, .NET, etc.
- AWS CLI
Adapters and plugins
- AWS IoT Core
- Amazon Kinesis Data Analytics for Apache Flink connect (GitHub)
- Telegraf connector (GitHub)

Data ingestion: Pricing

$0.50 for 1 million writes of 1KB. (in us-east-1, us-east-2, us-west-2 regions)

Scenario

Send 100 different measurements
Assume that measurements are sent every 5 seconds
Assume on average each record is 110 bytes

Example: Data ingestion using Python (1)

Data ingestion: Calculating pricing (1)

Example: Data ingestion using Python (2)

Put all measurements into one list (records).

Data ingestion: Calculating pricing (2)

Example: Data ingestion using Python (3)

Take all attributes into common attributes.

Data ingestion: Calculating pricing (3)

Data ingestion: Pricing recap

Storage: Memory and magnetic stores

In-memory tier
- Handles the ingestion of all data
- Timestampe associated with the record must land in the in-memory tier
- Automatically handles data deduplication
- Optimized for latency sensitive point-in-time queries
- $0.036/GB/hour (pricing in us-east-1, us-east-2, us-west-2 regions)
Magnetic disk tier
- Optimized for high performance analytical queries
- Cost effective for long-term storage
- $0.03/GB/month (pricing in us-east-1, us-east-2, us-west-2 regions)

Best practices: Data storage and ingestion

Use record batching
- A single write_records request can write a batch of up to 100 records
- Each write request has a minimum charge of 1KB
Use common attributes
- This removes the need to redundantly send dimensional data for each record
Make measure and dimension names only as long as necessary
- There are ingestion and storage costs for user-defined dimension and measure names
Configure the in-memory tier as long as necessary to accommodate late arriving data
- The in-memory tier is optimized for queries that access narrow windows of time
The magnetic tier is better optimized for analytics queries
- The magnetic tier is cost-optimized to store data indefinitely

Query processing

Query processing: SQL and connectivity

(Mostly) ANSI-2003 SQL for querying
- Time-series, interpolation, and gap-filling functions
- 250+ scalar, aggregate, and windowing functions
- Pricing is $0.01/GB of data scanned (pricing in us-east-1, us-east-2, us-west-2 regions)
Data is queried using the AWS SDK or AWS CLI
- Java, Python, Node.js, .NET, etc.
- JDBC Driver
- Amazon QuickSight support
- Grafana (Open Source Edition)

Best practices: Query processing

Queries should have a predicate on the measure_name
Queries should have a predicate on time
Most queries should have a predicate on one or more dimensions
Predicates on time, measure_name, and dimensions can reduce data scan charges by leveraging range-restricted scans
Queries using a GROUP BY clause will perform faster if the first grouping dimension has a high cardinality
Only select the dimensions that are necessary; unnecessary columns read can impact both performance and cost

Additional resources

Video tutorials: Amazon Timestream console
AWS Labs on GitHub: amazon-timestream-tools
Ernest’s study notes on Amazon Timestream