筆記: Pokémon 將資料庫搬遷到 Aurora 與 DynamoDB

(圖說:Aurora 極光之美。圖片來源。)

昨天看到一則 tweet 提到 Pokémon 將資料庫搬遷到 AWS Aurora PostgreSQL 與 DynamoDB,仔細一看是 AWS re:Invent 2019 的影片 (GAM304),在現場時幾乎都略掉遊戲類的場子,現在回頭來補一下。

以下筆記內容有些段落是我閱讀過程之中或之後的觀察與補充,筆記的目的是增進我的學習效率,與保持分享的習慣。



AWS re:Invent 2019: Migrating the live Pokémon database to Aurora PostgreSQL (GAM304)

  • 影片出處:https://youtu.be/2eEKuK5eOC4
  • Speakers
    • Chris Finch, Sr. SA Game Tech Evangelist, AWS
    • Jeff Webb, Development Manager, The Pokémon Company International
    • David Williams, Sr. DevOps Engineer, The Pokémon Company International
  • Agenda
    • Introduction
    • A brief history
    • The challenge
    • The solution
    • The results (想看總結 (省時間) 可以直接看這個段落即可)

Introduction

  • The Pokémon Company International (TPCi)
    • Subsidiary of The Pokémon Company (TPC)
    • Manages Pokémon property outside of Asia
  • Scopes
    • Brand management
    • Localization
    • Trading card game
    • Marketing
    • Licensing
    • PokemonCenter.com
    • Engineering

A brief history

  • Before Pokémon GO (pre-2016)
    • All the consoles are managed by TPC.
    • Pokemon.com was the focus of the small tech team and included:
      • Pokemon.com
        • Marketing and user engagement
        • Pokémon TV
      • Organized Play
        • Trading card game league/tournament management
      • Pokémon Trainer Club
        • User registration
        • User profile management
        • Authentication
        • Millions of accounts
          • Used by Pokemon.com and a few smaller digital products
      • Pokémon Trainer Club Service (PTCS)
        • Purpose: User registration and login system
          • COPPA
          • GDPR
        • Size: Into the hundreds of millions
        • Usage: Millions of logins a day
  • Preparing for Pokémon GO
    • Lift & shift from co-lo to AWS in Spring 2016.
    • Split PTC data out to NoSQL DB in preparation for GO.
    • Pokémon GO launched (July 2016)
      • And everything changed
        • 10x growth of PTCS users in 6 months
        • 100% increase in PTCS users by end of 2017
        • Additional 50% increase in users by end of 2018
      • Service and DB performance was good

The challenge

  • Service stability issues: 2017/2018
    • Service and DB availability was not good
      • Downtime: 137 hours (down or degraded) in six-month period (約占總時間 3.17%)
    • Licensing and infrastructure costs increasing
      • 300 nodes required to support
    • Engineering time
      • Full-time support from 1-2 resources
  • Business drivers for change
    • Instability of the DB platform was impacting customer and visitor experience.
    • Future project success required achieving new goals
      1. Stabilize our services & infrastructure for reduced downtime & customer impact
      2. Reduce operational overhead of managing the DB platform
      3. Reduce costs associated with our DB platform
  • Infrastructure drivers for change
    • Oversized EC2 isntances
    • Duplicated infrastructure
      • Maintain multiple redundant copies of data and indexes
    • Operational overhead
      • Routine activities no longer routine
      • Unexpected behaviors: Amazon EC2 restarts, capacity, etc.
      • DB backups
      • Patching/upgrading became unmanageable
  • Data tier architecture
    • Hundreds of DB instances across serveral roles
    • All deployed with Amazon EC2 Auto Scaling groups
    • One datastore for all data

The solution

  • Design goals
    • Leverage managed services
      • Optimize resource utilization for our core business
    • Use appropriate datastore for data
      • Event data
      • User data
      • Configuration/TTL data
    • High availability, stability, performance
    • Reduce cost & right-size infrastructure
      • Only want to use and pay for what we need
  • Choosing Amazon Aurora PostgreSQL
    • Amazon DynamoDB
      • Pros
        • Tier 1 AWS service
        • Multi-region
        • Highly scalable
        • Easier lift from JSON?
      • Cons
        • Encryption (at the time)
        • Internal expertise
    • Aurora MySQL
      • Pros
        • Internal expertise
        • Encryption
        • Feature rich
      • Cons
        • Nascent tech
        • Multi-region
        • JSON to relational ETL
    • Aurora PostgreSQL
      • Pros
        • Internal expertise
        • Encryption
        • DB internals
        • Native support for JSON
      • Cons
        • Feature lag
        • Multi-region
  • Data-driven approach
    • Acceptance criteria
      • Authentication: 2k/sec
      • User signups: 60+/sec
      • Admin bulk queries
    • Testing
      • User generation: 200m
      • Performance test suite: Burst, soak testing
    • Iterate
      • Rework schema
      • Rerun tests
  • The migration plan
    • An iterative approach
      • Improve cache stability
      • Migrate out TTL and configuration data
      • Stream event data
    • Each phase should independently deliver value
    • Migration phases
      • 看起來一開始所有東西都自架在 EC2 上,以下標示 (1) (2) (3)…
      • Application tier –> (2) 將 configuration data 往外搬到 DynamoDB: Auth config, TTL tables –> (3) Amazon Kinesis Data Streams 送進 Analytics storage (S3),S3 相對比養 EC2 機器便宜。
        • PTC instances
        • Auth instances
        • Batch/Async instances
      • Data tier –> (4) 將 Data/Query/Index nodes 合併成 Profile data 放 Aurora PostgreSQL
        • Data nodes
        • Query nodes
        • Index nodes
        • Cache nodes –> (1) 改用 ElastiCache Memcached
  • Planning for Aurora PostgreSQL
    • AWS Professional Services to bridge the knowledge gap
      • Validate our schema design and provide feedback
      • Advice on how to tune DB parameter groups
      • Tools for planning monitoring and tuning
    • Aurora PostgreSQL cluster design
      • PTC instances –> Cluster
      • Auth instances –> Login
      • Batch/Async instances –> Admin/bulk
  • The migration: Extract-Transform-Load (ETL)
    • 分成 NoSQL live clusterNoSQL backup cluster 以及 NoSQL extraction cluster
    • Transform & load
      • “Pretty easy”
      • Abandon users that had never activated
      • Minor data changes, nothing structural
    • Extract
      • Leverage NoSQL cluster architecture
      • Map-reduce to find users not extracted
      • Extration process marks users as such in backup cluster
      • Any user changes in production would overwrite changes in backup cluster
    • Test
      • 11m user multi-cluster test setup
      • Dozens of test runs
      • Test cases - inactive users, user updates
    • ~2% of documents were not overwritten
    • And iterate
      • User profile change documents
      • Third cluster
  • Migration Day
    • 先停掉 PTCS 的 profile maintenance activites。
    • Auth 不能停。Auth (login) 持續打進 NoSQL,保持 extract 轉進 Aurora PostgreSQL。
    • Aurora PostgreSQL 都完成資料同步後,Auth 從打進 NoSQL 改成打進 Aurora PostgreSQL。
    • Testers 進來測試 PTCS,將 PTCS 打進 Aurora PostgreSQL。
    • 最後停掉 NoSQL。
  • How did it go?
    • The good
      • No authentication downtime
      • Plan worked as expected
    • The bad
      • Patches
      • Some underperforming queries
    • The ugly
      • Nothing
    • 95% of users experienced no impact
    • Performance was good and consistent

The results

  • Design goals revisited
    • Leverage managed services (checked)
    • Use appropriate datastore for data (checked)
    • High availability, stability, performance (checked)
    • Reduce cost & right-size infrastructure (checked)
  • Overall value
    • Technology
      • Old platform: 3rd party NoSQL
      • New platform: Aurora, DynamoDB, S3
      • Benefits: Independent scaling / managed
    • Infra/Licensing
      • Old: ~300 / Costly
      • New: ~10~20 / ~$0
      • Benefits: ~$3.5~4.5 million/year savings
    • Dedicated resources
      • Old: 1.5 dev/engineer
      • New: None
      • Benefits: 1.5 dev/engineer savings (還在還在,被調去支援其他任務:p)
    • Stability
      • Old: 137 hours (6 months)
      • New: 0
      • Benefits: Customer experience, priceless (這邊其實應該可以算出來商業價值對應的金額,但寫 priceless 比較多掌聲 (誤 XDD))
  • Project retrospective
    • What went well
      • An agile approach to the project & problems we encountered
      • Leveraging the experts: AWS DB Professional Services (Critical point! Ask for help!)
      • Segmenting the data and how it was handled
    • Key learnings
      • Deeper understanding of our data and services
      • Prior solution was more expensive (month and people) than we realized
  • Tech org moving forward
    • Next phase
      • Monitor & optimize the platform and our services
      • Understand upcoming needs & changes to our data
      • Scope our advanced analytics requirements
    • New architectural tenets
      • Define and use common design patterns and components
      • Simplify toolset, preferring managed services
      • Documentation must exist, be well-defined
      • Data standards & practices
      • Use data to make decisions
  • 結束時全場歡呼,這場好嗨森! XDD
Loading comments…