Scaling and stabilizing large server side infrastructure Yoshinori Matsunobu Principal Infrastructure Architect DeNA - PowerPoint PPT Presentation

About This Presentation
Title:

Scaling and stabilizing large server side infrastructure Yoshinori Matsunobu Principal Infrastructure Architect DeNA

Description:

large server side infrastructure Yoshinori Matsunobu Principal Infrastructure Architect DeNA – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 52
Provided by: JamilMo4
Category:

less

Transcript and Presenter's Notes

Title: Scaling and stabilizing large server side infrastructure Yoshinori Matsunobu Principal Infrastructure Architect DeNA


1
Scaling and stabilizinglarge server side
infrastructureYoshinori MatsunobuPrincipal
Infrastructure Architect DeNA
2
Who am I
  • 2006-2010 Lead MySQL Consultant at MySQL AB (now
    Oracle)
  • Sep 2010- Database and Infrastructure engineer
    at DeNA
  • Eliminating single-point-of-failure
  • Establishing no-downtime operations
  • Performance optimizations, reducing the total
    number of servers
  • Avoiding sudden performance stalls
  • Many more
  • Speaking at many conferences such as mysqlconf,
    OSCON
  • Oracle ACE Director from 2011

3
DeNA Company Background
  • One of the largest social game providers in Japan
  • Social game platform Mobage and many social
    game titles
  • Subsidiary ngmoco) at San Francisco
  • Japan localized phone, Smart Phone, and PC games
  • Operating thousands of servers in multiple
    datacenters
  • 2-3 billion page views per day
  • 35 million users
  • Got OReilly MySQL Award in 2011
  • Note This is not a sponsored session so the
    talks and slides are neutral

4
Agenda
  • Scaling strategies
  • Cloud vs Real Servers
  • Performance practices
  • Administration practices

5
Server side apps Whats difficult?
At least these parameter values are different
from users, changing very frequently, and must
not be lost. Name Job Equipment ATK points DEF
points Social points Current LP Max LP Estimated
time of recovery Current BP Max BP EXP/LV/etc
6
Server side apps Whats difficult?
  • Dynamic data must not be lost
  • Should be stored into (stable) databases
  • Should be highly available
  • Some kinds of data redundancy is needed
  • Very frequently selected and updated
  • Top page is the most frequently accessed. All of
    the latest values need to be fetched
  • Name does not change, but LP/EXP change very
    frequently
  • Caching (read scaling) and write tuning strategy
    should be planned
  • Data size can be huge
  • 100K per user x 100M users 10TB in total
  • Some kinds of data distribution strategy might
    be needed
  • Focus on text (images/movies are read only, and
    can be located on static contents servers so its
    easy from server side point of view)
  • Not so much different between online games and
    general web services

7
Characteristics of online/social games
  • It is very difficult to predict growth
  • Social games grow (or shrink) rapidly
  • i.e. Estimating 10 Million page views per day
    before going live..
  • Good case 100 Million page views/day
  • Bad case Only 10K page views/day
  • It is necessary to prepare servers for handling
    enough traffics
  • It takes time to purchase, ship and setup 100
    physical servers
  • Too many unused servers might kill your company
  • For smaller companies
  • Using cloud services (AWS, Rackspace, etc) is
    much less risky
  • For larger companies
  • Servers in stock can probably be used

8
Scaling strategy
  1. Single Server
  2. Multiple-Tier servers
  3. Scaling Reads / Database Redundancy
  4. Horizontal Partitioning / Sharding
  5. Distributing across regions

9
Single Server
  • Logically separated
  • Issues
  • Single Point of Failure
  • Service capacity is very limited (soon reaching
    CPU or disk I/O bound)

Web/App
Database
10
H/W failure often happens
  • At DeNA, we run thousands of machines in
    production
  • N,000 web servers
  • 1,000 database (mainly MySQL) servers
  • 100 caching (memcached, etc) servers
  • Most of server failures are caused by H/W
    problems
  • Disk I/O errors, memory corruption, etc
  • Mature middleware is stable enough
  • MySQL has not crashed by MySQL own bugs for years
  • Do not afraid too much (but prepare) for
    upgrading middleware
  • Older software (CentOS4, MySQL5.0, etc) has lots
    of bugs that will never be fixed

11
Multiple-Tier Servers
  • Physically separated Web and Database Tier
  • Multiple Web servers
  • Single database server
  • Web servers scale
  • Issues
  • Database server is still a single point of
    failure
  • Service capacity is limited by database servers
    performance
  • Single database server cant handle TBs of active
    game data

Web/App
Database
12
Scaling Reads / Database Redundancy
  • This is probably the most common deployment in
    the world
  • Data is replicated from master to slaves
  • Read traffics can be scaled with cache servers
    and slave servers
  • Issues
  • Master database is still single point of failure
  • Write is not scalable

Cache
Web/App
Database(W master)
Replication
Database(R slave)
13
Replication for read scaling/Redundancy
  • Asynchronous replication is majority
  • In case of master crash you might lose some of
    the latest data
  • Most of (open source) database software does not
    support synchronous replication
  • Sync replication does not perform well between
    remote datacenters
  • Be careful about replication delay
  • Replication is Single Threaded in most of
    databases
  • Game users are very strict even for a few seconds
    of delay
  • How do you feel if you cant find virtual items
    that you bought just now?
  • Especially during limited time gaming events
  • We use SSD on slave servers so that replication
    slaves can keep up

14
Resolving server address (our case)
- getting mapping info - caching locally
MyDNS (Global catalog database)
connect(ff_m)
connect(ff_s)
ff_m (db1001) ff_s (db1011100, db1012100,
db10130) gundam_m (db2103) gundam_s (db1713,
db1714, db1715, db1716)
  • Other approaches
  • Using distribution-aware databases such as
    MongoDB
  • - Good if database software itself is stable
  • Using Load Balancer for distributing access to
    slaves
  • - Increasing of servers and response time

15
Scaling Writes
  • Database write operations
  • INSERT, UPDATE, DELETE
  • Most of write operations do random disk reads as
    well as writes
  • INSERT reading target index blocks
    (2,000-4,000/s)
  • UPDATE reading matched records (200-2,000/s)
  • DELETE reading matched records and indexes
    (100-2,000/s)
  • Increasing RAM helps to reduce the number of
    random reads
  • Performance highly depends on storage devices
  • Using SSD (great at random reads) is a good
    practice
  • Single database server is not enough to handle
    massive write requests

16
Horizontal Partitioning / Sharding
  • Data is divided into shards
  • Both reads and writes are scalable
  • How should application programs choose proper
    shard?
  • Hashing, Mapping table
  • Each master database still might be single point
    of failure (depending on database products)
  • Datacenter crash results in service failure
  • Re-sharding (moving data) is painful

Cache
Web/App
Database(W1,2,..)
Shard1
Shard2
Database(R1,2,..)
1ltuidlt1M
1Mltuidlt2M
17
Approaches for Sharding
  • Developing application framework
  • Creating a catalog (mapping) table (i.e.
    user_id1..1M gt shard1)
  • Looking up a catalog table (and cache it), then
    accessing to a proper shard
  • So far many people have taken this approach
  • Once you create a framework, you can use it for
    many other games
  • Re-sharding (moving data between shards) is
    difficult
  • Using sharding-aware database products
  • Database client library automatically selects a
    proper shard
  • Many recent distributed NoSQL databases support
    it
  • MongoDB, MemBase, HBase (and many more) are
    popular
  • Newer databases tend to have lots of bugs and
    stability problems

18
Approaches to avoid Re-Sharding
Instance 1
Instance 1
Instance 1
Instance 2
Instance 2
Instance 2
.
Running N shards within single server
Running a dedicated shard per server
.
Moving to higher-spec server
  • Dont move data into different shard, but move
    entire data to higher-spec environment (larger
    RAM, SSDs, etc)
  • It is easy by using replication slaves
  • Practically weve been able to avoid re-sharding

19
Handling rapidly growing users
  • On one of our most popular online games..
  • At first we started with 2 database shards 3
    slaves per shard
  • The number of registered users in the first two
    days after launch was much higher than expected
  • We added two more shards dynamically, then mapped
    all new users to new shards
  • We have heavily used range-partitioning for
    removing older data
  • Can drop older data very quickly (less than
    milliseconds)
  • This has helped to reduce total data size (less
    than 250GB database size per shard) a lot

20
Mitigating Replication delay
  • For older games using HDD on slaves..
  • Even 1,000 updates per sec causes replication
    delay
  • Replacing HDD with SATA SSD on slaves is a good
    practice
  • Many updates (4,000 updates per second)
  • Some slaves with SSD got behind master
  • Increasing RAM
  • Increasing RAM from 32GB to 64GB per shard
  • Reducing the total write volumes
  • innodb-doublewrite0 helps in InnoDB(MySQL)
  • Avoiding sudden short stalls
  • Using xfs filesystem instead of ext3
  • We migrated to higher spec servers without
    downtime. We havent needed re-shards so far

21
Cloud vs Physical servers
  • DeNA uses physical servers (N thousands)
  • Some of our subsidiaries use AWS / AppEngine

22
Advantages of cloud servers
  • Initial costs are very small
  • You dont need to buy 10 servers (may cost more
    than 50,000) for a new game that might be closed
    within a few months
  • No lead time to increase servers
  • It is not uncommon to take 1 month to get new
    physical H/W components
  • No penalty to decrease servers
  • If unused physical servers cant be used anywhere
    else, it just wastes money

23
Disadvantages of Cloud servers
  • Taking longer time to analyze problems
  • Network configurations are black box
  • Storage devices are black box
  • Problems caused by disks and network often
    happen, but it takes longer time to find root
    causes
  • This really matters for games that generate
    10M/month
  • Limited choices for performance optimizations
  • What I want are
  • Using direct attached PCIe SSD for handling
    massive reads/writes
  • Customizing Linux to reduce TCP connection
    timeouts (3 seconds to 0.5-1 second)
  • Updating device drivers
  • Per-server performance tends to be (much) lower
    than physical server
  • More expensive than physical servers when your
    system becomes large

24
Distributing across DC/regions
  • Availability
  • Single datacenter crash should not result in
    service failure
  • Latency / Response time
  • Round trip time (RTT) between Tokyo to San
    Francisco exceeds 100ms
  • 10 round trips within 1 HTTP request will take
    more than 1 second
  • If you plan to release games in Japan/APAC, I
    dont recommend sending all contents from US
  • Use CDN (to send static contents from APAC
    region) or run servers in APAC

25
Distributing across multiple regions
  • Network latency should be
  • considered (100ms RTT)
  • Web servers should access to local databases as
    long as possible
  • Conflict detection resolution is very tough
    issue
  • What if user_id1 was updated in both regions at
    the same time?
  • My region ID should be considered
  • When updating users in Tokyo, always access to
    masters in Tokyo
  • Bulk operations should be considered
  • To reduce round-trips between Web and DB
  • Stored Procedure, Proxy server, etc..

Web/ Cache
Web/ Cache
DB(W1,2,..)
DB(W1,2,..)
DB(R1,2,..)
DB(R1,2,..)
Tokyo DC
US East DC
26
Monitoring and Performance Practices
  • Monitoring
  • Fighting against stalls
  • Improving per server performance
  • Consolidating servers

27
Monitoring Server
  • Server activity
  • reachable via ping, http, mysql, etc
  • H/W and OS errors
  • Memory usage (to avoid out of memory)
  • Disk failure, Disk block failure
  • RAID controller failure
  • Network errors (sometimes caused by bad switches)
  • Clock time

28
Monitoring Server Performance
  • Resource utilization per second
  • Load average (Not perfect, but better than
    nothing)
  • Concurrency (the number of running threads
    Useful to detect stalls)
  • Disk busy rate (svctm, iowait)
  • CPU utilization (especially on web servers)
  • Network traffics
  • Free memory (Swap is bad / Be careful about
    memory leak)
  • Web Servers
  • HTTP response time
  • The number of processes/threads that can accept
    new HTTP connections
  • Database Servers
  • Queries per second
  • Bad queries
  • Long running transactions (blocking other
    clients)
  • Replication delay

29
Best practices for minimizing replication delay
  • Application side
  • Continuously monitor bad queries
  • Especially when deploying new modules
  • Do not run massive updates in single DML
    statement
  • LOAD DATA ALTER TABLE
  • Reduce the number of DML statements
  • INSERTUPDATEDELETE -gt Single UPDATE
  • Infrastructure side
  • Using SSD on slave servers
  • Using larger RAM
  • Using xfs filesystem

30
Performance analysis
  • Identifying reasons why the server is so slow or
    resource consuming
  • Web server
  • Profiling functions that take long elapsed time
    (i.e. using NYTprof for Perl based applications)
  • Database server
  • Bad queries (full table scan/etc) / indexing
  • Only a few query patterns consume more than 90
    time

31
Thundering Herd / C10K
  • Most of middleware cant handle 100K concurrent
    requests per server
  • Work hard for stable request processing
  • Sudden cache miss (CDN bug, memcached bug, etc)
    will result in sending burst requests toward
    backend servers
  • Case memcached bug (recently fixed)
  • memcached crashed when thousands of persistent
    connections are established
  • All requests to memcached went to database
    servers (cache miss)
  • backend database servers couldnt handle 10x more
    queries
  • -gt service down

32
Stalls
  • Stalls cause bad response time and unstable
    resource usage
  • In bad cases web servers or database servers will
    be down
  • Serious if happened on database servers
  • All web servers are affected
  • Identifying stalls is important, but difficult
  • Very often root causes are inside source codes of
    the middleware
  • I often take stack traces and dig into MySQL
    source codes

33
Avoiding sudden performance drops
Response time
Product A
Product B
Time
  • Some unstable database servers suddenly drop
    performance in some situations
  • Low performance is a problem because we cant
    meet customers demands during that time
  • Through product A is better on average, product B
    is much more stable
  • Dont trust benchmarks. Vendors benchmarks show
    the best score but dont show bad numbers

34
Monitoring Stalls
  • All clients are blocked for a short periods of
    time (less than one second a few seconds)
  • The number of internal running threads grew
    significantly (5-20 on average, but suddenly grew
    to 1000 and TOO MANY CONNECTIONS errors are
    thrown)
  • Increased response time

35
Avoiding Stalls
  • Avoid holding locks for a long time
  • Other clients that need the locks are blocked
  • Establishing new TCP connection sometimes takes
    3 seconds for SYN retry
  • Prohibiting cheats from malicious users
  • i.e Massive requests from same user id
  • Be careful when choosing database products
  • Most of newly announced database products dont
    care about stalls

36
Improving per-server performance
  • To handle 1 million queries per second..
  • 1000 queries/sec per server 1000 servers in
    total
  • 10000 queries/sec per server 100 servers in
    total
  • Additional 900 servers will cost 10M initially,
    1M every year
  • If you can increase per server throughput, you
    can reduce the total number of servers and TCO

37
Recent H/W trends
  • 64bit Linux large RAM
  • 60GB 128GB RAM is quite common
  • SSD
  • Database performance can be N times faster
  • Using SSD on MySQL slaves is a good practice to
    eliminate replication delay
  • Network, CPU
  • Dont use 100Mbps
  • CPU speed matters There are still many single
    threaded codes inside middleware

38
Consolidating Web/DB servers
  • How do you handle unpopular games?
  • Running a small game on high-end servers is not
    cost effective
  • Recent H/W is fast
  • Running N DB instances on a single server is not
    uncommon
  • DeNA consolidates 2-10 games in a single database
    server

39
Performance is not everything
  • High Availability, Disaster Recovery (backups),
    Security, etc
  • Be careful about malicious users
  • Bot / repeatable access via tools
  • Duplicating items / Real money trades
  • Illegal logging in
  • etc

40
Administration practices
  • Automating setups
  • Operations without downtime
  • Automating failover

41
Automation
  • Automation is important to reduce operational
    costs
  • The number of dev/ops engineers can not grow as
    quickly as the number of servers
  • At DeNA
  • Hundreds of servers are managed per devops
    engineer
  • Initial server setups (installing OS, setting up
    Web server) can be done within 30 minutes
  • Can add to / remove from services in seconds to
    minutes
  • Do not automate (in production) what you do not
    understand
  • How do you restart failover when automated
    failover stops with errors in the middle?
  • Without understanding in depth it is very hard to
    recover

42
Automating installation/setup
  • Automate software installation / filesystem
    partitioning / etc
  • Kickstart Local yum repository
  • Cloning OS
  • Copying entire OS image (including software
    packages and conf files) from read-only base
    server
  • We use this approach
  • Automate initial configuration
  • Not all servers are identical
  • Hostname, IP addr, disk size
  • some parameters (server-id in MySQL)
  • Use configuration manager like Chef/Puppet (we
    use similar internal tool)
  • Automate continuous configuration checking
  • Sometimes people change configurations
    tentatively and forget to set them back

43
Moving Servers- Games are expanding rapidly
  • For expanding games
  • Adding more web servers
  • Adding more replication slaves or cache servers
    (for read scaling)
  • Adding more shards (for write scaling)
  • DeNA has Perl based sharding framework on
    application side so that we can add new shards
    without stopping services
  • Scaling up masters H/W, or upgrading MySQL
    version
  • More RAM, HDD-gtSSD/PCI-E SSD, Faster CPU, etc

44
Moving Servers- Games are shrinking gradually
  • For shrinking games
  • Decreasing web servers
  • Decreasing replication slaves
  • Migrating servers to lower-spec machines
  • Consolidating a few servers into single machine

45
Moving database servers
  • For many databases (MySQL etc), there is only one
    master server (replication master) per shard
  • Moving master is not trivial
  • Many people allocate scheduled maintenance
    downtime
  • We want to move master servers more frequently
  • Scaling-up or Scaling-down (Online games have
    many more opportunities than non-game web
    services)
  • Upgrading MySQL version / updating non-dynamic
    parameters
  • Working around power outage Moving games to
    remote datacenter

46
Desire for easier operations
  • In many cases people do not want to allocate
    maintenance window
  • Announcing to users, coordinating with customer
    support, etc
  • Longer downtime reduces revenue and hurts brands
  • Operating staffs will be exhausted by too many
    midnight work
  • Reducing maintenance time is important to manage
    hundreds or thousands of servers

47
Our case
  • Previous
  • Allocating 30 minute maintenance window after
    200am
  • Announcing at top page of the game
  • Coordinating with customer support team
  • Couldnt do many times because its painful
  • Current
  • Migrating to a new server within 0.5-3 seconds
    gracefully with an online MySQL master switching
    tool (MHA for MySQL OSS)
  • Not allocating maintenance window can be done in
    daytime
  • Note Be very careful about error handling when
    using databases that support automated master
    switch
  • Killing database sessions might result in data
    inconsistency
  • If it waits for a long time to disconnect,
    downtime will be longer

48
Automating Failover
  • Master database server is usually a single point
    of failure, and difficult for failover
  • Bleeding edge databases that support automated
    failover often dont work as expected
  • Split-brain, false positive, etc
  • In our case Using MHA for automated MySQL
    failover
  • Takes 9-12 seconds to detect failure, 0.5-N
    seconds to complete failover
  • Mostly caused by H/W failure

49
Manual, Simple failover
  • In extreme crash scenario, automated failover is
    dangerous
  • i.e. Datacenter failure
  • Identifying root cause should be the first
    priority
  • But failover should be done by simple enough
    command
  • Just one command

50
Summary
  • Understanding scaling solutions is important to
    provide large/growing social games
  • Stable performance is important to avoid sudden
    service outage. Do not trust sales talks. Many
    middleware still have many stall problems
  • Online games tend to grow or shrink rapidly.
    Solutions for setting up/migrating servers
    (including master database) are important

51
Thank you!
  • Yoshinori.Matsunobu_at_gmail.com
  • Yoshinori.Matsunobu at Facebook
  • _at_matsunobu at Twitter
Write a Comment
User Comments (0)
About PowerShow.com