Scaling and stabilizing large server side infrastructure Yoshinori Matsunobu Principal Infrastructure Architect DeNA - PowerPoint PPT Presentation

About This Presentation

Title:

Scaling and stabilizing large server side infrastructure Yoshinori Matsunobu Principal Infrastructure Architect DeNA

Description:

large server side infrastructure Yoshinori Matsunobu Principal Infrastructure Architect DeNA – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 52

Provided by: JamilMo4

Category:

more less

Transcript and Presenter's Notes

Title: Scaling and stabilizing large server side infrastructure Yoshinori Matsunobu Principal Infrastructure Architect DeNA

1
Scaling and stabilizinglarge server side
infrastructureYoshinori MatsunobuPrincipal
Infrastructure Architect DeNA
2
Who am I

2006-2010 Lead MySQL Consultant at MySQL AB (now
Oracle)
Sep 2010- Database and Infrastructure engineer
at DeNA
Eliminating single-point-of-failure
Establishing no-downtime operations
Performance optimizations, reducing the total
number of servers
Avoiding sudden performance stalls
Many more
Speaking at many conferences such as mysqlconf,
OSCON
Oracle ACE Director from 2011

3
DeNA Company Background

One of the largest social game providers in Japan
Social game platform Mobage and many social
game titles
Subsidiary ngmoco) at San Francisco
Japan localized phone, Smart Phone, and PC games
Operating thousands of servers in multiple
datacenters
2-3 billion page views per day
35 million users
Got OReilly MySQL Award in 2011
Note This is not a sponsored session so the
talks and slides are neutral

4
Agenda

Scaling strategies
Cloud vs Real Servers
Performance practices
Administration practices

5
Server side apps Whats difficult?
At least these parameter values are different
from users, changing very frequently, and must
not be lost. Name Job Equipment ATK points DEF
points Social points Current LP Max LP Estimated
time of recovery Current BP Max BP EXP/LV/etc
6
Server side apps Whats difficult?

Dynamic data must not be lost
Should be stored into (stable) databases
Should be highly available
Some kinds of data redundancy is needed
Very frequently selected and updated
Top page is the most frequently accessed. All of
the latest values need to be fetched
Name does not change, but LP/EXP change very
frequently
Caching (read scaling) and write tuning strategy
should be planned
Data size can be huge
100K per user x 100M users 10TB in total
Some kinds of data distribution strategy might
be needed
Focus on text (images/movies are read only, and
can be located on static contents servers so its
easy from server side point of view)
Not so much different between online games and
general web services

7
Characteristics of online/social games

It is very difficult to predict growth
Social games grow (or shrink) rapidly
i.e. Estimating 10 Million page views per day
before going live..
Good case 100 Million page views/day
Bad case Only 10K page views/day
It is necessary to prepare servers for handling
enough traffics
It takes time to purchase, ship and setup 100
physical servers
Too many unused servers might kill your company
For smaller companies
Using cloud services (AWS, Rackspace, etc) is
much less risky
For larger companies
Servers in stock can probably be used

8
Scaling strategy

Single Server
Multiple-Tier servers
Scaling Reads / Database Redundancy
Horizontal Partitioning / Sharding
Distributing across regions

9
Single Server

Logically separated
Issues
Single Point of Failure
Service capacity is very limited (soon reaching
CPU or disk I/O bound)

Web/App
Database
10
H/W failure often happens

At DeNA, we run thousands of machines in
production
N,000 web servers
1,000 database (mainly MySQL) servers
100 caching (memcached, etc) servers
Most of server failures are caused by H/W
problems
Disk I/O errors, memory corruption, etc
Mature middleware is stable enough
MySQL has not crashed by MySQL own bugs for years
Do not afraid too much (but prepare) for
upgrading middleware
Older software (CentOS4, MySQL5.0, etc) has lots
of bugs that will never be fixed

11
Multiple-Tier Servers

Physically separated Web and Database Tier
Multiple Web servers
Single database server
Web servers scale
Issues
Database server is still a single point of
failure
Service capacity is limited by database servers
performance
Single database server cant handle TBs of active
game data

Web/App
Database
12
Scaling Reads / Database Redundancy

This is probably the most common deployment in
the world
Data is replicated from master to slaves
Read traffics can be scaled with cache servers
and slave servers
Issues
Master database is still single point of failure
Write is not scalable

Cache
Web/App
Database(W master)
Replication
Database(R slave)
13
Replication for read scaling/Redundancy

Asynchronous replication is majority
In case of master crash you might lose some of
the latest data
Most of (open source) database software does not
support synchronous replication
Sync replication does not perform well between
remote datacenters
Be careful about replication delay
Replication is Single Threaded in most of
databases
Game users are very strict even for a few seconds
of delay
How do you feel if you cant find virtual items
that you bought just now?
Especially during limited time gaming events
We use SSD on slave servers so that replication
slaves can keep up

14
Resolving server address (our case)
- getting mapping info - caching locally
MyDNS (Global catalog database)
connect(ff_m)
connect(ff_s)
ff_m (db1001) ff_s (db1011100, db1012100,
db10130) gundam_m (db2103) gundam_s (db1713,
db1714, db1715, db1716)

Other approaches
Using distribution-aware databases such as
MongoDB
- Good if database software itself is stable
Using Load Balancer for distributing access to
slaves
- Increasing of servers and response time

15
Scaling Writes

Database write operations
INSERT, UPDATE, DELETE
Most of write operations do random disk reads as
well as writes
INSERT reading target index blocks
(2,000-4,000/s)
UPDATE reading matched records (200-2,000/s)
DELETE reading matched records and indexes
(100-2,000/s)
Increasing RAM helps to reduce the number of
random reads
Performance highly depends on storage devices
Using SSD (great at random reads) is a good
practice
Single database server is not enough to handle
massive write requests

16
Horizontal Partitioning / Sharding

Data is divided into shards
Both reads and writes are scalable
How should application programs choose proper
shard?
Hashing, Mapping table
Each master database still might be single point
of failure (depending on database products)
Datacenter crash results in service failure
Re-sharding (moving data) is painful

Cache
Web/App
Database(W1,2,..)
Shard1
Shard2
Database(R1,2,..)
1ltuidlt1M
1Mltuidlt2M
17
Approaches for Sharding

Developing application framework
Creating a catalog (mapping) table (i.e.
user_id1..1M gt shard1)
Looking up a catalog table (and cache it), then
accessing to a proper shard
So far many people have taken this approach
Once you create a framework, you can use it for
many other games
Re-sharding (moving data between shards) is
difficult
Using sharding-aware database products
Database client library automatically selects a
proper shard
Many recent distributed NoSQL databases support
it
MongoDB, MemBase, HBase (and many more) are
popular
Newer databases tend to have lots of bugs and
stability problems

18
Approaches to avoid Re-Sharding
Instance 1
Instance 1
Instance 1
Instance 2
Instance 2
Instance 2
.
Running N shards within single server
Running a dedicated shard per server
.
Moving to higher-spec server

Dont move data into different shard, but move
entire data to higher-spec environment (larger
RAM, SSDs, etc)
It is easy by using replication slaves
Practically weve been able to avoid re-sharding

19
Handling rapidly growing users

On one of our most popular online games..
At first we started with 2 database shards 3
slaves per shard
The number of registered users in the first two
days after launch was much higher than expected
We added two more shards dynamically, then mapped
all new users to new shards
We have heavily used range-partitioning for
removing older data
Can drop older data very quickly (less than
milliseconds)
This has helped to reduce total data size (less
than 250GB database size per shard) a lot

20
Mitigating Replication delay

For older games using HDD on slaves..
Even 1,000 updates per sec causes replication
delay
Replacing HDD with SATA SSD on slaves is a good
practice
Many updates (4,000 updates per second)
Some slaves with SSD got behind master
Increasing RAM
Increasing RAM from 32GB to 64GB per shard
Reducing the total write volumes
innodb-doublewrite0 helps in InnoDB(MySQL)
Avoiding sudden short stalls
Using xfs filesystem instead of ext3
We migrated to higher spec servers without
downtime. We havent needed re-shards so far

21
Cloud vs Physical servers

DeNA uses physical servers (N thousands)
Some of our subsidiaries use AWS / AppEngine

22
Advantages of cloud servers

Initial costs are very small
You dont need to buy 10 servers (may cost more
than 50,000) for a new game that might be closed
within a few months
No lead time to increase servers
It is not uncommon to take 1 month to get new
physical H/W components
No penalty to decrease servers
If unused physical servers cant be used anywhere
else, it just wastes money

23
Disadvantages of Cloud servers

Taking longer time to analyze problems
Network configurations are black box
Storage devices are black box
Problems caused by disks and network often
happen, but it takes longer time to find root
causes
This really matters for games that generate
10M/month
Limited choices for performance optimizations
What I want are
Using direct attached PCIe SSD for handling
massive reads/writes
Customizing Linux to reduce TCP connection
timeouts (3 seconds to 0.5-1 second)
Updating device drivers
Per-server performance tends to be (much) lower
than physical server
More expensive than physical servers when your
system becomes large

24
Distributing across DC/regions

Availability
Single datacenter crash should not result in
service failure
Latency / Response time
Round trip time (RTT) between Tokyo to San
Francisco exceeds 100ms
10 round trips within 1 HTTP request will take
more than 1 second
If you plan to release games in Japan/APAC, I
dont recommend sending all contents from US
Use CDN (to send static contents from APAC
region) or run servers in APAC

25
Distributing across multiple regions

Network latency should be
considered (100ms RTT)
Web servers should access to local databases as
long as possible
Conflict detection resolution is very tough
issue
What if user_id1 was updated in both regions at
the same time?
My region ID should be considered
When updating users in Tokyo, always access to
masters in Tokyo
Bulk operations should be considered
To reduce round-trips between Web and DB
Stored Procedure, Proxy server, etc..

Web/ Cache
Web/ Cache
DB(W1,2,..)
DB(W1,2,..)
DB(R1,2,..)
DB(R1,2,..)
Tokyo DC
US East DC
26
Monitoring and Performance Practices

Monitoring
Fighting against stalls
Improving per server performance
Consolidating servers

27
Monitoring Server

Server activity
reachable via ping, http, mysql, etc
H/W and OS errors
Memory usage (to avoid out of memory)
Disk failure, Disk block failure
RAID controller failure
Network errors (sometimes caused by bad switches)
Clock time

28
Monitoring Server Performance

Resource utilization per second
Load average (Not perfect, but better than
nothing)
Concurrency (the number of running threads
Useful to detect stalls)
Disk busy rate (svctm, iowait)
CPU utilization (especially on web servers)
Network traffics
Free memory (Swap is bad / Be careful about
memory leak)
Web Servers
HTTP response time
The number of processes/threads that can accept
new HTTP connections
Database Servers
Queries per second
Bad queries
Long running transactions (blocking other
clients)
Replication delay

29
Best practices for minimizing replication delay

Application side
Continuously monitor bad queries
Especially when deploying new modules
Do not run massive updates in single DML
statement
LOAD DATA ALTER TABLE
Reduce the number of DML statements
INSERTUPDATEDELETE -gt Single UPDATE
Infrastructure side
Using SSD on slave servers
Using larger RAM
Using xfs filesystem

30
Performance analysis

Identifying reasons why the server is so slow or
resource consuming
Web server
Profiling functions that take long elapsed time
(i.e. using NYTprof for Perl based applications)
Database server
Bad queries (full table scan/etc) / indexing
Only a few query patterns consume more than 90
time

31
Thundering Herd / C10K

Most of middleware cant handle 100K concurrent
requests per server
Work hard for stable request processing
Sudden cache miss (CDN bug, memcached bug, etc)
will result in sending burst requests toward
backend servers
Case memcached bug (recently fixed)
memcached crashed when thousands of persistent
connections are established
All requests to memcached went to database
servers (cache miss)
backend database servers couldnt handle 10x more
queries
-gt service down

32
Stalls

Stalls cause bad response time and unstable
resource usage
In bad cases web servers or database servers will
be down
Serious if happened on database servers
All web servers are affected
Identifying stalls is important, but difficult
Very often root causes are inside source codes of
the middleware
I often take stack traces and dig into MySQL
source codes

33
Avoiding sudden performance drops
Response time
Product A
Product B
Time

Some unstable database servers suddenly drop
performance in some situations
Low performance is a problem because we cant
meet customers demands during that time
Through product A is better on average, product B
is much more stable
Dont trust benchmarks. Vendors benchmarks show
the best score but dont show bad numbers

34
Monitoring Stalls

All clients are blocked for a short periods of
time (less than one second a few seconds)
The number of internal running threads grew
significantly (5-20 on average, but suddenly grew
to 1000 and TOO MANY CONNECTIONS errors are
thrown)
Increased response time

35
Avoiding Stalls

Avoid holding locks for a long time
Other clients that need the locks are blocked
Establishing new TCP connection sometimes takes
3 seconds for SYN retry
Prohibiting cheats from malicious users
i.e Massive requests from same user id
Be careful when choosing database products
Most of newly announced database products dont
care about stalls

36
Improving per-server performance

To handle 1 million queries per second..
1000 queries/sec per server 1000 servers in
total
10000 queries/sec per server 100 servers in
total
Additional 900 servers will cost 10M initially,
1M every year
If you can increase per server throughput, you
can reduce the total number of servers and TCO

37
Recent H/W trends

64bit Linux large RAM
60GB 128GB RAM is quite common
SSD
Database performance can be N times faster
Using SSD on MySQL slaves is a good practice to
eliminate replication delay
Network, CPU
Dont use 100Mbps
CPU speed matters There are still many single
threaded codes inside middleware

38
Consolidating Web/DB servers

How do you handle unpopular games?
Running a small game on high-end servers is not
cost effective
Recent H/W is fast
Running N DB instances on a single server is not
uncommon
DeNA consolidates 2-10 games in a single database
server

39
Performance is not everything

High Availability, Disaster Recovery (backups),
Security, etc
Be careful about malicious users
Bot / repeatable access via tools
Duplicating items / Real money trades
Illegal logging in
etc

40
Administration practices

Automating setups
Operations without downtime
Automating failover

41
Automation

Automation is important to reduce operational
costs
The number of dev/ops engineers can not grow as
quickly as the number of servers
At DeNA
Hundreds of servers are managed per devops
engineer
Initial server setups (installing OS, setting up
Web server) can be done within 30 minutes
Can add to / remove from services in seconds to
minutes
Do not automate (in production) what you do not
understand
How do you restart failover when automated
failover stops with errors in the middle?
Without understanding in depth it is very hard to
recover

42
Automating installation/setup

Automate software installation / filesystem
partitioning / etc
Kickstart Local yum repository
Cloning OS
Copying entire OS image (including software
packages and conf files) from read-only base
server
We use this approach
Automate initial configuration
Not all servers are identical
Hostname, IP addr, disk size
some parameters (server-id in MySQL)
Use configuration manager like Chef/Puppet (we
use similar internal tool)
Automate continuous configuration checking
Sometimes people change configurations
tentatively and forget to set them back

43
Moving Servers- Games are expanding rapidly

For expanding games
Adding more web servers
Adding more replication slaves or cache servers
(for read scaling)
Adding more shards (for write scaling)
DeNA has Perl based sharding framework on
application side so that we can add new shards
without stopping services
Scaling up masters H/W, or upgrading MySQL
version
More RAM, HDD-gtSSD/PCI-E SSD, Faster CPU, etc

44
Moving Servers- Games are shrinking gradually

For shrinking games
Decreasing web servers
Decreasing replication slaves
Migrating servers to lower-spec machines
Consolidating a few servers into single machine

45
Moving database servers

For many databases (MySQL etc), there is only one
master server (replication master) per shard
Moving master is not trivial
Many people allocate scheduled maintenance
downtime
We want to move master servers more frequently
Scaling-up or Scaling-down (Online games have
many more opportunities than non-game web
services)
Upgrading MySQL version / updating non-dynamic
parameters
Working around power outage Moving games to
remote datacenter

46
Desire for easier operations

In many cases people do not want to allocate
maintenance window
Announcing to users, coordinating with customer
support, etc
Longer downtime reduces revenue and hurts brands
Operating staffs will be exhausted by too many
midnight work
Reducing maintenance time is important to manage
hundreds or thousands of servers

47
Our case

Previous
Allocating 30 minute maintenance window after
200am
Announcing at top page of the game
Coordinating with customer support team
Couldnt do many times because its painful
Current
Migrating to a new server within 0.5-3 seconds
gracefully with an online MySQL master switching
tool (MHA for MySQL OSS)
Not allocating maintenance window can be done in
daytime
Note Be very careful about error handling when
using databases that support automated master
switch
Killing database sessions might result in data
inconsistency
If it waits for a long time to disconnect,
downtime will be longer

48
Automating Failover

Master database server is usually a single point
of failure, and difficult for failover
Bleeding edge databases that support automated
failover often dont work as expected
Split-brain, false positive, etc
In our case Using MHA for automated MySQL
failover
Takes 9-12 seconds to detect failure, 0.5-N
seconds to complete failover
Mostly caused by H/W failure

49
Manual, Simple failover

In extreme crash scenario, automated failover is
dangerous
i.e. Datacenter failure
Identifying root cause should be the first
priority
But failover should be done by simple enough
command
Just one command

50
Summary

Understanding scaling solutions is important to
provide large/growing social games
Stable performance is important to avoid sudden
service outage. Do not trust sales talks. Many
middleware still have many stall problems
Online games tend to grow or shrink rapidly.
Solutions for setting up/migrating servers
(including master database) are important

51
Thank you!