Title: SIP Server Scalability
1SIP Server Scalability
- IRT Internal Seminar
- Kundan Singh, Henning Schulzrinne
- and Jonathan Lennox
- May 10, 2005
2Agenda
- Why do we need scalability?
- Scaling the server
- SIP express router (Iptel.org)
- SIPd (Columbia University)
- Threads/Processes/Events
- Scaling using load sharing
- DNS-based, Identifier-based
- Two stage architecture
- Conclusions
27 slides
3Internet telephony(SIP Session Initiation
Protocol)
alice_at_yahoo.com
bob_at_example.com
yahoo.com
example.com
192.1.2.4
129.1.2.3
DB
4Scalability RequirementsDepends on role in the
network architecture
Cybercafe
ISP
IP network
IP phones
GW
ISP
MG
MG
SIP/MGC
GW
SIP/PSTN
SIP/MGC
Carrier network
MG
GW
PBX
T1 PRI/BRI
PSTN phones
PSTN
5Scalability RequirementsDepends on traffic type
- Registration (uniform)
- Authentication, mobile users
- Call routing (Poisson)
- stateful vs stateless proxy, redirect,
programmable scripts - Beyond telephony (Dont know)
- Instant message, presence (including sensors),
device control - Stateful calls (Poisson arrival, exponential call
duration) - Firewall, conference, voicemail
- Transport type
- UDP/TCP/TLS (cost of security)
6SIPstoneSIP server performance metrics
SQL database
- Steady state rate for
- successful registration, forwarding and
unsuccessful call attempts measured using 15 min
test runs. - Measure requests/s with given delay constraint.
- Performancef(user,DNS,UDP/TCP,g(request),L)
where gtype and arrival pdf (request/s),
Llogging? - For register, outbound proxy, redirect, proxy480,
proxy200. - Parameters
- Measurement interval, transaction response time,
RPS (registers/s), CPS (calls/s), transaction
failure probabilitylt5, - Delay budget R1 lt 500 ms, R2 lt 2000 ms
- Shortcomings
- does not consider forking, scripting, Via header,
packet size, different call rates, SSL. Is there
linear combination of results? - Whitebox measurements turnaround time
- Extend to SIMPLEstone
Server
Loader
Handler
REGISTER
R1
200 OK
INVITE
100 Trying
INVITE
R2
180 Ringing
180 Ringing
200 OK
200 OK
ACK
ACK
BYE
BYE
200 OK
200 OK
7SIP serverWhat happens inside a proxy?
(Blocking) I/O
Critical section (lock)
Critical section (r/w lock)
8Lessons Learnt (sipd)In-memory database
- Call routing involves (? 1) contact lookups
- 10 ms per query (approx)
- Cache (FastSQL)
- Loading entire database is easy
- Periodic refresh
- Potentially useful for DNS lookups
Web config
SQL database
Periodic Refresh
Cache
lt 1 ms
2002Narayanan Single CPU Sun Ultra10
Turnaround time vs RPS
9Lessons Learnt (sipd)Thread-per-request does not
scale
- One thread per message
- Doesnt scale
- Too many threads over a short timescale
- Stateless 2-4 threads per transaction
- Stateful 30s holding time
- Thread pool queue
- Thread overhead less more useful processing
- Pre-fork processes for SIP-CGI
- Overload management
- Graceful failure, drop requests over responses
- Not enough if holding time is high
- Each request holds (blocks) a thread
Thread pool with overload control
Throughput
Thread per request
Load
10What is the best architecture?
- Event-based
- Reactive system
- Process pool
- Each pool process receives and processes to the
end (SER)
- Thread pool
- Receive and hand-over to pool thread (sipd)
- Each pool thread receives and processes to the
end - Staged event-driven each stage has a thread pool
11Stateless proxyUDP, no DNS, six messages per call
Match transaction
Modify response
stateful
Stateless proxy
Response
sendto, send or sendmsg
recvfrom or accept/recv
Update DB
Found
parse
Redirect/reject
REGISTER
Match transaction
Build response
Lookup DB
Request
other
Stateless proxy
Proxy
Modify Request
DNS
12Stateless proxyUDP, no DNS, six messages per call
Architecture /Hardware 1 PentiumIV 3GHz, 1GB, Linux2.4.20 (CPS) 4 pentium, 450MHz, 512 MB, Linux2.4.20 (CPS) 1 ultraSparc-IIi, 300 MHz, 64MB, Solaris (CPS) 2 ultraSparc-II, 300 MHz, 256MB, Solaris (CPS)
Event-based 1650 370 150 190
Thread/msg 1400 TBD 100 TBD
Thread-pool1 1450 600 (?) 110 220 (?)
Thread-pool2 1600 1150 (?) 152 TBD
Process-pool 1700 1400 160 350
13Stateful proxyUDP, no DNS, eight messages per
call
- Event-based
- single thread socket listener scheduler/timer
- Thread-per-message
- pool_schedule gt pthread_create
- Thread-pool1 (sipd)
- Thread-pool2
- N event-based threads
- Each handles specific subset of requests
(hash(call-id)) - Receive hand over to the correct thread
- poll in multiple threads gt bad on multi-CPU
- Process pool
- Not finished yet
14Stateful proxyUDP, no DNS, eight messages per
call
Architecture /Hardware 1 PentiumIV 3GHz, 1GB, Linux2.4.20 (CPS) 4 pentium, 450MHz, 512 MB, Linux2.4.20 (CPS) 1 ultraSparc-IIi, 360MHz, 256 MB, Solaris5.9 (CPS) 2 ultraSparc-II, 300 MHz, 256 MB, Solaris5.8 (CPS)
Event-based 1200 300 160 160
Thread/msg 650 175 90 120
Thread-pool1 950 340 (p4) 120 120 (p4)
Thread-pool2 1100 500 (p4) 155 200 (p4)
Process-pool - - - -
15Lessons LearntWhat is the best architecture?
- Stateless
- CPU is bottleneck
- Memory is constant
- Process pool is the best
- Event-based not good for multi-CPU
- Thread/msg and thread-pool similar
- Thread-pool2 close to process-poll
- Stateful
- Memory can become bottle-neck
- Thread-pool2 is good
- But not N x CPU
- Not good if P ? CPU
- Process pool may be better (?)
16Lessons Learnt (sipd)Avoid blocking function
calls
- DNS
- 10-25 ms (29 queries)
- Cache
- 110 to 900 CPS
- Internal vs external
- non-blocking
- Logger
- Lazy logger as a separate thread
- Date formatter
- Strftime() 10 REG processing
- Update date variable every second
- random32()
- Cache gethostid()- 37?s
Logger while (1) lock writeall
unlock sleep
17Lessons Learnt (sipd)Resource management
- Socket management
- Problems OS limit (1024), liveness detection,
retransmission - One socket per transaction does not scale
- Global socket if downstream server is alive, soft
state works for UDP - Hard for TCP/TLS apply connection reuse
- Socket buffer size
- 64KB to 128KB Tradeoff memory per socket vs
number of sockets - Memory management
- Problems too many malloc/free, leaks
- Memory pool
- Transaction specific memory, free once also,
less memcpy - About 30 performance gain
- Stateful 650 to 800 CPS Stateless 900 to 1200
CPS
Stateless processing time (?s) INV 180 200 ACK BYE 200 REG 200
W/o mempool 155 67 67 95 139 62 237 70
W/ mempool 111 49 48 64 106 41 202 48
Improvement () 28 27 28 33 24 34 15 31
18Lessons Learnt (SER)Optimizations
- Reduce copying and string operations
- Data lumps, counted strings (5-10)
- Reduce URI comparison to local
- User part as a keyword, use r2 parameters
- Parser
- Lazy parsing (2-6x), incremental parsing
- 32-bit header parser (2-3.5x)
- Use padding to align
- Fast for general case (canonicalized)
- Case compare
- Hash-table, sixth bit
- Database
- Cache is divided into domains for locking
2003Jan Janak SIP proxy server effectiveness,
Masters thesis, Czech Technical University
19Lessons Learnt (SER)Protocol bottlenecks and
other scalability concerns
- Protocol bottlenecks
- Parsing
- Order of headers
- Host names vs IP address
- Line folding
- Scattered headers (Via, Route)
- Authentication
- Reuse credentials in subsequent requests
- TCP
- Message length unknown until Content-Length
- Other scalability concerns
- Configuration
- broken digest client, wrong password, wrong
expires - Overuse of features
- Use stateless instead of stateful if possible
- Record route only when needed
- Avoid outbound proxy if possible
20Load SharingDistribute load among multiple
servers
- Single server scalability
- There is a maximum capacity limit
- Multiple servers
- DNS-based
- Identifier-based
- Network address translation
- Same IP address
21Load Sharing (DNS-based)Redundant proxies and
databases
- REGISTER
- Write to D1 D2
- INVITE
- Read from D1 or D2
- Database write/ synchronization traffic becomes
bottleneck
P1
D1
P2
D2
P3
22Load Sharing (Identifier-based)Divide the user
space
- Proxy and database on the same host
- First-stage proxy may get overloaded
- Use many
- Hashing
- Static vs dynamic
P1
D1
a-h
P2
D2
i-q
P3
D3
r-z
23Load SharingComparison of the two designs
P1
P1
a-h
D1
D1
P2
P2
i-q
D2
D2
P3
P3
D2
r-z
Total time per DB
D number of database servers N number of
writes (REGISTER) r reads/writes
(INVREG)/REG T write latency t read
latency/write latency
24Scalability (and Reliability)Two stage
architecture for CINEMA
a_at_example.com
a.example.com _sip._udp SRV 0 0 a1.example.com
SRV 1 0 a2.example.com
a1
s1
a2
sipbob_at_example.com
s2
sipbob_at_b.example.com
b_at_example.com
b.example.com _sip._udp SRV 0 0 b1.example.com
SRV 1 0 b2.example.com
s3
b1
b2
ex
example.com _sip._udp SRV 0 40 s1.example.com
SRV 0 40 s2.example.com SRV 0 20
s3.example.com SRV 1 0 ex.backup.com
Request-rate f(stateless, groups) Bottleneck
CPU, memory, bandwidth?
25Load SharingResult (UDP, stateless, no DNS, no
mempool)
- S P CPS
- 3 3 2800
- 2 3 2100
- 2 2 1800
- 1 2 1050
- 0 1 900
26Lessons LearntLoad sharing
- Non-uniform distribution
- Identifier distribution (bad hash function)
- Call distribution gt dynamically adjust
- Stateless proxy
- S1050, P900 CPS
- S3P3 gt 10 million BHCA (busy hour call attempts)
- Stateful proxy
- S800, P650 CPS
- Registration (no auth)
- S2500, P2400 RPS
- S3P3 gt 10 million subscribers (1 hour refresh)
- Memory pool and thread-pool2/event-based further
increase the capacity (approx 1.8x)
27Conclusions and future work
- Server scalability
- Non-blocking, process/events/thread, resource
management, optimizations - Load sharing
- DNS, Identifier, two-stage
- Current and future work
- Measure process pool performance for stateful
- Optimize sipd
- Use thread-pool2/event-based (?)
- Memory - use counted strings clean after 200 (?)
- CPU - use hash tables
- Presence, call stateful and TLS performance
(Vishal and Eilon)
28Backup slides
29Telephone scalability(PSTN Public Switched
Telephone Network)
bearer network
telephone switch(SSP)
30SIP serverComparison with HTTP server
- Signaling (vs data) bound
- No File I/O (exception scripts, logging)
- No caching DB read and write frequency are
comparable - Transactions
- Stateful wait for response
- Depends on external entities
- DNS, SQL database
- Transport
- UDP in addition to TCP/TLS
- Goals
- Carrier class scaling using commodity hardware
- Try not to customize/recompile OS or implement
(parts of) server in kernel (khttpd, AFPA)
31Related workScalability for (web) servers
- Existing work
- Connection dispatcher
- Content/session-based redirection
- DNS-based load sharing
- HTTP vs SIP
- UDPTCP, signaling not bandwidth intensive, no
caching of response, read/write ratio is
comparable for DB - SIP scalability bottleneck
- Signaling (chapter 4), real-time media data,
gateway - 302 redirect to less loaded server, REFER session
to another location, signal upstream to reduce
32Related work3GPP (release 5)s IP Multimedia
core network Subsystem uses SIP
- Proxy-CSCF (call session control function)
- First contact in visited network. 911 lookup.
Dialplan. - Interrogating-CSCF
- First contact in operators network.
- Locate S-CSCF for register
- Serving-CSCF
- User policy and privileges, session control
service - Registrar
- Connection to PSTN
- MGCF and MGW
33 Server-based vs peer-to-peer
Reliability, failover latency DNS-based. Depends on client retry timeout, DB replication latency, registration refresh interval DHT self organization and periodic registration refresh. Depends on client timeout, registration refresh interval.
Scalability, number of users Depends on number of servers in the two stages. Depends on refresh rate, join/leave rate, uptime
Call setup latency One or two steps. O(log(N)) steps.
Security TLS, digest authentication, S/MIME Additionally needs a reputation system, working around spy nodes
Maintenance, configuration Administrator DNS, database, middle-box Automatic one time bootstrap node addresses
PSTN interoperability Gateways, TRIP, ENUM Interact with server-based infrastructure or co-locate peer node with the gateway
34Comparison of sipd and SER
- sipd
- Thread pool
- Events (reactive system)
- Memory pool
- PentiumIV 3GHz, 1GB, 1200 CPS, 2400 RPS (no auth)
- SER
- Process pool
- Custom memory management
- PentiumIII 850 MHz, 512 MB gt 2000 CPS, 1800 RPS