At Yelp, we use MySQL replication to provide the performance we need to serve our 120 million monthly unique visitors. In this post Josh Snyder walks us through some of the internals of MySQL’s replication system, and discusses an issue that we saw in production.
MySQL replication and the slave I/O thread’s purpose
On a MySQL slave, the task of replication is split between two threads: the I/O and SQL threads. The I/O thread’s job is to download the log of events executed on the master and write them to disk in the relay log. The SQL thread then reads and executes the downloaded events on the slave. For the purposes of this post we’ll discuss the I/O thread.
The slave I/O thread connects to a master server using the MySQL protocol, just like any other client. It executes certain housekeeping queries on the master, and immediately afterward asks for a dump of the binary log starting at a specific log file and position. Once the slave I/O thread requests the binlog dump, the connection will be devoted entirely to the binlog stream until the TCP stream is closed. This means that the results of the housekeeping queries mentioned above will need to be cached for use during the entire lifetime of the TCP stream. Practically speaking this is not a problem, because it’s very cheap to break and reestablish the connection.
What happens when the connection breaks?
Nothing right away. The MySQL slave can’t recognize the difference between silence due to a lack of events in the binary log and silence due to connection issues.
The slave I/O thread can take a guess, though, which is where
MASTER_HEARTBEAT_PERIOD come in.
The manual describes
slave_net_timeout as, “the number of seconds to wait for more data from the master before the slave considers the connection broken, aborts the read, and tries to reconnect.” This is correct, but the actual behavior is somewhat more subtle. As implemented, slave_net_timeout is the number of seconds that any read on the slave I/O socket will block before timing out. It’s passed directly to
SO_RCVTIMEO. Therefore, if the I/O thread ever spends more than slave_net_timeout seconds reading from its socket, the read will time out and an error will be returned. MySQL handles this error internally as
ER_NET_READ_INTERRUPTED, and its ultimate reaction is to terminate the connection and initiate a new one.
A small value of
slave_net_timeout could then pose a problem. If slave_net_timeout is two seconds, MySQL will terminate and reconnect the replication stream whenever the master spends more than two seconds without publishing an event to its binary log. To remedy this, the slave can request that the master send periodic heartbeats using the
MASTER_HEARTBEAT_PERIOD replication setting (which defaults to half of
slave_net_timeout). If the master heartbeat period elapses and the master has no events to send, it will insert a
Heartbeat_log_event into the binary log stream. These events go over the wire, but are never written to disk; they don’t even get recorded in the slave’s relay log.
Heartbeat events serve to “fill the silence” created by an idle master. Busy masters (i.e. those that always have an event to send within one heartbeat period) will never send heartbeat events. This can be verified on a slave by checking that the
Slave_received_heartbeats status variable is never incremented when the master is busily accepting writes.
What does a slave reconnection look like?
I set up a master-slave MySQL pair within Docker containers, with an active binary log stream coming down from the master. I set
slave_net_timeout to 30 seconds and then broke replication by removing the replication TCP flow from the tables of my stateful firewall.
Using iptables/conntrack to break a replication connection
First, set up a firewall rule that uses conntrack to disallow all invalid TCP sessions:
root@mysql_master # iptables -A INPUT -p tcp --dport 3306 -m state --state INVALID -j LOG
root@mysql_master # iptables -A INPUT -p tcp --dport 3306 -m state --state INVALID -j DROP
Be sure to disable conntrack’s TCP loose mode, which will otherwise allow TCP ACKs to establish valid connections:
root@mysql_master # echo 0 > /proc/sys/net/netfilter/nf_conntrack_tcp_loose
Then simply delete the conntrack entry for MySQL replication, and our stateful firewall will block any MySQL tcp sessions that were in flight:
root@mysql_master # conntrack -D -p tcp --dport 3306
I set my
slave_net_timeout to 30 seconds and then broke replication. As expected, the slave I/O thread stopped advancing.
Seconds_Behind_Master displays zero, because the slave SQL thread is not “actively processing updates”. Nothing of consequence happens until
Once that happens, MySQL reconnects and
Seconds_Behind_Master spikes to...wait what?
How did that happen? We spent only the thirty second
slave_net_timeout disconnected from the master host, but the slave SQL thread thinks it’s a whopping 1634 seconds behind. This is due to MySQL bug #66921, discussed by Domas Mituzas here. Succinctly, when the connection is reestablished, the slave SQL thread re-executes the format description event from the master’s binary log, noting its timestamp as the latest executed event. Seconds_Behind_Master becomes equal to the age of the format description event in the master’s binary log file, which in this case was 1634 seconds.
MySQL doesn’t bother to print to the error log that the I/O thread recovered from a connection failure. It considers these kinds of issues routine, so it gracefully recovers without comment. This presents an issue for a database administrator who would like to know how frequently reconnects are occurring. So far the only way I’ve found to gather this information is to interrogate the master about the age of its ‘Binlog Dump’ threads. 
select id, host, time_ms, date_sub(now(), interval time second) as started
where command = 'Binlog Dump'
order by time_ms;
In some cases, the I/O thread’s attempt to reconnect will fail. This will cause a cheery error message to appear in the error log:
[ERROR] Slave I/O: error reconnecting to master 'email@example.com:3306' - retry-time: 60 retries: 86400, Error_code: 2003
Error code 2003 is the familiar “
Cannot connect to MySQL server” error. Replication isn’t currently happening, but MySQL will gracefully handle the situation by continuing its attempts to reconnect to the master . The error message means that the slave I/O thread will reconnect at 60 second (
MASTER_CONNECT_RETRY) intervals, and will persist in doing so 86400 (
An annoying outage
On one occasion, packet loss due to a network provider issue caused some of Yelp’s MySQL 5.5 series databases to print different and scarier messages:
[ERROR] Slave I/O: The slave I/O thread stops because a fatal error is encountered when it try to get the value of SERVER_ID variable from master. Error: , Error_code: 1159
[ERROR] The slave I/O thread stops because SET @master_heartbeat_period on master failed. Error: Error_code: 1593
perror informs us that these errors are network timeout and fatal slave errors, respectively.
$ perror 1159 1593
MySQL error code 1159 (ER_NET_READ_INTERRUPTED): Got timeout reading communication packets
MySQL error code 1593 (ER_SLAVE_FATAL_ERROR): Fatal error: %s
In both cases, MySQL did not gracefully handle the situation. The slave I/O thread did not attempt to reconnect: it was stone dead. A human had to intervene to restart replication. These problems arose because the I/O thread wasn’t correctly handling errors when doing its start-of-connection housekeeping queries. Our investigation revealed that each error message implicates a different code site.
The first error,
ER_NET_READ_INTERRUPTED, is already familiar to us. It’s the error that occurs when
slave_net_timeout elapses without receiving any data from the master. In this case, the replication client executed
SHOW VARIABLES LIKE 'SERVER_ID', and did not receive a response in time. As explained above, MySQL drops the connection and reconnects when it encounters this error during a binlog dump. At this specific code site, MySQL handles errors a little differently. Instead of catching and handling
ERR_NET_READ_INTERRUPTED the code checks for any network error. The code that matches the error codes looks like:
bool is_network_error(uint errorno)
if (errorno == CR_CONNECTION_ERROR ||
errorno == CR_CONN_HOST_ERROR ||
errorno == CR_SERVER_GONE_ERROR ||
errorno == CR_SERVER_LOST ||
errorno == ER_CON_COUNT_ERROR ||
errorno == ER_SERVER_SHUTDOWN)
As you can see,
ER_NET_READ_INTERRUPTED is missing from this list. Because of this, MySQL decided to terminate the slave I/O thread instead of gracefully reconnecting.
The second error was similar. Setting the heartbeat period on the master is one of the housekeeping queries done by the slave before requesting a binlog dump. The query looks like
SET @master_heartbeat_period=%s. Due to the way this query’s result was handled, any error at all would cause death of the slave I/O thread.
We worked with Percona to get fixes for these two issues included in the Percona Server 5.5.36 release. The patch, written by Vlad Lesin, also includes numerous test cases to exercise this behavior.
Rubber Meets Road
I wanted to see Vlad’s change in practice, so I constructed a testbed to mimic packet loss. I chose some very aggressive settings in an attempt to get a high frequency of connection breakage.
On the master, I left my existing stateful firewall rules in place. I then emulated 60% packet loss:
tc qdisc add dev eth0 root netem drop 60%
On the slave, I set
SET GLOBAL slave_net_timeout=2;
CHANGE MASTER TO master_connect_retry=1;
I continuously wrote into the master at a rate of 5 queries/second, with each event being 217 bytes. This translates to 1085 bytes/second of binlog traffic.
Using Percona Server 5.5.36 as a baseline, I used tcpdump to determine that these network conditions caused MySQL to reconnect 74 times over a five minute period. At no point did the I/O thread fail entirely. Percona Server 5.5.35 is a different story. Over a five minute period under the same conditions, MySQL reconnected 69 times. On five occasions the I/O thread failed completely:
3x [ERROR] Slave I/O: The slave I/O thread stops because SET @master_heartbeat_period on master failed. Error: , Error_code: 1593
2x [ERROR] Slave I/O: The slave I/O thread stops because a fatal error is encountered when it try to get the value of SERVER_ID variable from master. Error: , Error_code: 1159
Based on these results, I’d say the fix is effective. Moreover, it seems unlikely that there are any further bugs of this kind in the reconnection code.
I was very happy to have a checkout of MySQL handy while debugging this. In general, I’ve found the MySQL replication source code to be readable enough (if not simple), especially when you’re searching for a specific error code. Most importantly, though, I wouldn’t be able to live without the ability to spin-up a throwaway MySQL instance for testing.
-  well, I lied. I know of one other way. You can configure replication in a way that causes the slave I/O thread to print a warning when it connects. The warning will be printed to the slave’s error log. I’ve seen this, for instance, when a MySQL 5.6 replica connects to a MySQL 5.5 master.
-  whether your application will gracefully handle the resulting replication delay is a different story entirely