The Gory Details of the Moodle Performance Issue of August 18, 2010

This post provides more technical details related to the Moodle performance degradation we saw on Wednesday, 8/18. Warning, there is a lot of technical jargon below! This is a narrative of the events as they happened.

Early on the morning of August 18, Daniel L’Hommedieu (member of the DELTA Application Support group) was working to get our report server synchronized with our Moodle database servers. After starting that, he noticed that not only was it not catching up as it should have, it was getting farther and farther behind. He then checked the secondary MySQL server for Moodle and noticed that it, too, was showing a significant lag in its replication and that it was similarly getting farther behind. At this point we also started noticing a significant increase in the time to load pages on each of the Moodle web servers. Page load times went higher than the 10 second threshold on each of the web servers and our monitoring system started letting us know by paging us.

After we started investigating, we noticed the sessions2 table was staying locked and causing a bottleneck for everyone. The purpose of this table is to provide shared information between each of the web servers (most web developers would call these sessions), should one of them go down. We had previously researched InnoDB and we thought that the different database engine could fix the locking we were seeing, so we decided to change the storage engine from MyISAM to InnoDB on that one particular table. This would alleviate the table-level locking problem and allow requests to process in parallel due to the design of InnoDB allowing row-level locking (see http://dev.mysql.com/doc/refman/5.0/en/internal-locking.html). This was at the time of the emergency maintenance around noon. We put Moodle in maintenance mode, shut down all the web servers, and waited for the secondary database to catch up with all pending transactions. At this point we issued the change table command and waited for it to complete. Once it was finished, we started the web servers on the web nodes and monitored things for a bit. We took Moodle out of maintenance mode around 12:45. After monitoring for a while we noticed this change did help but affected the situation in a way we hadn’t anticipated.

Upon further monitoring we realized the requests were not waiting on shared locked resources anymore, but were instead processing all at once. This change had the effect of causing a higher load on the MySQL servers and did not really have a positive effect on the page load times. Although the replication between the two MySQL servers improved drastically, this was not the solution we were looking for.

At the same time these changes were being made, my colleague Jeff Webster was doing research on the various Moodle forums to see if there was another cause or a better solution. Jeff found an entry on the Moodle forums that seemed directly related to the problem we were seeing (see http://moodle.org/mod/forum/discuss.php?d=106108 and  http://tracker.moodle.org/browse/MDL-16641). To sum up, every hit to Moodle was causing a complete table scan of the sessions2 table, some 20,000 rows, instead of a much quicker direct look-up (if the select statement were correct and used the correct table indexes). After Jeff and I looked at the suggested code fix, we decided it was worth making this change to see if we could get a performance increase and created a patch file. The code in question really seemed to be a bug, even though the discussion in the forum seemed to suggest it wasn’t a bug at all. We took a web server from the load balancer pool to separate it from the production cluster and made the code change. We immediately noticed a dramatic performance increase in page load times. After a little more testing we decided it was worth pushing the code out to the rest of the web servers since there was minimal risk (the code change was an adjustment to the WHERE clause from a binary match to a simple equals).

Since this update, there have not been any of the performance issues we saw on the 18th. Average page load times have dropped 60 percent, even with increased load. We will keep watching the situation, but it appears that we have solved the major issue we saw on the first day of classes.

We’re very happy to have been able to solve this problem and apologize for any inconvenience the problem caused.

Glossary

– The session information is stored on the web server using the session identifier (session ID) generated as a result of the first (sometimes the first authenticated) request from the end user running a web browser. The “storage” of session IDs and the associated session data (user name, account number, etc.) on the web server is accomplished using a variety of techniques including, but not limited to: local memory, flat files, and databases.

– A technique used in database management systems, where a row is locked for writing to prevent other users from accessing data being while it is being updated.

– A technique used in database management systems, where an entire table is locked while data in it is being updated.