Friday morning rolled around. I already knew it would be a tough day. I didn’t know it would be quite as bad as it was. Just like Thursday, I didn’t get my second email, so I knew right away something was up. Sure enough, tasks were backed up, and the importer was importing around one record every 6 or 7 seconds. No blocking queries again. And of course, a client sent in a critical ticket, which immediately got escalated to one of the founders, who promptly sent it on to me.
This client’s API call were timing out. A simple call to a small table was taking 17 minutes to run. Other normally fast queries were taking an unusually long time to run. And, as before, SQL Server Profiler wasn’t showing any one specific problem. In this case, it looked like everything was a problem. Queries were not utilizing indexes, and that became highly baffling. I had to address this one client’s issue, though, and found our code was sloppy in that one area. So, with a quick consulting of our development director and CTO, I got permission to rewrite that piece and put out an emergency patch. That worked for that one client for that one query. The system was still slow. More folks got involved from the IT side, and they started looking at network bandwidth. One looked at the database server’s disk I/O and found highly unsual activity along with very long wait states. Aha! Something, finally.
I surmised at this point that one very large table with 4.4 billion records was fragmented. We restored a backup onto a secondary server and ran a check. It was 67% fragmented. Definitely a culprit.
During the restore, I researched more into troubleshooting, specifically what the numbers in SQL Server Profiler truly meant. I’m not a DBA, but I know enough to work on issues like these, though is stresses me out severely. Anyway, I found that while Duration is a good number to look at, it could be faulty because it measures the round-trip from the time a query is invoked until the time it completes, including any waiting in between. Since we now knew we had long waits for disk I/O, I needed to look at something else. The Reads column was that key. We found one query running on our API server over and over doing 2.6 million reads each time. There was the smoking gun. But, what was the query? It was some prepared statement, so we set out to look for it.
We ultimately found the problem in the code. It was a piece of code that had been running for a long time with no troubles. It’s unclear as to why it suddenly caused a problem. We’ll search for that answer in the next week or so. Friday evening, we created another patch and pushed it to production. The number of reads went from 2.6M to anywhere between 10 and 100. That’s a HUGE difference. The database is now nice and happy, and things are performing as they should. We still have the large table with high fragmentation. We need to schedule a maintenance window to reindex it. Our timings show it will take around 4 hours for the reindexing to run. Not too bad.
I learned a few lessons from this ordeal.
- SQL Server Profiler is your friend. Understanding what the numbers mean and how to use them can quickly show up most problems.
- One high CPU utilizing query can have a cascading effect on the system. In this case, the query optimization engine decided that many of the indexes in these smaller tables were useless since queries were taking so long, and defaulted back to a clustered index scan on the primary key instead of using the index. Once the problem was resolved, the optimization engine started using the indexes again.
- In the future, one of my first checks will be disk I/O. We had a lot of disk thrashing and a large disk queue length.
Saturday morning came around, and all was quiet. The database was humming along nicely, and clients’ jobs were running well. Looks like I’ll have a relaxing weekend after all.