J-Unleashed!

On databases, programming and more

Posts Tagged ‘SQL Server’

Autogrow stopped working again

Posted by Joe on April 4, 2009

Back in December, we had a client who ran a defrag while SQL Server was running. This caused the main database to run out of filegroups space in the ‘PRIMARY’ filegroup. Well, it happened again this week. This time, though, a defrag was not the culprit. We’re still trying to determine what causes this from time to time. Here’s the error in the event log:

MODIFY FILE encountered operating system error 1450(Insufficient system resources exist to complete the requested service.) while attempting to expand the physical file

The solution? Stop all SQL services. Make a physical copy of the database’s MDF file. Change the name of the original MDF file to something else. Rename the copied MDF file to the same name as the original MDF file. Restart SQL services.

Original research showed that this was probably a highly fragmented data file that caused this error, but I don’t know if that’s the case here. Making a copy definitely wrote the physical bits to another area of the disk. For this to happen twice in a three month period is highly suspicious. I’m thinking there’s a drive issue, but system checks didn’t show anything. The investigation continues . . .

Posted in SQL Server | Tagged: | Leave a Comment »

Turn off SQL Server before defragmenting your hard drive

Posted by Joe on December 15, 2008

Had a fun time this morning with a client’s machine where they defragmented the hard drive on the server over the weekend. One would expect this to be pretty benign. An interesting side effect occurred that I can’t yet explain. The main database ran out of filegroups space in the ‘PRIMARY’ filegroup. The database was set to autogrow, but it wouldn’t autogrow. A manual grow yielded this message:

 

MODIFY FILE encountered operating system error 1450(Insufficient system resources exist to complete the requested service.) while attempting to expand the physical file

Most references to operating system error 1450 yielded results pertaining to a backup. That wasn’t the case here. After quite a bit of searching for information, I came across a blog by Imran Shaikh that basically said this can result from a highly fragmented data file. That was indeed the case with this data file. We turned off SQL Server, ran a defrag again, and turned SQL Server back on. The data file was once again autogrowing. 

I don’t have a real thorough understanding of what happened. My suspicion is that while the original defrag occurred, it couldn’t defrag the database file because it was in use. That may have boxed in the data file so it couldn’t autogrow any more. It’s a bit strange if that’s the case because I would have expected the file to grow elsewhere on the disk, but it didn’t. So, lesson learned: if you’re going to defragment your server’s hard drive, turn off SQL Server.

Posted in SQL Server | Tagged: | Leave a Comment »

Database Madness – Day 3

Posted by Joe on December 13, 2008

Friday morning rolled around. I already knew it would be a tough day. I didn’t know it would be quite as bad as it was. Just like Thursday, I didn’t get my second email, so I knew right away something was up. Sure enough, tasks were backed up, and the importer was importing around one record every 6 or 7 seconds. No blocking queries again. And of course, a client sent in a critical ticket, which immediately got escalated to one of the founders, who promptly sent it on to me.
This client’s API call were timing out. A simple call to a small table was taking 17 minutes to run. Other normally fast queries were taking an unusually long time to run. And, as before, SQL Server Profiler wasn’t showing any one specific problem. In this case, it looked like everything was a problem. Queries were not utilizing indexes, and that became highly baffling. I had to address this one client’s issue, though, and found our code was sloppy in that one area. So, with a quick consulting of our development director and CTO, I got permission to rewrite that piece and put out an emergency patch. That worked for that one client for that one query. The system was still slow. More folks got involved from the IT side, and they started looking at network bandwidth. One looked at the database server’s disk I/O and found highly unsual activity along with very long wait states. Aha! Something, finally.
I surmised at this point that one very large table with 4.4 billion records was fragmented. We restored a backup onto a secondary server and ran a check. It was 67% fragmented. Definitely a culprit.
During the restore, I researched more into troubleshooting, specifically what the numbers in SQL Server Profiler truly meant. I’m not a DBA, but I know enough to work on issues like these, though is stresses me out severely. Anyway, I found that while Duration is a good number to look at, it could be faulty because it measures the round-trip from the time a query is invoked until the time it completes, including any waiting in between. Since we now knew we had long waits for disk I/O, I needed to look at something else. The Reads column was that key. We found one query running on our API server over and over doing 2.6 million reads each time. There was the smoking gun. But, what was the query? It was some prepared statement, so we set out to look for it.
We ultimately found the problem in the code. It was a piece of code that had been running for a long time with no troubles. It’s unclear as to why it suddenly caused a problem. We’ll search for that answer in the next week or so. Friday evening, we created another patch and pushed it to production. The number of reads went from 2.6M to anywhere between 10 and 100. That’s a HUGE difference. The database is now nice and happy, and things are performing as they should. We still have the large table with high fragmentation. We need to schedule a maintenance window to reindex it. Our timings show it will take around 4 hours for the reindexing to run. Not too bad.
I learned a few lessons from this ordeal.
  1. SQL Server Profiler is your friend. Understanding what the numbers mean and how to use them can quickly show up most problems.
  2. One high CPU utilizing query can have a cascading effect on the system. In this case, the query optimization engine decided that many of the indexes in these smaller tables were useless since queries were taking so long, and defaulted back to a clustered index scan on the primary key instead of using the index. Once the problem was resolved, the optimization engine started using the indexes again.
  3. In the future, one of my first checks will be disk I/O. We had a lot of disk thrashing and a large disk queue length.
Saturday morning came around, and all was quiet. The database was humming along nicely, and clients’ jobs were running well. Looks like I’ll have a relaxing weekend after all.

Posted in SQL Server | Tagged: | Leave a Comment »

Database Madness – Day 2

Posted by Joe on December 13, 2008

 

Thursday morning rolled around. In the words of Arthur Dent, fictional character of Hitchhiker’s Guide to the Galaxy: “This must be Thursday. I never could get the hang of Thursdays.”
 
 I take a train into work, so I get up early and arrive by just after 6am. I have two custom jobs that send me an email when they complete, so I’m used to seeing two emails every morning. This particular morning, there was only one. This could be one of two things. Either we didn’t receive the file from the client, or something was awry with the processor. We did indeed receive the file, so that ruled out the first possibility. The processor had picked up the job, so that ruled out the second possibility. Very strange. There was a third possibility that I hadn’t originally considered.
 
I checked the list of tasks the system was processing, expecting to see the normal 4 or 5 at this time of day. There were nearly 150. Yikes!
 
In that list was a very odd thing. One client sent 16 import jobs through our user interface so they were all processing simultaneously. Well, surely that would be the culprit. I stopped the importer, but things didn’t get any better. I ran a few simple queries against the database and they were coming back with good speed, so no problems there. I checked for blocking queries, and there were some, but a refresh showed them going away quickly.
 
As the morning progressed, tensions rose. Folks started ambling into the office and awareness that we have a problem quickly entered their coffee-starved minds. This was going to be a long day. By 10am, the situation was beginning to get dire. Publishing jobs were going so slow, and normally fast queries were taking a very long time to run. There was nothing weird with the database except for a very high CPU utilization that we were accustomed to seeing because of one client who had intense jobs. However, it was higher than normal – 98 to 100%. I ran a DBCC SHOWCONTIG on a suspect table and it had 36% fragmentation. We usually re-index this table evey 5 weeks or so because it has a tendency to get fragmented easily.
 
Executives were there, and I recommended restarting the server since we couldn’t pinpoint anything. Even SQL Server Profiler wasn’t showing us any singular query that was a suspect. All agreed, and so we stop SQL Server, and for good measure, rebooted the box. I also recommended taking the scheduled maintenance window for that evening to reindex this table. All agreed. The database came back up, was slow, but still running. We’d have to limp along until evening.
 
Thursday evening came around and we reindexed that table and one other that had a high fragmentation percentage as well. We brought the system back up and things were running, but we noticed an unusual slowness in several areas. But things were running. We were tired, and we decided to leave it until morning.

Posted in SQL Server | Tagged: | Leave a Comment »

Database Madness – Day 1

Posted by Joe on December 13, 2008

I work at WhatCounts, an WhatCounts, Inc.. We send out email marketing and newsletter campaigns for the corporate-type clients. From time to time, we run into database and other system performance problems. Our database is almost a terabyte in size, so performance is an absolute must when dealing with this much data.
It all started on Wednesday around 10am. Complaints started coming in that the publishing jobs were not running. We checked and found one job was holding up the entire publishing queue. It was running, but very, very slowly. We check the normal things: blocking queries, slow network, etc. All checked out fine. As we dug a deeper, we found this one client’s publishing job retrieved different content for each message from an external site, and the content was conditional, depending on who was getting the email. We have two technologies to retrieve content: one that caches and one that doesn’t. This one was using the one that doesn’t. Fortunately for us, we could just go into their email template and change it to the one that does. That’s what we did, and the system started flowing nicely again.
Little did we know that this was just the beginning of our problem.

Posted in SQL Server | Tagged: | Leave a Comment »