At ~10:00 Pacific today, while I was BS’ing with 2 of the DBA’s at my company from how much the DBE’s irritate us to what is wrong with our storage infrastructure, the pager went off [CLOISTER BELL]. The page indicated one of our DS/DW virtual SQL 2008 instance was offline. I ran over and got onto the Windows 2008 failover cluster tool and viewed the status of the cluster. I saw that SQL Server application was offline. Then I noticed the transaction log volume was offline/RED/non-existant. We immediately contacted our SOC and had them page out the Storage, Windows and Service Management teams.
While I waited for a bridge to be created I tried to bring the transaction log volume online. That didn’t work. It looked like the storage was not there. Now it is 10AM and no one should be doing anything at 10AM, RIGHT?!?!?! The bridge was started, a priority 2 incident was created and storage got on the call. The person in question, whose name will be kept secret because I might start cursing, identified he was cleaning up retired SVSP storage which we removed yesterday he accidentally removed the wrong volume. This got my blood boiling and so I asked “When is it coming back?” This person said it was GONE gone, as it removed/data destroyed. Holy Crap! So he as in the process of creating a new logs volume.
Now thankfully the master/model/msdb log files were not on this volume when SQL 2008 was installed. Also the logs volume was a dependency of SQL so when it was gone, SQL crashed and alerted us. We could bring up SQL just fine but all the user databases were in SUSPECT mode. The funny thing is that when you go through SSMS it didn’t actually say the word SUSPECT but rather showed the name of the database with no other icons next to it. What to do next. Well there is your check list:
- When was your last FULL & DIFF backups?
For 3 of the 4 core databases we had a recent FULL/DIFF. For the database which was ~6TB in size we had NONE.
- Why didn’t you have a backup?
- WHERE ARE YOUR BACKUPS?
Now our team huddled together and we started to come up with ideas. My manager, who was on IM mind you, remembered in the SQL 2000 days you could use sp_attach_single_file_db to attach a database without a transaction log file. Since we had a DBA tools database that was down because of the missing t-log file, we waited until the new log volume was created, built out the folder structure needed for recovery and tried:
- sp_attach_single_file_db — Failed.
- CREATE DATABASE … ON FILE … FOR ATTACH_REBUILD_LOG — Failed.
- DBCC CHECKDB (<DB>,REPAIR_ALLOW_DATA_LOSS) — Failed.
Now what can we do? So I did a Google search [no I will never BING] and I found the answer. Paul Randal had blogged in August 28, 2008 on how to fix this exact issue. We had forgotten to put the database into EMERGENCY and SINGLE_USER mode prior to running the DBCC command. So what we had to run was:
ALTER DATABASE <DB> SET EMERGENCY;
ALTER DATABASE <DB> SET SINGLE_USER;
DBCC CHECKDB (<DB>, REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS, ALL_ERRORMSGS;
We took the smallest of the 4 core user databases and ran the steps above. In about 38 minutes the DBCC CHECKDB finished for a 385GB database. We did a check and everything look okay [tables, view, data, etc]. Given the size/time for this test we estimated the recovery of the next 2 core databases [1.9 & 3.6 TB each] would take about 2.5 and 5 hours respectively. The largest database, which is ~6TB in size, will take about 9 hours to get done. The speed was estimated at 10GB per minute, taking into account our storage infrastructure, memory and MAXDOP setting of the server. The recovery is on its way and given we will have some data loss we really lucked out. By noon tomorrow we should be back up 100%.
The moral of this story kids? Paul Randal is a rockstar and I will buy him a round of anything he wants the next time I see him. As for our storage group, that is another sad story.