Last week, a table in one of our applications became full, and our disk space usage on the server was at 100%. Enter disaster mode. I scrambled to remove as much as I could from the server, and to my relief, I was able to get down to about 78% space usage on the box. While this was good, there was still more to be done. In this particular application, page views are recorded and saved to a MySQL DB table every time any page in the application is viewed. The application has been running for several years now, and we have in the past cleared old data, but this is my first crack at it, personally.
My first idea was to just delete the records that were more than 2 years old, and be done with it. Simple enough, right? Not so much. My project manager wanted to keep the old data, an idea with which I agreed. That old data could come in handy some day. So now my plan was to export the entire page_views table to a .sql file and copy it down to my machine, and then delete the old page views from the table. No problem! Hahahahahha oh man was I wrong. I got the data file exported and copied down to my local machine (the page_views table was 26 gigs with around 147,000,000 records in it) at about 3 gigabytes, a process that took upwards of 5 hours. I then removed the .sql file and we were still looking good as far as disk space goes. I was quite happy with that result, and I was ready to delete the old data.
Without any reading or research, I just assumed that running this command (which would effectively delete about 30,000,000 records from the page_views table):
mysql> DELETE FROM page_views WHERE created_at < NOW() - INTERVAL 2 YEAR;
would accomplish my goals in short order and I would be free to go about the rest of my business. I couldn't have been more wrong. I let that query run for over an hour until I realized that something wasn't quite right. I began my quest to learn more about MySQL and how it really works. As I continued to find, the query above would take an absurd amount of time. Here's why: Every single record in the DELETE query is scanned, then marked for deletion, and logged. All of this has to happen before a record is deleted. And to make matters worse, when a record is finally deleted, the table is rescanned again because of the indexes AFTER EVERY RECORD. That is just awful. Now I know.
But my optimistic self decided to let the query run over night, and I naively expected the operation to be completed the next morning. Not even close. A single record had yet to be deleted. :epic face palm: Armed with that knowledge, and with the knowledge that MySQL would be able to roll back the command, I killed the query. The rollback process took less than 24 hours, and this morning I began the process of deleting large amounts of MySQL data the right way:
CREATE TABLE page_views_new LIKE page_views;
LOCK TABLES page_views_new WRITE, page_views READ;
INSERT INTO page_views_new SELECT * FROM page_views WHERE created_at > NOW() - INTERVAL 2 YEAR;
UNLOCK TABLES;
RENAME TABLE page_views TO page_views_old;
RENAME TABLE page_views_new TO page_views;
DROP TABLE page_views_old;
That's pretty much it, though I am still learning the in's and out's of fine tuning large queries in MySQL.
You know what they say, better late than never!