The following is my log of the OSGrid asset conversion saga.
Day 1: Today we’re taking a third attempt at doing the big fragstore conversion for OSGrid, for those not following the Saga of the Asset Server – about two months ago we started noticing major scalibility issues surrounding assets. Right now they are thrown into MySQL as a blob table, resulting in large amounts of waste both in duplicate content and in the fact we’re storing a filesystem inside of a relational database – you can read up on the earlier design decisions leading to FragStore here.
The previous two conversion attempts have suffered “mysterious MySQL glitches” which we assume may be related to various bugs with long running commands. Apparently the proper course of action when running a query that takes more than 60 seconds to process the command, is to freeze up entirely and stop processing requests – for now and evermore.
In an attempt to make this run a bit smoother – we’ve broken up the process into 2,000 batches of 1,000 assets each – previously our batch mechanism was using MySQL LIMIT X, Y which has the side effect of getting slower and slower as you progress down the table (thus causing the above); so we’ve shifted to using a numeric ID on the assets table. Putting the numeric ID on there allows us to at least index sequential accesses – LIMIT unfortunately will not use any form of index hinting.
mysql> ALTER TABLE `assets` -> ADD COLUMN `numericID` int(11) UNSIGNED NOT NULL AUTO_INCREMENT AFTER `access_time`, -> DROP PRIMARY KEY, -> ADD PRIMARY KEY (`numericID`), -> ADD UNIQUE INDEX `assetID` (`id`); Query OK, 1623826 rows affected (1 hour 11 min 2.54 sec) Records: 1623826 Duplicates: 0 Warnings: 0
An hour and a bit later, the difference between the speed of processing before and after is pretty astounding
mysql> select id from assets limit 540000,10; 10 rows in set (11.53 sec) mysql> select id from assets where numericID between 540000 AND 540009; 10 rows in set (0.42 sec)
It doesnt need a whole bunch of explanation to figure the above may help with our situation. Running the revised and simplified “AssetConverterMarkII” appears to go without a hitch – data is stored into the database, the metadata table is being filled correctly – all in all it appears to be functional. With one minor teeny little problem.
Only the first 4096 bytes of data are being written to the backend store. The remaining sectors of data are written – but consist entirely of zeros. Retrieving the data results in a buffer of the same length as the original stored asset – but often half the data is completely missing. An hour later, it looks like the data is being sent to the backend voldestore correctly, but either on the way there or on the way back, it loses something. Unfortunately it looks like the problem is outside the purview of the client adapter and is somewhere in the deep murk of the backend storage server.
Day 2: Rethinking time – after spending quite some time hunting for some alternatives, the simplest solution looks to be the best.
While I am keen to use Project Voldemort in the long term – in the short term debugging our implementation details are just not on my agenda. We use a IKVM cross-compiled connector library from Java, and the problem looks like it is sitting in there somewhere. Unfortunately debugging Java IKVM libraries from within .NET is painful at best, and not something easily fitting into our timescale.
The simplest solution is to throw the asset blobs onto the filesystem – filesystems are after all developed to handle tiny little files. Directories will slow down when there is more than about 3,000 entries within them, so we’re breaking storage up into “/b1/b2/hash.blob” – this means assuming an even distribution, approximately 30 files per directory at current size, scaling us up to a capacity of 100 million assets before we need to rethink the situation.
Distribution and redundancy are both things I am still keen to employ – putting us on the filesystem does allow us to look at things such as KosmosFS which provide transparent distributed filesystems on Linux, and also gives us the opportunity to look at commercial filestores down the road if we ever win the lottery.
Rewriting fragstore to use filesystem components where voldemort was employed took all of an hour and the asset converter was up and running – a lot faster too. Our conversion transfer rate on Voldemort was 66 assets per second. FragstoreFS?
10,000 Assets Processed (102.04 asset(s)/sec): 0 error(s) so far.
The second thing I wanted to test was just how big a savings we were getting from using Content Addressable Storage – with 10,000 processed, we ended up with 613 duplicates eliminated (6.1%). With 20,000 – 1390 (6.9%), with 90,000 – 8962 (9.95%). We’re hoping as the full dataset is processed – the % of duplicates eliminated continues to increase.

Fig 1. CAS Duplicate Savings
The next issue to present itself was a slowdown as the conversion occured – the number above (102.4) held firm for the first 10% of the conversion, then conversion speed began to massively taper off, first down to 71.39/sec, then down to 50.22/sec by 150,000 converted. My fears were a reprise of the situation we thought fixed on day 1 – slowdowns on accessing as we move further down the table.
Nebadon suggested that this infact might be actually because as OSGrid has become more popular the average size of an asset has increased over time – so we skipped a million rows down the table and started converting some of the later rows. Conversion speed? 68.31/sec. This indicates that yes, later assets are more expensive to process – but the conversion speed should still average the 60 or so per second we need to be able to convert the entire database in under 24 hours.
Appearing somewhat happy with the results, conversion on the complete database has started, but we wont know how well it has worked until tommorow.
Day 3: Stay tuned!


Very interesting stuff!
Thanks for posting and for notifying us in IRC.
Starky Rubble
30 Apr 09 at 7:25 pm
Actually, databases are becoming more and more usable for storing things like this.
I’d suggest as this dataset is only going to grow, you might be better off creating two seperate tables – one with the metadata (Filename, Checksum, AssetID, etc), and another with only three fields (Primary ID, Relational ID, Data).
I can also forsee using a traditional on-disk filesystem having some potentially devastating effects in this type of scenario also – for example, depending on the activity of the nodes pulling information down, are you going to have enough filehandles available on the operation system to serve up all your requests?
Either way, for this application – it might be beneficial to get yourself a copy of the MySQL query analyser, and enable the slow query log on the server so you guys can pinpoint which queries/tables need optimising for performance.
Will Dowling
1 May 09 at 1:33 am
Hey Will,
Postgres apparently isnt too bad here – by MySQL has shown it definetely does not like 200GB tables (we’ve had to recover it multiple times now – any any query against the table that isnt on the primary key will nuke the box for 2 hours). I have a feeling that any problems we have with MySQL right now will be significantly worse if we ever hit the same 300TB+ ranges Linden Lab do.
Re: on-disk performance, I don’t think that will be an issue yet, but it is a long term consideration – the big options I am looking at now are using some FUSE-based “filesystems” designed for distribution and clustering across multiple machines. KosmosFS looks promising, but there are a few others too.
Adam Frisby
1 May 09 at 7:53 am
G’day again
MySQL should be able to handle the load fine, but you’ll have the same issues that you’d have with any other RDBMS.
There’s a few things I’d be curious to know about the current setup. What storage engine are you guys using? What is the underlying storage? How does the database get to the underlying storage (local, nfs, iscsi)?
At the end of the day, in this type of scenario whilst the database is 100% definitively the authorative source for asset information, success will be measured by what caching/expiry you do outside the database.
Your recent posts scaling are making me tempted to put up my hand and offer some help – let me know if you need anything
Will Dowling
4 May 09 at 1:38 am
We’ve varied between InnoDB and MyISAM for this, MyISAM has unfortunate table-level locking which can hit us badly if we try do analysis on the table while the grid is running; but we’ve found it is more reliable – InnoDB has corrupted for us in the past and unlike MyISAM, there are no innodb repair tools – had to restore from backup.
Underlying disk is a RAID1 SCSI array, so the speed there isnt too much of a problem. The hardware itself is fairly decent — but it’s our experience that MySQL just doesnt do ‘big data’ well.
One of the IBM engineers who used to work on data warehousing has said that Postgres will perform significantly better on GB/TB-sized tables – but ultimately there’s more at play than reliability & speed; decent partial distribution and clustering are going to be must haves this year.
Adam Frisby
4 May 09 at 6:43 pm
Btw, definetely always looking for extra volunteers – there’s a ridiculous amount of work that needs to be done with this stuff. I’ll grab you on Facebook or something tommorow/today.
Adam Frisby
4 May 09 at 6:44 pm