Last week it came to my attention that the server currently acting as the asset server for OSGrid.org is nearly full on disk space, and at the time, moving to an alternate server was impossible due to certain DNS requirements. The ability to move the asset server onto another machine is no longer a blocking problem, however there are a number of serious problems with the way they are currently being stored.
To enumerate the problems at present, the current architecture needs a little description.
Current Setup
At the moment, Assets are stored and retrieved through an interface called IAssetStore, where it gets stored to is up to the module that implements this interface, although the majority of providers use some kind of relational database as a backend. This works well for small grids and standalone regions where the number and size of assets is small, however OSGrid.org (using the MySQL adapter) has several million asset entries and a 75GB table file. Because of the primary key asset requests are still very fast – however performing operations over the entire table requires a 75GB table scan, resulting in the simplest of queries taking several hours.
It should be noted that there is no duplicate detection, and two copies of the same texture will take up the full disk space and table space, given that there is a very high number of duplicates in the database this accounts for a very significant amount of waste.
OSGrid.org’s Profile
OSGrid’s asset server is currently running on a high performance machine donated by the Electric Sheep Company. It’s disk is a ~140GB SCSI RAID-1 array, which fine for the metadata (maybe even overkill) becomes an issue when raw storage capacity is questioned. The current assets table is around 75GB or so, and growing at a rate of approximately 10% per week. At the current rate of growth, the disk will be full in about 6 weeks time, which means migration to something more scalable in the long term is nessecary.
In the long term however, moving to a new server every 6 months is not an ideal situation, especially so since osgrid is a nonprofit enterprise and hence relies on donations. This creates a nessecity to switch to something in which capacity can be more easily added without bringing the grid down, or relying on moving to bigger and better hardware constantly, especially so if it ever hits the kind of scale where it simply becomes impossible to buy hardware big enough. Ignoring any associated problems with RDMS’s and table size.
The second large concern is the inability to run queries against the table due to it’s size – while this will always be a concern, it can be limited by application of appropriate indexes and limiting the table size where possible. Running things such as garbage collection obviously becomes impossible when the table is too large.
Candidates for a Solution
For the last week I have been investigating some possibilities for scaling the asset cluster, without spending anything other than development time (see ‘relies on donations’ above). Some of the commercial solutions, such as the one Linden Lab employ – Isilon OneFS appear to fit most of the requirements, but the cost is simply too prohibitive, there are also some limitations on them – such as difficulty in garbage collection, duplicate linking, etc. Thus our requirements look something like this:
- Partial Replication – Scaling should be achievable by adding new machines into the cluster, rather than hopping from one to another.
- Inexpensive – the total cost for setting this up should not exceed the donations osgrid recieves to keep running. Ideally it shouldnt any new monthly expenses.
- Handles large datasets – while OSGrid only has 75gb of assets, the Second Life grid has 200tb+. This means whatever solution is chosen, it needs to have good scaling prospects and remain useful. Moving from one provider to another is a time consuming and difficult process – so it should not need to be done too often.
- CAS – The DB should utilize content addressable storage to eliminate problems associated with duplicate files – ideally these should be handled completely transparently.
Optional requirements (’would be nice if..’)
- Fault Tolerance – if an individual system goes down, the system should be able to compensate automatically without downtime.
- Caching – Frequently accessed items should be cached for more speedy access in future.
- Read Replication – assets should be requestable from multiple seperate machines to increase scaling properties.
- Very speedy access on a primary key – KeyValue stores tend to be ideal here (eg, BDB) since data access is well controlled.
- In Production Already – Ultimately there’s a lot more confidence in a system if it’s been proven to scale elsewhere already.
There’s a number of solutions which fit this bill – there’s the hosted variety, Amazon S3 / SimpleDB, Bitcache, Custom Hackery of the Routing Asset Server variety, Custom Hackery of the write-your-own-DB variety, etc. Evaluating these options through the above criteria, most of the options were invalidated.
Amazon, or (the hosted option).
Amazon’s S3 is a well known option for storing largeĀ numbers of images for popular websites – it effectively says “Dont do this, we’ll do it for you.”, and if the price was right, it’d definetely be my selection. Unfortunately the math doesnt quite add up for OSGrid to utilize S3 in the long run, at the current rate of bandwidth and size, OSGrid would be paying around $300.00/month – growing 10% per week and running the resulting 7100 GB is a minimum $1,100/month without any transfers or requests. By comparison, the same amount of space on plain hard disks is a mere $2,500 once-off ($10,700 savings over a year is enough justification to skip this one).
Bitcache
After mentioning the problem on IRC, Bitcache was suggested as an option. For those who havent seen it before – it’s a storage engine that operates on REST principles – the main limitation is it requires an additional database setup to store the URLs of where assets are stored if you wish to start moving assets over multiple machines. Not a huge limitation and it’s definetely easier to integrate into the current solution than writing one yourself. The other downsides is it doesnt appear to be used in production anywhere ‘large’ and the backend is written in Ruby which isn’t well known for scalability.
Custom Hackery of the Routing Asset Server variety
This option involves writing a special version of the asset server which instead of processing the request, does a UUID lookup, then sends the asset over to another server which has that asset stored. This allows assets to be shifted around using standard database migration tools. It’s an option, however fault tolerance is an issue as it’s possible to lose a section of the database and suffer the consequences thereof. This was the option I was considering until only yesterday, albeit using a new custom BDB-based adapter for the final asset store to increase speed and scalabilty.
For those who like pretty pictures -

Custom Hackery of the Write-your-own-DB variety
Bad. No.
Enter Fragstore
Fragstore is similar in principle to how a number of the above systems work. Fragstore utilizes a pair of databases at it’s core – one contains information such as access times, hash information, etc. The other contains the data itself – I use the term ‘database’ here rather loosely, since the data datastore is actually a piece of technology called (ugh) Project Voldemort. PV is a Java project used by LinkedIn for their own internal database – it’s similar in principle to Amazon’s SimpleDB and Google’s BigTable – and has similar properties to them in terms of scaling. It has a number of useful properties which match up to the above – including fault tolerence, partial replication, high speed access, caching and more – plus it’s been put through it’s paces at LinkedIn already.
The downside however is that it running queries on say access time would be ridiculously expensive. Likewise it lacks CAS-support so duplicate files would be difficult to track down and link together. It’s also written in Java and lacks a .NET connector (something thankfully IKVMC was able to fix with the help of dmiles).
Fragstore builds upon the strength of PV by utilizing it as a raw data store, but then integrates in a traditional DBMS for handling of metadata and querying the tables. We also connect in access to the old database so assets uploaded to the old system can still be accessed while the conversion is taking place. The result looks something like this:

There’s a couple of key changes to note – first is, we’ve switched from a UUID-based primary key to a Integer one, this is done because of MySQL’s internal lack of a UUID datatype and the speed at which string based primary keys are searched (several orders of magnitude slower). While not used in normal operations – it will allow us to run the garbage collector on a slaved machine and produce a list of redundant ID’s which can be eliminated significantly faster. Likewise, we can eliminate all copies of an asset at once by eliminating the hash value – useful for instance in handling DMCA takedown requests.
When will this be done? I’m aiming for the next few days to have a beta version availible, if it performs adequetly, we will run a test by converting the entire osgrid asset database over to it. Keep your eyes peeled for more information soon.


Hello,
Microsoft Azure and Google App Engine can be an alternate solution to Amazon hosting.
Laurent
14 Feb 09 at 10:47 am
[...] a little link to some of the things facing OSGrid.org as they ponder an architecture for their asset server that won’t barf up a lung every six [...]
Reading Radar » Maybe You’ll Believe Someone Else?
14 Feb 09 at 5:37 pm
App Engine is even more expensive than S3. Azure’s consumption pricing isnt quite as easy to calculate, but I’d expect it to be just as high.
Adam Frisby
14 Feb 09 at 11:23 pm
Hi Adam
I was interested to read this article as it confirmed something I have felt for some time. That is Opensim Asset growth is a real problem long term.
I set up a small private Opensim grid with a friend sometime ago. We are not developers in the Opensim sense but we do have extensive IT experiance in database systems. So we thought we would investigate the Database behind Opensim and see if we could do anything to help in that way. So far we have built a php based adminstarion tool that allows drill down of the data. We are now working on a utility to identify and archive redundant assets. Its nearly finished. When it is we will be looking for a way to give this to the Opensim community so grid operators can keep their disk storgae requirements down.
We realised that assets get created by Opensim but never destroyed even when those assets have no pointers towards them in any of the tables. Such assets could be removed from the Asset file with no effect on Opensim other than disk space savings and retrieval effeciency improvements. This is especially noticabale in Script assets. A script is written and creates an asset. It is updated as it is developed and a new asset is created each time leaving the older one redundant and unused. The same sort of problems exist on other types of assets too.
In our small and not very active grid we found that 75% of all assets are not pointed to and so can be removed. I would expect on very active grids like OSgrid this percentage would be much higher.
So apply that sort of maths to your current problem and maybe you dont need to upgrade disk so often. Of course it will still grow and your plan is still a good one, but with the event of Hypergrid I think smaller grids with well managed asset archivial would also be a good scalable way forward.
Would love to have your views on what we are doing and will explain our approach in more detail if you are interested.
Bob Wellman
17 Feb 09 at 10:25 am
[...] to scaling OpenSim is amazing. Sure I am a fanboy, but anyone wondering why I think so should read Adam’s FragStore blog post [...]
FragStore, End of Finite UUIDs and More « Mo Hax
19 Feb 09 at 2:38 am