<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Adam Frisby &#187; distribution</title>
	<atom:link href="http://www.adamfrisby.com/blog/tag/distribution/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.adamfrisby.com/blog</link>
	<description>ZOMGWTFHAI</description>
	<lastBuildDate>Sat, 26 Dec 2009 07:02:09 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>FragStore &#8211; A Fragmenting Asset Store</title>
		<link>http://www.adamfrisby.com/blog/2009/02/fragstore-a-fragmenting-asset-store/</link>
		<comments>http://www.adamfrisby.com/blog/2009/02/fragstore-a-fragmenting-asset-store/#comments</comments>
		<pubDate>Sat, 14 Feb 2009 04:40:38 +0000</pubDate>
		<dc:creator>Adam Frisby</dc:creator>
				<category><![CDATA[OpenSim]]></category>
		<category><![CDATA[assets]]></category>
		<category><![CDATA[cable beach]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.adamfrisby.com/blog/?p=107</guid>
		<description><![CDATA[Last week it came to my attention that the server currently acting as the asset server for OSGrid.org is nearly full on disk space, and at the time, moving to an alternate server was impossible due to certain DNS requirements. The ability to move the asset server onto another machine is no longer a blocking [...]]]></description>
			<content:encoded><![CDATA[<p>Last week it came to my attention that the server currently acting as the asset server for OSGrid.org is nearly full on disk space, and at the time, moving to an alternate server was impossible due to certain DNS requirements. The ability to move the asset server onto another machine is no longer a blocking problem, however there are a number of serious problems with the way they are currently being stored.</p>
<p>To enumerate the problems at present, the current architecture needs a little description.</p>
<h3>Current Setup</h3>
<p>At the moment, Assets are stored and retrieved through an interface called IAssetStore, where it gets stored to is up to the module that implements this interface, although the majority of providers use some kind of relational database as a backend. This works well for small grids and standalone regions where the number and size of assets is small, however OSGrid.org (using the MySQL adapter) has several million asset entries and a 75GB table file. Because of the primary key asset requests are still very fast &#8211; however performing operations over the entire table requires a 75GB table scan, resulting in the simplest of queries taking several hours.</p>
<p>It should be noted that there is no duplicate detection, and two copies of the same texture will take up the full disk space and table space, given that there is a very high number of duplicates in the database this accounts for a very significant amount of waste.</p>
<h3>OSGrid.org&#8217;s Profile</h3>
<p>OSGrid&#8217;s asset server is currently running on a high performance machine donated by the Electric Sheep Company. It&#8217;s disk is a ~140GB SCSI RAID-1 array, which fine for the metadata (maybe even overkill) becomes an issue when raw storage capacity is questioned. The current assets table is around 75GB or so, and growing at a rate of approximately 10% per week. At the current rate of growth, the disk will be full in about 6 weeks time, which means migration to something more scalable in the long term is nessecary.</p>
<p>In the long term however, moving to a new server every 6 months is not an ideal situation, especially so since osgrid is a nonprofit enterprise and hence relies on donations. This creates a nessecity to switch to something in which capacity can be more easily added without bringing the grid down, or relying on moving to bigger and better hardware constantly, especially so if it ever hits the kind of scale where it simply becomes impossible to buy hardware big enough. Ignoring any associated problems with RDMS&#8217;s and table size.</p>
<p>The second large concern is the inability to run queries against the table due to it&#8217;s size &#8211; while this will always be a concern, it can be limited by application of appropriate indexes and limiting the table size where possible. Running things such as garbage collection obviously becomes impossible when the table is too large.</p>
<h3>Candidates for a Solution</h3>
<p>For the last week I have been investigating some possibilities for scaling the asset cluster, without spending anything other than development time (see &#8216;relies on donations&#8217; above). Some of the commercial solutions, such as the one Linden Lab employ &#8211; Isilon OneFS appear to fit most of the requirements, but the cost is simply too prohibitive, there are also some limitations on them &#8211; such as difficulty in garbage collection, duplicate linking, etc. Thus our requirements look something like this:</p>
<ul>
<li>Partial Replication &#8211; Scaling should be achievable by adding new machines into the cluster, rather than hopping from one to another.</li>
<li>Inexpensive &#8211; the total cost for setting this up should not exceed the donations osgrid recieves to keep running. Ideally it shouldnt any new monthly expenses.</li>
<li>Handles large datasets &#8211; while OSGrid only has 75gb of assets, the Second Life grid has 200tb+. This means whatever solution is chosen, it needs to have good scaling prospects and remain useful. Moving from one provider to another is a time consuming and difficult process &#8211; so it should not need to be done too often.</li>
<li>CAS &#8211; The DB should utilize content addressable storage to eliminate problems associated with duplicate files &#8211; ideally these should be handled completely transparently.</li>
</ul>
<p>Optional requirements (&#8217;would be nice if..&#8217;)</p>
<ul>
<li>Fault Tolerance &#8211; if an individual system goes down, the system should be able to compensate automatically without downtime.</li>
<li>Caching &#8211; Frequently accessed items should be cached for more speedy access in future.</li>
<li>Read Replication &#8211; assets should be requestable from multiple seperate machines to increase scaling properties.</li>
<li>Very speedy access on a primary key &#8211; KeyValue stores tend to be ideal here (eg, BDB) since data access is well controlled.</li>
<li>In Production Already &#8211; Ultimately there&#8217;s a lot more confidence in a system if it&#8217;s been proven to scale elsewhere already.</li>
</ul>
<p>There&#8217;s a number of solutions which fit this bill &#8211; there&#8217;s the hosted variety, Amazon S3 / SimpleDB, Bitcache, Custom Hackery of the Routing Asset Server variety, Custom Hackery of the write-your-own-DB variety, etc. Evaluating these options through the above criteria, most of the options were invalidated.</p>
<h3>Amazon, or (the hosted option).</h3>
<p>Amazon&#8217;s S3 is a well known option for storing large  numbers of images for popular websites &#8211; it effectively says &#8220;Dont do this, we&#8217;ll do it for you.&#8221;, and if the price was right, it&#8217;d definetely be my selection. Unfortunately the math doesnt quite add up for OSGrid to utilize S3 in the long run, at the current rate of bandwidth and size, OSGrid would be paying around $300.00/month &#8211; growing 10% per week and running the resulting 7100 GB is a minimum $1,100/month without any transfers or requests. By comparison, the same amount of space on plain hard disks is a mere $2,500 once-off ($10,700 savings over a year is enough justification to skip this one).</p>
<h3>Bitcache</h3>
<p>After mentioning the problem on IRC, Bitcache was suggested as an option. For those who havent seen it before &#8211; it&#8217;s a storage engine that operates on REST principles &#8211; the main limitation is it requires an additional database setup to store the URLs of where assets are stored if you wish to start moving assets over multiple machines. Not a huge limitation and it&#8217;s definetely easier to integrate into the current solution than writing one yourself. The other downsides is it doesnt appear to be used in production anywhere &#8216;large&#8217; and the backend is written in Ruby which isn&#8217;t well known for scalability.</p>
<h3>Custom Hackery of the Routing Asset Server variety</h3>
<p>This option involves writing a special version of the asset server which instead of processing the request, does a UUID lookup, then sends the asset over to another server which has that asset stored. This allows assets to be shifted around using standard database migration tools. It&#8217;s an option, however fault tolerance is an issue as it&#8217;s possible to lose a section of the database and suffer the consequences thereof. This was the option I was considering until only yesterday, albeit using a new custom BDB-based adapter for the final asset store to increase speed and scalabilty.</p>
<p>For those who like pretty pictures -</p>
<p><img class="alignnone" title="Routing Asset Server" src="http://www.adamfrisby.com/routingserver.png" alt="" width="412" height="785" /></p>
<h3>Custom Hackery of the Write-your-own-DB variety</h3>
<p>Bad. No.</p>
<h3>Enter Fragstore</h3>
<p>Fragstore is similar in principle to how a number of the above systems work. Fragstore utilizes a pair of databases at it&#8217;s core &#8211; one contains information such as access times, hash information, etc. The other contains the data itself &#8211; I use the term &#8216;database&#8217; here rather loosely, since the data datastore is actually a piece of technology called (ugh) Project Voldemort. PV is a Java project used by LinkedIn for their own internal database &#8211; it&#8217;s similar in principle to Amazon&#8217;s SimpleDB and Google&#8217;s BigTable &#8211; and has similar properties to them in terms of scaling. It has a number of useful properties which match up to the above &#8211; including fault tolerence, partial replication, high speed access, caching and more &#8211; plus it&#8217;s been put through it&#8217;s paces at LinkedIn already.</p>
<p>The downside however is that it running queries on say access time would be ridiculously expensive. Likewise it lacks CAS-support so duplicate files would be difficult to track down and link together. It&#8217;s also written in Java and lacks a .NET connector (something thankfully IKVMC was able to fix with the help of dmiles).</p>
<p>Fragstore builds upon the strength of PV by utilizing it as a raw data store, but then integrates in a traditional DBMS for handling of metadata and querying the tables. We also connect in access to the old database so assets uploaded to the old system can still be accessed while the conversion is taking place. The result looks something like this:</p>
<p><img class="alignnone" title="FragStore" src="http://www.adamfrisby.com/fragstore.png" alt="" width="700" height="576" /></p>
<p>There&#8217;s a couple of key changes to note &#8211; first is, we&#8217;ve switched from a UUID-based primary key to a Integer one, this is done because of MySQL&#8217;s internal lack of a UUID datatype and the speed at which string based primary keys are searched (several orders of magnitude slower). While not used in normal operations &#8211; it will allow us to run the garbage collector on a slaved machine and produce a list of redundant ID&#8217;s which can be eliminated significantly faster. Likewise, we can eliminate all copies of an asset at once by eliminating the hash value &#8211; useful for instance in handling DMCA takedown requests.</p>
<p>When will this be done? I&#8217;m aiming for the next few days to have a beta version availible, if it performs adequetly, we will run a test by converting the entire osgrid asset database over to it. Keep your eyes peeled for more information soon.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.adamfrisby.com/blog/2009/02/fragstore-a-fragmenting-asset-store/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>
