<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Adam Frisby &#187; voldemort</title>
	<atom:link href="http://www.adamfrisby.com/blog/tag/voldemort/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.adamfrisby.com/blog</link>
	<description>ZOMGWTFHAI</description>
	<lastBuildDate>Sat, 26 Dec 2009 07:02:09 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>DB&#8217;s Considered Harmful [... to my sanity.]</title>
		<link>http://www.adamfrisby.com/blog/2009/04/dbs-considered-harmful-to-my-sanity/</link>
		<comments>http://www.adamfrisby.com/blog/2009/04/dbs-considered-harmful-to-my-sanity/#comments</comments>
		<pubDate>Thu, 30 Apr 2009 18:14:15 +0000</pubDate>
		<dc:creator>Adam Frisby</dc:creator>
				<category><![CDATA[OpenSim]]></category>
		<category><![CDATA[assets]]></category>
		<category><![CDATA[fragstore]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[osgrid]]></category>
		<category><![CDATA[voldemort]]></category>

		<guid isPermaLink="false">http://www.adamfrisby.com/blog/?p=230</guid>
		<description><![CDATA[The following is my log of the OSGrid asset conversion saga.
Day 1: Today we&#8217;re taking a third attempt at doing the big fragstore conversion for OSGrid, for those not following the Saga of the Asset Server &#8211; about two months ago we started noticing major scalibility issues surrounding assets. Right now they are thrown into [...]]]></description>
			<content:encoded><![CDATA[<p>The following is my log of the OSGrid asset conversion saga.</p>
<p><strong>Day 1</strong>: Today we&#8217;re taking a third attempt at doing the big fragstore conversion for OSGrid, for those not following the Saga of the Asset Server &#8211; about two months ago we started noticing major scalibility issues surrounding assets. Right now they are thrown into MySQL as a blob table, resulting in large amounts of waste both in duplicate content and in the fact we&#8217;re storing a filesystem inside of a relational database &#8211; you can read up on the <a href="http://www.adamfrisby.com/blog/2009/02/fragstore-a-fragmenting-asset-store/">earlier design decisions leading to FragStore here</a>.</p>
<p>The previous two conversion attempts have suffered &#8220;mysterious MySQL glitches&#8221; which we assume may be related to various bugs with long running commands. Apparently the proper course of action when running a query that takes more than 60 seconds to process the command, is to freeze up entirely and stop processing requests &#8211; for now and evermore.</p>
<p>In an attempt to make this run a bit smoother &#8211; we&#8217;ve broken up the process into 2,000 batches of 1,000 assets each &#8211; previously our batch mechanism was using MySQL LIMIT X, Y which has the side effect of getting slower and slower as you progress down the table (thus causing the above); so we&#8217;ve shifted to using a numeric ID on the assets table. Putting the numeric ID on there allows us to at least index sequential accesses &#8211; LIMIT unfortunately will not use any form of index hinting.</p>
<blockquote>
<pre>mysql&gt; ALTER TABLE `assets`
    -&gt; ADD COLUMN `numericID`  int(11) UNSIGNED NOT NULL AUTO_INCREMENT AFTER `access_time`,
    -&gt; DROP PRIMARY KEY,
    -&gt; ADD PRIMARY KEY (`numericID`),
    -&gt; ADD UNIQUE INDEX `assetID` (`id`);

Query OK, 1623826 rows affected (1 hour 11 min 2.54 sec)
Records: 1623826  Duplicates: 0  Warnings: 0</pre>
</blockquote>
<p>An hour and a bit later, the difference between the speed of processing before and after is pretty astounding</p>
<blockquote>
<pre>mysql&gt; select id from assets limit 540000,10;
10 rows in set (11.53 sec)

mysql&gt; select id from assets where numericID between 540000 AND 540009;
10 rows in set (0.42 sec)</pre>
</blockquote>
<p>It doesnt need a whole bunch of explanation to figure the above may help with our situation. Running the revised and simplified &#8220;AssetConverterMarkII&#8221; appears to go without a hitch &#8211; data is stored into the database, the metadata table is being filled correctly &#8211; all in all it appears to be functional. With one minor teeny little problem.</p>
<p>Only the first 4096 bytes of data are being written to the backend store. The remaining sectors of data are written &#8211; but consist entirely of zeros. Retrieving the data results in a buffer of the same length as the original stored asset &#8211; but often half the data is completely missing. An hour later, it looks like the data is being sent to the backend voldestore correctly, but either on the way there or on the way back, it loses something. Unfortunately it looks like the problem is outside the purview of the client adapter and is somewhere in the deep murk of the backend storage server.</p>
<p><strong>Day 2</strong>: Rethinking time &#8211; after spending quite some time hunting for some alternatives, the simplest solution looks to be the best.</p>
<p>While I am keen to use Project Voldemort in the long term &#8211; in the short term debugging our implementation details are just not on my agenda. We use a IKVM cross-compiled connector library from Java, and the problem looks like it is sitting in there somewhere. Unfortunately debugging Java IKVM libraries from within .NET is painful at best, and not something easily fitting into our timescale.</p>
<p>The simplest solution is to throw the asset blobs onto the filesystem &#8211; filesystems are after all developed to handle tiny little files. Directories will slow down when there is more than about 3,000 entries within them, so we&#8217;re breaking storage up into &#8220;/b1/b2/hash.blob&#8221; &#8211; this means assuming an even distribution, approximately 30 files per directory at current size, scaling us up to a capacity of 100 million assets before we need to rethink the situation.</p>
<p>Distribution and redundancy are both things I am still keen to employ &#8211; putting us on the filesystem does allow us to look at things such as KosmosFS which provide transparent distributed filesystems on Linux, and also gives us the opportunity to look at commercial filestores down the road if we ever win the lottery.</p>
<p>Rewriting fragstore to use filesystem components where voldemort was employed took all of an hour and the asset converter was up and running &#8211; a lot faster too. Our conversion transfer rate on Voldemort was 66 assets per second. FragstoreFS?</p>
<blockquote>
<pre>10,000 Assets Processed (102.04 asset(s)/sec): 0 error(s) so far.</pre>
</blockquote>
<p>The second thing I wanted to test was just how big a savings we were getting from using Content Addressable Storage &#8211; with 10,000 processed, we ended up with 613 duplicates eliminated (6.1%). With 20,000 &#8211; 1390 (6.9%), with 90,000 &#8211; 8962 (9.95%). We&#8217;re hoping as the full dataset is processed &#8211; the % of duplicates eliminated continues to increase.</p>
<div id="attachment_249" class="wp-caption aligncenter" style="width: 503px"><img class="size-full wp-image-249" title="OSGrid Content Addressable Storage Dupe Savings" src="http://www.adamfrisby.com/blog/wp-content/uploads/osgrid_casdupes.png" alt="Fig 1. CAS Duplicate Savings" width="493" height="286" /><p class="wp-caption-text">Fig 1. CAS Duplicate Savings</p></div>
<p>The next issue to present itself was a slowdown as the conversion occured &#8211; the number above (102.4) held firm for the first 10% of the conversion, then conversion speed began to massively taper off, first down to 71.39/sec, then down to 50.22/sec by 150,000 converted. My fears were a reprise of the situation we thought fixed on day 1 &#8211; slowdowns on accessing as we move further down the table.</p>
<p>Nebadon suggested that this infact might be actually because as OSGrid has become more popular the average size of an asset has increased over time &#8211; so we skipped a million rows down the table and started converting some of the later rows. Conversion speed? 68.31/sec. This indicates that yes, later assets are more expensive to process &#8211; but the conversion speed should still average the 60 or so per second we need to be able to convert the entire database in under 24 hours.</p>
<p>Appearing somewhat happy with the results, conversion on the complete database has started, but we wont know how well it has worked until tommorow.</p>
<p><strong>Day 3</strong>: Stay tuned!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.adamfrisby.com/blog/2009/04/dbs-considered-harmful-to-my-sanity/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>
