Adam Frisby

Archive for the ‘performance’ tag

The Imaginary 45K Wall

with 8 comments

Kohala (the SimHost region on OSgrid)

I see on a fairly regularly basis reports that OpenSim supports 45,000 prims versus Second LifeĀ®’s 15,000. It’s rubbish; but it comes from a somewhat logical source. The viewer itself will not display more than 45,000 objects in the ‘prims parcel supports’ field. There’s no technical reason for it – it just clamps the value (and I’m not entirely sure why). If OpenSim is set to unlimited prims (or 99,999,999), the viewer will show it as ‘Supports 45,000 Prims’ and not what it really is (’Supports “99,999,999″ prims’).

But, you can go well above that boundary. Some of Shenlei’s Leviathan builds are now breaking the 160,000 primitive count mark (and I have no doubt she intends to push it further!), but those builds are incredibly intensive on other aspects of the system, particularly memory usage. Primitive counts as a resource delimiter have never been accurate as far as underlying consumption goes; how they evolved goes straight back to when SL was still a MMORPG with nifty building tools (circa 2002-2004).

This is evident in SL – certain scripts cause “lag”, popular clubs can block other users from accessing the region (by eating up the whole max avatar count) – neither of these is factored into the current resource limits. This is applicable in OpenSim too – an unscripted region will behave better than a scripted one, an empty region will have uptimes measured in months, a popular one in days (or hours).

So, with this entry I aim to do two things – first dispell the myth that OpenSim supports 45K primitives. That is incorrect – OpenSim supports whatever you tell it to handle, whether it behaves is up to the underlying consumption required for your build, and what you are hosting it on. And second – clarify where the limits really are, and how you can optimise them.

With SimHost, we needed to give a number to our potential customers that is indicative of their usage, in a manner that can be understood like prim counts, but reflects the actual capacity used. We decided to go with RAM usage – this is because memory is the primary requirement of the OpenSim software. The amount of memory a region needs is pretty much directly proportional to all the other requirements (scripts need a roughly equal amount of memory as CPU, so do avatars, prims, etc.). There are other limits too – network bandwidth, processor usage, etc. All of these can become a bottleneck depending on the design of a region.

To give you a rough estimate of capacity-by-memory, one of our heavier customers has a 10,000 prim sim which hosts weekly meetings; it’s somewhat scripted – memory use for this region is between 405MB (Resident) and 1070MB (Total). Each avatar to the region adds between 20 and 50mb to the “resident” figure (and when occupied, some of the paged memory moves into the resident as it is accessed). If you use this as an example – 1024MB of resident memory should get you a “standard region equivilent”; if you want to start pushing on it further, then you might want to allocate 2GB dedicated to the region.

Network bandwidth is directly tied to # of avatars plus, # of primitives plus, # and size of textures. You can drop your bandwidth requirements fairly dramatically simply by building more efficiently, encouraging texture re-use, optimising your textures, etc. Sculpties actually work to your benefit here – since they can replace many prims with just one; and that one is ‘instanced’ – so that every copy you use, is only downloaded once by the viewer.

Processor usage is generally not a problem; to avoid any issues – giving a region a dedicated core will let it do it’s own thing. Scripts are about the only thing that can really push this figure (and physics to a much lesser degree). With recent updates to OpenSim enabling much larger concurrencies – processor usage is beggining to appear; but an average user will often struggle to push an average CPU core usage of more than 10%.

So, next time you see the claim that ‘OpenSim supports 45,000 prims’ (as I often do) – think of it not as a hard limit, or even a ballpark figure that is remotely accurate. OpenSim will try serve whatever you tell it to — but whether it does so successfully is more likely to be up to other factors relating to the underlying hardware; than the software itself.

Written by Adam Frisby

November 2nd, 2009 at 2:41 pm

85.

with 9 comments

OK, so we didn’t quite get to 100 as originally planned – but this time it wasn’t OpenSim’s fault. Yes, by the end you could tell the sim was straining – and at about 65 avatars, the physics engine finally choked on trying to solve a 15 avatar capsule interpenetration (or at least, my interpretation of the bug – analysis pending); but it kept on accepting logins and people kept arriving – and very quickly we hit 70, … 75, … 80 then peaked at 85 before running out of people, slipping back to 79 and manually shutting the sim down to grab the all important debug dump.

85 Avatars in Wright Plaza

It’s important to note here – these were real clients, using SL-derived viewers. By comparison libsl is a lot friendlier on the packet engine than the full viewer, so bots tend to be a less effective test. (Plus users introduce randomness that bots cant quite emulate). Wright Plaza with 85 avatars and their attachments weighs in at a healthy 15,400 prims – so there was no shortage of texture of prim data to be sent to each client – it’s actually probably one of the nastiest sims to do load tests in – which makes it great for this. Furthermore the hardware it is located on isn’t exactly top of the line, or even middle-of-the-line.

The short news is – we’ve made some really impressive progress in the the last week. Earlier we got up to 50 – which was tweaked, tailored and adjusted to get us to where 100 or even 150 isn’t really that out of the question anymore. There’s three big causes for this – first, abandoning OpenJpeg for decoding J2K textures made some very noticable improvements to stability (it’s in progress to abandon it for Encoding too); this means we’re not crashing on the way up – which means we can hit higher concurrencies more reliably. Second – John Hurliman from Intel rewrote our throttle routines and some low-level packeting code, which delivered a big boost to packet performance. Third – multiple efforts to reduce memory use in key places, has at least halved operating memory requirements – at 85 concurrent, memory was peaking at a mere 1.7gb (~20mb/user).

A result of these improvements has been memory IO is no longer such a major bottleneck – we’re actually beggining to hit the point where CPU usage is nearly becoming a more important bottleneck (we were hitting 90% CPU at peak — although the physics interpenetration mentioned above might be distorting this, since it could lead to run-away CPU use) – which is a refreshing change, since it is a lot easier to optimise around, and the tools for CPU use profiling are a lot better than those for memory IO profiling – and produce a lot more meaningful information.

We’d like to continue these load tests – the information the devs have gotten in the last week has been absolutely invaluable. Having a big pool of testers able to jump in on a moments notice has resulted in getting performance fixes tested and integrated a lot faster than usual – it’s also helped stability, each crash has been diagnosed and debugged in series as it is encountered. It’d be very easy to say that performance & stability wise, more has happened in the last week than the last 6 months – and we still need your help to keep going. We’re going to be continuing these load tests next week – there will probably be another major effort at getting 100+ avatars in a sim next Friday (same time, 1PM PST). If you want to know when the next test is planned, and help out – either hang around in #opensim on Freenode, or follow @osgrid or @adamfrisby where I’ll announce them they come.

Next stop, 150.

Written by Adam Frisby

October 9th, 2009 at 11:39 pm

Persistent Vegetative Simulation

with 3 comments

If a tree falls in a forest and nobody can see it, does it still fire a collision event?

In Second Life(R) it does, and this is one of the reasons that the platform is so expensive – maintaining 20,000 regions requires 20,000 processor cores. Consider the WWW by contrast – a single server can power thousands of websites; because the cost of hosting is virtually free until someone accesses the site. Cost does not equate simulation space – it’s based on simultaneous concurrent accesses and the complexity of processing the requests.

OpenSim falls into the same trap in many ways – while the cost of hosting an empty region is nearly nil; there are two key aspects where processing will always continue even if there is no need or desire to do so. These areas are physics and scripting – arguably two of the larger expenses on the processor bill.

Physics can be limited somewhat – switching away from a persistent simulator (ODE) to something that computes physics ‘on demand’ (POS, BasicPhysics) will remove the expense of physics while no-one is in the region, but you lose the ability to have objects move independently. For those hosting geographical areas this may be perfect – if you don’t have any scripts in the region and compute physics purely on demand, you are looking at a raw cost in just memory – about 20mb per region, and even that can be paged to disk safely.

However, hosting conventional regions is a more tricky prospect – users have an expectation of certain features working, people use scripting and want to physically interact with objects. One of the options worth considering may be simply dialling the simFPS according to the number of viewers in the region. Drop down to 1Hz when there is no users, but dial back up to 45Hz when users appear – all easy enough to do within the codebase, however doesn’t reduce the processor cost to zero while inactive.

If we consider the ultimate goal being to reduce Virtual World running expenses to similar to that of a web-host: costs being per-access, measured in processor time, bandwidth, etc. then we need to go a few steps further. One of the bigger steps is reducing expectations – right now you can safely assume LSL will always be running, so therefor you can write things like “servers” in LSL. Servers are a good example of the problem – they are something that interacts with the world, but aren’t supposed to be part of it; yet they consume the expenses of being in the world all the same.

Servers could be much better replaced by simple web scripts or server daemons on a normal server – which have a decent API that can connect to the world and make their interactions without needing to be part of it. By doing so, you not only make the servers more efficient (Apache+PHP is faster and more powerful than LSL), but you reduce the persistent load on the world itself.

The work I am doing on MRM is addressing part of this problem – redefining the API to reduce the number of scripts you need in the world, but a total solution here includes some kind of “remoting-style” API that lets you send messages and interact with the world, from a completely independent outside authority. The other side of the same coin is reducing ‘background noise’ processing – timers, sensorrepeats and other recurring tasks all place a small but visible background load on the server, all of these events will keep the server processing even if there is nothing to do.

Some things may require this – but encouraging people to think about when they need these events is better. Limitations on the LSL API really prevent this in SL, but in OpenSim would you be willing to use a “TimerIfAvatarInRegion” instead of a plain Timer? If you could – then it would allow the back-end server to attach conditions to your scheduler which provide optimisations when no-one is there. Allowing these kinds of optimizations is going to be key in the long term – because a cost of one processor core per four concurrent users is just simply insane and will never be adopted in the wider marketplace VW operators hope to gain. If webservers could only handle four simultaneous users – Google would require millions of servers to operate instead of just thousands.

Written by Adam Frisby

April 11th, 2009 at 7:51 am

Posted in OpenSim

Tagged with , ,

Ideas for Scene Graph Optimisation

with 2 comments

Authors note: I use terms such as ‘disadvantage’ when refering to Second Life’s building tools as a comparison to professional tools with professional artists, naturally user generated content tends to lean towards less efficient building techniques. This is not a slight on the content creators themselves, just that the tools make lots more work for people writing renderers and dealing with efficiency.

Second Note: Like my previous post, a large deal of this is speculation. I plan on confirming or denying a large number of my suspicious with the Xenki viewer’s design, but at this point should be just ramblings on the authors blog rather than any authorative statement.

As a sidenote from my previous post – I have some more ideas I’d like to try put into practice directly with rendering Second Life(tm)-style scenes faster for Xenki. The mainline SL client achieves it as far as I can tell through a combination of utter brute force (equivilent to sending an entire dam through a garden hose every minute – It’s pretty impressive.) and lots and lots and lots of caching.

This is not going to play well with WPF at all (I can see that much already), first we dont have access to low level hardware, and second I dont want to debug a thousand graphics glitches with every nuanced bit of hardware. Thanks, but no thanks, I’d rather let MS worry about that part.

So, if brute force is out of the question, what options exist for making things render faster.

First is the obvious one – let’s cache better.

One of the things that has been lamented previously has been the fact that Second Life has dynamic content, ergo we cannot cache the scene – I suspect this isnt the whole deal, while it is true that every object in the scene can potentially be moved (scripted or avatar building) at any moment, we can evaluate a lot of them on probabilities and discount swathes as likely to move.

Objects

Objects can be pretty easily split between “Likely to move” and “Unlikely to move.” Likely to move objects were either recently created, marked temporary or physical, or contain scripts. While it is true the others could still move, the probability is significantly lower, and therefor we can more readily cache them. If they get moved, then we’ll need to rebuild that cache (without the object that moved), but for now – it’s acceptable.

This cache could take the form of rendering the entire ’static’ portion of the scene to a single massive vertex buffer, and then rendering the dynamic elements individually (or in smaller caches). This is very similar to how modern games work – however in that case you have the advantage of being able to build a BSP tree in the editor. I am uncertain as to whether we are capable of doing BSP generation fast enough to make this dynamic cache feasible, but it is an interesting idea nontheless (Insert additional concerns about wide open spaces and BSP trees here).

A potential downside here is that we’ll need to change how LOD works for this to be effective – rather than having LOD calculated “on the fly” as your camera navigates, we will need to force the scene, then only update LOD periodically as the cache refreshes. In this case, LOD may become a function of the size of the object in absolute terms rather than relative to screen space.

Maintaining this cache on an idle processor

One of the great things about processors lately has been the abundance of cores added, this means chances are there is a piece of hardware sitting on this machine without much to do. We can leverage this by doing the cache building and maintainence on a seperate thread which runs on another processor, because the cache is not a prerequisite to rendering – we can optimise the cache in the background, then use it when it is availible.

Handling Textures Better

Second Life has the disadvantage of not using professionally created textures on every surface – this means that it’s possible for a microscopic object that you cannot really see having a massive 1024×1024 sized texture attached to it, increasing both bandwidth usage – and the amount of texture memory that is consumed in displaying your scene.

An idea for fixing this problem could be to measure the surface area each texture is applied to, then using this surface area to approximate what resolution we should render each texture as. (Converting that 1024×1024 texture down to a 32×32 texture if it is only used once, on that object).

By doing this, in combination with careful management of the amount of texture memory availible (downsampling to fit memory and applicability together) this may get around at least part of the “huge texture memory consumption problem”.

Written by Adam Frisby

August 6th, 2008 at 4:43 pm

Posted in Xenki

Tagged with , , , ,

Procedural Generation of Prims considered harmful?

with one comment

Yep.

I said it – one of the things that’s been touted as so fantastic about SL’s rendering performance is the speed at which you can push them to the graphics card, the amount of caching in vertex buffers that can be done, etc.

I’m about to say that it actually doesnt seem to matter that much, and Prims lose out in a lot of cases for some very interesting, but difficult to fix reasons, and doing performance workarounds for this is going to be complex, irritating and make me wish I was dealing with my precious meshes.

I should note here, that the performance of the XBAP application on my crummy laptop graphics card is still relatively solid – and I’m brute forcing nearly every operation at this point.

Reason Number Uno: Fill rate, “invisible” triangles.

Prims waste a lot of triangles in areas we cannot see – occlusion culling of whole objects works well here, but it doesnt work when we’re dealing with potentially a few thousand triangles that are part of an object, but inseperable. This is mostly due to construction techniques than something we can fix at the renderer level, but nonetheless it has a major impact on performance.

Possible Solutions

I’m experimenting with using CSG (Constructive Solid Geometry – boolean operations) at the moment as a method of reducing the number of hidden triangles pushed to the screen. This will have some complexity when involving transparent surfaces, but if we discount transparent primitives from the algorithm we may get a reasonable reduction in the number of triangles pushed to the screen, at the expense of increasing the number of vertex buffers used (prims do have vertex caching on their side).

This is something I plan on experimenting with and am looking at ways to do CSG in C# without me having to dig out research papers.

Reason Number Duo: Really Inefficient Texturing

This is a more annoying issue – namely that as we start drawing triangles for the procedural surface, we have to flick texture index multiple times to render the primitive (assuming it isnt the same texture on all sides), on a spherical or curved surface this isnt so much of a problem – we push a few thousand, flip, push a few thousand more. Fine.

On boxes – Push 2 triangles. Flip. Push 2 triangles. Flip. Now, of course it’s better not to flip at all – and as some people will point out, pushing 2 triangles vs a few thousand is better and still more efficient. The problem here is how primitives differ from mesh based models.

Traditionally in mesh based modelling, you generate a single texture with a uv map for the entire object. By wrapping and contorting it, you can render the entire object as one single pass, which means we dont need to pause, do a new texture lookup, repeat as many times. It still happens occasionally, but the number is much much lower.

If your scene (such as in a modern game) only has 50 uniquely textured objects on scene at once (look closely and you will find it’s probably not much higher than this number) this is fine. It works well – if we appropriately stage our render pipeline, we might even be able to group these into a single pass each.

SL? Your lucky if your scene has less than 100 textures visible. I’ve seen regions where this number is many times more, potentially in the thousands — and as I pointed out earlier, we’re flipping textures midway through rendering single object collections, which is possibly hurting the performance gains we are making by being able to cache those collections originally.

Yeuck.

Some possible solutions here

There’s a couple of potential solutions to this, but I think the easiest one is to leave this to ATi/NVidia/Intel – pipelining similar textures is something I expect their drivers to do. If this does become a problem, I have some ideas in place for grouping similarly textured faces from different primitive groups into single vertex collections.

Written by Adam Frisby

August 6th, 2008 at 3:30 pm

 

You need to log in to vote

The blog owner requires users to be logged in to be able to vote for this post.

Alternatively, if you do not have an account yet you can create one here.

Powered by Vote It Up