Saturday, February 7, 2015

Latency Laggers: Don't Be a Party Crasher

In the wake of our second huge event turn-out in less than a month, and in the wake of the chaos caused by three of the 30+ users attending, it's time to discuss the issues and the choices event hosts have to make as the party slows to a crawl. You know who is lagging you. You don't want them to have to leave. You stay silent and the whole thing comes apart. You say something to save the event but it ruins it for you because you turned out someone you care about.

How does a single user crash the system?

A high latency user is a user with a connection slower than good catsup. I know and love several users with this problem. I anticipate their arrival on my grid, savor their company, and I want to have them at the big events because they are cool and they are fun. Trouble is, once they join your workshop/grid-hop/party--other users are blocked from arriving, local chat becomes impossible, teleports grind to a halt. Under the server's hood a data tsunami is building, and when it lets go, all the avatars will go down.

Think of it this way:

My father was what I like to call a ponderer. If you wanted a banana for breakfast, you better ask the night before. Otherwise it might be dinnertime before you got approval for your breakfast. Add to that, he had nine children to care for on his own. This is a high latency responder in a high traffic situation. Kids can generate requests at the rate of about a hundred per second: Can I? Will you? Are you? How do you? What if? How does?

Multiply that by nine. Pop's mental queue was so overwhelmed that the answer to any question not involving threat to life or limb was likely to be delivered a week to a month after it was asked. RL chat lag at work. 

Basically this is the Opensim latency issue. We have have the host grid server which is spitting out requests like an impatient child to the incoming avatar: Tell me this this info. Give me that data. Do you hear me? Huh? Huh? Huh? Did you? Gimme. Gimme. Gimme. Gimme Now!

We have the high latency user answering:

Umm... Just a sec...

Where a sec=a-whole-freaking-minute-of-waiting.

Ignored, our host server melts down into an avatar-booting tantrum.

Or to be more precise about the cause, I'll quote Aine Caoimhe:

Viewer latency is an issue that we've begun to encounter in Opensim and appears to be a code bug/oversight that allows a single user's viewer to bring an entire region to its knees if a "perfect storm" of conditions occurs. This typically happens in a busy region with many concurrent users and/or a large amount of content when one of the  users in the region is on an internet connection that is slow or unreliable. This viewer "latency" can generate a bit of lag and could even snowball into what I call a "cascading latency" issue that can reach severe enough proportions to crash a simulator.

Based on our testing, it seems that what happens is if a sim doesn't receive a packet response from a viewer (usually due to latency) it immediately resends the same packet again to try to elicit a response. The simulator seems to prioritize this resend ahead of all other user traffic, too, so a highly latent viewer can rapidly fill the entire traffic queue with resends and results in all other user traffic also grinding to a halt. If this only happens briefly and then the connection is restored, the simulator can recover once it catches up with everything. In serious cases, the other users' traffic will also start to time out because the simulator doesn't know that it's being backlogged by the queue, resulting in their traffic starting to be handled as latent too. Once it hits this tipping point, it rapidly cascades into either a complete simulator crash or the sim might manage to just kick a large number of the people from the region and then recover with whoever it left. Of course everyone who got disconnected will immediately try to return which just puts it under extreme load again....

Okey-dokey. So what can we do about that?

I am going to start by dropping to my knees and begging you all to do a couple of simple things before you try to join a high-traffic event. Grid owners usually know the major sources of their latency problems. They love those users. They don't want to ask them to leave an event. I'd also like to add that it doesn't mean all events are not for slow users, only events where server resources are seriously taxed. Be kind to event hosts and check your usage.

Consider how you connect to the internet:

a) Satellite users--if it is a group of 10 or more, not gonna work. :(

b) Cell phone users--if you are using a mobile hotspot or tethering in on a cellphone it might work. It depends on a lot of things. See the pre-event speed test below to determine if it will work for you.

c) Wireless users--typically free community WiFi or hooking up at the local coffee shop is gonna be slooow. See the speed test below. Also, where possible hook up by a wire to the router. If you're using a cellphone, try USB tethering direct to computer rather than WiFi hotspot mode.

d) Throttled users--if you are over your monthly data limit and your connection is throttled. Not gonna work. :(

Pre-event speed-testing your connection:

A ping test is like ping-pong. You whack a ball (packet) of data out to a server on the internet and see how successfully and quickly it is returned. I'm pasting in Aine's description (slightly condensed) of running the test. The pics are of my test so the numbers in my test differ from those in her description.

1) Open a command prompt. Windows XP users can get one with Start > Run.  Windows 7/8 do Start and then in the search box type "cmd" which should bring up cmd.exe as on option to run. This brings up a console with a command prompt.

2) Copy and paste this in the window: ping -n 20 narasnook.com

Use the address of the grid you're going to without the port number...can be either a domain name or an IP address.
Add caption
3)You will get something like this as a result:
First thing to look at is the number and percentage lost.

- if all are lost (100%) then the grid server does not allow ping tests (Nara's does but some others may not)
- if 1 or more is lost, run a second test with 50 pings instead of 20 (ping -n 50 narasnook.com)
- if 2 or more of those 50 are lost (loss > 4%) then the connection is likely going to be one that causes problems. Even 2% can cause problems, in fact.
- ideally you want to consistently see 0% as was the case in my test above

Next, look at the average value (my result was 96ms which is a little slow but I had other traffic on my system when I ran it)
- typical (good) average values will be anywhere from 30ms to as much as 150ms depending on distance between you and the server and how much internet traffic (in general) is going on at this time of day. Trans-Atlantic is often in the 120-150ms range. Local is usually more 30-75ms.
- if your average is >250ms you could cause isssues...try a 50-ping test and see if that holds pretty stable and if it does, be sensitive when you go to the party. If you're experiencing chat lag (more than a second) or script lag (dialogs are slow to pop up), you're the likely cause.
- averages over 500ms are extremely likely to cause issues, and 1000ms (a full second!!!) is almost guaranteed sim death in a busy region

Finally, look at the max value (mine was 108ms)
- if max <200 you're fine.
- if you have a max value between 200-500ms take a look at the indivudual results and see how many were over 200. If it was only 1 or two you're probably fine. If it was more than that, run a 50-ping test and look at the individual results as they appear on your screen, counting how many of them are over 300ms. If more than about 10 of 50 are longer than 300ms then there's a good chance you'll cause problems in a busy sim.
- if your max value was over 500ms then you have some latency and are probably going to cause issues but run a 50-ping test just to see. If you're getting any sort of regular ping responses over 500ms you're almost certain to kill a region if it's busy.

From Nara:
Keep in mind situations are fluid and things can still go wrong even if you pass the test. We're in the wilds of a new frontier and we have to roll with the whims of technology, pipelines, and mercury in retrograde.

Things all users can do to lower stress on a server--
1) Don't stream videos, Skype, surf the web, and log into the party as three different avatars all at once.

2) Don't hypergrid in with 15,000 inventory items in your suitcase. If you're going to a high-traffic event, the only items in your suitcase should be a landmark, clothing and attachments you are wearing. Extra copies of your hair, skirt, feet--to replace those the hypergrid eats.

3) If your connection is slowish, log in an hour ahead of the event and sit your avatar and leave it sitting. It gives the server time to deal with you.

4) Set your viewer bandwidth to 500kbps or less.

5) If the event is at six, be annoyingly early or very fashionably late. Everyone logging in at 6 on the dot does not work.

6) If there should be a region crash, everyone trying to log back in immediately is not going to get anyone anywhere. Take a potty break. Fix a sandwich. Return at leisure.

Advice from Aine:
Things region/grid owners can do to help the latency issue--


1) Force HG to arrive in an "empty" region first, so their initial HG login stuff can be handled separately on a sim that isn't under stress. That way when they tp to the party region it's handled as a local tp and is far less difficult for the region to manage

2) Keep party regions as sparse as you dare...disable any unnecessary scripts, try not to have too many different/high res textures in it, etc.

3) Keep in mind that each avi increases sim load exponentially not linearly and that sitting avi (including people couples dancing or poledancing) are far less of a drain on resources than an avi who is standing (or singles dancing) since they become phantom and don't use physics.

4) For a party, change opensim.ini to disable avatar collisions (so they can't bump into each other when standing/walking) since that also reduces physics calculations

5) For a party, set/manage the per-user throttle settings of the region (in opensim.ini) to restrict general traffic levels so people with viewers set to ridiculously high traffic levels don't create issues even if they have excellent connections...if your grid admin doesn't know how to do that, ask all people who visit to set their viewer bandwidth to 500kbps or less (in user preferences)

Key "tells" that typically signal a latent viewer in the region:

>>>> console starts to flood with "resend" messages for a specific user they are probably using a high-latency connection. Keep in mind, though, that if the region reaches that "cascade" level of resends, you'll be seeing them for pretty much every user in the region. Some log levels might not show this on the console (I don't recall off the top of my head what logging level is required to display them).

>>>> in console run "show stats" and look at UnackB, as well as the two pending. High counts in any of these means that some latency has occurred but this can also be from initial tps into a region with high traffic.

>>>> general chat lag that persists/occurs when a sim isn't in the process of handling an inbound tp (the initial arrival in a busy region will usually cause a bit)

>>>> slow response (or completely stoppage) of script dialog boxes appearing since that uses region chat too

>>>> if you have direct system-level access, run a per-user level test for packet resends and response times...when a latent viewer is in a region it can generate 60k+ resends in less than 5 minutes!!! If it manages to queue up enough resends, it will also start to make your other users appear latent and they'll start to rack up resends too. Compare total resends against time present in region to get an average resend rate...a latent viewer will have a very high value

Back to Nara again:

Bottom line here for event hosts-- if you have a high latency user dragging the whole thing down, you have to tell them what is going on and ask them to log out so the region won't crash. Up to this point I've let the crashes come where they may and hoped for the best--but that isn't fair to everyone who worked hard to prepare the event and all the people who made time to join you. Hopefully sharing this explanation of the situation and how to avoid it will limit the number of times you find yourself in that situation.

Last words from Aine:

At Refuge grid we're actively collecting and providing detailed data to the Opensim developers to help them track down and eventually eliminate the issue. Until then, there's no way to either prevent a high-latency viewer from disrupting everyone else's experience, nor is there any "fix" for someone who has such an issue (although perhaps the Sl on Go (http://www.firestormviewer.org/firestorm-on-sl-go/) service would work?) except to be sympathetic to region owners and other users and not go to regions that have high traffic. A sim-owner can take steps to make a region more "volume-friendly" but of course this may defeat the entire purpose of having the sim in the first place.

I hope this helps :)

2 comments:

  1. Great post and the information is really helpful. Thank you, Nara and Aine.

    ReplyDelete