Archive

Posts Tagged ‘efficiency’

Programming Bad Performance

January 15, 2011 11 comments

Image courtesy of Wikipedia

Last week an interesting problem surfaced at work. An application engineer received reports of slow performance on a particular website and needed some help from my group to track down the source of the problem. This engineer had done some fantastic research on the problem and was able to answer almost every question we threw at him in regards to details surrounding the issue. I am going to try and run through the problem itself and the questions we asked which led us to the possible culprit. The solution to the problem was discovered a few days later and ended up surprising us as it was not even something we had considered could be the cause. Although the application engineer collected a lot of data in the form of trace logs and packet captures, my group didn’t examine any of this data. The problem was solved before we actually had to get in and look at the data ourselves. With a white board and some direct questions, we were able to point the engineer in the right direction. He did all the work.

Problem:

A URL that was Internet facing was performing very sluggish compared to others.

When did the problem start? Unknown.

Possible causes to consider:

1. Remote end of the connection
2. Internet connectivity
3. Firewall
4. Intrusion prevention sensor/Content filter/Other security hardware
5. Router/Switch/Load balancer problem on the internal network hosting the site
6. Server hosting the site
7. Web server software on server hosting the site(ie IIS,Apache)
8. Web site code (ie HTML,ASP,JScript,CSS,XML)

Troubleshooting: For the purposes of isolating the problem, we started with the remote connectivity and worked our way inward. From here on out, I am going to refer to the application engineer as Bob. That’s not his real name, but it’s a lot easier to type than “application engineer” or his actual name.

Had Bob checked into the remote side as being the source of the problem? Yes, he had. In fact he ran the same checks from other ISP’s and experienced the same result. That rules out item 1 on the list of possible causes.

Bob had a lot of additional information to add regarding this problem. First, this particular website was really a specfic URL that was problematic. Over a dozen URL’s using the same exact hostname were fine. It was just this one particular URL that was having a problem. That rules out item 2 as being the issue. Second, Bob stated that the problem was occurring on the internal network as well. That rules out items 3 and 4 from the list of possible causes. Now we’re getting somewhere. At this point, we know that we aren’t dealing with a problem isolated to the Internet. That’s actually a good thing because it’s never easy when you have to explain to people that you have no control over traffic once it leaves your network. It just comes off like you are passing the buck to non-network savvy people.

Bob added an additional piece that would vindicate the network hardware from being the culprit. He stated that the average MTU on all of the URL’s that were working great was somewhere over 1000 bytes. However, for the URL that was operating sluggish, the average MTU size was a little over 200bytes. Now the discussion goes on for a few minutes about how MTU size will affect performance and that 200byte average sizes are not good when compared to the other URL’s and their greater than 1000 MTU averages.

At this point, we know there is an MTU problem and that problem occurs on the external and internal network. Now I know that every switch this traffic is traversing on the internal network is going to allow an MTU of 1500, so I don’t think there is a piece of networking gear causing the problem. This seems like it is going to be something with the system itself. It turns out that this particular server hosting all of these URLs is one of several servers hiding behind a load balancer. I know my load balancer isn’t messing with the MTU, so I feel comfortable in ruling out item 5 as being the source of the problem.

Has Bob checked the server hosting these URLs? Bob indicates that there are 4 different servers behind the load balancer hosting these same URLs and they are all having the same problems. He tested the URL on each individual server and experienced latency. It is possible that we are dealing with a problem on all 4 servers, However, the odds of that being a sever hardware probem are very low. Considering the fact that these same servers host over a dozen more URL’s that are running with no problems, I am convinced that we can rule out item 6 as a possible culprit.

Now we are looking at the web server software or the site code itself as being the culprit. While I am by no means an expert when it comes to IIS, Apache, or other web server software, I am willing to bet that the issue is not with the web server software. My reasoning is that only 1 URL is experiencing the problem and over a dozen other URL’s are not. They are all using the same hostname, so one would expect any sort of MTU setting in the web server software, if there is one, to be the same across every URL.

At this point in the troubleshooting process, we figured it must be something in the code. Our recommendation to Bob was that he go back to the developers and have them check their code.

Bob came back several days later. He found the problem. Actually, there was no problem. The way the developers had coded this particular URL was what caused the problem. In this case, they had a bunch of really small CSS files that were used in conjunction with the URL that was problematic. The client would make the request and then it would have to grab tons of really small CSS files. Due to the small size of these files, the MTU itself was small. I suppose that small file sizes wouldn’t be too much of a problem, except in this case, there were too many files that had to be transferred. That is what was causing the latency.

In this particular case, there was nothing wrong with any infrastructure or server equipment. Everything was working as designed. If nothing else, it was a reminder that developers don’t always consider application performance over the network when designing software. They routinely get beat up for having poor security. I guess you can add poor network performance to the list as well. I think it is a generally accepted belief that programs are usually designed for low latency LAN environments, and very rarely are designed with WAN performance in mind. I shouldn’t be surprised to find a case like this in which the code wasn’t designed with network performance in mind at all.

I feel that it is also important to point out that it is fairly difficult to write code that takes all factors into consideration(ie Security, Network). Maybe the best solution is to involve the various entities during the testing of code to ensure it will perform properly. I can see how this issue would have been overlooked since it was a simple URL that was affected. Had it been an entire program that was affected, it might have been caught during testing.

Advertisements

Make Your Job Easier

****Note – While I thought about detailing the technical steps necessary for delegation on different pieces of equipment, I decided to go with the more “architectural” or “philosophical” approach in this post. Besides, there are plenty of others out there who do a far better job with graphics and CLI examples.

Recently, I took some steps to make my job a little easier. I delegated access to another group that does not normally have anything to do with the network side of the house. In this particular instance, I was able to give that group access to a Cisco ACE load balancer. Normally, giving non-network people access to equipment would be frowned upon. This is especially true for equipment in a data center that controls data flows for your most important applications. I had to consider the following:

1. Can I give them specific levels of access?
2. Will they be able to perform operations with relative ease?
3. Does it make sense to do this?

Question 1 was easy. Of course we can provide granular levels of access. It is hard to find a piece of equipment on most enterprise networks that can’t do this. Question 2 was a “most likely”, but could have been tough if everything needed to be done via CLI. Question 3 was probably the most important. Generally speaking, most technical problems can be solved given enough time and resources(ie people, money, and equipment). What many of us should ask, and some of us fail to ask, is whether or not we SHOULD do something. I for one love playing with new equipment. Build an Ethernet switch that interfaces with a toaster and I want to play with it. However, is there any use for something like that? Is there a large community of people out there that want connectivity with their toaster?

The point, is that while a lot of things are possible, not everything is necessary. Sometimes giving people access to network equipment can cause more harm than good. While I am a big fan of wanting to provide as much information to others as possible, if that information cannot be interpreted correctly, you are wasting your time. For example, I have been in environments where non-network related groups were given access to Netflow data. While that sounds great on the surface, the reality was that the data was being interpreted incorrectly. When looking at something like a 3Mbps circuit, some people would see full utilization and assume that more bandwidth was required. What they failed to take into account was that the QoS markings of the traffic indicated that a bunch of AF11(what was deemed scavenger) traffic was using the bulk of the bandwidth. Had any additional traffic come over the circuit that was tagged as AF21 or higher, it would have pushed down the AF11 traffic and gradually used more and more of the circuit until it reached the bandwidth limit that was set for that specific class of traffic. More bandwidth was not needed when the Netflow data was viewed in its entirety. Had this particular group understood QoS markings, they would have come to a different conclusion. Could we the network group have provided more in depth training on this particular product? Sure, but how long would that training have to be before the individuals understood QoS well enough to interpret traffic flows correctly? If you are a QoS fan, how long did it take you before you understood things like shaping vs policing? Or L2 vs L3 markings?

Back to the issue at hand. Does it make sense to give another group access to the load balancer? Yes. In this case it did. The typical process for maintenance on a server getting requests via the ACE load balancer was to have the network group pull it out of the active pool. Then, another group would make whatever changes were needed. Once they were done, they would contact the network group who would place the server back into the pool. If you are having to make changes to a dozen servers, this process can take some time. Why not just give the group making changes to the server limited access to the load balancer so they can do everything themselves? Time and resources would be saved by all.

That brings me back to the second question of can we make it easy for them to make changes to the load balancer? In the case of Cisco ACE, yes. We had an instance of Application Network Manager(ANM) running in our data center to help us. While I tend to be a fan of CLI (except in the case of the Cisco ASA), not everyone else is. Sometimes a GUI is far more helpful for people who need to make changes to network gear. That’s where ANM comes in. In a matter of minutes, I was able to create a domain(which is where you define the servers and farms you are giving access to), and role(you can create your own if you don’t like the default ones) for this other group to use. Now they had access to select servers and their corresponding server farms, but not enough access to do any real damage.

After doing that, I just had to create some instructions for the 2 tasks they would need to do. First, they need to know how to remove servers from a load balanced pool. Second, they need to know how to add servers to a load balanced pool. With ANM and the specific domain/role I assigned to their group, this is a piece of cake. I took the appropriate screen shots to walk them through the process of adding and removing a server and put it in a nice concise MS Word document. There are times when I am hesitant to put a lot of pictures in instructions. Sometimes people get offended when you drop it down to an elementary school level. Thankfully this particular group LOVED pictures, so everything worked out. In about 15 minutes we ran through the instructions. Additionally, I asked if they wanted a bit more detail about the Cisco ACE load balancer in general, so we talked about what it does and where it sits in terms of its physical place in the network. Everyone seemed happy with the training, and I think they were truly excited about not having to wait on the network group anymore when they needed to make changes.

Problem solved. Everyone was happy, and I know that outside group is reaping the benefits of being able to make changes on their own. I have jumped on to conference calls several times recently and noticed that servers were being added to and removed from load balanced pools without the network group having to do anything. The group I gave access to was taking care of it.

If you have the means to delegate processes to other groups, I would recommend that you do it provided it complies with any security and administrative policies your company or IT department has. You do have those policies in place right? šŸ˜‰ If it makes your job easier, makes other people’s jobs easier, and you get to impart some knowledge about the network to external groups, why not do it?