Last week an interesting problem surfaced at work. An application engineer received reports of slow performance on a particular website and needed some help from my group to track down the source of the problem. This engineer had done some fantastic research on the problem and was able to answer almost every question we threw at him in regards to details surrounding the issue. I am going to try and run through the problem itself and the questions we asked which led us to the possible culprit. The solution to the problem was discovered a few days later and ended up surprising us as it was not even something we had considered could be the cause. Although the application engineer collected a lot of data in the form of trace logs and packet captures, my group didn’t examine any of this data. The problem was solved before we actually had to get in and look at the data ourselves. With a white board and some direct questions, we were able to point the engineer in the right direction. He did all the work.
A URL that was Internet facing was performing very sluggish compared to others.
When did the problem start? Unknown.
Possible causes to consider:
1. Remote end of the connection
2. Internet connectivity
4. Intrusion prevention sensor/Content filter/Other security hardware
5. Router/Switch/Load balancer problem on the internal network hosting the site
6. Server hosting the site
7. Web server software on server hosting the site(ie IIS,Apache)
8. Web site code (ie HTML,ASP,JScript,CSS,XML)
Troubleshooting: For the purposes of isolating the problem, we started with the remote connectivity and worked our way inward. From here on out, I am going to refer to the application engineer as Bob. That’s not his real name, but it’s a lot easier to type than “application engineer” or his actual name.
Had Bob checked into the remote side as being the source of the problem? Yes, he had. In fact he ran the same checks from other ISP’s and experienced the same result. That rules out item 1 on the list of possible causes.
Bob had a lot of additional information to add regarding this problem. First, this particular website was really a specfic URL that was problematic. Over a dozen URL’s using the same exact hostname were fine. It was just this one particular URL that was having a problem. That rules out item 2 as being the issue. Second, Bob stated that the problem was occurring on the internal network as well. That rules out items 3 and 4 from the list of possible causes. Now we’re getting somewhere. At this point, we know that we aren’t dealing with a problem isolated to the Internet. That’s actually a good thing because it’s never easy when you have to explain to people that you have no control over traffic once it leaves your network. It just comes off like you are passing the buck to non-network savvy people.
Bob added an additional piece that would vindicate the network hardware from being the culprit. He stated that the average MTU on all of the URL’s that were working great was somewhere over 1000 bytes. However, for the URL that was operating sluggish, the average MTU size was a little over 200bytes. Now the discussion goes on for a few minutes about how MTU size will affect performance and that 200byte average sizes are not good when compared to the other URL’s and their greater than 1000 MTU averages.
At this point, we know there is an MTU problem and that problem occurs on the external and internal network. Now I know that every switch this traffic is traversing on the internal network is going to allow an MTU of 1500, so I don’t think there is a piece of networking gear causing the problem. This seems like it is going to be something with the system itself. It turns out that this particular server hosting all of these URLs is one of several servers hiding behind a load balancer. I know my load balancer isn’t messing with the MTU, so I feel comfortable in ruling out item 5 as being the source of the problem.
Has Bob checked the server hosting these URLs? Bob indicates that there are 4 different servers behind the load balancer hosting these same URLs and they are all having the same problems. He tested the URL on each individual server and experienced latency. It is possible that we are dealing with a problem on all 4 servers, However, the odds of that being a sever hardware probem are very low. Considering the fact that these same servers host over a dozen more URL’s that are running with no problems, I am convinced that we can rule out item 6 as a possible culprit.
Now we are looking at the web server software or the site code itself as being the culprit. While I am by no means an expert when it comes to IIS, Apache, or other web server software, I am willing to bet that the issue is not with the web server software. My reasoning is that only 1 URL is experiencing the problem and over a dozen other URL’s are not. They are all using the same hostname, so one would expect any sort of MTU setting in the web server software, if there is one, to be the same across every URL.
At this point in the troubleshooting process, we figured it must be something in the code. Our recommendation to Bob was that he go back to the developers and have them check their code.
Bob came back several days later. He found the problem. Actually, there was no problem. The way the developers had coded this particular URL was what caused the problem. In this case, they had a bunch of really small CSS files that were used in conjunction with the URL that was problematic. The client would make the request and then it would have to grab tons of really small CSS files. Due to the small size of these files, the MTU itself was small. I suppose that small file sizes wouldn’t be too much of a problem, except in this case, there were too many files that had to be transferred. That is what was causing the latency.
In this particular case, there was nothing wrong with any infrastructure or server equipment. Everything was working as designed. If nothing else, it was a reminder that developers don’t always consider application performance over the network when designing software. They routinely get beat up for having poor security. I guess you can add poor network performance to the list as well. I think it is a generally accepted belief that programs are usually designed for low latency LAN environments, and very rarely are designed with WAN performance in mind. I shouldn’t be surprised to find a case like this in which the code wasn’t designed with network performance in mind at all.
I feel that it is also important to point out that it is fairly difficult to write code that takes all factors into consideration(ie Security, Network). Maybe the best solution is to involve the various entities during the testing of code to ensure it will perform properly. I can see how this issue would have been overlooked since it was a simple URL that was affected. Had it been an entire program that was affected, it might have been caught during testing.
Inevitably, we are all going to come across things in our jobs that we are deficient in. Maybe we know a little about a certain topic, but we need to know more. Maybe we know absolutely nothing and need a basic introduction to the topic. Regardless, there will come a time in which we need to increase our knowledge and understanding of something in this ever growing world of networking or just IT in general.
The problem as I see it, is how I go about filling in those gaps. When you just start out in the IT world, you may not have a good methodology in which to learn about IT things. If you have been in the industry for a long time, you may already have a good system that works for you. No matter which category you fall into, the fact that you will constantly have to learn is unavoidable. There are NO exceptions to this rule. If you wish to be at the top of your game in IT from a technical standpoint, you must make a habit of constantly learning new things. Failure to do so means that your knowledge will become dated and you will drift off into obscurity working as some corporate slave in a dark and dreary cubicle. This may or may not involve working for the government. 🙂
Now that we have established that static knowledge is a dead end, let’s look at how to ensure we are always at the top of our game. I offer you the 5 step plan. Others have 12 step programs. Maybe some have less. I only have 5. I am all about efficiency…..and my program doesn’t cost you a dime.
1. Examine your current level of knowledge. – How much do you already know about the subject in question? The answer to that question is going to dictate the kind of resources you use. Let’s use BGP for example. If you need to learn about the basics of it, there are a few good books that can handle that. There are also plenty of websites with white papers and blog posts that give a generic overview of BGP. There are some classes out there that will accomplish the same thing. However, there are quite a few books and white papers that will completely blow your mind if you don’t already have a decent understanding of BGP. The service provider side of BGP comes to mind. Enterprises and service providers use BGP in VERY different ways.
2. Find out where the information is. – For starters, you need to identify what kind of learner you are. Some of us are visual learners. Some of us are audible learners. Some of us learn by doing. Perhaps you are a mix of several different methods. Only you know what works best for you. If you need a lot of pictures and the topic is relatively mainstream, maybe a visual CBT(computer based training) course is what you need. If that is the case, I highly recommend you check out CBT Nuggets. If what you are looking for is somewhat more obscure, then I would recommend asking other people who do what you do. There are a variety of resources in which you can ask these questions like LinkedIn, forums, or Twitter. I prefer Twitter because it is a lot quicker. The only possible problem would be having enough people see the request. If you are new to Twitter, or very rarely use it, you may not have many followers who would see your message. Feel free to engage others in a substantive manner and over time your followers will grow. If all you do is tell everyone what you ate for lunch or what the weather is like in your part of the world, you probably aren’t going to get anywhere. If you absolutely refuse to use something like Twitter, then consider posting on Cisco’s forums if your issue is of a Cisco nature or networking-forum.com. There are other forums out there as well as mailing lists(NANOG comes to mind). All of the major vendors have support forums as well. Keep in mind that you may have to sift through tons of information before you finally find the information you are looking for. There is not always going to be a technical paper or book that explains exactly what you are looking for. Sometimes you have to piece it together from multiple sources. Actually, I would recommend that you use multiple sources unless it is some vendor specific thing that you can only get in one place. I have found out that you cannot trust a single source for 100% accuracy. Not that all sources are wrong, but imperfect human beings write books, white papers, and blog posts. Other imperfect human beings double check these same sources. When the content is of a technical nature, things get missed. This is especially true for the deeper technical things.
3. Execute. – You have all of the appropriate resources identified. Now you just need to get that information into your head. There are no shortcuts. While I wish I could learn kung-fu like Neo did in The Matrix, it isn’t going to happen. You have to put in the time required to absorb all of that information. Sometimes it can be done in a matter of minutes. Sometimes it takes weeks.
4. Ignore any distractions. – In the course of your learning, you are bound to come across something else that is interesting or neat. Resist the temptation to get sidetracked and stay focused on the main thing you are trying to learn. If you want to go back at another time and research the other items that pop up, then make a note of them. By focusing on the main thing you are trying to learn, you have a better chance of retaining information then if you start going in 100 different directions with every new thing that appears.
5. Allow the information to digest. – Sometimes it helps to simply think about things. Just go over it in your head. I tend to do this in conjunction with step 3. If I need to absorb a large amount of information, I like to take it in pieces and digest it little by little. By stopping to sort things out in your head, you can really come to terms with what makes sense and what doesn’t. I am very thankful my current employer allows me the freedom to do this. While it may look like I am spacing out on any given day in my cubicle, lots of times I am just thinking about something I just read or watched. It’s my way of performing a “write memory” on my brain. One of the other things I will do is drive to and from work in complete silence. That really helps because all I have to focus on is not crashing the car, which is relatively simple.
**Note – When asking others about a certain technology or product, do yourself a favor and research it first. Try and figure some things out on your own. This isn’t so much a problem with people who have been in the industry for a number of years as it is with those who have only been in IT for a few years or less. It’s not that people don’t want to answer the question. There will always be someone who will just blurt out an answer. The issue with asking without having done any research on your own is that you miss out on a great opportunity to develop your own research methods. There’s a reason that lmgtfy.com was created and is often quoted on Twitter. It has been my experience that those who last in IT are the ones that only need a nudge in the right direction. They don’t want their hand held. They just want a sanity check every now and then. The people who never want to put in the time or effort to figure something out and habitually want you to solve their problems are the ones that won’t make it in the long run. Well, they might have a job, but they won’t be anywhere near what they could be if they put forth some effort.
I am not going to make the bold claim that the 5 steps I laid out will work for everyone. They work for me when I follow them, and I don’t always follow them. I find the instances in which I have tried to cram something new into my head without following these steps ends badly. I forget something and have to start all over again. When I take the time to really dig into something and not rush it, it tends to stay with me at least from a conceptual point of view.
It’s not that I don’t have anything to say! People who know me know that I very rarely shut up for more than a few minutes. It’s just that I have been fairly busy lately. A lot of different things have been eating into my time and writing things for a network blog take a lot of time and effort. I have a 4 day Cisco ACE class next week in which I will be out of town, so I hope to get several posts done at night when I am sitting in the hotel. You don’t actually think I will be going out at night do you? Hmmmm…..a week away from the office and a training day that ends at 4:30pm. That leaves me all sorts of time for the following:
1. Catch up on the billion or so web pages I have bookmarked.
2. Get some things written for the blog that revolve around possible competitors to the Nexus 7000. With HP, Arista, Brocade, Force10, and Juniper selling competing products, there’s a lot of data to sift through. I honestly have no idea who will come out on top. It might just be the Nexus 7000!
3. Comment on my experience with the ACE class I will be taking with Global Knowledge. I’ve spent the last several days at work focused on ACE, so I am very interested in filling in the gaps of my knowledge regarding this interesting product.
4. Read up a little more on the Cisco/EMC/VMware vBlock concept. I went to a presentation today about that and am intrigued to say the least.
5. Write about the concept of baselining your in-house applications. This would be focused on knowing what the normal TCP/UDP operations look like from a packet capture standpoint.
I try and keep a running list in Evernote of the things I would like to write about. The list continues to grow, but the time it takes to transform just one of those ideas into a somewhat coherent post just hasn’t been there.
I hope to have some new content up early next week. The last thing I want is to end up abandoning this blog and waste all my time playing mindless games on my iPad, although I do enjoy doing that a few times a week.
It’s late August here in the United States. That means one thing for a lot of people. Football is starting. No offense rest of the world. Your football is my soccer, although I tend to side with you that my soccer should be called football. How often does one kick an American football? A LOT less than we touch the ball with our hands. I’m getting off on a “semantics” tangent though. It is the one sport that predominately resides within North America. Yes, I am acknowledging you too Canada!
Many athletes at all age levels have been practicing for several months and are ready to get started with the football season. Many a Saturday afternoon, Sunday afternoon, and Monday night will be spent watching people knock each other over to carry a piece of pig skin across some lines on the ground and celebrate by dancing as gracefully as one can when covered by all that protective gear. Millions of people will watch all the way up to early next year when the championships are decided by as little as 1 point. For the most part, there are no do-overs. All of you “instant replay” fans just bite your tongue and let me carry this analogy as far as I can. When the game is over, it is over. There are no series of games like baseball, hockey, and basketball have. You have one shot at glory. Miss it, and you’ll have to wait until next season.
There’s a German proverb which says: “To aim is not enough. You must hit!”
I get paid for things that go bump in the night. Whether that thing happens to be a router failing, or a circuit deciding it no longer likes my 1’s and 0’s, my job is to fix it and fix it fast.
I do come to work during the day. I go to meetings and look at configurations of various hardware. I build network diagrams and dispense or seek advice on a number of different things. I participate in the important philosophical discussions like whether or not Anakin Skywalker was a better Jedi than Luke or Yoda(In my opinion, Anakin (aka Darth Vader) was the better Jedi and was robbed of his destiny by his meddling child and his rebel scum friends). I put in change requests for maintenance that must be performed. I can plow through the day to day stuff without hardly any interaction from management. Of course, they care about the quality of the work and if I used these stencils in my Visio diagrams, they might object. However, my overall existence in the day to day network operations life is rather calm.
In essence, I do the things that need to be done during the day, but my REAL job comes in spurts. Kind of like football(From now on, when I say football, I mean American football.) players. My game time comes at odd hours much like the police officer or fire fighter. When trouble happens, I need to perform. I need to be able to ask the right questions and formulate a short list of what the possible problems are. I need to be able to troubleshoot in a logical fashion either working up/down the OSI model or grabbing a packet capture and examining the session flows. When it is my equipment or systems that are at fault, I have to get in there and make the big play. I need a touchdown each and every time. I can’t drop a pass or fumble the ball. I get paid for results and rest assured my management is watching. They have to. All it takes is for someone much higher up on the food chain to ask why they pay the salaries of network people who can’t seem to fix the network. Then, I am out on the street forced to sell my services to the highest bidder, who hopefully doesn’t play Dungeons and Dragons or World of Warcraft with any of my now former co-workers/managers. I would have used a sport like tennis or basketball, but since I am in IT, the odds of that happening are much less than a bunch of technical geeks sitting around in Viking helmets and leather tunics taking part in the raid of an Ogre village on World of Warcraft over a shared broadband connection in an obscure apartment complex deep in suburbia while guzzling Red Bulls and listening to angry death metal music. By the way, for all of you D&D geeks who are shaking your heads in disgust at my mention of Ogre villages, I get it. I saw Shrek. I know they are solitary creatures, but I needed an effective illustration. If I used Elf village, the visual would have been less powerful.
Am I saying that we can’t make mistakes? Well, that depends. There are some places in which you can’t. Ever. Most places will allow mistakes. We’re all human and mistakes will happen. Of course, with enough attention to detail those mistakes can be minimized significantly. What I AM saying is that you need to be able to perform when a crisis hits. Your entire career at a particular company may come to a screeching halt over just a few minutes of doing the wrong thing. It won’t matter how long you have been with company X if your performance is so poor that company X starts bleeding millions of dollars due to an outage that you can’t fix. Problems are going to happen. Outages are going to happen. If your company expects you to fix them, you better fix them. Now I know that some people get in over their heads. It may be the company’s fault for placing an unrealistic demand on you, or it may be your fault for misrepresenting your capabilities. If your company is expecting you to fix and support issues with F5 load balancers and you have never so much as looked at an F5 load balancer, you better let someone know and get up to speed as fast as you can. After all, your job typically is whatever your company says it is. Don’t like that? Tough. Go somewhere else. Life isn’t fair. Sometimes you are the one in the room everyone is counting on to fix the problem, even if it isn’t your equipment that is causing the problem.
In the interest of brevity, let me close with some thoughts on how to ensure your performance is top notch.
1. Know what the scope of your job is. – This may seem a bit simplistic, but you need to be on the same page as management when it comes to your responsibilities. You cannot rely on someone else to tell you if that piece of network gear buried in some rack in a data center is your responsibility. You are going to have to find that out yourself and it needs to happen before the problem occurs. Hopefully your co-workers who have been there longer than you have a good grasp on what things belong to you. For example, if your boss expects you to take care of the wireless network, you better do it or have it handed off to someone else who can take care of it when a problem arises.
2. Develop your skills around your responsibilities. – I’m not advocating you abandon any sort of professional development that is not DIRECTLY related to your job. However, a BIG part of getting a pay check from a company is directly tied to being able to do your job as defined by the company. Good managers won’t load you up with things you are not able to do unless you have managed to con your way into a job by being a bit liberal with your resume. If you are stuck with something you are relatively new to, do the best you can and make sure your management KNOWS you are doing the best you can. Read books, configuration guides, white papers, and other technical documentation. Attend a training class. A career in IT is all about adaptation. None of us are working with the same hardware/software we were 10 years ago. If you are, odds are you either work for the government or a REALLY cheap company. Perhaps there are one or two things that have had a ten year plus shelf life, but for the most part, technology changes so fast that a decade is a lifetime in IT.
3. Be prepared. – Expect the unexpected. Think about different failure scenarios and design the network to remediate any single points of failure. If need be, have some block time purchased with an external consultant or VAR that has considerable experience with your specific hardware/software platforms. Carry maintenance contracts on all your hardware/software that is critical.
4. Raise any red flags early on. – If there are issues you know are going to be a problem, let someone know as soon as possible. Document those issues. Fix those issues. Even if the company says no due to budgetary reasons or some technical issue, at least you have done your homework and tried to make these issues known. If a problem does occur, nobody can come back to you and say that you should have known about this, or that it was your fault, etc. Additionally, it may work out to your benefit as management typically appreciates people who just want to make things better and do what is right for the stability of the network.
5. Stay calm during the outage/problem. – Remember that in a lot of people’s eyes, it is always the network that is at fault. Don’t let that get you down. Stay focused and work on the problem at hand. Ask as many questions as needed to get an idea of what the scope of the problem is. Don’t be afraid to ask very basic questions. One of the best ones to ask is “What changed?” or “When did the problem start?”. Maintain professionalism at all times. I get upset when I am on a conference call and someone won’t stop moaning about why it’s not their fault long enough for me to ask a question or answer one. However, it’s rather immature and unprofessional for me to lash out at them in anger even if I know it’s not my issue. There ARE times when I have had to tell someone to stop talking so that I could either answer a question or ask one of someone else on the call. I hate having to do that, but sometimes in the interest of getting it fixed YOU HAVE TO. If you are dealing with an issue where people are congregating around your desk watching over your shoulder, try and tune them out. You can’t always tell them to get lost or to leave you alone. You have to learn to work under pressure, but if you have taken item number 2 to heart, you should be able to minimize the time these people are hovering near your desk.
6. Be humble. – If people know that you don’t know it all, they tend to cut you a little more slack. If you are condescending and treat people like garbage because they don’t know the difference between a “routed” protocol and a “routing” protocol, they will be very unforgiving of your mistakes. Remember, there is no way possible you can know it all. There are people out there who know far more than you do. Sometimes they are in the same room as you. If you save the day and score a touchdown, good job. You don’t have to do the happy dance in front of everyone if you figure out what caused that routing loop. Your actions will speak for themselves. On the other hand, if you storm into the room demanding people shut up and watch you perform, you better get it right. If you don’t, your stock just went down and at some point, you’ll be looking for work elsewhere.
Ask any athlete how hard they have to work in order to get to their peak performance level and you’ll no doubt hear a recurring answer. You will find that it took a lot of time and effort to get there. There are no short cuts. When the wide receiver catches the ball and runs 80 yards to the end zone for a touch down, you can bet he ran sprints hundreds of times in the months prior. When the quarterback throws the ball for 50 yards and drops it right on the chest of the wide receiver, you can bet he threw that same pass hundreds of times in the months prior. When the defensive end wraps his arms around the running back and slams him to the ground, you can bet he practiced on a tackling dummy hundreds of times in the months prior. The examples go on and on. Peak performance takes time and effort. You practice and refine your skills for what is usually a short performance. Sometimes the performance extends over a couple of days or weeks, but generally issues get diagnosed and resolved in a relatively short time. How you prepare will determine the outcome. If you take shortcuts, expect poor results. If you put in the effort to perform well, good things will come your way. Granted, you probably won’t get a multi-million dollar contract with company X, but how many football players do you know who understand cool stuff like policy routing and VRF’s? Oh, and being able to fix problems on the network quickly leaves you more time to play World of Warcraft.
I have a bit of a problem when it comes to information. I tend to resemble someone on the TV show Hoarders. I have loads of PDF files on my laptop. Some are on my iPad. Some are on my desktop PC. I even have some on a little flash drive I carry around in my pocket. Of course, I have plenty of books. Just for networking related stuff, I have a pile at home as well as a good size collection at work. Then there are the URL’s. Every day I save all of the valuable URL’s I have discovered from Twitter and RSS feeds and put them in their own little folder with the date as the name under my bookmarks in Firefox. If I follow you on Twitter and you post a link, odds are I have looked at it and bookmarked it if it is something that pertains to my interests. If I read your blog, and odds are I do, I will bookmark various posts of yours and at some point go back and reference them. You see, I don’t always have time to read everything during the day. Additionally, if it is a post like this, or this, I will have to go back and read it all when I have a considerable amount of free time.
Therein lies the problem. I never seem to have time to go back and sift through every thing like I had planned. Well, that’s not entirely true. I have the time. I just get caught up in all the new links that are posted on Twitter every day and wind up spending study time skimming new blog posts or digging through websites. There’s a lot of good info out there that people are sharing. I suppose I could limit my intake to just routing and switching, but what fun would that be? Besides, I don’t want to be ignorant of the other things that are out there. After all, it wasn’t that long ago that I had absolutely nothing to do with voice, storage, wireless, and security. Times are changing, and changing fast.
There’s just so much out there that needs to be absorbed. Just when I think I have a handle on most of the Cisco product line, they go and release UCS, and the Nexus 1000V, and the ASR1000’s, and Clean Air. It never ends. There is always a new technology or some new hardware to read up on.
The realization I have come to is that there is no use in collecting information if you are not going to use it. All of those PDF’s, books, and URLs will do me no good if I never use them. At the same time, if I stop keeping up with what is current, I will fall behind and be of less help to my employer. I won’t be able to effectively design anything because I won’t be aware of what the possibilities are.
One of two things has to happen. The first option is that I can really narrow down the focus to just the things that directly pertain to my job. That will alleviate some of the information I have been hoarding. The second option is to start dedicating a bigger portion of my day to information consumption. I think option two is the best one as I can’t see myself ignoring products and technologies that I am not using today due to the fact that I may be using them tomorrow. Besides, it’s more fun when you have a wide range of technologies to keep up with as opposed to a handful.
I don’t know how everyone else handles their technical knowledge maintenance. If you happen to have a tried and true method of keeping up with all things networking, I would love to hear about it.
Maybe you’ve experienced this before. You are minding your own business without a care in the world when all of the sudden the phone rings.
Them: There’s a problem and we think the network is the cause. Can you check it?
You: Check what? The network?
You: Which part? What am I looking for?
Them: Any sort of problem.
(Fast forward an hour or so later)
You: Well, I ran a packet capture on the switch port connecting to system XYZ. I see a bunch of TCP resets coming from your server.
Them: Okay. We’ll take a look.
(Fast forward another half hour or so)
Them: It looks like we found the problem. Process blah-blah-blah was failing due to a dependency on process ha-ha-ha. We reset the services and everything is working again. Thanks for your help.
You: Okay. Not a problem.
(Back to life as before)
Sound familiar? If you have been in networking for more than a couple of years, this should invoke all kinds of warm and fuzzy memories. Meals were missed. Plans were canceled. Sleep was lost. All in the name of defending the network’s honor. Oh yes. This is the part about a career in networking that is conveniently left out of the brochure you are given before signing your life over to Cisco/Juniper/Citrix/Aruba/Nortel/F5/Brocade/Alcatel/etc.
I have seen more than my fair share of these incidents. With the exception of a brief stint in consulting and about 2 years doing things in the US military that you’ll never do anywhere else, I have lived my entire IT existence in the “corporate” setting. By that I mean chained to a desk looking over logs and configurations. Slaving away on the same network for years on end. Getting to know the lay of the land in the same way one knows all the sounds an old car or house makes. In short, after you work on a certain network long enough, you can see into the guts of it like Neo can with the Matrix.
If you are like me, you have a certain affinity towards your network. Sure, it may need some help with cabling or a cleaner route table, but you work with what you have. You make changes as you can. You replace hardware as the budget allows. You care for it like a farmer does his corn fields. Is this creeping you out yet? Well it shouldn’t. There are plenty of people out there who love their networks even to the point of showing them off to the world.
Here’s the problem with being a networking engineer/administrator/architect/designer/janitor. You have to understand everyone else’s piece of the pie, but not too many people have to understand yours. Fair? No, it isn’t, but as an officer I once worked for in the military told me: “That’s a burden you have to bear.” He was right, even if I didn’t like hearing it. That is not to say that all other entities within IT or greater corporate America are completely clueless when it comes to networks. Quite the contrary. There are plenty of systems people who understand networks very well. You can give them an IP with a classless subnet mask and they don’t even bat an eye because they know exactly what you mean when you say it’s a slash 26 network. However, when it comes to “applications” people, my experience has been that they only have to know their piece of the pie and can conveniently blame the network when a problem arises. I know what you’re thinking. Did he just paint all applications people with a broad brush? Yes. Yes I did. Of course, if you happen to be an applications person, I meant everyone else. Not you. 😉
That brings me to the title of this mini-rant/post. You can plead your case before everyone telling them that it probably isn’t the network, but they’re not going to believe you. Why? A lack of understanding or a lack of visibility into your world. You see, the network is just a big murky box to them. Maybe if they had access to some monitoring platforms they could be swayed, but unless your monitoring package can go down to the transaction level like Compuware’s Vantage product, you’re still going to have some explaining to do. However, in a way that I cannot begin to explain, people tend to believe packet captures. Don’t ask me why. I can tell you until I am blue in the face that the switches and routers on the network for the most part could care less what your payload is and you won’t believe me. You may not even understand TCP, UDP, and the rest of the acronym soup being tossed around, but for some reason, Wireshark or tcpdump results are more credible than Steven Hawking discussing time travel. If you want some good laughs around things like this, follow this guy on Twitter. He seems to deal with this on a regular basis and has some hilarious tweets to show for it.
Let me end this post with the following suggestions:
1. Get familiar with interpreting packet captures. Wireshark is the most well known packet capture utility for Windows boxes out there. There’s even a good book out there that covers everything in detail. You’ll also need to know about TCP and how it works. There are other protocols like UDP and ICMP that will be good to know, but TCP is by far the most useful protocol to know and understand when dealing with packet captures. For some good info on TCP, see here.
2. Don’t be afraid to run a packet capture early on in the troubleshooting process. I am finding that this tends to solve the problem when all other methods fail.
3. Don’t EVER, and I stress EVER, state emphatically that there is no way possible that the network is at fault. 99 times out of 100 you may be right. Get it wrong 1 time, and everyone will be gunning for you. There’s always the possibility that the network is at fault. Even when everything you know is telling you that it isn’t the network, if you don’t have a packet capture to back it up, you’re wasting your time.
4. Educate your co-workers about the network, or networking in general. Try to do this without condescension. Nobody wants to listen to Nick Burns tell them how stupid they are. The more people know, the less likely they are to hurl unsubstantiated accusations your way that you are manipulating traffic to break their application. It makes every organization a lot stronger when education is provided from the various departments. Please understand that although you and I might get excited when talking about routing protocols, not everyone else will. Oh how I wish my wife and I could have the EIGRP vs OSPF discussion, but it’s just not going to happen. Some people are not going to want to know a whole lot about the network, so try and figure out how much they really want to know and tailor the education to that level.
If nothing else, looking at a bunch of packet captures will help you appreciate what is going on behind the scenes every time you read an e-mail message or look at a website. Although other people might not appreciate it, I find that it helps my wife fall asleep faster when I talk about the various TCP flags and why they are used in data transmissions. At least she will never blame the network. 🙂
****Note – While I thought about detailing the technical steps necessary for delegation on different pieces of equipment, I decided to go with the more “architectural” or “philosophical” approach in this post. Besides, there are plenty of others out there who do a far better job with graphics and CLI examples.
Recently, I took some steps to make my job a little easier. I delegated access to another group that does not normally have anything to do with the network side of the house. In this particular instance, I was able to give that group access to a Cisco ACE load balancer. Normally, giving non-network people access to equipment would be frowned upon. This is especially true for equipment in a data center that controls data flows for your most important applications. I had to consider the following:
1. Can I give them specific levels of access?
2. Will they be able to perform operations with relative ease?
3. Does it make sense to do this?
Question 1 was easy. Of course we can provide granular levels of access. It is hard to find a piece of equipment on most enterprise networks that can’t do this. Question 2 was a “most likely”, but could have been tough if everything needed to be done via CLI. Question 3 was probably the most important. Generally speaking, most technical problems can be solved given enough time and resources(ie people, money, and equipment). What many of us should ask, and some of us fail to ask, is whether or not we SHOULD do something. I for one love playing with new equipment. Build an Ethernet switch that interfaces with a toaster and I want to play with it. However, is there any use for something like that? Is there a large community of people out there that want connectivity with their toaster?
The point, is that while a lot of things are possible, not everything is necessary. Sometimes giving people access to network equipment can cause more harm than good. While I am a big fan of wanting to provide as much information to others as possible, if that information cannot be interpreted correctly, you are wasting your time. For example, I have been in environments where non-network related groups were given access to Netflow data. While that sounds great on the surface, the reality was that the data was being interpreted incorrectly. When looking at something like a 3Mbps circuit, some people would see full utilization and assume that more bandwidth was required. What they failed to take into account was that the QoS markings of the traffic indicated that a bunch of AF11(what was deemed scavenger) traffic was using the bulk of the bandwidth. Had any additional traffic come over the circuit that was tagged as AF21 or higher, it would have pushed down the AF11 traffic and gradually used more and more of the circuit until it reached the bandwidth limit that was set for that specific class of traffic. More bandwidth was not needed when the Netflow data was viewed in its entirety. Had this particular group understood QoS markings, they would have come to a different conclusion. Could we the network group have provided more in depth training on this particular product? Sure, but how long would that training have to be before the individuals understood QoS well enough to interpret traffic flows correctly? If you are a QoS fan, how long did it take you before you understood things like shaping vs policing? Or L2 vs L3 markings?
Back to the issue at hand. Does it make sense to give another group access to the load balancer? Yes. In this case it did. The typical process for maintenance on a server getting requests via the ACE load balancer was to have the network group pull it out of the active pool. Then, another group would make whatever changes were needed. Once they were done, they would contact the network group who would place the server back into the pool. If you are having to make changes to a dozen servers, this process can take some time. Why not just give the group making changes to the server limited access to the load balancer so they can do everything themselves? Time and resources would be saved by all.
That brings me back to the second question of can we make it easy for them to make changes to the load balancer? In the case of Cisco ACE, yes. We had an instance of Application Network Manager(ANM) running in our data center to help us. While I tend to be a fan of CLI (except in the case of the Cisco ASA), not everyone else is. Sometimes a GUI is far more helpful for people who need to make changes to network gear. That’s where ANM comes in. In a matter of minutes, I was able to create a domain(which is where you define the servers and farms you are giving access to), and role(you can create your own if you don’t like the default ones) for this other group to use. Now they had access to select servers and their corresponding server farms, but not enough access to do any real damage.
After doing that, I just had to create some instructions for the 2 tasks they would need to do. First, they need to know how to remove servers from a load balanced pool. Second, they need to know how to add servers to a load balanced pool. With ANM and the specific domain/role I assigned to their group, this is a piece of cake. I took the appropriate screen shots to walk them through the process of adding and removing a server and put it in a nice concise MS Word document. There are times when I am hesitant to put a lot of pictures in instructions. Sometimes people get offended when you drop it down to an elementary school level. Thankfully this particular group LOVED pictures, so everything worked out. In about 15 minutes we ran through the instructions. Additionally, I asked if they wanted a bit more detail about the Cisco ACE load balancer in general, so we talked about what it does and where it sits in terms of its physical place in the network. Everyone seemed happy with the training, and I think they were truly excited about not having to wait on the network group anymore when they needed to make changes.
Problem solved. Everyone was happy, and I know that outside group is reaping the benefits of being able to make changes on their own. I have jumped on to conference calls several times recently and noticed that servers were being added to and removed from load balanced pools without the network group having to do anything. The group I gave access to was taking care of it.
If you have the means to delegate processes to other groups, I would recommend that you do it provided it complies with any security and administrative policies your company or IT department has. You do have those policies in place right? 😉 If it makes your job easier, makes other people’s jobs easier, and you get to impart some knowledge about the network to external groups, why not do it?