****Note – I am NOT in any way shape or form a VMware expert. I can’t guarantee you that I will be 100% correct in my terminology or representation of VMware, VMotion, VSphere, etc. I apologize in advance. I am just a network guy trying to understand how the Nexus 1000V ties into the VMware ecosystem. I also understand that companies other than VMware are doing virtualization. Please feel free to correct my inaccuracies via the comments.
Paradigm shifts are coming. Some of them are already here. About 5 or 6 years ago I was first introduced to server virtualization in the form of VMware ESX server. For you old mainframe people, you probably weren’t as impressed as I was when I learned about this particular technology.
When it came to VMware, I wasn’t doing anything fancy. I was just using it to host a few Windows servers. When these boxes were physical, they were only using a fraction of their CPU, memory, and disk space. In most cases, they were specific applications that vendors would only support if they were on their own server. From a networking standpoint, there was absolutely nothing fancy that I was doing. All of the traffic from the virtual machines came out of a shared 1 gig port. For me, VMware was a fantastic product in that it allowed me to reduce power, rack space, and cooling requirements.
I realize that some people will take issue with my use of the term “server virtualization”. To some, software and hardware virtualization are different animals. For the purposes of non-VMware people like myself, the fact that I used VMware to reduce the physical server sprawl means that I refer to it as “server virtualization”.
Fast forward to today. It is getting harder and harder to find a company that isn’t doing some sort of server virtualization. It isn’t just about reducing physical server footprint and maximizing CPU and memory resources. These days, you can achieve phenomenal uptime rates due to things like VMotion. For those who are unfamiliar with VMotion, it is a service within VMware that can move a virtual machine from one physical host(ie ESX/ESXi server) to another. This can happen as a result of hardware failure on the physical host itself, additional CPU/memory resource requirements, or other reasons that the VMware administrator deems important.
Today, from a networking standpoint, there are 3 options when it comes to networking inside the VMware vSphere 4 ecosystem:
vNetwork Standard Switch – 1 or more of these standard switches reside on a single ESX host. This would be the vSwitch in older versions of ESX. This is basically a no frills switch. Think of this as managing switches without the use of VTP. You have to touch a lot of these switches if certain VLAN’s reside on multiple ESX hosts.
vNetwork Distributed Switch – 1 or more of these will reside in a “Datacenter”. By “Datacenter”, I am not referring to a physical location. Rather, in VMware lingo, it is a logical grouping of ESX clusters(comprised of ESX hosts). This is the equivalent of running VTP across a network of Cisco switches. You can make changes and have them show up on each ESX host that is part of the “Datacenter”. This particular switch type has several advantages over the standard switch in terms of feature availability. It also allows you to move virtual hosts between multiple servers via vMotion and have the policies associated with that host
Cisco Nexus 1000V – Similar to the distributed switch, except it was built on NX-OS and you can manage it almost like you would any physical Cisco switch. It also has a few more features that the regular VMware distributed switch does not.
That’s the basic overview as I understand it. What I had been struggling with was the actual architecture behind it. How does it work? I can look at a physical switch like the 3750 or 6500 and get a fairly decent understanding of it. Not the level I would like to have, but I understand that vendors like Cisco don’t want to give away their “secret sauce” to everyone that comes along and asks for it.
As luck would have it, my company has purchased several instances of the Nexus 1000V and last week, I was able to spend a day with a Cisco corporate resource and one of the server/storage engineers my company employs. I didn’t realize how deficient I was in the world of VMware until I got into a room with these 2 guys and we started talking through how we would design and implement the Nexus 1000V. I kept asking them to explain things over and over. In the end, a fair amount of pictures on the white board caused the light bulb in my head to go active. I still have much reading to do, but for now I understand it a LOT more than I did. Now, let’s see if I can have it make sense to you. 🙂
The Nexus 1000V is basically comprised of 2 different parts. The VEM and the VSM. If we were to assign these 2 things to actual hardware pieces, the VEM(Virtual Ethernet Module) would be the equivalent of a line card in a switch like the Nexus 7000 or a Catalyst 6500. In essence, this is the data plane. The second piece is the VSM(Virtual Supervisor Module). This is the same as the supervisor module in the Nexus 7000 or Catalyst 6500. As you probably already guessed, this is the control plane piece.
Here’s where it gets a bit crazy. The VSM can support up to 64 VEM’s per 1000V. You can also have a second VSM that operates in standby mode until the active fails. In theory, you have a virtual chassis with 66 slots. In the Nexus 1000V CLI, you can actually type a “show module” and they will all show up. Each ESX host will show up as its own module. Will you ever have 64 VEM’s in a single VSM? Maybe. However, there are limitations around the Nexus 1000V that make that unlikely.
The VEM lives on each ESX server, but where does the VSM reside? It resides in its own guest VM. You actually create a separate virtual machine for the VSM when installing the Nexus 1000V. That guest VM resides on one of the ESX servers within the “datacenter” that the Nexus 1000V controls. You access that guest VM just like you would a physical switch in your network by using the CLI. Once the VSM is installed, the network resource can go in via SSH or Telnet and configure away.
That’s the basic components of the Nexus 1000V. There are other things that need to be mentioned such as how communication happens from the guest VM perspective to the rest of the network and vice versa. Additionally, we need to discuss the benefits of using the Nexus 1000V over the standard VMware distributed switch. There’s a lot more than just the management aspect of it. I will cover that in part 2. Additionally, I plan on doing a write up on the Nexus 1010 appliance. This allows you to REALLY move the control plane piece out of the VMware environment and put it on a box with a Cisco logo on it.
***Please note that these are my own thoughts and not those of my employer.
First and foremost, I have to credit Jimmy Ray Purser from Cisco for putting the idea for this post in my head. Back in late April of this year, he wrote an article for Network World entitled “The ABC’s of Anybody but Cisco.”
Like a lot of people, I work on networks that have a lot of Cisco gear. I’m very comfortable with their stuff. I’ve used CatOS, IOS, and NX-OS. Switches, routers, phones, firewalls, load balancers, access points, voice gateways, ACS, etc. It’s all familiar. Old hat. A lot like the old comfy t-shirt that Tom Hollingsworth wears. I’ve invested a lot of my time and energy into learning their product set. One could say my familiarity with their gear has been responsible for a significant portion of my earnings over the years. I don’t want to give the impression that I am a paid shill for Cisco. After all, I don’t work for a Cisco partner. I’m one of those corporate types who occupies a cubicle and acts as the caretaker for a decent sized network. Not a massive network, but big enough to have some direct support from our local Cisco office when we need it. We don’t use Cisco for every single thing on the network side, but if the various vendors on our network had voting rights, let’s just say that Cisco would be able to influence any election in their favor.
Although I use a lot of Cisco gear, I also try and keep an eye on the other vendors out there. On a given day, I probably get at least a dozen e-mails from other vendors and networking publications. Additionally, I have an ever growing RSS feed list comprised of dozens of vendor blogs and blogs from the many networking professionals I follow on Twitter or have found via word of mouth. In other words, I spend at least 10% of my day consuming information from Cisco, other vendors, and networking professionals who may or may not share my viewpoint. I feel that gives me the potential to have a very well rounded view of the networking industry. Whether or not I can process all of that information to form worthwhile opinions is another matter. Suffice to say, I am trying to put in the hours to ensure I can make the best decisions for my employer.
The challenge to vendors other than Cisco is figuring out how to pitch their product and have it SERIOUSLY considered by people who manage networks where Cisco dominates. This can be done, but it takes marketing and sales people who understand who their likely customers are. I’ve seen and read plenty of non-Cisco pitches. Some are very good and offer compelling reasons to consider their product. A lot offer nothing other than “We’re not Cisco!”.
A few years ago, I read a very interesting book about Wal-Mart called “The Wal-Mart Effect”. I thought it was a mostly balanced look at the company and how they do business. One of the most interesting parts of that book was where a retail executive was giving advice on how to beat Wal-Mart. He said that you would NEVER beat them on price. They are too big and have too much influence over their suppliers. Wal-Mart will do whatever it takes to get the lowest price possible from the supplier. You beat Wal-Mart by creating a better experience for the customer. Give them better quality. Give them better product selection. Give them nicer facilities to shop in. That’s how you beat them. If you try and take them on in a price war, you WILL lose. I think about what that person said and try and apply that to Cisco’s competitors. How do you compete with the overall networking market leader? In my mind, that requires some different thinking.
1. Don’t spend all day bashing Cisco. – If I am willingly talking to another vendor as an alternative to Cisco, it’s probably because I realize there are other options out there. You are not helping your cause any if the main thing you have to offer is that Cisco sucks and you are better. Up until July of this year, the Riverbed blog was full of posts slamming Cisco. I’m not going to say that Cisco didn’t deserve some of those. I’m not even going to make the argument that WAAS is the same or better as Riverbed when it comes to WAN optimization. Clearly Riverbed is doing some great things in WAN optimization. They’re very easy to setup and administer. Their products work very well. Sell me on those points. Sell me on the fact that it works and that you’re offering me things that the other WAN optimization vendors are not. When I have to deal with corporate arrogance be it from marketing content or sales people, I get turned off on the product real fast. Why? Well, arrogance breeds complacency. It doesn’t allow an organization to see clearly and eventually someone is going to catch up and turn your double digit market share into single digits. When competing with a company like Cisco, you are competing with a very well oiled marketing machine. Don’t ever forget that. In fairness to Riverbed, I haven’t seen that mentality lately. As a result of that, my feelings toward them have improved greatly.
2. Don’t preach to me about standards. – If there’s one thing I hear the most, it’s that vendor XYZ is a completely “standards” based company. If you have been around the Cisco community for a few years, you know how instrumental they are in driving standards. VRRP, MPLS, PoE, LLDP, 802.1q, and CAPWAP among others are a direct result of Cisco and their influence. Additionally, some of these companies that want you to use their hardware because it is standards based are producing their own variants of certain standards. I can think of a couple of vendors right off the top of my head that have their own enhancements to VRRP. Most of the Cisco gear I have been around support these “standards”. Remember, they drove the creation of quite a few of them in the first place. Yes, Cisco does make modifcations/enhancements to things like OSPF and other protocols, but so do most large vendors. My point is that everyone supports standards, or they wouldn’t be standards. What vendors are really trying to tell me when they are preaching the fact that they are standards based is that by using proprietary protocols, I am locking myself into Cisco. That, in their minds, is a bad thing. I can assure you that I am aware of the risks of running EIGRP, HSRP, or whatever proprietary protocol that Cisco puts out. It’s also not a good idea to assume that I am running any proprietary protocols just because I have Cisco gear.
3. Your presentation needs to be polished. – Please, please, please spend some time on your documentation and presentation material. Understand that people who are buying and deploying Cisco have access to very polished design guides, configuration guides, product data sheets, etc. While I am not asking you to replicate the entire Cisco documentation/support ecosystem, it would help if you ran your documentation through a spelling and grammar check before displaying it on your website. For examples of good documentation outside of Cisco, see Juniper. They have their act together.
4. Innovate – Sometimes a radical approach to the way we are used to seeing things is needed. Aerohive is doing some very interesting things in the wireless space. They’re not using controllers to manage their AP’s. The AP’s manage each other. Very different and very cool. I’m not a wireless expert. I don’t use their products. I don’t have any plans to use their products in the near future. However, I AM keeping an eye on them and if an opportunity comes up in which they would be a good fit, I won’t hesitate to reach out to them.
Juniper has completely sold me on their SA line of SSL VPN appliances. Everyone I talk to about them has the same feeling. Very feature rich!
Riverbed is the leader in WAN optimization. I use their product and am very satisified with it. It just works.
Arista has created a compelling offering in the data center switching environment. The fact that I can run numerous third party applications on their switch due to their granting me shell access is VERY interesting. Oh, and I should also mention that nobody else can give me the same amount(384) of wire speed 10Gbps ports in a chassis(7500) of their size(11RU).
Brocade and Force10 both offer an interesting alternative to the standard top of rack copper switch or Nexus 2k FEX line. They have line cards that support the MRJ21 connector system developed by Tyco Electronics. Essentially, your top of rack switch is reduced to a patch panel that connects back to a Brocade or Force10 chassis. However, this is not a 1 to 1 patch panel. Rather, a single MRJ21 cable(about the width of a pencil) connects the patch panel and chassis. Each cable supports 6 10/100/1000 connections. The patch panel is not the ordinary punch down block you are used to seeing. It’s a modular cassette or fixed 24 or 48 port 1U patch panel. None of these cassettes or patch panels require power. All management is done from the central chassis. If you are familiar with Nexus 2k administration from the Nexus 5k, this is similar. The benefits to this technology are three fold. First, there is no power consumption required in the top of rack. Second, all administration is done from the central chassis which can be located at the end of the row, middle of the row, or on the other side of the data center. Third, with the modular capabilities, you can split up the 24 or 48 ports your typical top of rack switch has. 1 logical 48 port switch being managed at the central chassis can be spread out over 2 or more racks.
That’s it. Four simple things from my perspective. There may be more, but to me, these are the big ones. I didn’t mention pricing. I don’t usually see this as being an issue. Other vendors know what the pricing is going to be from Cisco. They can figure out based on the size of the company what the discount is probably going to be. They may not know the exact amount, but they can make a pretty good guess. Your product needs to be cheaper. That’s a great selling point for a lot of vendors. They know you are paying for the Cisco name, so they can use that as leverage. If it is not cheaper, it better have some sort of compelling reason to be chosen. There needs to be a “wow” factor and not vaporware. It needs to be legitimate.
You’ll also notice that I did not address the problems with Cisco itself. That would take another long post and there are plenty of other capable bloggers out there hammering away at them on a regular basis. My goal in this post is to focus on how to compete with Cisco.
**** Please note that these are my own thoughts and observations and should not in any way be taken to be the opinion(s) of my employer. Additionally, this is a rather long post, so please bear with me. I promise not to waste your time by babbling incessantly about non relevant things.
Finally! After many hours spent sifting through vendor websites and reading various documents, I have finished my comparison. If there’s one thing I came away with in this process, it’s that some vendors are better than others at providing specifics regarding their platforms. By far, Juniper was the best at providing in depth documentation on their hardware and software. Although Cisco has a ton of information out there about the Nexus 7000, I found that a lot of it was more on the architecture/design side and less on the actual specifics of the platform itself. Some vendors still hide documentation behind a login that only works with a valid support contract. In my opinion, that’s not a good thing. I think most people research products before they decide to buy, so why hide things that are going to cause roadblocks for people like myself trying to do some initial research? I’ve read MANY brochures, white papers, data sheets, third party “independent” tests(meaning a vendor paid for a canned report that gives a big thumbs up to their product), and other marketing documents in the past couple of weeks. I did not actively seek out conversations with sales people in regards to these products. I did have a couple of conversations around these products and not all the people I talked to were straight sales people. Some were very technical. However, I wanted to go off the things that the websites were advertising. Once the list is narrowed down to 2 or 3 platforms, the REAL work begins with an even deeper dive into the platforms.
I wish I could display the whole thing on this website and have it look pretty. Unfortunately, I don’t know how to do that and make it look nice. Remember, I get paid for networking stuff and not my web skills! In consideration of that, I have attached a PDF file of my comparison chart. I have the original in Excel format, but WordPress wouldn’t allow me to upload it. If you want a copy, I can certainly e-mail it to you. You can send me your e-mail address via a direct message in Twitter. I can be found here.
What IS included in the spreadsheet.
I would love to say that I did all of this work for the benefit of my fellow network engineers, but I would be lying if I said that. I built this out of a specific need that my employer has or will have in the coming months/years. Due to that, some of the features that were important to me may not be important to you. If you find yourself wondering why I included it, just chalk it up to it being something that I considered a
requirement. Having said that, it would be selfish not to share this information with you, so take it for what it’s worth.
When it comes to the actual numbers of things like fan trays and power supplies, I tend to build out the chassis to the full amount it will hold. If it can take 8 power supplies, I will probably use 8. Same with fabric
modules. I like to plan with the belief that I will fully populate the chassis at some point, so I want to have enough power, throughput, and cooling on board to handle any new blades. All chassis examined have the
ability to run on less than the maximum number of power supplies.
When it comes to throughput rates, you have to distinguish between full duplex numbers and half duplex numbers. They don’t always specify which is which, so you have to dig through a lot of documentation to figure out what they are really saying. Thankfully marketing people tend to favor the larger numbers so more often than not, the number given is full duplex. In the case of slot bandwidth, I used the half duplex speed. The backplane numbers are all full duplex.
What IS NOT included in the spreadsheet and why.
If I were to include every single thing these switches support, the spreadsheet would be 10 times bigger than it already is. There are quite a few things that I consider to be basic requirements. These basic things
were left out of the sheet to avoid cluttering it up with things you probably already know. For example, does the switch support IPv6? This should be a resounding yes. If it doesn’t, why in the world would I even
consider it? The same can be said with routing protocols. They all should support OSPFv2 and RIPv2 at a minimum. Most, if not all support IS-IS and BGP as well. It is also worth pointing out that I may not even need this switch to run layer 3. I am looking for 10Gig aggregation and am not necessarily concerned about anything other than layer 2. All of these switches also support QoS. Perhaps they do things a little differently
between each switch, but the basics are still the basics and I don’t really need a billion different options when it comes to QoS. That may change in a few years, but for now, I am not looking at running anything
other than non-storage traffic over these switches.
I think you see my point by now. I could go on and on about what isn’t included. If it is something well known like SSH for management purposes, I don’t need to include it in the list. It’s a given.
Special note on the TOR(Top of Rack) fabric extension.
While I primarily need 10Gig aggregation, another bonus is the ability to have 1Gig copper aggregation as well. However, I don’t want it all coming back to the chassis itself. The Nexus 7010 has the ability through the Nexus 5000’s(of which I already own several) to attach Nexus 2000 series fabric extenders that function as top of rack switches(although it’s not REALLY a switch). This is a nice bonus feature as I can aggregate a lot of copper connections back to 1 chassis without all the spaghetti wiring that is commonly seen in 6500’s and 4500’s. In the case of Brocade and Force10, they actually have the TOR extensions as nothing more than MRJ-21 patch panels. With 1 cable(which is the width of a pencil) per 6 copper ports, the amount of wiring coming back to the chassis is reduced tremendously.
Additionally, there is no power consumption at the top of the rack like there is with the Nexus 2000’s and it is a direct link to the top of rack connections unlike the Nexus model where I have an intermediate 5000 series switch in between.
One final note. The HP/H3C A12508 is listed on the HP site as the A12508, but when you click into the actual product page, it is listed as the S12508. These terms can be mixed and matched and mean the same chassis. I have chosen to use A12508 as the model number as much as possible in this post, but my previous post that mentioned the various switches used the letter “S” instead of “A”.
I plan on posting a few more thoughts on this process as it pertains to specific platforms. I was awed by several of the platforms, not just by the hardware itself, but by the approach the company is taking to the data center in general. Any of these platforms will do the job I need them to do. Some will do that job a lot better than others. As for cost, I have only seen numbers on a few of the platforms. That’s something that is important, but not the most important. You can read my previous post on this for more clarification on what my thought process is.
Remember that I am not claiming to be an expert in regards to any of these platforms. I have done many hours of research on them, but there is a chance that some information in this PDF file will be wrong. If you see any glaring errors, please let me know. I promise you won’t hurt my feelings. If anything is marked “Unknown”, rest assured that I looked at every possible piece of literature on the website that I could reasonably find. If you managed to read this far in the post, the file is below. Enjoy!
*****Update – The Juniper 8200 series does support multi-chassis link aggregation. It just requires another piece to make it work. The XRE200 External Routing Engine gives the 8200 this capability. Thanks to Abner Germanow from Juniper for clarifying that!
I have an increasing need for 10Gig connectivity. Although I may have enough ports today, I have to plan for the future. While I can easily buy some more Nexus 5000 series switches, I would rather have a more capable platform. As a heavy user of Cisco hardware, the logical choice was to use the Nexus 7000 series line. It is a platform that I can grow into over time. I don’t need the big 7018, so the 7010 will suffice. My company has a great relationship with Cisco and our sales rep and local engineer are top notch. No hard selling on their part so the relationship is, in my opinion, a very good one.
Having said that, I also have to point out that I have an obligation to my company to ensure the best product is selected. It would be irresponsible of me to make a technical decision of six digit magnitude and have it come up short in features. I need to make sure the product we select is the best fit for our particular needs. That doesn’t mean the Nexus 7010 is the wrong device. For all I know it will be the best thing for us. Of course, I still have to do my due diligence.
Over the past several weeks, I have been looking over some of the competition. Granted, I still have to spend a lot more time looking at Nexus 7010 competitors, so I am nowhere near done. I’ve been really busy with other things, so I haven’t been able to dedicate as much time as I thought I would to figuring this out. What I have done so far is narrow down a list of vendors and the appropriate product that can compete with the Nexus 7010. Here’s a short list of the features I am looking to compare:
1. 10Gig port count across the entire chassis.
2. 10Gig port/blade/module oversubscription rates. (Some products may not have this issue.)
3. Size of chassis.
4. Power consumption.
5. Layer 2 features(STP, TRILL, proprietary)
6. Layer 3 features(Standard based protocols, proprietary protocols)
7. Cost(Not the main driver, but it is a factor to consider after the technical merits.)
8. Product age(Is it a new platform, or has it been around for more than a year or two?)
9. Focus of the company
10. Size of the company
11. Support structure of the company
12. Code updates(Is there a defined release cycle?)
13. Availability of documentation from the vendor.
14. Connectivity options other than 10Gig(1Gig copper ports or some type of TOR integration aka Nexus 2000’s?)
Obviously there are going to be other things to consider. I also was very vague on the L2 and L3 feature requirements. That was on purpose. As I go through this process, I will be able to elaborate more on the particular L2/3 features that are needed vs those that are available.
Here’s the models I am comparing:
Cisco Nexus 7010
Brocade NetIron MLX 16
HP S12508 – This was recently changed from the S9512E as it was recommended by someone from HP that it was a better comparison to the Nexus 7010.
It is pretty hard if not downright impossible to find competing platforms that have exactly the same specs. I tried to find the closest match in terms of 10Gig port capability since that is the main driver behind this project.
More posts to come soon on this. I am still trying to decide if I want to do a post on each platform individually or do a few posts focusing on certain features that they all have in common. Any thoughts on this are appreciated.
Days 3 and 4 did not disappoint! I don’t know if I stated this in the earlier posts, but the days basically consisted of lecture in the morning and labs after lunch. I REALLY, REALLY enjoyed the lecture portion. Again, I have to state that the instructor was fairly knowledgeable in regards to ACE, so he was able to actually teach instead of regurgitate a slide deck like other classes I have been in. That makes all the difference in the world. As for the labs, I guess they do some good if you have not had much experience with the ACE CLI. We did not do any labs using the built in GUI or ANM. The problem I have with labs is that they are a very canned and controlled environment. You end up just going through the motions without actually soaking up what it is that you are doing. Ideally, the labs would need to be tailored to your environment to have the greatest effect. This of course, is not realistic. Having said that, I am sure there are some people who get something out of it. My opinion was shared by others in the class in regards to the effectiveness of the labs, so I am not the only one who feels this way. However, the effectiveness of the lecture portion completely overshadowed any shortcomings of the lab portion.
In the interest of brevity, I am going to touch on the things I thought were the most interesting, but I don’t want this post to be so long it requires a coffee break to finish.
Route Health Injection – On a simplistic level, RHI allows the ACE to inject a host route into the network. You would use this to advertise the VIP(virtual IP) that clients use to connect to a server farm. If the server farm is not available due any number of issues, the host route can be automatically removed from the route table and not advertised. The alternative is to simply advertise the VIP’s as part of a regular subnet advertisement like you do with any other VLAN or subnet. Again, I am simplifying this and need to point out that this is NOT something that is specific to Cisco ACE. Other vendors implement similar technologies.
KeepAlive-Appliance Protocol(KAL-AP) – There’s a few variations of the Cisco ACE, and one of those is the Global Site Selector(GSS). Its purpose is simply to provide higher level load balancing between data centers. Basically, it is a load balancer of load balancers. By using KAL-AP, the GSS can query VIP’s at multiple data centers and determine which one is the best fit to send traffic to.
There are a couple of things that the ACE 4710 appliance does that the ACE module cannot. I asked the question as to why this is the case and was told that the ACE appliance has different architecture than the module. It has certain functionality that might come to the module at some point, but for now is restricted to the appliance. These extra functions really revolve around the ACE appliance being able to cache certain HTTP objects and speeding up the process of delivering a web page to an end user. A fair amount of detail on this can be found here.
It sure seems as if I cut back on the information from days 3 and 4 when compared to 1 and 2. I did. Although there were plenty of interesting things covered in the past 2 days of class, a lot of those things would take a while to explain and draw out via diagrams. That’s also assuming that I actually understand these things well enough to explain them in depth.
That brings to me to a more philosophical point in regards to the type of niche product that Cisco ACE is. While it would be great if you knew the CLI on ACE backwards and forwards, it really isn’t necessary. What is necessary is an understanding of what a platform like ACE is capable of. I sat in a meeting today in which some developers wanted ACE to perform health checks on a server outside of a load balance pool and use the results of that query to determine whether or not servers should be removed from a load balance pool. Basically, they wanted to do something that ACE is not really designed to do. Spending 4 days in a classroom learning all about ACE gave me the information needed to have a productive meeting with these developers today. I was able to answer their questions and give better guidance than I would have a couple of weeks ago. I don’t know all the commands for ACE. I will still have to use the configuration guides to look things up now and again. The important thing is that I understand the capabilities and limitations of the ACE load balancer a lot better today than I did prior to taking the ACE class. My main goal is to know what it can and cannot do in order to design anything requiring load balancing properly. To me that is more important than memorizing commands.
Day 2 of ACE boot camp did not disappoint! Another full day of lecture and labs. We covered the following topics:
Modular Policy CLI
Managing the ACE Appliance and Service Module
Layer 4 Load Balancing
I’ll cover some general things about each topic and go into additional details on the points I thought were interesting.
Modular Policy CLI – ACE classifies which traffic it will load-balance based on policy maps, which are comprised of class maps. If this sounds a lot like how you build QoS policies on IOS based routers, it is. The big difference is that ACE is far more restrictive in what those policies contain.
Managing the ACE Appliance and Service Module – Like most Cisco devices, ACE can be managed in a number of different ways. Telnet, SSH, HTTPS, and SNMP. You can even use the XML API if you want. With SNMP, versions 1 and 2 cannot understand contexts. SNMP version 3 can. In order for SNMP version 1 and 2 to work with contexts, you have to use the community string format of “community@context” where “community” is the community name and “context” is the name of the virtual context. When the GET, SET, or whatever SNMP action you choose hits the ACE, the “@context” portion is understood and passed along to the appropriate context.
Security Features – There are a ton of different ways to restrict traffic entering and leaving the ACE. Most of the time you will be focused on traffic entering the ACE. As with applying ACL’s to interfaces on switches and routers, very rarely will you see access lists applied in the outbound direction. That feature is there in case you have some special need to use it.
An interesting capability that the access lists have in ACE is the ability to use object groups to identify which traffic to permit or deny. If you have ever worked on the PIX, ASA, or FWSM, you will be familiar with object groups. They make traffic identification much easier not to mention the simplification of the ACE configuration itself.
The much more granular security options were of great interest to me. Take something like IP fragmentation and reassembly. You can specify the max number of fragments allowed from one packet. If it exceeds the number you specify, you can just drop the traffic. Many other options exist with regards to the packet stream itself. You can enforce certain flags from being set. If violations occur, not only can you drop the traffic, but you could actually reset the flag itself and then send the traffic through the ACE. While most options are configurable, there are some rules that are always enforced. For example, the source IP of a traffic flow can never equal the destination IP.
Layer 4 Load Balancing – This is exactly what it sounds like. Load balancing based on TCP/UDP flows. I think the neatest part about this particular topic was the fact that you can actually load balance traffic across multiple firewalls and have the return traffic come back through the same firewall. This of course requires an ACE on both sides of the firewall, but withe ability for the ACE module to have up to 250 virtual contexts, it doesn’t have to be 2 separate physical ACE modules. The same module can host both contexts that live on either side of the firewall. It is fairly clever how they make this work. Essentially, when traffic comes from one firewall into the ACE, it remembers the MAC address of the sending firewall and places that connection in a state table. When traffic comes back through the ACE, it already knows which firewall to send the traffic to based on that state table. I’m not sure I would want to use an ACE module for load balancing through firewalls, but there are plenty of customers out there that are already doing it or could see the benefit in doing something like that.
Health Monitoring – If there’s one thing the ACE seems to have a fairly large amount of options on, it’s the health monitoring or probes. All the major protocols have specific probes on the ACE that are used to check the health of the back end or “real” servers. This is way beyond the load balancer simply pinging the server to make sure it is up and running. Let’s say you used the HTTP probe. Instead of just trying a simple ping to check a back end servers’ status, the HTTP probe can actually go out and make an HTTP connection to the server or serverfarm. That’s a far more intelligent way to query server status. Based on the probe results, any number of things can be done to the various serverfarms and servers ACE may be providing services for. They may be taken out of active status, have their priority reduced, etc.
There’s a LOT more to this stuff. This was only day 2 of 4! More to come.
First off, let me point out that this is not a boot camp with a certification in mind. It’s a 4 day course given by Firefly Communications. Although I booked the course through Global Knowledge, I was told that they typically outsource their data center courses to Firefly. Works for me. As long as it is quality training, I don’t care if you outsource it to Elbonia. I am assuming they use the term “boot camp” because it is an end to end ACE class taught in just 4 days.
Which brings me to my first point. My company was able to use Cisco Learning Credits to pay for this class. At 30 credits, that translates to $3,000 US dollars for 4 days worth of training. Sitting in the class, I couldn’t help but notice people doing regular work while the instructor was going through his lecture. I realize most places are understaffed. Outages happen. Fires have to get put out. However, $3,000 for 4 days to me is a big deal. If you send your employees off to training that is critical/applicable for their job, LET THEM TRAIN! Leave them alone while they are there. Of course, that’s a 2 way street in that some employees need to learn to let go as well. The company will function without them for a few days. You can turn off “martyr” and “hero” mode for a couple of days. I am checking e-mail at night, but not being obsessive about it. I have very capable co-workers who can do anything and everything without my help.
Now, on to the actual class. Let me begin by commenting on the quality of instruction. I’ve been to plenty of poor classes in which someone was trying to shovel test material down your throat the whole time. I’ve also sat in several classes where the instructor was obviously out of their league and could not field questions from the crowd that weren’t covered on the vendor approved slide deck. That is simply not the case with Firefly. My instructor is very competent and when he hits the limit of his knowledge, he indicates that. So far, I think I have only seen 1 time out of the dozen or so questions he was hit with today in which that was the case. I guess that is what $3,000 a seat gets you.
It seems as if there is a fairly decent mix of people in this class. About a dozen or so in attendance. A fair amount of them are actually using the ACE 4710 appliance which I thought was rather interesting. Of course, most are using the standard ACE module. There are varying levels of experience with ACE as well. I was under the impression that I would be here mainly for the second half of the class, as I felt comfortable with the basics. Of course, just when you go and get comfortable, you realize how little you know. I learned a LOT today. Mainly, it was about things I never really bothered to dig into. You see, like most people, we probably only dig into the features we absolutely need right now. Maybe we plan on coming back and covering everything else at a later time, but I think that happens far less than we’d like it to. Some of the things we covered today that I was horribly deficient on were:
Resource Management – If you use multiple contexts, RM can prevent a single context from taking over the entire resources of the module. I don’t use this as it is currently not a concern, but good to know if things change!
ACE 4710 appliance – I don’t use it and never have. However, it does do a few things the module does not mainly centered around application acceleration. We have not covered that exhaustively yet, but I will take good notes when we do.
There were other things covered in which I was glad to get a decent refresher. The main one being TCP sequence numbers. They are always a bit confusing to me if I don’t study them on a fairly regular basis. Although you weren’t there with me in class today, you can read this post by Jeremy Stretch which talks about TCP sequencing. He even uses nice graphics!
We ended the day doing a pretty simple lab in which we created some contexts and messed around with resource management to see if we could oversubscribe the module in terms of CPU, memory, etc in regards to other contexts. Overall, it was a really good first day. I am eagerly anticipating what tomorrow will be like. It is also good to be taught by someone who actually helped develop the slide deck the course is taught from. He was able to add funny little details about how he created this drawing or that. It’s always nice to have someone teach who has a great sense of humor. So far, I give the Firefly ACE boot camp 2 thumbs up!
I am hoping to get a wee bit more technical in the following posts regarding ACE boot camp as the remaining days will REALLY focus on load balancing. Who knows? I might even post a graphic or two! Shocking isn’t it?