Customer Edge Site High Availability for Application Delivery - Reference Architecture
Purpose This guide describes the reference architecture for deploying a highly available F5 Distributed Cloud (F5XC) Customer Edge (CE) site. It explains the networking options available to deploy a highly available multi-node CE site in an on-premise data center, branch location, or on the public cloud when deployed manually. Audience This guide is for technical readers, including NetOps and Solution Architect teams who want to better understand the various options for deploying a highly available F5 Distributed Cloud Customer Edge (CE) site. The guide assumes the reader is familiar with basic networking concepts like routing protocols, DNS, and data center network architecture. Also, the reader must be aware of various F5XC concepts such as Load Balancing, BGP configuration, Sites and Virtual Sites, and Site Local Inside (SLI) and Site Local Outside (SLO) interfaces. Introduction To create a resilient network architecture, all components on the network must be deployed in a redundant topology to handle device and connectivity failures. A CE acts as an L7 gateway and sits in the path of the network traffic, hence it needs a redundant architecture. For a production setup, it is recommended to deploy the site as a three-node cluster. These three nodes are the control nodes. Additional worker nodes can be added for higher L7 and security performance. Clustering on CE Site A CE can be deployed as a multi-node site for redundancy and scaling performance. CE runs kubernetes on its nodes and inherits k8s HA architecture of having either one or three control nodes and optional worker nodes. Production deployments are recommended to have 3 control nodes for redundancy and additional worker nodes to meet the performance requirements of the site. A multi-node site can tolerate one control node failure as it needs at least 2 nodes to form the quorum for HA. It's important to ensure that multiple control nodes don't fail simultaneously in each site. Worker node failures do not cause the whole site to fail. It only reduces the total throughput the site can handle. Note: The control nodes may also be addressed as master nodes in legacy documentation. Although they are called control nodes, they run both the control plane and data plane functions. Figure: CE Clustering In a multi-node setup, two CE control nodes form tunnels to the two closest REs. If one of the control nodes with a tunnel fails, it gets reassigned to the remaining control node. In a single node site, the same node forms tunnels to two different REs. Worker nodes are not supported for sites with a single control node. Figure: RE – CE connectivity CE Site HA Options In a regular deployment, a multi-node CE site is used to achieve redundancy. A load balancer configured on the CE site uses the IP address of SLI, SLO, or both interfaces as VIP by default. But this means the load balancer domain/hostname will need to resolve to multiple IP addresses across the nodes of the CE. To simplify this, F5XC also allows users to specify a custom IPv4 address as the VIP for each load balancer. An alternative topology is to use multiple single-node sites deployed across different availability zones in the data center or public cloud. In this case, the sites can be grouped into a Virtual Site. A load balancer can be configured with a custom VIP advertised to this Virtual Site. Both of these options are explained in detail in the sections below. High Availability Options for Single CE Site This section describes the deployment options available to direct the traffic across the CE nodes and lists the pros and cons of each option. The feasibility of these options may vary by environment (on-premise or public cloud) and networking tools available. These nuances are also explained for each option. For L4 and L7 load balancer VIPs on the CE site, all nodes (control and worker nodes) can actively receive the traffic. The site bandwidth scales linearly with the number of nodes. So, multiple worker nodes can be deployed based on the performance requirements of the site. For public load balancers, the VIP is on RE, and the bandwidth is limited to the bandwidth of the tunnel connecting the two CE control nodes to the REs. Layer 3 Redundancy Using Static Routing With ECMP This is the simplest way to configure redundancy for Load Balancer VIP on the CE cluster. The application admin can configure the LB with a user-specified VIP and the Network admin can configure equal-cost static routes for this VIP IP addresses, with the SLI/SLO IP addresses of the CE nodes as the next hop. The router uses Equal Cost Multi-Path (ECMP) to spread the traffic across the CE nodes. It is recommended to use consistent hashing ECMP configuration on the router to ensure an active session to a CE node is not rehashed in case another node fails. Figure: Static Routing Pros: VIP IP can be from any valid subnet. It is not restricted to the SLO or SLI subnet where it is advertised. Simple L3 routing configuration. Can scale with worker nodes with minimal route configuration change. All active nodes can receive the traffic. Cons: Needs routing configuration changes external to F5XC, every time a new LB VIP is created/deleted. Traffic will get blackholed when a CE node fails, until the node’s route is removed from the route configuration, or the node is restored. When To Use: When the NetOps team does not have access to routing devices with dynamic routing protocol capabilities like BGP. In use cases where the number of load balancers on the site is small and doesn’t change often, the operational overhead of configuring and managing the routes is less. Layer 3 Redundancy Using BGP Routing With ECMP BGP peering can be configured between F5XC CE and the router. This configuration requires LBs to be created with user-specified VIPs. The CE advertises equal cost, /32 routes to the VIP with the SLO/SLI as the next hop. The router uses Equal Cost Multi-Path (ECMP) to spread the traffic across the CE nodes. It is recommended to use sticky/persistent ECMP configuration on the router to ensure an active session to a CE node is not rehashed to a different node in case of a node failure. Note: Separate BGP peers must be configured for VIPs on SLO and SLI. Users can select the peer interface on CE while configuring the peers. For more information check BGP. Figure:BGP Routing Pros: VIP IP can be from any valid subnet. It is not restricted to the SLO or SLI subnet where it is advertised. Can automatically scale with worker nodes. Automatically revokes the route for failed CE node. Faster failover than any other method. All active nodes can receive the traffic. Cons: Needs advanced network configuration on the router. The router must support BGP. When To Use: The site has a large number of load balancers configured. Load balancers are frequently created and deleted. The application requires fast failover and minimal disruption in case of node failure. With network overlay technologies like Cisco ACI. Layer 2 redundancy using VRRP/GARP A user can enable VRRP on the CE site. This configuration requires LBs to be created with user-specified VIPs. Only control nodes participate in the VRRP redundancy group and one of them is elected as the leader for a VIP. VIPs will be randomly placed on different nodes. Only the leader node for the VIP sends out Gratuitous ARP (GARP) broadcasts for the VIP. If the leader node fails, a new leader is elected, and VIP is placed on it. Figure: VRRP/GARP Pros: No network configuration is required external to F5XC. Automatic failover of VIP when VRRP leader node fails. Cons: Only control nodes can receive the traffic. Only one node is actively receiving traffic at a given time. VIPs will be placed on different control nodes randomly. Equal distribution of VIPs across control nodes is not guaranteed. Failover can be slow depending on the ARP resolution time on the network. When To Use: The application team does not have access to routers, DNS servers, or load balancers on the network. (see other deployment options for details) The application does not require high throughput. Some traffic loss can be tolerated in case of node failure (e.g. for non-critical applications) Does not work in public cloud deployments as the cloud networking blocks GARP requests. External Proxy Load Balancing Network admin can configure an external load balancer (LB), with the CE SLO/SLI IP addresses in its origin pool, to spread the traffic across CE nodes. This can be a TCP or HTTP load balancer. For an external TCP LB, the client IP will be lost as the LB will SNAT the request before forwarding it to the CE nodes. F5XC does not support proxy protocol on the client side so it cannot be used to convey the client IP to the load balancer on the CE. For an external HTTP LB, the traffic will still get SNAT-ed, but the client IP can be persisted to the CE nodes if the external LB can add the X-Forwarded-For header to the request. If the LB on the CE site is a HTTPS LB or TCP LB with TLS enabled, the external LB will have to host the TLS certificate as it will terminate the client TLS sessions. A wild card certificate can be used to simplify this issue, but this may not always be a viable option for the applications. In case of public cloud deployments where L2 ARP and routing protocols may not work, in addition to TCP or HTTP load balancers, users also have the option to use Network Load Balancer on AWS and Google Cloud, Standard Load Balancer on Azure or similar feature on other public clouds, that does not SNAT the traffic but just forwards it to the CE nodes just like a router running ECMP. Note: For multi-node public cloud sites created using the F5XC console, it will automatically create the required cloud native LB. But we also support manual CE deployment in the cloud in which case the user will have to create the LB. Figure: External LB Pros: All active nodes can receive the traffic. Health probes can be configured to get F5XC LB health to avoid traffic blackholing. Can scale with worker nodes. Works for public cloud deployments. Cons: Managing certificates on external LB can be operationally challenging for TLS traffic. No source IP retention in the case of TCP LB Adds additional proxy hop. External LB can become a performance bottleneck even if CE is scaled out using worker nodes. When To Use: There is an existing load balancer (usually in DMZ) in the traffic path, but CE is used for additional services like WAAP, DDoS, etc. In public clouds where cloud LBs can be used to load balance to the nodes. DNS Load Balancing Network admins can use DNS to resolve application hostnames to the SLI/SLO IP addresses on the CE Nodes. The DNS can be configured to respond with one IP at a time in a round robin manner. Alternatively, a private DNS LB or Global Server Load Balancer (GSLB) can be used which can make load-based intelligent decisions to distribute the traffic more evenly. User-specified VIPs must not be used in this case as the hostname must resolve to the individual node’s SLI/SLO IP address for the traffic to get routed to the node. Figure: DNS LB Pros: Can be configured using existing DNS servers. Can scale with worker nodes. Works for public cloud deployments. (Not recommended as better options are available) This does not add a L4/L7 hop to the traffic path. Cons: Needs DNS configuration changes external to F5XC, every time a new LB VIP is created/deleted, or a new worker node is added. Traffic will get blackholed when a CE node fails, until the node’s IP is removed from the DNS configuration, or the node is restored. Intelligent distribution of traffic requires GSLB which can be expensive. Subject to DNS cache and TTL causing clients to resolve to a down CE node. When To Use: The application team only has access to GSLB or DNS server and does not want to limit the traffic to only one node at a time as in the case of the VRRP/GARP option High performance is not a requirement as multiple clients may resolve to the same node even if the site has multiple nodes Can be used in the public cloud if the user does not want to create an external LB. High Availability Using Multiple Single Node Sites Across Availability Zones Instead of deploying a single multi-node site, customers can opt to deploy two (or more) single-node sites and use them together (as individual sites or grouped into a Virtual Site) to advertise a VIP. This can be useful if the data center has two AZs so it’s more logical to deploy a CE on each AZ than deploying a three-node CE with one node in one AZ and two nodes in the other. By upgrading one site at a time, it is guaranteed that at least one site will always be online to serve the traffic, providing resiliency against upgrade failures. This is very useful in case of critical applications demanding zero downtime. All the deployment options above, other than the VRRP/GARP method, can be used in this case. It is recommended to use a consistent hash configuration for ECMP on the router to ensure all packets in a TCP session from a client are always routed to the same site. In this deployment, each CE has two tunnels to the nearest REs. Hence, this method is also beneficial when you want to publish an app to the internet using F5XC Regional Edges (REs) as you can scale throughput by adding CEs and hence more tunnels. Note: This is a big advantage of this topology over a multi-node site as the latter is limited to only two tunnels. For Public load balancers, the VIP is on the Regional Edges (REs) on the F5XC global network. The load balancing happens on the REs and the CEs provide secure connectivity with auto SNAT between REs and private origins. So, to get the most out of the available compute, the CEs in this case can be configured for Enhanced L3 performance mode as all the L7 processing happens on the RE. Figure: RE-CE tunnels for multiple single-node site deployment Conclusion This guide should help the reader learn about the various HA options available in F5XC to make an informed decision on which method to choose based on their requirements and the networking tools available. For a more detailed explanation of the above options with config examples also see:F5 Distributed Cloud – CE High Availability Options: A Comparative Exploration Related Articles F5XC Load Balancing and Distributed Proxy Concepts F5XC Virtual Network Concepts F5XC Site BGP Configurations on F5XC F5 Distributed Cloud - Customer Edge Site - Deployment & Routing Options F5 Distributed Cloud - Listener Logic530Views3likes0CommentsF5 Distributed Cloud – CE High Availability Options: A Comparative Exploration
This article explores an alternative approach to achieve HA across single CE nodes, catering for use cases requiring higher performance and granular control over redundancy and failover management. Introduction F5 Distributed Cloud offers different techniques to achieve High Availability (HA) for Customer Edge (CE) nodes in an active-active configuration to provide redundancy, scaling on-demand and simplify management. By default, F5 Distributed Cloud uses a method for clustering CE nodes, in which CEs keep track of peers by sending heartbeats and facilitating traffic exchange among themselves. This method also handles the automatic transfer of traffic, virtual IPs, and services between CE peers —excellent for simplified deployment and running App Stack sites hosting Kubernetes workloads. However, if CE nodes are deployed mainly to manage L3/L7 traffic and application security, this default model might lack the flexibility needed for certain scenarios. Many of our customers tell us that achieving high availability is not so straightforward with the current clustering model. These customers often have a lot of experience in managing redundancy and high availability across traditional network devices. They like to manage everything themselves—from scheduling when to switch over to a redundant pair (planned failover), to choosing how many network paths (tunnels) to use between CEs to REs (Regional Edges) or other CEs. They also want to handle any issues device by device, decide the number of CE nodes in a redundancy group, and be able to direct traffic to different CEs when one is being updated. Their feedback inspired us to write this article, where we explore a different approach to achieve high availability across CEs. The default clustering model is explained in this document: https://docs.cloud.f5.com/docs/ves-concepts/site#cluster-of-nodes Throughout this article, we will dive into several key areas: An overview of the default CE clustering model, highlighting its inherent challenges and advantages. Introduction to an alternative clustering strategy: Single Node Clustering, including: An analysis of its challenges and benefits. Identification of scenarios where this approach is most applicable. A guide to the configuration steps necessary to implement this model. An exploration of failover behavior within this framework. A comparison table showing how this new method differs from the default clustering method. By the end of this article, readers will gain an understanding of both clustering approaches, enabling informed decisions on the optimal strategy for their specific needs. Default CE Clustering Overview In a standard CE clustering setup, a cluster must have at least three Master nodes, with subsequent additions acting as Worker nodes. A CE cluster is configured as a "Site," centralizing operations like pool configuration and software upgrades to simplify management. In this clustering method, frequent communication is required between control plane components of the nodes on a low latency network. When a failover happens, the VIPs and services - including customer’s compute workloads - will transition to the other active nodes. As shown in the picture above, a CE cluster is treated as a single site, regardless of the number of nodes it contains. In a Mesh Group scenario, each mesh link is associated with one single tunnel connected to the cluster. These tunnels are distributed among the master nodes in the cluster, optimizing the total number of tunnels required for a large-scale Mesh Group. It also means that the site will be connected to REs only via 2 tunnels – one to each RE. Design Considerations for Default CE Clustering model: Best suited for: 1- App Stack Sites: Running Kubernetes workloads necessitates the default clustering method for container orchestration across nodes. 2- Large-scale Site-Mesh Groups (SMG) 3- Cluster-wide upgrade preference: Customers who favour managing nodes collectively will find cluster-wide upgrades more convenient, however without control over the upgrade sequence of individual nodes. Challenges: o Network Bottleneck for Ingress Traffic: A cluster connected to two Regional Edge (RE) sites via only 2 tunnels can lead to only two nodes processing external (ingress) traffic, limiting the use of additional nodes to process internal traffic only. o Three-master node requirement: Some customers are accustomed to dual-node HA models and may find the requirement for three master nodes resource-intensive. o Hitless upgrades: Controlled, phased upgrades are preferred by some customers for testing before widespread deployment, which is challenging with cluster-wide upgrades. o Cross-site deployments: High network latency between remote data centers can impact cluster performance due to the latency sensitivity of etcd daemon, the backbone of cluster state management. If the network connection across the nodes gets disconnected, all nodes will most likely stop the operation due to the quorum requirements of etcd. Therefore, F5 recommends deploying separate clusters for different physical sites. o Service Fault Sprawl and limited Node fault tolerance: Default clusters can sometimes experience a cascading effect where a fault in a node spreads throughout the cluster. Additionally, a standard 3-node cluster can generally only tolerate the failure of two nodes. If the cluster was originally configured with three nodes, functionality may be lost if reduced to a single active node. These limitations stem from the underlying clustering design and its dependency on etcd for maintaining cluster state. The Alternative Solution: HA Between Multiple Single Nodes The good news is that we can achieve the key objectives of the clustering – which are streamlined management and high availability - without the dependency on the control plane clustering mechanisms. Streamlined management using “Virtual Site”: F5 Distributed Cloud provides a mechanism called “Virtual Site” to perform operations on a group of sites (site = node or cluster of nodes), reducing the need to repeat the same set of operations for each site. The “Virtual Site” acts as an abstraction layer, grouping nodes tagged with a unique label and allows collectively addressing these nodes as a single entity. Configuration of origin pools and load balancers can reference Virtual Sites instead of individual sites/nodes, to facilitate cluster-like management for two or more nodes and enabling controlled day 2 operations. When a node is disassociated from Virtual Site by removing the label, it's no longer eligible for new connections, and its listeners are simultaneously deactivated. Upgrading nodes is streamlined: simply remove the node's label to exclude it from the Virtual Site, perform the upgrade, and then reapply the label once the node is operational again. This procedure offers you a controlled failover process, ensuring minimal disruption and enhanced manageability by minimizing the blast radius and limiting the cope of downtime. As traffic is rerouted to other CEs, if something goes wrong with an upgrade of a CE node, the services will not be impacted. HA/Redundancy across multiple nodes: Each single node in a Virtual Site connect to dual REs through IPSec or SSL/TLS tunnels, ensuring even load distribution and true active-active redundancy. External (Ingress) Traffic: In the Virtual Site model, the Regional Edges (REs) distribute external traffic evenly across all nodes. This contrasts with the default clustering approach where only two CE nodes are actively connected to the REs. The main Virtual Site advantage lies in its true active/active configuration for CEs, increasing the total ingress traffic capacity. If a node becomes unavailable, the REs will automatically reroute the new connections to another operational node within the Virtual Site, and the services (connection to origin pools) remain uninterrupted. Internal (East-West) Traffic: For managing internal traffic within a single CE node in a Virtual Site (for example, when LB objects are configured to be advertised within the local site), all network techniques applicable to the default clustering model can be employed in this model as well, except for the Layer 2 attachment (VRRP) method. Preferred load distribution method for internal traffic across CEs: Our preferred methods for load balancing across CE nodes are either DNS based load balancing or Equal-Cost Multi-Path (ECMP) routing utilizing BGP for redundancy. DNS Load Balancer Behavior: If a node is detached from a Virtual Site, its associated listeners and Virtual IPs (VIPs) are automatically withdrawn. Consequently, the DNS load balancer's health checks will mark those VIPs as down and prevent them from receiving internal network traffic. Current limitation for custom VIP and BGP: When using BGP, please note a current limitation that prevents configuring a custom VIP address on the Virtual Site. As a workaround, custom VIPs should be advertised on individual sites instead. The F5 product team is actively working to address this gap. For a detailed exploration of traffic routing options to CEs, please refer to the following article here: https://community.f5.com/kb/technicalarticles/f5-distributed-cloud---customer-edge-site---deployment--routing-options/319435 Design Considerations for Single Node HA Model: Best suited for: 1- Customers with high throughput requirement: This clustering model ensures that all Customer Edge (CE) nodes are engaged in managing ingress traffic from Regional Edges (REs), which allows for scalable expansion by adding additional CEs as required. In contrast, the default clustering model limits ingress traffic processing to only two CE nodes per cluster, and more precisely, to a single node from each RE, regardless of the number of worker nodes in the cluster. Consequently, this model is more advantageous for customers who have high throughput demands. 2- Customers who prefer to use controlled failover and software upgrades This clustering model enables a sequential upgrade process, where nodes are updated individually to ensure each node upgrades successfully before moving on to the other nodes. The process involves detaching the node from the cluster by removing its site label, which causes redirecting traffic to the remaining nodes during the upgrade. Once upgraded, the label is reapplied, and this process is repeated for each node in turn. This is a model that customers have known for 20+ years for upgrade procedures, with a little wrinkle with the label. 3- Customers who prefer to distribute the load across remote sites Nodes are deployed independently and do not require inter-node heartbeat communication, unlike the default clustering method. This independence allows for their deployment across various data centers and availability zones while being managed as a single entity. They are compatible with both Layer 2 (L2) spanned and Layer 3 (L3) spanned data centers, where nodes in different L3 networks utilize distinct gateways. As long as the nodes can access the origin pools, they can be integrated into the same "Virtual Site". This flexibility caters to customers' traditional preferences, such as deploying two CE nodes per location, which is fully supported by this clustering model. Challenges: Lack of VRRP Support: The primary limitation of this clustering method is the absence of VRRP support for internal VIPs. However, there are some alternative methods to distribute internal traffic across CE nodes. These include DNS based routing, BGP with Equal-Cost Multi-Path (ECMP) routing, or the implementation of CEs behind another Layer 4 (L4) load balancer capable of traffic distribution without source address alteration, such as F5 BIG-IPs or the standard load balancers provided by Azure or AWS. Limitation on Custom VIP IP Support: Currently, the F5 Distributed Cloud Console has a restriction preventing the configuration of custom virtual IPs for load balancer advertisements on Virtual Sites. We anticipate this limitation will be addressed in future updates to the F5 Distributed Cloud platform. As a temporary solution, you can advertise the LB across multiple individual sites within the Virtual Site. This approach enables the configuration of custom VIPs on those sites. Requires extra steps for upgrading nodes Unlike the Default clustering model where upgrades can be performed collectively on a group of nodes, this clustering model requires upgrading nodes on an individual basis. This may introduce more steps, especially in larger clusters, but it remains significantly simpler than traditional network device upgrades. Large-Scale Mesh Group: In F5 Distributed Cloud, the "Mesh Group" feature allows for direct connections between sites (whether individual CE sites or clusters of CEs) and other selected sites through IPSec tunnels. For CE clusters, tunnels are established on a per-cluster basis. However, for single-node sites, each node creates its own tunnels to connect with remote CEs. This setup can lead to an increased number of tunnels needed to establish the mesh. For example, in a network of 10 sites configured with dual-CE Virtual Sites, each CE is required to establish 18 IPSec tunnels to connect with other sites, or 19 for a full mesh configuration. Comparatively, a 10-site network using the default clustering method—with a minimum of 3 CEs per site—would only need up to 9 tunnels from each CE for full connectivity. Opting for Virtual Sites with dual CEs, a common choice, effectively doubles the number of required tunnels from each CE when compared to the default clustering setup. However, despite this increase in tunnels, opting for a Mesh configuration with single-node clusters can offer advantages in terms of performance and load distribution. Note: Use DC Groups as an alternative solution to Secure Mesh Group for CE connectivity: For customers with existing private connectivity between their CE nodes, running Site Mesh Group (SMG) with numerous IPsec tunnels can be less optimal. As a more scalable alternative for these customers, we recommend using DC Cluster Group (DCG). This method utilizes IP-in-IP tunnels over the existing private network, eliminating the need for individual encrypted IPsec tunnels between each node and streamlining communication between CE nodes via IP-n-IP encapsulations. Configuration Steps The configuration for creating single node clusters involves the following steps: Creating a Label Creating a Virtual Site Applying the label to the CE nodes (sites) Review and validate the configuration The detailed configuration guide for the above steps can be found here: https://docs.cloud.f5.com/docs/how-to/fleets-vsites/create-virtual-site Example Configuration: In this example, you can create a label called "my-vsite" to group CE nodes that belong to the same Virtual Site. Within this label, you can then define different values to represent different environments or clusters, such as specific Azure region or an on-premise data center. Then a Virtual Site of “CE” type can be created to represent the CE cluster in “Azure-AustraliaEast-vSite" and tied to any CE that is tagged with the label “my-vsite=Azure-AustraliaEast-vSite”: Now, any CE node that should join the cluster (Virtual Site), should get this label: Verification: To confirm the Virtual Site configuration is functioning as intended, we joined two CEs (k1-azure-ce2 and k1-azure-ce03) into the Virtual Site and evaluated the routing and load balancing behavior. Test 1: Public Load Balancer (Virtual Site referenced in the pool) The diagram shows a public "Load Balancer" advertised on the RE referencing a pool that uses the newly created Virtual Site to access the private application: As shown below, the pool member was configured to be accessed through the Virtual Site: Analysis of the request logs in the Performance dashboard confirmed that all requests to the public website were evenly distributed across both CEs. Test 2: Internal Load Balancer (LB advertised on the Virtual Site) We deployed an internal Load Balancer and advertised it on the newly created Virtual Site, utilizing the pool that also references the same Virtual Site (k1-azure-ce2 and k1-azure-ce03). As shown below, the Load Balancer was configured to be advertised on the Virtual Site. Note: Here we couldn't use a "shared" custom VIP across the Virtual Site due to a current platform constraint. If a custom VIP is required, we should use "site" as opposed to "Virtual Site" and advertise the Load Balancer on all sites, like below picture: Request logs revealed that when traffic reached either CE node within the Virtual Site, the request was processed and forwarded locally to the pool member. In the example below: src_site: Indicates the CE (k1-azure-ce2) that processed the request. src_ip: Represents the client's source IP address (192.168.1.68). dst_site: Indicates the CE (k1-azure-ce2) from which the pool member is accessed. dst_ip: Represents the IP address of the pool member (192.168.1.6). Resilience Testing: To assess the Virtual Site's resilience, we intentionally blocked network access from k1-azure-ce2 CE to the pool member (192.168.1.6). The CE automatically rerouted traffic to the pool member via the other CE (k1-azure-ce03) in the Virtual Site. Note:By default, CEs can communicate with each other via the F5 Global Network. This can be customized to use direct connectivity through tunnels if the CEs are members of the same DC Cluster Group (IP-n-IP tunneling) or Secure Mesh Groups (IPSec tunneling). The following picture shows the traffic flow via F5 Global Network. The following picture shows the traffic flow via the IP-n-IP tunnel when a DC Clustering Group (DCG) is configured across the CE nodes. Failover Behaviour When a CE node is tied to a Virtual Site, all internal Load Balancers (VIPs) advertised on that Virtual Site will be deployed in the CE. Additionally, the Regional Edge (RE) begins to use this node as one of the potential next hops for connections to the origin pool. Should the CE become unavailable, or if it lacks the necessary network access to the origin server, the RE will almost seamlessly reroute connections through the other operational CEs in the Virtual Site. Uncontrolled Failover: During instances of uncontrolled failover, such as when a node is unexpectedly shut down from the hypervisor, we have observed a handful of new connections experiencing timeouts. However, these issues were resolved by implementing health checks within the origin pool, which prevented any subsequent connection drops. Note: Irrespective of the clustering model in use, it's always recommended to configure health checks for the origin pool. This practice enhances failover responsiveness and mitigates any additional latency incurred during traffic rerouting. Controlled Failover: The moment a CE node is disassociated from the Virtual Site — by the removal of its label— the CE node will not be used by RE to connect to origin pools anymore. At the same time, all Load Balancer listeners associated with that Virtual Site are withdrawn from the node. This effectively halts traffic processing for those applications, preventing the node from receiving related traffic. During controlled failover scenarios, we have observed seamless service continuity on externally advertised services (to REs). On-Demand Scaling: F5 Distributed Cloud provides a flexible solution that enables customers to scale the number of active CE nodes according to demand. This allows you to easily add more powerful CE nodes during peak periods (such as promotional events) and then remove them when demand subsides. With the Virtual Sites method, you can even mix and match node sizes within your cluster (Virtual Site), providing granular control over resources. It's advisable to monitor CE node performance and implement node related alerts. These alerts notify you when nodes are operating at high capacity, allowing for timely addition of extra nodes as needed. Moreover, you can monitor node’s health in the dashboard. CPU, Memory and Disk utilizations of nodes can be a good factor in determining if more nodes are needed or not. Furthermore, the use of Virtual Sites makes managing this process even easier, thanks to labels. Node Based Alerts: Node-based alerts are essential for maintaining efficient CE operations. Accessing the alerts in the Console: To view alerts, go to Multi-Cloud Network Connect > Notifications > Alerts. Here, you can see both "Active Alerts" and "All Alerts." Alerts related to node health fall under the "infrastructure" alert group. The following screenshot shows alerts indicating high loads on the nodes. Configuring Alert Policies: Alert policies determine the notification process for raised alerts. To set up an alert policy, navigate to Multi-Cloud Network Connect > Alerts Management > Alert Policies. An alert policy consists of two main elements: the alert receiver configuration and the policy rules. Configuring Alert Receiver: The configuration allows for integration with platforms like Slack and PagerDuty, among others, facilitating notifications through commonly used channels. Configuring Alert Rules: For alert selection, we recommend configuring notifications for alerts with severity of “Major” or “Critical” at a minimum. Alternatively, the “infrastructure” group which includes node-based alerts can be selected. Comparison Table Criteria Default Cluster Single Node HA Minimum number of nodes in HA 3 2 Upgrade operations Per cluster Per Node Network redundancy and client side routing for east-west traffic VRRP, BGP, DNS, L4/7 LB DNS, L4/7 LB, BGP* Tunnels to RE 2 tunnels per cluster 2 tunnels per node Tunnels to other CEs (SMG or DCG) 1 tunnel from each cluster 1 tunnel from each node External traffic processing Limited to 2 nodes All nodes will be active Internal traffic processing All nodes can be active All nodes can be active Scale management in Public Cloud Sites Straightforward, by configuring ingress interfaces in Azure/AWS/GCP sites Straightforward, by adding or removing the labels Scale management in Secure Mesh Sites Requires reconfiguring the cluster (secure mesh site) - may cause interruption Straightforward, by adding or removing the labels Custom VIP IP Available Not Available (Planned to be available in future releases), workaround available. Node sizes All nodes should be same size. Upgrading node size in a cluster is a disruptive operation. Any node sizes or clusters can join the Virtual Site * When using BGP, please note a current limitation that prevents configuring custom VIP address on the Virtual Site. Conclusion: F5 Distributed Cloud offers a flexible approach to High Availability (HA) across CE nodes, allowing customers to select the redundancy model that best fits their specific use cases and requirements. While we continue to advocate for default clustering approach due to their operational simplicity and shared VRRP VIP or, unified network configuration benefits, especially for routine tasks like upgrades, the Virtual Site and single node HA model presents some great use cases. It not only addresses the limitations and challenges of the default clustering model, but also introduces a solution that is both scalable and adaptable. While Virtual Sites offer their own benefits, we recognize they also present trade-offs. The overall benefits, particularly for scenarios demanding high ingress (RE to CE) throughput and controlled failover capabilities cater to specific customer demands. The F5 product and development team remains committed to addressing the limitations of both default clustering and Virtual Sites discussed throughout this article. Their focus is on continuous improvement and finding the solutions that best serve our customers' needs. References and Additional Links: Default Clustering model: https://docs.cloud.f5.com/docs/ves-concepts/site#cluster-of-nodes Configuration guide for Virtual Sites:https://docs.cloud.f5.com/docs/how-to/fleets-vsites/create-virtual-site Routing Options for CEs:https://community.f5.com/kb/technicalarticles/f5-distributed-cloud---customer-edge-site---deployment--routing-options/319435 Configuration guide for DC Clustering Group:https://docs.cloud.f5.com/docs/how-to/advanced-networking/configure-dc-cluster-group1.4KViews4likes0CommentsCreate a BIG-IP HA Pair in Azure
Use an Azure ARM template to create a high availability (active-standby) pair of BIG-IP Virtual Edition instances in Microsoft Azure. When one BIG-IP VE goes standby, the other becomes active, the virtual server address is reassigned from one external NIC to another. Today, let’s walk through how to create a high availability pair of BIG-IP VE instances in Microsoft Azure. When we’re done, we’ll have an active-standby pair of BIG-IP VEs. To start, go to the F5 Networks Github repository. Click F5-azure-arm-templates. Then go to Supported>failover. You have several options at this point. You can chose which templates to use based on your needs, failing over via API calls, via upstream load balancers, and NIC counts. Read each readme to determine your desired deployment strategy. When you already have your subnets and existing IP addresses defined but to see how it works, let’s deploy a new stack. Click new stack and scroll down to the Deploy button. If you have a trial or production license from F5, you can use the BYOL or BIG-IQ as license server options but in this case we’re going to choose the PAYG option. Click Deploy and the template opens in the Azure portal. Now we simply fill out the fields. We’ll create a new Resource Group and set a password for the BIG-IP VEs. When you get to the questions: The DNS label is used as part of the URL. Instance Name is just the name of the VM in Azure. Instance Type determines how much memory and CPU you’ll have. Image Name determines how many BIG-IP modules you can run (and you can choose the latest BIG-IP version). Licensed Bandwidth determines the maximum throughput of the traffic going through BIG-IP. Select the Number of External IP addresses (we’ll start with one but can add more later). For instance, if you plan on running more than one application behind the BIG-IP, then you’ll need the appropriate external IP addresses. Vnet Address Prefix is for the address ranges of you subnets (we’ll leave at default). The next 3 fields (Tenant ID, Client ID, Service Principal Secret) have to do with security. Rather than using your own credentials to modify resources in Azure, you can create an Active Directory application and assign permissions to it. The last two fields also go together. Managed Routes let you route traffic from other external networks through the BIG-IPs. The Route Table Tag means that anytime this tag is found in the route table, routes that have this destination are updated so that the next hop is the IP address of the active BIG-IP VE. This is useful if you want all outbound traffic to go through the BIG-IP or if you want to send traffic from a bunch of different Vnets through the BIG-IP. We’ll leave the rest as default but the Restricted Src Address is good way to put IP addresses on my network – the ones that are allowed to connect to the BIG-IP. We’ll agree to the terms and click Purchase. We’re redirected to the Dashboard with the Deployment in Progress indicator. This takes about 15 minutes. Once finished we’ll go check all the resources in the Resource Group. Let’s find out where the virtual server address is located since this is associated with one of the external NICs, which have ‘ext’ in the name. Click the one you want. Then click IP Configuration under Settings. When you look at the IP Configuration for these NICs, whenever the NIC has two IP addresses that’s the NIC for the active BIG-IP. The Primary IP address is the BIG-IP Self IP and the Secondary IP is the virtual server address. If we look at the other external NIC we’ll see that it only has one Self IP and that’s the Primary and it doesn’t have the Secondary virtual server address. The virtual server address is assigned to the active BIG-IP. When we force the active BIG-IP to standby, the virtual server address is reassigned from one NIC to the other. To see this, we’ll log into the BIG-IPs and on the active BIG-IP, we’ll click Force to Standby and the other BIG-IP becomes Active. When we go back to Azure, we can see that the virtual server IP is no longer associated with the external NIC. And if we wait a few minutes, we’ll see that the address is now associated with the other NIC. So basically how BIG-IP HA works in the Azure cloud is by reassigning the virtual server address from one BIG-IP to another. Thanks to our TechPubs group and check out the demo video. ps6.4KViews0likes6CommentsLightboard Lessons: BIG-IP Deployments in Azure Cloud
In this edition of Lightboard Lessons, I cover the deployment of a BIG-IP in Azure cloud. There are a few videos associated with this topic, and each video will address a specific use case. Topics will include the following: Azure Overview with BIG-IP BIG-IP High Availability Failover Methods Glossary: ALB = Azure Load Balancer ILB = Azure Internal Load Balancer HA = High Availability VE = Virtual Edition NVA = Network Virtual Appliance DSR = Direct Server Return RT = Route Table UDR = User Defined Route WAF = Web Application Firewall Azure Overview with BIG-IP This overview covers the on-prem BIG-IP with a 3-nic example setup. I then discuss the Azure cloud network and cloud components and how that relates to making a BIG-IP work in the Azure cloud. Things discussed include NICs, routes, network security groups, and IP configurations. The important thing to remember is that the cloud is not like on-prem regarding things like L2 and L3 networking components. This makes a difference as you assign NICs and IPs to each virtual machine in the cloud. Read more here on F5 CloudDocs for Azure BIG-IP Deployments. BIG-IP HA and Failover Methods The high availability section will review three different videos. These videos will discuss the failover methods for a BIG-IP cluster and how traffic can failover to the second device upon a failover event. I also discuss IP addressing options for the BIG-IP VIP/listeners and "why". Question: “Which F5 solution is right for me? Autoscaling or HA solutions?” Use these bullet points as guidance: Auto Scale Solution Ramp up/down time to consider as new instances come and go Dynamically adjust instance count based on CPU, memory, and throughput No failover, all devices are Active/Active Self-healing upon device failure (thanks to cloud provider native features) Instances are deployed with 1-NIC only HA Failover (non auto scale) No Ramp up/down time since no additional devices are "auto" scaling No dynamic scaling of the cluster, it will remain as two (2) instances Yes failover, UDRs and IP config will failover to other BIG-IP instance No self-healing, manual maintenance is required by user (similar to on-prem) Instances can be deployed with multiple NICs if needed HA Using API for Failover How do IP addresses and routes failover to the other BIG-IP unit and still process traffic with no layer 2 (L2) networking? Easy, API calls to the cloud. When you deploy an HA pair of BIG-IP instances in the Azure cloud, the BIG-IP instances are onboarded with various cloud scripts. These scripts help facilitate the moving of cloud objects by detecting failover events, triggering API calls to Azure cloud, and thus moving cloud objects (ex. Azure IPs, Azure route next-hops). Traffic now processes successfully on the newly active BIG-IP instance. Benefits of Failover via API: This is most similar to a traditional HA setup No ALB or ILB required VIPs, SNATs, Floating IPs, and managed routes (UDR) can failover to the peer SNAT pool can be used if port exhaustion is a concern SNAT automap is optional (UDR routes needed if SNAT none) Requirements for Failover via API: Service Principal required with correct permissions BIG-IP needs outbound internet access to Azure REST API on port 443 Mutli-NIC required Other things to know: BIG-IP pair will be active/standby Failover times are dependent on Azure API queue (30-90 seconds, sometimes longer) I have experienced up to 20 minutes to failover IPs (public IPs, private IPs) UDR route table entries typically take 5-10 seconds in my testing experience BIG-IP listener can be secondary private IP associated with NIC can be an IP within network prefix being routed to BIG-IP via UDR Read about the F5 GitHub Azure Failover via API templates. HA Using ALB for Failover This type of BIG-IP deployment in Azure requires the use of an Azure load balancer. This ALB sits in a Tier 1 position and acts as an Layer 4 (L4) load balancer to the BIG-IP instances. The ALB performs health checks against the BIG-IP instances with configurable timers. This can result in a much faster failover time than the "HA via API" method in which the latter is dependent on the Azure API queue. In default mode, the ALB has Direct Server Return (DSR) disabled. This means the ALB will DNAT the destination IP requested by the client. This results in the BIG-IP VIP/listener IP listening on a wildcard 0.0.0.0/0 or the NIC subnet range like 10.1.1.0/24. Why? Because ALB will send traffic to the BIG-IP instance on a private IP. This IP will be unique per BIG-IP instance and cannot "float" over without an API call. Remember, no L2...no ARP in the cloud. Rather than create two different listener IP objects for each app, you can simply use a network range listener or a wildcard. The video has a quick example of this using various ports like 0.0.0.0/0:443, 0.0.0.0/0:9443. Benefits of Failover via LB: 3-NIC deployment supports sync-only Active/Active or sync-fail Active/Standby Failover times depend on ALB health probe (ex. 5 sec) Multiple traffic groups are supported Requirements for Failover via LB: ALB and/or ILB required SNAT automap required Other things to know: BIG-IP pair will be active/standby or active/active depending on setup ALB is for internet traffic ILB is for internal traffic ALB has DSR disabled by default Failover times are much quicker than "HA via API" Times are dependent on Azure LB health probe timers Azure LB health probe can be tcp:80 for example (keep it simple) Backend pool members for ALB are the BIG-IP secondary private IPs BIG-IP listener can be wildcard like 0.0.0.0/0 can be network range associated with NIC subnet like 10.1.1.0/24 can use different ports for different apps like 0.0.0.0/0:443, 0.0.0.0/0:9443 Read about the F5 GitHub Azure Failover via ALB templates. HA Using ALB for Failover with DSR Enabled (Floating IP) This is a quick follow up video to the previous "HA via ALB". In this fourth video, I discuss the "HA via ALB" method again but this time the ALB has DSR enabled. Whew! Lots of acronyms! When DSR is enabled, the ALB will send the traffic to the backend pool (aka BIG-IP instances) private IP without performing destination NAT (DNAT). This means...if client requested 2.2.2.2, then the ALB will send a request to the backend pool (BIG-IP) on same destination 2.2.2.2. As a result, the BIG-IP VIP/listener will match the public IP on the ALB. This makes use of a floating IP. Benefits of Failover via LB with ALB DSR Enabled: Reduces configuration complexity between the ALB and BIG-IP The IP you see on the ALB will be the same IP as the BIG-IP listener Failover times depend on ALB health probe (ex. 5 sec) Requirements for Failover via LB: DSR enabled on the Azure ALB or ILB ALB and/or ILB required SNAT automap required Dummy VIP "healthprobe" to check status of BIG-IP on individual self IP of each instance Create one "healthprobe" listener for each BIG-IP (total of 2) VIP listener IP #1 will be BIG-IP #1 self IP of external network VIP listener IP #2 will be BIG-IP #2 self IP of external network VIP listener port can be 8888 for example (this should match on the ALB health probe side) attach iRule to listener for up/down status Example iRule... when HTTP_REQUEST { HTTP::respond 200 content "OK" } Other things to know: ALB is for internet traffic ILB is for internal traffic BIG-IP pair will operate as active/active Failover times are much quicker than "HA via API" Times are dependent on Azure LB health probe timers Backend pool members for ALB are the BIG-IP primary private IPs BIG-IP listener can be same IP as the ALB public IP can use different ports for different apps like 2.2.2.2:443, 2.2.2.2:8443 Read about the F5 GitHub Azure Failover via ALB templates. Also read about Azure LB and DSR. Auto Scale BIG-IP with ALB This type of BIG-IP deployment takes advantage of the native cloud features by creating an auto scaling group of BIG-IP instances. Similar to the "HA via LB" mentioned earlier, this deployment makes use of an ALB that sits in a Tier 1 position and acts as a Layer 4 (L4) load balancer to the BIG-IP instances. Azure auto scaling is accomplished by using Azure Virtual Machine Scale Sets that automatically increase or decrease BIG-IP instance count. Benefits of Auto Scale with LB: Dynamically increase/decrease BIG-IP instance count based on CPU and throughput If using F5 auto scale WAF templates, then those come with pre-configured WAF policies F5 devices will self-heal (cloud VM scale set will replace damaged instances with new) Requirements for Auto Scale with LB: Service Principal required with correct permissions BIG-IP needs outbound internet access to Azure REST API on port 443 ALB required SNAT automap required Other things to know: BIG-IP cluster will be active/active BIG-IP will be deployed with 1-NIC BIG-IP onboarding time BIG-IP VE process takes about 3-8 minutes depending on instance type and modules Azure VM Scale Set configured with 10 minute window for scale up/down window (ex. to prevent flapping) Take these timers into account when looking at full readiness to accept traffic BIG-IP listener can be wildcard like 0.0.0.0/0 can use different ports for different apps like 0.0.0.0/0:443, 0.0.0.0/0:9443 Licensing PAYG marketplace licensing can be used BIG-IQ license manager can be used for BYOL licensing Sorry, no video yet...a picture will have to do! Here's an example diagram of auto scale with ALB. Read about the F5 GitHub Azure Auto Scale via ALB templates. Auto Scale BIG-IP with DNS This type of BIG-IP deployment takes advantage of the native cloud features by creating an auto scaling group of BIG-IP instances. Unlike "HA via LB" mentioned earlier or "Auto Scale with ALB", this deployment makes use of DNS that acts as a method to distribute traffic to the auto scaling BIG-IP instances. This solution integrates with F5 BIG-IP DNS (formerly named GTM). And...since there is no ALB in front of the BIG-IP instances, this means you do not need SNAT automap on the BIG-IP listeners. In other words, if you have apps that need to see the real client IP and they are non-HTTP apps (can't pass XFF header) then this is one method to consider. Benefits of Auto Scale with DNS: Dynamically increase/decrease BIG-IP instance count based on CPU and throughput If using F5 auto scale WAF templates, then those come with pre-configured WAF policies F5 devices will self-heal (cloud VM scale set will replace damaged instances with new) ALB not required (cost savings) SNAT automap not required Requirements for Auto Scale with DNS: Service Principal required with correct permissions BIG-IP needs outbound internet access to Azure REST API on port 443 SNAT automap optional BIG-IP DNS (aka GTM) needs connectivity to each BIG-IP auto scaled instance Other things to know: BIG-IP cluster will be active/active BIG-IP will be deployed with 1-NIC BIG-IP onboarding time BIG-IP VE process takes about 3-8 minutes depending on instance type and modules Azure VM Scale Set configured with 10 minute window for scale up/down window (ex. to prevent flapping) Take these timers into account when looking at full readiness to accept traffic BIG-IP listener can be wildcard like 0.0.0.0/0 can use different ports for different apps like 0.0.0.0/0:443, 0.0.0.0/0:9443 Licensing PAYG marketplace licensing can be used BIG-IQ license manager can be used for BYOL licensing Sorry, no video yet...a picture will have to do! Here's an example diagram of auto scale with DNS. Read about the F5 GitHub Azure Auto Scale via DNS templates. Summary That's it for now! I hope you enjoyed the video series (here in full on YouTube) and quick explanation. Please leave a comment if this helped or if you have additional questions. Additional Resources F5 High Availability - Public Cloud Guidance The Hitchhiker’s Guide to BIG-IP in Azure The Hitchhiker’s Guide to BIG-IP in Azure – “Deployment Scenarios” The Hitchhiker’s Guide to BIG-IP in Azure – “High Availability” The Hitchhiker’s Guide to BIG-IP in Azure – “Life Cycle Management”9.4KViews7likes7CommentsConfigure HA Groups on BIG-IP
Last week we talked about how HA Groups work on BIG-IP and this week we’ll look at how to configure HA Groupson BIG-IP. To recap, an HA group is a configuration object you create and assign to a traffic group for devices in a device group. An HA group defines health criteria for a resource (such as an application server pool) that the traffic group uses. With an HA group, the BIG-IP system can decide whether to keep a traffic group active on its current device or fail over the traffic group to another device when resources such as pool members fall below a certain level. First, some prerequisites: Basic Setup: Each BIG-IP (v13) is licensed, provisioned and configured to run BIG-IP LTM HA Configuration: All BIG-IP devices are members of a sync-failover device group and synced Each BIG-IP has a unique virtual server with a unique server pool assigned to it All virtual addresses are associated with traffic-group-1 To the BIG-IP GUI! First you go to System>High Availability>HA Group List>and then click the Create button. The first thing is to name the group. Give it a detailed name to indicate the object group type, the device it pertains to and the traffic group it pertains to. In this case, we’ll call it ‘ha_group_deviceA_tg1.’ Next, we’ll click Add in the Pools area under Health Conditions and add the pool for BIG-IP A to the HA Group which we’ve already created. We then move on to the minimum member count. The minimum member count is members that need to be up for traffic-group-1 to remain active on BIG-IP A. In this case, we want 3 out of 4 members to be up. If that number falls below 3, the BIG-IP will automatically fail the traffic group over to another device in the device group. Next is HA Score and this is the sufficient threshold which is the number of up pool members you want to represent a full health score. In this case, we’ll choose 4. So if 4 pool members are up then it is considered to have a full health score. If fewer than 4 members are up, then this health score would be lower. We’ll give it a default weight of 10 since 10 represents the full HA score for BIG-IP A. We’re going to say that all 4 members need to be active in the group in order for BIG-IP to give BIG-IP A an HA score of 10. And we click Add. We’ll then see a summary of the health conditions we just specified including the minimum member count and sufficient member count. Then click Create HA Group. Next, we go to Device Management>Traffic Groups>and click on traffic-group-1. Now, we’ll associate this new HA Group with traffic-group-1. Go to the HA Group setting and select the new HA Group from the drop-down list. And then select the Failover Method to Device with the Best HA Score. Click Save. Now we do the same thing for BIG-IP B. So again, go to System>High Availability>HA Group List>and then click the Create button. Give it a special name, click Add in the Pools area and select the pool you’ve already created for BIG-IP B. Again, for our situation, we’ll specify a minimum of 3 members to be up if traffic-group-1 is active on BIG-IP B. This minimum number does not have to be the same as the other HA Group, but it is for this example. Again, a default weight of 10 in the HA Score for all pool members. Click Add and then Create HA Group for BIG-IP B. And then, Device Management>Traffic Groups> and click traffic-group-1. Choose BIG-IP B’s HA Group and select the same Failover method as BIG-IP A – Based on HA Score. Click Save. Lastly, you would create another HA Group on BIG-IP C as we’ve done on BIG-IP A and BIG-IP B. Once that happens, you’ll have the same set up as this: As you can see, BIG-IP A has lost another pool member causing traffic-group-1 to failover and the BIG-IP software has chosen BIG-IP C as the next active device to host the traffic group because BIG-IP C has the highest HA Score based on the health of its pool. Thanks to our TechPubs group for the basis of this article and check out a video demo here. ps9.1KViews1like0CommentsHigh Availability Groups on BIG-IP
High Availability of applications is critical to an organization’s survival. On BIG-IP, HA Groups is a feature that allows BIG-IP to fail over automatically based not on the health of the BIG-IP system itself but rather on the health of external resources within a traffic group. These external resources include the health and availability of pool members, trunk links, VIPRION cluster members or a combination of all three. This is the only cause of failover that is triggered based on resources outside of the BIG-IP. An HA group is a configuration object you create and assign to a traffic group for devices in a device group. An HA group defines health criteria for a resource (such as an application server pool) that the traffic group uses. With an HA group, the BIG-IP system can decide whether to keep a traffic group active on its current device or fail over the traffic group to another device when resources such as pool members fall below a certain level. In this scenario, there are three BIG-IP Devices – A, B, C and each device has two traffic groups on it. As you can see, for BIG-IP A, traffic-group 1 is active. For BIG-IP B, traffic-group 2 is active and for BIG-IP C, both traffic groups are in a standby state. Attached to traffic-group 1 on BIG-IP A is an HA group which specifies that there needs to be a minimum of 3 pool members out of 4 to be up for traffic-group-1 to remain active on BIG-IP A. Similarly, on BIG-IP B the traffic-group needs a minimum of 3 pool members up out of 4 for this traffic group to stay active on BIG-IP B. On BIG-IP A, if fewer than 3 members of traffic-group-1 are up, this traffic-group will fail-over. So let’s say that 2 pool members go down on BIG-IP A. Traffic-group-1 responds by failing-over to the device (BIG-IP) that has the healthiest pool…which in this case is BIG-IP C. Now we see that traffic-group-1 is active on BIG-IP C. Achieving the ultimate ‘Five Nines’ of web site availability (around 5 minutes of downtime a year) has been a goal of many organizations since the beginning of the internet era. There are several ways to accomplish this but essentially a few principles apply. Eliminate single points of failure by adding redundancy so if one component fails, the entire system still works. Have reliable crossover to the duplicate systems so they are ready when needed. And have the ability to detect failures as they occur so proper action can be taken. If the first two are in place, hopefully you never see a failure. But if you do, HA Groups can help. ps Related: Lightboard Lessons: BIG-IP Basic Nomenclature Lightboard Lessons: Device Services Clustering HA Groups Overview2.2KViews0likes2CommentsThe Cloud is Still a Datacenter Somewhere
Application delivery is always evolving. Initially, applications were delivered out of a physical data center, either dedicated raised floor at the corporate headquarters or from some leased space rented from one of the web hosting vendors during the late 1990’s to early 2000’s or some combination of both. Soon global organizations and ecommerce sites alike, started to distribute their applications and deploy them at multiple physical data centers to address geo-location, redundancy and disaster recovery challenges. This was an expensive endeavor back then even without adding the networking, bandwidth and leased line costs. When server virtualization emerged and organizations had the ability to divide resources for different applications, content delivery was no longer tethered 1:1 with a physical device. It could live anywhere. With virtualization technology as the driving force, the cloud computing industry was formed and offered yet another avenue to deliver applications. Application delivery evolved again. As cloud adoption grew, along with the Softwares, Platforms and Infrastructures enabling it, organizations were able to quickly, easily and cost effectively distribute their resources around the globe. This allows organizations to place content closer the user depending on location, and provides some fault tolerance in case of a data outage. Today, there is a mixture of options available to deliver critical applications. Many organizations have on-premises private, owned data center facilities, some leased resources at a dedicated location and maybe even some cloud services. In order to achieve or even maintain continuous application availability and keep up with the pace of new application rollouts, many organizations are looking to expand their data center options, including cloud, to ensure application availability. This is important since 84% of datacenters had issues with power, space and cooling capacity, assets, and uptime that negatively impacted business operations according to IDC. This leads to delays in application rollouts, disrupted customer service or even unplanned expenses to remedy the situation. Operating in multiple data centers is no easy task, however, and new data center deployments or even integrating existing data centers can cause havoc for visitors, employees and IT staff alike. Critical areas of attention include public web properties, employee access to corporate resources and communication tools like email along with the security and required back end data replication for content consistency. On top of that, maintaining control over critical systems spread around the globe is always a major concern. A combination of BIG-IP technologies provides organizations the global application services for DNS, federated identity, security, SSL offload, optimization & application health/availability to create an intelligent cost effective, resilient global application delivery infrastructure across a hybrid mix of data centers. Organizations can minimize downtime, ensure continuous availability and have on demand scalability when needed. Simplify, secure and consolidate across multiple data centers while mitigating impact to users or applications. ps Related: Datacenter Transformation and Cloud The Event-Driven Data Center Hybrid Architectures Do Not Require Private Cloud The Dynamic Data Center: Cloud's Overlooked Little Brother Decade old Data Centers Technorati Tags: datacenter,f5,big-ip,hybrid,cloud,private,multi,applications,silva Connect with Peter: Connect with F5:460Views0likes0CommentsSurviving Disasters with F5 GTM and Oracle Enterprise Manager
Let's say you have 2 or more Data Centers, or locations where you run your applications. MANs, WANs, Cloud whatever - You've architected diverse fiber routes, multipath IP routing, perhaps some Spanning Tree gobblygookness... and you feel confident that your network can handle an outage, right ? So what about the mission critical Applications, Middleware, and Databases running ON TOP of all that fancy, expensive Disaster Recovery bundle of ca$$$h you and the CIO spent ? Did you even test the failover ? Failback ?? How many man-hours and dozens of scripts did it really take ? And more importantly, how much money could your company lose while everyone waits ?? Allow me to try and put your mind at ease, with some great Oracle Enterprise Manager 12c and F5 Networks solutions, and the soothing words of Maximum Availability Architecture Best Practices guidance. We recently finished a hot-off-the-press MAA whitepaper that details how to use F5's Global Traffic Manager with EM12c to provide quick failover of your Data Center's Oracle Management Framework, the EM12c platform itself. You see, it stands to reason, that if you are going to try and move Apps or Databases from one Data Center to another, you have to have a management platform that in and of itself is Highly Available - right ? I mean, how do you even expect to handle the movement of apps and DBs, if your "management framework" is single site only, and thatsite now looks like something from the movie Twister ? Of course, it could be any natural or man-made disaster, but the fiber-eating backhoe is my fave. You need a fully HA EM12c environment to start with first, like the cornerstone of any solid DR/BC solution, the foundation has to be there first. For more details on the different Level 1-4 architectures and descriptions, see the EM12c Framework docs, Part Number E24473-18. But at Last, for Level 4, the best of the best: after you set up Level 3 in each of your sites, you add GTM to your DNS system to quickly detect failures and route both EM agent and administrative access from one Data Center to the other. Where to find all this goodness? The F5 Oracle Enterprise Manager, and Oracle MAA pages. EM Level 3 HA Guide: http://www.oracle.com/technetwork/oem/framework-infra/wp-em12c-config-oms-ha-bigip-1552459.pdf EM Level 4 HA Guide: http://www.oracle.com/technetwork/database/availability/oms-dr-f5-big-ip-gtm-1880830.pdf So if on some dark day you do end up with DataCenter-Twisted for your Primary site, GTM will route you to the Standby site, where EM12c is up and running and waiting to help you control the Apps, Middleware and Databases that are critical to your business. And just in case you are wondering, it is rumored that F5 Networks gets its name from the F5 class tornado ...280Views0likes0CommentsMy application is not the next Twitter so why should I care about high availability?
It often seems that load balancing and high availability are associated with only high traffic sites, like Twitter and Google. But load balancing and high availability isn't just for Web 2.0 phenomenons or web monsters; it can be an invaluable tool in your strategy to maintain service level agreements and customer satisfaction no matter how large or small your customer base - and data center - might be. Load balancing is integral to scalability, to being able to increase the capacity of your web and application servers. But it also just as inexorably linked to high availability through its ability to provide fail-over. Fail-over ensures that if, for any reason, one server in a pool/farm is unavailable that requests are redirected to a secondary or stand-by server. This ensures the site or application is available at all times. More typically, all servers in a pool are utilized at all times to improve performance and to maintain availability in the event that one or more server becomes unavailable. This is true whether you have two servers or two-hundred servers in your pool; whether you're Twitter or Bob's Widget Shop. Where's F5? Storage Decisions Sept 23-24 in New York Networld IT Roadmap Sept 23 in Dallas Oracle Open World Sept 21-25 in San Francisco Storage Networking World Oct 13-16 in Dallas Storage Expo 2008 UK Oct 15-16 in London Storage Networking World Oct 27-29 in Frankfurt High availability can be used in multiple scenarios to provide for continued availability regardless of size and reach of your site or application. MAINTENANCE WINDOWS Everybody has them, and they often result in downtime that, while understandable, may frustrate customers or users, especially if it's unscheduled. Patching, upgrades, migrations, hardware changes - these can all lead to necessary downtime. By implementing a high availability strategy through load balancing, you can ensure that applications remain available. This is accomplished by performing whatever tasks are necessary on one server, allowing the second (or more) to continue to serve requests. Because the load balancer (or application delivery controller) mediates between clients and your servers, customers see no interruption of service while you are working on any one one the servers in the pool. JUST IN CASE Unscheduled downtime is a nice way of saying "Things happen". And when those things happen that cause a server to fail - hardware, infections, licensing issues, bugs - it's nice to know that your application or site, and thus your overall availability, will not likely be affected. Unanticipated downtime can destroy your overall availability rating and cause users to go into a tizzy. A high availability deployment will prevent "things" from taking down your entire site or application, giving you time to focus on the problem at hand and solve it without fielding calls from any number of interested and angry constituents. WIGGLE ROOM Sometimes you develop an application and you're pretty sure you can serve the needs of your customers just fine. And then something happens and you're the next best thing on the web since Google. Maybe you got slashdotted or farked, or maybe you're the last retailer left in the country that's selling The Hottest Christmas Toy this year. Whatever the reason, a sudden spike in volume of users can leave your servers smoking. Implementing a high availability infrastructure can ensure that even if you don't always need that second (or one hundred and twenty-second) server that in the event you do need it, it's there and immediately usable. There's nothing special you need to do, it just picks up the extra load by virtue of being part of the pool. And if you need even more wiggle room, you can add another, and another, and another server transparently. There's no need to interrupt service, just tell your load balancer or application delivery controller that the server is available and it immediately becomes part of the pool, serving up your application to hungry users. High availability isn't just for huge sites and applications, it's a good strategy for anyone who delivers applications via the web. If your business might suffer from downtime, then you need to consider implementing a high-availability strategy sooner rather than later.222Views0likes1Comment4 things you can do in your code now to make it more scalable later
No one likes to hear that they need to rewrite or re-architect an application because it doesn't scale. I'm sure no one at Twitter thought that they'd need to be overhauling their architecture because it gained popularity as quickly as it did. Many developers, especially in the enterprise space, don't worry about the kind of scalability that sites like Twitter or LinkedIn need to concern themselves with, but they still need to be (or at least should be) concerned with scalability in general and the effects of inserting an application into a high-scalability environment, such as one fronted by a load balancer or application delivery controller. There are some very simple things you can do in your code, when you're developing an application, that can ease the transition into a high-availability architecture and that will eventually lead to a faster, more scalable application. Here are four things you can do now - and why - to make your application fit better into a high availability environment in the future and avoid rewriting or re-architecting your solutions later. Where's F5? Storage Decisions Sept 23-24 in New York Networld IT Roadmap Sept 23 in Dallas Oracle Open World Sept 21-25 in San Francisco Storage Networking World Oct 13-16 in Dallas Storage Expo 2008 UK Oct 15-16 in London Storage Networking World Oct 27-29 in Frankfurt 1. Don't assume your application is always responsible for cookie encryption Encrypting cookies in today's privacy lax environment that is the Internet is the responsible thing to do. In the first iterations of your application you will certainly be responsible for handling the encryption and decryption of cookies, but later on, when the application is inserted into a high-availability environment and there exists an application delivery controller (ADC), that functionality can be offloaded to the ADC. Offloading the responsibility for encryption and decryption of cookies to the ADC improves performance because the ADC employs hardware acceleration. To make it easier to offload this responsibility to an ADC in the future but support it early on, use a configuration flag to indicate whether you should decrypt or encrypt cookies before examining them. That way you can simply change the configuration flag later on and immediately take advantage of a performance boost from the network infrastructure. 2. Don't assume the client IP is accurate If you need to use/store/access the client's IP address, don't assume the traditional HTTP header is accurate. Early on it certainly will be, but when the application is inserted into a high availability environment and a full-proxy solution is sitting in front of your application, it won't be. A full-proxy mediates between client and server, which means it is the client when talking to the server, so its IP address becomes the "client IP". Almost all full-proxies insert the real client IP address into the X-Forwarded-For HTTP header, so you should always check that header before checking the client IP address. If there is an X-Forwarded-For value, you'll more than likely want to use it instead of the client IP address. This simple check should alleviate the need to make changes to your application when it's moved into a high availability environment. 3. Don't use relative paths Always use the FQDN (fully qualified domain name) when referencing images, scripts, etc... inside your application. Furthermore, use different host names for different content types - i.e. images.example.com and scripts.example.com. Early on all the hosts will point to the same server, probably, but by insuring that you're using the FQDN now makes architecting that high availability environment much easier. While any intelligent application delivery controller can perform layer 7 switching on any part of the URI and arrive at the same architecture, it's much more efficient to load balance and route application data based on the host name. By using the FQDN and separating host names by content type you can later optimize and tune specific servers for delivery of that content, or use the CNAME trick to improve parallelism and performance in request heavy applications. 4. Separate out API rate limiting functionality If you're writing an application with an API for integration later, separate out the rate limiting functionality. Initially you may need it, but when the application is inserted into a high-availability environment with an intelligent application delivery controller, it can take over that functionality and spare your application from having to reject requests that exceed the set limits. Like cookie encryption, use a configuration flag to determine whether you should check this limitation or not so it can be easily be turned on and off at will. By offloading the responsibility for rate limiting to an application delivery controller you remove the need for the server to waste resources (connections, RAM, cycles) on requests it won't respond to anyway. This improves the capacity of the server and thus your application, making it more efficient and more scalable. By thinking about the ways in which your application will need to interact with a high availability infrastructure later and adjusting your code to take that into consideration you can save yourself a lot of headaches later on when your application is inserted into that infrastructure. That means less rewriting of applications, less troubleshooting, and fewer servers needed to scale up quickly to meet demand. Happy coding!367Views0likes1Comment