What is High Availability (HA)?
Simply put, High Availability provides how a service is available to the user without any noticeable interruption and is achieved using the following principles:
- Redundancy
- Monitoring
- Failover
Implementing a highly available solution provides the benefits of reducing incurred downtime, whether due to loss of service, a site outage, human error, or scheduled maintenance, and improving application and service performance through scalability.
Another advantage of implementing high availability is that there are a plethora of technologies to support the needs of any organization. These can range from network load balancing, service clustering at the hypervisor or the database level to service backups and replication, all depending on budget and existing infrastructure resources.
The caveat is that in most if not all cases, someone needs to implement and support the solution and technology of choice. The systems administrator will require expertise in the technology or someone else within the organization, i.e., a network load balancer expert or a paid support contract from the technology vendor. HA always has costs in time and money, all determined by the solution and technology of choice.
Why HA is a Good Thing
HA protects resources so that in the event of a computer service outage, your business can continue with minimal interruption or a decrease in throughput. By reducing the risk of downtime, we improve business continuity.
One way that HA reduces the risk of downtime is by eliminating single points of failure. If your entire PaperCut system runs on one server, then that server is a single point of failure – and so are the components of that server like the hard drive, power supply, and network interface card, by the way. Even software such as driver updates and operating system patches can bring down the system if they contain severe defects. Therefore, it is essential to highlight the benefits of testing before going live in a production environment supported by standardized change control processes thereafter.
Making any system Highly Available is not just about how it is architected, but also the processes and procedures that are implemented, which help reduce the risk of a system failure. All in all, HA gives you methodologies to eliminate single points of failure, reducing the risk of downtime, and increased business continuity. That’s why it’s a good thing.
Business Continuity Planning
Having a business continuity plan is paramount when we look to implement high availability within an organization. It provides the business with defined guidelines of how operations will continue during an unanticipated disruption of service through the means of prevention and recovery systems while maintaining service levels for the end-users.
When creating a business continuity plan, it’s good practice to conduct a business impact analysis and recovery strategy initiative. This helps set expectations, define processes and guide resolutions.
Business Impact Analysis (BIA)
A (BIA) is used to identify the time-sensitive and business-critical functions that need to be maintained, the current processes, and the resources required to support them. This will allow the business to understand the operational ramifications further should the print service fail and indicate how long the company can operate without print?
Producing questionnaires, running workshops, and formulating interviews with the relevant stakeholders are some of the ways to ascertain the required information.
Recovery strategies
When generating a recovery strategy, we must - using the (BIA), identify and document the resource requirements of the business. We can recognize the gaps between recovery requirements and existing business capabilities by conducting a gap analysis exercise.
With this information, we can answer questions such as:
- What current technologies and processes exist within the organization?
- What in-house organization resources can we utilize to support the business requirements?
- What teams or individuals can we contact when a failure occurs?
Implementing a business continuity plan is hugely beneficial. It provides the organization with a roadmap and a set of processes that support making the right decisions if and when unexpected operational failures occur.
Organizational Factors
Before we can achieve a Highly Available system we need to identify where the weak links are in the computing services. Where are the single points of failure? This is a complicated question because it depends on many factors, such as:
Business value
First, we have to know what parts of the business need to continue, and what their value is to the overall organization. If we are running a hospital and the snack vending machine can’t process payments, it’s probably not as vital as ICU patient monitoring. However, if you’re a university during finals week that same vending system could be mission critical.
User behavior
Another impact to proper HA planning is user behavior. Are there peak usage times which stress certain components of the computing system? Do some applications create higher system load? Which users will need system access even in the event of a disaster?
System infrastructure
Are there multiple types of devices that need to be protected? Have we identified all the pieces of the system? What if we lose power to the primary datacenter?
Textbooks are written on techniques to protect computing services. Whether it’s the servers, operating systems, databases or power sources, we need to understand business goals and any single points of failure in the computing systems that will put them at risk.
Achieving HA
Achieving HA is accomplished primarily through two methods: redundancy and recovery. Both give the computing system resilience (i.e. an ability to return to full operation).
Redundancy
Redundancy solves the problem of a potential failure by having a duplicate standing by. It uses technologies like RAID, virtual machine images, clusters and network load balancers (NLBs). If you’re using RAID and a hard disk crashes, no problem, the data has been redundantly written to other disks. If you are using virtual machines and the whole VM crashes, no problem just spin up the latest VM image on a new VM server and you’re back in business. Clusters and NLBs have multiple servers running and can divert computing requests away from a failed server to one that is still up. Theoretically, you can have an extremely low risk of downtime by implementing redundancy.
However, the cost is usually at least double for a redundant system. And there is added time and complexity to build and maintain these systems, which also require highly trained personnel.
Serious consideration should be given to determine the type and level of redundancy. One PaperCut customer had a very sophisticated Linux HA system with clustered DNS servers and clustered database servers. It was bullet-proof. However, when their system administrator left the organization, no one knew how to maintain it. System upgrades were painful and time-intensive. Eventually, they scrapped the complex, over-engineered system and implemented VM servers that could be spun up on new hardware at a moment’s notice. It provided the same level of redundancy and much less complexity.
Recovery
HA can also be accomplished with a good disaster recovery plan. This is the most basic form of high availability and avoids most of the complexity of many HA techniques. Good disaster recovery plans will have procedures to minimize downtime for key systems. One such procedure could be taking a database backup every night, and writing daily transactions to an offsite server. This should give you the ability to have the full database back online within a short time even if the main database server crashes and burns.
Resilience
System resilience is the ability to ensure the continuous availability of operational services such as printing within an organization. Utilizing the PaperCut Site-server component can protect the print infrastructure from unexpected network outages or unreliable network links across multiple remote office locations.
In the event of a connectivity failure between the head office and a remote site, the PaperCut Site server will automatically take over the role of the PaperCut Primary Application server. The failover process is seamless and transparent, protecting print tracking, copy, authentication, and Find-Me printing from going offline. Once network connectivity to the Primary Application server at the head office has resumed, the PaperCut Site server hands back the responsibility to the Primary Application server.
For more information on what the PaperCut Site server protects, please see our Offline Operations KB.
What is the right level of High Availability?
Once upon a time, we had another PaperCut customer with a robust installation that included clustered application servers, clustered print servers, and clustered database servers – all of which pointed to a SAN. From a system point of view, it was on the high end of HA… Until a fire tore through the data center, and then it wasn’t on the high end of anything, let alone available. Forced into a redesign, they reconstructed their PaperCut installation and opted for more modern virtual machine technologies to provide the same level of HA.
This leads us to a primary consideration: what is the right level of HA? There is a significant difference in TCO between providing 99% uptime and 99.99% uptime. Is the difference necessary and worth the cost? Even if HA is a good thing for your business, and the ways to achieve it are well understood, the question still needs to be considered: what level of HA is necessary? You’ll have different answers to this question for various functions in your organization. Your mission-critical systems need more HA techniques for more parts of the infrastructure if the cost of an outage would be greater than the cost of providing HA.
For example, vital systems such as order placement, user authentication, and database may need 99.99% uptime. This might be enabled with multiple HA techniques like virtual machine snapshots, clustering, synchronous replication, hot sites, and off-site backups. However, the print system might be just fine with an hour of downtime.
Our focus when considering HA should be on two objectives
Recovery Point Objective (RPO)
RPO is the length of time between taking snapshots of your data. It’s a measure of how far back in time you must go to get a recovery point. It’s also the amount of time where the business process can cope with a loss of data.
Recovery Time Objective (RTO)
RTO (aka Mean Time To Recovery) is the maximum tolerable time from point of failure back to normal operation.
RPO and RTO need to be carefully considered, because together they determine the cost to recovery. The smaller the times for RPO and RTO, the larger the cost to recovery. If you want recovery points measured in seconds and recovery time in minutes, then expect a very high cost to recovery.
Uptime - and how it’s calculated
Living in the current era of the “always-on, always available” service, there is the ever-increasing requirement to obtain 100% uptime of business-critical services. Whether it is your print management solution or the broader supporting infrastructure, in reality, things are bound to fail no matter how hard we prepare. Therefore we must align the expectation to something more achievable, 99% uptime, by utilizing the tools available to create a HA environment.
As such, uptime is determined by the availability of resources over a year, defined using the number of minutes or seconds. Calculations start at 99% and are measured in “nines,” representing the ratio of how many minutes out of the total minutes in a year service is up and running.
Depending on the organization’s requirements will determine what level of uptime is required versus the level of downtime that is acceptable.
Protecting PaperCut with HA
Now let’s apply all this to a PaperCut deployment. We don’t want to start with the assumption that printing systems need to be protected at the same level as other business functions. We should start with the business objective questions. What is the maximum amount of time that print jobs could be lost and need to be reprinted (RPO)? What is the maximum time allowable from failure of printing systems to full recovery (RTO)? And third, what is the expected total cost to recovery?
It is possible with PaperCut to improve overall system resilience at multiple points in the infrastructure. PaperCut recommends using HA technologies and methods that are most familiar to the customer, and where they have trained personnel to support them. This reduces overall system complexity by not introducing new tools and procedures to learn and implement in the event of an outage.
Remember our clustered Linux HA customer mentioned earlier? It turns out that their single point of failure was the person who set up the complex environment using multiple HA technologies that no one else understood. PaperCut does not want to force customers to add yet another tool on the HA stack unless they feel it necessary.
However, HA technologies have evolved to the point where very favorable RPO and RTO can be achieved. The full printing system, including PaperCut, can be protected against downtime with off-the-shelf HA technologies. The customer does not have to learn our way of providing HA, they should be able to use what they already know. With whatever option they do go for, you can’t fully protect against the most common sources of downtime: full hard disks, dead NICs and human error.
Therefore, the discussion to add HA for PaperCut should start with, “How does the customer defend against downtime on other servers, and in particular the printing systems?” For example, it would do us little good to defend the PaperCut server and not other print servers as well. Anywhere that PaperCut is running, or where it utilizes a resource should be considered in the HA plan. Simply put, the main pieces that we want to evaluate protecting are the PaperCut Application Server, Site Server, Print Provider (i.e. print servers), database and the multi-function devices (MFDs). If a customer already employs HA methods on other servers and resources, they should be able to include PaperCut in the same manner with very little additional training or expense.
Whatever methods you employ for PaperCut HA, “The general principle is start light, and build over time” (Chris Dance, PaperCut CEO).
PaperCut Examples
For more information see the High Availability Whitepaper (High Availability - Protecting PaperCut NG & MF, printing and beyond).
Related articles
- High Availability Whitepaper:
- Failover technical documents:
- Application Server Failover (PaperCut NG/MF Application Server)
- Cluster Server
- VM Clusters
- Network Load Balancing
Comments