Compliance and Operational Security
In this chapter, we will discuss best business practices regarding compliance and operational security. This will include discussions of risk-related concepts, such as false positives and risk calculation, risk mitigation strategies, and incident response procedures. This chapter also covers security-related awareness and training concepts, such as identifying PII (personally identifiable information), data handling, threat awareness, and proper use of social and P2P networking. It will also cover business continuity practices, environmental controls, disaster recovery, and CIA concepts. The core Security+ exam objectives covered in this chapter are as follows:
- Explain risk-related concepts
- Carry out appropriate risk mitigation strategies
- Execute appropriate incident response procedures
- Explain the importance of security-related awareness and training
- Compare and contrast aspects of business continuity
- Explain the impact and proper use of environmental controls
- Execute disaster recovery plans and procedures
- Exemplify the concepts of confidentiality, integrity, and availability (CIA)
Risks to technical systems come in many shapes and sizes. It is important for administrators of those systems to understand the types of risks; to equip themselves with tools to judge the magnitude of various risks; and to be prepared to implement policies, procedures, and technical solutions to mitigate those risks. This section will cover the following topics:
- Control types
- False positives
- Importance of policies in reducing risk
- Risk calculation
- Quantitative versus qualitative
- Risk avoidance, transference, acceptance, mitigation, and deterrence
- Risks associated with cloud computing and virtualization
Control Types
Control of the use of technical systems is not limited to technical means. The use of these systems can be controlled by technical means, by management policies put in place regarding proper use, or by non-technical means, such as physically securing important devices. The following is a list of control types:
Technical controls – These include the use of built-in controls, such as the Authentication, Authorization, and Accounting (AAA) model, to control access of use of resources. An example of a technical control would be to deny access to a user mailbox to all other users.
Management controls – These policies cover the allowed use of an asset. Management controls may or may not have technical or operational controls in place to force compliance. An example of a management control would be a rule stating that it is impermissible to access another user’s e-mail box.
Operational controls – Operational controls focus on the actions of people, as opposed to technical systems. An example of an operational control would be the practice of always escorting or monitoring administrators in a mission-critical server room.
False Positives
Any system that attempts to filter content dynamically must make a judgment call as to whether the content conforms to or violates the policy. An anti-virus program might mistakenly classify an innocuous file as a virus. A spam filter may tag a legitimate e-mail as spam and send it to quarantine. When an agent erroneously identifies conforming content as a policy violation, the agent has identified a false positive.
There may be substantial costs associated with excessive false positives. Imagine a situation where a spam filter flags just one percent of the total legitimate mail flowing through an organization of 50 people as spam. The administrator of the system would need to correct manually the routing of mail totaling half the average mail flow for any given user. If a timesaving system such as anti-spam has a high enough false-positive rate, the system may actually end up negatively impacting efficiency.
Importance of Policies in Reducing Risk
Organizations may implement a number of policies to mitigate – or reduce – risks to their organization. Management may want to consider the benefits and drawbacks associated with many different common types of policies. For any policy to be effective in reducing risk, it must be practiced. A policy that is made and ignored provides minimal benefit to any organization.
Privacy policy – A privacy policy describes the ways in which an organization may and may not use information that may be considered private. For instance, will a company sell information regarding shopping habits to third parties? Will an organization be able to release name and address information to a print shop in order to outsource mailing information? Organizations considering implementing a privacy policy will want to consider any legal requirement for confidentiality, customer confidence/comfort level with the disclosure of information pertaining to them, and the impact an overly restrictive policy might have on the organization’s ability to operate effectively.
Acceptable use – An acceptable use policy states the allowed and disallowed uses of company resources. An Internet acceptable use policy, for example, might state whether Internet streaming music is permissible, or list certain types of sites that may not be visited. It may state hours during which certain activities are permissible, such as personal web activity during the lunch hour only. These policies serve to inform users of the expectations around the use of company resources.
Security policy – A security policy is concerned with the policies and practices to be taken to ensure resources are protected against unauthorized use. This includes internal and external threats, such as hacking or unauthorized disclosure of data. The security policy encompasses the policies put in place to enforce many other policies, such as privacy and acceptable use.
Mandatory vacations – Employees involved with the handling of sensitive data may be required to take “mandatory vacations.” The goal of a mandatory vacation policy is to have another employee take over the responsibilities of sensitive employees, increasing the likelihood that any improper activity will be detected, while simultaneously discouraging these types of activities. By mandating a number of consecutive working days away from the office, an organization can ensure that an employee’s work will be seen by another employee.
Job rotation – Job rotation, or regularly assigning new employees to handle sensitive data, accomplishes many of the same goals as a mandatory vacation. If an employee or set of employees is assured that their work will regularly be taken over by someone else, there is less incentive to engage in unacceptable activities. A downside of job rotation is that it may stunt the development of expertise in any given area, as an employee who regularly changes roles is more likely to become a generalist rather than a specialist.
Separation of duties – The separation of duties ensures that there are checks and balances in place; that is, that no one person is capable of unilaterally taking common actions that may violate policy. One common example of this would be having a different person write checks than the person who signs them. This ensures that each transaction must pass through at least two people. Another example would be of a programmer submitting his code for manager approval before having it merged into a current build. The programmer is only concerned with writing the necessary code, and the manager is concerned with managing the build.
Least privilege – The principle of least privilege in data security states that each user should have the rights necessary to do all he or she is required to do, and no more. If an administrator is in charge of databases and backups, he or she does not necessarily need permission to re-program routers or access user mailboxes. By restricting permissions to those that are necessary, an organization can protect data from unnecessary exposure (i.e., protecting confidential information without impeding the flow of approved information).
Risk Calculation
When assessing risk, it is necessary to estimate the likelihood of any individual occurrence of the risk to be protected against, the impact of any occurrence, and the annualized loss expectancy (ALE). This is called risk calculation.
Likelihood – This is the chance that any given risk will occur in a given time period. The estimation of the likelihood of any given risk occurring is often relatively stable over large numbers of actors. Assuming risk is relatively evenly distributed across a given industry, industry averages are often a good indicator of the likelihood of occurrence.
Impact – This is the loss that the organization would suffer as a result of an occurrence of the risk to be mitigated. This might be a catastrophic data loss, or a claim for unemployment insurance. Whatever the risk, the impact is the cost if the risk should occur.
ALE (annualized loss expectancy) – This is calculated by taking the likelihood of a risk occurring over a certain time period, multiplying the likelihood of the risk by the expected impact, and arriving at an expected value for annualized loss from a given risk.
For example, imagine a catastrophic data loss has a one in four chance of occurring in any given year. The likelihood of this loss is 0.25, or 25%. The impact of this loss would be sizable, estimated at around $10 million. The annualized loss expected is calculated by taking the likelihood, 0.25, and multiplying it by the impact, $10 million, for an annualized loss expected of $2.5 million due to a catastrophic data loss. In order to mitigate this risk, we can likely rule out any cost in excess of $2.5 million annually.
Quantitative versus Qualitative
Risk assessment can be performed either quantitatively or qualitatively. Asset value is used when performing a quantitative risk assessment, which might be used when deciding whether to acquire an additional backup service provider. There may be no moral- or values-based assessment included at all, just a strict by the numbers calculation of whether the expected additional uptime (the asset being protected) would make up for the increased cost.
Qualitative risk assessment, on the other hand, bases risk mitigation on the type of loss. A breach of a database of private customer data might have a relatively low monetary cost, but a high cost in goodwill. Allowing untrusted access to private customer data may also be considered a moral failing. When risk mitigation is based on non-monetary factors, we consider it qualitative risk mitigation.
Risk Avoidance, Transference, Acceptance, Mitigation, and Deterrence
The main strategies used to deal with risk are avoidance, transference, acceptance, mitigation and deterrence.
Avoidance is the decision that a certain action is too risky to be taken. An example of this may be the decision to close a business that regularly is the target of legal troubles. Rather than continuing to operate in a risky environment, the decision can be made to avoid the risk altogether.
Purchasing insurance to reduce risk is an example of transference (transferring the risk to another party). Though the risk of the occurrence is still just as high as it was before the transference, now the risk is borne by another party. Insurance is most frequently used to protect against low-likelihood, high-impact risks.
Acceptance of risk is another option. If the risks exist, but there is no way economically to mitigate them, then acceptance may be the best course of action. This is often true for low-impact risks. They are often considered the cost of doing business.
Mitigation is the reduction of the impact of an occurrence of any given risk. For instance, to best perform risk mitigation of user access control rights, perform routine user permission reviews. That way, a single compromised account cannot lead to a compromise of the entire infrastructure.
Deterrence is the attempt to reduce the likelihood of a given risk. An example of risk deterrence is regular security audits to ensure the systems are secure.
Risk can be managed via mitigation, acceptance, and transference, but not elimination. There is no way to eliminate risk or uncertainty fully, but knowing the risks being taken can lead to better decisions in managing those risks.
Risks Associated with Cloud Computing and Virtualization
Cloud computing and virtualization carry with them an additional set of risks. The greatest additional risk with cloud computing is the loss of the physical control of data. When data resides in the cloud, it is only as secure as the cloud service hosting it. If the service becomes unavailable for any reason, from an interrupted Internet connection to an unforeseen closure of the company, your data becomes unavailable, perhaps permanently.
Virtualization may seem like a magic bullet for reducing hardware risks, but it comes with its own set of challenges. Virtualized servers have the same software information security requirements as physical servers, with the additional design complexity that comes with no longer assigning roles to specific hardware. If you outsource virtual server hosting, you gain all the risks and benefits of cloud computing and virtualization, which should be taken into consideration before any decisions are made.
Appropriate Risk Mitigation Strategies
Once risks have been assessed, it is time to consider the best ways to mitigate them. The costs and expected benefits of any mitigation steps should be weighed before implementation to make sure they are the right fit for your organization. This section will cover the following topics:
- Security controls based on risk
- Change management
- Incident management
- User rights and permissions reviews
- Routine audits
- Policies and procedures to prevent data loss or theft
Security Controls Based on Risk
Security controls are safeguards against security risks. They can be generally divided into three categories. Preventative controls, such as locks on doors, are those that exist in advance of an incident to deter or prevent the incident. Detective controls serve to identify and record information about a security incident in progress. These might include fire alarms or CCTV cameras. Corrective controls are designed to minimize damage caused by an event, such as a specialized fire suppression system for a room that houses vital electronics.
In addition to the purpose the controls serve, controls can be categorized according to their usage. We have already discussed technical, management, and operational controls. In addition to these basic categories, we can also consider physical controls and legal/regulatory controls.
It is important to match the control to the risk. For Internet browsing, a simple acceptable use policy might be sufficient in some situations. In others, it may be necessary to implement preventative, detective, and corrective controls of many different natures. One might use firewall rules, real-time monitoring, logging, and reporting on the technical side. One could further restrain physical access to the Internet by separating networks with access to sensitive internal information from those that have access to the Internet, or implementing operational constraints such as reserving time on Internet-enabled machines.
As you can see from our simple example, there are a number of controls that can be put in place to achieve the same objectives, but the more in depth the controls are, the more likely those controls are to include undesirable costs. It is important to match the level of the controls to the risks they are intended to mitigate.
Change Management
Change management is the systemization of making changes to infrastructure. A well-implemented change management system (CMS) protects against ad-hoc configuration errors and can provide a method to roll back undesirable changes. A well-executed change management strategy will document key changes and the current state of vital infrastructure, providing a baseline for configuration data that infrastructure in use can be expected to match. Major changes, such as updated firmware versions or operating system replacements, should always be noted in a CMS.
Incident Management
Incident management is the system and process of responding to security issues. It can be broadly defined as “What to do when things go wrong.” The incident management process is centered around an event, whether a simple event-log entry or a power outage. The process is concerned with defining the problem and taking corrective action.
Securing a system is not a one-time event, but, rather, an ongoing process. When an incident does occur, it may be a sign that there are insufficient controls in place, or that the controls in place have been improperly applied. Maintaining system security requires constant re-evaluation of risks and threats. Incident management should be taken as an opportunity to learn from the failure that occurred and strengthen the systems in place by addressing the issue and considering possible mitigation strategies.
Incident response procedures will be covered in greater detail in the chapter objective to follow.
User Rights and Permissions Reviews
In keeping with the best practice of least privilege, user rights and permissions should be regularly reviewed. User permissions and group memberships should be reviewed to ensure current permissions reflect only those needed to perform duties.
Routine Audits
In addition to regular user rights and permissions reviews, other regular information security audits should take place. These may include server event log review, basic penetration testing to ensure proper firewall and server configuration, or version audits to make sure software is updated as required and that software versions are accurately tracked in the CMS.
Policies and Procedures to Prevent Data Loss or Theft
Data loss and data theft are two risks that no company wants to experience. With data loss, data is no longer available to those who need it. With data theft, confidential, private, or proprietary data is exposed to unauthorized parties. In both cases, the data is not where it should be.
Policies and procedures to prevent data loss are often centered on backup/recovery, in which a copy is made of the data to be protected as it exists at a certain point in time. If something happens to the data, only the data that has changed after the backup was made is at risk of being lost. These backups in time are known as checkpoints or restore points. Other policies to protect data are concerned with not losing it in the first place. This may mean redundant servers, limiting access to make changes to data, and regularly verifying data integrity.
Data theft occurs when unauthorized individuals are able to access proprietary data. To combat data theft, policies defining who has access rights to sensitive data must be made and enforced. Backups should be stored securely, with either physical security of the backup media itself, encryption of the contents of the backup media, or both.
Systems should be hardened as a matter of course, with only those services enabled that are required, and user rights should be established only for those who require access. Data theft often occurs without the removal of any data, but instead the copying of data, so it may be difficult to detect. Limit access to confidential or private data as much as possible, and audit access of the data. Unfortunately, once unauthorized access to private data has occurred, it is likely too late to maintain control of the data. To combat data theft, you must prevent access in the first place.
Appropriate Incident Response Procedures
When something goes wrong, a response is warranted to determine just what went wrong, why it went wrong, and to set it right again. When investigating a security incident, there are a number of best practices to be followed to ensure a thorough investigation and recovery. Basic forensic procedures and swift action can help to keep a “canary in a coalmine” from becoming a catastrophe. This section will cover the following topics:
- Basic forensic procedures
- Damage and loss control
- Chain of custody
- Incident response: First responder
Basic Forensic Procedures
When gathering information to analyze a security response, there are a number of considerations and best practices to implement to ensure all available information can be gathered. The following are basic forensic procedures:
Order of volatility – When gathering information after a security incident, gather it in the order of most to least volatile, which is the order it is likely to have been changed or lost. The most volatile information is held in RAM, and it is constantly changing. Next, there may be information that changes regularly, such as swap files or working files for system processes. Less volatile data would be disk contents. Data unlikely to be changed, and thus the least volatile, is that held on separate systems, such as logs or archived data.
When gathering data, it is not necessarily a good idea to shut down an affected system immediately, as doing so would cause the loss of the most volatile information. If possible, information should be gathered before taking actions that would cause the loss of volatile data.
Capture system image – In order to preserve the non-volatile contents of a system for forensic investigation, standard practice is to create a system image. This image is a snapshot in time of the contents of a disk or array. Upon creating an image, it is advisable to create a hash of the image, such as MD5 or SHA512, to ensure future data integrity. The hash of the file can be compared to a newly generated hash of any future copy of the file to ensure no changes have been made to the file.
It is possible to make disk, partition, or array images. Making an image of a single disk in a striped redundant array of independent disks (RAID) will not yield a usable image. Take care that the type of image you create conforms to the type of data it is necessary to capture.
Network traffic and logs – Investigations of an incident can be greatly aided by analyzing data that is captured constantly and logged. Traffic analysis can only occur if the traffic is already being captured, as network communication is transient in nature. In order to be able to analyze traffic, you must configure in advance logging that may be useful. The same can be said for any transient event. It is difficult to know ahead of time what information may be useful, so an effort should be made to find a balance between monitoring overhead and depth of tracking.
Capture video – Any relevant video should be recorded and kept. This video can be reviewed later as an infallible record of what happened, rather than relying on memory alone.
Record time offset – It is not unusual for systems on the same network to have slight variations in the time of their system clocks. It may be important to know in what order the events occurred when comparing logs from disparate sources. Note the variation of time on a given system’s clock from a trusted benchmark.
Take hashes – Hashes ensure data integrity. A changed file hash is an indicator that the contents of the file are no longer the same as they were when the hash was created. Create and log the hash of any file for which data integrity is important.
Screenshots – In addition to video, screenshots of work in progress can be created to help form a record of the state of a system at any given time.
Witnesses – Like the contents of RAM, firsthand recollections are volatile. Gather information from people soon after the event, as memories do not remain constant over time.
Track man hours and expense – Record the time spent responding to an incident and the work performed for a given time. Having a record of the work that was done and when can be valuable in determining the total cost of a response, in addition to which steps have already been taken and which steps still need to be taken. Tracking the costs of responding to an incident can be useful in determining whether the response is proportionate to the risk and adjusting priorities accordingly.
Damage and Loss Control
When responding to an incident, a major goal is to minimize the severity of the incident, also known as damage and loss control. This is an important consideration that must often be weighed against the other imperatives, such as ensuring system integrity. For some incidents, by the time you become aware of an issue, the threat has passed. At other times, an incident may represent an ongoing threat, such as a virus or a Trojan-horse-infected system. When responding to an incident in progress, steps must be taken quickly to prevent damage and loss that has not already occurred.
Chain of Custody
The list of those who have handled or have accessed information gathered for forensic purposes is known as the chain of custody. Its purpose is to provide documentation as to who has handled the evidence, and to be able to account for the whereabouts of evidence at all times.
Incident Response: First Responder
As a first responder to an incident, you may need to make decisions based on incomplete information. You will need to weigh competing considerations, such as thoroughness of forensic data gathered against the ongoing threat an affected system may pose. This may mean quickly classifying the incident as past or ongoing, and taking immediate action in the case of an ongoing threat. The first priority is to limit the damage while destroying as little evidence as possible. A policy should be put in place that outlines the processes a first responder to an incident should use, such as what constitutes an incident that requires elevation to a team of responders.
Having the greatest policies in the world will not do you any good if no one knows about them and follows them. It is important to educate users and administrators on the proper responses to common policy requirements. This section will cover the following topics:
- Security policy training and procedures
- Personally identifiable information (PII)
- Information classification: Sensitivity of data (hard or soft)
- Data labeling, handling, and disposal
- Compliance with laws, best practices, and standards
- User habits
- Threat awareness
- Use of social and P2P networking
Security Policy Training and Procedures
Users must be made aware of policies that directly impact them. An acceptable use policy will not impact the behavior of anyone who is not aware of it. Users should undergo frequent security policy training and understand related procedures.
A good practice is to have frequent (usually annual) security-related awareness training, and to have users sign a user agreement. This leaves no doubt as to the expectations of user conduct. Keeping users aware of and compliant with security policies minimizes the organizational risk posed by users.
Personally Identifiable Information (PII)
Personally identifiable information (PII) is just what it sounds like: any information that personally identifies an individual. This information is not limited to private medical, legal, or financial data. Any documents that contain two or more personal identity items, such as a person’s birthday, full name, or social security number, are considered having PII.
Usually, those who have to access documents or other data containing PII have the principle of least privilege applied to them. Access to PII should be limited to those with a legitimate need to access the data, and the access should be logged in case of misuse.
Information Classification: Sensitivity of Data (Hard or Soft)
Not all data is created equal. Some data is more sensitive than other data. Data may be available to anyone internal to the organization, may be shared with third parties, or may be available only to the board of directors.
For soft data, or data without a physical presence, such as database entries or other computer files, access can be controlled with user rights and permissions. For very sensitive data, encryption may be used to ensure that only specific people can access it and that they are unable to disseminate it. For hard data, such as printouts or a physical copy of information, access to sensitive data must be controlled physically.
Data Labeling, Handling, and Disposal
When it comes to sensitive data, how will your users distinguish between sensitive data that must be controlled and public information that does not require protection? The answer is any media that contains confidential data should be documented and labeled. If the data is soft, the indication can be made in the file itself. If it is a hard copy of the data, it can be clearly marked.
Whether the data is in hard or soft format, the location of sensitive data and changes to its location or content should always be recorded in an accurate log. There are often record-keeping requirements regarding sensitive data, such as a prescribed length of time medical or financial records must be maintained. For sensitive data that must be kept until an expiration date, controls should be put in place to prevent the accidental destruction of the data.
For data that must be kept confidential, it is important to track not just the original confidential data, but any copies or backups that are made. To reiterate, all confidential data should be labeled, cataloged, and tracked until disposal.
Compliance with Laws, Best Practices, and Standards
There are many levels of good ideas when it comes to policies and procedures, from something that tends to make things run smoothly to practices required by law. It is important for companies to have compliance with laws, and follow best practices and standards.
When a certain policy or practice is required by law, it is known as a compliance law. Some well-known compliance laws are Sarbanes-Oxley and HIPAA. Different industries may have different compliance requirements. Failing to conform to compliance laws may have severe penalties, including fines and/or jail time.
Best practices and standards are good ideas that, while not mandated by law, are nonetheless expected of systems. These can be defined generally and casually (e.g., make backups of important information) or specifically and formally (e.g., IEEE standards regarding network wiring). Audits conducted to ensure an organization is meeting compliance laws and industry standards are collectively known as compliance audits.
User Habits
User behavior and training, or user habits, are an integral component of threat mitigation. Users should be trained to know many common errors and the risks associated with them.
Password behaviors – Users should not keep passwords at their desk or at any other insecure location. Passwords should be memorized or physically secured.
Password masking – The risk of unauthorized knowledge of passwords can also occur via “shoulder-surfing.” This refers to the act of reading a password over the shoulder of a user entering his or her password. Users should be aware of shoulder surfing and taught to protect their computer password in the same way they protect their ATM PIN. Shoulder surfing attacks can be partially mitigated by masking passwords, a process by which the actual password displayed is a series of asterisks.
Password expiration – Password expiration helps to limit the term of exposure in the case that an account password becomes compromised. It also helps to prevent computationally intensive attacks against complex passwords.
Data handling – Data, both hard and soft, should be kept in designated areas. Soft data should be stored in the appropriate server folder, with user permissions assigned as necessary. Hard data should be available only to those with a need to access it.
Clean desk policy – Physically securing hard data might include storing files in locking file cabinets or the implementation of a “clean desk policy.” Under a clean desk policy, hard data is not permitted to be left at desks while unattended. This type of policy helps to reinforce the importance of routinely protecting data in the mind of the users.
Prevent tailgating – Restricting physical access to an area is only effective if users do not unwittingly bypass physical security. Tailgating is the process by which an unauthorized user follows a legitimately authorized user though a physically secured point. Users should be reminded to allow access only to those known to have access rights, or to require each entrant to authorize individually.
Personally owned devices – Personal devices such as laptops, USB drives, or smart phones have the capacity to hold malicious code. It is important to educate the users about the manner in which these devices may be used, if at all.
Threat Awareness
Users should be aware, at least generally, of the types of threats to their computers. They should be taught basic steps to mitigate these threats as a component of regular user training. This is known as threat awareness.
New viruses – Users should know not to run programs from untrusted sources, such as from unknown e-mail attachments or programs downloaded over peer-to-peer networks.
Phishing attacks – Users should be taught not to log in to sensitive websites through links sent via e-mail.
Zero-day exploits – Previously unknown threats are constantly coming into the wild. Though there is no protection against true zero-day exploits, diligent patching can protect systems from prolonged vulnerability to new attacks as they become known.
Use of Social and P2P Networking
The use of social and peer-to-peer (P2P) networks carries with it the risk of accidental public disclosure of private information. Many P2P network clients share certain information by default. Any files in the shared directories can become available to the world at large. Furthermore, files downloaded from P2P networks can come from an unknown source, often identified only by a file name. Due to the anonymous nature of these downloads, the files should not be trusted without verification of their authenticity (such as a matching MD5 hash of a known good file.
Social networks can divulge a large quantity of seemingly innocuous information. They may also violate the principle of minimal disclosure. For instance, an attacker may learn via a social networking post that a certain person is out of the office. The attacker may then use that information to convince a subordinate of the vacationing worker to send information that they would not normally divulge, claiming the missing worker was supposed to send it but cannot be reached. To minimize disclosure, employees should be encouraged not to post work details on social networks.
Aspects of Business Continuity
One of the goals of a well-planned infrastructure is to maximize the availability of network services. There are a number of concepts that can be applied to infrastructure planning to ensure business continuity and high availability, even in the face of unplanned incidents. This section will cover the following topics:
- Business impact analysis
- Removing single points of failure
- Business continuity planning and testing
- Continuity of operations
- Disaster recovery
- IT contingency planning
- Succession planning
Business Impact Analysis
Resources should be put toward mitigating losses with the greatest bang for the buck. This is done through business impact analysis. The loss from an interruption of service is not calculated on the cost to provide the service, but rather on the loss of the benefit a service usually provides. For example, the cost of the loss of a relatively inexpensive server running inexpensive software should not be measured in replacement cost, but in the loss of value from a server being unavailable. When allocating resources to minimize the business impact of any given outage, you should ask, “What happens if we lose a certain service, or a certain piece of hardware?”
A plan to keep the business going should consider the natural or technical disasters that can cause outages, and how these incidents affect the business. An administrator should create IT contingency planning recovery point objectives with the business in mind. A recovery point objective (RPO) is the maximum period of data that could be lost due to an outage. It may be absolutely necessary to keep a database with mirrored transactions fully synced at all times, so that absolutely no data is ever at risk, or it may be that certain data only needs to be protected on a daily, or even weekly, basis. The recovery point should be set according to business needs.
The length of time any particular outage renders services unavailable is known as the recovery time. The goal made for restoring service after an outage is known as the recovery time objective (RTO). If a server loses a power supply, is a plan in place to restore service immediately, or is it permissible to leave the server unavailable while parts are shipped? The RTOs should likewise be set according to business needs.
After conducting a business impact analysis, a plan should be crafted that satisfies the RPOs and the RTOs. These plans can range from a simple offsite backup to a fully redundant online offsite data center.
Removing Single Points of Failure
One of the most basic ways of increasing uptime is by removing single points of failure. Will a critical service experience downtime when a single piece of hardware (such as a single disk drive, a single server, a single rack, or a single site) has an issue? The more critical service availability is, the more important it is to build redundancy into the infrastructure design.
This may mean multiple power supplies, uninterruptible power supplies, RAID arrays, redundant service providers (such as power or Internet), or multiple locations from which a service can be provided.
Business Continuity Planning and Testing
After establishing RPOs and RTOs, it is a good idea to review and test your recovery plan to ensure that it actually functions. Without business continuity planning and testing, you risk a situation such as a RTO plan that relies on a restore from backup that completely fails in the case of a disaster, damaging both the original data and the drive necessary to read the backup media.
“Wargaming” various scenarios will help to ensure that your plans are actionable in the case of a disaster. Testing restores not only will ensure that your data protection is functioning as desired, but also will better prepare the system’s team to respond in the case of an actual disaster recovery. Regular testing ensures that business continuity planning will actually lead to the continuing provision of service, even in the face of a disaster that might otherwise have had a catastrophic business impact.
Continuity of Operations
When disasters happen, it is important to ensure your business can still function. This is known as continuity of operations. Some services that are normally provided by means of technical infrastructure can also be provided by other means. The provision of non-technical means to provide critical services can provide a strong “worst-case scenario” fallback position. For instance, a computerized catalog might normally provide a quick way to learn the necessary parts and service for a car dealership. If the dealership were to experience a computer outage, but had hard copies of the necessary service manuals on hand, work could continue under less-than-ideal circumstances rather than grinding to a halt.
Though the risk of outage cannot be totally eliminated, it can be appreciably reduced though business continuity planning and testing. Even in cases of technical outage, businesses may be able to prepare non-technical means of continuing operations.
The best metric for determining the effectiveness of a continuity of operations plan or a disaster recovery plan is the mean time to restore. In cases where even a complete, catastrophic loss of a site does not impact the provision of services, the mean time to restore approaches zero. The sooner you can restore full functionality after a disaster or technical failure, the better.
Disaster Recovery
When it comes time to perform disaster recovery, a plan should already be in place and tested. The disaster recovery process is simply the application of the already established disaster recovery plan to meet the RPOs and RTOs. The first time you are restoring data from backups made as part of your disaster recovery plan should not be in the midst of an actual disaster. A well-tested plan should leave no surprises. This will be discussed later in the chapter.
IT Contingency Planning
Part of a disaster recovery or business continuity plan should be IT contingency planning, which is basically the practice of planning that failures happen a little at a time, if at all. This can include redundant systems for the most critical services, strong non-technical fallback positions for some services (such as service manuals in place of an online parts database), or alternative sites that may be brought online temporarily. It is closely related to business continuity planning; however, it is only concerned with the delivery of services.
Succession Planning
In any organization, there are a number of indispensable roles. A good succession plan ensures that no role exists in which only one person can fulfill it. There may be any number of reasons a certain individual can no longer perform his or her assigned role, such as a sudden departure or incapacity. To minimize the disruption caused when a certain role changes hands, encourage key personnel to train others in the performance of their key duties. Another tactic to minimize disruption caused when roles change is to encourage regular job rotation where practical.
Impact and Proper Use of Environmental Controls
The systems administrator may need to take certain steps to ensure that the environment in which systems are intended to function does not degrade the performance of the systems. Systems should not be kept in an environment that is too hot, or too dusty, or at risk of water or fire damage. The best environments are regularly monitored to ensure the environmental variables remain in an ideal range. This section will cover the following topics:
- HVAC
- Fire suppression
- EMI shielding
- Hot and cold aisles
- Environmental monitoring
- Temperature and humidity controls
- Video monitoring
HVAC
HVAC (heating, venting, and air conditioning) is of great concern to a systems administrator. These systems are vital to ensuring control of temperature and humidity. Heavy use of computing resources tends to generate high levels of waste heat. If this heat were to be trapped around the computers generating it, it would degrade the performance and lifetime of the computers. For this reason, high-density computing infrastructure almost always requires specialized cooling.
Fire Suppression
One risk of any high-power-use environment is fire. In the event of a fire, electronic resources can be severely damaged by standard sprinkler systems.
For enclosed server rooms, a specialized fire suppression agent may be used, such as halon or argon. These systems are safe for electronics but present a threat to people. If a specialized fire suppression system is installed, it is vital that it include safety measures to protect people from the accidental discharge of potentially deadly agents. At the very minimum, areas that hold electronics must have a fire notification system so that any fires are detected and can be responded to as quickly as possible.
EMI Shielding
Electro-magnetic interference (EMI) is generated by the motion of electricity. EMI can be a problem because strong EMI can interfere with network communications. Where EMI is regularly observable, it may be possible to determine the pattern of electricity that generates the EMI. This is particularly troublesome in highly sensitive network communications. For highly sensitive networks, it is recommended to shield networking runs to contain the emission of EMI.
Hot and Cold Aisles
In a data-center environment, airflow should not be an afterthought, but, rather, an integral design component. Data centers should utilize hot and cold aisles. A well-designed data-center will maximize the ventilation of hot air out while maximizing the inflow of cold air for more energy efficient heat regulation. One of the most effective methods for airflow management is the creation of hot and cold aisles. In a “hot” aisle, consecutive rows of server racks are vented toward the same row. The next row will draw air from a shared “cold” aisle and vent to a shared hot aisle. This configuration is repeated for each pair of aisles, alternating hot and cold.
The vents designed to draw hot air out are placed in the hot aisles, while those designed to help keep cold airflow in are placed in the cold aisles. This allows the same space with the same thermal output to be cooled much more efficiently than a system that treats the entire data-center as a single undifferentiated zone.
Environmental Monitoring
A planned environment must also live up to the expectations of the planners. The only way to ensure the effectiveness of environmental controls is environmental monitoring. Warnings and alerts should be configured for cases where the environment fails to remain within an acceptable range of conditions. This way, if problems with the environment are encountered, they can be addressed prior to causing problems with the equipment.
Temperature and Humidity Controls
In addition to specialized cooling requirements, electronics infrastructure can also require temperature and humidity controls. Extremely dry or dusty environments can lead to additional static discharge. For this reason, high particulate environments and those with extremely dry air are undesirable.
Video Monitoring
Where security of a data-center is concerned, the best choice is video monitoring. This can often provide a much more complete picture than a simple security log recording entries and exits. A video log of sensitive areas reveals not only the entries into and exits from a secure area but also an impartial record of where and when these entries and exits took place and what was done in the secure area.
Disaster Recovery Plans and Procedures
When your data is at stake, it is, of course, imperative to protect against exterior threats. However, threats can come in the form of not only malicious attacks but also accidents and from nature itself (i.e., failures and disasters), which is why it is necessary to prepare for these events. This section will cover the following topics:
- Backup/backout contingency plans or policies
- Backups, execution, and frequency
- Redundancy and fault tolerance
- High availability
- Cold site, hot site, and warm site
- Mean time to restore, mean time between failures, RTOs, and RPOs
Backup/Backout Contingency Plans or Policies
The most important thing to remember when creating data recovery strategies is to ensure production data is backed up in an offsite location. This ensures that in a worst-case scenario, if the main site is destroyed, the data will be safely kept in a different physical place. The two major factors in disaster recovery are:
- Ensuring data availability through redundancy; and
- Preparation of the offsite location in which to resume operations (hot, warm, or cold; these will be discussed later).
It is vital to have these plans outlined and well documented in every enterprise to ensure corporate data is not lost. These disaster recovery plans (DRPs) should include the following major types of measures:
- Preventative measures, which help to ensure disasters do not occur
- Detective measures, which assist in detecting disasters
- Corrective measures, which address recovery after a disaster
DRPs should also include the location of the disaster recovery site and details on its production readiness; that is, whether it is equipped properly to begin operations immediately after a disaster, as well as a hierarchical list of critical systems and data, to ensure important systems are backed up in the offsite location.
Backups, Execution, and Frequency
When considering important factors for a DRP, it is extremely important to ensure your company’s backup plan not only meets business requirements but also preserves all data reliably. A major component of ensuring the integrity and availability (IA of CIA) of your data is to store backups offsite, usually at the disaster recovery site or at another physically distant location, in the event of a natural disaster. You should never store data backups onsite, unless they are redundant copies.
Redundancy and Fault Tolerance
Making backups of your data in the event of a disaster is one way to ensure continuity of operations after a major event. However, in order to ensure your data is not lost or compromised during regular, everyday operations, you should configure your systems for redundancy and fault tolerance.
Redundant and fault-tolerant systems include hardware- and software-based mechanisms that allow for partial failure while still maintaining continuity of operations and integrity of data. Appliances and features that comprise these types of systems can be any of the following:
- Uninterruptible power supplies (UPSs). UPSs supply power to computers and servers for a short duration in the event of external power failure. UPSs are usually employed so users can save their work and shut down their computers, as the power supplied is usually not enough for more than a few minutes worth of uptime – and in cases where external generators or auxiliary power is employed, this gives those systems time to come online.
- Hardware-based redundancy and fault tolerance. Having multiple systems with redundant copies of data or backups can help in case of failure.
- Server clustering is a way to avoid single point of failure problems, where an individual server problem causes the whole system to crash. Clustering is the act of linking multiple computers, usually servers, together to act as one unit – so if one server crashes, the remaining machines will still be able to perform the required actions.
- Load balancing. When you have high demand for data on a network, you can have problems with slow network response times or outages. Load balancing across multiple servers can assist with this. Load balancing simply takes requests and sends them to the server that is being least used at the moment.
- Redundant array of independent disks (RAID). The whole purpose of RAID is to combine drives in a machine to provide either better availability, in the case of drive failure, or better speed. There are multiple types of RAID configurations:
- RAID 0 has no redundancy. It performs block-level striping of data. (Striping means that a single piece of data is written across multiple disks, so that when it is read later, all of the individual pieces will be read from multiple disks at once –thus improving read and write times greatly). RAID 0 requires a minimum of two disks.
- RAID 1 is the ultimate in redundancy. It performs a 1:1 mirroring of data from the original drive to each additional drive in the array, thus improving not only redundancy but also read times (not write times). RAID 1 requires a minimum of two disks.
- RAID 2 allows you to rebuild a lost drive through parity, and the data is striped at the bit level to provide better performance. RAID 2 requires a minimum of three disks.
- RAID 3 is byte-level striping with parity, but all parity bits are stored on a dedicated parity drive. RAID 3 requires a minimum of three disks.
- RAID 4 is block-level striping but also has a dedicated parity drive. This is similar to RAID 5, in which one drive from the array can fail and the lost data can be rebuilt from the parity drive; however, the performance of a RAID 4 configuration will be greatly reliant on that of the parity drive. RAID 4 requires a minimum of three disks.
- RAID 5 is block-level striping but has distributed parity, which removes the requirement for a dedicated parity drive and the potential performance bottleneck this configuration could impose. A drive can fail from RAID 5 and its contents can be rebuilt from the parity bits that are distributed among the remaining drives (every drive has parity information for the other drives, but not its own). RAID 5 requires a minimum of three disks.
- RAID 6 is extremely fault-tolerant in that it uses double distributed parity. This allows for the loss and rebuilding of two disks in an array; RAID 6 is usually used with a minimum of four drives. RAID 6 also uses block-level striping.
It is important to note that RAID is concerned more with availability than with integrity of data. RAID arrays simply ensure data will be available; ensuring it is not corrupt or data is not lost is not a major strength of RAID.
High Availability
Another method of combating data loss and unavailability is to ensure high availability. There are methods of ensuring this, including using RAID 1 and RAID 5 configurations, as well as ensuring systems fail properly – that is, when they do fail, should they fail “open” (where data is still available and thus unsecured) or “closed” (where data is locked down and unavailable, but secured)? This is also described as “fail-safe.”
Cold Site, Hot Site, and Warm Site
For disaster recovery, it is imperative to include in your backup plan an off-site location from which to base operations after the event. The site can have varying levels of serviceability and readiness; these are described as cold site, hot site, and warm site.
A cold site is a location that is usually just a building that has basic infrastructure, including lights, power, and networking ability. These types of sites usually require hardware and data to be shipped in, as well as all the additional requirements of an office – desks, chairs, etc. This is the least prepared, but least costly, site to maintain.
A hot site is a location that is a near replica of the production environment in terms of equipment, data, power, and connectivity. It usually only requires minimal work to resume operations at the new site. Data backups are usually stored at a hot site.
A warm site is somewhere between the two: it is usually a fully functional production office but does not have up-to-date data or recent backups. A warm site takes longer than a hot site to prepare for full use, but much less time than a cold site.
Mean Time to Restore, Mean Time between Failures, RTOs, and RPOs
For business and accountability purposes, it is important to have a plan in place that details an approximate, or average, time it takes for systems to come back online. Such an equation is called a mean time to restore (MTTR), or mean time to recovery. Another equation that is important as well is mean time between failures (MTBF). Recovery time objectives (RTOs) and recovery point objectives (RPOs) address the speedy resolution of circumstances surrounding such failures.
MTTR simply refers to the average time it takes for your services or equipment to come back online after a failure. An 8-hour MTTR is obviously better than a 36-hour one.
MTBF refers to the average time between system failures. The bigger the number, the better, indicating your systems have been up for a longer period of time. This is a prediction usually indicated in your systems model.
RTOs indicate a certain time window in which production systems must be up and running again after a disaster or failure.
RPOs indicate the maximum timeframe in which data can be lost after a disaster or failure. This usually means that frequent backups must be maintained (usually at a hot site) to prevent loss of data. Infrequent backups can cause data loss, as a system can go down after backups have occurred and after new data has been generated. Frequent backups can help mitigate this and support rigorous RPOs.
Concepts of Confidentiality, Integrity, and Availability (CIA)
Of all the major security concepts, the concept of confidentiality, integrity, and availability (CIA) is a cornerstone. In this section, we will discuss what each of the three parts mean to the security administrator, and give examples of each.
Confidentiality
All throughout time, a concept called confidentiality has kept certain pieces of information from the eyes of unprivileged individuals, and has been used to further the agendas of kings, to carry out wars, and to protect the information of those with the power to do so.
In the realm of security, confidentiality refers to the protection of data from unauthorized users. Corporations do not want their proprietary information (i.e., the data on the commercial products from which they profit) to fall into the hands of the public; therefore, all e-mail transactions are scanned for sensitive data, users utilizing removable storage are subject to encryption requirements, all VPNs and remote connections have the highest security, and datacenters are equipped with mantraps to keep unauthorized personnel from entering.
Integrity
The second component of the CIA model is integrity. In the world of data security, not only is it absolutely vital to ensure confidentiality, but data is worthless unless it can be verified that it is still the same as it was yesterday, or an e-mail is still genuine after it has been received, and has not been hacked. Administrators must ensure data is not changed – that is, they must ensure that it maintains its integrity. When digital signatures are added to e-mails and databases confirm data consistency, this is verification of data integrity.
Availability
The last component comprising the CIA model is availability. Data must be kept secure (confidentiality), intact (integrity), and, lastly, accessible (availability). If important, mission-critical data is inaccessible for any reason, whether it is due to malware, network attacks, physical theft, or environmental controls such as air conditioning failure, this third and last part of the CIA model has not been met, and this security concept is unfulfilled.