Given a scenario, implement a network troubleshooting methodology. This chapter aims to break down a multiple-step network troubleshooting methodology. Although the scope may be unclear at the beginning, troubleshooting is considered a science and it always involves a pre-determined process. Network engineers who follow this process will continue to get better at this over time, optimizing it based on their experience.
A methodology is needed in order to speed up the problem finding and solving process. Although the methodology can be different from case to case, it is important to use one that best suits your company, network, and internal structure. A proposed methodology includes the following steps:
Step 1: Gather information
Step 2: Identify the affected areas in the network
Step 3: Determine whether anything has changed in the network
Step 4: Establish the most probable cause of the problem
Step 5: Determine whether escalation is necessary
Step 6: Create an action plan and a possible solution
Step 7: Implement and test the solution
Step 8: Analyze the results
Step 9: Document the process and solution
This process is represented by the flow diagram in Figure 6.1 below:
Figure 6.1 – Network Troubleshooting Methodology
In the following sections, we will analyze each of the nine troubleshooting methodology steps.
Gather Information
Information gathering is the first step in the network troubleshooting methodology, which is similar to many other processes in the IT world, including the process of handling a security incident. Information gathering is also called the reconnaissance phase and it aims to obtain as quickly as possible all the relevant information that will assist you during the next phases.
The event that generated the information gathering phase is a network issue that was most likely reported by an end-user who discovered it or who was directly affected by the issue. The end-user could have reported this via e-mail, phone, or by opening a help desk ticket.
One of the first things that should occur in this phase is interviewing the affected parties and impacted users to find out the exact nature of the network incident. You should correlate these interviews with the application/device logs and error messages gathered from the affected systems. One of the first things that you need to do is isolate the issue and figure out whether the problem affects a single user (end-station) or a group of users, such as an entire VLAN or network segment.
Depending on the number of users affected, you can start the troubleshooting process at the Physical Layer or at an upper layer in the OSI reference model. For example, if you have multiple users in a VLAN/subnet and none of them are able to access a specific application, you should not begin by examining the cabling from their workstations to see if they are plugged in; instead, you should immediately move up the OSI model, as the probability of physical connectivity issues is low. Focus on the things all those users have in common from a networking perspective and this might lead to a particular switch that they are all connected to and that might be malfunctioning.
If the problem reported involves an application issue, you should try to gather some screenshots from the affected users to see how the problem manifested. Depending on the situation, you might need to walk the users through certain areas over the phone or remotely connect to their station via RDP or other terminal services to analyze the issue. If this is not possible, the network administrator or operator might have to personally go to the affected system and do some hands-on analysis in order to properly gather the necessary information.
After all of these steps are completed, you should document the relevant information gathered, including the following:
- The type of problem
- Problem description
- Which systems were affected
- How the systems were affected
- In what context the problem manifested
Possible effects of the problem on the systems affected include:
- Slow performance/response
- Data corruption
- Logon issues
- Resource access issues
- Misconfigurations
Another critical aspect you should cover during the information gathering phase is identifying the specific moment at which the problem manifested. The time of occurrence should be correlated with different network events that happened in that period, including changes made by users in different areas. This information, together with details about the symptoms and error messages, should offer a complete information set specific to this phase and should allow you to proceed to the next step in the troubleshooting process.
Identify the Affected Areas in the Network
In order to properly and rapidly identify the affected areas of the network, you should make use of good mapping tools, including:
- Packet sniffers (like Wireshark)
- Detailed topology diagrams (both physical and logical)
- Other schematics of the network
You need to understand the physical and logical network topology in order to identify the affected areas and trace the problem throughout the network. In addition, you should also be able to use tools like ping, event viewer, and other monitoring tools.
Note: If the organization uses some kind of security policy, you should also understand this policy as part of the troubleshooting process. |
Solid network documentation helps in this phase, including documentation about IP addressing within the network. Using VLSM and address aggregation will be useful, as they can prevent problems within a network area from affecting the entire routing domain. Using IP addressing aggregation to represent many networks can create problem domains, which will help in the troubleshooting process.
Another part of this phase is understanding which applications, services, and protocols are used by every group of users within the network. Knowing which areas of the organization are using a certain type of application can help to quickly isolate the problem domain and continue the troubleshooting process.
An important thing that helps in this troubleshooting step is designing a modular network that contains multiple layers, such as the following:
- Core Layer
- Distribution Layer
- Access Layer
- Management module
- Remote Access module
- VPN module
If you have WAN connectivity to remote and branch offices, it would help to train remote network operators in the troubleshooting process so they will be able to remotely help from their respective offices. In addition, if the problem includes other technology areas, you should consult with the colleagues responsible for those areas and maybe even form a troubleshooting team in order to fix the problem as soon as possible.
Depending on the situation, to minimize the effect of the problem and shorten the troubleshooting process, you should make sure you have:
- Backups of the system
- Roll-back techniques, especially in situations in which modifying device configurations does not solve the problem
- Spare parts
- Failover between devices and modules.
Determine whether Anything Has Changed in the Network
Depending on the actual problem, the network administrator/operator must follow a mental flowchart that starts with the problem reported by the user, for example, an issue logging on to a system. First, make sure you go through the standard network troubleshooting process using your knowledge of the OSI reference model. If a user is on a system, and the system is up and running but the user is not able to get his credentials passed to a central server, you should try to determine whether this is a single user issue or a widespread issue, because this will dramatically affect the troubleshooting method.
If it’s a single user issue, you should use standard networking troubleshooting tools, including:
- Ping
- Event logs
- RDP to access the system
If it’s a widespread issue, you should examine the services and the system logs on the servers the users are trying to access to see if you can learn any information about the issue (maybe some kind of authentication problem).
Establish the Most Probable Cause of the Problem
At this point, you should have discovered what the problem is and how it manifested based on the following issues:
- Service outage/inaccessible
- Slow service
- Logging issues
- Dropped sessions
- Data corruption
The next step is to find the cause of the problem, for example:
- Cabling problems
- Connectivity between an Access Layer switch and a Distribution Layer switch, either in the server room or in the wiring closet
- DoS attack on a system (e.g., router, switch, or server)
- Software issue (user misconfiguration or user adding some type of application)
- IP addressing issue (DHCP problem)
You should ask everyone involved in the incident what the last change in the system was and try to obtain details on this. In this troubleshooting phase, you should consider every possible cause but you should put them in order (based on the symptoms) and start with the most obvious things first. In the end, you will know exactly what you should test to solve the particular issue.
Determine Whether Escalation Is Necessary
Many organizations use the Information Technology Infrastructure Library (ITIL) framework, which is a systematic approach for information technology management in an organization. One of the domains specified by this library is incident management. Many organizations have internal help desk or service desk structures.
While help desks usually serve outside customers and vendors, service desks serve internal customers and other departments within the same company. Users can issue a trouble ticket to the service desk system and that will be processed through some type of workflow using e-mail or other automatic process. At some point, the service desk operators have to decide whether the solution is beyond their capabilities and responsibilities and, if it is, whether they should escalate this to a higher-level team and involve other people in the process. Usually, organizations have a three-tier escalation model:
- Level 1: The service desk operators are in direct contact with the customer/users. This is where the problem is reported.
- Level 2: Service desk personnel who are more qualified than Level 1 technicians are used for escalation.
- Level 3: This is the highest escalation level and it often includes network engineers and application developers.
Create an Action Plan and a Possible Solution
Most of the time the action plan needed in a troubleshooting methodology is based on experience and analysis of documentation created by previous network engineers or operators. The process of creating an action plan involves documenting every one of the previous steps in the troubleshooting process, often by taking notes and using a PDA or an audio recorder to capture all the meaningful information along the way.
An important thing to remember is that you should act on one event at a time. If several problems occur simultaneously, you should prioritize them based on the way they affect users and the impact on the network and even on the business. Depending on the situation, you may want to delegate other technicians to specific technology areas in order to cover all the affected zones at the same time.
Once the action to be taken has been identified, you should implement a single fix/solution at a time. Do not try to mix different solutions just to hurry things up. Most often this will lead to other problems. You should move on to the next fix only if the previous one does not work.
Another important rule states that backup should happen first and rollback second. You should have a backup of the data, system, and configuration files before implementing the fix, and then you should have some way to roll back to the last known good configuration before attempting the troubleshooting solution.
Other special cases are the ones in which the problem is intermittent or the solution cannot be implemented outside production hours. If this is the case, you should carefully schedule a change control window and follow a strict procedure to cover every possible solution. You should always have a backup plan in case the primary solution fails; this will allow you to speed up the troubleshooting process and make maximum use of the scheduled maintenance window.
Do not panic if things get out of control. Ask your colleagues for assistance when needed and stay calm so you can maintain control of the situation.
Implement and Test the Solution
One of the most critical parts of the troubleshooting process is testing the solution. If the solution involves some type of major change or fundamental modification to the network infrastructure or design, the recommendation is to test the solution in an isolated prototype environment first.
If possible, as a best-case scenario, you should have an exact mirror of the network topology, or at least try to get as close to this as possible. This could mean trying to create a subset of the network infrastructure on which the solution can be tested. For example, the solution might involve applying some type of service pack or software upgrade on different devices, and this should be carefully tested in an isolated environment before launching such a drastic change into production.
During the implementation and testing phase, network technicians usually create scripts in order to execute multiple tasks at the same time to save time. Following this advice, you should prepare a detailed implementation plan and testing procedure before starting the actual work to minimize possible problems that might occur. In addition, you should have technicians with higher seniority available if things get out of control.
From a testing standpoint, solutions should be tested based on their complexity, starting with the simple ones first, in order to achieve maximum efficiency.
Analyze the Results
The testing phase results may or may not be favorable, so you should have an iterative process in place that will permit you to go back to a different phase of the process until you find the right solution. This means that, as part of the troubleshooting process, you should know which phase you should go back to. For example, if you performed enough information gathering and you are sure that you have all the facts, you can skip this phase.
The iterative process may also involve the following actions:
- Using an audio recording device throughout the process to capture the actions taken
- Sharing the results with the online community, using work groups and bulletin boards that can help you obtain answers quickly
- Escalation to a Layer 2 or Layer 3 technician
After finding and implementing the solution to the problem reported, most network professionals just move on without taking one important step into consideration: implementing preventative measures, to avoid having the issue occur again in the future. This includes procedures that mitigate the problem, such as building a redundant network topology with failover capabilities to minimize the effect of a device going down.
As a network technician you should be prepared for unexpected risk, as often times things that you do to fix a problem will have unexpected consequences on other users, systems, or applications.
Document the Process and Solution
The documentation process should cover all phases of the troubleshooting process. Using a PDA or an audio recording device can assist in recording every step of the process, including mistakes and unexpected consequences. Another source for obtaining information to be used in the documentation process is logging servers that generate customized reports based on customized filters.
Various Web-based tools from different vendors are available for documentation purposes and for creating customized reports and summaries (using XML or other formats). A common document management system used for this purpose is Microsoft SharePoint, which offers the capability of using document libraries.
The end-scope of this process is to generate a series of reports and summaries in order to complete the troubleshooting process and offer the technician the possibility of providing the final resolution to management, as he is responsible for delivering documentation on the solution and on the entire process. Part of the goal in this phase is constant improvement, which includes storing the solution in a common knowledge database that can provide valuable information for similar cases in the future.
Summary
Troubleshooting is considered a science, as it always involves a pre-determined process. Network engineers who follow this process will get better at troubleshooting over time, optimizing it based on their experience.
A methodology is needed in order to speed up the problem finding and solving process. Although the methodology can be different from case to case, it is important to use one that best suits your company, network, and internal structure. A proposed methodology includes the following steps:
Step 1: Gather information
Step 2: Identify the affected areas in the network
Step 3: Determine whether anything has changed in the network
Step 4: Establish the most probable cause of the problem
Step 5: Determine whether escalation is necessary
Step 6: Create an action plan and a possible solution
Step 7: Implement and test the solution
Step 8: Analyze the results
Step 9: Document the process and solution