Error count exceeded stopping



Transport retry count exceeded on mlx5_2:1/IB #7416

Comments

paboyle commented Sep 16, 2021

Describe the bug

Runtime errors stopping jub

Steps to Reproduce

Non-reproducible, intermittent running «Grid» code on Booster at Juelich in production jobs.

Setup and versions

Linux jwlogin23.juwels 4.18.0-305.10.2.el8_4.x86_64 #1 SMP Tue Jul 20 17:25:16 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

yosefe commented Sep 16, 2021

@paboyle Is there any other failure, except the «retry exceeded» error?
On how many nodes does the issue happen? are there any UCX environment variables used?
Is the application using MPI-RMA?
It is likely this issue is related to fabric configuration/health, so should be taken with Mellanox/NVIDIA support

paboyle commented Sep 17, 2021

Hi,
Christoph Lehner is hitting the problem, and will put his environment into this message thread
Thanks !

lehner commented Sep 17, 2021

Thanks for getting back to us quickly! There are no other failures but typically a large (factor of 2) fluctuation in network performance prior to the errors. Application is not using MPI-RMA. UCX environment:

export UCX_MEMTYPE_CACHE=n
export UCX_RNDV_SCHEME=put_zcopy

I tried export UCX_IB_SL=1 as well but the problem persisted in this case as well. Thanks!

lehner commented Sep 17, 2021

It happened on 256 nodes

paboyle commented Sep 17, 2021

We’re using MPI_Sendrecv and MPI_Isend / Irecv pairs + all_reduce in code.

Whether software or hardware . we don’t know, hoped you could steer based on the nature of error reported.

paboyle commented Sep 17, 2021

gdr_copy, RNDVZ put_zcopy, many of transfers are GPU-GPU

paboyle commented Sep 17, 2021

Christoph — UCX version?

lehner commented Sep 17, 2021 •

Cuda 11.4
UCX:
ucx_info -v
UCT version=1.10.1 revision 6a5856e
configured with: —prefix=/p/software/juwelsbooster/stages/2020/software/UCX/1.10.1 —build=x86_64-pc-linux-gnu —host=x86_64-pc-linux-gnu —with-verbs —without-java —disable-doxygen-doc —enable-optimizations —enable-mt —disable-debug —disable-logging —disable-assertions —disable-params-check —disable-dependency-tracking —with-cuda=/p/software/juwelsbooster/stages/2020/software/CUDA/11.3 —enable-cma —with-rc —with-ud —with-dc —with-mlx5-dv —with-ib-hw-tm —with-dm —with-avx —with-gdrcopy —without-cm

yosefe commented Sep 17, 2021

For now, it looks like a HW issue.
As an experiment, can you pls try with UCX_RNDV_SCHEME=get_zcopy and UCX_TLS=dc (each one separately)?
Is there adaptive routing enabled on the cluster (can make sure it’s disabled by setting UCX_IB_AR_ENABLE=n )

lehner commented Sep 20, 2021

Thank you, I have jobs in the queue to conduct both experiments.

AndiH commented Sep 21, 2021

We are seeing similar error messages on JUWELS Booster for other applications as well.

Could you also, separately, set UCX_RC_TIMEOUT=4s as an additional experiment?

Источник

How to Fix ‘Interface CRC Error Count’ inside HD Tune

Some Windows users are reporting that they always end up seeing a warning (Ultra DMA CRC Error Count) when analyzing their HDD using the HD Tune utility. While some affected users are seeing this with used hard drivers, others are reporting this issue with brand new HDDs.

Interface CRC Error Count inside HD Tune

What is Ultra DMA CRC Error Count?

This is a S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) parameter that indicates the total quantity of CRC errors during UltraDMA mode. The raw value of this attribute indicates the number of errors found during data transfer in UltraDMA mode by ICRC (Interface CRC).

But keep in mind that this parameter is considered informational by most hardware vendors. Although the degradation of this parameter can be regarded as an indicator of an aging drive with potential electromechanical problems, it does NOT directly indicate imminent driver failure.

To get the complete picture of the health of your HDD, you need to pay attention to other parameters and the overall drive health.

After investigating this issue thoroughly, it turns out that there are several different underlying causes that might end up produce this particular error code:

  • Generic False Positive – Keep in mind that a warning thrown around by the HD Tune utility does not necessarily mean that your HDD is failing. This utility uses generic aggregated from every manufacturer, so concerning data from one manufacturer might not be concerning for another. To get a more accurate result, you will need to run the brand-specific diagnostic tool and see if the same kind of warning occurs.
  • Incompatibility between Samsung SSD and SATA Controller – If you’re encountering this issue with an SSD, chances are it’s due to a conflict between your solid-state drive and the Microsoft or AMD SATA controller driver. To fix this incompatibility, you’ll need to use Registry Editor to disable NCQ (Native Command Queue).
  • Faulty SATA Cable or SATA port – As it turns out, you can also expect to encounter this type of issue if you’re dealing with a faulty SATA port or a non-congruent SATA cable. In this case, you can identify the culprit by testing the HDD on a different machine and replacing the current SATA cable.
  • Failing HDD or SSD – Under certain circumstances, you can expect to see this error warning in the early stages of a failing drive. In this case, the only thing you can do is back up your data before the drive breaks up for good and start looking for a replacement.

Now that you know the very potential scenario that might cause this error code, here’s a list of methods that will help you identify and resolve the Ultra DMA CRC Error Count error:

Method 1: Running the brand-specific diagnostic tool

Keep in mind that the HD Tune Utility is a 3rd party tool that will ‘judge’ the heath of an HDD solely by comparing them against a set of generic values.

Because of this, it’s highly recommended to avoid making a decision based on HD Tune Utility alone and instead run the brand-specific diagnostic tool – The official testing tools are specifically designed for their brand products.

Depending on your HDD manufacturer, install and scan your hard drive with the proprietary diagnostic utility. To make matters easier for you, we’ve made a list of the most popular brand-specific diagnostic tools:

Note: If your HDD manufacturer is not included in the list above, search online for specific steps on your brand-specific diagnostic tool, then install and run it to see if the Ultra DMA CRC Error Count is still off.

If the manufacturer-specific diagnostic tool doesn’t raise any concerns in relation to the value of Ultra DMA CRC Error Count, then you can safely ignore the warning thrown by HD Tune.

However, if the warning is also displayed in the manufacturer-specific analysis tool, move down to the next potential fix below.

Method 2: Fix the Incompatibility between Samsung SSD and SATA Controller (if applicable)

As it turns out, the Ultra DMA CRC Error Count error is not restricted to an HDD and can also occur if you’re using an SSD.

But if you’re seeing this error with a Samsung SSD, there’s a high chance that the issue has nothing to do with a bad cable or solid-state health – It’s most likely due to an incompatibility between your Samsung SSD and your chipset Sata controller.

If you find yourself in this particular scenario, you can fix the issue and prevent this warning from appearing by disabling NCQ (Native Command Queue) in your SATA driver.

Note: This will not affect the functionality of your SATA drive.

If this scenario is applicable, the instructions below to fix the incompatibility between your Samsung SSD and the Sata Controller:

  1. Press Windows key + R to open up a Run dialog box. Next, inside the text box, type ‘regedit’, then press Ctrl + Shift + Enter to open up the Registry Editor with admin access. When you’re prompted by the UAC (User Account Control), click Yes to grant administrative access. Opening Regedit
  2. Once you’re inside the Registry Editor, use the left-hand menu to navigate to the following locations, depending on if you’re using a Microsoft SATA Controller driver or a AMD SATA Controller driver:

Note: You can either navigate here manually or you can paste the location directly into the navigation bar

  • Once you’re inside the correct location, right-click on Device, then choose New > Dword (32-bit) Value from the context menu that just appeared. Creating a new Dword value inside the Device menu
  • Next, name the newly created DWORD NcqDisabled if you’re using the Microsoft SATA Controller driver, or name it AmdSataNCQDisabled if you’re using the AMD SATA Controller driver.
  • Finally, double-click on the DWORD that you’ve just created then set the Base to Hexadecimal and the value to 1 to disable NCQ and prevent the same incompatibility from creating the Ultra DMA CRC Error Count error.
  • If the same issue is still occurring even after following the instructions above or this scenario was not applicable, move down to the next potential fix below.

    Method 3: Replace the power and SATA cable

    As several affected users have confirmed, this particular issue can also be associated with a faulty SATA cable or a faulty SATA port. Because of this, the Ultra DMA CRC Error Count error can also be a symptom of a non-congruent cable.

    To test this theory, you can connect your HDD to a different computer (or at least use a different SATA port + cable) if you don’t have a second machine to do some testing on.

    Example of a SATA Port on the motherboard

    After you replaced the SATA port, repeat the scan inside HD Tune utility and see if the Ultra DMA CRC Error Count error is still occurring – If the issue has stopped occurring, consider taking your motherboard to an IT technician to investigate for loose pins.

    On the other hand, if the issue doesn’t occur while you use a different SATA cable, you’ve just managed to identify your culprit.

    In case you’ve eliminated both the SATA cable and the SATA port from the list of culprits, move down to the next potential fix below as the issue is definitely occurring due to a failing drive.

    Method 4: Backup your HDD data

    If you’ve previously made sure that you were right to concern yourself with the Ultra DMA CRC Error Count error, the first thing you should do is backup your data to ensure that you’re not losing anything in case the drive goes bad.

    If you’re looking to back up your HDD data while you figure out which replacement to get, keep in mind that you have two ways forward – You can either backup your HDD using the built-in feature or you can use a 3rd party utility.

    A. Backing up the files on your HDD via Command Prompt

    If you’re comfortable with using an elevated CMD terminal, you can create a backup and save it on external storage without the need to install a 3rd party software.

    But keep in mind that depending on your preferred approach, you might need to insert or plugin-compatible installation media.

    If you’re comfortable with this approach, here are the instructions for backup your files from an Elevated Command prompt.

    B. Backing up the files on your HDD via an Imaging 3rd party software

    On the other hand, if you’re comfortable with trusting a 3rd party utility with your HDD backup, you’ll have a lot of extra features that are simply not available when creating a regular backup via Command Prompt.

    You can use a 3rd party backup software to either clone or create an image of your HDD and save it externally or on the cloud. Here’s a list of the best cloning & imaging software that you should consider using.

    Method 5: Send your HDD for replacement or order a replacement

    If you’ve made sure that the Ultra DMA CRC Error Count warning you’re seeing is genuine and you have successfully backed up your HDD data in advance, the only thing you can do right now is to look for a replacement.

    Of course, if your HDD is still protected by the warranty, you should send it in for repair right away.

    But if the warranty has expired or you have the option to return it still, our recommendation is to stay away from the legacy HDD (Hard Disk Drive) and go for SSD (Solid State Drive) instead.

    Although SSD is still more expensive than traditional HDD, there are much less prone to break and the speed is incomparable in favor of SSD (10x more writing and reading speeds).

    If you’re in the market for an SSD, here’s our advanced guide to buying the best solid-state drive for your needs.

    Источник

    Common Error Messages on Catalyst 6500/6000 Series Switches Running Cisco IOS Software

    Available Languages

    Download Options

    Bias-Free Language

    The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

    Contents

    Introduction

    This document provides a brief explanation of common syslog and error messages that you see on Cisco Catalyst 6500/6000 series switches that run Cisco IOS ® system software. Use the Cisco CLI Analyzer (registered customers only) if you have an error message that does not appear in this document. The tool provides the meaning of error messages that Cisco IOS Software and Catalyst OS (CatOS) software generate.

    Note: The exact format of the syslog and error messages that this document describes can vary slightly. The variation depends on the software release that runs on the Supervisor Engine.

    Note: This minimum logging configuration on the Catalyst 6500/6000 is recommended:

    Set the date and time on the switch, or configure the switch to use the Network Time Protocol (NTP) in order to obtain the date and time from an NTP server.

    Ensure that logging and logging time stamps are enabled, which is the default.

    Configure the switch to log to a syslog server, if possible.

    Prerequisites

    Requirements

    There are no specific requirements for this document.

    Components Used

    This document is not restricted to specific software and hardware versions.

    Conventions

    Refer to Cisco Technical Tips Conventions for more information on document conventions.

    %C6KPWR-SP-4-UNSUPPORTED: unsupported module in slot [num], power not allowed: [chars]

    Problem

    The switch reports this error message:

    C6KPWR-SP-4-UNSUPPORTED: unsupported module in slot [num], power not allowed: [chars]

    This example shows the console output that is displayed when this problem occurs:

    Description

    This message indicates that the module in the specified slot is not supported. The [num] is the slot number, and [chars] provides more details about the error.

    Workaround

    Upgrade the Supervisor Engine software to a version that supports the hardware module. Refer to the Supported Hardware section of the Cisco Catalyst 6500 Series Switches Release Notes for the relevant release. In order to resolve the issue that the message describes, perform one of these actions:

    Insert or replace the Switch Fabric Module.

    Move the unsupported module to a different slot.

    %DUAL-3-INTERNAL: IP-EIGRP 1: Internal Error

    Problem

    The switch reports this error message:

    %DUAL-3-INTERNAL: IP-EIGRP 1: Internal Error

    Description

    The error message indicates that there is an internal bug in the Cisco IOS Software. The bug has been fixed in these releases:

    Cisco IOS Software Release 12.2(0.4)

    Cisco IOS Software Release 12.1(6.1)

    Cisco IOS Software Release 12.2(0.5)T

    Cisco IOS Software Release 12.1(6.5)E

    Cisco IOS Software Release 12.1(6.5)EC

    Cisco IOS Software Release 12.1(6)E02

    Cisco IOS Software Release 12.2(0.18)S

    Cisco IOS Software Release 12.2(2)B

    Cisco IOS Software Release 12.2(15)ZN

    Workaround

    Upgrade the Cisco IOS Software to one of these releases or to the latest release.

    %EARL_L3_ASIC-SP-4-INTR_THROTTLE: Throttling «IP_TOO_SHRT»

    Problem

    The switch reports this error message:

    %EARL_L3_ASIC-SP-4-INTR_THROTTLE: Throttling «IP_TOO_SHRT»

    This example shows the console output that is displayed when this problem occurs:

    Description

    This message indicates that the switch forwarding engine receives an IP packet of a length that is shorter than the minimum allowed length. The switch drops the packet. In earlier versions, the packet is silently dropped and counted in the forwarding engine statistics. In later versions, the error message is recorded in the syslog once every 30 minutes. These issues can cause the switch forwarding engine to receive this type of IP packet:

    A bad network interface card (NIC) driver

    A NIC driver bug

    A bad application

    The switch simply reports that it has received these «bad» packets and intends to drop them.

    Workaround

    The origin of the problem is external to the switch. Unfortunately, the forwarding engine does not keep track of the source IP address of the device that sends these bad packets. The only way to detect the device is to use a sniffer to track down the source and then replace the device.

    %EARL_L3_ASIC-SP-3-INTR_WARN: EARL L3 ASIC: Non-fatal interrupt [chars]

    Problem

    The switch reports this error message:

    EARL_L3_ASIC-SP-3-INTR_WARN: EARL L3 ASIC: Non-fatal interrupt [chars]

    This example shows the console output that is displayed when this problem occurs:

    Description

    The error message %EARL_L3_ASIC-SP-3-INTR_WARN indicates that the Enhanced Address Recognition Logic (EARL) Layer 3 (L3) application-specific integrated circuit (ASIC) detected an unexpected non-fatal condition. This indicates that a bad packet, probably a packet which contains a Layer 3 IP checksum error, was received and dropped. The cause of the issue is a device on the network that sends out bad packets. These issues, among others, can cause the bad packets:

    Bad NIC drivers

    In older Cisco IOS Software releases, these packets are normally dropped without being logged. The feature of logging error messages about this problem is a feature found in Cisco IOS Software Release 12.2SX and later.

    Workaround

    This message is for informational purposes only. As a workaround, use one of these two options:

    Use a network sniffer in order to identify the source that sends out the erroneous packets. Then, resolve the issue with the source device or application.

    Disable Layer 3 error checks in the switch hardware for:

    Packet checksum errors

    Packet length errors

    Packets that have the same source and destination IP addresses

    Use the no mls verify command to stop these error checks, as these examples show:

    %EARL_NETFLOW-4-TCAM_THRLD: Netflow TCAM threshold exceeded, TCAM Utilization [[dec]%]

    Problem

    The switch reports this error message:

    EARL_NETFLOW-4-TCAM_THRLD: Netflow TCAM threshold exceeded, TCAM Utilization [[dec]%]

    This example shows the console output that is displayed when this problem occurs:

    Note: If you want to filter out this specific error message, be aware that all error messages with same severity level will be filtered. A specific log message cannot be filtered without affecting other logs under, which are under the same severity level.

    Description

    This message indicates that the NetFlow ternary content addressable memory (TCAM) is almost full. Aggressive aging will be temporarily enabled. If you change the NetFlow mask to FULL mode, TCAM for NetFlow can overflow because there are so many entries. Issue the show mls netflow ip count command in order to check this information.

    The Supervisor Engine 720 checks how full the NetFlow table is every 30 seconds. The Supervisor Engine turns on aggressive aging when the table size reaches almost 90 percent. The idea behind aggressive aging is that the table is nearly full, so there are new active flows that cannot be created. Therefore, it makes sense to aggressively age-out the less active flows (or inactive flows) in the tablein order to make space for more active flows.

    The capacity for each policy feature card (PFC) NetFlow table (IPv4), for PFC3a and PFC3b, is 128,000 flows. For the PFC3bXL, the capacity is 256,000 flows.

    Workaround

    In order to prevent this problem, disable the FULL NetFlow mode. Issue the no mls flow ip command.

    Note: Generally, the no mls flow ip command does not affect packet forwarding because TCAM for packet forwarding and TCAM for NetFlow accounting are separate.

    In order to recover from this issue, enable MLS fast aging. While you enable MLS fast aging time, initially set the value to 128 seconds. If the size of the MLS cache continues to grow over 32 K entries, decrease the setting until the cache size remaines less than 32 K. If the cache continues to grow over 32K entries, decrease the normal MLS aging time. Any aging-time value that is not a multiple of 8 seconds is adjusted to the closest multiple of 8 seconds.

    The other workaround would disable service intrenal in case if you have enabled, and remove mls flow ip interface-full in case if you do not need full flow.

    %ETHCNTR-3-LOOP_BACK_DETECTED : Keepalive packet loop-back detected on [chars]

    Problem

    The switch reports this error message, and the port is forced to linkdown:

    %ETHCNTR-3-LOOP_BACK_DETECTED : Keepalive packet loop-back detected on [chars]

    This example shows the console output that is displayed when this problem occurs:

    Description

    The problem occurs because the keepalive packet is looped back to the port that sent the keepalive. Keepalives are sent on the Catalyst switches in order to prevent loops in the network. Keepalives are enabled by default on all interfaces. You see this problem on the device that detects and breaks the loop, but not on the device that causes the loop.

    Workaround

    Issue the no keepalive interface command in order to disable keepalives. A disablement of the keepalive prevents errdisablement of the interface, but it does not remove the loop.

    Note: In Cisco IOS Software Release 12.2(x)SE-based releases and later, keepalives are not sent on fiber and uplink interfaces by default.

    loadprog: error — on file open boot: cannot load «cisco2-Cat6k-MSFC»

    Problem

    The switch reports this error message:

    loadprog: error — on file open boot: cannot load «bootflash:c6msfc2-boot-mz.121-8a.EX»

    Description

    The problem occurs only at an unaligned write to the device that is close to an internal 64-byte boundary. The problem can occur under one of these circumstances:

    During the write of a crash dump file

    Something causes the system to crash at the time of the write of the file.

    When code is corrupted during migration from CatOS to Cisco IOS Software

    Workaround

    The workaround is to modify the device driver so that it correctly handles unaligned access. If the error occurs because of a code corruption during migration from CatOS to Cisco IOS Software, erase the Flash and download a new, valid CatOS software image.

    %L3_ASIC-DFC3-4-ERR_INTRPT: Interrupt TF_INT:FI_DATA_INT

    Problem

    The switch reports this error message:

    %L3_ASIC-DFC3-4-ERR_INTRPT: Interrupt TF_INT:FI_DATA_INT occurring in EARL %Layer 3 ASIC

    Description

    This error message indicates that there is an error in the Layer 3 (L3) forwarding application-specific integrated circuit (ASIC). Basically, the switch shows this message when some transient traffic passes through the ASIC and the software simply reports the occurrence of an interrupt condition. As soon as this condition is met, the counters that the show earl statistics command shows increase. Every time that the software tries to recover from such a state, the switch generates this syslog message. Generally, this message is informational if its occurrence remains low. But if the error message occurs frequently, there can be a problem with the hardware.

    Check the counters value in the show earl statistics command output. If the counters increase rapidly, it indicates a possible problem with the hardware.

    %MLS_STAT-SP-4-IP_LEN_ERR: MAC/IP length inconsistencies

    Problem

    The switch reports this error message:

    %MLS_STAT-SP-4-IP_LEN_ERR: MAC/IP length inconsistencies

    This example shows the console output that is displayed when this problem occurs:

    Description

    These messages indicate that packets were received in which the IP length does not match the MAC length of the packet. The Supervisor Engine dropped these packets. There are no negative effects on the switch because it drops the packets. The switch reports the message for informational purposes. The cause of the issue is a device on the network that sends out bad packets. These issues, among others, can cause the bad packets:

    Bad NIC drivers

    Use a network sniffer in order to find the source that sends out the erroneous packets. Then, resolve the issue with the source device or application.

    The other workaround is a switch configuration that stops the switch checks for:

    Packet checksum errors

    Packet length errors

    Packets that have the same source and destination IP addresses

    Use these commands in order to stop the switch checks:

    %MLS_STAT-SP-4-IP_CSUM_ERR: IP checksum errors

    Problem

    The switch reports this error message:

    %MLS_STAT-SP-4-IP_CSUM_ERR: IP checksum errors

    This example shows the console output that is displayed when this problem occurs:

    Description

    These messages indicate that the switch receives IP packets that have an invalid checksum value. There are no negative effects on the switch because the switch drops the packets. The switch reports the message for informational purposes. The cause of the issue is a device on the network that sends out bad packets. These issues, among others, can cause the bad packets:

    Bad NIC drivers

    Workaround

    As a workaround, use one of these two options:

    Use a network sniffer in order to identify the source that sends out the erroneous packets. Then, resolve the issue with the source device or application.

    Disable Layer 3 error checks in the switch hardware for both:

    Packet checksum errors

    Packet length errors

    In order to stop these error checks, use the no mls verify command, as these examples show:

    %MCAST-SP-6-ADDRESS_ALIASING_FALLBACK

    Problem

    The switch reports this error message:

    This example shows the console output that is displayed when this problem occurs:

    Description

    This message indicates that the switch receives excessive multicast traffic that is destined for a multicast MAC address in the 01-00-5e-00-00-xx range. This multicast range is reserved for Internet Group Management Protocol (IGMP) control traffic, for example:

    The switch CPU normally processes all the IGMP control traffic. Therefore, Cisco IOS Software provides a mechanism to ignore excessive IGMP multicast traffic that is destined for reserved addresses. The mechanism ensures that the CPU does not become overwhelmed. Use of this mechanism is referred to as «fallback mode».

    Find the source of the illegal multicast traffic. Then, either stop the transmission or modify the characteristics of the stream so that the transmission no longer infringes upon the IGMP control data space. Also, use the error message in the Problem section, which provides a network source that potentially causes the problem.

    c6k_pwr_get_fru_present(): can’t find fru_info for fru type 6, #

    Problem

    The switch reports this error message:

    c6k_pwr_get_fru_present(): can’t find fru_info for fru type 6, #

    This example shows the console output that is displayed when this problem occurs:

    Description

    This error message appears because of an erroneous response from the switch to Simple Network Management Protocol (SNMP) polling of the port adapters that Flex WAN modules use. This error message is cosmetic in nature, and there are no detrimental switch performance issues. The issue is fixed in these releases:

    Cisco IOS Software Release 12.1(11b)E4

    Cisco IOS Software Release 12.1(12c)E1

    Cisco IOS Software Release 12.1(13)E

    Cisco IOS Software Release 12.1(13)EC

    %MROUTE-3-TWHEEL_DELAY_ERR

    Problem

    The switch reports this error message:

    This example shows the console output that is displayed when this problem occurs:

    Description

    This message appears when the switch receives Protocol Independent Multicast (PIM) join/prune packets that advertise a high hold-time value. The packets advertise a higher hold-time value than the maximum delay that the OS of the switch allows, which is 4 minutes. These packets are multicast control packets, such as PIM, Distance Vector Multicast Routing Protocol (DVMRP), and other types.

    Later releases of Cisco IOS Software for the Catalyst 6500/6000 have increased this maximum delay to 65,535 seconds, or approximately 17 minutes. The issue is fixed in these releases:

    Cisco IOS Software Release 12.1(12c)E

    Cisco IOS Software Release 12.2(12)T01

    Cisco IOS Software Release 12.1(13)E

    Cisco IOS Software Release 12.1(13)EC

    Workaround

    Configure the third-party device that generates the PIM packets to use timers that are recommended by protocol standards.

    %MCAST-SP-6-GC_LIMIT_EXCEEDED

    Problem

    The switch reports this error message:

    This example shows the console output that is displayed when this problem occurs:

    Description

    This error message is logged when the IGMP snooping function on the switch has created the maximum number of allowed Layer 2 (L2) entries. The default maximum number of L2 entries that the switch can create for multicast groups is 15,488. In later versions of Cisco IOS Software, only the hardware-installed L2 multicast entries count toward the limit. Refer to Cisco bug ID CSCdx89380 (registered customers only) for more details. The issue is fixed in Cisco IOS Software Release 12.1(13)E1 and later.

    Workaround

    You can manually raise the L2 limit. Issue the ip igmp l2-entry-limit command.

    %MISTRAL-SP-3-ERROR: Error condition detected: TM_NPP_PARITY_ERROR

    Problem

    The switch reports this error message:

    %MISTRAL-SP-3-ERROR: Error condition detected: TM_NPP_PARITY_ERROR

    This example shows the console output that is displayed when this problem occurs:

    Description

    This error message indicates that there was a parity error in the next-page pointer of the internal Table Manager. If the switch runs Cisco IOS Software Release 12.1(8)E or later, the switch detects the parity error and resets the Mistral ASIC. The switch can then continue, without the need to reload. A random static discharge or other external factors can cause the memory parity error. If you see the error message only once or rarely, monitor the switch syslog in order to confirm that the error message is an isolated incident. If these error messages reoccur, create a service request with Cisco Technical Support.

    %MLS_STAT-4-IP_TOO_SHRT: Too short IP packets received

    Problem

    The switch reports this error message:

    %MLS_STAT-4-IP_TOO_SHRT: Too short IP packets received

    This example shows the console output that is displayed when this problem occurs:

    Description

    The message indicates that the switch forwarding engine receives an IP packet of a length that is shorter than the minimum allowed length. The switch drops the packet. In earlier versions, the packet is silently dropped and counted in the forwarding engine statistics. This applies to software releases that are earlier than 7.x or earlier than Cisco IOS Software Release 12.1(13E). In software releases that are later than 7.x or later than Cisco IOS Software Release 12.1(13E), the message is recorded in the syslog once every 30 minutes.

    There is no effect on the switch side. The switch drops the bad packet, which the receiving device would have dropped consequently. The only concern is that there is a device that sends bad packets. Possible causes include:

    A bad NIC driver

    A NIC driver bug

    A bad application

    Because of hardware limitations, the Supervisor Engine does not keep track of the source IP, MAC address, or port of the device that sends the bad packets. You must use a packet-sniffing application in order to detect these devices and track down the source address.

    The message in the Problem section is simply a warning/informational message from the switch. The message does not provide any information about the source port, MAC address, or IP address.

    Use a packet-sniffing application inside the network. Try to shut down some interface or remove some device from the network in order to determine if you can isolate the device that malfunctions.

    Processor [number] of module in slot [number] cannot service session requests

    Problem

    The switch reports this error message:

    Processor [number] of module in slot [number] cannot service session requests

    Description

    This error occurs when you issue the session slot number processor number command in an attempt to establish a session in these situations:

    You try to establish a session to a module in which a session has been already established while logging into the switch.

    You try to establish a session for an unavailable module in the slot.

    You try to establish a session for an unavailable processor in the module.

    %PM_SCP-1-LCP_FW_ERR: System resetting module [dec] to recover from error: [chars]

    Problem

    The switch reports this error message:

    %PM_SCP-1-LCP_FW_ERR: System resetting module [dec] to recover from error: [chars]

    These examples show the console output that is displayed when this problem occurs:

    %PM_SCP-SP-1-LCP_FW_ERR: System resetting module 13 to recover from error: Linecard received system exception

    %PM_SCP-SP-1-LCP_FW_ERR: System resetting module 4 to recover from error: Coil Pb Rx Parity Error - Port #14

    Description

    The message indicates that the firmware of the specified module has detected an error. The system automatically resets the module in order to recover from the error. The [dec] is the module number, and [chars] is the error.

    Workaround

    Reseat the module or put the module in a different slot and allow the module to go through the complete bootup diagnostics test. For more information on online diagnostics on the Catalyst 6500 series switches, refer to Configuring Online Diagnostics. After the module passes the diagnostics test, monitor the recurrence of the error message. If the error occurs again or the diagnostics test detects any issues, create a service request with Cisco Technical Support for further troubleshooting.

    %PM_SCP-2-LCP_FW_ERR_INFORM: Module [dec] is experiencing the following error: [chars]

    Problem

    The switch reports this error message:

    %PM_SCP-2-LCP_FW_ERR_INFORM: Module [dec] is experiencing the following error: [chars]

    This example shows the console output that is displayed when this problem occurs:

    %PM_SCP-SP-2-LCP_FW_ERR_INFORM: Module 4 is experiencing the following error: Bus Asic #0 transient Pb error

    Description

    The module reports an error condition, where [dec] is the module number and [chars] is the error. This condition is usually caused by an improperly seated line card or a hardware failure. If the error message is seen on all of the line cards, the cause is an improperly seated module.

    Workaround

    Reseat and reset the line card or the module. Then issue the show diagnostic result module module_# command.«

    If the error message persists after the module is reset, create a service request with Cisco Technical Support for further troubleshooting.

    %PM_SCP-SP-2-LCP_FW_ERR_INFORM: Module [dec] is experiencing the following error: [chars]

    Problem

    The switch reports this error message:

    %PM_SCP-SP-2-LCP_FW_ERR_INFORM: Module 4 is experiencing the following error: Port #36 transient TX Pb error

    Description

    This error message indicates a transient error on module number 4 in the datapath of port 36. In most cases, this is a one time/transient issue.

    Workaround

    Shut and unshut the port Gi4/36, and monitor for recurrence of the issue.

    If the error re-occurs, set the diagnostic to complete with the diagnostic bootup level complete command. Then, physically reseat the linecard.

    If the error message persists after the module is reseated, create a service request with Cisco Technical Support for further troubleshooting with these command outputs:

    %PM_SCP-SP-4-UNK_OPCODE: Received unknown unsolicited message from module [dec], opcode [hex]

    Problem

    The switch reports this error message:

    %PM_SCP-SP-4-UNK_OPCODE: Received unknown unsolicited message from module [dec], opcode [hex]

    These examples show the console output that is displayed when this problem occurs:

    Dec 10 12:44:18.117: %PM_SCP-SP-4-UNK_OPCODE: Received unknown unsolicited message from module 2, opcode 0x330

    Dec 10 12:44:25.210: %PM_SCP-SP-4-UNK_OPCODE: Received unknown unsolicited message from module 2, opcode 0x114

    Description

    This error message simply indicates that the Supervisor Engine does not understand the control message from the line card because of features that are not supported by the switch Cisco IOS Software release.

    Line cards send out control messages to the active Supervisor Engine that indicate the features that the software supports. But if the software does not support any of the line card features, these control messages are not recognized and the error message is displayed. This message is a harmless occurrence and does not affect any functions on the Supervisor Engine or the line cards.

    Workaround

    Upgrade the Supervisor Engine software to the latest version that has the maximum feature support. Because this error message does not affect production or traffic, you can ignore the message.

    %PM_SCP-SP-3-TRANSCEIVER_BAD_EEPROM: Integrity check on transceiver in LAN port 5/2 failed: bad key

    Problem

    The switch reports this error message:

    %PM_SCP-SP-3-TRANSCEIVER_BAD_EEPROM: Integrity check on transceiver in LAN port 5/2 failed: bad key

    Description

    The reason for this error message is the usage of non-Cisco SFP GBIC, which is not supported.

    Cisco SFP GBICs have a unique encrypted code (Quality ID) that enables Cisco IOS/CAT OS to identify Cisco pluggable parts. Normal GBICs do not have this and hence they can possibly work. Refer to %PM_SCP-SP-3-TRANSCEIVER_BAD_EEPROM for more information.

    %PM_SCP-SP-3-LCP_FW_ABLC: Late collision message from module [dec], port:035

    Problem

    The switch reports this error message:

    %PM_SCP-SP-3-LCP_FW_ABLC: Late collision message from module 3, port:035

    Description

    Late Collisions — A late collision occurs when two devices transmit at the same time, and neither side of the connection detects a collision. The reason for this occurrence is because the time to propagate the signal from one end of the network to another is longer than the time to put the entire packet on the network. The two devices that cause the late collision never see that the other is sending until after it puts the entire packet on the network. Late collisions are not detected by the transmitter until after the first 64 byte slot time. This is because they are only detected in transmissions of packets longer than 64 bytes.

    Possible Causes — Late collisions are a result of when there is a duplex mismatch, incorrect cabling or a non-compliant number of hubs in the network. Bad NICs can also cause late collisions.

    %PM-3-INVALID_BRIDGE_PORT: Bridge Port number is out of range

    Problem

    The switch reports this error message:

    Description

    This issue appears cosmetic and is due to an SNMP poll of the mib dot1dTpFdbEntry.

    Workaround

    You can block the OID from polled on this device. This defect is fixed from Cisco IOS Release 12.2(33)SRD04 and later.

    %QM-4-TCAM_ENTRY: Hardware TCAM entry capacity exceeded

    Problem

    The switch reports this error message:

    %QM-4-TCAM_ENTRY: Hardware TCAM entry capacity exceeded

    Description

    TCAM is a specialized piece of memory designed for rapid table lookups by the ACL and QoS engines. This message indicates exhaustion of the TCAM resources and software switching of packets. This means that each interface has its own ID in TCAM and therefore uses more TCAM resources. Most likely this problem is caused either by the presence of the mls qos marking statistics command or when the hardware TCAM does not have the capacity to handle all of the configured ACLs.

    Workaround

    Disable the mls qos marking statistics command as it is enabled by default.

    Try to share the same ACLs across multiple interfaces in order to reduce the TCAM resource contention.

    %slot_earl_icc_shim_addr: Slot [num] is neither SuperCard nor Supervisor — Invalid slot

    Problem

    The switch reports this error message:

    %slot_earl_icc_shim_addr: Slot [num] is neither SuperCard nor Supervisor — Invalid Slot

    Description

    This message occurs when an SNMP Manager polls for the TCAM data of a line card which does not have any TCAM information. This occurs only for a line card in a Catalyst 6500 switch that runs Cisco IOS Software. If the line card has TCAM information during the SNMP poll, the data is given to the network management system (NMS) for further processing. Refer to Cisco bug ID CSCec39383 (registered customers only) for more details. This issue is fixed in Cisco IOS Software Release 12.2(18).

    As a workaround, you can block the query of TCAM data by the NMSs. The MIB object that provides TCAM usage data is cseTcamUsageTable. Complete these steps on the router in order to avoid tracebacks:

    Issue the snmp-server view tcamBlock cseTcamUsageTable excluded command.

    Issue the snmp-server view tcamBlock iso included command.

    Issue the snmp-server community public view tcamBlock ro command.

    Issue the snmp-server community private view tcamBlock rw command.

    %SYSTEM_CONTROLLER-SP-3-ERROR: Error condition detected: TM_NPP_PARITY_ERROR

    Problem

    The switch reports this error message:

    %SYSTEM_CONTROLLER-SP-3-ERROR: Error condition detected: TM_NPP_PARITY_ERROR

    This example shows the console output that is displayed when this problem occurs:

    Description

    The most common errors from the Mistral ASIC on the MSFC are TM_DATA_PARITY_ERROR, SYSDRAM_PARITY_ERROR, SYSAD_PARITY_ERROR, and TM_NPP_PARITY_ERROR. Possible causes of these parity errors are random static discharge or other external factors. This error message indicates that there was a parity error. Processor Memory Parity Errors (PMPEs) are are broken down into two types: single event upset (SEU) and repeated errors.

    These single bit errors occur when a bit in a data word changes unexpectedly due to external events (which causes, for example, a zero to spontaneously change to a one). SEUs are a universal phenomenon irrespective of vendor or technology. SEUs occur very infrequently, but all computer and network systems, even a PC, are subject to them. SEUs are also called soft errors, which are caused by noise and result in a transient, inconsistent error in the data, this is unrelated to a component failure — most often the result of cosmic radiation.

    Repeated errors (often referred to a hard errors) are caused by failed components. A hard error is caused by a failed component or a board-level problem, such as an improperly manufactured printed circuit board that results in repeated occurrences of the same error.

    Workaround

    If you see the error message only once or rarely, monitor the switch syslog in order to confirm that the error message is an isolated incident. If these error messages reoccur, reseat the supervisor engine blade. If the errors stop, it was a hard parity error. If these error messages continue to reoccur, open a case with the Technical Assistance Center.

    %SYSTEM_CONTROLLER-SW2_SPSTBY-3-ERROR: Error condition detected: TM_NPP_PARITY_ERROR

    Problem

    The switch reports this error message:

    %SYSTEM_CONTROLLER-SW2_SPSTBY-3-ERROR: Error condition detected: TM_NPP_PARITY_ERROR

    Description

    This error message indicates there was a parity error and possible causes are a random static discharge or other external factors, which causes the memory parity error, such as a transient back panel connectivity or might happen because of power issues and sometimes line card is not able access the serial PROM (SPROM) contents on the module in order to determine the identification of the line card.

    All computer and network systems are susceptible to the rare occurrence of Single Event Upsets (SEU), sometimes described as parity errors. These single bit errors occur when a bit in a data word changes unexpectedly due to external events, and thus causes, for example, a zero to spontaneously change to a one. SEUs are a universal phenomenon irrespective of vendor and technology. SEUs occur very infrequently, but all computer and network systems, even a PC, are subject to them. SEUs are also called soft errors, which are caused by noise and results in a transient, inconsistent error in the data, and are unrelated to a component failure.

    Repeated errors, often referred to hard errors, are caused by failed components. A hard error is caused by a failed component, or a board-level problem such as improperly manufactured printed circuit board that results in repeated occurrences of the same error.

    Workaround

    If these error messages reoccur, reseat the supervisor module during the maintenance window.

    SP: Linecard endpoint of Channel 14 lost Sync. to Lower fabric and trying to recover now!

    Problem

    The switch reports this error message:

    SP: Linecard endpoint of Channel 14 lost Sync. to Lower fabric and trying to recover now!

    Description

    The error message usually points to a mis-seated linecard. In most cases, you can physically reseat the linecard in order to solve this problem. In some cases, the module is faulty.

    Issue the show fabric fpoe map command in order to identify the module that causes this error message.

    This example is the result of the show fabric fpoe map command. From the output, you can identify that the module in slot 12 causes the error message.

    Reseat the module which causes the error message.

    %SYSTEM-1-INITFAIL: Network boot is not supported

    Problem

    While a Cisco Catalyst 6000/6500 switch boots, it can throw a similar error message:

    Description

    This error occurs mostly when the boot variables are not configured properly to boot the switch from a valid flash device.

    In the illustration, notice the last line of the message:

    The name of the flash device mentioned is bootdisk, and the first part of the IOS filename, s72033 notes that the IOS is for Supervisor module 720. The Supervisor 720 module does not have or support a flash device named bootdisk. Because the Supervisor 720 module does not have a local flash of that name, the switch thinks that you want to boot from the network, so it displays the error message.

    Resolution

    Configure the boot variable with the correct flash device name and the valid software file name.

    These flash devices are supported by the Supervisor modules:

    Supervisor Engine 1 and Supervisor Engine 2

    Flash Device Name Description
    bootflash: Onboard flash memory
    slot0: Linear Flash PC card (PCMCIA slot)
    disk0: ATA Flash PC card (PCMCIA slot)

    Supervisor Engine 720

    Flash Device Name Description
    bootflash: Onboard flash memory
    disk0: CompactFlash Type II card only (disk 0 slot)
    disk1: CompactFlash Type II card (disk 1 slot)

    Supervisor Engine 32

    Flash Device Name Description
    bootdisk: Onboard flash memory
    disk0: CompactFlash Type II card only (disk 0 slot)

    CPU_MONITOR-3-TIMED_OUT or CPU_MONITOR-6-NOT_HEARD

    Problem

    The switch reports these error messages:

    Description

    These messages indicate that CPU monitor messages have not been heard for a significant amount of time. A time-out most probably occurs, which resets the system. [dec] is the number of seconds.

    The problem possibly occurs because of these reasons:

    Badly seated line card or module

    Bad ASIC or bad backplane

    High traffic in the Ethernet out of band channel (EOBC) channel

    The EOBC channel is a half duplex channel that services many other functions, which includes Simple Network Management Protocol (SNMP) traffic and packets that are destined to the switch. If the EOBC channel is full of messages because of a storm of SNMP traffic, then the channel is subjected to collisions. When this happens, EOBC is possibly not able to carry IPC messages. This makes the switch display the error message.

    Workaround

    Reseat the line card or module. If a maintenance window can be scheduled, reset the switch in order to clear any transient issues.

    % Invalid IDPROM image for linecard

    Problem

    The %Invalid IDPROM image for linecard error message is received in Catalyst 6500 series switches running Cisco IOS system software.

    The error message can look similar to these messages:

    Description

    This error indicates that the linecards installed did not boot correctly because the supervisor generated a bad signal onto control bus. In some scenarios, it is seen that improper seating can also cause the supervisor or linecards not to be recognized on Cat6500 chassis. Refer to Cisco bug ID CSCdz65855 (registered customers only) for more information.

    Workaround

    If redundant supervisor setup is available, perform a force switchover and reseat the original active supervisor.

    If it is a single supervisor setup, schedule a downtime, and complete these steps:

    Move the supervisor module to another slot.

    Reseat all the line cards and make sure that they are placed properly.

    Refer to Online Insertion and Removal (OIR) of Modules in Cisco Catalyst Switches for more information on online insertion and removal of modules.

    %CPU_MONITOR-SP-6-NOT_HEARD or %CPU_MONITOR-SP-3-TIMED_OUT

    Problem

    The switch reports these error messages:

    Description

    The supervisor sends an SCP ping once every 2 seconds to each line card. If no response is received after 3 pings (6 seconds), it is counted as the first failure. After 25 such successive failures, or after 150 seconds of not receiving a response from the line card, the supervisor power cycles that line card. After every 30 seconds, this error message is seen on the switch:

    After 150 seconds, the module gets power cycled with these syslogs:

    %C6KPWR-4-DISABLED: Power to module in slot [dec] set [chars]

    Problem

    The switch reports this error message:

    This example shows the console output that is displayed when this problem occurs:

    Description

    This message indicates that the module in the indicated slot was powered off for the indicated reason. [dec] is the slot number, and [chars] indicates the power status.

    The switch has its normal vibrations and over time those vibrations can cause a module to slightly come away from the backplane. When this happens, the supervisors keepalive polling does not receive a response from the module within the allotted time and the supervisor reboots the module in order to try to gain a better connection to it. If the module still does not respond to the polls, the supervisor continuously reboots the module, and eventually puts it into error disable and does not allow any power to reach this module.

    Workaround

    A simple reseat of the module corrects this issue 90 percent of the time. If you reseat the module, it realigns the switch fabric and ensures a firm connection to the backplane.

    If the concerned module is the Content Switching Module (CSM), consider the upgrade of the CSM software to a release 4.1(7) or later. This issue is documented at Cisco bug ID CSCei85928 (against CSM software) (registered customers only) and Cisco bug Id CSCek28863 (against Cisco IOS software) (registered customers only) .

    The latest CSM software can be downloaded from the Cisco Catalyst 6000 Content Switching Module software download page.

    ONLINE-SP-6-INITFAIL: Module [dec]: Failed to [chars]

    Problem

    The switch reports the error message:

    This example shows the console output that is displayed when this problem occurs:

    Description

    The cause of the crash is that the Pinnacle ASIC failed to synchronize. This is usually caused by a bad contact or a badly seated card.

    Workaround

    The system recovers without user intervention. If the error message recurs, then reseat the concerned line card or module.

    FM_EARL7-4-FLOW_FEAT_FLOWMASK_REQ_FAIL

    Problem

    The switch reports the error message:

    This example shows the console output that is displayed when this problem occurs:

    Description

    The flow mask request for the flow-based feature is unsuccessful. This condition can occur because of a TCAM resource exception, a flow mask registers resource exception, or an unresolvable flow mask conflict with other NetFlow-based features. The NetFlow shortcut installation and hardware acceleration for the feature can be disabled under this condition, and the feature can be applied in the software.

    If you have ingress reflexive ACL only, reflect and eval configured in the ingress direction on different interfaces, then reflexive ACL flowmask requirement is based on ingress reflexive ACLs. As long as the reflexive ACL is configured on a different interface than QoS micro-flow policing or does not overlap with micro-flow policing policy ACL, when on same interface, they can coexist in hardware. If they are on the same interface and the reflexive ACL and QoS policy overlap, then reflexive ACL disables NetFlow shortcut installation and traffic matching reflexive ACL is software switched. This is due to the conflicting flowmask requirements.

    In case of egress reflexive ACL, the reflexive ACL flowmask requirement is global on all the interfaces, since there is only ingress NetFlow. If QoS user based micro-flow policing is configured in this case, reflexive ACL disables NetFlow shortcut installation and traffic matching reflexive ACL is software switched.

    Workaround

    Issue the show fm fie flowmask command in order to determine the NetFlow shortcut installation enable / disable status for the feature. If the NetFlow shortcut installation and hardware acceleration is disabled for the feature, use only ingress reflexive access-lists in combination with micro-flow policing, and make sure that the micro-flow policer does not overlap with the reflexive access-list. Reapply the feature for the flow mask request to succeed, and re-enable the NetFlow shortcut installation for the feature.

    MCAST-2-IGMP_SNOOP_DISABLE

    Problem

    The switch reports the error message:

    This example shows the console output that is displayed when this problem occurs:

    Description

    IGMP snooping is disabled, but the system receives multicast traffic. This situation forces multicast packets to be directed to the route processor and possibly floods it. IGMP snooping can be disabled automatically due to excessive multicast traffic. IGMP snooping basically looks at these control packets that are exchanged between routers and hosts and based on the joins, leaves and queries update what ports receive the multicast.

    This message normally occurs because the route processor receives a much higher than expected rate of IGMP join packets or normal multicast packets destined to reserved Layer3/Layer2 multicast address ranges. Therefore the switch runs out of resources and as the logging messages reports, the switch mitigates and disables IGMP snooping for a short period.

    Workaround

    You can enable multicast rate limiting feature and set the threshold to a greater number.

    Rate limiting is a more desirable method so that the queue is not overrun and also means that valid IGMP packets have less chance of being dropped and therefore the snooping process on the switch is still able to update appropriately.

    Complete these steps in order to troubleshoot this issue:

    Disable IGMP snooping with the command no ip igmp snooping.

    Setup a SPAN session on the management VLAN interface on your Catalyst 6500 in order to determine that the MAC address belongs to the source from where the excessive traffic comes.

    Look into the CAM table in order to identify the source, and remove that source.

    Re-enable IGMP snooping.

    C6KERRDETECT-2-FIFOCRITLEVEL: System detected an unrecoverable resources error on the active supervisor pinnacle

    Problem

    The switch reports these error messages. The error message can be one of these two types:

    Description

    The root cause of this error is possibly a defective module or a mis-seated module. It can also be a chassis problem with this particular slot. This can be a transient issue if it is due to a mis-seated module.

    These messages indicate that the system detected unrecoverable resources, which is due to the First In, First Out [FIFO] problem, on the indicated Pinnacle ASIC or specified port ASIC.

    Workaround

    Issue the remote command switch show platform hardware asicreg pinnacle slot 1 port 1 err command in order to resolve this error, and configure the switch to run enhanced hardware tests with these steps:

    Note: Type the entire command and hit the Enter key. You cannot write the command with the Tab key.

    Issue the diagnostic bootup level complete command in order to set the diagnostic level to complete, and save the configuration.

    Reseat the supervisor and firmly insert it

    Once the supervisor comes online, issue the show diagnostic command in order to monitor the switch and check whether the error message still persists

    %C6KERRDETECT-SP-4-SWBUSSTALL: The switching bus is experiencing stall for 3 seconds

    Problem

    The switch reports these error messages:

    %C6KERRDETECT-SP-4-SWBUSSTALL: The switching bus is experiencing stall for 3 seconds

    %C6KERRDETECT-SP-4-SWBUSSTALL_RECOVERED: The switching bus stall is recovered and data traffic switching continues

    Description

    The %C6KERRDETECT-SP-4-SWBUSSTALL message indicates the switching bus is stalled and data traffic is lost.

    The %C6KERRDETECT-SP-4-SWBUSSTALL_RECOVERED message indicates that the switching bus is no longer stalled, and data traffic can continue.

    Basically, if any one module on the system bus hangs then the supervisor detects a timeout and tries to recover on its own. If a module was in the process of being installed then that is a very possible cause of these messages since this can cause a bus stall while the module gets seated into the backplane.

    SP-RP Ping Test[7]: Test skipped due to high traffic/CPU utilization

    Problem

    This error message is received when inband test pings failed die to high CPU:

    Description

    The SP-RP in band ping is an online diagnostic test and the message that SP-RP ping test failed is purely informational. It is indicative of high CPU utilization and can be the result of much traffic passing to the Route Processor or of switching traffic flowing to the switch processor. This can also happen during any route updates. It is normal to have Route Processor CPU used up to 100 percent sometimes.

    Workaround

    The error message is purely informational and does not have any impact on the device performance.

    SW_VLAN-4-MAX_SUB_INT

    Problem

    The switch reports this error message:

    This example shows the console output that is displayed when this problem occurs:

    Description

    The number of Layer 3 sub-interfaces is limited by the internal VLANs in the switch. Catalyst 6500 series has 4094 VLANs that are used for various purposes. Issue the show platform hardware capacity vlan command in order to know the current status VLAN availability.

    Workaround

    Recommended limit of sub interfaces is 1000 for each interface and 2000 for each module. Reduce the number of sub-interfaces allocated for the interface as it has exceeded the recommended limit.

    Note: The console can get locked up due to the flood of these messages that are displayed at switch reload. This issue is documented in Cisco bug ID CSCek73741 (registered customers only) and the issue is resolved in Cisco IOS Software Releases 12.2(18)SXF10 and Cisco IOS Software Releases 12.2(33)SXH or later.

    MCAST-6-L2_HASH_BUCKET_COLLISION

    Problem

    The switch reports this error message:

    This example shows the console output that is displayed when this problem occurs:

    This error message is normally seen along with this message:

    Description

    This message indicates that a Layer 2 entry was not installed in the hardware because there is not enough space in the hash bucket. Multicast packets are flooded on the incoming VLAN because the Layer 2 entry installation failed. When limit is exceeded, flooding occurs for additional group MACs.

    Workaround

    If you do not use multicast, then you can disable IGMP snooping. Otherwise, you can increase the hash entry limit with the use of the ip igmp snooping l2-entry-limit command.

    %QM-4-AGG_POL_EXCEEDED: QoS Hardware Resources Exceeded : Out of Aggregate policers

    Problem

    The switch reports this error message:

    Description

    Only a limited number of aggregate policers can be supported. On EARL7-based switches, this limit is 1023.

    Workaround

    Instead of port based QoS, you can configure VLAN based QoS. Complete these steps:

    Apply the service-policy to each VLAN configured on the Layer 2 switchport.

    Remove the service-policy from each port that belongs to the specific VLAN.

    Configure each Layer 2 switchport for VLAN based QoS with the mls qos vlan-based command.

    %EC-SP-5-CANNOT_BUNDLE2: is not compatible with Gi2/1 and will be suspended (MTU of Gi2/2 is 1500, Gi2/1 is 9216)

    Problem

    The switch reports this error message:

    %EC-SP-5-CANNOT_BUNDLE2: is not compatible with Gi2/1 and will be suspended (MTU of Gi2/2 is 1500, Gi2/1 is 9216)

    Description

    This error message indicates MTU of the port channel member is not the same, so cause Port channel adding failure. By default all interfaces used MTU size as 1500. Due to mismatch of the MTU value, the port cannot add to the port channel.

    Workaround

    Configure the same MTU at those member ports.

    %EC-SP-5-CANNOT_BUNDLE2: Gi1/4 is not compatible with Gi6/1 and will be suspended (flow control send of Gi1/4 is off, Gi6/1 is on)

    Problem

    The switch reports this error message:

    %EC-SP-5-CANNOT_BUNDLE2: Gi1/4 is not compatible with Gi6/1 and will be suspended (flow control send of Gi1/4 is off, Gi6/1 is on)

    Description

    This error message indicates speed or a flow control mismatch, so the cause is a port channel adding failure.

    Workaround

    Verify the interface configuration participates in the port channel.

    %CFIB-7-CFIB_EXCEPTION: FIB TCAM exception, Some entries will be software switched

    Problem

    The switch reports this error message:

    Description

    The error message indicates that number of route entries that are installed is about to reach the hardware FIB capacity or the maximum routes limit set for the specified protocol. If the limit is reached, some prefixes are dropped.

    Workaround

    Reload the router in order to exit the exception mode. Enter the mls cef maximum-routes command in global configuration mode in order to increase the maximum number of routes for the protocol. By default, one PFC3 on SUP has a capacity of 192K entries but if you use the mls cef maximum-routes 239 command, this gives an option to utilize the maximum available TCAM entries. Use the show mls cef maximum-routes command in order to check the maximum-routes. Use the show mls cef summary command, which shows the summary of CEF table information, in order to check the current usage.

    Module Fails the TestMatchCapture Test

    Problem

    Module 5(supervisor) fails the TestMatchCapture diagnostic test as indicated in this output from show diagnostic result module module_# :

    Description

    The TestMatchCapture test is a combination of the TestProtocolMatchChannel and the TestCapture tests as described here:

    TestProtocolMatchChannel — The TestProtocolMatchChannel test verifies the ability to match specific Layer 2 protocols in the Layer 2 forwarding engine. When you run the test on the supervisor engine, the diagnostic packet is sent from the inband port of the supervisor engine and performs a packet lookup with the Layer 2 forwarding engine. For DFC-enabled modules, the diagnostic packet is sent from the inband port of the supervisor engine through the switch fabric and is looped back from one of the DFC ports. The Match feature is verified during the diagnostic packet lookup by the Layer 2 forwarding engine.

    TestCapture — The TestCapture test verifies that the capture feature of Layer 2 forwarding engine is working properly. The capture functionality is used for multicast replication. When you run the test on the supervisor engine, the diagnostic packet is sent from the inband port of the supervisor engine and performs a packet lookup with the Layer 2 forwarding engine. For DFC-enabled modules, the diagnostic packet is sent from the inband port of the supervisor engine through the switch fabric and is looped back from one of the DFC ports. The Capture feature is verified during the diagnostic packet lookup by the Layer 2 forwarding engine.

    Workaround

    Do a reseat of the module whenever you get an opportunity. Since these are minor errors, they can be ignored if you do not see any impact on performance.

    %CONST_DIAG-SP-3-HM_PORT_ERR: Port 5 on module 2 failed 10 consecutive times. Disabling the port

    Problem

    The switch reports this error message:

    This example shows the console output that is displayed when this problem occurs:

    Description

    The error message indicates that the data path that corresponds to the port has failed. The port is put into the errdisable state.

    Workaround

    Reset the line card in order to see if the problem resolves itself.

    %CONST_DIAG-SP-4-ERROR_COUNTER_WARNING: Module 7 Error counter exceeds threshold, system operation continue

    Problem

    The switch reports this error message:

    Description

    Check the diagnostic results:

    The TestErrorCounterMonitor monitors the errors/interrupts on each module in the system by periodically polling for the error counters maintained in the line card.

    This error message pops up when an ASIC on the line card receives packets with bad CRC. The issue can be local to this module or can be triggered by some other faulty module in the chassis. This can also be due to frames with bad CRC received by pinnacle asic from the DBUS. That is, the error messages imply that bad packets are being received across the bus on module 7.

    One of the reasons for the error messages to occur is the inability of the module to properly communicate with the backplane of the chassis due to mis-seating of the module. The problem is with the line card (mis-seated module), supervisor or the Data Bus. However, it is not possible to say what component is corrupting the data and causing a bad CRC.

    Workaround

    First perform a re-seat of module 7 and make sure the screws are tightened well. Also, before the reseat, set the diagnostics to complete with the diagnostic bootup level complete command.

    Once the re-seat is done, full diagnostics will run on the module. Then, you can confirm that there are no hardware issues on the module 7.

    %SYS-3-PORT_RX_BADCODE: Port 3/43 detected 7602 bad code error(s) in last 30 minutes

    Problem

    The switch reports this error message:

    This example shows the console output that is displayed when this problem occurs:

    Description

    This error message indicates that a port has been affected with an unknown protocol error. For example, a Catalyst 6500 Series switch receives frames with protocol it does not know nor recognize. The first [dec] is the module number, [chars] is the port number, and the second [dec] is the number of inbound packets with unknown protocols encountered in the last 30 minutes.

    These are the possible causes of the error message:

    Due to mismatched speed and duplex settings.

    CDP is enabled on one end and not on other end.

    Due to DTP, this is enabled by default on switch interfaces. Since routers do not understand DTP, this can cause some issues.

    Workaround

    Check the runts counter on the interface. If it increases, then there could be a duplex mismatch on the interfaces.

    Источник

    Читайте также:  Template error while templating string expected name or number
    Оцените статью
    toolgir.ru
    Adblock
    detector