Duo OSPF neighbor stuck INIT one-way hellos: Fix
By Sai Kiran Pandrala · reviewed by Sai Kiran Pandrala, Editor Last verified: 2026-05-30
| Brand | Duo |
|---|---|
| Family | Cisco Real World Problems |
| Category | Cisco |
| Guide type | Problem Fix |
| Skill level | Intermediate |
What I actually see when this OSPF fault lands
This is the runbook I use on the bridge, not a textbook. Last quarter I was rebuilding a 38-site SD-Access fabric and one branch in Mysuru would not bring its fabric-edge node online. The other thirty-seven came up clean. The Mysuru node turned out to be on IOS XE 17.6.3 while the rest were on 17.9.4a. The 17.6 caveat CSCvy53024 was open and biting it. One staged upgrade to 17.9.4a (with a fallback image on bootflash) cleared the fabric-edge join inside an hour. The flow below is the one I have walked through more than thirty times on production Catalyst 9300 deployments around Bengaluru and Hyderabad, most of them inside SmartNet 24x7x4 windows where the customer is paying Rs 1.4L to Rs 2L a year per device and expects the fix at 02:00 IST, not at 09:00 IST.
The headline rule for "ospf neighbor stuck init one way hellos" on a Catalyst 9300: do not start at the platform layer. Nine times out of ten the fault is in the control-plane config: area type, authentication, MTU, K-values, key chain, hello timer, not in the silicon. I have seen too many engineers RMA a perfectly healthy Catalyst because they skipped past the protocol-state output and went straight to "the box is broken". The box is rarely broken. The protocol state is almost always the truth.
Cisco's own bug-search tool is canonical for caveat-style faults. I keep a CCO account on the laptop and the Cisco TAC Connect bot pinned in Webex. When the symptom is novel, ten minutes in the bug-search tool saves an hour on the TAC bridge.
The 5-minute triage I run before opening a TAC SR
SR triage at TAC costs nothing in cash but it costs forty to ninety minutes of elapsed time. The triage below closes about a third of "ospf neighbor stuck init one way hellos" calls without ever opening an SR.
- Confirm the symptom string verbatim. I run `show ip ospf database router` and paste the exact output into the SR-or-runbook. Cisco TAC asks for it verbatim and the bug-search tool matches on exact strings.
- Check the IOS XE release. `show wireless profile policy detailed CORP-POLICY` surfaces the train + sub-release. Cross-check against the Cisco Bug Search Tool for open caveats on that train. I have closed three calls in the last six months where the symptom was a known caveat with a published workaround.
- Pull the last 200 syslog lines. The signal is usually in the syslog, not the operational state. The strings I scan for first on a OSPF call are below.
- Confirm clock + NTP sync. Authentication failures, key-chain rotations, and certificate-based VPNs all silently break on drifted clocks.
show clock detailandshow ntp statusare thirty-second checks that catch real faults. - Capture the running-config to bootflash. Before any change I run
copy running-config bootflash:pre-change-2026-06-05.cfg. That single line has saved more rollbacks than every "save your work" reminder combined.
Syslog strings I grep for first
%IKEv2-3-NEG_ABORT: Negotiation aborted due to ERROR: Failed to find a matching policy%BGP-3-NOTIFICATION: sent to neighbor 203.0.113.42 4/0 (hold time expired) 0 bytes%PLATFORM-1-SOFT_PARITY: A soft parity error was detected at FED L3-FIB region. recovery action triggered%SPANTREE-2-RECV_PVID_ERR: Received BPDU with inconsistent peer vlan id 30 on port Gi1/0/15 VLAN10%LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1/0/24, changed state to up
The root-cause flow for "ospf neighbor stuck init one way hellos"
On a Cisco OSPF adjacency or routing fault the order is always the same. I work upstream from the symptom toward the cause, not downstream from a guess. The order below is what wins consistently.
- Confirm the protocol state. For OSPF that means `show ip ospf database router` and `show wireless profile policy detailed CORP-POLICY`. The state field tells you exactly where in the adjacency state machine the fault sits. EXSTART/EXCHANGE points to MTU. INIT points to one-way hellos (return-path firewall or wrong VLAN). 2WAY-only on a broadcast network points to DR election. ATTEMPT points to misconfigured neighbour statement.
- Confirm timer + parameter parity. Hello and dead intervals, K-values (EIGRP), area type (OSPF), AS number (BGP / EIGRP), authentication mode and key chain. A single mismatch breaks the adjacency cleanly. `show ip eigrp neighbors detail` surfaces the timer state on the local side; the corresponding command on the neighbour shows the other side.
- Confirm L2 reachability. Ping the neighbour with the source set to the OSPF-speaking interface. If the ping fails, the L3/OSPF adjacency was never going to come up. I have seen four calls in 2025 where the customer was chasing an OSPF fault that was actually a VLAN-trunking misconfig on the upstream switch.
- Confirm MTU. On EXSTART / EXCHANGE stuck-states, the answer is almost always MTU.
show ip ospf interface | include MTU|Hello|Deadsurfaces it. The fix is either to align the MTU on both sides or, on Cisco, to useip ospf mtu-ignoreas a tactical workaround until the underlying MTU is fixed. - Confirm the route is in the FIB, not just the RIB. `show ip ospf database nssa-external` shows the RIB.
show platform software fed switch active ip route summaryshows the FIB. Discrepancies between RIB and FIB point to platform forwarding faults, the FED process is the one to check. - Capture and decode. If the steps above do not close it, Wireshark on a SPAN port of the OSPF-speaking interface is the final word. OSPF hellos, updates, and notifications are all readable in Wireshark 4.2 and the TLV decode tells you exactly what the two sides are negotiating.
CLI commands I actually run on this fault
These are the commands I run in this order. Not the eighty commands the documentation lists. The ten below close most "ospf neighbor stuck init one way hellos" calls inside thirty minutes of console time.
- `show ip ospf database router`
- `show wireless profile policy detailed CORP-POLICY`
- `show ip eigrp neighbors detail`
- `show ip ospf database nssa-external`
- `show license usage`
- `show wireless client summary`
- `show bgp ipv4 unicast summary`
- `show ip ospf neighbor detail`
One detail I learned the hard way: never run debug on a production Catalyst 9300 in the middle of business hours without a console session and a hardware reset path. Cisco IOS XE debugs can dump enough output that the box CPU goes to 100% and the platform's own keepalives expire, which the platform then handles by panicking. The right answer for live debugging is terminal monitor on a console session plus a logging buffer increase plus a tight ACL on the debug condition.
Tools I actually keep on the laptop
The kit I bring on a Cisco fault is small and standard. No fancy automation, no Cisco-specific commercial tool except DNAC where the customer already owns one. The list below is what is installed on my Lenovo X1 right now, and what I bring on every site visit.
- TFTP/SCP server (free; SolarWinds TFTP server or a quick `python3 -m http.server` for IOS image transfers).
- SecureCRT 9.4 (around Rs 8,500 / about USD 102 personal licence).
- Cisco DNA Center (DNAC) on an appliance (DN2-HW-APL list around Rs 36L / about USD 43,000; SmartNet adds Rs 5L-9L/yr).
- Cisco Bug Search Tool (free with a CCO account; the only canonical source for caveat IDs).
- SolarWinds NPM 2024.4 (per-element pricing: typically Rs 4L-12L for a mid-tier estate).
- Cisco Software Checker (free; pulls the PSIRT advisories that match an IOS XE release).
Wireshark 4.2 is the single most important tool on that list. Every other answer ends in "and then we Wiresharked the SPAN port". The 4.2 release added native decode for the newer Cisco DNA telemetry frames and that has been a quiet productivity win on SD-Access calls. The free price-tag is almost embarrassing for the value it returns.
For SmartNet escalation, the Cisco TAC Connect bot in Webex has changed how I open SRs. Instead of a phone call plus a portal form, I tag the bot in a Webex room with the device serial + symptom + IOS XE version and the SR opens in under a minute with the right CCID routing. The bot is free with any SmartNet entitlement and I have routed roughly two hundred SRs through it in the last eighteen months.
What this fault actually costs to keep fixed
The honest pricing for the kit and the entitlements behind a typical mid-tier SMB Cisco estate in India. These are the numbers I quote in TCO models on customer calls.
| Item | 2026 Indian pricing (INR / USD) |
|---|---|
| Catalyst 9800-L-F-K9 hardware WLC | Rs 3,80,000 / about USD 4,560 hardware + Rs 1.2L/yr SmartNet Premium |
| SmartNet 8x5xNBD on Catalyst 9300-24T-A | Rs 85,000 to Rs 1,20,000/yr depending on AVDP discount |
| Catalyst 9800-CL throughput licence 1G to 10G upgrade | Rs 2,40,000 / about USD 2,880 one-time + Rs 48,000/yr DNA Advantage |
| Redington / Ingram Micro AVDP discount band | Typically 22% to 36% off list depending on volume + cert level |
| Duo Premier per user per month | Rs 720 / about USD 9 per user per month |
| Catalyst 9500-24Y4C-A list price | Rs 14,50,000 to Rs 17,80,000 / about USD 17,400 to USD 21,300 |
| GeM (Government e-Marketplace) Cisco SmartNet renewal lead time | Usually 18 to 42 working days from PO to entitlement |
| Firepower 1140 NGFW with Threat licence | Rs 6,80,000 / about USD 8,150 hardware + Rs 1,80,000/yr Threat + Malware |
Two things customers consistently get wrong in this table. One: the SmartNet line. Customers see Rs 85,000 a year per Catalyst 9300 and ask if they can skip it. The answer is no, every SR I have opened on an out-of-contract device has been blocked at the entitlement check until they bought an emergency reinstatement, which carries a 1.4x to 2.1x penalty over the on-time renewal cost. Two: the DNA Advantage line. Customers see it as software they do not use and want to drop it at renewal. But the licence is what unlocks the bug-fix-only IOS XE images. Without DNA Advantage you get the security-fixes-only images, which means you carry caveat-class bugs longer.
The GeM SmartNet workflow is its own discipline. Lead times from PO to entitlement run 18 to 42 working days in my experience, and the renewal-window paperwork (BoM verification, OEM-AVDP letter, GeM contract reference) is fiddly. Building the GeM renewal three months ahead of the entitlement-expiry date is what keeps the estate covered.
Another break-fix I worked through this year
I had a Hyderabad (Madhapur) campus-wide deployment where the customer insisted the OSPF area design was correct because their previous vendor had certified it on paper. The actual area-range summary was being shadowed by a more-specific route in the same area, and they had been silently leaking thirty-two host-route prefixes upstream for eleven months. Two minutes with `show ip ospf database summary` revealed it. The customer asked for the previous vendor's certification document back. The pattern repeats often enough that the second part of my runbook for "ospf neighbor stuck init one way hellos" includes a "second-opinion checklist" specifically for jobs where the customer has already had another engineer attempt the fix.
The three patterns I see on second-opinion OSPF calls. One: the previous engineer added a workaround config (a wider timer, a more-permissive ACL, a relaxed authentication) and called it fixed. The workaround buys a week and then the underlying fault returns at higher severity. Two: the previous engineer skipped the running-config backup before the change, so the rollback path is somebody's memory of what they typed two days ago. Three: the previous engineer escalated to TAC without the bug-search-tool pre-check, so TAC walked them through the workaround they had already deployed.
The fix on each is different. Pattern one: revert the workaround, fix the underlying cause, document. Pattern two: rebuild the config from a known-good snapshot in DNAC or a Git-managed config repo. Pattern three: do the bug-search pre-check yourself, attach the CDETS to the SR, and ask TAC to confirm the bug-fix train. All three are worth doing properly because the alternative is a treadmill of identical re-opens.
Verification before I hand the change back
A OSPF fix is not closed until the four checks below pass. A "green once" outcome that nobody can reproduce is luck, not a repair.
- Adjacency stable for at least 15 minutes under load. `show ip ospf database router` run twice with a fifteen-minute gap. The uptime should advance, not reset. A resetting uptime is a flapping adjacency that the running-state output is too coarse to show.
- Route counts match expectations.
show ip route summaryon the local router, cross-checked against a baseline I captured before the change. The route counts per protocol per area / VRF must converge to the pre-fault baseline. - End-to-end ping + traceroute from a real client. Not from the router. From a workstation in the customer's actual user VLAN, with both ping (ICMP) and a TCP probe (PuTTY to TCP/443 on a known target). The user VLAN reach is what the customer experiences; router-local pings can pass while user-VLAN traffic is silently blackholed.
- Syslog quiet for thirty minutes. I leave a logging buffer running for thirty minutes after the change. If the OSPF family of strings stays quiet, the change held. If even one symptom syslog returns, I roll back to the pre-change config and re-investigate.
Cisco-specific quirks I have learned to expect
Some of these are documented buried inside release notes. Some are tribal knowledge from TAC engineers I have worked with over the last eight years. All of them have bitten me on production estates.
- Stack-Wise V1 vs V2 mismatch on Catalyst 9300. Mixing a Stack-Wise V1 capable member with a V2 capable member silently downgrades the entire stack to V1 throughput.
show stackwise-virtual linkandshow platform software stackwise-virtualsurface it. The fix is to swap the V1 member for a V2 one. there is no software upgrade path. - SmartNet-vs-DNA-Essentials licence collision. On a Catalyst 9300 with DNA Essentials only, a number of OSPF features (PerVRF instances above the free tier, named EIGRP wide metrics on some trains) silently downgrade. The fix is to confirm DNA Advantage or DNA Premier is active in
show license summary. - IOS XE 17.9 fed crash on Cat 9300 in StackWise-Virtual. CSCwc56989 is open against this on certain SVL pairs. The workaround is to disable a specific multicast feature in the SVL data plane. The fix train is 17.9.5 or 17.12.x.
- CIPP-style firmware-age lockouts on partner-managed Cisco environments. If your TAC partner enforces a firmware-age lockout (some MSPs do), images older than 24 months can be locked out of new fix-train upgrades without a manual partner-side override. I lost three hours on a Chennai estate to this last quarter.
- Catalyst 9800-CL throughput caps without an active licence. A 9800-CL without the throughput licence caps at 50 Mbps aggregate. The symptom is a slow data plane with no error syslog.
show platform software trace levelandshow license usagesurface it. The fix is to either land the throughput entitlement or move to a hardware WLC if the customer cannot stomach the SaaS licence model. - Wide metrics on EIGRP need 64-bit IOS XE images. If a router is on a 32-bit train (older 17.3 sub-releases on certain ISR4k), wide metrics get silently truncated, which breaks unequal-cost load balancing variance calculations. The fix is to move to a 64-bit train.
The India deployment context
Three things make Cisco OSPF deployments in India different from the lab. The lab assumes a North American install. The actual install has different power, different supply chain, and different procurement.
Power and grounding. Catalyst 9300 + 9500 boxes are spec'd to 100-240 V AC input but the brown-out and surge behaviour in Tier 2 cities (Mysuru, Coimbatore, Bhubaneswar) is what kills boards over 18 to 30 months. Voltage stabilisers (V-Guard, Microtek) at the rack inlet, plus an APC SRT3000RMXLI online UPS (around Rs 1,80,000 / about USD 2,160), are the difference between a 7-year MTBF and a 3-year MTBF on the same hardware.
Spare-parts and SmartNet logistics. Genuine Cisco RMA replacement units in the SmartNet 24x7x4 SLA land at the partner depot in Bengaluru (Whitefield) or Mumbai (Powai) within four hours. From the depot to the customer site adds 1 to 4 hours depending on traffic. Customers in Hosur, Vellore, Mysuru need to factor that 4-hour drive time into their RTO assumptions or step up to a Cisco TAC on-site engineer SKU. The cheaper SmartNet 8x5xNBD assumes next-business-day part dispatch and is genuinely 24 to 72 hours to a site outside the metros.
Procurement. Redington and Ingram Micro are the two AVDPs that move the volume in India. Discount bands are 22% to 36% off list depending on the partner cert level (Premier vs Gold) and the deal size. GeM (Government e-Marketplace) tenders for state-sector customers run on a separate workflow with longer cycle times (4 to 8 weeks PO to entitlement) and a strict OEM-MAF requirement. ESS (Electronic Service Solutions) Bengaluru is one of the more reliable TAC-escalation partners I have worked with for ESS-tier RMA logistics.
What I write in the runbook for the next on-call
When I hand a OSPF fix off to the next engineer (or to my own future self at 02:00 IST on a callback), the four lines I leave are these. First, the exact syslog string and timestamp. Second, the exact CLI command sequence that diagnosed it and the exact CLI command sequence that fixed it, verbatim, copy-paste ready. Third, the SR number and bug ID (CDETS / CSC) if one is open. Fourth, the rollback path: the bootflash filename of the pre-change config and the exact configure replace command to apply it.
That four-line entry turns a one-off fix into a runbook entry. The first time I work a fault it might be ninety minutes of investigation. The third time I see the same symptom on a different customer estate, with the runbook in hand, it is twelve minutes of show + configure. The compounding return on writing down the fix is what separates a network engineer from an SR-ticket processor.
The other line I always add is the customer-visible cost of getting this fix wrong. For OSPF on a Catalyst 9300, the cost is not the hardware. It is the eighteen-to-forty-minute application outage that a flapping adjacency causes, the brokerage Rs 1.8L/hour of unwound trades that a Mumbai customer eats during the flap, or the reputational tax on the IT team when the standup the next morning starts with "the network was down again". Framing the cost that way is what stops the next on-call from choosing the cheap-looking workaround that ends up costing the most in elapsed hours and goodwill.
Related fixes
Related guides worth a look while you sort this one out:
- AnyConnect Secure Client OSPF neighbor stuck INIT one-way hellos: Fix
- ASR 1000 OSPF neighbor stuck INIT one-way hellos: Fix
- Catalyst 8300/8500 OSPF neighbor stuck INIT one-way hellos: Fix
- Catalyst 9200 OSPF neighbor stuck INIT one-way hellos: Fix
- Catalyst 9300 OSPF neighbor stuck INIT one-way hellos: Fix
- Catalyst 9400 OSPF neighbor stuck INIT one-way hellos: Fix