Performance Issue Outage

Updates

Resolved
May 17, 2024 at 1:05:24 PM
Resolved
May 17, 2024 at 1:05:24 PM
Hello, we are providing an update after we've continued to collaborate on this escalation bridge with our data center / private cloud provider and have continued to see service and platform stability.
Once production routes were redirected back to edge routers and active infrastructure, a brief firewall session reset was required due to identified traffic due to observed connection issues. Rival5 continues to monitor our firewall, network, and virtualization platforms closely in our data centers ourselves and with the help of our data center / private cloud provider.

Please notify our support team should you continue to experience connection or stability issues and we will be more than happy to dive in and investigate as needed.
Thank you for your continued patience and support; we hope to bring this to a final resolution quickly. There is some underlying fabric switching work that will be completed that will not be service-impacting. Once a RCA (root cause analysis) and overall event/incident investigation are completed by our provider and reviewed both internally and with them, our management and engineering team will construct an "RFO" (Reason for Outage) / service incident report that can be made available to customers upon request.
Please let us know if you have any questions and notify us if there are any concerns.
Thank you,
Rival5 Support Center
Update
May 17, 2024 at 9:37:34 AM
Update
May 17, 2024 at 9:37:34 AM
Hello, Rival5 is and has been undergoing emergency storage fabric and logical storage maintenance with the assistance of our data center / private cloud provider. We are working quickly to restore services and will resume monitoring customers' instances once they are restored and verify registration recovery as well as instance health and availability.
Numerous attempts to reach out and engage with Broadcom (f/k/a VMware) support by our provider's escalation bridge and management teams, with Broadcom's failure to reply delayed this troubleshooting process to pinpoint the exact root cause with hardware and signaling errors observed on data center fabric from primary hypervisor computing hosts. Emergency change maintenance has been scheduled on network-based storage fabric switches which is not expected to impact operations.

Rival5 engineering and management teams have been working on an escalation bridge with pertinent teams within our data center / private cloud provider since 09:30 AM on 05/16, working through their shift changes.

We will provide updates as they are available here and on already submitted customer-facing tickets. Thank you for your patience.
Monitoring
May 16, 2024 at 2:52:17 PM
Monitoring
May 16, 2024 at 2:52:17 PM
Our engineering and management teams are currently engaged in an incident conference with our private cloud and connectivity provider, troubleshooting the issue, isolating certain hosts, and rolling back yesterday's change to our previous production storage platform. Storage availability and read/write latency is identified as the incident's primary symptoms and are affecting our services' reliable operation.

An escalation has been initiated to the virtualization manufacturer/vendor (Broadcom/VMware) to verify operation/configurations. Networking assets/infrastructure between our virtualization cluster and the storage platform are also being investigated. We are continuing to monitor customers' connections and will communicate updates on this status page, as well as customer-initiated tickets to/from our support team.
Investigating
May 16, 2024 at 1:18:24 PM
Investigating
May 16, 2024 at 1:18:24 PM
We are currently investigating this incident. We are seeing degraded performance after status changes last night at our data center. We are currently in the process of switching over to our recovery site to restore service to our customers.

Rival5 - Performance Issue Outage – Incident details

All systems operational