Changes

kreno · c20c4976
--- a/Operations-of-Compass,-Failover-and-failure-handling.md
+++ b/Operations-of-Compass,-Failover-and-failure-handling.md
+Compass has multiple paths to handle both short term and long term failure modes, both at a hardware and software level.
+### Hardware Failure:
+Compass relies on the Nutanix (https://www.nutanix.com/) "cloud" system for hardware failures. Specifically we have 3 hardware nodes on-site, with enough reserved capacity for 1 full node failure.  This gives us hardware level redundancy.
+At a software layer, we utilize Hashicorp Nomad (https://www.hashicorp.com/products/nomad) to handle software level failures. In the event of a software component failure, nomad will automatically re-schedule and re-run the software component.
+Lastly, for complete failures, we utilize a tested near-term and far-term backup and restore system.  Compass utilizes point in time recover (PITR) for our systems, meaning we continually and constantly are backing up the system (s) to our near-term backup location (a shared disk).  These PITR backups are automatically restored every day as part of our development and testing procedures, giving us confidence that our backup procedures work, and are directly useful.  These near-term backups are then also backed up regularly using traditional file backup strategies via the YETC IT Department.
+In the past 5 years, we have not had a single user-facing hardware outage, due to our robust failure handling system.
+In the case of more than 1 node failure (for which we can not tolerate on-line), we have a contract with Nutanix to provide 8x5 service to ensure our systems get back up and running quickly.
+### Overall system error handling.
+Compass utilizes best in practice error handling for both software and hardware.  For Software level errors, the system automatically captures the source line # and a traceback (stack) of the error and creates a ticket/case in our trouble ticketing system. Additionally, if the error happens with a user-facing program, (i.e. the GUI interface), then Compass will also capture a screenshot of the users desktop (just the compass windows, to minimize privacy issues) and display an error window to the user, prompting for more information, like what they were doing at the time of the error, etc.  This information gathered from the users computer and screenshot are then attached to the souce line # and stack trace (traceback) information in our trouble ticketing system.  This gives us very detailed and very useful error messages, from backend system failures to end-user facing system errors.  For the odd-case where a user-facing issue happens that is not detected as a software error, the client software GUI also allows the user to create a case/ticket in our trouble ticketing system directly from the application.  It also takes a screenshot and attaches it along with some other captured environment information to go along with the user-supplied issue information.
+Since all issues are tracked in our trouble ticketing system, we of course track and close every single issue.  This gives us amazing insight into troubles and issues affecting users.
+Attached are 2 screenshots from our ticketing system showing some examples of the data collected on errors.
+* Case 62556 is an example of a backend process crashing.
+![case62556-auditExample](uploads/44853ff97e7447efd607b3f974f0840b/case62556-auditExample.png)
+* Case 62686 is an example of a user-facing process crashing. The black bars are added after the fact to remove anything that might be potentially sensitive.  You will note we include a link in the case to our logs system, which when clicked will take the developer/trouble-shooter to the log entries from the server side that relate to this case.  The screenshot of the user desktop is a 1920 × 1017 pixels PNG image. 
+![case62686-auditExample](uploads/6e338c654df10144ec1c181e3eabf46f/case62686-auditExample.png)