# IT:AD:Non-Functional Requirements:Monitoring #
* [[./|(UP)]]
{{indexmenu>.#2|nsort tsort}}
## Requirements ##
^ID^Cat(s)^SH^Level^Priority^Requirement^Details^RFP Response/Comments^
|NFR:xxxx|Cxxx|SHxxx|`MUST`|1||Rational:\\ \\ Information: \\ \\ Resources:\\ \\ ||
* [NFRX:8r6gm] Define the availability of the application, in terms of % available for any 24-hour period.
* [NFRX:8r6gn] Define the maximum amount of time a search will take, 95% of the time.
* [NFRX:8r6go] Define the maximum amount of time a search will take, 99% of the time.
* [NFR:] The application must be able to report its % of availability in a 24 hour period, in order to compare against NFRs.
* [NFR:] The application must be able to report its meeting NFR's based on the duration of defined Operations.
* See: [:8r6gn] and [:8r6go].
* [NFR:8rjtq:MUST]The application must be monitored from an external source to ensure the application is publicly available.
* Rational:meet Mean Time Between Failure (MTBF) requirements.
* Consider using a service similar to http://www.siteuptime.com/
* [NFR:8qy5i:MUST:Availability,Security] Certificates used to secure the application must be kept up to date.
* Rational: the security of the application relies on secure communication
* [NFR:8qy5j:MUST:Availability,Security] An automatic process to ensure certificates are up to date (see [NFS:8qy5i] should be instituted.
* Rational: automated processes ensure humans don't drop the ball.
* If there isn't an automated protocol already in place, consider using a service such as https://snitch.io/
* [NFR:8rjqf:SHOULD:Responsiveness] The application's operations should be tracked with Performance Counters.
* Rational: tracking normal performance allows for alerts due to abnormal behavior.
* The Service Facade is generally the most appropriate location to track operations, rather than the Application Layer, as the Service Facade methods are not re-entrant -- whereas the Application Layer methods may call themselves several times.
* Useful counters for each operation are:
* Simple counters:
* XXXXOperationsInvoked (an NumberOfItems32 PerformanceCounter, incremented by one each time the operation is invoked.
* XXXXOperationsInvokedWithoutException (an NumberOfItems32 PerformanceCounter, incremented by one each time the operation is invoked, without an Exception being raised.
* XXXXOperationsInvokedWithException (an NumberOfItems32 PerformanceCounter, incremented by one each time the operation is invoked, without an Exception being raised.
* Counters per second:
* XXXXOperationsInvokedPerSecond (an RateOfCountsPerSecond32 PerformanceCounter, incremented by one each time the operation is invoked.
* XXXXOperationsInvokedWithoutExceptionPerSecond (an RateOfCountsPerSecond32 PerformanceCounter, incremented by one each time the operation is invoked, without an Exception being raised.
* XXXXOperationsInvokedWithExceptionPerSecond (an RateOfCountsPerSecond32 PerformanceCounter, incremented by one each time the operation is invoked, without an Exception being raised.
* Note: this usually is one to use to alert infrastructure personnel if the value passes a known threshold.
* Durations:
* XXXXOperationDuration (a NumberOfItems64 of the duration, in ticks, of the operation).
* XXXXOperationAverageDuration (a AverageTimer32 PerformanceCounter of the average duration, in ticks, of the operation).
* This is backed by an XXXXOperationAverageDurationBASE of type AverageBase that is incremented by one each time the operation is invoked.
* [NFR:8rjqg:SHOULD:Responsiveness] Invoking of external services should be tracked with Performance Counters.
* Rational: tracking the duraction of external service requests helps find bottlenecks, as well as remove ambiguities as to what may be causing slower performance.
* Some Services to consider tracking are the following: SSO,Db,Search Service (eg: [[IT/AD/Lucene/]],SMTP,etc.
* Use a similar structure as mentioned in [NRF:8rjqf]
* The most important PerformanceCounter in this case will be those tracking Duration.
* [NFR:8rjqh:SHOULD:Responsiveness] The application should be able to render performance metrics on a View of the UI.
* Rational:looking at a UI of live graphs in the application itself is an easier means of keeping an eye on the proper functioning of an application -- especially at the beginning when operation limits have not been determined yet.
* Consider creating a PerformanceCounterReportService to return JSON.
* [NFR:8rjqi:SHOULD:Responsiveness]The application should render Database metrics on a UI View.
* Rational:looking at a UI of live performance reports provides an easy means of spotting potential bottlenecks early.
### Technical Design Requirements ###
* [MUST] Log Files should be rotated on a regular basis (max 1 week).
* [MUST] If using WCF Logging, ensure Logging is rotated.
* [TDR:8ru5s:SHOULD:Availability] The application should have a `MonitoringOperationService` which invokes `MonitoringOperation`s in order to check known failure points. `MonitoringOperation`s that fail will raise an alert.
* [TDR:8ru5t:SHOULD:Availability] A `MonitoringOperation` (see [TDR:8ru5s]) should be created to invoke a HardDriveSpaceService in order to monitors the size of specific folders and the space available on specific hard-drives.
* Rational:a hard-drive that is full can cause logging to fail, making the whole application unavailable.
* [TDR:8ru5x:SHOULD:Availability] A `MonitoringOperation` should be created to ensure the certificates used in the application to secure transport and messages are up to date. See [NFR:8qy5i].
* Rational:there are online services available to monitor SSL certificates, but no internet service that can monitor other Certificates, such as those used to secure message integrity. There are internal infrastructure monitoring solutions -- but they expensive in themselves, as well as raise operating costs having to put in place procedures to find the right people to update the application and possibly redeploy it.
* [TDR:8ru5u:SHOULD:Performance] The `TracingService` should not be set to flush to the hard-drive after every operation.
* Rational: flushing would tie the application's responsiveness directly to the hard-drive's IO speed. Even an SSD is far slower than memory operations.
* [TDR:8ru5v:SHOULD] The application should have available a `PerformanceCounterApiController` exposing the functionality of a `PerformanceCounterReportService` in order to return JSON feeds of performance Counter values.
* [TDR:8ru5w:SHOULD] The application should provide a `DatabasePerformanceApiController` exposing the functionality of a `DatabasePerformanceService` returned as JSON messages.
* Consider using the followinging [link](http://www.codeproject.com/Articles/799053/Web-based-real-time-SQL-Server-Performance-Dashboa) (and [this](http://www.codeproject.com/Articles/822859/Real-time-Oracle-Database-Monitoring-Dashboard-in) one as well) as a starting point.
### Unclassified
* Monitoring should be in place for:
* DDOS
* CERT Expiration
* Harddrive
* CPU
* IO
* Operation Duration
* Success to Failure ratio.
* Login Attempts