Conclusion Scope Scope of this document is only about mobile app and mobile security metrics. It doesn't suggest any change in Keycloak metrics, Sync metrics and the advanced analytics in Kibana. In a nutshell, mobile devices are generating events such as sdkInit or securityCheck they push it to a backend. Then Grafana visualizes the received data. Current architecture:
+---------------------------------+ +-----------------------------------+ |
| | | | |
| | | | |
| | | | |
| | | | |
| AeroGear Metrics Server | | | |
| | Sa^es data to | Postgres | |
| +--------------------------------------------> | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
+----------------------------+----+ +------------------+----------------+ |
^ ^ |
| | |
| | |
| | |
| | Consumes |
| | |
| | |
| +---------------+--------------------+ |
| | | |
| | | |
| | | |
| | | |
| | | |
| | Grafana | |
| | | |
| Sends data to | | |
| | | |
| | | |
| | | |
+----------+-------------+ | | |
| | | | |
| | | | |
| | +------------------------------------+ |
| | |
| | |
| | |
| Mobile Device | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
+------------------------+
|
Suggested Architecture
+---------------------------------+ +-----------------------------------+ |
| | | | |
| | | | |
| | | | |
| | | | |
| Prometheus Push Gateway | | | |
| | Scrapes | Prometheus | |
| +<-------------------------------------------+ | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
+----------------------------+----+ +-----------------------------------+ |
^ |
| ^ |
| | |
| | |
| | Consumes |
| | |
| | |
| +---------------+--------------------+ |
| | | |
| | | |
| | | |
| | | |
| | | |
| | Grafana | |
| | | |
| Sends data to | | |
| | | |
| | | |
| | | |
+----------+-------------+ | | |
| | | | |
| | | | |
| | +------------------------------------+ |
| | |
| | |
| | |
| Mobile Device | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
+------------------------+
|
Architecture diffs In current architecture, we have a metrics server which simply receives events from mobile devices and saves them to the database. Grafana then visualizes the data on the database. In the new architecture, we're suggesting to replace the metrics server with Prometheus Push Gateway. That gateway is essentially doing what the metrics server is doing. But instead of writing to the database, it keeps the data in memory and waits Prometheus to scrape it. We already have a Prometheus instance in our current architecture, but it is not used for the mobile app metrics. Instead, it is used to monitor the server instances of bunch of other services. So, the new architecture will not introduce any new component, other than the push gateway. Limitations In the current architecture, we're using Postgres with JSON/BSON features. It is quite powerful and flexible and we can do any aggregation with it. However, in the new architecture, the metrics will be stored in Prometheus, which is not a full-feature database. It can provide good aggregations, but some of the existing graphs won't be possible. For example, the graph "Unique Clients Per App Id" will not be possible. To illustrate, assume we have these entries in Prometheus series sdkInitEvent.
sdkInitEvent{appId="com.example.someApp",appVersion="1.2.3",clientId="i2",framework="native",job="gateway",platform="ios",platformVersion="11.3",sdkVersion="3.0.0"} 1 |
sdkInitEvent{appId="com.example.someApp",appVersion="256",clientId="a2",framework="native",job="gateway",platform="android",platformVersion="28",sdkVersion="3.0.0"} 1 |
sdkInitEvent{appId="com.example.someApp",appVersion="4.5.6",clientId="i1",framework="native",job="gateway",platform="ios",platformVersion="10.1",sdkVersion="2.0.0"} 1 |
sdkInitEvent{appId="com.example.someOtherApp",appVersion="257",clientId="a1",framework="cordova",job="gateway",platform="android",platformVersion="28",sdkVersion="3.0.0"} 1 |
sdkInitEvent{appId="com.example.someOtherApp",appVersion="7.8.9",clientId="i1",framework="cordova",job="gateway",platform="ios",platformVersion="10.1",sdkVersion="2.0.0"} 1
|
Finding the unique client ids is possible in 2 steps:
Query: count by (clientId)(sdkInitEvent) |
|
{clientId="a1"} 1 |
{clientId="i2"} 1 |
{clientId="a2"} 1 |
{clientId="i1"} 2
|
Query above would tell us the number of events per client ids. When we do another aggregation, we would get the number of unique client ids.
Query: count (count by (clientId)(sdkInitEvent)) |
|
{} 4
|
However fragmenting the query in the first step is not possible. Some example queries:
Query: count by (appId)(sdkInitEvent) |
|
{appId="com.example.someApp"} 3 |
{appId="com.example.someOtherApp"} 2
|
We lost the client id information and can't aggregate again on that.
Query: count by (clientId)(sdkInitEvent) |
|
{clientId="a1"} 1 |
{clientId="i2"} 1 |
{clientId="a2"} 1 |
{clientId="i1"} 2
|
We lost the app id information and can't aggregate again on that.
Query: count by (clientId, appId)(sdkInitEvent) |
|
{appId="com.example.someOtherApp",clientId="i1"} 1 |
{appId="com.example.someApp",clientId="i2"} 1 |
{appId="com.example.someApp",clientId="a2"} 1 |
{appId="com.example.someApp",clientId="i1"} 1 |
{appId="com.example.someOtherApp",clientId="a1"} 1
|
Won't help. So, some of the charts in the existing dashboards will be gone. However, we think this is a good trade off. Failures Prometheus Push Gateway instances store the data in memory. So if it fails for a reason, the metrics for that period of time will be gone. But Prometheus scrapes the gateway often enough to make this problem neglected. Scalability As there's no persistence in the gateway, we can scale horizontally. We can create a K8s service with bunch of pods and make Prometheus scrape each pod. Prometheus will store metrics from each pod and when we aggregate, it will aggregate data came from all pods. We already have experience with this kind of setup in scraping AeroGear mobile services. Gateways This document talks about Prometheus Push Gateway, which is nothing but a metrics cache. However, there is also something called "WeaveWorks aggregation gateway" which is a fork of the Prometheus Push Gateway. That one claims it aggregates data and let Prometheus scrape the aggregated data. This aggregating on the gateway actually has the potential to solve our limitation mentioned above, but there are absolutely 0 documents about how to configure the aggregations. Even basic instructions are missing. I asked WeaveWorks community around this (mailing list and Slack) and received no answer. I would not recommend this component. |