CODEX

Background Job & Queue — Pushing Millions of Notifications per Hour

Tuan Nguyen
CodeX
Published in
5 min readJan 30, 2021

--

In this article, I present a case study of adopting the background job and queue model to a real system. Given the current system architecture, an improved version is proposed and the details of designing such the system from different aspects will be provided. If the whole content did not make much sense, I highly recommend reading my other articles in the following order:

Input

A notification system written in Python and MySQL has several performance issues, that are

  • Latency that slows down services
  • High service demand during peak times, resulting in the loss of pushed notifications
  • Failed ratio related to the connectivity to Firebase is high due to a large number of call requests.
  • Token management is inefficient with duplicates and many inactive tokens which slow down the push process
  • Data in the database is poorly normalized and therefore every push request needs to join 3 tables.
  • Wasted resource, in terms of CPU and RAM

In addition, the system does not track the successful ratio of push requests as well as tracking them from other services. The current architecture is shown in the following figure. Upon receiving a request, the Push API of the system will get a list of tokens of the request’s user from the Token store and sending push request to Firebase API. That’s it.

Solution

Here is the new design adopting Queue and Background Job:

Database and API Design

Basically, the database will store 2 main data:

  • User’s Push token: Less update, more read by list, lots of extra data related to device or OS information.
  • Push log with log information and push result with frequent insert and update while less read.

From this observation on data characteristics, the database is migrated from MySQL to MongoDB so that we can leverage the better MongoDB performance in insert/update operations to records.

Push API is responsible for receiving request traffics from other services. So to accelerate the process as well as shorten waiting times services, the whole message flow is simplified as the following:

  • Insert push log to the DB
  • Query token of the user from the DB
  • Construct a push request based on the list of tokens
  • Call Firebase API
  • Update the response of push request to the DB

You can see that by shifting some functions out of Push API, and re-designing the database, there are 3 issues related to API latency, token management, and performance of obtaining token for each request.

Job Worker

Straightforwardly, the implementation of job worker can start with the following simple steps:

  • Insert push log to the DB
  • Query token of the user from the DB
  • Construct a push request based on the list of tokens
  • Call Firebase API
  • Update the response of push request to the DB

In order to finish all of these steps, it takes about 300–400ms. The system, therefore, needs to scale out on the worker nodes to benefit from parallel computing. However, there are 2 things that adversely affect the system performance, which are:

  • Inserting log and updating the push result increase workload at the database
  • Calling API to external service, i.e. Firebase suffers some limitations in terms of network connectivity and latency.

To overcome these 2 issues, the batch processing technique is used for each worker instance. In particular, all the requests that share a common thing will be grouped as a batch into only 1 call to the DB, API. For example, with Golang, Muster is a library facilitating the batch processing implementation by creating a batch in the form of buckets. The incoming request will fill this bucket one by one and the action will be taken on this bucket once the number of requests reaches a threshold or just simply timeout.

I list all the steps that a worker needs to walk through when processing a job. I also add the information of processing time at each step with n the number of push requests that the system receives.

  • Step 1: Insert log to DB job (less than 1ms x n)

Fill the bucket with the request insert log to DB

  • Step 2: Batch processing (about 10ms x n/1000)

Trigger a batch for jobs insert log to DB

Schedule push job

  • Step 3: Push job (about 5ms x n)

Filter token related to the user from DB

Construct a push request based on the list of tokens

Fill the bucket (different from Step 1) with the request call Firebase

  • Step 4: Batch processing (about 500ms x n/500)

Trigger a batch for jobs call Firebase

Fill the bucket with the request update DB

  • Step 5: Batch processing (about 20ms x n/1000)

Trigger a batch for jobs update DB

Thanks to the support Bulk write operation of MongoDB and Firebase API also enables a batch of requests up to 500 messages per call, this approach obviously reduces a significant waiting time for API calls (due to reducing the number of requests) as well as the resource allocated for DB (reducing the number of queries). This allows the system to reach up to 1000 notification pushes per second, or 3.6 million pushes every hour with only 1 worker and constrained resource.

Conclusion and Acknowledge

It is not a simple task to implement the batch processing technique regardless of its straightforward principle. It is important to consider issues related to error, retry, and report. Any issue that makes a step fails requires a huge effort to resolve. Such a failed step can adversely affect a number of push jobs instead of 1. But as long as you can handle it (yes, that’s the matter of if), it is worth accepting the trade-off.

This is the last part of my series about background job and queue concepts. Once again, I would like to send my big thanks to Quang Minh (a.k.a Minh Monmen) for the permission to translate his original post.

--

--