How a Supervisor Helps Handle Big Data in Small Projects

Written by Sathyakrishnan (Web Team Lead)

Now we are living in a new era of software development. Particularly since COVID-19, the world relies for almost everything on the virtual landscape. Even children have experienced virtual learning for schools and are making the most of it. Picture this scenario this past year:

Abruptly parents are very much interested to find the best learning opportunities for their kids. They log into an application to run a comparison report between five major schools of the state. They want to run the number of students passed out last decade, the courses available, the extracurricular activities offered, etc. Of course, the data is there, which is provided by some third-party data providers. But can we process that bi…..g data? Let’s go with some other example, A stockholder wants to run a report of the security they wants to purchase. They want to see the performance of the security for (at least?) two decades, the major investors, the short selling, latest closing prices of the security, intra-day sales performance, etc. As many checkboxes as they can tick with their tiny mobile screen, we need to show all of them, and more! When they swipe up in their mobile device we need to be ready to fetch and display the next 50 rows. We have data providers, but can we process this much amount of data?

Data locality:

Instead of relying on every data with the third-party data providers, we must maintain a historical copy of data for ourselves. Maintaining local data is the key to the performance of our application, on the other hand, third-party providers will provide a lot of junk data with our required data. For example, would you need the closing price of a company, then you will be showered with opening price, high and low prices, the average price of the day, total sales made, etc. From an optimistic perspective, it is good to have all the data at once, but here, for our case, we need only “the closing price”. So we need to save our desired data on our server (aka) database.

But we face a problem with that. We likely cannot have the ‘exact’ copy of the data which is provided by third-party providers. So we need some mechanism to check and update the database frequently. I admit that is not a cake walk. You have to check third-party data with your data, compare it with what you have, and copy the missing pieces from third-party data. It’s easier said than done. Not only that but it’s also a time-consuming process. We cannot ask our customers to be patient and wait for the update process. So while we continue to provide our service at the front-end, we have to update the database at the back-end. On the other hand, we have to chunk the whole process and spool the chunk of processes.

Supervisor:

The Supervisor does this tough job for you. It will control any number of processes in your system. It will spool your processes, watch them and redo them in case of any failure. The server-side component is known as Supervisord and the client-side component is known as Supervisorctl. Supervisord is responsible for starting child programs at their invocation, responding to commands from clients, restarting crashed or exited subprocesses, logging its subprocess stdout and stderr output, and generating and handling “events” corresponding to points in subprocess lifetimes. Supervisorctl provides a shell-like interface to the features provided by supervisord. From supervisorctl, a user can connect to different supervisord processes (one at a time), get status on the subprocesses controlled by, stop and start subprocesses of, and get lists of running processes of a supervisord. Supervisord‘s primary purpose is to create and manage processes based on data in its configuration file. It does this by creating subprocesses. Each subprocess spawned by the supervisor is managed for the entirety of its lifetime by supervisord (supervisord is the parent process of each process it creates). When a child dies, the supervisor is notified of its death via the SIGCHLD signal, and it performs the appropriate operation.

Configuring Supervisor:

Supervisord.conf

[program:YourProgram1]

command=php /var/www/html/your/cake/file/path/bin/cake.php YourProgramShell1

process_name=%(program_name)s.%(process_num)02d

numprocs=5

directory=/var/www/html/your/cake/file/path

autostart=true

autorestart=true

startretries=3

stderr_logfile=/var/www/html/your/cake/file/path/logs/your_program1_error.log

stdout_logfile=/var/www/html/your/cake/file/path/logs/your_program1_debug.log

user=www-data

[program:YourProgram2]

command=php /var/www/html/your/cake/file/path/bin/cake.php YourProgramShell2

process_name=%(program_name)s.%(process_num)02d

numprocs=5

directory=/var/www/html/your/cake/file/path

autostart=true

autorestart=true

startretries=3

stderr_logfile=/var/www/html/your/cake/file/path/logs/your_program2_error.log

stdout_logfile=/var/www/html/your/cake/file/path/logs/your_program2_debug.log

user=www-data

The above configuration is a CakePHP supervisor configuration. As we explained earlier while your_program1 is under process your_program2 will be kept waiting by supervisor. And we can see or kill the process id which is being processed using supervisorctl command. Overall, the supervisor is a simple and effective tool that keeps your processes in order and runs them and keeps an eye on them.

Do you face similar situations in your projects? Share your insights on how you handle it!

Did you find this article interesting? Share it on Social Media :