Poor Man’s Parallelization for Batch Processing Jobs

One common problem that I’ve seen time and time again with batch processing jobs (generally cronjobs) is that when they’re written they run quickly. Over time their workload grows until eventually it’s unacceptably slow. Most of simple jobs are single threaded and follow the pattern:

  • Query a database to get a list of objects to work on (users, pages, customers, etc)
  • Process or derive some data for each user

As an example let’s use a simple script that just iterates over a list of users.

<?php

$userIds = range(1,10);

foreach ($userIds as $userId) {
    echo "working on userId=$userId". PHP_EOL;
}

If you run this script it’ll just output:

working on userId=1
working on userId=2
working on userId=3
...
working on userId=10

To create a simple of way of separating the jobs in a consistent way we can use the modulus operator. It just calculates the remainder of two numbers. It’s a common arithmetic operator in almost all languages so this technique is pretty portable.

Let’s allow two arguments to be passed into the code. We want to define the number of different jobs to run in parallel and also which of the two this one is. Here’s a simple way of doing it:

<?php

$job = (int)$argv[1];
$jobs = (int)$argv[2];

echo "running job $job of $jobs". PHP_EOL;

$userIds = range(1,10);

foreach ($userIds as $userId) {

    if ($userId % $jobs != $job - 1) {
        continue;
    }
    echo "working on userId=$userId". PHP_EOL;
}

Running this on the command-line like php myScript.php 1 2 will output:

running job 1 of 2
working on userId=2
working on userId=4
working on userId=6
working on userId=8
working on userId=10

…and running with php myScript.php 2 2

running job 2 of 2
working on userId=1
working on userId=3
working on userId=5
working on userId=7
working on userId=9

Nice! Now we can run the even and odd jobs separately and at the same time.

Have a play with the script and try to run it split three ways with the options: "1 3", "2 3" and "3 3"

You can use the modulus operator in SQL queries too which could be useful if you’re pulling your list of Ids from the database.

SELECT userId FROM users WHERE userId % 2 = 1;

It’s worth noting that this hack won’t always work. You need to be able to identify each iteration of your loop with some unique integer and have sufficient free resources on your machine(s) to run the script in parallel. In some cases the proper way to go is a full on async setup using something like Gearman but if you’re in a hurry or the code is trivial this is a great little five minute fix.

About these ads

About James Cohen
LAMP geek with interests in building scalable web applications

2 Responses to Poor Man’s Parallelization for Batch Processing Jobs

  1. Pingback: James Cohen’s Blog: Poor Man’s Parallelization for Batch Processing Jobs | Scripting4You Blog

  2. Pingback: Programowanie w PHP » Blog Archive » James Cohen’s Blog: Poor Man’s Parallelization for Batch Processing Jobs

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: