When processing large amounts of data in PHP, I see a lot of developers loading the entire dataset in memory. When reading a file, they will read the entire file with file and then process it. That’d look something like this:

function processFile($filePath) {
    $lines = file($filePath, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
    foreach ($lines as $line) {
        processLine($line);
    }
}

$filePath = 'largefile.txt';
processFile($filePath);

This has a very obvious disadvantage: it loads the entire file in memory. This might cause memory issues for larger files. It also causes a significant delay as the entire file will have to be read before it can be processed.

Luckily, there is a very easy solution to this: Use a generator.

How do generators work?

Using a generator is extremely easy. The official documentation describes the concept very clearly:

A generator allows you to write code that uses foreach to iterate over a set of data without needing to build an array in memory, which may cause you to exceed a memory limit, or require a considerable amount of processing time to generate. Instead, you can write a generator function, which is the same as a normal function, except that instead of returning once, a generator can yield as many times as it needs to in order to provide the values to be iterated over.

In practice, the code above would look like this, adapted to use a generator:

// Generator function
function readFileLineByLine($filePath) {
    $handle = fopen($filePath, 'r');
    if ($handle) {
        while (($line = fgets($handle)) !== false) {
            yield $line;
        }
        fclose($handle);
    } else {
        throw new Exception("Could not open the file!");
    }
}

function processFile($filePath) {
    foreach (readFileLineByLine($filePath) as $line) {
        processLine($line);
    }
}

$filePath = 'largefile.txt';
processFile($filePath);

The generator simply pauses when it reaches the yield, and continues when it is iterated over again.

When to use generators?

When you are working on the following things, using a generator should come pop up in your mind:

  • Reading large files (think of CSV and log files)
  • Streaming data (processing large API responses)
  • Database results
  • Generating infinite sequences
  • Handling pagination
  • Batch processing (ETL pipelines)
  • Handlings uploads or downloads

Further reading