Knapsack - Collection Pipeline for Fun and Profit!

PHP has this powerful data structure called array, which we can use to handle any collection of data. We can then traverse them using for or foreach and process the data. The introduction of Closures in PHP 5.3 made the collection processing a bit more interesting with functions like array_map, array_filter, etc. But these functions are limited to native arrays only. We cannot use them to process Iterable objects or Generators. In this article, we will look at Knapsack, a library which can make working with collections a lot more fun.

Knapsack is a great Collection Pipeline library for PHP, which provides almost all the commonly used collection operations. Collection Pipeline is a pattern where we organize collection processing as a sequence of operations. The result of each operation is passed as the input to the next operation. If you use Unix command line, you know how powerful this pattern is.

Knapsack can be used to process arrays, traversable objects, and Generators. Most of the methods in Knapsack returns lazy collections by using Generators. So we can efficiently process a large amount of data with less memory usage. Without further ado, let's check this library.

Installation

First, let us install the Knapsack library using Composer.

composer require dusank/knapsack

Usage

We can use Knapsack to process anything that is iterable. To use this library, we first need to convert our data into a Knapsack Collection object.

require 'vendor/autoload.php';

use DusanKasan\Knapsack\Collection;

$collection = new Collection([1,2,3]);

// or
$collection = Collection::from([1,2,3]);

Collection::from internally calls the constructor, but is more convenient as we can directly call the methods on the result. The constructor can take an array, traversable object or a Generator function as the input and returns a lazy collection. The values inside the collection are not realized until we actually need them.

Collection Methods

Once we have a collection object, we can call the methods on it. Operations in Knapsack are immutable. Which means, they will return a new collection instead of modifying the existing one.

Explaining all the methods available in Knapsack is beyond the scope of this article, but let's see a few methods that are common in all collection libraries.

Map

map function creates a new collection by applying a mapping function to each item in the collection. Let's say we want to double each item in the collection.

$doubled = $collection->map(function($item) {
    return $item * 2;
});
// [2, 4, 6]

Here the mapping function doubles whatever it receives as the input. So the result will be a collection with the values doubled.

Reduce

reduce function reduces a collection into a single value. It takes two parameters, a callback function, and an initial value and returns the reduced value. For example, we can use reduce to find the sum of the items in a collection.

$sum = $collection->reduce(function($carry, $item) {
    return $carry + $item;
}, 0); //6

Here we have provided 0 as the initial value, which will be passed as the $carry in the first iteration. The result of the callback is carried over in all subsequent iterations and the result of the last iteration is returned as the result.

Filter and Reject

filter returns a filtered collection with items that returns true for the predicate function.

Imagine we have a collection of cities with name and country code and we want to find the cities in US. Here is how we can do it using filter

$cities = new Collection([
    ['name' => 'Chicago', 'country' => 'US'],
    ['name' => 'Tokyo', 'country' => 'JP'],
    ['name' => 'New York', 'country' => 'US'],
    ['name' => 'New Delhi', 'country' => 'IN']
]);

$result = $cities->filter(function($city) {
    return ($city['country'] == 'US');
})->toArray(); 
// [['name' => 'Chicago', 'country' => 'US'],['name' => 'New York', 'country' => 'US']]

filter will iterate over the collection and pass each item to the predicate function. The $city is included in the result if the country code is US.

reject, on the other hand, excludes the items that match the predicate condition from the result collection.

$result = $cities->reject(function($city) {
    return ($city['country'] == 'US');
})->toArray(); 
// [['name' => 'Tokyo', 'country' => 'JP'],['name' => 'New Delhi', 'country' => 'IN']]

Here the cities in the US are excluded from the result.

GroupBy and GroupByKey

groupBy groups the items in the collection based on the return value of the callback function. Let's say we want to group the cities collection based on country.

$result = $cities->groupBy(function($city) {
    return $city['country'];
})->toArray(); 
//  [
//      'US' => [
//          ['name' => 'Chicago', 'country' => 'US'],['name' => 'New York', 'country' => 'US']
//      ],
//      'JP' => [
//          ['name' => 'Tokyo', 'country' => 'JP']
//      ],
//      'IN' => [
//          ['name' => 'New Delhi', 'country' => 'IN']
//      ]
//  ]

groupByKey makes the grouping easier by taking the key with which we need to group the collection. The above example can be re written as

$result = $cities->groupByKey('country')->toArray();

Take, TakeNth and TakeWhile

take($n) will return the first $n elements from a collection.

$first = $articles
    ->take(1)
    ->toArray();
//['content' => '...', 'categories' => ['php', 'javascript']],

The above example will get the first article from the articles collection.

takeNth($n) will return a collection by taking every $nth element in the collection. Let' create a list of odd numbers from 1 to 10.

$oddNumbers = Collection::range(1,10)
    ->takeNth(2)
    ->toArray(); // [1,3,5,7,9]

Collection::range() creates a lazy collection of numbers from 1 to 10. Then we takes the elements at index devisible by 2, which will return all the odd numbers.

takeWhile takes a predicate function and returns a collection by taking the items till it reaches an item that does not meet the predicate.

$numbers = Collection::range(1,10)
    ->takeWhile(function($item) {
        return ($item < 5);
    })
    ->toArray(); // [1,2,3,4]

Here we have a predicate function which checks if the item is less than 5. So the resulting collection will be of numbers that are less than 5.

Similarly, we have drop* methods which skip items from the collection. I leave that to you as an exercise.

Zip

zip combines two collections and creates a collection of pairs (tuples) by taking one element from the first collection and the corresponding element from the second collection.

$names = ['John Doe', 'Peter', 'George'];
$ages = [30, 20, 35];

$combined = Collection::from($names)
    ->zip($ages)
    ->toArray();

Mapcat

mapcat is a really useful method, but many of us doesn't know how or when to use it. mapcat calls map on each item in the collection and flattens the result into one level. Let see it through an example.

Imagine we want to create a list of distinct categories from a collection of articles.

$articles = Collection::from([
    ['content' => '...', 'categories' => ['php', 'javascript']],
    ['content' => '...', 'categories' => ['apache', 'mysql']],
    ['content' => '...', 'categories' => ['php7', 'php']],
]);

$categories = $articles->mapcat(function($article) {
    return $article['categories'];
})
->distinct()
->values()
->toArray();
// ["php","javascript","apache","mysql","php7"]

Here, mapcat goes through the articles collection and returns the categories of each article. The resulting collection will have duplicate values, so we are calling distinct on it.

Update: As Dušan Kasan commented, in this case we are calling values() for the below reason.

It's there because Knapsack preserves the original keys of the items, which in the example will result in all of the items having 0 as a key (since they were each at first position in the original arrays). So when you attempt to call toArray(), it will only have the last element, as the others will be overwriten, since they occupy the same index. The values() simply reindexes everything, so that toArray() can output what you would expect.

Example - Markov Chain Text Generator

Now let's see the fun part. Let's create a Markov Chain text generator using Knapsack. Markov Chain is a process that undergoes a transition from one state to the other where one state solely depends on its previous state. This technique can be used to generate random text from any given text. We are going to create a function which can generate gibberish text, but looks like real.

function markov_generator($text, $length)
{
    $collection = Collection::from(explode(' ', $text));

    $word_list = $collection->partition(2, 1)
        ->groupBy(function ($item) {
            return $item->first();
        })
        ->toArray();

    $markovGenerator = function ($word_list) {
        $current = key($word_list);
        yield $current;
        while ($current) {
            $next = isset($word_list[$current]) ? $word_list[$current] : reset($word_list);
            $current = $next->shuffle()
                ->first()
                ->last();
            yield $current;
        }
    };

    $markov = Collection::from($markovGenerator($word_list));

    return implode(' ', $markov->take($length)->toArray());
}

markov_generator function takes a string and the number of words to be generated as input parameters and returns a random text of the given length. This function may not be the optimal implementation of Markov chain, but demonstrates how we can use Knapsack library to write clean an concise code. Let's look inside of it.

To generate Markov Chain, we need to create a mapping of words where for each word X in the text, it should have a list of the words that are followed by X (possible transition from state X).

First we split the input text using spaces. Then $collection->partition(2,1) partiontions the collection into sub collections of length two. For example, partitioning ["php", "is", "great"] will return [["php", "is"], ["is", "great"]]. The first parameter is the size of the sub collection and the second is the step. Here we are taking one step at a time.

Now, using groupBy, we groups the partitioned collection with the first item in each sub-collection. This will generate the required mapping where we will look up for the next word.

$markovGenerator is a Generator function which takes our mapping array and gives one word at a time. yield key($word_list) will return the first key, which is the first word in the input string.

Inside the while loop, we look for the next word. $next will be a collection of words from the input string, which are followed by the current string. In order to randomize, we shuffle the $next and takes the first item. This is again a collection of two consecutive words that we generated earlier. We have to take the second element from this collection using last method.

We then convert $markovGenerator into a lazy collection using Collection::from. $markov can produce any number of words from the input text, but we are only taking the first $length number of words. Finally, we combine all the words to create our Markov Text.

Disclaimer: This example is inspired from this talk by Michael Feathers.

Summary

Knapsack is a simple Collection library that brings Collection Pipeline pattern into PHP. Collection Pipeline technique not only make our code concise and cleaner but also encourages single responsibility. Instead of fitting everything inside a for loop, we can split the logic into small methods or functions. Taking full advantage of this library might be a bit challenging in the beginning, but you will start loving it once you get its real power.

We saw how we can easily create a Markov Chain text generator using Knapsack. What are the other complex problems that you would like to solve using this library?

Tags : phpKnapsack