Finding Median by Stream

This small waterfall is one of the many intermediate
operations of the stream. Photo taken in Algonquin, ON

The Java's DoubleStream class provides a few methods for common statistics including min(), max(), sum(), and average(). For average(), what does it actually mean? Here is the description in the specification: The average() method returns an OptionalDouble describing the arithmetic mean of elements of the stream, or an empty optional if the stream is empty.

While the arithmetic mean method is handy, we likely want a median method for different needs as well. The DoubleStream class doesn't provide a median() method but the sorted() method. We can use the sorted() method to assist in finding the median value.

Building the Median Function

Given each of existing min(), max(), and average() will return an OptionalDouble, having a median() method to return the same type is preferable. There are different ways to create a median method, and the goal here is to create one by functional programming with Stream.

Below is the ListStats class that can be instantiated by given it a List of Double's. This class will have a number of Lambda functions, including median, arithmeticMean, max, min, geometricMean and so on. The Lambda function of median is included here:

class ListStats {

    private final List<Double> doubleList;

    ListStats(List<Double> doubleList) {
        this.doubleList = Collections.unmodifiableList(doubleList);
    }

    Function<List<Double>, OptionalDouble> median = list -> {
        if (list.isEmpty() || list.contains(null))
            return OptionalDouble.empty();

        DoubleStream sortedStream = list.stream()
                .mapToDouble(Double::doubleValue)
                .sorted();

        return list.size() % 2 == 0 ?
                sortedStream.skip((list.size() / 2) - 1).limit(2).average() :
                sortedStream.skip(list.size() / 2).findFirst();
    };

    OptionalDouble findMedian() {
        return median.apply(doubleList);
    }
}

The Lambda function allows us to see clearly that the operation within the function only relies on the parameter given. In this case, the parameter is of type List<Double>. The operation does not rely on or mutate any variables outside of the function's local scope, and does not throw an Exception. This function can potentially be enabled for parallel Stream processing.

The findMedian() method is a wrapper that passes the parameter to the median Lambda function. Here is how to call to findMedian():

List<Double> doubleList = Arrays.asList(5d, 6d, 1d, 2d, 3d, 4d, 7d);

ListStats listStats = new ListStats(doubleList);
System.out.print("Median is: ");
listStats.findMedian()
        .ifPresentOrElse(System.out::println, () -> System.out.println("not available"));

Notice that ifPresentOrElse requires Java 9 or beyond. If your machine is still with Java 8, adjust accordingly.

Output from the above example:

Median is: 4.0

Sequential Processing vs. Parallel Processing

To see how much time can be saved by parallel processing, a large number of objects in the list is necessary. It won't be beneficial to enable parallel processing if there are only a small number of objects. This is due to the overhead cost of creating multiple threads internally to partition the main stream into sub-streams then process them. For a small number of objects, sequential processing can be faster.

The example list (5d, 6d, 1d, 2d, 3d, 4d, 7d) as shown in the code snippet above apparently has the median 4. The list is created for easy understanding but not for measuring how much time can be saved when parallel processing is enabled.

Let's create a list of 100 million unordered numbers, and have that pass to the median function.

Change the code

from:

List<Double> doubleList = Arrays.asList(5d, 6d, 1d, 2d, 3d, 4d, 7d);

to:

List<Double> doubleList = DoubleStream
        .generate(Math::random)
        .limit(100_000_000)
        .boxed()
        .collect(Collectors.toList());

This code generates a list of 100,000,000 numbers randomly by using Math::random for each number. By doing so, for each number x such that: 0 < x < 1.

The machine for measurement

Different machines will produce different measurement results due to the variety of hardware and software factors. All those factors are not important here. The comparison of sequential and parallel processing is on my PC with a CPU with multiple cores that is typical nowadays. We will see whether parallel is actually faster than sequential on the same machine.

I compiled all the codes first so that the .class files were ready to run. The measurement did not include the compile time.

Time required to complete the sequential processing

By running the above codes from main(), I got this result:

Median is: 0.50002787383783

It took about 18 seconds to get the result.

Time required to complete the parallel processing

Now, let's enable parallel stream in the median function by adding .parallel(), as shown here:

DoubleStream sortedStream = list.stream()
        .parallel()
        .mapToDouble(Double::doubleValue)
        .sorted();

By re-compiling the codes, then running the codes from main(), I got this result:

Median is: 0.4999533384298516

It took about 11 seconds to get the result. Given the generation of random numbers evenly distributed between 0 and 1, the median always approach 0.5.

Time Saved

On my machine, parallel processing was 7 seconds faster in this particular case. That was 39% faster than sequential.

Search This Blog

App Development Blog