lazysorted 0.1.1

Creator: bradpython12

Last updated:

Add to Cart

Description:

lazysorted 0.1.1

lazysorted is a Python extension module for sorting sequences lazily. It
presents the programmer with the abstraction that they are actually
working with a sorted list, when in fact the list is only physically
sorted when the programmer requests elements from it, and even then it
is only sorted partially, just enough to return whatever was requested.
The LazySorted object has a constructor that implements the same
interface as the builtin sorted(...) function, and it supports most
of the non-mutating methods of a python list.
Since the LazySorted object only sorts as much as necessary, it can be
faster than using the builtin sorted(...) for tasks that do not
require the entire data to be sorted, like:

Computing medians
Computing truncated
means
Quickly iterating through the first few sorted elements of a list
Computing the deciles or quartiles of some data


How to use it
You can use LazySorted in much the same way you use the sorted(...)
function and the python lists it produces:
from lazysorted import LazySorted
from math import floor, ceil


def median(xs):
"""An expected linear time median function"""
ls = LazySorted(xs)
n = len(ls)
if n == 0:
raise ValueError("Need a non-empty iterable")
elif n % 2 == 1:
return ls[n//2]
else:
return sum(ls[(n/2-1):(n/2+1)]) / 2.0


def top_k(xs, k, key=None, reverse=False):
"""Efficiently computes the top k elements of xs using the given key, or
the bottom k if reverse=True"""
ls = LazySorted(xs, key=key, reverse=reverse)
return ls[0:k]


def trimmed_mean(xs, alpha=0.05):
"""Computes the mean of the xs from the alpha to (1-alpha) quantiles
in expected linear time. More robust than the ordinary sample mean."""
if not 0 <= alpha < 0.5:
raise ValueError("alpha must be in [0, 0.5)")

ls = LazySorted(xs)
n = len(ls)
if n == 0:
raise ValueError("Need a non-empty iterable")
lower = int(floor(n * alpha))
upper = int(ceil(n * (1 - alpha)))
return sum(ls.between(lower, upper)) / (upper - lower)
In addition to the __len__ and __getitem__ methods demostrated
above, LazySorted also supports the __iter__, __contains__,
index, and count methods, just like a regular python list:
>>> import random
>>> from lazysorted import LazySorted
>>> xs = list(range(1000)) + 5 * [1234]
>>> random.shuffle(xs)
>>> ls = LazySorted(xs)
>>> for x in ls:
... print(x)
... if x >= 3:
... break
0
1
2
3
>>> 1235 in ls
False
>>> ls.index(821)
821
>>> ls.count(1234)
5
Although the LazySorted constructor pretends to be equivalent to the
sorted function, and the LazySorted object pretends to be equivalent
to a sorted python list, there are a few differences between them:

LazySorted objects are immutable, while python lists are not.
Sorting with the builtin sorted function is guaranteed to be
stable, (ie, preserve the original order of elements that compare
equal), while LazySorted sorting is not stable.
The LazySorted object has a between(i, j) method, which returns a
list of all the items whose sorted indices are in range(i, j),
but not necessarily in order. This is useful, for example, for
throwing away outliers when computing an alpha-trimmed mean.

When the APIs differ between python2.x and python3.x, lazysorted
implements the python3.x version. So the LazySorted constructor does not
support the cmp argument that was removed in python3.x, and the
LazySorted object does not support the __getslice__ method that was
also removed in python3.x.
All of the LazySorted methods have pretty good documentation, which can
be accessed through the builtin help(...) function.
I’ve tested lazysorted and found it to work for CPython versions 2.5,
2.6, 2.7, and 3.1, 3.2, and 3.3. I haven’t tested 3.0.


How it works
In short, LazySorted works by using quicksort partitions lazily and
keeping track of the indices used as pivots.
`quicksort <http://en.wikipedia.org/wiki/Quicksort>`_ sorts a list
by picking an element of the list to be the “pivot”, and then
partitioning the data into the part that’s greater than or equal to the
pivot and the part that’s less than the pivot. These two parts are then
recursively sorted with quicksort,
`quickselect <http://en.wikipedia.org/wiki/Quickselect>`_ finds the
kth smallest element of a list by picking a pivot element and
partitioning the data, as in quicksort. Then the algorithm recurses into
the larger or smaller part of the list, depending on whether k is larger
or smaller than the index of the pivot element.
There are two key observations to make from these algorithms: First of
all, if we are only interested in part of a sorted list, we only need to
recurse into the part we are interested in after doing a partition.
Second of all, after doing some partitions, the list is partially
sorted, with the pivots all in their sorted order and the elements
between two pivots guaranteed to be bigger than the pivot to their left
and smaller than the pivot to their right.
So whenever some data is queried from a LazySorted object, we first look
through the pivots to see which pivots delimit the data we want. Then we
partition sublist(s) as necessary and recurse into the side(s) that our
data is in.
There are also some implementation details that help lazysorted to run
quickly: First of all, pivots elements are chosen to be the median of
three randomly elements, which makes the partition likely to be more
balanced and guarantees average case O(n log n) behavior.
Second of all, for sufficiently small lists, lazysorted uses insertion
sort instead of quicksort, which is faster on small lists. Both of these
tricks are well-known to speed up quicksort implementations.
Thirdly, since it’s important to find the pivots that bound an index
quickly, lazysorted stores the pivots in a binary search tree, so that
these sorts of lookups occur in O(log n) expected time. The BST
lazysorted uses is a Treap,
selected for its overall expected speed, especially in insertion and
deletion.
lazysorted also makes a big effort to delete irrelevant pivots from the
BST; for example, if there are three pivots at indices 5, 26, and 42,
and both the data (between 5 and 26) and (between 26 and 42) is sorted,
then we can remove the irrelevant pivot 26, and just say that the data
between indices 5 and 42 is sorted.


Installation
lazysorted requires the python headers, (Python.h). I believe they ship
with OSX, but if you don’t have them they can be installed on a
debian-like system with
$ sudo apt-get install python-dev
Then you can install lazysorted with
$ sudo python setup.py install
Alternatively, you can install lazysorted from pypi with
$ easy_install --user lazysorted
or
$ pip install lazysorted
though you’ll still need the python headers for it to build properly.


Testing
I’ve put in a fair bit of effort to test that lazysorted actually does
what it’s supposed to. You can test it yourself (after installing it)
with
$ python test.py


FAQ
Doesn’t numpy have a median and percentile function?
Yes, but it’s implemented by sorting the entire array and then reading
off the requested values, not with quickselect or another O(n) selection
algorithm. And LazySorted is empirically faster, as you can see from
benchmark.py
Isn’t python3.4 going to have a statistics module with a median
function?
Yes, and I’m really excited about it! This is
PEP450. Unfortunately, the
current implementation is in pure python, and computes the median by
sorting the data and picking off the middle element.
Doesn’t the standard library have a heapq module?
Yes, but it lacks the full generality of this module. For example, you
can use it to get the k smallest elements in O(n log k) time, but not k
arbitrary contiguous elements. This module represents a different
paradigm: you’re allowed to program as if your list was sorted, and let
the data structure deal with the details.
How is lazysorted licensed?
lazysorted is BSD-licensed. So you can use it pretty much however you
like! See LICENSE for details.
What should I not use lazysorted for?

Applications requiring a stable sort; the quicksort partitions make
the order of equal elements in the sorted list undefined.
Applications requiring guaranteed fast worst-case performance.
Although it’s very unlikely, many operations in LazySorted run in
worst case O(n^2) time.
Applications requiring high security. The random number generator is
insecure and seeded from system time, so an (ambitious) attacker
could reverse engineer the random number generator and feed
LazySorted pathological lists that make it run in O(n^2) time.
Sorting entire lists: The builtin sorted(...) is very
impressively designed and implemented. It also has the advantage of
running faster than O(n log n) on lists with partial structure.

How does lazysorted work at scale?
Unfortunately, only okay. This turns out to be primarily due to the fact
that CPython deals with python objects by passing around pointers to
them, causing cache misses when the list and its elements no longer fit
in cache. The gory details can be found in a blog post I wrote about
Memory Locality and Python
Objects.
However, this effect doesn’t kick in until lists grow larger than about
100K values, and even past that lazysorted remains faster than complete
sorting.


Contact me!
If you use this software and feel so inclined, I’d greatly appreciate
hearing what you are using it for! You can hit me up on Twitter
@naftaliharris, or at my email
address on my contact page.

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.