Flatten A Nested List Of Variable Sized Sublists Into A Scipy Array
How can I use numpy/scipy to flatten a nested list with sublists of different sizes? Speed is very important and the lists are large. lst = [[1, 2, 3, 4],[2, 3],[1, 2, 3, 4, 5],[4
Solution 1:
How about np.fromiter:
In [49]: %timeit np.hstack(lst*1000)
10 loops, best of 3: 25.2 ms per loop
In [50]: %timeit np.array(list(chain.from_iterable(lst*1000)))
1000 loops, best of 3: 1.81 ms per loop
In [52]: %timeit np.fromiter(chain.from_iterable(lst*1000), dtype='int')
1000 loops, best of 3: 1 ms per loop
Solution 2:
You can try numpy.hstack
>>> lst = [[1, 2, 3, 4],[2, 3],[1, 2, 3, 4, 5],[4, 1, 2]]
>>> np.hstack(lst)
array([1, 2, 3, 4, 2, 3, 1, 2, 3, 4, 5, 4, 1, 2])
Solution 3:
The fastest way to create a numpy array from an iterator is to use numpy.fromiter
:
>>>%timeit numpy.fromiter(itertools.chain.from_iterable(lst), numpy.int64)
100000 loops, best of 3: 3.76 us per loop
>>>%timeit numpy.array(list(itertools.chain.from_iterable(lst)))
100000 loops, best of 3: 14.5 us per loop
>>>%timeit numpy.hstack(lst)
10000 loops, best of 3: 57.7 us per loop
As you can see, this is faster than converting to a list, and much faster than hstack
.
Solution 4:
How about trying:
np.hstack(lst)
Solution 5:
Use chain.from_iterable
:
vec = sp.array(list(chain.from_iterable(lst)))
This avoids using *
which is quite expensive to handle if the iterable has many sublists.
An other option might be to sum
the lists:
vec = sp.array(sum(lst, []))
Note however that this will cause quadratic reallocation. Something like this performs much better:
defsum_lists(lst):
iflen(lst) < 2:
returnsum(lst, [])
else:
half_length = len(lst) // 2return sum_lists(lst[:half_length]) + sum_lists(lst[half_length:])
On my machine I get:
>>>L = [[random.randint(0, 500) for _ inrange(x)] for x inrange(10, 510)]>>>timeit.timeit('sum(L, [])', 'from __main__ import L', number=1000)
168.3029818534851
>>>timeit.timeit('sum_lists(L)', 'from __main__ import L,sum_lists', number=1000)
10.248489141464233
>>>168.3029818534851 / 10.248489141464233
16.422223757114615
As you can see, a 16x speed-up. The chain.from_iterable
is even faster:
>>>timeit.timeit('list(itertools.chain.from_iterable(L))', 'import itertools; from __main__ import L', number=1000)
1.905594825744629
>>>10.248489141464233 / 1.905594825744629
5.378105042586658
An other 6x speed-up.
I looked for a "pure-python" solution, not knowing numpy. I believe Abhijitunutbu/senderle's solution is the way to go in your case.
Post a Comment for "Flatten A Nested List Of Variable Sized Sublists Into A Scipy Array"