Skip to content Skip to sidebar Skip to footer

Numpy.concatenate On Record Arrays Fails When Array Has Different Length Strings

When trying to concatenate record arrays which has a field of dtype string but has different length, concatenation fails. As you can see in the following example, concatenate works

Solution 1:

To post a complete answer. As Pierre GM suggested the module:

import numpy.lib.recfunctions

gives a solution. The function that does what you want however is:

numpy.lib.recfunctions.stack_arrays((a,b), autoconvert=True, usemask=False)

(usemask=False is just to avoid creation of a masked array, which you are probably not using. The important thing is autoconvert=True to force the conversion from a's dtype "|S3" to "|S5").


Solution 2:

Would numpy.lib.recfunctions.merge_arrays work for you ? recfunctions is a little known subpackage that hasn't been advertised a lot, it's a bit clunky but could be useful sometimes.


Solution 3:

When you do not specify the dtype, np.rec.fromarrays (aka np.core.records.fromarrays) tries to guess the dtype for you. Hence,

In [4]: a = np.core.records.fromarrays( ([1,2], ["one","two"]) )

In [5]: a
Out[5]: 
rec.array([(1, 'one'), (2, 'two')], 
      dtype=[('f0', '<i4'), ('f1', '|S3')])

Notice the dtype of the f1 column is a 3-byte string.

You can't concatenate np.concatenate( (a,b) ) because numpy sees the dtypes of a and b are different and doesn't change the dtype of the smaller string to match the larger string.

If you know a maximum string size that would work with all your arrays, you could specify the dtype from the beginning:

In [9]: a = np.rec.fromarrays( ([1,2], ["one","two"]), dtype = [('f0', '<i4'), ('f1', '|S8')])

In [10]: b = np.core.records.fromarrays( ([3,4,5], ["three","four","three"]), dtype = [('f0', '<i4'), ('f1', '|S8')])

and then concatenation will work as desired:

In [11]: np.concatenate( (a,b))
Out[11]: 
array([(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'three')], 
      dtype=[('f0', '<i4'), ('f1', '|S8')])

If you do not know in advance the maximum length of the strings, you could specify the dtype as 'object':

In [35]: a = np.core.records.fromarrays( ([1,2], ["one","two"]), dtype = [('f0', '<i4'), ('f1', 'object')])

In [36]: b = np.core.records.fromarrays( ([3,4,5], ["three","four","three"]), dtype = [('f0', '<i4'), ('f1', 'object')])

In [37]: np.concatenate( (a,b))
Out[37]: 
array([(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'three')], 
      dtype=[('f0', '<i4'), ('f1', '|O4')])

This will not be as space-efficient as a dtype of '|Sn' (for some integer n), but at least it will allow you to perform the concatenate operation.


Post a Comment for "Numpy.concatenate On Record Arrays Fails When Array Has Different Length Strings"