Numpy.concatenate On Record Arrays Fails When Array Has Different Length Strings

June 19, 2022 Post a Comment

When trying to concatenate record arrays which has a field of dtype string but has different length, concatenation fails. As you can see in the following example, concatenate works

Solution 1:

To post a complete answer. As Pierre GM suggested the module:

import numpy.lib.recfunctions

gives a solution. The function that does what you want however is:

numpy.lib.recfunctions.stack_arrays((a,b), autoconvert=True, usemask=False)

(usemask=False is just to avoid creation of a masked array, which you are probably not using. The important thing is autoconvert=True to force the conversion from a's dtype "|S3" to "|S5").

Solution 2:

Would numpy.lib.recfunctions.merge_arrays work for you ? recfunctions is a little known subpackage that hasn't been advertised a lot, it's a bit clunky but could be useful sometimes.

Solution 3:

When you do not specify the dtype, np.rec.fromarrays (aka np.core.records.fromarrays) tries to guess the dtype for you. Hence,

In [4]: a = np.core.records.fromarrays( ([1,2], ["one","two"]) )

In [5]: a
Out[5]: 
rec.array([(1, 'one'), (2, 'two')], 
      dtype=[('f0', '<i4'), ('f1', '|S3')])

Notice the dtype of the f1 column is a 3-byte string.

You can't concatenate np.concatenate( (a,b) ) because numpy sees the dtypes of a and b are different and doesn't change the dtype of the smaller string to match the larger string.

If you know a maximum string size that would work with all your arrays, you could specify the dtype from the beginning:

In [9]: a = np.rec.fromarrays( ([1,2], ["one","two"]), dtype = [('f0', '<i4'), ('f1', '|S8')])

In [10]: b = np.core.records.fromarrays( ([3,4,5], ["three","four","three"]), dtype = [('f0', '<i4'), ('f1', '|S8')])

and then concatenation will work as desired:

In [11]: np.concatenate( (a,b))
Out[11]: 
array([(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'three')], 
      dtype=[('f0', '<i4'), ('f1', '|S8')])

If you do not know in advance the maximum length of the strings, you could specify the dtype as 'object':

In [35]: a = np.core.records.fromarrays( ([1,2], ["one","two"]), dtype = [('f0', '<i4'), ('f1', 'object')])

In [36]: b = np.core.records.fromarrays( ([3,4,5], ["three","four","three"]), dtype = [('f0', '<i4'), ('f1', 'object')])

In [37]: np.concatenate( (a,b))
Out[37]: 
array([(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'three')], 
      dtype=[('f0', '<i4'), ('f1', '|O4')])

This will not be as space-efficient as a dtype of '|Sn' (for some integer n), but at least it will allow you to perform the concatenate operation.

Learn Python Programming

Numpy.concatenate On Record Arrays Fails When Array Has Different Length Strings

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Numpy.concatenate On Record Arrays Fails When Array Has Different Length Strings"