Skip to content Skip to sidebar Skip to footer

How To Select Some Rows From Sparse Matrix Then Use Them Form A New Sparse Matrix

I have a very large sparse matrix(100000 column and 100000 rows). I want to select some of the rows of this sparse matrix and then use them to form a new sparse matrix. I tried to

Solution 1:

I added some tags that would have helped me see your question sooner.

When asking about an error, it's a good idea to provide some or all of the traceback, so we can see where the error is occuring. Information on the inputs to the problem function call can also help.

Fortunately I can recreate the problem fairly easily - and in a reasonable size example. No need to make a 100000 x10000 matrix that no one can look at!

Make a modest size sparse matrix:

In [126]: M = sparse.random(10,10,.1,'csr')                                                              
In [127]: M                                                                                              
Out[127]: 
<10x10 sparse matrix of type'<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

I can do a whole matrix row sum, just as with a dense array. The sparse code actually uses matrix-vector multiplication to do this, producing a dense matrix.

In [128]: M.sum(axis=1)                                                                                  
Out[128]: 
matrix([[0.59659958],
        [0.80390719],
        [0.37251645],
        [0.        ],
        [0.85766909],
        [0.42267366],
        [0.76794737],
        [0.        ],
        [0.83131054],
        [0.46254634]])

It's sparse enough so that some rows have no zeros. With floats, especially in the 0-1 range, I'm not going to get rows where the nonzero values cancel out.

Or using your row by row calculation:

In [133]: alist = [np.sum(row.toarray()[0]) for row in M]                                                
In [134]: alist                                                                                          
Out[134]: 
[0.5965995802776853,
 0.8039071870427961,
 0.37251644566924424,
 0.0,
 0.8576690924353791,
 0.42267365715276595,
 0.7679473651419432,
 0.0,
 0.8313105376003095,
 0.4625463360625408]

And selecting the rows that do sum to zero (in this case empty ones):

In [135]: alist = [rowforrowin M if np.sum(row.toarray()[0])==0]                                      
In [136]: alist                                                                                          
Out[136]: 
[<1x10 sparse matrix of type '<class 'numpy.float64'>'with0 stored elements in Compressed Sparse Row format>,
 <1x10 sparse matrix of type '<class 'numpy.float64'>'with0 stored elements in Compressed Sparse Row format>]

Note that this is a list of sparse matrices. That's what you got too, right?

Now if I try to make matrix from that, I get your error:

In [137]: sparse.csr_matrix(alist)                                                                       
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-137-5e20e6fc2524> in <module>
----> 1 sparse.csr_matrix(alist)

/usr/local/lib/python3.6/dist-packages/scipy/sparse/compressed.py in __init__(self, arg1, shape, dtype, copy)
     86"".format(self.format))
     87             from .coo import coo_matrix
---> 88             self._set_self(self.__class__(coo_matrix(arg1, dtype=dtype)))8990         # Read matrix dimensions given, if any

/usr/local/lib/python3.6/dist-packages/scipy/sparse/coo.py in __init__(self, arg1, shape, dtype, copy)
    189                                          (shape, self._shape))
    190--> 191                 self.row, self.col = M.nonzero()192self.data = M[self.row, self.col]
    193self.has_canonical_format = True

/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py in __bool__(self)
    285returnself.nnz != 0286else:
--> 287             raise ValueError("The truth value of an array with more than one "288"element is ambiguous. Use a.any() or a.all().")
    289     __nonzero__ = __bool__

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

OK, this error doesn't tell me a whole lot (at least without more reading of the code), but it's clearly having problems with the input list. But read csr_matrix docs again! Does it say we can give it a list of sparse matrices?

But there is a sparse.vstack function will work with a list of matrices (modeled on the np.vstack):

In [140]: sparse.vstack(alist)                                                                           
Out[140]: 
<2x10 sparse matrix of type'<class 'numpy.float64'>'with0 stored elements in Compressed Sparse Row format>

We get more interesting results if we select the rows that don't sum to zero:

In [141]: alist = [rowforrowin M if np.sum(row.toarray()[0])!=0]                                      
In [142]: M1=sparse.vstack(alist)                                                                        
In [143]: M1                                                                                             
Out[143]: 
<8x10 sparse matrix of type '<class 'numpy.float64'>'with10 stored elements in Compressed Sparse Row format>

But I showed before that we can get the row sums without iterating. Applying where to Out[128], I get the row indices (of the nonzero rows):

In [151]: idx=np.where(M.sum(axis=1))                                                                    
In [152]: idx                                                                                            
Out[152]: (array([0, 1, 2, 4, 5, 6, 8, 9]), array([0, 0, 0, 0, 0, 0, 0, 0]))
In [153]: M2=M[idx[0],:]                                                                                 
In [154]: M2                                                                                             
Out[154]: 
<8x10 sparse matrix of type '<class 'numpy.float64'>'with10 stored elements in Compressed Sparse Row format>In [155]: np.allclose(M1.A, M2.A)                                                                        
Out[155]: True

====

I suspect the In[137] was produced trying to find the nonzero (np.where) elements of the input, or input cast as a numpy array:

In [159]: alist = [rowforrowin M if np.sum(row.toarray()[0])==0]                                      
In [160]: np.array(alist)                                                                                
Out[160]: 
array([<1x10 sparse matrix of type '<class 'numpy.float64'>'with0 stored elements in Compressed Sparse Row format>,
       <1x10 sparse matrix of type '<class 'numpy.float64'>'with0 stored elements in Compressed Sparse Row format>], dtype=object)
In [161]: np.array(alist).nonzero()                                                                      
---------------------------------------------------------------------------
ValueError                                Traceback (most recent calllast)
<ipython-input-161-832a25987c15>in<module>----> 1 np.array(alist).nonzero()/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py in __bool__(self)
    285return self.nnz !=0286else:
--> 287             raise ValueError("The truth value of an array with more than one "288                              "element is ambiguous. Use a.any() or a.all().")
    289     __nonzero__ = __bool__

ValueError: The truth valueof an arraywith more than one element is ambiguous. Use a.any() or a.all().

np.array on a list of sparse matrices produces an object dtype array of those matrices.

Post a Comment for "How To Select Some Rows From Sparse Matrix Then Use Them Form A New Sparse Matrix"