Skip to content Skip to sidebar Skip to footer

Keep Csv Feature Labels For Lda Pca

I am trying to use the 2000 topics' top 20 frequency data at https://github.com/wwbp/facebook_topics/tree/master/csv I would like to perform randomizedPCA on the data. From the doc

Solution 1:

PCA doesn't discard or retain features, but the component results don't map to features either. (Given x, y, z and an n_components=2 param, the resulting two components won't map to any of xyz perfectly.) If you want to retain the feature names as part of dimensionality reduction, you might want to explore other approaches (sklearn has a whole section for this).

Chuck Ivan is correct that an encoder or vectorizer is called for before you can do PCA. I like his OrdinalEncoder suggestion, but you may also consider the sklearn text utilities on this list: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text

Solution 2:

PCA works by solving an optimization problem that requires your features to be numeric. This code is trying to perform PCA on non-numeric data. You will need to factorize (encode) the strings into numbers. sklearn.preprocessing.OrdinalEncoder and sklearn.preprocessing.OneHotEncoder handle that.

Charles Landau's feature extraction solution looks very relevant to the question.

Post a Comment for "Keep Csv Feature Labels For Lda Pca"