COUNT_VECTORIZER
The COUNT_VECTORIZER node receives a collection (matrix, vector or dataframe) of text documents and converts it to a matrix of token counts.Params:Returns:tokens : DataFrameHolds all the unique tokens observed from the input.word_count_vector : VectorContains the occurences of these tokens from each sentence.
Python Code
from typing import TypedDict
from sklearn.feature_extraction.text import CountVectorizer
from flojoy import flojoy, DataFrame, Matrix, Vector
import pandas as pd
class CountVectorizerOutput(TypedDict):
tokens: DataFrame
word_count_vector: Vector
@flojoy(deps={"scikit-learn": "1.2.2"})
def COUNT_VECTORIZER(default: DataFrame | Matrix | Vector) -> CountVectorizerOutput:
"""The COUNT_VECTORIZER node receives a collection (matrix, vector or dataframe) of text documents and converts it to a matrix of token counts.
Returns
-------
tokens: DataFrame
Holds all the unique tokens observed from the input.
word_count_vector: Vector
Contains the occurences of these tokens from each sentence.
"""
if isinstance(default, DataFrame):
data = default.m.values
elif isinstance(default, Vector):
data = default.v
else:
data = default.m
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data.flatten())
x = pd.DataFrame({"tokens": vectorizer.get_feature_names_out()})
y = X.toarray() # type: ignore
return CountVectorizerOutput(tokens=DataFrame(df=x), word_count_vector=Vector(v=y))
Example
Having problem with this example app? Join our Discord community and we will help you out!
In this example, the READ_CSV
node loads a local file. Then COUNT_VECTORIZER
node transforms the received dataframe of text into a matrix of token/word counts, and it returns a DataFrame
that contains unique words and a Matrix
that contains token counts for each sentence.