Natural language processing Basic Data Wrangling

With the available of many libraries now its more than easy to build your nlp model with comfort. I have discussed about them below so feel free to check this out.

1 > Case conversion

2> Tokenization

nltk provide sentence tokenizer for this wow right,


>> import nltk

>> nltk.sent_tokenize(doc)

This will result list of python sentences…

for word tokenize >> we have nltk.word_tokenize(doc)

3> What you have HTML tags

4> Removing Accentted Characters

use unicodeddata

import unicodeddata

normalize with NFKD(Canonical equivalence)

< Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.>


unicoded.normalize(“NFKD”, text).encode(“ascii”, “ignore”).decode(“utf-8”, ‘ignore’)

5>> Removing Special characters

for eg include numbers and numbers

re.sub(r’[^a-zA-Z0–9\s]’, ‘’, text)

6 >> Contractions


import contractions


e.g. I ain’t going >> I am not going.

7>> Stemming

stemmer >> porterstemmer, snowballstemmer, landcaststemmer

Different forms are converted into one such as played playing >> play


from nltk.stem import porterstemmer


result > > play

8>> Lemmatization

from nltk.stem import WordNetLemmatizer

whl = WordNetLemmatizer()


9…….Continue in next story

From Adarsha regmi

working to build a AI working for us to protect earth from dissolving join me and my motive.


For reference check this colab notebooK;



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store