Natural language processing Basic Data Wrangling

Adarsha Regmi
2 min readSep 23, 2021

With the available of many libraries now its more than easy to build your nlp model with comfort. I have discussed about them below so feel free to check this out.

1 > Case conversion

you can simply use lower(), upper(), title() method provided by the python

2> Tokenization

doc = ‘my name is Hello. You are also Hello.’

nltk provide sentence tokenizer for this wow right,

code

>> import nltk

>> nltk.sent_tokenize(doc)

This will result list of python sentences…

for word tokenize >> we have nltk.word_tokenize(doc)

3> What you have HTML tags

that’s the easy part use Beatifulsoup it is beginner friendly. I too feel very easy working with it why not you. code includes just intialise BS4 and cal soup.li haha easy right. IF you want about this I will post a video in upcoming days or a exercise.

4> Removing Accentted Characters

ever seen t with ~ arrow above t. If yes this is noise I am talking about . LEts clean it.

use unicodeddata

import unicodeddata

normalize with NFKD(Canonical equivalence)

< Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.>

<code>

unicoded.normalize(“NFKD”, text).encode(“ascii”, “ignore”).decode(“utf-8”, ‘ignore’)

5>> Removing Special characters

using regex expression much easier right

for eg include numbers and numbers

re.sub(r’[^a-zA-Z0–9\s]’, ‘’, text)

6 >> Contractions

noises such as won’t isn’t slang need to de-contract it

code>

import contractions

contractions.fix(s)

e.g. I ain’t going >> I am not going.

7>> Stemming

<if else case not a good approach>

stemmer >> porterstemmer, snowballstemmer, landcaststemmer

Different forms are converted into one such as played playing >> play

code

from nltk.stem import porterstemmer

portersteemer.stem(‘playing’)

result > > play

8>> Lemmatization

Checks dictionary for semantics but verbose n should be used.

from nltk.stem import WordNetLemmatizer

whl = WordNetLemmatizer()

whl.lemmatize(“cars”,”n”)

9…….Continue in next story

From Adarsha regmi

working to build a AI working for us to protect earth from dissolving join me and my motive.

Thanks

For reference check this colab notebooK;

https://colab.research.google.com/drive/1gx5xy1fTZYs1YSybKfwoUVmRBUf--3yp#scrollTo=IZxnwuqrVIZx

--

--