Natural language processing Basic Data Wrangling

1 > Case conversion

you can simply use lower(), upper(), title() method provided by the python

2> Tokenization

doc = ‘my name is Hello. You are also Hello.’

code

3> What you have HTML tags

that’s the easy part use Beatifulsoup it is beginner friendly. I too feel very easy working with it why not you. code includes just intialise BS4 and cal soup.li haha easy right. IF you want about this I will post a video in upcoming days or a exercise.

4> Removing Accentted Characters

ever seen t with ~ arrow above t. If yes this is noise I am talking about . LEts clean it.

5>> Removing Special characters

using regex expression much easier right

6 >> Contractions

noises such as won’t isn’t slang need to de-contract it

7>> Stemming

<if else case not a good approach>

code

from nltk.stem import porterstemmer

8>> Lemmatization

Checks dictionary for semantics but verbose n should be used.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store