Natural language processing Basic Data Wrangling
--
With the available of many libraries now its more than easy to build your nlp model with comfort. I have discussed about them below so feel free to check this out.
1 > Case conversion
you can simply use lower(), upper(), title() method provided by the python
2> Tokenization
doc = ‘my name is Hello. You are also Hello.’
nltk provide sentence tokenizer for this wow right,
code
>> import nltk
>> nltk.sent_tokenize(doc)
This will result list of python sentences…
for word tokenize >> we have nltk.word_tokenize(doc)
3> What you have HTML tags
that’s the easy part use Beatifulsoup it is beginner friendly. I too feel very easy working with it why not you. code includes just intialise BS4 and cal soup.li haha easy right. IF you want about this I will post a video in upcoming days or a exercise.
4> Removing Accentted Characters
ever seen t with ~ arrow above t. If yes this is noise I am talking about . LEts clean it.
use unicodeddata
import unicodeddata
normalize with NFKD(Canonical equivalence)
< Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.>
<code>
unicoded.normalize(“NFKD”, text).encode(“ascii”, “ignore”).decode(“utf-8”, ‘ignore’)
5>> Removing Special characters
using regex expression much easier right
for eg include numbers and numbers
re.sub(r’[^a-zA-Z0–9\s]’, ‘’, text)
6 >> Contractions
noises such as won’t isn’t slang need to de-contract it
code>
import contractions
contractions.fix(s)
e.g. I ain’t going >> I am not going.
7>> Stemming
<if else case not a good approach>
stemmer >> porterstemmer, snowballstemmer, landcaststemmer
Different forms are converted into one such as played playing >> play
code
from nltk.stem import porterstemmer
portersteemer.stem(‘playing’)
result > > play
8>> Lemmatization
Checks dictionary for semantics but verbose n should be used.
from nltk.stem import WordNetLemmatizer
whl = WordNetLemmatizer()
whl.lemmatize(“cars”,”n”)
9…….Continue in next story
From Adarsha regmi
working to build a AI working for us to protect earth from dissolving join me and my motive.
Thanks
For reference check this colab notebooK;
https://colab.research.google.com/drive/1gx5xy1fTZYs1YSybKfwoUVmRBUf--3yp#scrollTo=IZxnwuqrVIZx