Spacy in Python for Natural Language Processing (NLP) Example
Notebook and Code from https://github.com/jeffheaton/t81_558_deep_learning
import urllib.request
import csv
import codecs
import numpy as np
url = "https://data.heatonresearch.com/data/t81-558/datasets/sonnet_18.txt"
with urllib.request.urlopen(url) as urlstream:
for line in codecs.iterdecode(urlstream, 'utf-8'):
print(line.rstrip())
import spacy
nlp = spacy.load('en')
doc = nlp(line.rstrip())
for token in doc:
print(token.text)
import spacy
nlp = spacy.load('en')
doc = nlp(u"Apple is looking at buying a U.K. startup for $1 billion")
for token in doc:
print(token.text)
You can also obtain the part of speech for each word. Common parts of speech include nouns, verbs, pronouns, and adjectives.
for word in doc:
print(word.text, word.pos_)
Spacy includes functions to check if parts of a sentence appear to be numbers, acronyms, or other entities.
for word in doc:
print(f"{word} is like number? {word.like_num}")
import spacy
from spacy import displacy
nlp = spacy.load('en')
doc = nlp(u"This is a sentance")
displacy.serve(doc, style="dep")
Note, you will have to manually stop the above cell
print(doc)
The following code shows how to reduce words to their stems. Here the sentence words are reduced to their most basic form. For example, "striped" to "stripe."
import spacy
# Initialize spacy 'en' model, keeping only tagger
# component needed for lemmatization
nlp = spacy.load('en', disable=['parser', 'ner'])
sentence = "The striped bats are hanging on their feet for best"
# Parse the sentence using the loaded 'en' model object `nlp`
doc = nlp(sentence)
# Extract the lemma for each token and join
" ".join([token.lemma_ for token in doc])
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS)