Paragraph analysis using NLP and Python

admin

6 years ago

As a part of analysis, we want to find length or number of letters in the text or paragraph and number of times each word repeated. We can perform this analysis by using NLP (Natural language processing) and Python in simpler way.

In order to perform above analysis, we need to do some operations on the text like tokenization, removing stop words, sorting and counting. Let us discuss each step in a detailed manner.

Step 1

First step is to create a text or paragraph and assigning it to a variable.

Python Code:

txt1 = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming. Python is often described as a batteries included language due to its comprehensive standard library. Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released in 2000, introduced features like list comprehensions and a garbage collection system capable of collecting reference cycles. Python 3.0, released in 2008, was a major revision of the language that is not completely backward-compatible, and much Python 2 code does not run unmodified on Python 3."

Step 2

We can find the length of the paragraph or text simply by using len() function. This function only works when the text is in string format.

Python Code:

len(txt1)

Output:

Step 3

Step 3 is tokenization, which means dividing each word in the paragraph into separate strings. We can perform this by using nltk library in NLP. split() function is used for tokenization.

Python Code:

#spliting the words
tokenized_text = txt1.split()

Step 4

In this step, we will remove stop words from text. In order to remove stop words, we have to import stop words of English language from nltk corpus library. These stop words are removed from text if they are present in corpus library and store the remaining text in the list.

Python Code:

from nltk.corpus import stopwords
#removing stopwords from tokenized_text
sw=stopwords.words('English')
final_words=[word for word in tokenized_text if word not in sw]

Step 5

This step includes sorting. sort() function is used to sort the tokens in an order of numbers, upper case and lower case alphabetical order. After sorting, it is stored in a list.

Python Code:

final_words.sort()
print(final_words)

Output:

[‘(particularly,’, ‘1980s’, ‘1991,’, ‘2’, ‘2.0,’, ‘2000,’, ‘2008,’, ‘3.’, ‘3.0,’, ‘ABC’, ‘Created’, ‘Guido’, ‘It’, ‘Its’, ‘Python’, ‘Python’, ‘Python’, ‘Python’, ‘Python’, ‘Python’, ‘Python’, ‘Python’, “Python’s”, ‘Rossum’, ‘aim’, ‘approach’, ‘backward-compatible,’, ‘batteries’, ‘capable’, ‘clear,’, ‘code’, ‘code’, ‘code’, ‘collecting’, ‘collection’, ‘completely’, ‘comprehensions’, ‘comprehensive’, ‘conceived’, ‘constructs’, ‘cycles.’, ‘described’, ‘design’, ‘due’, ‘dynamically’, ’emphasizes’, ‘features’, ‘first’, ‘functional’, ‘garbage’, ‘garbage-collected.’, ‘general-purpose’, ‘help’, ‘high-level,’, ‘included’, ‘including’, ‘interpreted,’, ‘introduced’, ‘language’, ‘language’, ‘language’, ‘language.’, ‘language.’, ‘large-scale’, ‘late’, ‘library.’, ‘like’, ‘list’, ‘logical’, ‘major’, ‘much’, ‘multiple’, ‘notable’, ‘object-oriented’, ‘object-oriented,’, ‘often’, ‘paradigms,’, ‘philosophy’, ‘procedural),’, ‘programmers’, ‘programming’, ‘programming’, ‘programming.’, ‘projects’, ‘readability’, ‘reference’, ‘released’, ‘released’, ‘released’, ‘revision’, ‘run’, ‘significant’, ‘small’, ‘standard’, ‘structured’, ‘successor’, ‘supports’, ‘system’, ‘typed’, ‘unmodified’, ‘use’, ‘van’, ‘whitespace.’, ‘write’]

Step 6

In step 6, we will count the number of words repeated in the paragraph. In order to do this, convert sorted list into set and will create an empty dictionary and append it by the word and number of times it is repeated. To count the word in the text, count() function is used.

Python Code:

temp=set(final_words)
result={}
for i in temp:
    result[i]=final_words.count(i)
print(result)

Output:

{‘language.’: 2, ‘introduced’: 1, ‘programming’: 2, ‘clear,’: 1, ‘dynamically’: 1, ‘aim’: 1, ‘2’: 1, ‘Python’: 8, ‘comprehensive’: 1, ‘much’: 1, ‘standard’: 1, ‘collection’: 1, ‘procedural),’: 1, ‘reference’: 1, ‘capable’: 1, ‘run’: 1, ‘unmodified’: 1, ‘philosophy’: 1, ‘released’: 3, ‘programmers’: 1, ‘1980s’: 1, ‘programming.’: 1, ‘notable’: 1, ‘like’: 1, ‘described’: 1, ‘design’: 1, ‘often’: 1, ‘revision’: 1, ‘comprehensions’: 1, ‘3.’: 1, ‘It’: 1, ‘garbage’: 1, ‘collecting’: 1, ‘features’: 1, ‘constructs’: 1, ‘system’: 1, ‘completely’: 1, ‘significant’: 1, ‘whitespace.’: 1, ‘code’: 3, ‘typed’: 1, ‘supports’: 1, ‘2008,’: 1, ‘paradigms,’: 1, ‘ABC’: 1, ‘cycles.’: 1, ‘late’: 1, ‘(particularly,’: 1, ‘projects’: 1, ‘help’: 1, ‘large-scale’: 1, ’emphasizes’: 1, ‘small’: 1, ‘Rossum’: 1, ‘Guido’: 1, “Python’s”: 1, ‘library.’: 1, ‘multiple’: 1, ‘successor’: 1, ‘logical’: 1, ‘garbage-collected.’: 1, ‘object-oriented’: 1, ‘first’: 1, ‘2.0,’: 1, ‘3.0,’: 1, ‘Created’: 1, ‘due’: 1, ‘van’: 1, ‘object-oriented,’: 1, ‘2000,’: 1, ‘language’: 3, ‘approach’: 1, ‘interpreted,’: 1, ‘conceived’: 1, ‘1991,’: 1, ‘included’: 1, ‘high-level,’: 1, ‘Its’: 1, ‘batteries’: 1, ‘list’: 1, ‘including’: 1, ‘functional’: 1, ‘use’: 1, ‘write’: 1, ‘general-purpose’: 1, ‘backward-compatible,’: 1, ‘structured’: 1, ‘readability’: 1, ‘major’: 1}

Step 7

Finally, we have to display the sorted list of words with its repetition. It can be done by converting the above dictionary to a data frame by using pandas. The output displayed is sorted in the ascending order of number of repetitions of sorted words in the text.

Python Code:

import pandas as pd
word_Counts=pd.DataFrame({'NUMBER of REPEATATION':result})
word_Counts.columns = ['NUMBER of REPEATATION']
word_Counts.sort_values('NUMBER of REPEATATION')

Output:

	NUMBER of REPEATATION
(particularly,	1
object-oriented,	1
object-oriented	1
notable	1
multiple	1
much	1
major	1
…	…
language.	2
programming	2
released	3
language	3
code	3
Python	8