Language Development > QUESTIONS & ANSWERS > Natural Language Processing (All)

Natural Language Processing

Document Content and Description Below

# Imports - these are all the imports needed for the assignment %matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Import nltk pack... age # PennTreeBank word tokenizer # English language stopwords import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords import warnings warnings.filterwarnings('ignore') # scikit-learn imports # SVM (Support Vector Machine) classifer # Vectorizer, which transforms text data into bag-of-words feature # TF-IDF Vectorizer that first removes widely used words in the dataset and␣ ,→then transforms test data # Metrics functions to evaluate performance from sklearn.svm import SVC 2from sklearn.tree import DecisionTreeClassifier from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.metrics import classification_report,␣ ,→precision_recall_fscore_support from sklearn.model_selection import train_test_split Run the following cell to download the NLTK English tokenizer and the stopwords of all languages. [2]: nltk.download('punkt') nltk.download('stopwords') [nltk_data] Downloading package punkt to /home/v1lu/nltk_data… [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package stopwords to /home/v1lu/nltk_data… [nltk_data] Package stopwords is already up-to-date! [2]: True Finally, run the following cell to define a fixed random state. This will ensure that the results you generate are the same same each time you run the code/do sampling. [3]: # define random state for use random_state = 10 3.2 # Part 1: Data, EDA, and Readability index of Newsgroups (3 pts) In this first set of questions you will: - read the dataset in and do minimal cleaning - use the data provided to calculate metrics needed - carry out some basic EDA 3.2.1 1a) Import Data Read the CSV file 'data/20news_data.csv', storing its contents into a dataframe news_df. Set the column names as ‘category’, ‘subject’ and ‘message’. The dataset we’ll be using here are email discussions from News groups. The category being discussed is in the first column. The subject of the email thread is in the second. And its contents is in the third. Note that these are from actual discussions. Some sensitive topics and discussions are included; the statements within these datasets do not necessarily reflect the sentiments of the course staff. [4]: news_filepath = "data/20news_data.csv" # YOUR CODE HERE news_df = pd.read_csv(news_filepath, names=['category','subject','message']) [5]: # take a look at 5 random rows from the dataset # note that b/c random_state is set, you will get same 5 rows each time news_df.sample(5, random_state=random_state) 3[5]: category subject \ 16150 politics National Crime Survey 256 politics Re: Europe vs. Muslim Bosnians 8355 recreational Re: wife wants convertible 9609 religion Re: After 2000 years, can we say that Christia… 8288 recreational Re Aftermarket A/C units message 16150 Well, I dropped by the library yesterday, and … 256 I like what Mr. Joseph Biden had to say yester… 8355 : : : HELP!!! : my wife has informed me tha… 9609 I'll take a wild guess and say Freedom is obje… 8288 | I looked into getting a/c installed on my 19… [6]: assert isinstance(news_df, pd.DataFrame) assert len(news_df) == 18773 assert list(news_df.columns) == ['category', 'subject', 'message'] 3.2.2 1b) Remove null observations Given that we are carrying out a classification/prediction task, we’re going to drop null values from the dataframe. Drop any rows containing null values in any of the three columns of news_df. Store this back into news_df. [7]: # YOUR CODE HERE news_df = news_df.dropna() [8]: assert news_df.isna().sum().sum() == 0 assert news_df.shape == (18731, 3) 3.3 Readability Index Now, it’s time to calculate the readability index of each text entry. This is the measure of how easy or hard it might be to read each message. We will use the Dale-Chall method to calculate it in the next few steps, using the following: Readability Score of a text = 0.1579 (Percentage Diffcult Words not on the Dale– Chall word list) + 0.0496 * (Average Sentence Length) + 3.6365* 3.3.1 1c) Average sentence length We will create a function (avg_sentence_len) to calculate the average sentence length across a piece of text. This function should take text as an input parameter. Within this function: 1. sentences: Use the split() string method to split the input text at every '.'. This will split the text into a list of sentences. Store this in the variable sentences. To keep things simple, we will consider every “.” as a sentence separator. (This decision could lead to 4misleading answers. For example, “Hello Dr. Jacob.” is actually a single sentence, but our function will consider this 2 separate sentences). 2. words: Use the split() method to split the input text into a list of separate words, storing this in words. Again, to limit complexity, we will assume that all words are separated by a single space (" "). (So, while “I am going.to see you later” actually has 7 words, since there is no space after the “.”, so we will assume the this to contain 6 separate words in our function.) 3. Calculate the average sentence length, returning this from your function: - if the last value in sentences is an empty string: the average sentence length should be the number of words divided by the len(sentences) - 1. - otherwise, the average sentence length should be the number of words divided by the number of sentences. For the “I am going.to see you later” example, your function should return 3.0. [9]: # YOUR CODE HERE def avg_sentence_len(text): sentences = text.split('.') words = text.split(" ") avg = len(words)/len(sentences) if sentences[-1] == '': avg = len(words)/(len(sentences)-1) else: avg = avg return avg [10]: assert avg_sentence_len("each sentence. has two. words right?.") == 2 assert avg_sentence_len("a. a. a") == avg_sentence_len("a. a. a.") assert avg_sentence_len("a. a. a") == 1 assert avg_sentence_len("hello Dr. Jacob.") == 1.5 assert avg_sentence_len(news_df['message'][0])//1 == 16 assert avg_sentence_len("one. two.") != avg_sentence_len("one.two.") 3.3.2 1d) Calculate Average Sentence Length Apply the above method to 'message' column of news_df, creating a new column in news_df called 'ASL'. This column should store the average sentence length of each message. [11]: # YOUR CODE HERE news_df['ASL'] = news_df['message'].apply(avg_sentence_len) [12]: assert "ASL" in news_df.columns assert np.isclose(news_df.loc[0,'ASL'], 16.07, 0.01) [13]: # look at output news_df.head() 5[13]: category subject \ 0 politics Re: The Stage is Being Set 1 politics Re: Zionism - racism 2 politics The Armenians did not form a distinct race. 3 politics Re: As Muslim women and children were being ma… 4 politics Re: Israel's Expansion II message ASL 0 Srinivas Suder writes: If the Haitian people's… 16.066667 1 From: Center for Policy Research <cpr Subject:… 13.307692 2 My question is, given so many separations in t… 15.469697 3 Come again? The image-conscious Armenians sore… 17.461538 4 I have never seen such immaturity among semito… 19.000000 3.3.3 1e) Diffcult Words The file 'data/dale_chall.txt' contains a list of words which are considered easy in general. Read this file in, storing the file in a DataFrame called dale_chall_file. (Note that this is a .txt file, so be sure to consider what the sep parameter should be when reading this in and what the value for header should be.) Set the column name to ‘easy_words’ We will use this list of words to calculate the percentage of diffcult words for each message. [14]: # YOUR CODE HERE dale_chall_file = pd.read_csv('data/dale_chall.txt', sep='\n',␣ ,→names=['easy_words']) [Show More]

Last updated: 2 years ago

Preview 1 out of 29 pages

Buy Now

Instant download

We Accept: