# Imports - these are all the imports needed for the assignment
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Import nltk package
# PennTreeBank
...
# Imports - these are all the imports needed for the assignment
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Import nltk package
# PennTreeBank word tokenizer
# English language stopwords
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import warnings
warnings.filterwarnings('ignore')
# scikit-learn imports
# SVM (Support Vector Machine) classifer
# Vectorizer, which transforms text data into bag-of-words feature
# TF-IDF Vectorizer that first removes widely used words in the dataset and␣
,→then transforms test data
# Metrics functions to evaluate performance
from sklearn.svm import SVC
2from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report,␣
,→precision_recall_fscore_support
from sklearn.model_selection import train_test_split
Run the following cell to download the NLTK English tokenizer and the stopwords of all languages.
[2]: nltk.download('punkt')
nltk.download('stopwords')
[nltk_data] Downloading package punkt to /home/v1lu/nltk_data…
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/v1lu/nltk_data…
[nltk_data] Package stopwords is already up-to-date!
[2]: True
Finally, run the following cell to define a fixed random state. This will ensure that the results you
generate are the same same each time you run the code/do sampling.
[3]: # define random state for use
random_state = 10
3.2 # Part 1: Data, EDA, and Readability index of Newsgroups (3 pts)
In this first set of questions you will: - read the dataset in and do minimal cleaning - use the data
provided to calculate metrics needed - carry out some basic EDA
3.2.1 1a) Import Data
Read the CSV file 'data/20news_data.csv', storing its contents into a dataframe news_df. Set
the column names as ‘category’, ‘subject’ and ‘message’.
The dataset we’ll be using here are email discussions from News groups. The category being
discussed is in the first column. The subject of the email thread is in the second. And its contents
is in the third. Note that these are from actual discussions. Some sensitive topics and discussions
are included; the statements within these datasets do not necessarily reflect the sentiments of the
course staff.
[4]: news_filepath = "data/20news_data.csv"
# YOUR CODE HERE
news_df = pd.read_csv(news_filepath, names=['category','subject','message'])
[5]: # take a look at 5 random rows from the dataset
# note that b/c random_state is set, you will get same 5 rows each time
news_df.sample(5, random_state=random_state)
3[5]: category subject \
16150 politics National Crime Survey
256 politics Re: Europe vs. Muslim Bosnians
8355 recreational Re: wife wants convertible
9609 religion Re: After 2000 years, can we say that Christia…
8288 recreational Re Aftermarket A/C units
message
16150 Well, I dropped by the library yesterday, and …
256 I like what Mr. Joseph Biden had to say yester…
8355 : : : HELP!!! : my wife has informed me tha…
9609 I'll take a wild guess and say Freedom is obje…
8288 | I looked into getting a/c installed on my 19…
[6]: assert isinstance(news_df, pd.DataFrame)
assert len(news_df) == 18773
assert list(news_df.columns) == ['category', 'subject', 'message']
3.2.2 1b) Remove null observations
Given that we are carrying out a classification/prediction task, we’re going to drop null values from
the dataframe. Drop any rows containing null values in any of the three columns of news_df. Store
this back into news_df.
[7]: # YOUR CODE HERE
news_df = news_df.dropna()
[8]: assert news_df.isna().sum().sum() == 0
assert news_df.shape == (18731, 3)
3.3 Readability Index
Now, it’s time to calculate the readability index of each text entry. This is the measure of how easy
or hard it might be to read each message. We will use the Dale-Chall method to calculate it in
the next few steps, using the following:
Readability Score of a text = 0.1579 (Percentage Diffcult Words not on the Dale–
Chall word list) + 0.0496 * (Average Sentence Length) + 3.6365*
3.3.1 1c) Average sentence length
We will create a function (avg_sentence_len) to calculate the average sentence length across a
piece of text.
This function should take text as an input parameter.
Within this function: 1. sentences: Use the split() string method to split the input text at
every '.'. This will split the text into a list of sentences. Store this in the variable sentences. To
keep things simple, we will consider every “.” as a sentence separator. (This decision could lead to
4misleading answers. For example, “Hello Dr. Jacob.” is actually a single sentence, but our function
will consider this 2 separate sentences). 2. words: Use the split() method to split the input text
into a list of separate words, storing this in words. Again, to limit complexity, we will assume that
all words are separated by a single space (" "). (So, while “I am going.to see you later” actually has
7 words, since there is no space after the “.”, so we will assume the this to contain 6 separate words
in our function.) 3. Calculate the average sentence length, returning this from your function: - if
the last value in sentences is an empty string: the average sentence length should be the number
of words divided by the len(sentences) - 1. - otherwise, the average sentence length should be
the number of words divided by the number of sentences.
For the “I am going.to see you later” example, your function should return 3.0.
[9]: # YOUR CODE HERE
def avg_sentence_len(text):
sentences = text.split('.')
words = text.split(" ")
avg = len(words)/len(sentences)
if sentences[-1] == '':
avg = len(words)/(len(sentences)-1)
else:
avg = avg
return avg
[10]: assert avg_sentence_len("each sentence. has two. words right?.") == 2
assert avg_sentence_len("a. a. a") == avg_sentence_len("a. a. a.")
assert avg_sentence_len("a. a. a") == 1
assert avg_sentence_len("hello Dr. Jacob.") == 1.5
assert avg_sentence_len(news_df['message'][0])//1 == 16
assert avg_sentence_len("one. two.") != avg_sentence_len("one.two.")
3.3.2 1d) Calculate Average Sentence Length
Apply the above method to 'message' column of news_df, creating a new column in news_df
called 'ASL'. This column should store the average sentence length of each message.
[11]: # YOUR CODE HERE
news_df['ASL'] = news_df['message'].apply(avg_sentence_len)
[12]: assert "ASL" in news_df.columns
assert np.isclose(news_df.loc[0,'ASL'], 16.07, 0.01)
[13]: # look at output
news_df.head()
5[13]: category subject \
0 politics Re: The Stage is Being Set
1 politics Re: Zionism - racism
2 politics The Armenians did not form a distinct race.
3 politics Re: As Muslim women and children were being ma…
4 politics Re: Israel's Expansion II
message ASL
0 Srinivas Suder writes: If the Haitian people's… 16.066667
1 From: Center for Policy Research
[Show More]
Last updated: 3 years ago
Preview 1 out of 29 pages