Removing Punctuation From A List In Python
Solution 1:
You don't need regex
at all. string.punctuation
contains all of the punctations. Just iterate and skip those.
>>>import string>>>["".join( j for j in i if j notin string.punctuation) for i in lst]
Solution 2:
Taking a look at get_text()
, it appears we need to modify a few things before we can remove any punctuation. I've added some comments in here.
def get_text():
str_lines = [] # create an empty list
url = 'http://www.gutenberg.org/files/1155/1155-h/1155-h.htm'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
text = soup.find_all('p') #finds all of the text between <p>
i=0for p in text:
i+=1
line = p.get_text()
if (i<10):
continue
str_lines.append(line) # append the current line to the listreturn str_lines # return the list of lines
First, I uncommented your str_lines
variable and set it to an empty list. Next, I replaced the print
statement with code to append the line to the list of lines. Finally, I changed the return
statement to return that list of lines.
For strip_text()
, we can reduce it to a few lines of code:
def strip_text():
list_words =get_text()
list_words = [re.sub("[^a-zA-Z]", " ", s.lower()) for s in list_words]
return list_words
There is no need to operate on a per-word basis because we can look at the entire line and remove all punctuation, so I removed the split()
. Using list comprehension, we can alter every element of the list in a single line, and I also put the lower()
method in there to condense the code.
To implement the answer provided by @AhsanulHaque, you just need to substitute that second line of the strip_text()
method with it, as shown:
def strip_text():
list_words = get_text()
list_words = ["".join(j.lower() for j in i if j notinstring.punctuation)
for i in list_words]
return list_words
For fun, here is that translate
method I mentioned earlier implemented for Python 3.x, as described here:
def strip_text():
list_words = get_text()
translator = str.maketrans({key: Noneforkeyin string.punctuation})
list_words = [s.lower().translate(translator) forsin list_words]
return list_words
Unfortunately I cannot time any of these for your particular code because Gutenberg blocked me temporarily (too many runs of the code too quickly, I suppose).
Post a Comment for "Removing Punctuation From A List In Python"