Skip to content Skip to sidebar Skip to footer

Removing Punctuation From A List In Python

I know this is a common question but I haven't found an applicable answer. I'm trying to remove the punctuation from a list of words, which I have gotten from scraping an HTML page

Solution 1:

You don't need regex at all. string.punctuation contains all of the punctations. Just iterate and skip those.

>>>import string>>>["".join( j for j in i if j notin string.punctuation) for i in  lst]

Solution 2:

Taking a look at get_text(), it appears we need to modify a few things before we can remove any punctuation. I've added some comments in here.

def get_text(): 
    str_lines = []  # create an empty list
    url = 'http://www.gutenberg.org/files/1155/1155-h/1155-h.htm'
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data, 'html.parser')
    text = soup.find_all('p') #finds all of the text between <p>
    i=0for p in text:
        i+=1
        line = p.get_text()
        if (i<10):
            continue
        str_lines.append(line)  # append the current line to the listreturn str_lines  # return the list of lines

First, I uncommented your str_lines variable and set it to an empty list. Next, I replaced the print statement with code to append the line to the list of lines. Finally, I changed the return statement to return that list of lines.

For strip_text(), we can reduce it to a few lines of code:

def strip_text():    
    list_words =get_text()
    list_words = [re.sub("[^a-zA-Z]", " ", s.lower()) for s in list_words]
    return list_words

There is no need to operate on a per-word basis because we can look at the entire line and remove all punctuation, so I removed the split(). Using list comprehension, we can alter every element of the list in a single line, and I also put the lower() method in there to condense the code.

To implement the answer provided by @AhsanulHaque, you just need to substitute that second line of the strip_text() method with it, as shown:

def strip_text():
    list_words = get_text()
    list_words = ["".join(j.lower() for j in i if j notinstring.punctuation)
                  for i in list_words]
    return list_words

For fun, here is that translate method I mentioned earlier implemented for Python 3.x, as described here:

def strip_text():
    list_words = get_text()
    translator = str.maketrans({key: Noneforkeyin string.punctuation})
    list_words = [s.lower().translate(translator) forsin list_words]
    return list_words

Unfortunately I cannot time any of these for your particular code because Gutenberg blocked me temporarily (too many runs of the code too quickly, I suppose).

Post a Comment for "Removing Punctuation From A List In Python"