Skip to content Skip to sidebar Skip to footer

Python: Parsing Numeric Values From String Using Regular Expressions

I'm writing python code to parse different types of numbers from a string using regular expressions and have run into an annoying problem which I don't understand. My code is as f

Solution 1:

Firstly, I suspect that the period in the first part of the regex should be escaped with a leading backslash (if it is intended to match a decimal point), currently it matches any character which is why you have a match containing a space '$26 '.

The 2,333 is therefore matched to the first part of your regex (the , matches with the unescaped .), which is why it didn't match the ,450 part of that number.

Whilst your (corrected) regex works with your limited sample data which might be good enough, it may be too broad for general use - for instance it matches ($1267.3%. You could build up a bigger regex out of smaller parts, however this can get ugly fast:

import re

test_string = "Distributions $54.00 bob $26 and 0.30 5% ($0.23) 2,333,450"
test_string += " $12,354.00 43 43.12 1234,12 ($123,456.78"

COMMA_SEP_NUMBER = r'\d{1,3}(?:,\d{3})*'# require groups of 3
DECIMAL_NUMBER = r'\d+(?:\.\d*)?'
COMMA_SEP_DECIMAL = COMMA_SEP_NUMBER + r'(?:\.(?:\d{3},)*\d{0,3})?'# are commas used after the decimal point?

regex_items = []

regex_items.append('\$' + COMMA_SEP_DECIMAL)
regex_items.append('\$' + DECIMAL_NUMBER)
regex_items.append(COMMA_SEP_DECIMAL + '\%')
regex_items.append(DECIMAL_NUMBER + '\%')
regex_items.append(COMMA_SEP_DECIMAL)
regex_items.append(DECIMAL_NUMBER)

r = re.compile('|'.join(regex_items))

print r.findall(test_string)

Note that this doesn't account for parenthesis around the numbers, and it fails on 1234,12 (which should probably be interpreted as two numbers 1234 and 12) due to matching 123 against the COMMA_SEP_NUMBER pattern.

This is a problem with this technique because if the DECIMAL_NUMBER pattern comes first, COMMA_SEP_NUMBER will never be matched.

Finally, here's a nice tool for visualising regex

\d{1,3}(?:,\d{3})*(?:\.(?:\d{3},)*\d{0,3})?

Regular expression visualization

Debuggex Demo

Solution 2:

How about merge two parts into one?

>>> test_string = "Distributions $54.00 bob $26 and 0.30 5% ($0.23) 2,333,450">>> re.findall(r'\(?\$?\d+(?:,\d+)*\.?\d*%?\)?', test_string)
['$54.00', '$26', '0.30', '5%', '($0.23)', '2,333,450']
  • Replaced . with \. to match dot literally instead of matching any charcter.
  • Replaced [0-9] with \d. (\d matches digit)

Post a Comment for "Python: Parsing Numeric Values From String Using Regular Expressions"