Skip to content Skip to sidebar Skip to footer

Can I Create List From Regular Expressions?

I'm making a crawler. User can specify regular expression string to download data. When user input form is: http://xxx/abc[x-z]/image(9|10|11).png I want to download these. http:/

Solution 1:

If you are wanting to take a user's given regex as an input and generate a list of strings you can use the library sre_yield:

However, be very aware that trying to parse every possible string of a regex can get out of hand very quickly. You'll need to be sure that your users are aware of the implications that wildcard characters and open ended or repeating groups can have on the number of possible matching strings.

As an example, your regex string: http://xxx/abc[x-z]/image(9|10|11).png does not escape the ., which is a wildcard for any character, so it will generate a lot of unexpected strings. Instead we'll need to escape it as seen in the example below:

>>>import sre_yield>>>links = []>>>for each in sre_yield.AllStrings(r'http://xxx/abc[x-z]/image(9|10|11)\.png'):
        links.append(each)

Or more simply links = list(sre_yield.AllStrings(r'http://xxx/abc[x-z]/image(9|10|11)\.png'))

The result is:

>>> links

['http://xxx/abcx/image9.png', 'http://xxx/abcy/image9.png', 
'http://xxx/abcz/image9.png', 'http://xxx/abcx/image10.png', 
'http://xxx/abcy/image10.png', 'http://xxx/abcz/image10.png', 
'http://xxx/abcx/image11.png', 'http://xxx/abcy/image11.png', 
'http://xxx/abcz/image11.png']

Solution 2:

You can use product() from the itertools builtin:

from itertools import product

for x, y in product(['x', 'y', 'z'], range(9, 12)):
    print'http://xxx/abc{}/image{}'.format(x, y)

To build your list you can use a comprehension:

links = ['http://xxx/abc{}/image{}'.format(x, y) for x, y in product(['x', 'y', 'z'], range(9, 12))]

Solution 3:

Simple try may be-alternative to the previous answers

lst = ['http://xxx/abc%s/image%s.png'%(x,y) for x, y in [(j,i) for i in (9,10,11) for j in ('x', 'y', 'z')]]

Omitted range and format function for quicker performance.


Analysis- I compared my way and the way posted by Jkdc

I ran both way 100000 times but mean shows that itertools approach is faster in terms of execution time-

from itertools import product
import time
from matplotlib import pyplot as plt
import numpy as np

prodct = []
native = []

deftest():
    start = time.clock()
    lst = ['http://xxx/abc{}/image{}'.format(x, y) for x, y in product(('x', 'y', 'z'), range(9, 11))]
    end = time.clock()

    print'{0:.50f}'.format(end-start)
    prodct.append('{0:.50f}'.format(end-start))

    start1 = time.clock()
    lst = ['http://xxx/abc%s/image%s'%(x,y) for x, y in [(j,i) for i in (9,10,11) for j in ('x', 'y', 'z')]]
    end1 = time.clock()

    print'{0:.50f}'.format(end1-start1)
    native.append('{0:.50f}'.format(end1-start1))


for i inrange(1,100000):
    test()


y = np.dot(np.array(native).astype(np.float),100000)
x= np.dot(np.array(prodct).astype(np.float),100000)

print np.mean(y)
print np.mean(x)

graph and getting result for native(no module) and itertools-product as below

for native 2.1831179834 for itertools-product 1.60410432562

Post a Comment for "Can I Create List From Regular Expressions?"