Skip to content Skip to sidebar Skip to footer

Read From Line To Line Yelp Dataset By Python

I want to change this code to specifically read from line 1400001 to 1450000. What is modification? file is composed of a single object type, one JSON-object per-line. I want al

Solution 1:

If it is JSON per line:

revu=[]
withopen("review.json", 'r',encoding="utf8") as f:
    # expensive statement, depending on your filesize this might
    # let you run out of memory
    revu = [json.loads(s) for s in f.readlines()[1400001:1450000]]

if you do it on the /etc/passwd file it is easy to test (no json of course, so that is left out)

revu = []
withopen("/etc/passwd", 'r') as f:
    # expensive statement
    revu = [s for s in f.readlines()[5:10]]

print(revu)  # gives entry 5 to 10

Or you iterate over all lines, saving you from memory issues:

revu = []
withopen("...", 'r') as f:
    for i, line inenumerate(f):
        if i >= 1400001and i <= 1450000:
            revu.append(json.loads(line))

# process revu   

To CSV ...

import pandas as pd
import json

defmylines(filename, _from, _to):
    withopen(filename, encoding="utf8") as f:
        for i, line inenumerate(f):
            if i >= _fromand i <= _to:
                yield json.loads(line)

df = pd.DataFrame([r for r in mylines("review.json", 1400001, 1450000)])
df.to_csv("/tmp/whatever.csv")

Post a Comment for "Read From Line To Line Yelp Dataset By Python"