Using A Custom Object In Pandas.read_csv()
Solution 1:
One way to make a file-like object in Python3 by subclassing io.RawIOBase
.
And using Mechanical snail's iterstream
,
you can convert any iterable of bytes into a file-like object:
import tempfile
import io
import pandas as pd
defiterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module).
For efficiency, the stream is buffered.
"""classIterStream(io.RawIOBase):
def__init__(self):
self.leftover = Nonedefreadable(self):
returnTruedefreadinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover ornext(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
returnlen(output)
except StopIteration:
return0# indicate EOFreturn io.BufferedReader(IterStream(), buffer_size=buffer_size)
classDataFile(object):
def__init__(self, files):
self.files = files
defread(self):
for file_name in self.files:
withopen(file_name, 'rb') as f:
for line in f:
yield line
defmake_files(num):
filenames = []
for i inrange(num):
with tempfile.NamedTemporaryFile(mode='wb', delete=False) as f:
f.write(b'''1,2,3\n4,5,6\n''')
filenames.append(f.name)
return filenames
# hours = ['file1.csv', 'file2.csv', 'file3.csv']
hours = make_files(3)
print(hours)
data = DataFile(hours)
df = pd.read_csv(iterstream(data.read()), header=None)
print(df)
prints
0 1 2
0 1 2 3
1 4 5 6
2 1 2 3
3 4 5 6
4 1 2 3
5 4 5 6
Solution 2:
The documentation mentions the read
method but it's actually checking if it's a is_file_like
argument (that's where the exception is thrown). That function is actually very simple:
defis_file_like(obj):
ifnot (hasattr(obj, 'read') orhasattr(obj, 'write')):
returnFalseifnothasattr(obj, "__iter__"):
returnFalsereturnTrue
So it also needs an __iter__
method.
But that's not the only problem. Pandas requires that it actually behaves file-like. So the read
method should accept an additional argument for the number of bytes (so you can't make read
a generator - because it has to be callable with 2 arguments and should return a string).
So for example:
classDataFile(object):
def__init__(self, files):
self.data = """a b
1 2
2 3
"""
self.pos = 0defread(self, x):
nxt = self.pos + x
ret = self.data[self.pos:nxt]
self.pos = nxt
return ret
def__iter__(self):
yieldfrom self.data.split('\n')
will be recognized as valid input.
However it's harder for multiple files, I hoped that fileinput
could have some appropriate routines but it doesn't seem like it:
import fileinput
pd.read_csv(fileinput.input([...]))
# ValueError: Invalid file path or buffer object type: <class 'fileinput.FileInput'>
Solution 3:
How about this alternative approach:
defget_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)
df = get_merged_csv(hours)
Post a Comment for "Using A Custom Object In Pandas.read_csv()"