Closed
Description
Problem description
R's data.table library has a function fread
that can take a shell command as input to construct a dataframe.
This is highly useful for reading data formats that require a small amount of wrangling (without creating additional files). There are many examples of these formats in bioinformatics (sam, bam, gff, etc.)
As far as I can tell there is no (straightforward) equivilent in pandas.
Is there interest in a pull request for an additional function to add this functionality?
Example solution
import io
import pandas
import subprocess
def read_shell(command, shell=False, **kwargs):
"""
Takes a shell command as a string and and reads the result into a Pandas DataFrame.
Additional keyword arguments are passed through to pandas.read_csv.
:param command: a shell command that returns tabular data
:type command: str
:param shell: passed to subprocess.Popen
:type shell: bool
:return: a pandas dataframe
:rtype: :class:`pandas.dataframe`
"""
proc = subprocess.Popen(command,
shell=shell,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
output, error = proc.communicate()
if proc.returncode == 0:
with io.StringIO(output.decode()) as buffer:
return pandas.read_csv(buffer, **kwargs)
else:
message = ("Shell command returned non-zero exit status: {0}\n\n"
"Command was:\n{1}\n\n"
"Standard error was:\n{2}")
raise IOError(message.format(proc.returncode, command, error.decode()))
Expected usage
command = "samtools view eaxample.bam | head | cut -f 1,2,3,4,5,6,7 -d '\t'"
read_shell(command, shell=True, sep='\t', header=None) # note options passed to pandas.read_csv