Skip to content

ENH: Construct dataframe from shell command #16846

Closed
@timothymillar

Description

@timothymillar

Problem description

R's data.table library has a function fread that can take a shell command as input to construct a dataframe.

This is highly useful for reading data formats that require a small amount of wrangling (without creating additional files). There are many examples of these formats in bioinformatics (sam, bam, gff, etc.)

As far as I can tell there is no (straightforward) equivilent in pandas.

Is there interest in a pull request for an additional function to add this functionality?

Example solution

import io
import pandas
import subprocess

def read_shell(command, shell=False, **kwargs):
    """
    Takes a shell command as a string and and reads the result into a Pandas DataFrame.
    
    Additional keyword arguments are passed through to pandas.read_csv.
    
    :param command: a shell command that returns tabular data
    :type command: str
    :param shell: passed to subprocess.Popen
    :type shell: bool
    
    :return: a pandas dataframe
    :rtype: :class:`pandas.dataframe`
    """
    proc = subprocess.Popen(command, 
                            shell=shell,
                            stdout=subprocess.PIPE, 
                            stderr=subprocess.PIPE)
    output, error = proc.communicate()
    
    if proc.returncode == 0:
        with io.StringIO(output.decode()) as buffer:
            return pandas.read_csv(buffer, **kwargs)
    else:
        message = ("Shell command returned non-zero exit status: {0}\n\n"
                   "Command was:\n{1}\n\n"
                   "Standard error was:\n{2}")
        raise IOError(message.format(proc.returncode, command, error.decode()))

Expected usage

command = "samtools view eaxample.bam | head | cut -f 1,2,3,4,5,6,7 -d '\t'"

read_shell(command, shell=True, sep='\t', header=None)  # note options passed to pandas.read_csv

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO DataIO issues that don't fit into a more specific label

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions