Skip to content

SURFRAD site & date-range download #1155

Open
@mikofski

Description

@mikofski

Is your feature request related to a problem? Please describe.
the current SURFRAD iotools only reads in a singe day .dat from either an URL or a filesystem, EG:

# read from url
pvlib.iotools.read_surfrad('ftp://aftp.cmdl.noaa.gov/data/radiation/surfrad/Bondville_IL/2021/bon21001.dat')
# read from file
pvlib.iotools.read_surfrad('bon21001.dat')

Unfortunately, I can't quickly read an entire range or any arbitrarily large date range. I can use pvlib.iotools.read_surfrad in a loop, but it takes a long time to serially read in an entire year. Maybe it would be faster if I already had the files downloaded. It takes about 1-second to read a single 111kb file. So for 10,000 files that would be about 3 hours which is too long if I have to read 7 sites.

%%timeit
bon95 = [
    pvl.iotools.read_surfrad(r'ftp://aftp.cmdl.noaa.gov/data/radiation/surfrad/Bondville_IL/1995/bon95%03d.dat' % (x+1))
    for x in range(16)]  # read in 16 files

## -- End pasted text --
14.4 s ± 295 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That's 14.4[s] / 16[files] = 0.9[s] per file. I tried to use threading, but then I get connection errors. I think there's a limit of 5 connections to the NOAA ftp from your computer. That should bring it down to about 30 minutes, hmm, maybe I didn't try hard enough? Anyway, I went a different way.

Describe the solution you'd like
The current read_surfrad uses Python's urllib.requests.urlopen for each connection. I have found that opening a long lasting FTP connection using Python's ftplib allows downloading a lot more files by reusing the same connection. However this download is still serial, so I have found in addition using Python threading allows me to open up to 5 simultaneous connections, but any more and I get a 421 FTP Connection Error, too many connections.

Describe alternatives you've considered
I was able to open the FTP site directly in Windows, but it was also a serial connection, and so for about 10,000 (about 1gb) would have taken 4 hours. By contrast, using ftplib and threading I can download all of the data from a single site in about 25 mintes.

Additional context
#590
#595
gist of my working script: https://gist.github.com/mikofski/30455056b88a5d161598856cc4eedb2c

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions