Fetching millions of rows in Python w/ psycopg2
I ran into an issue with psycopg2
getting "stuck" on querying a large result set. cursor.execute("...")
would run fine and so would cursor.fetchone()
, but cursor.fetchmany(BATCH_SIZE)
with a moderate value of BATCH_SIZE=100
would get stuck _.
It seemed to take an awful lot of time to fetch just 100 rows of the result set. The query wasn't too complex so it should've executed fairly quickly. Looking at RDS logs, everything looked fine.
Turned out that psycopg2 fetches the entire result set to the client by default — unless you use a named cursor (server-side cursor). When you use a named cursor, the result set is maintained on the server-side allowing you to fetch rows as necessary. You can also control the no. of records fetched over the network in each go by setting the cursor.itersize
property to a reasonable value (default 2000).
#!/usr/bin/python
import psycopg2
def main():
conn_url = 'postgresql://{username}:{password}@{host}:{port}/{dbname}'.format(
username='',
password='',
host='',
port='',
dbname='')
conn = psycopg2.connect(conn_url)
cursor = conn.cursor(name='fetch_large_result')
cursor.execute('SELECT * FROM <large_table>')
while True:
# consume result over a series of iterations
# with each iteration fetching 2000 records
records = cursor.fetchmany(size=2000)
if not records:
break
for r in records:
# do something with record here
cursor.close() # don't forget to cleanup
conn.close()
if __name__ == "__main__":
main()