Using pandas with a persistent postgres container (with docker toolbox on Windows)

Gabriele Calvo
3 min readSep 10, 2018

--

I wanted to move some table format data from a bunch of csv files scattered in different folders into one postgres database that I can more easily query and get data from without having to parse the csv with pandas every time. And I want to use docker because, among other reasons, I’m probably going to have to move this solution onto a different machine once I’m done prototyping.

So let’s get started!

Let’s get a docker postgres image so we can connect to it. So from the docker console:

docker run \
-v /c/Users/gabe/pgdata:/var/lib/postgresql/data \
-e POSTGRES_USER=myusername\
-e POSTGRES_PASSWORD=mypassword \
-e POSTGRES_DB=mydatabase \
-p 5432:5432 \
--name pgtest \
-d postgres

If you are not too familiar with docker:

  • The line starting with “-v” is mapping a volume on the Windows filesystem to the default postgres data folder inside the container. This way the data doesn’t get destroyed if the container is deleted. And if we rerun the same code we will have the data in the newly spun container.
  • The various lines starting with “-e” are setting environment variables so that the database will use them to create the admin user and password and to name the database. Obviously you should change them if running this in production.
  • “-p” maps the port of docker to the port of the container. In our case we set them both to the default postgres port “5432”. Note: as we are using docker toolbox, docker’s IP is not the localhost’s (the computer’s). By default it should be something like 192.168.99.100.
  • “--name” is just a convenient way to name the container so we can reference it later if needed without having to look up its randomly assigned container id.
  • “-d” is used to set the container running in the background (daemonised mode)

Now that we have the container up and running we can use python to do the rest!

Because of the settings we used to run the container our connection string will be the following:

my_connection_string = 'postgresql://myusername:mypassword@192.168.99.100:5432/mydatabase'

And we can setup the engine that we’ll use to connect to the database with:

from sqlalchemy import create_engineengine = create_engine(my_connection_string)print(engine.table_names())  # this should list the current table 
# in the database. As it is a new one
# it should return an empty list "[]".

We can now use pandas to create a new table, query it, edit the table and delete items from it.

Let’s grab the periodic table from a project I have on github and add it as a table named ‘elements’ in the db.

import pandas as pddf = pd.read_csv( 'https://raw.githubusercontent.com/gabrielecalvo/kivy_periodic_table/master/element_properties.csv', encoding='latin')# saving all the data to a 'elements' table
df.to_sql('elements', engine)

I stumbled upon a quicker way to bulk upload DataFrame data to the db, but the standard way will do for this example.

Once the data is in the database you can call it back to a pandas DataFrame using the read_sql_table method or using the read_sql to allow queries in SQL language.

# querying all the values in column 'Name' using SQL language
pd.read_sql_table('elements', engine)
# querying all the values in column 'Name' using SQL language
pd.read_sql('select "Name" from elements', engine)

From this point on I’ll just hand you to the capable hands of pandas documentation where you can find a much better and broader list of pandas capabilities when it comes to input/output to SQL.

Happy coding!

--

--

Gabriele Calvo
Gabriele Calvo

Written by Gabriele Calvo

Energy Engineer working as Data Scientist with a passion for coding, travelling and learning.

No responses yet