Entertainment at it's peak. The news is by your side.

Using PostgreSQL to Shape and Prepare Scientific Data


At this time time we are going to dawdle through among the preliminary recordsdata shaping steps in recordsdata science the enlighten of SQL in Postgres. I in actual fact catch a prolonged historical previous of working in recordsdata science, in conjunction with my Masters Stage (in Forestry) and Ph.D. (in Ecology) and all the map in which through this work I could maybe maybe maybe maybe automatically fetch raw recordsdata files that I had to fetch into shape to crawl diagnosis.

On every occasion you launch up to manufacture one thing original there is continually some uncomfortableness. That “why is this so laborious” feeling usually stops me from trying one thing original, nonetheless now not this time! I tried to manufacture most of my recordsdata prep and statistical diagnosis interior PostgreSQL. I didn’t enable myself any copy and paste in google sheets nor may maybe maybe maybe maybe also I enlighten Python for scripting. I know this is laborious core and within the quit will now not be the “easiest” technique to manufacture this. But after I didn’t pressure myself to manufacture it, I could maybe maybe maybe maybe never realize what it would possibly be like.

The Project

Next I needed to purchase a modeling challenge. We now catch just a few inviting recordsdata items within the Crunchy Files Demo repository nonetheless nothing that regarded perfect for statistical modeling. I believed about hunting for one thing barely extra relevant to my lifestyles.


Describe by keppet on flickr

It is likely you’ll maybe maybe maybe maybe also catch observed that we are having just a few fires here in Northern California. Reasonably just a few Crunchy of us live here. I’m in Santa Cruz (with a fireplace that used to be lower than 7 miles away), Craig lives end to Berkeley, Daniel and Will live in SF. We now catch in comparison notes on the sky (stress-free pic here), the amount of ash on automobiles, and PurpleAir values.

I believed it’d be inviting to peep if logistic regression may maybe maybe maybe maybe also predict the probability of a fireplace in accordance with the historical weather and fireplace patterns. The full particulars for how we got historical California fireplace recordsdata is here and here is the weather recordsdata.

Here’s a preview of the fireplace recordsdata:

And here is a preview of the weather recordsdata:

Preliminary Import of Files

You more than most likely can peep on the pages above among the preliminary cleaning we did. Most of it’s a ways minor though wanted issues like renaming columns, taking a separate date and time column, and combining them actual into a timestamp, or changing the worth on an glaring typo.

Let’s dig into the concatenating the date and time actual into a timestamp for the weather recordsdata (step 5). The conventional me would catch set the total recordsdata in a spreadsheet and then venerable a concatenate characteristic to set the two together. I could maybe maybe maybe maybe manufacture one cell with the system and then copy and paste into the total cells.

As a replacement, this is all we’d like:

UPDATE weather SET date_time = to_timestamp((date || ‘ ‘ || time || ‘-7’), ‘YYYYMMDD HH24:MI’);

Most most likely the most glaring benefits you may maybe maybe maybe maybe also unruffled peep comely away is that we gracious catch one line of code to manufacture what would take lots of manual steps. Because it’s in code if we’d like to reimport the recordsdata or adjust it for one other recordsdata impart, we appropriate want to reuse this line. We would also even enlighten this as phase of an computerized script.

The different ingredient you may maybe maybe maybe maybe also unruffled scrutinize is that we had been in a set to in actual fact add a timezone to the recordsdata (the -7). And lastly, we had been in a set to behave on the recordsdata as a full unit. Which technique we manufacture the operation impulsively without fat-fingering the recordsdata or forgetting to stick some originate of rows. The recordsdata stays together and the total operations occur at “the identical time”. If we desired to be even extra careful we’d also catch wrapped the assertion in a transaction.

Subsetting Our Files

The first ingredient we are going to manufacture to shape our recordsdata is subset the fires. Our weather set is in Northern California and so it more than most likely doesn’t produce sense to are attempting to predict Southern California fireplace probability given weather within the north. Here is yet again a one liner to subset our recordsdata:

SELECT fireplace19.INTO ncalfire FROM fireplace19, fireplace19_region WHERE st_covers(fireplace19_region.geom, fireplace19.geom) AND = ‘Northern’;

We chosen the total columns from the conventional recordsdata the set the Northern Space covers the total fireplace recordsdata. We enlighten ST_Covers as a replacement of ST_Contains on account of some unexpected behaviour in ST_Contains. By the enlighten of `st_covers`, our operation will now not encompass any fires whose boundaries race commence air the Northern Space, let’s keep in mind if the fireplace spread into Oregon.

Preparing for Logistic Regression

Logistic regression wants the response variable (did a fireplace occur on that day), be coded as a 1 = fireplace took set, 0 = no fireplace. When I started to evaluation the recordsdata, as I have to catch identified in accordance with living in California, extra than one fires can launch up on the identical day. So we’d like to aggregate recordsdata from extra than one days actual into a single entry. With SQL we don’t want to lose any knowledge. Here is the one line that does what we’d like:

WITH grouped_fire AS (

    SELECT alarm_date, count

as numfires, string_agg(fire_name, ‘, ‘) AS names,

       st_collect(geom)::geometry(geometrycollection,  3310) AS geom FROM ncalfire GROUP BY alarm_date


SELECT w.*, grouped_fire.*, 1 AS hasfire INTO fire_weather FROM weather w, grouped_fire WHERE grouped_fire.alarm_date = w.date_time::date;

So the principle phase of this quiz starting up “WITH grouped_fires” till the ) is belief as a CTE or total table expression. We manufacture all our aggregation within the CTE in accordance with alarm_date (specified by the “GROUP BY alarm_date”). Our output will be for every awe date:

  1. Return the sequence of fires that took set on that day.
  2. Aggregated the total strings for the fireplace names into one string separated by “,” (we’d also catch additionally venerable an an array of string.
  3. Fetch the geometries of the total assorted fires on that day into one geometry sequence.

Now we can enlighten the knowledge from the CTE like a table within the 2d phase of the quiz. Within the next phase we defend the total columns from the CTE alongside with the total columns within the conventional table and then we appropriate set a 1 in for all entries  in an output column named “hasfire”.

As an aspect mark, I’m a gargantuan fan of bringing clarity to boolean columns by prefixing them with “is” or “has” so what the comely situation technique.

Then we be a part of the CTE date to the conventional table by matching dates. We truncate the timestamp column by casting it to a date. Watch we are deciding on actual into a brand original table and the output of this portray will genuinely race actual into a table as a replacement of exhibiting on the veil veil. With that we catch now got a table with the total weather recordsdata, dates of fires, and a 1 within the hasfire column.

Fixing our Geometry

It is likely you’ll maybe maybe scrutinize above we had to provide a geometrycollection to community the fires geometries together. Most desktop and other GIS application don’t know what to manufacture with a geometrycollection geometry form. To repair this difficulty we are going to cast the geometries to multipolygons. A multipolygon permits a single row to catch extra than one polygons. The most effective instance of this is a city or property that includes islands. A multipolygon permits town, alongside with its islands, to be viewed as a single entry within the table. Here is the name we enlighten:

ALTER TABLE fire_weather ALTER COLUMN geom form geometry(multipolygon,3310) USING st_collectionextract(geom, 3);

ST_collectionextract takes a multigeometry and collects the total geometries you quiz for and returns a multigeometry of the requested form. The 3 in st_collectionextract is the “magic quantity” meaning polygons

Creating our non-fireplace recordsdata

We created our fireplace recordsdata and fastened it’s geometry. Now we catch now got to race support to the weather recordsdata and set 0 to the total days the set there used to be no fireplace:

WITH non_fire_weather AS (

SELECT weather.FROM weather WHERE identity NOT IN (SELECT identity FROM fire_weather)


SELECT non_fire_weather.*, null::date AS alarm_date, 0::bigint AS numfires, null::text AS names, null::geometry(MultiPolygon,3310) AS geom, 0 AS hasfire INTO non_fire_weather FROM non_fire_weather;

Again we are going to launch up with a CTE nonetheless this time the principle quiz is venerable to procure the total days that are NOT within the original fireplace + weather table above. This provides us a “table” with gracious weather recordsdata on days the set a fireplace didn’t occur. Then within the 2d phase of the quiz we add within the total columns that we added to our fire_weather table. Finally we set the outcomes actual into a brand original table known as “non_fire_weather”.

Now we combine the fire_weather and non_fire weather tables into one master table known as alldata:

SELECT INTO alldata FROM non_fire_weather UNION SELECT * FROM fire_weather;

We did this merging with the corpulent recordsdata impart  in case we want to manufacture different forms of analyses with this knowledge. As an instance, we genuinely catch counts of fireplace on days and we additionally catch complete spot of fireplace burned (from the polygons).  These variables may maybe maybe maybe maybe consequence in other inviting diagnosis within the prolonged crawl.

Wrap Up

Let’s wrap up and seek at the total perfect work we had been in a set to manufacture the enlighten of SQL for recordsdata science. We took some tables imported from raw recordsdata, turned dates and time into extra usable formats, aggregated the fireplace events, stored the geometries intact, created a code for event versus now not event, and created a sizable master recordsdata impart.

All of this used to be carried out with traces of code, no manual steps. Doing our recordsdata shaping this system makes our route of without complications repeatable if we’d like to manufacture it yet again. We would also very neatly set it in a script and automate the importing of original recordsdata. On a non-public mark, it makes my lifestyles so straight forward if I have to invent a brand original version of the dataset or import the recordsdata to a brand original server.

I am hoping you learned this code and the examples well-known to your recordsdata science work (here is the github repo for it). What’s your skills with the enlighten of SQL to shape your recordsdata sooner than doing diagnosis? Attain you catch got some guidelines or suggestions you would take to fragment? Drag away us comment below or on the Crunchy Files twitter story. Comprise stress-free with your diagnosis and code on!

Duvet image By Annette Spithoven, NL 

Read More

Leave A Reply

Your email address will not be published.