A colleague has built a web app with a PHP framework where we can configure some API calls to other systems. These are running in the night to pull in new data into a Postgres Database. Since Postgres is an OLTP Database and not made for analytics I started to read about Redshift. But I just can't figure out how all of this fits together.
Oh and for analytics we would look at PowerBI which could use DirectQuery with Redshift. But as I see it there is no such thing for Postgres.
So for my question I will split everything into four parts:
Solution | Application | Userdata | Data | Datawarehouse -------- | ----------- | ---------- | ------------- | ---------------- Now | PHP | Postgres | Postgres | 1. | PHP | Postgres | Postgres | Redshift 2. | PHP | Postgres | | Redshift 3. | PHP | Redshift | | Redshift
So the question is: What possible solution is the "right" one? I could use the infrastructure that we have and just add Redshift. But then I double the cost for storage. I could store the application data in a smaller database and store the data from the APIs directly into Redshift, or use Redshift as the only database.
Your question is not very clear on how you intend to use the databases, however the best recommendation is to try and use a "normal" database (in your case, PostgreSQL) for everything.
If you discover that your analytics is taking too long and you have millions or billions of rows in the database, you could then consider also using Amazon Redshift for faster analytical queries. If your queries are read-only, you could also consider using Amazon Athena, which can read data directly out of files stored in Amazon S3.
What purpose does the Postgres database serve in this scenario?
I'd suggest writing the output of the API calls directly to S3 and loading them into Redshift from there.
If these API responses are in JSON (likely) you may want to flatten them out into CSVs for loading into Redshift. Redshift's JSON loading is quite limited.
Both the systems have different backend infra and are used for some very specific purposes. Although both of them might be used interchangeably when dealing with small amount of data, but will change drastically when there are bulk reads/writes involved.
Here I assume that when you say you're using Postgres, yours is presumably a Row orientation.
For writing bulk data, a row DB is preferred since it is write intensive where as column DB is used if your operations involve querying multiple rows (a typical requirement for analytics purposes). A best mix always remains storing the transactional data over row-oriented DB, migrate some of the tables required for analytical purposes to columnar DB and run analytics queries there. This might sound absurd and expensive but this is how exactly some companies execute if they do not want to compromise with either transactional data or analytical data.
If yours is a product based company involving heavy (financial) transactions and you capture user_persona as well, split them both across row-and-column oriented schema respectively.
A row DB is write intensive. When application makes bulk transactional write statements, it has to be written on tables without any lag. I'm sure, you'll have multiple master_slave configuration as well, thus the data has to be replicated to slaves as well and that too, at the real time.
One now has to understand that analytical data is very much different than the transactional data. Transactional data is not voluminous - let us say it'll have a row created in orders table and will map user_id
with some basic order_details
for each order placed; but analytics data - click patterns on the screen, details of notifications sent etc is generated each time a user lands on app; is voluminous and cannot be stored in the same way as we store transactional data.
A columnar orientation (as in Amazon RS) is read intensive - a typical requirement for analytical data since a large number of rows will be retrieved for a given user_set - details of all the notifications sent, or all the screens browsed / clicked by the user. A columnar DB is tailor made to suit such requirements.
The bulk writes in columnar DB is slow; but since it now deals mostly with analytical data - not having data real time is not critical. Analytics takes time and data till current_date-1
or with a lag of n
hours can always be referred to draw a user persona.
For a large company with voluminous dataset, a tradeoff needs to be maintained. I hope you might now have a faint idea of how to go about it.