存储60亿个浮动文件,便于访问文件

I need to save 250 data files an hour with 36000 small arrays of [date, float, float, float] in python, that I can read somewhat easily with PHP. This needs to run for 10 years minimum, on 6tb of storage.

What is the best way to save these individual files, I am thinking python struct. But it starts to look bad for the job with large data amounts?

example of data

a = [["2016:04:03 20:30:00", 3.423, 2.123, -23.243], ["2016:23:.....], ......]

Edit: Space, is more important than unpacking speed and computation. Since the space is very limiting.

So you have 250 data providers of some kind, which are providing 10 samples per second of (float, float, float).

Since you didn't specify what your limitations are, there are more options.


Binary files

You could write files of fixed array of 3*36000 floats with struct, at 4 bytes each gets you at 432.000 bytes per file. You can encode the hour in the directory name and id of the data provider in file name.

If your data isn't too random, a decent compression algorithm should shave enough bytes, but you would probably need to have some sort of delayed compression if you wouldn't want to lose data.

numpy

An alternative to packing with struct is numpy.tofile, which stores the array directly to file. It is fast, but always stores data in C format, where you should take care if the endian is on target machine is different. With numpy.savez_compressed you can store a number of arrays in one npz archive, and also compress it at same time.

JSON, XML, CSV

A good option is any of the mentioned formats. Also worth mentioning is JSON-lines format, where each line is a JSON encoded record. This is to enable streaming writing, where you keep a valid file format after each write.

They are simple to read, and the syntactic overhead goes away with compression. Just don't do string concatenation, use a real serializer library.

(SQL) Database

Seriously, why not use a real database?

Obviously you will need to do something with the data. With 10 samples per second, no human will need so much data, so you will have to do aggregations: minimum, maximum, average, mean, sum, etc. Databases already have all this and with combination of other features they can save you a ton of time you can otherwise spend on writing oh so many scripts and abstractions over files. Not to mention just how cumbersome the file management becomes.

Databases are extensible and supported by many languages. You save a datetime in database with Python, you read datetime with PHP. No hassles with how you are going to have to encode your data.

Databases support indexes for faster lookup.

My personal favourite is PostgreSQL, which has a number of nice features. It supports BRIN index, a lightweight index, perfect for huge datasets with naturally ordered fields, such as timestamps. If you're low on disk, you can extend it with cstore_fdw, a columnar oriented datastore, which supports compression. And if you still want to use flat files, you can write a foreign data wrapper (also possible with Python) and still use SQL to access the data.

Unless you're consuming the files in the same language, avoid language specific formats and structures. Always.

If you're going between 2 or more languages, use a common, plain text data format like JSON or XML that can be easily (often natively) parsed by most languages and tools.

If you follow this advice and your storing plain text, then use compression on the stored file--that's how you conserve space. Typical well-structured JSON tends to compresses really well (assuming simple text content).

Once again, choose a compression format like gzip that's widely supported in by languages or their core libraries. PHP for example has a native function gzopen() and python has lib\gzip.py in the standard python library.

I doubt it is possible without extremely efficient compression.

6TB / 10 year/ 365 days / 24 hrs / 250 files = 270 KB per file.

In ideal case. In real word size of cluster matters.

If you have 36,000 “small arrays” to fit into each file, you have only 7 bytes per array, which is not enough to store even proper datetime object alone.

One idea that comes to my mind if you want to save space. You better store only values and discard timestamps. Produce files with only data and make sure that you created a kind of index (formula) that given a timestamp (year/month/day/hour/min/sec...) results in the position of the data inside of the file (and of course the file that you have to go for). Even, if you check twice you will discover that if you use an "smart" naming scheme for the files you can avoid to store information about year/month/day/hour, since part of the index could be the file name. That all depends on how do you implement your "index" system, but pushing to an extreme version you could forget about timestamps and focus only in data.

Regarding data format, as aforementioned, I would definitively go on language independent format as XML, JSON... Who know which languages and possibilities will you have in ten years ;)