如何在不重做数据库的情况下使用错误的数据库优化此旧版Webapp

Questions modified slightly to improve understandability

My goal is to optimize a web application which has a very bad DB design and for which I can't touch the DB (I can't alter table nor introduce a new DB). I can operate on the code itself, on the filesystem or via a proxy. By "optimizing" I mean: reduce the requests sent to the webapp, as opposed to the ones sent to filesystem directly, keep DB queries to a minimum, reduce the number of different URL calls (keep caching in mind).

Let me try to construct a fictitious example, just to provide something to talk upon. Let us imagine this scenario:

  • I have a php webapp, which exposes a database of a million different people. each person decided at some point if they are happy or sad
  • when I visit person.php?id=x {x=1,..1000000}, the page creates a link to show_picture_of_person.php?id=x. show_picture_of_person.php will go inside the db of a million rows and this db will tell me if the person is sad or happy by returning an image. I don't know what this image is, unless I extract it from the DB. If I extract it from the db, i can then analyze it and understand if it's either a sad face or a happy face. The function show_picture_of_person actually outputs an image. Also the DB stores the image itself in a blob. Images are always either sad.jpg or happy.jpg.

what I would like to have, instead of a million link to show_picture_of_person.php?id=x, is to have 2 links, one for a sad.jpg and one for a happy.jpg. Possible solutions in my mind:

  1. I write a script which calls all the possible combinations of show_picture_of_person, save all the images, understand which are the ones that are equal and then write a lookup table. I put that lookup table in a php function make_sensible_url("show_picture_of_person.php?id=x") -> happy.jpg. This function will be called in the person.php script. I am worried about performance here of the php engine itself (and array like that would be by itself a 50+MB file!)
  2. Same as above, but instead of constructing an array in PHP, I create a filesystem of a million text files, and inside each text file I will have the name of the actual static file of the image (avoiding repetitions). The function make_sensible_url("show_picture_of_person.php?id=x") will simply read and output the content of the file. I like this approach as no DB calls and reading to fs should be faster then the huge PHP array in solution 1.
  3. I change person.php so that there is no link to show_picture_of_person.php and instead I make data:images. The issue with this is that, if I have x calls to person.php, I still have 2x calls to the DB (one for person.php and one for show_picture_of_person.php). I would like to have only x calls to the DB. Also this increases the size of the page (in my real case I have ~20 images in 1 page, so lots of queries and lots of bytes)
  4. don't know what else..

How would you solve this? Thanks!


For completeness' sake, here was the original question:

This is the scenario:

  • a database with various tables with data which is not properly indexed (let's say for this argument's sake that we have 5000 unique objects represented in around 50.000 rows - so duplicates are present)
  • we are in a situation in which the database is non modifiable (this also mean that we can't insert another table)
  • we have a php app exposing these objects
  • there exists around 1 million php calls (all legitimate) which return one of those 5000 objects (e.g.: bad_url.php?id=bar, bad_url.php?id=foo, ..)
  • there is no easy way to programmatically decide which of the 5000 objects will be returned

Our goal is to somehow convert the million+ calls to calls which will be giveme.php?id=x, where x is one of the 5000 unique objects.

Just to give you an idea of a theoretical approach:

  • we could index all the millions calls and map them with which distinct object is returned
  • we can create a hash table or something and create a php function which would work as give_me_nice_url("bad_url.php?....").
  • my feeling is that creating an array with such solution would result in a 50-100MB array .. not sure how performant it would be running realtime under load.

My ask for this question is which approach would you use to solve this issue and handle the large data set? Does there exists a better way than a lookup table like in my solution? Remember I can't use a database in the final production setup.

I would cache the results of show_picture_of_person.php?id=x to the file system, which is similar to your approach #2.

However you may want to consider using a caching library instead of rolling your own. Some frameworks like laravel come with caching options or you could use a 3rd party caching library instead such as https://github.com/tedious/Stash

Here's an example for stash:

// show_picture_of_person.php

if (!isset($_GET['image_path'])) {
    // Create Driver with default options
    $driver = new Stash\Driver\FileSystem(array());
    // Inject the driver into a new Pool object.
    $pool = new Stash\Pool($driver);
    // Get a cache item.
    $item = $pool->getItem('image_path_' . $_GET['id']);

    // Attempt to get the data
    $image_path = $item->get();

    // Check to see if the data was a miss.
    if($item->isMiss())
    {
        // Let other processes know that this one is rebuilding the data.
        $item->lock();

        // Run intensive code
        $image_path = codeThatTakesALongTime(); // save image to disk here

        if ($image_path) {
            // Store the expensive to generate data.
            $pool->save($item->set($image_path));
        }
    }
}
else {
    $image_path = $_GET['image_path'];
}

// Continue as normal.
useDataForStuff($image_path);

In person.php?id=x you can now check the cache above for key x and if it's populated then render show_picture_of_person.php?image_path=[sad/happy].jpg and if it's not populated then render show_picture_of_person.php?id=x, which will populate the cache once clicked.

Now that I understand what you want to do, I think there are two logical cases.

One case is that you really just want to cache the "happy/sad" logic. In this case, I'd still suggest a Varnish or similar solution.

In this case, the HTML your app outputs doesn't change (<img src=show_picture_of_person.php?id=x>, presumably), and it's cached at every level - from browser through to your reverse proxy.

Modify your PHP code to set a really long TTL for the images coming from show_picture_of_person.php?id=x - depending on the business logic (do people go from happy to sad? do you need to invalidate the cache?), this could be hours, days, weeks or decades.

Then create a rule in Varnish to honour that TTL when handling requests for the images.

This way, the request hits your PHP code only when the item cannot be found in the cache.

The other option is that you want to move the logic "is person x happy or sad" away from retrieving an image through an expensive database call, and instead bake that into some other data storage mechanism, and your HTML code instead becomes <img src=happy.jpg>.

As you cannot store this in the database, you're looking at other options - effectively storing this in the filesystem, either as one huge array, or as lots of tiny files.

There are a couple of issues with this - the best place to store the data is in the database; separating this out into a different location creates maintenance and understandability challenges. One huge array doesn't really scale. Storing and accessing lots of tiny files is slow, and some filesystems limit the number of files in a directory, or slow down beyond a certain limit.

The biggest challenge with storing the data as files on the hard drive is that your application will become I/O bound from a scalability point of view. It's hard to say whether this will be a real problem - I'd run some fairly serious scalability testing using something like Apache JMeter.

My recommendation would be to use one of the many off-the-shelf PHP caching solutions, rather than bake your own. Which one depends on what you can install on your environment.

In some structured query language dialects there is the concept of a 'CURSOR'. Similar to return 10 or 50 search results per page.

If you use this then you can query for a suitable, or even a selectable, sized range of rows to show on a page, and also you can calculate arbitrary ranges of the same number of rows to present a list of links to browse through for the next or previous results.

Calculating the total number of such pages is more expensive.

Your option of having a file for each record would be easy, but also very bloated. If you know the image is one of 2 images, you could reduce your overhead by a large scale. Assume that you use a file that has 8 bit character encoding, each of those 8 bits can represent either happy (set to 1) or sad (set to 0). This gives you a reduction of 8 already. So for a million entries - your file will be up to 125K.

It's then a case of reading and writing individual bits to the file. A couple of small functions to do this and some test code...

<?php
error_reporting ( E_ALL );
ini_set ( 'display_errors', 1 );

function getState ( int $user )  {
    $fp = fopen ( "users.dat", "c+" );
    // Find the character in the file (position is user >> 4,
    // which is effectively / 8
    fseek($fp, $user >> 4, SEEK_SET );
    // Read the single char from the file
    $flagByte = fread($fp,1);
    // Extract the bit needed
    //  ord() converts a char to an ascii value ( 0-255)
    //  If the byte hasn't been read - use PHP 7 ?? to set it to a 0
    //  $user & 7 gets the bit position and shifts this bit to position 0
    //  & 1 extracts just this bit
    $flag = (ord($flagByte[0]??chr(0)) >> ($user & 7 )) & 1;
    fclose($fp);
    return $flag;
}

function setState ( int $user, bool $status )  {
    $fp = fopen ( "users.dat", "c+" );
    fseek($fp, $user >> 4, SEEK_SET );
    // Fetch the existing data
    $flagByte = fread($fp,1);
    $flagByte = ord($flagByte[0]??chr(0));
    // Get position of flag
    $flag = 1 << ($user & 7 );
    // Either set or unset the appropriate bit
    if ( $status )  {
        $flagByte |= $flag;
    }
    else    {
        $flagByte &= ~$flag;
    }
    fseek($fp, $user >> 4, SEEK_SET );
    fwrite($fp, chr($flagByte));
    fclose($fp);
}

setState(1, false);
setState(2, true);
setState(3, true);
setState(4, false);
setState(71, true);
setState(600100, false);
setState(600102, true);

echo "User: 1:".getState(1).PHP_EOL;
echo "User: 71:".getState(71).PHP_EOL;
echo "User: 600100:".getState(600100).PHP_EOL;
echo "User: 3:".getState(3).PHP_EOL;
echo "User: 600102:".getState(600102).PHP_EOL;
echo "User: 4:".getState(4).PHP_EOL;
echo "User: 871:".getState(871).PHP_EOL;
echo "User: 3:".getState(3).PHP_EOL;

I'm sure there are things you can improve on the code. Especially if it was put in a class, you can open the file and close the file once and not on each call. But if only one record is going to be tested, then it won't make much difference.

Update Assuming that you want to keep track of an image against a user, this method has a slower add cycle (as it checks if the image is already there) but access is a much more direct route. The concept uses 2 files, one for the list of image names and one for the image associated with the user. The main thing is when adding a new image, it checks if the image is already there and if so returns the position in the file of that image. If not found, it just adds it to the EOF. All names are just terminated by a PHP_EOL, so there is no limit to the image name or wasted space by allocating a fixed block. The user file just has a pointer into this image file, but (for simplicity) this is a 4 byte unsigned integer, so for a million users, this is 4MB - not that much.

function imageIndex ( string $addImage ): int {
    $images = fopen ( "images.dat", "c+" );
    while ( true )  {
        $pos = ftell($images);
        $image = fgets($images);
        if ( $image === false || rtrim($image, PHP_EOL) == $addImage )  {
            break;
        }
    }

    if ( $image === false ) {
        fseek($images, 0, SEEK_END);
        $pos = ftell($images);
        fwrite($images, $addImage.PHP_EOL);
    }
    fclose ( $images);
    return $pos;
}

function addUserImage ( int $userID, string $image )    {
    $users = fopen ( "users.dat", "c+" );
    // Fetch image location
    $image = imageIndex($image);
    // Locate user indicator (4 bytes per user)
    $loc = $userID << 2;
    fseek($users, $loc);
    // Write the location as an unsigned integer (4 bytes)
    fwrite($users, pack("L", $image));
    fclose ( $users);
}

function fetchUserImage ( int $userID ): string {
    $users = fopen ( "users.dat", "c+" );
    $images = fopen ( "images.dat", "c+" );
    // Locate user indicator
    $loc = $userID << 2;
    fseek($users, $loc);
    $imgRef = fread($users,4);
    // Convert the 4 chars to a PHP integer
    $imgLoc = unpack("Lloc", $imgRef);
    fseek($images, $imgLoc["loc"]);
    $image = fgets($images);
    fclose ( $users);
    fclose ( $images);

    return rtrim($image,PHP_EOL);
}

// Create 4000 users with some image
// for ( $i=0; $i<2000; $i++ )    {
//     addUserImage($i,"Image{$i}.jpg");
// }
// for ( $i=0; $i<2000; $i++ )    {
//     $ino = 2000 - $i;
//     addUserImage($i+2000,"Image{$ino}.jpg");
// }

// Fetch the image for 2000 users
for ( $i=0; $i < 4000; $i+=2) {
    echo "User {$i} image=".fetchUserImage($i).PHP_EOL;
}