处理二进制数据和mb_function重载?

I have a piece of code here which I need either assurance, or "no no no!" about in regards to if I'm thinking about this in the right or entirely wrong way.

This has to deal with cutting a variable of binary data at a specific spot, and also dealing with multi-byte overloaded functions. For example substr is actually mb_substr and strlen is mb_strlen etc.

Our server is set to UTF-8 internal encoding, and so theres this weird little thing I do to circumvent it for this binary data manipulation:

// $binary_data is the incoming variable with binary
// $clip_size is generally 16, 32 or 64 etc
$curenc = mb_internal_encoding();// this should be "UTF-8"
mb_internal_encoding('ISO-8859-1');// change so mb_ overloading doesnt screw this up
if (strlen($binary_data) >= $clip_size) {
    $first_hunk = substr($binary_data,0,$clip_size);
    $rest_of_it = substr($binary_data,$clip_size);
} else {
    // skip since its shorter than expected
}
mb_internal_encoding($curenc);// put this back now

I can't really show input and output results, since its binary data. But tests using the above appear to be working just fine and nothing is breaking...

However, parts of my brain are screaming "what are you doing... this can't be the way to handle this"!

Notes:

  • The binary data coming in, is a concatenation of those two parts to begin with.
  • The first part's size is always known (but changes).
  • The second part's size is entirely unknown.
  • This is pretty darn close to encryption and stuffing the IV on front and ripping it off again (which oddly, I found some old code which does this same thing lol ugh).

So, I guess my question is:

  • Is this actually fine to be doing?
  • Or is there something super obvious I'm overlooking?

MY SOLUTION TO THE WORRY

I dislike answering my own questions... but I wanted to share what I have decided on nonetheless.

Although what I had, "worked", I still wanted to change the hack-job-altering of the charset encoding. It was old code I admit, but for some reason, I never looked at hex2bin bin2hex for doing this. So I decided to change it to use those.

The resulting new code:

// $clip_size remains the same value for continuity later, 
// only spot-adjusted here... which is why the *2.
   $hex_data   = bin2hex( $binary_data );
   $first_hunk = hex2bin( substr($hex_data,0,($clip_size*2)) );
   $rest_of_it = hex2bin( substr($hex_data,($clip_size*2)) );
   if ( !empty($rest_of_it) ) { /* process the result for reasons */ }

Using the hex functions, turns the mess into something mb will not screw with either way. A 1 million bench loop, showed the process wasn't anything to be worried about (and its safer to run in parallel to itself than the mb_encoding mangle method).

So I'm going with this. It sits better in my mind, and resolves my question for now... until I revisit this old code again in a few years and go "what was I thinking ?!".

However, parts of my brain are screaming "what are you doing... this can't be the way to handle this"!

Your brain is right, you shouldn't be doing that in PHP in the first place. :)

Is this actually fine to be doing?

It depends the purpose of your code.

I can't see any reason of the top of my head to cut a binary like that. So my first instinct would be "no no no!" use unpack() to properly parse the binary into usable variables.

That being said if you just need to split your binary because reasons, then I guess this is fine. As long as your tests confirm that the code is working for you, I can't see any problem.

As a side note, I don't use mbstring overloading exactly for this kind of use case - i.e. for whenever you need the default string functions.