每次我在PHP中进行正则表达式拆分时,PHP返回的数组中的第一个和最后一个字符串显示为空

I am making a platform to learn Japanese and I have over 2000 hiraganas, katakanas and kanjis and their respective romajis (they are the sound they make when you pronounce them) that I want to insert into a MySQL database. but the problem is that I have them in a string like this (this are just the katakanas, imagine now over 2000 Asian characters more!):

    $string = "a    ア   ka  カ   sa  サ   ta  タ   na  ナ
    i   イ   ki  キ   shi シ   chi チ   ni  ニ
    u   ウ   ku  ク   su  ス   tsu ツ   nu  ヌ
    e   エ   ke  ケ   se  セ   te  テ   ne  ネ
    o   オ   ko  コ   so  ソ   to  ト   no  ノ
    ha  ハ   ma  マ   ya  ヤ   ra  ラ   wa  ワ
    hi  ヒ   mi  ミ           ri  リ   (wi)    ヰ
    fu  フ   mu  ム   yu  ユ   ru  ル   n   ン
    he  ヘ   me  メ           re  レ   (we)    ヱ
    ho  ホ   mo  モ   yo  ヨ   ro  ロ   (w)o    ヲ   ga  ガ   za  ザ   da  ダ   ba  バ   pa  パ
    gi  ギ   ji  ジ   ji  ヂ   bi  ビ   pi  ピ
    gu  グ   zu  ズ   zu  ヅ   bu  ブ   pu  プ
    ge  ゲ   ze  ゼ   de  デ   be  ベ   pe  ペ
    go  ゴ   zo  ゾ   do  ド   bo  ボ   po  ポ

    kya キャ  sha シャ  cha チャ  hya ヒャ  pya ピャ
    kyu キュ  shu シュ  chu チュ  hyu ヒュ  pyu ピュ
    kyo キョ  sho ショ  cho チョ  hyo ヒョ  pyo ピョ

    gya ギャ  ja  ジャ  nya ニャ  bya ビャ  mya ミャ
    gya ギュ  ju  ジュ  nyu ニュ  byu ビュ  my  ミュ
    gyo ギョ  jo  ジョ  nyo ニョ  byo ビョ  myo ミョ
    rya リャ  ryu リュ  ryu リョ  (ja)    ヂャ  (ju)    ヂュ";

So far I could split them between Asian characters and romajis, but with it also split tabulations, and there are blank characters in the first and last part of the array.

You should consider exploding the string into an array, using the tab as delimiter. Once you have the array you can loop through it separating out the characters. That's how I'd start.

php.net is going to be a great resource for you, check out the explode() function.

Try

preg_match_all('/(\S+)\s/+(\S+)\s*/', $string, $matches, PREG_SET_ORDER);
print_r($matches);

This searches for the pattern: letters, whitespace, letters, whitespace - and then repeating this pattern for the entire string.

I'm not sure what kind of output you want from your regex, but if you use this you'll get a 2D array with each sub array containing two elements (Every time it reads two words it adds a new array to the main array for the next two). It also strips the parenthesis from ja and ju. Let me know if you need to keep those. It's also very fragile (If there are an odd number of words in $string it will cause a PHP E_NOTICE warning. Let me know if you need that changed:

$arr = array();
preg_match_all('/(?<=^|\s)\S+(?=\s|$)/mu', $string, $arr);
$count = (int)(count($arr[0])/2);
for($i = 0; $i < $count; $i++)
    $arr[0][$i] = array($arr[0][$i*2], $arr[0][$i*2+1]);
$arr = array_slice($arr[0], 0, $count);

echo $arr[0][0].': '.$arr[0][1];      // Outputs "a: ア"
echo $arr[107][0].': '.$arr[107][1];  // Outputs "ju: ヂュ"

Try this:

<?php
    $string =
   "a    ア   ka  カ   sa  サ   ta  タ   na  ナ
    ...";
// |<-----------------------GRP#0------------------------>| // |GRP#01| |<--------------GRP#02-------------->| // |<-GRP#03->| // romans spaces non-spaces ignored-spaces '('romans')' opt-sapces preg_match_all('/([a-z]+)[ \t]+([^ \t]+(?:[ \t]+)(([a-z]+))?)[ \t]*/', $string, $matches, PREG_SET_ORDER); print_r($matches);

You should get an array of 103 elements and the last element should look like this:

Array
(
    [0] => ryu リョ  (ja)    
    [1] => ryu
    [2] => リョ  (ja)
    [3] => (ja)
)

I think this is self explanatory, if not let me know.

Hope this helps.