I am making a platform to learn Japanese and I have over 2000 hiraganas, katakanas and kanjis and their respective romajis (they are the sound they make when you pronounce them) that I want to insert into a MySQL database. but the problem is that I have them in a string like this (this are just the katakanas, imagine now over 2000 Asian characters more!):
$string = "a ア ka カ sa サ ta タ na ナ
i イ ki キ shi シ chi チ ni ニ
u ウ ku ク su ス tsu ツ nu ヌ
e エ ke ケ se セ te テ ne ネ
o オ ko コ so ソ to ト no ノ
ha ハ ma マ ya ヤ ra ラ wa ワ
hi ヒ mi ミ ri リ (wi) ヰ
fu フ mu ム yu ユ ru ル n ン
he ヘ me メ re レ (we) ヱ
ho ホ mo モ yo ヨ ro ロ (w)o ヲ ga ガ za ザ da ダ ba バ pa パ
gi ギ ji ジ ji ヂ bi ビ pi ピ
gu グ zu ズ zu ヅ bu ブ pu プ
ge ゲ ze ゼ de デ be ベ pe ペ
go ゴ zo ゾ do ド bo ボ po ポ
kya キャ sha シャ cha チャ hya ヒャ pya ピャ
kyu キュ shu シュ chu チュ hyu ヒュ pyu ピュ
kyo キョ sho ショ cho チョ hyo ヒョ pyo ピョ
gya ギャ ja ジャ nya ニャ bya ビャ mya ミャ
gya ギュ ju ジュ nyu ニュ byu ビュ my ミュ
gyo ギョ jo ジョ nyo ニョ byo ビョ myo ミョ
rya リャ ryu リュ ryu リョ (ja) ヂャ (ju) ヂュ";
So far I could split them between Asian characters and romajis, but with it also split tabulations, and there are blank characters in the first and last part of the array.
You should consider exploding the string into an array, using the tab as delimiter. Once you have the array you can loop through it separating out the characters. That's how I'd start.
php.net is going to be a great resource for you, check out the explode() function.
Try
preg_match_all('/(\S+)\s/+(\S+)\s*/', $string, $matches, PREG_SET_ORDER);
print_r($matches);
This searches for the pattern: letters, whitespace, letters, whitespace - and then repeating this pattern for the entire string.
I'm not sure what kind of output you want from your regex, but if you use this you'll get a 2D array with each sub array containing two elements (Every time it reads two words it adds a new array to the main array for the next two). It also strips the parenthesis from ja
and ju
. Let me know if you need to keep those. It's also very fragile (If there are an odd number of words in $string
it will cause a PHP E_NOTICE
warning. Let me know if you need that changed:
$arr = array();
preg_match_all('/(?<=^|\s)\S+(?=\s|$)/mu', $string, $arr);
$count = (int)(count($arr[0])/2);
for($i = 0; $i < $count; $i++)
$arr[0][$i] = array($arr[0][$i*2], $arr[0][$i*2+1]);
$arr = array_slice($arr[0], 0, $count);
echo $arr[0][0].': '.$arr[0][1]; // Outputs "a: ア"
echo $arr[107][0].': '.$arr[107][1]; // Outputs "ju: ヂュ"
Try this:
<?php
$string =
"a ア ka カ sa サ ta タ na ナ
...";
// |<-----------------------GRP#0------------------------>|
// |GRP#01| |<--------------GRP#02-------------->|
// |<-GRP#03->|
// romans spaces non-spaces ignored-spaces '('romans')' opt-sapces
preg_match_all('/([a-z]+)[
\t]+([^
\t]+(?:[
\t]+)(([a-z]+))?)[
\t]*/',
$string, $matches, PREG_SET_ORDER);
print_r($matches);
You should get an array of 103 elements and the last element should look like this:
Array ( [0] => ryu リョ (ja) [1] => ryu [2] => リョ (ja) [3] => (ja) )
I think this is self explanatory, if not let me know.
Hope this helps.