i have the problem, that i want to fill a list with the names of all pages in my wiki. My script:
$TitleList = [];
$nsList = [];
$nsURL= 'wiki/api.php?action=query&meta=siteinfo& siprop=namespaces|namespacealiases&format=json';
$nsJson = file_get_contents($nsURL);
$nsJsonD = json_decode($nsJson, true);
foreach ($nsJsonD['query']['namespaces'] as $ns)
{
if ( $ns['id'] >= 0 )
array_push ($nsList, $ns['id']);
}
# populate the list of all pages in each namespace
foreach ($nsList as $n)
{
$urlGET = 'wiki/api.php?action=query&list=allpages&apnamespace='.$n.'&format=json';
$json = file_get_contents($urlGET);
$json_b = json_decode( $json ,true);
foreach ($json_b['query']['allpages'] as $page)
{
echo("
".$page['title']);
array_push($TitleList, $page["title"]);
}
}
But there are still 35% pages missing, that i can visit on my wiki (testing with "random site"). Does anyone know, why this could happen?
MediaWiki API doesn't return all results at once, but does so in batches. A default batch is only 10 pages; you can specify aplimit
to change that (500 max for users, 5,000 max for bots).
To get the next batch, you need to specify the continue=
parameter; in each batch, you will also get a continue
property in the returned data, which you can use to ask for the next batch. To get all pages, you must loop as long as a continue element is present.
For example, on the English Wikipedia, this would be the first API call: https://en.wikipedia.org/w/api.php?action=query&list=allpages&apnamespace=0&format=json&aplimit=500&continue=
...and the continue
object will be this: "continue":{ "apcontinue":"\"Cigar\"_Daisey", "continue":"-||" }
(Updated according to comment by OP, with example code)
You would now want to flatten the continue
array into url parameters, for example using `
See the more complete explanation here: https://www.mediawiki.org/wiki/API:Query#Continuing_queries
A working version of your code should be (tested with Wikipedia with a slightly different code):
# populate the list of all pages in each namespace
$baseUrl = 'wiki/api.php?action=query&list=allpages&apnamespace='.$n.'&format=json&limit=500&'; // Increase limit if you are using a bot, up to 5,000
foreach ($nsList as $n) {
$next = '';
while ( isset( $next ) ) {
$urlGET = $baseUrl . $next;
$json = file_get_contents($urlGET);
$json_b = json_decode($json, true);
foreach ($json_b['query']['allpages'] as $page)
{
echo("
".$page['title']);
array_push($TitleList, $page["title"]);
}
if (isset($json_b['continue'])) {
$next = http_build_query($json_b['continue']);
}
}
}