Symfony2 DomCrawler和FB2书籍格式解析器

All!

How do I parse correctly described XML file with Symfony2 DomCrawler component?

I need to split all the sections and collect an internal tags (epigraph, p, poem etc.) with the current section together which belongs to this section only.

I've standard FB2 book XML format described below:

<?xml version="1.0" encoding="utf-8"?>
<FictionBook xmlns="http://www.gribuser.ru/xml/fictionbook/2.0" xmlns:l="http://www.w3.org/1999/xlink">
<description></description>
<body>
<section>
    <title><p><strong>Level 1, section 1</strong></p></title>
    <section>
        <title><p><strong>Level 2, section 2</strong></p></title>
        <section>
            <title><p><strong>Level 3, section 3</strong></p></title>
            <p>Level 3, section 3, paragraph 1</p>
            <poem>
                <stanza>
                    <v>bla-bla-bla 1</v>
                    <v>bla-bla-bla 2</v>
                    <v>bla-bla-bla 3</v>
                </stanza>
            </poem>
            <p>Level3, section 3, paragraph 2</p>
            <subtitle><strong>x x x</strong></subtitle>
        </section>
        <section>
            <title><p><strong>Level 3, section 4</strong></p></title>
            <p>Level 3, section 4, paragraph 1</p>
            <p>Level 3, section 4, paragraph 2</p>
            <subtitle><strong>x x x</strong></subtitle>
        </section>
        <section>
            <title><p><strong>Level 3, section 5</strong></p></title>
            <p>Level 3, section 5, paragraph 1</p>
            <p>Level 3, section 5, paragraph 2</p>
            <p>Level 3, section 5, paragraph 3</p>
            <empty-line/>
            <subtitle>This file was created</subtitle>
            <subtitle>with BookDesigner program</subtitle>
            <subtitle>bookdesigner@the-ebook.org</subtitle>
            <subtitle>22.04.2004</subtitle>
        </section>
    </section>
</section>
</body>
</FictionBook>

The code below do not work, so could somebody help me to solve this? Btw, title parsed correctly... but section's tags not...

private function loadBookSections(Crawler $crawler)
{
    $sections = $crawler->filter('section')->each(function(Crawler $node) {
        $c = $node->filter('section')->reduce(function(Crawler $node, $i) {
            return ($i == 0);
        });

        return array(
            'title' => $node->filter('title')->text(),
            'inner' => $c->html(),
        );
    });

    echo "*******************************************
";

    foreach($sections as $section ) {
        echo ">>> ".$section['title']."
";
        echo "!!! ".$section['inner']."
";
    }
}

And Thanks for help!

After four days... I've found the solution via XPath...

private function loadBookSections(Crawler $crawler)
{

    $sections = $crawler->filter('section')->each(function(Crawler $node) {
        return array(
            'title' => $node->filter('title')->text(),
            'inner' => $node->filterXPath("//*[not(section)]")->html(),
        );
    });

    foreach($sections as $section) {
        echo "TITLE: ".$section['title']."
";
        echo "INNER: ".$section['inner']."
";
    }
}

If you reduce your XML file quite a bit you get something like this:

<section>
    <section>
        <!-- ... -->
    </section>
    <section>
        <!-- ... -->
    </section>
    <section>
        <!-- ... -->
    </section>
</section>

You want to catch the children section elements, not the parent one.

Currently you are iterating only over the list of parent section elements, which means you only get the HTML of the parent section element.

To iterate over the children, you need to select section section instead of section.


Side information to further improve your code: instead of the ugly reduce call, just use ->first() to get the first element of the node list.


In total, your code will be:

$sections = $crawler->filter('section section')->each(function(Crawler $node) {
    $c = $node->filter('section')->first();

    return array(
        'title' => $node->filter('title')->text(),
        'inner' => $c->html(),
    );
});