Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I'd like to parse Wikimedia's .xml.bzip2 dumps without extracting the entire file or performing any XML validation:

var filename = "enwiki-20160820-pages-articles.xml.bz2";

var settings = new XmlReaderSettings()
{
    ValidationType = ValidationType.None,
    ConformanceLevel = ConformanceLevel.Auto // Fragment ?
};

using (var stream = File.Open(filename, FileMode.Open))
using (var bz2 = new BZip2InputStream(stream))
using (var xml = XmlTextReader.Create(bz2, settings))
{
    xml.ReadToFollowing("page");
    // ...
}

The BZip2InputStream works - if I use a StreamReader, I can read XML line by line. But when I use XmlTextReader, it fails when I try to perform the read:

System.Xml.XmlException: 'Unexpected end of file has occurred. The following elements are not closed: mediawiki. Line 58, position 1.'

The bzip stream is not at EOF. Is it possible to open an XmlTextReader on top of a BZip2 stream? Or is there some other means to do this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
367 views
Welcome To Ask or Share your Answers For Others

1 Answer

This should work. I used combination of XmlReader and Xml Linq. You can parse the XElement doc as needed.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;


namespace ConsoleApplication29
{
    class Program
    {
        const string URL = @"https://dumps.wikimedia.org/enwiki/20160820/enwiki-20160820-abstract26.xml";
        static void Main(string[] args)
        {
            XmlReader reader = XmlReader.Create(URL);

            while (!reader.EOF)
            {
                if (reader.Name != "doc")
                {
                    reader.ReadToFollowing("doc");
                }
                if (!reader.EOF)
                {
                    XElement doc = (XElement)XElement.ReadFrom(reader);
                }
            }

        }
    }
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...