Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

For a couple of days I am working on a WebBrowser based webscraper. After a couple of prototypes working with Threads and DocumentCompleted events, I decided to try and see if I could make a simple, easy to understand Webscraper.

The goal is to create a Webscraper that doesn't involve actual Thread objects. I want it to work in sequential steps (i.e. go to url, perform action, go to other url etc. etc.).

This is what I got so far:

public static class Webscraper
{
    private static WebBrowser _wb;
    public static string URL;

    //WebBrowser objects have to run in Single Thread Appartment for some reason.
    [STAThread] 
    public static void Init_Browser()
    { 
        _wb = new WebBrowser();
    }


    public static void Navigate_And_Wait(string url)
    {
        //Navigate to a specific url.
        _wb.Navigate(url);

        //Wait till the url is loaded.
        while (_wb.IsBusy) ;

        //Loop until current url == target url. (In case a website loads urls in steps)
        while (!_wb.Url.ToString().Contains(url))
        {
            //Wait till next url is loaded
            while (_wb.IsBusy) ;
        }

        //Place URL
        URL = _wb.Url.ToString();
    }
}

I am a novice programmer, but I think this is pretty straightforward code. That's why I detest the fact that for some reason the program throws an NullReferenceException at this piece of code:

 _wb.Url.ToString().Contains(url)

I just called the _wb.Navigate() method so the NullReference can't be in the _wb object itself. So the only thing that I can imagine is that the _wb.Url object is null. But the while _wb.IsBusy() loop should prevent that.

So what is going on and how can I fix it?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
247 views
Welcome To Ask or Share your Answers For Others

1 Answer

Busy waiting (while (_wb.IsBusy) ;) on UI thread isn't much advisable. If you use the new features async/await of .Net 4.5 you can get a similar effect (i.e. go to url, perform action, go to other url etc. etc.) you want

public static class SOExtensions
{
    public static Task NavigateAsync(this WebBrowser wb, string url)
    {
        TaskCompletionSource<object> tcs = new TaskCompletionSource<object>();
        WebBrowserDocumentCompletedEventHandler completedEvent = null;
        completedEvent = (sender, e) =>
        {
            wb.DocumentCompleted -= completedEvent;
            tcs.SetResult(null);
        };
        wb.DocumentCompleted += completedEvent;

        wb.ScriptErrorsSuppressed = true;
        wb.Navigate(url);

        return tcs.Task;
    }
}



async void ProcessButtonClick()
{
    await webBrowser1.NavigateAsync("http://www.stackoverflow.com");
    MessageBox.Show(webBrowser1.DocumentTitle);

    await webBrowser1.NavigateAsync("http://www.google.com");
    MessageBox.Show(webBrowser1.DocumentTitle);
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...