Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Hi guys.


I'm trying to create an app that will find the most frequently used words in the string. In my case, a string is the HTML. I've already can get HTML from URI. For example for "https://www.bbc.com/news/world-middle-east-57327591".


var url = "https://www.bbc.com/news/world-middle-east-57327591";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);

Html variable has the same HTML as in the Source. That's well.

But how to get rid of all styles, scripts, and additional information. And get only plain text in some string variable?

I want my application not to be only for BBC html, but for every HTML which I can get in the net. I have an idea that I should get text from every element such us <div>,<p>,<b>,<i>,<a> because not all of the text store in the <p>.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
137 views
Welcome To Ask or Share your Answers For Others

1 Answer

As per This answer, try the following:


var url = "https://www.bbc.com/news/world-middle-east-57327591";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
//Create a regex pattern that selects all html tag elements
string pattern = @"<(.|
)*?>";
//Replace all tag elements found using that regex with  nothing 
return Regex.Replace(htmlString, pattern, string.Empty);

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...