Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

i want to retrieve the names of product from the website, so i write my code below. but the result includes some trivial info such as . Can someone help me how to delete these stuff? code:

retrieve name

reddoturl <- 'http://red-dot.de/pd/online-exhibition/?lang=en&c=163&a=0&y=2013&i=0&oes='
library(XML)
doc <- htmlParse(reddoturl)

review data

reviews<-xpathSApply(doc,'//div[@class="work_contaienterner_headline"]',xmlValue)

results: [1] "VZ-C6 / VZ-C3D Document Camera "

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
137 views
Welcome To Ask or Share your Answers For Others

1 Answer

I worry a bit about removing all tabs but this would do it:

> reviews <- "VZ-C6 / VZ-C3D
																		
										Document Camera
									
																	" 
> reviews <- gsub( "\	", "", reviews)
> reviews
[1] "VZ-C6 / VZ-C3D

Document Camera

"

Read ?regex and understand that there are extra backslashes needed because both R and regex use "" as escapes and so there are two levels of character parsing on the way to a pattern. That's not the case in the replacement argument though so you don't need to used doubled escapes there. So if you then wanted to replace those " "'s with just one " " you could use:

> reviews <- gsub( "\
\
", "
", reviews)
> reviews
[1] "VZ-C6 / VZ-C3D
Document Camera
"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...