Nutch can anyone explain what are status name indicates in readdb stats.
1.db_redir_perm 2.db_unfetched 3.db_fetched 4.db_Gone 5.db_redir_temp 6.db_duplicate 7.db_notmodified.
See Question&Answers more detail:osNutch can anyone explain what are status name indicates in readdb stats.
1.db_redir_perm 2.db_unfetched 3.db_fetched 4.db_Gone 5.db_redir_temp 6.db_duplicate 7.db_notmodified.
See Question&Answers more detail:osNutch store all the metadata information of URLs in CrawlDatum Object. and it is stored in /crawldb/*/part-*/data
location
As per the source code of CrawlDatum
/** Page was not fetched yet. */
db_unfetched --> public static final byte STATUS_DB_UNFETCHED = 0x01;
/** Page was successfully fetched. */
db_fetched --> public static final byte STATUS_DB_FETCHED = 0x02;
/** Page no longer exists. */
db_Gone --> public static final byte STATUS_DB_GONE = 0x03;
/** Page temporarily redirects to other page. */
db_redir_temp --> public static final byte STATUS_DB_REDIR_TEMP = 0x04;
/** Page permanently redirects to other page. */
db_redir_perm --> public static final byte STATUS_DB_REDIR_PERM = 0x05;
/** Page was successfully fetched and found not modified. */
db_notmodified --> public static final byte STATUS_DB_NOTMODIFIED = 0x06;
/** Page was marked as being a duplicate of another page */
db_duplicate --> public static final byte STATUS_DB_DUPLICATE = 0x07;
CrawlDatum private byte status;
will take one of the values mentioned above depending on the state of URL. (and there are lot of other flags which i'm not discussing now)
When will status value of CrawlDatum(object) change?
There are a lot of flows where it might take one of the several states mentioned above.I will explain a few flows which I'm well aware of.
InjectReducer.reduce method.
for (CrawlDatum val : values) {
if (val.getStatus() == CrawlDatum.STATUS_INJECTED) {
injected.set(val);
injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
injectedSet = true;
} else {
old.set(val);
oldSet = true;
}
}
By setting this flag it will be helpful for the generator phase to pick only unfetched urls.
case ProtocolStatus.MOVED: // redirect case ProtocolStatus.TEMP_MOVED: int code; boolean temp; if (status.getCode() == ProtocolStatus.MOVED) { code = CrawlDatum.STATUS_FETCH_REDIR_PERM; temp = false; } else { code = CrawlDatum.STATUS_FETCH_REDIR_TEMP; temp = true; } output(fit.url, fit.datum, content, status, code); String newUrl = status.getMessage(); Text redirUrl = handleRedirect(fit, newUrl, temp, Fetcher.PROTOCOL_REDIR); if (redirUrl != null) { fit = queueRedirect(redirUrl, fit); } else { // stop redirecting redirecting = false; } break; case ProtocolStatus.EXCEPTION: logError(fit.url, status.getMessage()); int killedURLs = ((FetchItemQueues) fetchQueues).checkExceptionThreshold(fit .getQueueID()); if (killedURLs != 0) context.getCounter("FetcherStatus", "AboveExceptionThresholdInQueue").increment(killedURLs); /* FALLTHROUGH */ case ProtocolStatus.RETRY: // retry case ProtocolStatus.BLOCKED: output(fit.url, fit.datum, null, status, CrawlDatum.STATUS_FETCH_RETRY); break; case ProtocolStatus.GONE: // gone case ProtocolStatus.NOTFOUND: case ProtocolStatus.ACCESS_DENIED: case ProtocolStatus.ROBOTS_DENIED: output(fit.url, fit.datum, null, status, CrawlDatum.STATUS_FETCH_GONE); break; case ProtocolStatus.NOTMODIFIED: output(fit.url, fit.datum, null, status, CrawlDatum.STATUS_FETCH_NOTMODIFIED); break; default: if (LOG.isWarnEnabled()) { LOG.warn("{} {} Unknown ProtocolStatus: {}", getName(), Thread.currentThread().getId(), status.getCode()); } output(fit.url, fit.datum, null, status, CrawlDatum.STATUS_FETCH_RETRY);
if (redirecting && redirectCount > maxRedirect) {
((FetchItemQueues) fetchQueues).finishFetchItem(fit);
if (LOG.isInfoEnabled()) {
LOG.info("{} {} - redirect count exceeded {}", getName(),
Thread.currentThread().getId(), fit.url);
}
output(fit.url, fit.datum, null,
ProtocolStatus.STATUS_REDIR_EXCEEDED,
CrawlDatum.STATUS_FETCH_GONE);
}