Nutch can anyone explain what are status name indicates in readdb stats

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

Nutch can anyone explain what are status name indicates in readdb stats

asked Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

Nutch can anyone explain what are status name indicates in readdb stats.

1.db_redir_perm 2.db_unfetched 3.db_fetched 4.db_Gone 5.db_redir_temp 6.db_duplicate 7.db_notmodified.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

134 views

1 Answer

深蓝 · Answer 1 · 2022-01-31T07:21:31+0000

Nutch store all the metadata information of URLs in CrawlDatum Object. and it is stored in /crawldb/*/part-*/data location

As per the source code of CrawlDatum

 /** Page was not fetched yet. */
   db_unfetched -->   public static final byte STATUS_DB_UNFETCHED = 0x01; 
      /** Page was successfully fetched. */
   db_fetched -->   public static final byte STATUS_DB_FETCHED = 0x02;
      /** Page no longer exists. */
   db_Gone -->   public static final byte STATUS_DB_GONE = 0x03;
      /** Page temporarily redirects to other page. */
   db_redir_temp -->   public static final byte STATUS_DB_REDIR_TEMP = 0x04;
      /** Page permanently redirects to other page. */
   db_redir_perm -->   public static final byte STATUS_DB_REDIR_PERM = 0x05;
      /** Page was successfully fetched and found not modified. */
   db_notmodified -->   public static final byte STATUS_DB_NOTMODIFIED = 0x06;
      /** Page was marked as being a duplicate of another page */
   db_duplicate -->   public static final byte STATUS_DB_DUPLICATE = 0x07;

CrawlDatum private byte status; will take one of the values mentioned above depending on the state of URL. (and there are lot of other flags which i'm not discussing now)

When will status value of CrawlDatum(object) change?

There are a lot of flows where it might take one of the several states mentioned above.I will explain a few flows which I'm well aware of.

when we inject URLs into nutch. crawlDb folder is created with each URL CrawlDatum object with state as (db_unfetched). see below code from Injector class

InjectReducer.reduce method.

for (CrawlDatum val : values) {
    if (val.getStatus() == CrawlDatum.STATUS_INJECTED) {
      injected.set(val);
      injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
      injectedSet = true;
    } else {
      old.set(val);
      oldSet = true;
    }
  }

By setting this flag it will be helpful for the generator phase to pick only unfetched urls.

In Fetcher phase if you open FetcherThread source code. crawlDatum status is changed based on url http stats code. you can refer http codes here. (for better understanding)

case ProtocolStatus.MOVED: // redirect
    case ProtocolStatus.TEMP_MOVED:
      int code;
      boolean temp;
      if (status.getCode() == ProtocolStatus.MOVED) {
        code = CrawlDatum.STATUS_FETCH_REDIR_PERM;
        temp = false;
      } else {
        code = CrawlDatum.STATUS_FETCH_REDIR_TEMP;
        temp = true;
      }
      output(fit.url, fit.datum, content, status, code);
      String newUrl = status.getMessage();
      Text redirUrl = handleRedirect(fit, newUrl, temp,
          Fetcher.PROTOCOL_REDIR);
      if (redirUrl != null) {
        fit = queueRedirect(redirUrl, fit);
      } else {
        // stop redirecting
        redirecting = false;
      }
      break;
    case ProtocolStatus.EXCEPTION:
      logError(fit.url, status.getMessage());
      int killedURLs = ((FetchItemQueues) fetchQueues).checkExceptionThreshold(fit
          .getQueueID());
      if (killedURLs != 0)
        context.getCounter("FetcherStatus",
            "AboveExceptionThresholdInQueue").increment(killedURLs);
      /* FALLTHROUGH */
    case ProtocolStatus.RETRY: // retry
    case ProtocolStatus.BLOCKED:
      output(fit.url, fit.datum, null, status,
          CrawlDatum.STATUS_FETCH_RETRY);
      break;
    case ProtocolStatus.GONE: // gone
    case ProtocolStatus.NOTFOUND:
    case ProtocolStatus.ACCESS_DENIED:
    case ProtocolStatus.ROBOTS_DENIED:
      output(fit.url, fit.datum, null, status,
          CrawlDatum.STATUS_FETCH_GONE);
      break;
    case ProtocolStatus.NOTMODIFIED:
      output(fit.url, fit.datum, null, status,
          CrawlDatum.STATUS_FETCH_NOTMODIFIED);
      break;
    default:
      if (LOG.isWarnEnabled()) {
        LOG.warn("{} {} Unknown ProtocolStatus: {}", getName(),
            Thread.currentThread().getId(), status.getCode());
      }
      output(fit.url, fit.datum, null, status,
          CrawlDatum.STATUS_FETCH_RETRY);

    if (redirecting && redirectCount > maxRedirect) {
      ((FetchItemQueues) fetchQueues).finishFetchItem(fit);
      if (LOG.isInfoEnabled()) {
        LOG.info("{} {} - redirect count exceeded {}", getName(),
            Thread.currentThread().getId(), fit.url);
      }
      output(fit.url, fit.datum, null,
          ProtocolStatus.STATUS_REDIR_EXCEEDED,
          CrawlDatum.STATUS_FETCH_GONE);
    }

In deduplication phase if a URLs is found to be duplicate based on md5 hash then it will mark the status as STATUS_DB_DUPLICATE in the deduplication phase and in the next iteration it will not be picked by the Generator.

Categories

Nutch can anyone explain what are status name indicates in readdb stats

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags