javascript - To exceed the ImportXML limit on Google Spreadsheet

Question

Welcome To Ask or Share your Answers For Others

javascript - To exceed the ImportXML limit on Google Spreadsheet

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I am stucking on a "scraping problem" right now. Especially i want to extract the name of the author from a webpage to google spreadsheet. Actually the function =IMPORTXML(A2,"//span[@class='author vcard meta-item']") is working, but after i raise the amount of links to scrape it just starts to load endless.

So i researched and find out, that this problem is due to the fact, that there is a limit of google.

Does anybody know of to exceed the limit or a script, which i could "easily copy" ? - i really do not have a hunch of coding.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

460 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:48:32+0000

I created a custom import function that overcomes all limits of IMPORTXML I have a sheet using this in about 800 cells and it works great.

It makes use of Google Sheet’s custom scripts (Tools > Script editor…) and searches through content using regex instead of xpath.

function importRegex(url, regexInput) {
  var output = '';
  var fetchedUrl = UrlFetchApp.fetch(url, {muteHttpExceptions: true});
  if (fetchedUrl) {
    var html = fetchedUrl.getContentText();
    if (html.length && regexInput.length) {
      output = html.match(new RegExp(regexInput, 'i'))[1];
    }
  }
  // Grace period to not overload
  Utilities.sleep(1000);
  return output;
}

You can then use this function like any function.

=importRegex("https://example.com", "<title>(.*)</title>")

Of course, you can also reference cells.

=importRegex(A2, "<title>(.*)</title>")

If you don’t want to see HTML entities in the output, you can use this function.

var htmlEntities = {
  nbsp:  ' ',
  cent:  '￠',
  pound: '￡',
  yen:   '￥',
  euro:  '€',
  copy:  '?',
  reg:   '?',
  lt:    '<',
  gt:    '>',
  mdash: '–',
  ndash: '-',
  quot:  '"',
  amp:   '&',
  apos:  '''
};

function unescapeHTML(str) {
    return str.replace(/&([^;]+);/g, function (entity, entityCode) {
        var match;

        if (entityCode in htmlEntities) {
            return htmlEntities[entityCode];
        } else if (match = entityCode.match(/^#x([da-fA-F]+)$/)) {
            return String.fromCharCode(parseInt(match[1], 16));
        } else if (match = entityCode.match(/^#(d+)$/)) {
            return String.fromCharCode(~~match[1]);
        } else {
            return entity;
        }
    });
};

All together…

function importRegex(url, regexInput) {
  var output = '';
  var fetchedUrl = UrlFetchApp.fetch(url, {muteHttpExceptions: true});
  if (fetchedUrl) {
    var html = fetchedUrl.getContentText();
    if (html.length && regexInput.length) {
      output = html.match(new RegExp(regexInput, 'i'))[1];
    }
  }
  // Grace period to not overload
  Utilities.sleep(1000);
  return unescapeHTML(output);
}

var htmlEntities = {
  nbsp:  ' ',
  cent:  '￠',
  pound: '￡',
  yen:   '￥',
  euro:  '€',
  copy:  '?',
  reg:   '?',
  lt:    '<',
  gt:    '>',
  mdash: '–',
  ndash: '-',
  quot:  '"',
  amp:   '&',
  apos:  '''
};

function unescapeHTML(str) {
    return str.replace(/&([^;]+);/g, function (entity, entityCode) {
        var match;

        if (entityCode in htmlEntities) {
            return htmlEntities[entityCode];
        } else if (match = entityCode.match(/^#x([da-fA-F]+)$/)) {
            return String.fromCharCode(parseInt(match[1], 16));
        } else if (match = entityCode.match(/^#(d+)$/)) {
            return String.fromCharCode(~~match[1]);
        } else {
            return entity;
        }
    });
};

Categories

javascript - To exceed the ImportXML limit on Google Spreadsheet

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags