Tutorial: Scraping Blog Content and Updating Gainsight Community Federated Search with Google Scripts


Userlevel 4

Introduction

This is my first tutorial on using Gainsight Community. I had to build something which could be automated to index our blogs in the Community Federated Search. Since I didn’t have a great deal of budget to use on this, I decided to use Google Apps Scripts to do it. Since I found it a little difficult to find tutorials or demos on this kind of thing with Gainsight’s APIs, I thought I would share this so that others can benefit.In this tutorial, we will create a Google Script to scrape blog content from a website and update the Gainsight Community federated search index. This process involves fetching blog URLs, extracting content, and sending it to Gainsight's federated search API. I’ll break down the entire process step-by-step, ensuring that it's as easy to follow as possible. I would LOVE to see any enhancement suggestions as well 😉
 

Step 1: Creating a Google Script

  1. Go to Google Apps Script:

  2. Create a New Project:

    • Click on the + New Project button. This will open a new script editor.
  3. Name Your Script:

    • Click on Untitled project in the top left corner and give your project a name, such as Blog Scraper.
  4. Set Up Script Properties:

    • In the script editor, click on File > Project properties > Script properties and add a new property with the key accessToken and your Gainsight access token as the value. This helps manage your credentials securely.
    • Note: In this example, the client_id and client_secret are hardcoded for simplicity. For a production script, consider storing these values in Script Properties for better security and easier management.

Step 2: Writing the Script

Next, we will write the functions required to fetch blog content and update the Gainsight Community federated search. The code is shown completely below, so you can copy it if you like. You will need to modify it to suit your environment. Here’s an overview of what each function does:

  1. runScript1(), runScript2(), runScript3() Functions:

    • These functions are used to initiate the content fetching process for different page ranges. They acquire the access token and fetch new content. runScript1() also clears previous federated search indexed content for the source. This only needs to be done once. The reason there are three is that Google Scripts can run for only around 6 minutes. Running this script three times, using each of these functions in order, will allow you to safely get the content in. Three may not be required for your content, you may also need more. Play around to see what suits your environment. You can set up Google Scripts to automatically generate triggers to automate the calling of the next set. However, this added extra complexity to this example, so I left it out. I have done this here and can explain if required.
  2. fetchAllBlogUrls(startPage, endPage) Function:

    • This function fetches the pages between startPage and endPage, extracting blog URLs from each page.
    • It should be pointed out that the kind of blog listing page I am working with has a number of blogs per page. Each page is navigated to via a URL like this….

      https://www.company.com/blog/?start_page=

      So keep that in mind when looking through the code.

  3. extractBlogUrls(htmlContent) Function:

    • This function extracts the blog URLs from the HTML content of a page. It uses a regular expression tailored to your blog page structure.
  4. fetchContentFromUrls(startPage, endPage, accessToken) Function:

    • This function drives the process. It collects blog URLs, fetches each blog page, extracts the content, and sends it to Gainsight's API.
  5. extractArticleText(htmlContent) Function:

    • This function extracts the main content of a blog post from the HTML. It looks for <article> tags or <main class="long-form"> tags.
  6. extractTitle(content) Function:

    • This function extracts the title of a blog post from its content, looking for <h1> tags.
  7. sanitizeAndExtractText(htmlContent) Function:

    • This function cleans up the HTML content by removing <script> and <style> tags and normalizing whitespace.
  8. decodeHtmlEntities(text) Function:

    • This function replaces HTML character codes with their corresponding characters.
  9. constructJsonString(title, content, url, source) Function:

    • This function builds the JSON structure required by the Gainsight API using the blog title, content, URL, and source.
  10. getAccessToken() Function:

    • This function acquires the Gainsight access token. In a real-world scenario, you should store your client_id and client_secret in Script Properties.
  11. makeApiRequest(accessToken, jsonString) Function:

    • This function sends the constructed JSON to Gainsight's API to update the federated search index.
  12. clearFederatedSearch(accessToken, federatedSource) Function:

    • This function clears the previous content from the specified federated search source in Gainsight.

 


/*
The first function to call in the process.
This acquires the access token for Gainsight, clears the previous federated search "source"
for this content and fetches the first few pages of content (pages 0 to 10)
*/
function runScript1() {
//get access token
var accessToken = getAccessToken();
//Clear blog content
clearFederatedSearch(accessToken, "bigpanda_blog");
fetchContentFromUrls(0, 10, accessToken);

}

/*
The second function to call in the process.
This acquires the access token for Gainsight and fetches pages 11 to 20.
*/
function runScript2() {
//get access token
var accessToken = getAccessToken();
fetchContentFromUrls(11, 20, accessToken);
}

/*
The third function to call in the process.
This acquires the access token for Gainsight and fetches pages 21 to 30.
*/
function runScript3() {
//get access token
var accessToken = getAccessToken();
fetchContentFromUrls(21, 30, accessToken);
}


/*
This function fetches the pages including and between the startPage and endPage, it then
extracts the URLs for blogs on that page using the extractBlogUrls function (this will
need to be tailored to your environment).
*/
function fetchAllBlogUrls(startPage, endPage) {
var baseUrl = 'https://www.company.com/blog/?start_page='; //Url will need changing to your environment
var blogUrls = [];
var hasMorePosts = true;

while (hasMorePosts && startPage <= endPage) {
var pageUrl = baseUrl + startPage;
var response = UrlFetchApp.fetch(pageUrl);
var htmlContent = response.getContentText();

// Extract blog post URLs from the current page
var urls = extractBlogUrls(htmlContent);
if (urls.length > 0) {
blogUrls = blogUrls.concat(urls);
startPage++;
} else {
hasMorePosts = false;
}
}

Logger.log(blogUrls);
return blogUrls;
}



/*
This function extracts the blog URLs from the page. It will need to be tailored to the HTML of your
blog page.
*/
function extractBlogUrls(htmlContent) {
var blogUrls = [];
// This regex will need to be tailored to your environment
var regex = /<a href="(https:\/\/www\.company\.com\/blog\/[^"]+)"[^>]*>/g;
var match;

while ((match = regex.exec(htmlContent)) !== null) {
blogUrls.push(match[1]);
}

return blogUrls;
}



/*
This function drives the process. It calls the fetchAllBlogUrls function to collect blog URLs and then
iterates over those URLs to fetch each of the blog pages. With the extracted HTML, it uses the extractArticleText
function to extract the contents in plain text. It then creates some JSON for the API to send to Gainsight with your
blog content to be added to the federated search.

There is commented code here for debugging. I have left this so that you can see what is extracted in Google Document.
*/
function fetchContentFromUrls(startPage, endPage, accessToken) {
try {


// Create a new Google Document
//var doc = DocumentApp.create('Fetched Blog Content');
//var docId = doc.getId();
//var docBody = doc.getBody();
//Logger.log('Created document with ID: ' + docId);



// Add a delay to ensure the document is fully accessible
//Utilities.sleep(5000);

var urls = fetchAllBlogUrls(startPage, endPage);

for (var i = 0; i < urls.length; i++) {
var url = urls[i];
if (url) {
try {
var response = UrlFetchApp.fetch(url);
var html = response.getContentText();

// Extract text content from the HTML
var text = extractArticleText(html);


//Logger.log('Extracted text from URL ' + url + ': ' + text);

// Append the text content to the Google Document
if (text.content.trim()) {
//docBody.appendParagraph('Content from URL: ' + url);
//docBody.appendParagraph('Title: ' + text.title);

let jsonString = constructJsonString(text.title, text.content, url, "blog");

//appendTextInChunks(docBody, jsonString, 2000); // Append text in chunks of 2000 characters
//docBody.appendPageBreak();

makeApiRequest(accessToken, jsonString);

Logger.log('Text size: ' + text.content.length);
} else {
Logger.log('No text extracted from URL: ' + url);
}
} catch (e) {
Logger.log('Error fetching URL: ' + url + ' - ' + e.toString());
//docBody.appendParagraph('Error fetching URL: ' + url);
//docBody.appendPageBreak();
}
}
}

//Logger.log('Document created with ID: ' + docId);

} catch (e) {
Logger.log('Error: ' + e.toString());
}
}



/*
This function is used to extract the blog content from the blog page. It will need tailoring to your system. If you
look at the "match" code, I am looking for <article> tags AND <main class="long-form"> tags since this is where blog
content is contained on our blog pages. You will need to figure out what you will need here.
*/
function extractArticleText(htmlContent) {
// Extract content inside the first <article> tag
var articleMatch = htmlContent.match(/<article[^>]*>([\s\S]*?)<\/article>/i);
var content, title;

if (articleMatch && articleMatch[1]) {
content = articleMatch[1];
title = extractTitle(content);
} else {
// If no <article> tag, try to extract content inside the first <main class="long-form"> tag
var mainMatch = htmlContent.match(/<main class="long-form"[^>]*>([\s\S]*?)<\/main>/i);
if (mainMatch && mainMatch[1]) {
content = mainMatch[1];
title = extractTitle(content);
} else {
return 'No <article> or <main class="long-form"> element found in the HTML content.';
}
}

// Sanitize the extracted content
var sanitizedContent = sanitizeAndExtractText(content);
// Decode HTML entities
var decodedContent = decodeHtmlEntities(sanitizedContent);

// Return the decoded content along with the title
return {
title: title,
content: decodedContent
};
}

/*
This function is used to extract the blog title. It is called by the extractArticleText function. Again,
this may need tailoring to your blog page configuration.
*/
function extractTitle(content) {
var titleMatch = content.match(/<h1[^>]*>([\s\S]*?)<\/h1>/i);
if (titleMatch && titleMatch[1]) {
var sanitizedTitle = sanitizeAndExtractText(titleMatch[1]);
return decodeHtmlEntities(sanitizedTitle);
}
return 'No title found';
}


/*
This function is a cleanup function. You may not need it. It is used for removing an <script> tags, any HTML tags and
removing white space. When testing, you may want to tweak this.
*/
function sanitizeAndExtractText(htmlContent) {
// Remove script and style content
htmlContent = htmlContent.replace(/<script[^>]*>([\s\S]*?)<\/script>/gi, '');
htmlContent = htmlContent.replace(/<style[^>]*>([\s\S]*?)<\/style>/gi, '');

// Remove all HTML tags
var textContent = htmlContent.replace(/<\/?[^>]+(>|$)/g, ' ');

// Normalize whitespace
textContent = textContent.replace(/\s+/g, ' ').trim();

return textContent;
}


/*
This function is to replace any HTML character codes that may be in your HTML page. You may need to add to this.
*/
function decodeHtmlEntities(text) {
var entities = {
'&amp;': '&',
'&lt;': '<',
'&gt;': '>',
'&quot;': '"',
'&apos;': "'",
'&#8217;': "'",
'&#8216;': "'",
'&#8220;': '"',
'&#8221;': '"',
// Add more entities as needed
};

return text.replace(/&[^;]+;/g, function(match) {
return entities[match] || match;
});
}

/*
This function builds the JSON for the API using the title, content, url and source (federated search key)
*/
function constructJsonString(title, content, url, source) {
var data = {
"batch": [{
"title": title,
"content": content,
"url": url,
"source": source
}]
};

// Convert the data object to a JSON string
var jsonString = JSON.stringify(data);
return jsonString;
}


/*
This function is used only when uncommenting the write to Document code. It is used to stop the script
falling over due to trying to load too much at once to the Google Document. Left here so that the code works
uncommented.
*/
function appendTextInChunks(docBody, text, chunkSize) {
var textElement = docBody.appendParagraph('').editAsText();
for (var i = 0; i < text.length; i += chunkSize) {
var chunk = text.substring(i, i + chunkSize);
textElement.appendText(chunk);
}
}




/*
This function acquires the Gainsight Access Token. Depending on your API endpoint location, you will need to
change that. Also the client_id and client_secret will need to be generated by you and changed here. These are better
set in the Google Script project settings, but I have left them here for ease of demonstration. The id and secret used here
are just examples.
*/
function getAccessToken() {
var url = "https://api2-us-west-2.insided.com/oauth2/token";
var payload = {
"grant_type": "client_credentials",
"client_id": "b1a67F767-9A56-F672-ac67-985C8787FC5651",
"client_secret": "2ed99f123ab8765d791c10be717af1f986d0e9cc87293f50b3cd06436f94e6a",
"scope": "write"
};

var options = {
"method": "post",
"contentType": "application/x-www-form-urlencoded",
"payload": payload
};

var response = UrlFetchApp.fetch(url, options);
var json = JSON.parse(response.getContentText());
var accessToken = json.access_token;

return accessToken;
}

/*
This function is used to call the indexing API. The acceeToken and the jsonString are provided to this. As above, the
endpoint may need to be changed for your environment.
*/
function makeApiRequest(accessToken, jsonString) {

var url = "https://api2-us-west-2.insided.com/external-content/index";

var headers = {
"Authorization": "Bearer " + accessToken,
"Content-Type": "application/json"
};

var options = {
"method": "post",
"headers": headers,
"payload": jsonString
};

var response = UrlFetchApp.fetch(url, options);
Logger.log(response.getContentText());
}

/*
This function is used to clear the previous federated search index for the source specified. You may not want to always
do this. But given that web content can change, it might make sense for you to clear this each time and load from scratch.
This function is called in the runScript1 function. Again, the endpoint may need to be tweaked
*/
function clearFederatedSearch(accessToken, federatedSource) {
var url = "https://api2-us-west-2.insided.com/external-content/clear";

var headers = {
"Authorization": "Bearer " + accessToken,
"Content-Type": "application/json"
};

var payload = {
"source": federatedSource
};

var options = {
"method": "post",
"headers": headers,
"payload": JSON.stringify(payload),
"muteHttpExceptions": true // Ensure errors don't stop the script
};

var response;

try {
response = UrlFetchApp.fetch(url, options);
var responseCode = response.getResponseCode();
var responseContent = response.getContentText();

// Log the response code and content
Logger.log("Response Code: " + responseCode);
Logger.log("Response Content: " + responseContent);

return {
code: responseCode,
content: responseContent
};
} catch (e) {
Logger.log("Error: " + e.message);

if (e.response) {
var errorCode = e.response.getResponseCode();
var errorContent = e.response.getContentText();

Logger.log("Error Code: " + errorCode);
Logger.log("Error Content: " + errorContent);

return {
code: errorCode,
content: errorContent
};
}

return {
code: 'unknown',
content: e.message
};
}
}

 

Step 3: Running the Script

To execute the script, follow these steps:

  1. Run runScript1():

    • This will start the process by fetching the first 10 pages of blog content, clearing previous content, and updating Gainsight.
  2. Run runScript2():

    • This continues the process by fetching the next set of pages (11 to 20) and updating Gainsight.
  3. Run runScript3():

    • This completes the process by fetching pages 21 to 30 and updating Gainsight.

Conclusion

By following this tutorial, you will have created a Google Script to automate the process of scraping blog content from a website and updating the Gainsight Community federated search index. As I have said throughout this tutorial, you will need to tweak bits and pieces to suit your environment (blog listing page, blog pages, etc). This script can also be adjusted for other content you may wish to scrape and index….but will need a bit of thought about how your environment works. Remember to handle your credentials securely and consider using Script Properties for sensitive information.

Feel free to add additional logging or error handling to suit your needs. Happy scripting!


3 replies

Userlevel 7
Badge +4

This is amazing @rhall 

Appreciate you sharing it!

Userlevel 4

Thanks @revathimenon. I thought it might help to have some examples of API usage here. Zapier is super easy when working with APIs, but there are limitations as to what you can do out of the box. A bit of code gives you so much more freedom to experiment.

Userlevel 4
Badge +4

This is awesome thanks for creating this tutorial @rhall 🙌

Reply