Webcrawl a blog to retrieve all entries locally: RSS on steroids
Today’s sample shows how to create a web crawler in the background. This crawler starts with a web page, looks for all links on that page, and follows all those links. The links are filtered to my blog, but generalizing the code to search the entire web or some other site is trivial (if you have enough disk space<g>). (VB.Net version to appear soon on this blog.)
I was doing a search on my blog for “ancestors” via the Search box on the sidebar on the left, and there were no results. Strange, I thought, so I used MSN search for my site:
http://search.msn.com/results.aspx?FORM=TOOLBR&q=ancestors+site%3Ahttp%3A%2F%2Fblogs.msdn.com%2FCalvin_Hsia%2F
That search succeeded: it came up with the expected blog entry.
This incident reminded me of the fact that I’ve done a lot of work to create my blog, but I depend on a 3rd party to maintain it. There are hundreds of code samples, with links to references. If the blog server were to disappear for some reason, so would all my content. I wanted to retrieve all my blog content into a local table. Then I can manipulate it any way I want.
In particular, suppose I want to read my entire blog. I would have to do a lot of manual clicking to get to the month/day of the post, and then I might have missed something because I’m manually crawling. That’s pretty cumbersome. Also, I can have all of a blog available while offline, updating when connected.
So I wrote a code sample below that crawls my blog, looking for all the blog posts, and shows them in a form which has search capability. Because it’s all local, searching and navigating from post to post is extremely fast. The entry is displayed in a web control, so the page looks just like it would online and the hyperlinks are all live.
You can start a web crawl by pushing the Crawl button. You can interrupt the web crawl by typing ‘Q’ (<esc> will cancel the automation of the IE SaveAs dialog). The next time the crawl runs, it will resume where it left off. Crawling acts as if you were subscribed to my blog via RSS. Once you have all current content, Crawling again later will just add any new content. The saved content is the entire blog entry web page, including any comments. As an exercise, readers are encouraged to make the web crawling execute on a background thread!
A crawl starts at the main page http://blogs.msdn.com/Calvin_Hsia, which shows any new content and has links on the side bar for any other posts. The page is loaded and then parsed for any links. Any links pointing to my blog are inserted into a table if they’re not there already. Then the table is scanned for any unfollowed links and the process repeats. If a page is a leaf node (currently any link with 8 backslashes) then the Publication date is parsed, and the file is saved in the MHT field in the table. The link parsing was a little complicated due to some comment spam reducing measures and some broken links when the blog host server switched software.
You will probably have to modify the code if you want to do the same for other blogs. For example, some blogs may have the Publication date in a different place. Others may have archive links elsewhere or in a different format.
I experimented with using HTTPGet
cTempFile=ADDBS(GETENV("TEMP"))+SYS(3)+".htm"
LOCAL oHTTP as "winhttp.winhttprequest.5.1"
LOCAL cHTML
oHTTP=NEWOBJECT("winhttp.winhttprequest.5.1")
oHTTP.Open("GET","http://blogs.msdn.com/calvin_hsia/archive/2004/06/28/168054.aspx",.f.)
oHTTP.Send()
STRTOFILE(ohTTP.ResponseText,cTempFile)
oIE=CREATEOBJECT("InternetExplorer.Application")
oIE.Visible=1
oIE.Navigate(cTempFile)
But the content looked pretty bad, because of the CSS references, pictures, etc.
Being able to automate IE was helpful, but how do you parse the HTML for the links to each blog entry? I thought about using an XSLT, but that was fairly complex. I used the IE Document model IHTMLDocument,to search through the HTML nodes for links.
IE has a feature that saves a web page to a single file: Web Archive, single file(*.mht) from the File->SaveAs menu option. So I used Windows Scripting Host to automate this feature.
Making the code run in a background thread is trivial: just use the ThreadClass from here.
See also :
Wite your own RSS News/Blog aggregator in <100 lines of code
Use a simple XSLT to read the RSS feed from a blog,
Do you like reading a blog author? Retrieve all blog entries locally for reading/searching using XML, XSLT, XPATH
Generating VBScript to read a blog
CLEAR ALL
CLEAR
#define WAIT_TIMEOUT 258
#define ERROR_ALREADY_EXISTS 183
#define WM_USER 0x400
SET EXCLUSIVE OFF
SET SAFETY OFF
SET ASSERTS ON
PUBLIC oBlogForm as Form
oBlogForm=creat("BlogForm","blogs.msdn.com/Calvin_Hsia")
oBlogForm.Visible=1
DEFINE CLASS BlogForm AS Form
Height=_screen.Height-80
Width = 900
AllowOutput=0
left=170
cBlogUrl=""
oThreadMgr=null
ADD OBJECT txtSearch as textbox WITH width=200
ADD OBJECT cmdSearch as CommandButton WITH left=210,caption="\<Search"
ADD OBJECT cmdCrawl as CommandButton WITH left=310,caption="\<Crawl"
ADD OBJECT cmdQuit as CommandButton WITH left=410,caption="\<Quit"
ADD OBJECT oGrid as Grid WITH ;
width = thisform.Width,;
top=20,;
ReadOnly=1,;
Anchor=15
ADD OBJECT oWeb as cWeb WITH ;
top=230,;
height=thisform.Height-250,;
width = thisform.Width,;
Anchor=15
ADD OBJECT lblStatus as label WITH top = thisform.Height-18,width = thisform.Width,anchor=4,caption=""
PROCEDURE Init(cUrl as String)
this.cBlogUrl=cUrl
IF !FILE("blogs.dbf")
CREATE table Blogs(title c(250),pubdate t,link c(100),followed i, Stored t,mht m)
INDEX on link TAG link
INDEX on pubdate TAG pubdate DESCENDING
INSERT INTO Blogs (link) VALUES (cUrl) && jump start the table with a link
INSERT INTO blogs (link) VALUES ('http://blogs.msdn.com/vsdata/archive/2004/03/18/92346.aspx') && early blogs
INSERT INTO blogs (link) VALUES ('http://blogs.msdn.com/vsdata/archive/2004/03/31/105159.aspx')
INSERT INTO blogs (link) VALUES ('http://blogs.msdn.com/vsdata/archive/2004/04/05/107986.aspx')
INSERT INTO blogs (link) VALUES ('http://blogs.msdn.com/vsdata/archive/2004/05/12/130612.aspx')
INSERT INTO blogs (link) VALUES ('http://blogs.msdn.com/vsdata/archive/2004/06/16/157451.aspx')
ENDIF
USE blogs SHARED && reopen shared
this.RequeryData()
this.RefrGrid
PROCEDURE RequeryData
LOCAL cTxt, cWhere
cTxt=ALLTRIM(thisform.txtSearch.value)
cWhere= "!EMPTY(mht)"
IF LEN(cTxt)>0
cWhere=cWhere+" and ATC(cTxt, mht)>0"
ENDIF
SELECT * FROM blogs WHERE &cWhere ORDER BY pubdate DESC INTO CURSOR Result
thisform.lblStatus.caption="# records ="+TRANSFORM(_tally)
WITH this.oGrid
.RecordSource= "Result"
.Column1.FontSize=14
.Column1.Width=this.Width-120
.RowHeight=25
ENDWITH
thisform.refrGrid
PROCEDURE RefrGrid
cFilename=ADDBS(GETENV("temp"))+SYS(3)+".mht"
STRTOFILE(mht,cFilename)
thisform.oWeb.Navigate(cFilename)
PROCEDURE oGrid.AfterRowColChange(nColIndex as Integer)
IF this.rowcolChange=1 && row changed
thisform.RefrGrid
ENDIF
PROCEDURE cmdQuit.Click
thisform.Release
PROCEDURE cmdCrawl.Click
thisform.txtSearch.value=""
fBackgroundThread=.t. && if you want to run on background thread
IF this.Caption = "\<Crawl"
thisform.lblStatus.caption= "Blog crawl start"
CreateCrawlProc()
IF fBackgroundThread
this.Caption="Stop \<Crawl"
*Get ThreadManager from http://blogs.msdn.com/calvin_hsia/archive/2006/05/23/605465.aspx
thisform.oThreadMgr=NEWOBJECT("ThreadManager","threads.prg")
thisform.oThreadMgr.CreateThread("MyThreadFunc",thisform.cBlogUrl,"oBlogForm.CrawlDone")
thisform.lblStatus.caption= "Background Crawl Thread Created"
ELSE
LOCAL oBlogCrawl
oBlogCrawl=NEWOBJECT("BlogCrawl","MyThreadFunc.prg","",thisform.cBlogUrl) && the class def resides in MyThreadFunc.prg
thisform.CrawlDone
ENDIF
ELSE
this.Caption="\<Crawl"
IF fBackgroundThread AND TYPE("thisform.oThreadMgr")="O"
thisform.lblStatus.caption= "Attempting thread stop"
thisform.oThreadMgr.SendMsgToStopThreads()
ENDIF
ENDIF
PROCEDURE CrawlDone
thisform.oThreadMgr=null
thisform.cmdCrawl.caption="\<Crawl"
thisform.lblStatus.caption= "Crawl done"
this