Welcome, Guest. Please Login
 
  HomeHelpSearchLogin FAQ Radified Ghost.Classic Ghost.New Bootable CD Blog  
 
Page Index Toggle Pages: 1
Send Topic Print
Linux shell script to count # hyperlinks. (Read 5882 times)
Rad.Test
Technoluster
***
Offline


Rad's non-Admin test-profile
in Firefox

Posts: 108


Back to top
Linux shell script to count # hyperlinks.
Aug 29th, 2010 at 3:39pm
 
I am curious about the number of hyperlinks the site uses.

I was planning to begin (test) counting the number of links contained in the current blog directory:

http://mt5.radified.com/blog/2010/08/internet-vs-world-wide-web-plus-origins.htm...

.. which lies inside:

/mt5/blog/2010

Inside this directory are all the monthly directories, such as /01 and /02 etc.

This script gives me an error:

Code:
find mt5/blog/2010 -name '*.html' -execdir sed 's/\<a href/AHREF\n/gi' \{\} \; | grep AHREF | wc -l 



Error msg:
Code:
find: sed terminated by signal 13 



Ideas?
 
 
IP Logged
 

Rad.Test
Technoluster
***
Offline


Rad's non-Admin test-profile
in Firefox

Posts: 108


Back to top
Re: Linux shell script to count # hyperlinks.
Reply #1 - Aug 29th, 2010 at 3:51pm
 
Update. I tried to go straight to the monthly directory for January, which contains 5 *.html files.

Code:
find mt5/blog/2010/01 -name '*.html' -execdir sed 's/\<a href/AHREF\n/gi' \{\} \; | grep AHREF | wc -l 

I get the same error mentioned above, 5 separate times, one for each file it would seem.
 
 
IP Logged
 
Rad.Test
Technoluster
***
Offline


Rad's non-Admin test-profile
in Firefox

Posts: 108


Back to top
Re: Linux shell script to count # hyperlinks.
Reply #2 - Aug 29th, 2010 at 8:45pm
 
think i mighta figured it out.

the final character, -l (a letter) I thought was -1 ( a number).

Links in January 2010 blog: 464.

Links in Movable Type blog for all of 2010 to date: 4270.
 
 
IP Logged
 
Rad.Test
Technoluster
***
Offline


Rad's non-Admin test-profile
in Firefox

Posts: 108


Back to top
Re: Linux shell script to count # hyperlinks.
Reply #3 - Aug 29th, 2010 at 9:28pm
 
when i ran the script for ALL *.html files in entire site, I get > 100,103 links

when i ran it for *.htm, I get > 159,765

which doesn't sound right, cuz I have WAY more *.html pages than *.htm, which I stopped using long ago.

And this includes guides I did not write, such as those by Magoo & NightOwl.

I could query those directories and subtract, but not a big number.

I don't think the forums are included, cuz those are stored at *.txt files, which I believe the forums script uses to create the web pages.

Does a quarter million links (not counting the forums) sound reasonable?
 
 
IP Logged
 
Rad.Test
Technoluster
***
Offline


Rad's non-Admin test-profile
in Firefox

Posts: 108


Back to top
Re: Linux shell script to count # hyperlinks.
Reply #4 - Aug 30th, 2010 at 12:47am
 
1. The pages contained in Ye Olde Rad Blog v4 contain 4,270 links (.. as of August 30, 2010).
2. The pages contained in Ye Olde Rad Blog III contain 18,973 links.
3. The pages contained in Ye Olde Rad Blog II contain 12,806 links.
4. The pages contained in Ye Olde Rad Blog contain 20,264 links
TOTAL: 56,313 .. not counting the guides and daily entries not converted to blog entries (.. with Movable Type).
 
 
IP Logged
 
MrMagoo
Übermensch
*****
Offline


Resident Linux Guru

Posts: 1026
Phoenix, AZ (USA)


Back to top
Re: Linux shell script to count # hyperlinks.
Reply #5 - Aug 30th, 2010 at 6:15pm
 
Glad you figured out the script.  That's a win on its own.  You've learned a lot since last year Wink

I'm not sure how accurate of a number the script can produce.  What exactly do you want to count?  The number of all hyperlinks?  The number of things *you* have linked to?  The number of links to external sites?  All would give you a different answer.

Moveable Type produces a lot of links on each page automatically, and I'm not sure if your script ends up counting many of those.  I think it would.  That's probably also why your .htm pages list so many - some .htm file has a bunch of links that were auto-generated.

I think you could probably find a bot happy to crawl your site and give you far more detailed and interesting statistics about your site.
 
WWW  
IP Logged
 

Rad
Radministrator
*****
Offline


Sufferin' succotash

Posts: 4090
Newport Beach, California


Back to top
Re: Linux shell script to count # hyperlinks.
Reply #6 - Sep 1st, 2010 at 12:13am
 
MrMagoo wrote on Aug 30th, 2010 at 6:15pm:
That's a win on its own. You've learned a lot since last year 

Yeah, thanks. Smiley

I *really* would like to count the number of links that I've personally created. But seems improbable.

Next best thing would be just the number of links.

Comments from the guy who sent me the script:

Quote:
Basically what happens is that the script reads all the web pages line by line, and whenever it finds "<a href" (case insensitive, due to the "i" at the end of the "sed" command) it replaces it with the "AHREF" text that I use as a marker, and force a line break just in case there's another "<a href" in the remainder of the line.

"Grep" reads all the lines, and because of the extra newlines we know that each line contains either 0 or 1 hyperlink (signalled by the presence of the "AHREF" marker).

I could have used the original "<a href" text, but I'd have to escape the "<" in that with a "\". You could use any string that's unlikely to appear in your web pages, as long as you use the same string in both the "sed" command and the "grep" command. I used caps out of habit, analogous to the #define statement in C, where the convention is to use caps for the names of the things you #define. Using a string like "qqqqqqq" would have been better (less chance of it appearing in the text), but AHREF carries more meaning.


AND

Quote:
You could replace the "grep AHREF | wc -l" with "less" to see the unfiltered output. If there's any "<a href=.......>" in there then there's something wrong with the parameters of "sed". If there's no "<a href" stuff in there, then the script is working properly.

An alternate, automated way of checking that is to replace "grep AHREF" with "grep href" to see if sed is missing hyperlinks (and hence doesn't replace them with the AHREF placeholders that get counted).

A very rough sanity check is to remove the entire "sed.....gi" bit with "cat" and replace "grep AHREF" with "grep href" to remove all filtering. If there's two or more hyperlinks in a single line they'll get counted as 1 link, but the result might be interesting.

BTW, I didn't test any of this for lack of time, but it might point you in the right direction. The man pages are your friend Smiley
 
WWW  
IP Logged
 
Page Index Toggle Pages: 1
Send Topic Print