Radified Community Forums
http://radified.com/cgi-bin/yabb2/YaBB.pl
Rad Community Non-Technical Discussion Boards >> YaBB Forum Software + Rad Web Site >> Linux shell script to count # hyperlinks.
http://radified.com/cgi-bin/yabb2/YaBB.pl?num=1283114394

Message started by Rad.Test on Aug 29th, 2010 at 3:39pm

Title: Linux shell script to count # hyperlinks.
Post by Rad.Test on Aug 29th, 2010 at 3:39pm
I am curious about the number of hyperlinks the site uses.

I was planning to begin (test) counting the number of links contained in the current blog directory:

http://mt5.radified.com/blog/2010/08/internet-vs-world-wide-web-plus-origins.html

.. which lies inside:

/mt5/blog/2010

Inside this directory are all the monthly directories, such as /01 and /02 etc.

This script gives me an error:


Code:
find mt5/blog/2010 -name '*.html' -execdir sed 's/\<a href/AHREF\n/gi' \{\} \; | grep AHREF | wc -l


Error msg:

Code:
find: sed terminated by signal 13


Ideas?

Title: Re: Linux shell script to count # hyperlinks.
Post by Rad.Test on Aug 29th, 2010 at 3:51pm
Update. I tried to go straight to the monthly directory for January, which contains 5 *.html files.


Code:
find mt5/blog/2010/01 -name '*.html' -execdir sed 's/\<a href/AHREF\n/gi' \{\} \; | grep AHREF | wc -l

I get the same error mentioned above, 5 separate times, one for each file it would seem.

Title: Re: Linux shell script to count # hyperlinks.
Post by Rad.Test on Aug 29th, 2010 at 8:45pm
think i mighta figured it out.

the final character, -l (a letter) I thought was -1 ( a number).

Links in January 2010 blog: 464.

Links in Movable Type blog for all of 2010 to date: 4270.

Title: Re: Linux shell script to count # hyperlinks.
Post by Rad.Test on Aug 29th, 2010 at 9:28pm
when i ran the script for ALL *.html files in entire site, I get > 100,103 links

when i ran it for *.htm, I get > 159,765

which doesn't sound right, cuz I have WAY more *.html pages than *.htm, which I stopped using long ago.

And this includes guides I did not write, such as those by Magoo & NightOwl.

I could query those directories and subtract, but not a big number.

I don't think the forums are included, cuz those are stored at *.txt files, which I believe the forums script uses to create the web pages.

Does a quarter million links (not counting the forums) sound reasonable?

Title: Re: Linux shell script to count # hyperlinks.
Post by Rad.Test on Aug 30th, 2010 at 12:47am
1. The pages contained in Ye Olde Rad Blog v4 contain 4,270 links (.. as of August 30, 2010).
2. The pages contained in Ye Olde Rad Blog III contain 18,973 links.
3. The pages contained in Ye Olde Rad Blog II contain 12,806 links.
4. The pages contained in Ye Olde Rad Blog contain 20,264 links
TOTAL: 56,313 .. not counting the guides and daily entries not converted to blog entries (.. with Movable Type).

Title: Re: Linux shell script to count # hyperlinks.
Post by MrMagoo on Aug 30th, 2010 at 6:15pm
Glad you figured out the script.  That's a win on its own.  You've learned a lot since last year ;)

I'm not sure how accurate of a number the script can produce.  What exactly do you want to count?  The number of all hyperlinks?  The number of things *you* have linked to?  The number of links to external sites?  All would give you a different answer.

Moveable Type produces a lot of links on each page automatically, and I'm not sure if your script ends up counting many of those.  I think it would.  That's probably also why your .htm pages list so many - some .htm file has a bunch of links that were auto-generated.

I think you could probably find a bot happy to crawl your site and give you far more detailed and interesting statistics about your site.

Title: Re: Linux shell script to count # hyperlinks.
Post by Rad on Sep 1st, 2010 at 12:13am

MrMagoo wrote on Aug 30th, 2010 at 6:15pm:
That's a win on its own. You've learned a lot since last year 

Yeah, thanks. :)

I *really* would like to count the number of links that I've personally created. But seems improbable.

Next best thing would be just the number of links.

Comments from the guy who sent me the script:


Quote:
Basically what happens is that the script reads all the web pages line by line, and whenever it finds "<a href" (case insensitive, due to the "i" at the end of the "sed" command) it replaces it with the "AHREF" text that I use as a marker, and force a line break just in case there's another "<a href" in the remainder of the line.

"Grep" reads all the lines, and because of the extra newlines we know that each line contains either 0 or 1 hyperlink (signalled by the presence of the "AHREF" marker).

I could have used the original "<a href" text, but I'd have to escape the "<" in that with a "\". You could use any string that's unlikely to appear in your web pages, as long as you use the same string in both the "sed" command and the "grep" command. I used caps out of habit, analogous to the #define statement in C, where the convention is to use caps for the names of the things you #define. Using a string like "qqqqqqq" would have been better (less chance of it appearing in the text), but AHREF carries more meaning.


AND


Quote:
You could replace the "grep AHREF | wc -l" with "less" to see the unfiltered output. If there's any "<a href=.......>" in there then there's something wrong with the parameters of "sed". If there's no "<a href" stuff in there, then the script is working properly.

An alternate, automated way of checking that is to replace "grep AHREF" with "grep href" to see if sed is missing hyperlinks (and hence doesn't replace them with the AHREF placeholders that get counted).

A very rough sanity check is to remove the entire "sed.....gi" bit with "cat" and replace "grep AHREF" with "grep href" to remove all filtering. If there's two or more hyperlinks in a single line they'll get counted as 1 link, but the result might be interesting.

BTW, I didn't test any of this for lack of time, but it might point you in the right direction. The man pages are your friend :)

Radified Community Forums » Powered by YaBB 2.4!
YaBB © 2000-2009. All Rights Reserved.