ICONICO

Discussion Thread

Data Extractor

Message Thread

For WindowsData Extractor

Data Extractor iconExtract any data, including email addresses and URLs from your files and webpages.

Posted in the Data Extractor Forum.




title and meta tags from HTML page

How can I extract title and meta tags from a HTML page?
by Saurabh Shyara on Mar 14 2007 1:21am Reply

title and meta tags from HTML page

We've added the script here:
http://www.iconico.com/Da...taTags.txt

You may install it by doing the following:
1. Create a new rule in Data Extractor; select the second tab and clicking 'New'.
2. Type a descriptive name for your rule and click OK.
3. Then click 'Edit Rule details'.
4. Select the 'HTML Webpage script' option on the right.
5. Copy and paste the rule into the 'Rule Details' text area.
by Nico Westerdale on Mar 14 2007 7:00pm Reply

title and meta tags from HTML page

Thanks.

Now how can I extract body and title at same time?

by Saurabh Shyara on Mar 14 2007 9:07pm Reply

title and meta tags from HTML page

If you want to extract the entire text of a page then you are better off using our HTML text extractor:
http://www.iconico.com/HTMLExtractor
by Nico Westerdale on Mar 14 2007 9:48pm Reply

title and meta tags from HTML page

I tried inserting this script, and it did not recover any meta tags or title tags. When I make my own rule to find titles using the search by text option and simply putting the title tags with the * wildcard, it does find the title tags. What is wrong?
by eric l on Apr 26 2007 3:21pm Reply

title and meta tags from HTML page

Hard to say what's wrong eric, the rule should work fine. With the rule please make sure you have line breaks in the source and it's not all on one line, you may need to 'View Source' depending on your browser.
by Nico Westerdale on Apr 27 2007 4:55am Reply

title and meta tags from HTML page

The code has the appropriate line breaks. It will scan all the files in my website, but does not extract anything at all, even though most pages have title tags and meta tags. I am scanning californialung.org to do a content audit, is there another code you offer to do this same thing?
by eric on Apr 27 2007 7:22am Reply

title and meta tags from HTML page

Eric,

I just realized there was a missing '}' at the end of the rule. I've updated the rule, please try again and it should work for you. Sorry about that!
by Nico Westerdale on Apr 28 2007 6:39am Reply

title and meta tags from HTML page

Thanks, that helped. It now finds all the meta tags, but it also now tries to open every file, including pdf and .mov files. I now either find that the appllication stalls when opening a .mov, or I get this error:

-begin error-

Data Extractor Script Error:
TypeError: -2147024891
Access is Denied

Please check that your script is correct

-end error-

Can i restrict what files it actually will open? Also, does the error above mean its running into password protected directories? Normally the program just asks for my username and password and continues gathering metadata.

Thanks for your help!

Eric
by Eric on May 4 2007 9:26am Reply

title and meta tags from HTML page

The error is occuring not becuase of password protected directories or files but becuase the Data Extractor is trying to open something that's not a webpage.

How are you setting up the Data Extactor on the first tab? Are you giving it a list of files or are you extracting from an entire site. If you could send your exact settings I can see if I can reproduce the error here and get you an answer.
by Nico Westerdale on May 4 2007 10:01am Reply

title and meta tags from HTML page

The first tab, I enter in the home page url, and check the box to scan all webfiles linked. Tab 2, I have the meta data code you have kindly posted here. then I click the extract now button. The site does have video files, that seems to be where it stops most often and generates the error.

Eric
by Eric on May 4 2007 11:38am Reply

title and meta tags from HTML page

What's the URL I need to try duplicating it here.
by Nico Westerdale on May 4 2007 11:54am Reply

title and meta tags from HTML page

californialung.org
by Eric on May 4 2007 12:09pm Reply

title and meta tags from HTML page

Eric,

Looks like you're right that the application is stopping when it finds a video. There's a quick fix for this which is to edit the file 'DEscript.txt' which is located in the C:\Program Files\Data Extractor folder.

You can add in different file extentions to be ignored by changing the line that's 9 lines from the bottom. Please change it to the following and the extraction should run all the way through, although you will need to click 'Cancel' for the security popups as they appear.

if ((ext != '.doc') && (ext != '.xls') && (ext != '.xml') && (ext != '.pdf') && (ext != '.xml') && (ext != '.txt') && (ext != '.csv') && (ext != '.mp3') && (ext != '.mov') && (ext != '.wma') && (ext != '.wmv')) {
by Nico Westerdale on May 10 2007 9:12pm Reply

title and meta tags from HTML page

For me this script returns multiple identical results per page. Why does it do that? Is there a way to make it stop searching the page after it's found 1 title already? I am just using the part with the URL and title tags, as that's all I need. Thanks!
by Laura Wagner on Dec 23 2009 2:55pm Reply

title and meta tags from HTML page

It shouldn't! You can always click the 'Remove Duplicates' button
by Nico User on Dec 23 2009 4:54pm Reply

title and meta tags from HTML page

Interesting. I wonder why mine is doing that then. Like if I put in your site here, I get four identical results which all say "Iconico.com Software."

Remove duplicates is fine, but the processing time is long. I'm wondering if it's because it keeps looking for more results or if it's just because it has the load the page.
by Laura Wagner on Dec 23 2009 5:10pm Reply

Our Software Stores

IconicoAccurate Design and Development Software

BitsDuJourDiscount Deal Coupons for Windows and Mac Software Apps

Our Software Services

IcoBlogOur Official Blog

© copyright 2004-2024 Iconico, Inc. Code & Design. All Rights Reserved. Terms & Conditions Privacy Policy Terms of Use Login