{"id":1138,"date":"2009-03-26T09:46:57","date_gmt":"2009-03-26T14:46:57","guid":{"rendered":"http:\/\/www.webliminal.com\/webliminalblog\/?p=1138"},"modified":"2009-03-26T10:14:12","modified_gmt":"2009-03-26T15:14:12","slug":"1138","status":"publish","type":"post","link":"https:\/\/www.webliminal.com\/webliminalblog\/interesting-web-sites\/1138","title":{"rendered":"Links for Web Archiving"},"content":{"rendered":"<p><a href=\"http:\/\/webliminal.com\/images\/sidepics\/sp119.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignleft\" style=\"margin-left: 22px; margin-right: 22px;\" src=\"http:\/\/webliminal.com\/images\/sidepics\/cwdata\/sp119.jpg\" alt=\"marigolds, front garden, home, falmouth, Virginia, US\" hspace=\"22\" width=\"150\" height=\"112\" align=\"left\" \/><\/a><\/p>\n<h3>Possible Applications to Try:<\/h3>\n<ul>\n<li><a href=\"http:\/\/www.websiteoptimization.com\/speed\/18\/\" target=\"_new\">List of a\u00c2\u00a0lot of compression tools.<\/a><a href=\"http:\/\/sourceforge.net\/projects\/mod-gzip\/\" target=\"_new\"><\/a><\/li>\n<li><a href=\"http:\/\/sourceforge.net\/projects\/mod-gzip\/\" target=\"_new\">Mod_GZip<\/a><\/li>\n<li><a href=\"http:\/\/httpd.apache.org\/docs\/2.0\/mod\/mod_deflate.html\" target=\"_new\">Mod_Deflate<\/a><\/li>\n<\/ul>\n<h3>Information about Website compression:<\/h3>\n<ul>\n<li><a href=\"http:\/\/www.websiteoptimization.com\/speed\/tweak\/compress\/\" target=\"_new\">Website Compression at Website Optimization\u00c2\u00a0<\/a><\/li>\n<li><a href=\"http:\/\/www.15seconds.com\/issue\/020314.htm\" target=\"_new\">Web Site Compression By Wayne Berry<\/a><\/li>\n<\/ul>\n<h3>Tools:<\/h3>\n<p><a href=\"http:\/\/leknor.com\/code\/gziped.php?url=http%3A%2F%2Fpaprika.umw.edu\" target=\"_new\">Web\u00c2\u00a0tool that examines the size gains of website compression.\u00c2\u00a0<\/a><a href=\"http:\/\/www.vigos.com\/products\/website-analyzer\/\" target=\"_new\">Downloadable\u00c2\u00a0Website Analyser<\/a><\/p>\n<p>http:\/\/webarchivist.org\/resources.htm<\/p>\n<p><strong><span style=\"font-size: medium;\">Below<\/span><\/strong>: Some\u00c2\u00a0 annotated text taken from <a href=\"http:\/\/hul.harvard.edu\/ois\/projects\/webarchive\/resources.html\">Web Archiving Resources<\/a> Office for Information Systems, Harvard University Libraries<\/p>\n<h2>Harvesting\u00c2\u00a0Services<\/h2>\n<p>ArchiveIt<\/p>\n<p>A subscription harvesting service\u00c2\u00a0provided by the Internet Archive. Through a web based interface, users\u00c2\u00a0can capture, catalogue and archive their institution&#8217;s own web site or\u00c2\u00a0build additional collections, and then search and browse the collection\u00c2\u00a0when complete.<br \/>\n<a href=\"http:\/\/www.archive-it.org\/\">http:\/\/www.archive-it.org\/<\/a><\/p>\n<h2><a name=\"software\"><\/a>Harvesting\u00c2\u00a0Software<\/h2>\n<p><strong>Open source harvesting\u00c2\u00a0software<\/strong><\/p>\n<p>Combine Harvesting Robot<br \/>\n<a href=\"http:\/\/www.lub.lu.se\/combine\/\">http:\/\/www.lub.lu.se\/combine\/<\/a><br \/>\nHarvesting and indexing software written in Perl and C++ and under the\u00c2\u00a0GPL license. Once used (and still?) by Swedish, Danish and Austrian archives.\u00c2\u00a0Do not know if this is actively developed anymore.<\/p>\n<p>GNU Wget<br \/>\n<a href=\"http:\/\/www.gnu.org\/software\/wget\/wget.html\">http:\/\/www.gnu.org\/software\/wget\/wget.html<\/a><br \/>\nA non-interactive command-line tool under the GPL license that can be\u00c2\u00a0used from scripts and other programs.<\/p>\n<p>Heritrix, Internet Archive and\u00c2\u00a0Nordic National Libraries<br \/>\n<a href=\"http:\/\/crawler.archive.org\/\">http:\/\/crawler.archive.org\/<\/a><br \/>\nA robust web archiving harvester under the LGPL license. Has very flexible\u00c2\u00a0means to configure and control the harvest. Designed to be extensible\u00c2\u00a0by writing new Java modules. Configurable through a web interface. This\u00c2\u00a0work is sponsored by the IIPC (International Internet Preservation Consortium).<\/p>\n<p>HTTrack<br \/>\n<a href=\"http:\/\/www.httrack.com\/\">http:\/\/www.httrack.com\/<\/a><br \/>\nOffline browsers under the GPL license that can be used from a graphical\u00c2\u00a0interface or the command line.<\/p>\n<p>Nalanda iVia Focused Crawler\u00c2\u00a0(NIFC)<\/p>\n<p><a href=\"http:\/\/ivia.ucr.edu\/projects\/Nalanda\/\">http:\/\/ivia.ucr.edu\/projects\/Nalanda\/<\/a><\/p>\n<p>Designed to find Web resources\u00c2\u00a0with the same topic as a seed set of known resources. NIFC was created\u00c2\u00a0by Dr. Soumen Chakrabarti\u00c2\u00a0at the Indian Institute of Technology (Bombay), and further developed\u00c2\u00a0in collaboration with the iVia team.<\/p>\n<p>Nedlib Harvester, Center for\u00c2\u00a0Scientific Computing &#8211; the Finnish IT Center for Science<br \/>\n<a href=\"http:\/\/www.csc.fi\/sovellus\/nedlib\/\">http:\/\/www.csc.fi\/sovellus\/nedlib\/<\/a><br \/>\nDeveloped as a part of the Nedlib project funded by the European Union.\u00c2\u00a0Written in C and dependent on the MySQL database. No longer supported\u00c2\u00a0or developed.<\/p>\n<p><strong>Commercial harvesting\u00c2\u00a0software<\/strong><\/p>\n<p>Internet Researcher, Zylox Software<br \/>\n<a href=\"http:\/\/www.zylox.com\/\">http:\/\/www.zylox.com\/<\/a><br \/>\nA Windows-only offline browsing tool with a graphical interface.<\/p>\n<p>Offline Explorer and Mass Downloader,\u00c2\u00a0MetaProducts Software Corporation<br \/>\n<a href=\"http:\/\/www.metaproducts.com\/mp\/mpProducts_List.asp\">http:\/\/www.metaproducts.com\/mp\/mpProducts_List.asp<\/a><br \/>\nVarious offline browsing tools for Windows. The MetaProducts Offline Explorer\u00c2\u00a0Pro 2.1 is used by DACHS (Digital Archive for Chinese Studies) -\u00c2\u00a0<a href=\"http:\/\/www.sino.uni-heidelberg.de\/dachs\/\">http:\/\/www.sino.uni-heidelberg.de\/dachs\/<\/a><\/p>\n<p>RafaBot, Spadix Software<br \/>\n<a href=\"http:\/\/www.spadixbd.com\/rafabot\/\">http:\/\/www.spadixbd.com\/rafabot\/<\/a><br \/>\nA Windows-only offline browsing tool with a graphical interface. As well\u00c2\u00a0as supplying it with a list of URLs, can give it search terms and RafaBot\u00c2\u00a0will download all matching web sites using search engines.<\/p>\n<p>SuperBot, Sparkleware<br \/>\n<a href=\"http:\/\/www.sparkleware.com\/dl.html\">http:\/\/www.sparkleware.com\/dl.html<\/a><br \/>\nA simple Windows-only offline browsing tool with a graphical interface.<\/p>\n<p>SurfSaver, askSam Systems<br \/>\n<a href=\"http:\/\/www.surfsaver.com\/\">http:\/\/www.surfsaver.com\/<\/a><br \/>\nAn add-on to Microsoft Internet Explorer.<\/p>\n<p>Teleport Webspiders<br \/>\n<a href=\"http:\/\/www.tenmax.com\/teleport\/home.htm\">http:\/\/www.tenmax.com\/teleport\/home.htm<\/a><br \/>\nVarious sophisticated Windows-only versions with different interfaces\u00c2\u00a0(graphical, console, scriptable) and feature sets.<\/p>\n<p>WebCopier, MaximumSoft Corp.<br \/>\n<a href=\"http:\/\/www.maximumsoft.com\/index.html\">http:\/\/www.maximumsoft.com\/index.html<\/a><br \/>\nAn offline browsing tool with a graphical interface in multiple versions\u00c2\u00a0for different operating systems and performance\/feature level.<\/p>\n<h2><a name=\"discovery\"><\/a>Discovery,\u00c2\u00a0Display and Access Software<\/h2>\n<p>ARC Access Tools<br \/>\n<a href=\"http:\/\/archive-access.sourceforge.net\/\">http:\/\/archive-access.sourceforge.net\/<\/a><br \/>\nInternet Archive&#8217;s list of tools for processing and accessing content\u00c2\u00a0in ARC files.<\/p>\n<p>Kea<\/p>\n<p><a href=\"http:\/\/www.nzdl.org\/Kea\/\">http:\/\/www.nzdl.org\/Kea\/<\/a><\/p>\n<p>A GPLed tool for automatic keyword\u00c2\u00a0extraction from text documents. Originally written in a combination of\u00c2\u00a0Perl, C and Java; now available in an all-Java version. From the New Zealand<br \/>\nDigital Library at the University of Waikato, New Zealand.<\/p>\n<p>libiViaMetadata<\/p>\n<p><a href=\"http:\/\/ivia.ucr.edu\/manuals\/libiViaMetadata\/current\/\">http:\/\/ivia.ucr.edu\/manuals\/libiViaMetadata\/current\/<\/a><\/p>\n<p>A GPLed C++ library for assigning\u00c2\u00a0descriptive metadata to web files. Developed under the iVia Project. Includes\u00c2\u00a0the PhraseRate program which is described at\u00c2\u00a0<a href=\"http:\/\/ivia.ucr.edu\/projects\/PhraseRate\/\">http:\/\/ivia.ucr.edu\/projects\/PhraseRate\/<\/a><\/p>\n<p>NutchWAX (Nutch + Web Archive\u00c2\u00a0eXtensions), Internet Archive and Nordic National Libraries<br \/>\n<a href=\"http:\/\/archive-access.sourceforge.net\/projects\/nutch\/gettingstarted.html\">http:\/\/archive-access.sourceforge.net\/projects\/nutch\/gettingstarted.html<\/a><br \/>\nA tool for indexing and searching web archives. Currently works only with\u00c2\u00a0the Arc format<br \/>\n(<a href=\"http:\/\/www.archive.org\/web\/researcher\/ArcFileFormat.php\">http:\/\/www.archive.org\/web\/researcher\/ArcFileFormat.php<\/a>).<br \/>\nImplemented as a Java servlet. Add parsers to handle different formats,\u00c2\u00a0e.g. xpdf for PDF files. <strong>This work is sponsored by the IIPC (International\u00c2\u00a0Internet Preservation Consortium)<\/strong>.<\/p>\n<p>Wayback, Internet Archive<\/p>\n<p><a href=\"http:\/\/archive-access.sourceforge.net\/projects\/wayback\/\">http:\/\/archive-access.sourceforge.net\/projects\/wayback\/<\/a><\/p>\n<p>The open source version of the\u00c2\u00a0Internet Archive&#8217;s proprietary search and display interface, the &#8220;Wayback\u00c2\u00a0Machine&#8221; (listed next).<\/p>\n<p>Wayback Machine, Internet Archive<br \/>\n<a href=\"http:\/\/www.archive.org\/web\/web.php\">http:\/\/www.archive.org\/web\/web.php<\/a><br \/>\nA proprietary interface to the Internet Archive&#8217;s huge collection of web\u00c2\u00a0pages archived from 1996 to the present.<\/p>\n<p>WERA (Web Archive Access), Internet<br \/>\nArchive and National Library of Norway\u00c2\u00a0<a href=\"http:\/\/nwa.nb.no\/\">http:\/\/nwa.nb.no\/<\/a><br \/>\nAn archive viewer application that gives an Internet Archive Wayback Machine-like\u00c2\u00a0access to web archive collections as well as the possibility to do full\u00c2\u00a0text search and easy navigation between different versions of a web page.\u00c2\u00a0WERA is based on, and replaces the NwaToolset. It uses the NutchWAX search\u00c2\u00a0engine and is written in PHP and Java. <strong>This work is sponsored\u00c2\u00a0by the IIPC (International Internet Preservation Consortium)<\/strong>.<\/p>\n<h2><a name=\"management\"><\/a>General\u00c2\u00a0Web Archiving Suites<\/h2>\n<p><strong>Software that is more\u00c2\u00a0of a system of web archiving tools rather that individual applications<\/strong><\/p>\n<p>DataFountains<\/p>\n<p><a href=\"http:\/\/ivia.ucr.edu\/manuals\/DataFountains\/1.0.0\/\">http:\/\/ivia.ucr.edu\/manuals\/DataFountains\/1.0.0\/<\/a><\/p>\n<p>A tool for discovering, harvesting<br \/>\nand describing web resources. Developed under the iVia Project.<\/p>\n<p>PANDAS (PANDORA Digital Archiving<br \/>\nSystem), National Library of Australia <a href=\"http:\/\/pandora.nla.gov.au\/pandas.html\">http:\/\/pandora.nla.gov.au\/pandas.html\u00c2\u00a0<\/a>Tools for controlling the harvest, conducting quality assurance checking,\u00c2\u00a0initiating archiving processes, managing the metadata including access\u00c2\u00a0restrictions, and producing management reports. Uses the HTTrack harvester.\u00c2\u00a0PANDAS was created to enable very selective harvesting and is not intended\u00c2\u00a0for large-scale automated harvests. The developers of this software are\u00c2\u00a0re-engineering PANDAS to use IIPC tools like Heretrix and WERA, and to<br \/>\nbe better integrated with their digital repository.<\/p>\n<p>WebArchivist Software Suite, SUNY Institute of\u00c2\u00a0Technology and University of Washington<br \/>\n<a href=\"http:\/\/www.webarchivist.org\/resources.htm\">http:\/\/www.webarchivist.org\/resources.htm<\/a><br \/>\nTools for entering metadata, searching, analyzing and displaying archived\u00c2\u00a0sites. The software isn&#8217;t licensed yet but according to the product&#8217;s\u00c2\u00a0website the plan is to make this software available to other organizations. \u00c2\u00a0Used for the Library of Congress&#8217; Election 2002 (<a href=\"http:\/\/lcweb4.loc.gov\/elect2002\/\">http:\/\/lcweb4.loc.gov\/elect2002\/<\/a>)\u00c2\u00a0and September 11 (<a href=\"http:\/\/september11.archive.org\/\">http:\/\/september11.archive.org\/<\/a>)<br \/>\nweb archives as well as the Asian Tsunami Web Archive (<a href=\"http:\/\/tsunami.archive.org\/\">http:\/\/tsunami.archive.org\/<\/a>).<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>Possible Applications to Try: List of a\u00c2\u00a0lot of compression tools. Mod_GZip Mod_Deflate Information about Website compression: Website Compression at Website Optimization\u00c2\u00a0 Web Site Compression By Wayne Berry Tools: Web\u00c2\u00a0tool that examines the size gains of website compression.\u00c2\u00a0Downloadable\u00c2\u00a0Website Analyser http:\/\/webarchivist.org\/resources.htm Below: Some\u00c2\u00a0 annotated text taken from Web Archiving Resources Office for Information Systems, Harvard University Libraries [&hellip;]<!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9,12],"tags":[353,354],"class_list":["post-1138","post","type-post","status-publish","format-standard","hentry","category-interesting-web-sites","category-teaching","tag-compression","tag-web-archiving"],"_links":{"self":[{"href":"https:\/\/www.webliminal.com\/webliminalblog\/wp-json\/wp\/v2\/posts\/1138","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.webliminal.com\/webliminalblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.webliminal.com\/webliminalblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.webliminal.com\/webliminalblog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.webliminal.com\/webliminalblog\/wp-json\/wp\/v2\/comments?post=1138"}],"version-history":[{"count":5,"href":"https:\/\/www.webliminal.com\/webliminalblog\/wp-json\/wp\/v2\/posts\/1138\/revisions"}],"predecessor-version":[{"id":1140,"href":"https:\/\/www.webliminal.com\/webliminalblog\/wp-json\/wp\/v2\/posts\/1138\/revisions\/1140"}],"wp:attachment":[{"href":"https:\/\/www.webliminal.com\/webliminalblog\/wp-json\/wp\/v2\/media?parent=1138"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.webliminal.com\/webliminalblog\/wp-json\/wp\/v2\/categories?post=1138"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.webliminal.com\/webliminalblog\/wp-json\/wp\/v2\/tags?post=1138"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}