The Internet - getting the most out of your searches [Archive]

View Full Version : The Internet - getting the most out of your searches

DMac

9th December 2011, 08:18 AM

Searching the internet as most people know it exists through major search sites such as google, bing, yahoo et al. What most people don't know is that the entire web is much bigger than the amount of data that is able to be collected by the mainstream search engines.

Take for example Google. Google uses a web crawling technology that follows public link after link to create an index of the web. The other mainstream search engines use similar technologies.

Something to keep in mind though, is that there are ways to keep a website and its data from being indexed by web crawlers such as Google.

This information is not necessarily private and it is accessible by any given anonymous user on the web.

Enter the "deep web" or invisible internet.

Let's take a look at what Wikipedia has to say about the deep web (or darknet) for some basic introduction to the subject:

Introduction

The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet, or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.

Mike Bergman, founder of BrightPlanet, credited with coining the phrase, has said that searching on the Internet today can be compared to dragging a net across the surface of the ocean: a great deal may be caught in the net, but there is a wealth of information that is deep and therefore missed. Most of the Web's information is buried far down on dynamically generated sites, and standard search engines do not find it. Traditional search engines cannot "see" or retrieve content in the deep Web—those pages do not exist until they are created dynamically as the result of a specific search. The deep Web is several orders of magnitude larger than the surface Web.

Size

Estimates based on extrapolations from a study done at University of California, Berkeley in the year 2000, speculate that the deep Web consists of about 91,000 terabytes. By contrast, the surface Web (which is easily reached by search engines) is about 167 terabytes; the Library of Congress, in 1997, was estimated to have 3,000 terabytes. More accurate estimates are available for the number of resources in the deep Web: He et al. detected around 300,000 deep web sites in the entire Web in 2004, and, according to Shestakov, around 14,000 deep web sites existed in the Russian part of the Web in 2006.

That's a tremendous amount of data hidden from most people's view! It is also interesting to note that these statistics are over 5 years old. The amount of data currently hidden from the mainstream public is likely much higher in today's world.

Here are some search engines that use different technology than the mainstream search engines one can use to explore different sections of the 'darknet':

10 Search Engines to Explore the Invisible Web (http://www.makeuseof.com/tag/10-search-engines-explore-deep-invisible-web/)

Infomine (http://infomine.ucr.edu/)

Infomine has been built by a pool of libraries in the United States. Some of them are University of California, Wake Forest University, California State University, and the University of Detroit. Infomine “˜mines’ information from databases, electronic journals, electronic books, bulletin boards, mailing lists, online library card catalogs, articles, directories of researchers, and many other resources.

You can search by subject category and further tweak your search using the search options. Infomine is not only a standalone search engine for the Deep Web but also a staging point for a lot of other reference information. Check out its Other Search Tools and General Reference links at the bottom.

The WWW Virtual Library (http://vlib.org/)

This is considered to be the oldest catalog on the web and was started by started by Tim Berners-Lee, the creator of the web. So, isn’t it strange that it finds a place in the list of Invisible Web resources? Maybe, but the WWW Virtual Library lists quite a lot of relevant resources on quite a lot of subjects. You can go vertically into the categories or use the search bar. The screenshot shows the alphabetical arrangement of subjects covered at the site.

Intute (http://www.intute.ac.uk/)

Intute is UK centric, but it has some of the most esteemed universities of the region providing the resources for study and research. You can browse by subject or do a keyword search for academic topics like agriculture to veterinary medicine. The online service has subject specialists who review and index other websites that cater to the topics for study and research.

Intute also provides free of cost over 60 free online tutorials to learn effective internet research skills. Tutorials are step by step guides and are arranged around specific subjects.

Complete Planet (http://aip.completeplanet.com/)

Complete Planet calls itself the “˜front door to the Deep Web’. This free and well designed directory resource makes it easy to access the mass of dynamic databases that are cloaked from a general purpose search. The databases indexed by Complete Planet number around 70,000 and range from Agriculture to Weather. Also thrown in are databases like Food & Drink and Military.

For a really effective Deep Web search, try out the Advanced Search options where among other things, you can set a date range.

Infoplease (http://www.infoplease.com/index.html)

Infoplease is an information portal with a host of features. Using the site, you can tap into a good number of encyclopedias, almanacs, an atlas, and biographies. Infoplease also has a few nice offshoots like Factmonster.com for kids and Biosearch, a search engine just for biographies.

DeepPeep (http://www.deeppeep.org/)

DeepPeep aims to enter the Invisible Web through forms that query databases and web services for information. Typed queries open up dynamic but short lived results which cannot be indexed by normal search engines. By indexing databases, DeepPeep hopes to track 45,000 forms across 7 domains.

The domains covered by DeepPeep (Beta) are Auto, Airfare, Biology, Book, Hotel, Job, and Rental. Being a beta service, there are occasional glitches as some results don’t load in the browser.

IncyWincy (http://www.incywincy.com/)

IncyWincy is an Invisible Web search engine and it behaves as a meta-search engine by tapping into other search engines and filtering the results. It searches the web, directory, forms, and images. With a free registration, you can track search results with alerts.

DeepWebTech (http://www.deepwebtech.com/)

DeepWebTech gives you five search engines (and browser plugins) for specific topics. The search engines cover science, medicine, and business. Using these topic specific search engines, you can query the underlying databases in the Deep Web.

Scirus (http://www.scirus.com/srsapp/)

Scirus has a pure scientific focus. It is a far reaching research engine that can scour journals, scientists’ homepages, courseware, pre-print server material, patents and institutional intranets.

TechXtra (http://www.techxtra.ac.uk/index.html)

TechXtra concentrates on engineering, mathematics and computing. It gives you industry news, job announcements, technical reports, technical data, full text eprints, teaching and learning resources along with articles and relevant website information.

Just like general web search, searching the Invisible Web is also about looking for the needle in the haystack. Only here, the haystack is much bigger. The Invisible Web is definitely not for the casual searcher. It is a deep but not dark because if you know what you are searching for, enlightenment is a few keywords away.

DMac

9th December 2011, 08:21 AM

Continuing the subject...

There are other search sites worth mentioning:

Wolfram|Alpha (http://www.wolframalpha.com/)

Answer questions, do math, instantly get facts, create plots, calculators, unit conversions, scientific data and statistics, help with homework—and much more.

DEVON Technologies:
This one is catered to MAC OS users and includes several downloadable applications for searching:
http://www.devontechnologies.com/

Dogman

9th December 2011, 08:22 AM

Have known about this for years, thanks for the links and the very good reminder.

DMac

9th December 2011, 08:24 AM

100 Useful Tips and Tools to Research the Deep Web (http://www.online-college-blog.com/features/100-useful-tips-and-tools-to-research-the-deep-web/)

intro:

Experts say that typical search engines like Yahoo! and Google only pick up about 1% of the information available on the Internet. The rest of that information is considered to be hidden in the deep web, also referred to as the invisible web. So how can you find all the rest of this information? This list offers 100 tips and tools to help you get the most out of your Internet searches.

Meta-Search Engines

Meta-search engines use the resources of many different search engines to gather the most results possible. Many of these will also eliminate duplicates and classify results to enhance your search experience.

This site includes many search engines for scouring the deep web and also contains some helpful tips.

Sample:

Tips and Strategies

Searching the deep web should be done a bit differently, so use these strategies to help you get started on your deep web searching.

Don’t rely on old ways of searching. Become aware that approximately 99% of content on the Internet doesn’t show up on typical search engines, so think about other ways of searching.

Search for databases. Using any search engine, enter your keyword alongside "database" to find any searchable databases (for example, "running database" or "woodworking database").

Get a library card. Many public libraries offer access to research databases for users with an active library card.

Stay informed. Reading blogs or other updated guides about Internet searches on a regular basis will ensure you are staying updated with the latest information on Internet searches.

Search government databases. There are many government databases available that have plenty of information you may be seeking.

Bookmark your databases. Once you find helpful databases, don’t forget to bookmark them so you can always come back to them again.

Practice. Just like with other types of research, the more you practice searching the deep web, the better you will become at it.

Don’t give up. Researchers agree that most of the information hidden in the deep web is some of the best quality information available.

banjo

9th December 2011, 08:39 AM

Interesting stuff, thanks for posting.

DMac

9th December 2011, 08:42 AM

Continuing, there is a dark part of the deepweb that is only accessible by using the Tor Browser. This is the modern world of hackers, the real kind. I do not plan on giving too much information on this area of the deep web because frankly, 'what has been seen cannot be unseen.'

There are networks of people out there on the dark net selling drugs, death and all sorts of mayhem. Custom trojans, viruses and all sorts of nasty stuff. Not joking.

For further research, for example, there is the 'hidden wiki'. Pipe that into google for some basic information. Also, see .onion (http://en.wikipedia.org/wiki/.onion):

.onion is a pseudo-top-level domain host suffix (similar in concept to such endings as .bitnet and .uucp used in earlier times) designating an anonymous hidden service reachable via the Tor network. Such addresses are not actual DNS names, and the .onion TLD is not in the Internet DNS root, but with the appropriate proxy software installed, Internet programs such as Web browsers can access sites with .onion addresses by sending the request through the network of Tor servers. The purpose of using such a system is to make both the information provider and the person accessing the information more difficult to trace, whether by one another, by an intermediate network host, or by an outsider.

Addresses in the .onion pseudo-TLD are opaque, non-mnemonic, 16-character alpha-semi-numeric hashes which are automatically generated based on a public key when a hidden service is configured. These 16-character hashes can be made up of any letter of the alphabet, and decimal digits beginning with 2 and ending with 7, thus representing an 80-bit number in base32.

The "onion" name refers to onion routing, the technique used by Tor to achieve a degree of anonymity.

Child porn scumbags use networks like this to remain outside LEO's main areas of interest. .Gov and the overall organizations that comprised of various LEO's have a presence in the darknet, but they are way behind the times on the whole.

Ponce

9th December 2011, 09:26 AM

Man oh man, am I in trouble now........thanks DMac.