Skip to content

Metadata and searching for non-text items

Hydrangeas in Paris, France There is a fundamental difference searching for non-text items compared to  searching for text items such as documents and other Web pages.  The difference is not in the techniques we use, but  since we usually use text for our searches when we search for non-text items, our search is based on secondary information, often descriptive in nature.  Almost all of our search tools use text to match items in a database. This is entirely natural and a long-standing practice, especially since finding documents and other text items is what people have traditionally searched for when researching a topic. When we use text to search for an item in text format we are doing a direct or primary search in the same medium as what we are looking for.

Imagine what it would be like if we could search for images using images. We could submit one image to the search engine and have it find a collection of images that are similar in one way or the other to the one we submit. Thinking through what it means for one image to be similar to another helps us understand what people have had to do to make searching for text successful.  Would we consider one image to be similar to another if it only had the same colors? The same number of primary objects? In many cases we’d like to have the search engine return items that had similar content, and being able to express that notion of similar content is the issue.  If we have an image of red and black butterflies, is there software available that can determine that and find other images with red and black butterflies? What if it turned up images of  red and black kites or boats with red and black sails?

What makes two audio files similar? The same audio frequencies, the same rhythm or if one file is a comedy sketch should a similar audio file be a collection of jokes by the same person? How can software determine whether the contents of an audio file is funny?

Attempting to derive characteristic or descriptive information directly  from non-text files brings forth a number of interesting issues. Many of these are very difficult to deal with. Sorry to say, but we are not there yet!

Because we have not developed the software that can quickly analyze and classify non-text items we rely on secondary, descriptive text  associated with the item. This secondary information that describes an item is called metadata.

You likely have heard about or thought about this issue before. For a file containing text  the name of the file, the date is was created, the date it was last modified, the size of the file, and the name of the site that hosts or publishes the file are all considered metadata.  A list of key terms or words  about the content would also be considered metadata.

Some metadata in rather technical. For example, the bit sampling rate associated with an audio file,  and the exposure information, the encoding process, and so on for pictures taken with digital cameras. The extension on a file name such as ,jpg, .mp3, .wav,  or .mov is metadata too because it tells us the type of file in which the information is stored and that  tells about the type of encoding or compression used.

Still we use the same techniques for searching for text and nontext files, but int he case of nontext files we rely on text  metadata in the form of titles, descriptions, and tags associated with nontext files.  The issue here is that we are relying on information that someone provides us about the file’s contents, not the contents of the file itself to help us locate it. Since these descriptions, titles, and tags are provided without strict rules, the coverage can be spotty or inaccurate. That is to be expected. What you think is funny may not give me much of a chuckle. Although, it is reasonable to expect that you and I will use the tag ‘butterfly’ in pretty much the same way.

The collection of tools and techniques we’ve developed for finding text items such as including phrases in quotes and specifying appropriate key words will work for us then in finding nontext items. Sometimes though we have to be more persistent and flexible because we are really searching through the metadata associated with the information. For example when searching for images using Google Image Search we can specify the size of the image to show, its type such as news, face, clip art, line drawing or photo, and the dominant color in the photo.

Post a Comment

Your email is never published nor shared. Required fields are marked *