The Challenges of Machine Searching

Guest post by Lorna Campbell, Nalanda Technology

There are a lot of variables and risks when searching disparate sources with different standards, qualities, volumes and potentially duplicated. If 20% of the average workforce is spent searching, are they finding the material they are looking for in a precise and cost effective manner?

In the past we used libraries for research and would ask academics for help, to do so we needed to give precise requirements and variables. If we were not precise there was a risk that the researcher would not provide a helpful response. With growth in unstructured data, in general terms, we need to be as precise or else risk having too many results to manage. Some of the challenges with unstructured data growth come from data format, data quality, data frequency updates, data languages. However, as a researcher, you don’t have much control over these variables. And they are variables as you may be looking at more than one source system of unstructured data – e-mails, CMS, SharePoint, RMS, Databases, Third Party Documents for example.

Data format and data quality are two distinct problems that are often addressed with a data management tool in the structured data world. Performing data quality routines to standardize, manipulate and change the data is expensive, even against constrained fields. The challenge is amplified in unstructured data that has no constraints. That said, one of the key points to highlight with Data Management solutions is that you are changing data – which you may not own. In manipulating data to standardize it, you may be artificially adding value where there was none originally. Another approach would be to deploy a search index that can manage the data standards and data quality challenges. A search solution that should support the user with various algorithms for spelling, streaming, and hyponyms.

For a lot of users, the experience of searching is frustrating; frustrating as its too long and not always fruitful. Searching should be more than looking for a person’s name or putting a single term into a search bar and hoping that the search solution can infer other criteria. Of all the issues that can be solved by a search solution, the user is not one of them. The concept of a case or a project is not a new one. So why not use the information from the user’s navigation history or project to help fill in the blanks. If a user is working on a legal case, for example, they probably want information to come back that is relevant to the case. Using third party information such as their history can help filter or rank results specific to the case that they are working.

Imprecise searching will highlight two problems: the search criteria and the search results. The search criteria will require a level of ambiguity to allow for the various permutations of the result in the numerous unconnected source systems that do not share a standard. But the results should not be too many for the users to navigate to find the relevant ones. After all, you do not want your patient record being missed by a clinician if the results list is too long, or your record of “blood pressure was elevated” didn’t return because the search was for “high blood pressure”. Equally, having the ability to use domain-specific search language will help find relevant results by equating ‘high’ with ‘elevated’ and the equating ‘symptoms’ to a clinical term such as ‘hypertension’.
We would suggest that precision is a combination of:
• 100% accuracy to the search terms
• A level of ambiguity based on a lack of data standards between the sources
• Returning results where the terms are at a separation level within the document that infers a relationship. Having terms in a document is not as accurate as those terms being in the same sentence, paragraph or section. Sentence level separation of terms infers context between the search terms.

To help with precision a good quality search index and search tool should use Natural Language Processing techniques to support the user. Natural Language Processing is the field of Computer Science that is concerned with the interactions between computers and natural human languages. Some of the top vendors in the world have commercial solutions to help humans be more productive. In essence, they work by simplifying the human speech (verbal or written), translating this to machine-usable code and instructions and produce an output that is related to the initial request. As far as sciences go, it’s pretty complex. However, it is not all about auto-chat bots and SIRI. NLP has enormous use cases in both the structured and unstructured world. The ability to scan large volumes of unstructured data and produce a usable value is where NLP can help. Summarization and Categorisation techniques have evolved to a level of precision that enables a consumer to quickly achieve a high level of understanding without reading every word or every document. These techniques become invaluable in providing decision support. For example, enabling a professional to make an intervention for the safety of a vulnerable person is a time-consuming and risk-laden task, and we can’t forget how emotions play into the equation. The human uses emotions and personal opinion to base their judgment, a machine does not. The human uses inference and third-party data to influence decisions, the machine does not. Search solutions should have the ability to scan large volumes of unstructured data and highlight topics and risks in any given field or market sector. There will always be a need for a human to support decisions, in cases where the judgment is “gray” and not black or white.

Understanding the sentiment of a statement, document, or sentence is an excellent way of scoring and graphing data. Sentiment Analysis can provide an insight where the meaning is hidden, however, being able to manage sarcasm is a lot more difficult. There are vendors that will use NLP techniques to monitor social media for market analysis, usually looking at their brand or product related messages. And while understanding trends is valuable; there are many factors such as peer pressure and influencers that will take consumers to social media to vent their frustrations. But, NLP techniques can be used to monitor more than customer relationships in fast food outlets. Imagine being able to run Sentiment Analysis on a Government Report or Parliamentary debates and see whether and how the worldwide events happened on a given day might have influenced a state decision in a certain country. The same techniques can be used to demonstrate why the graph has a peak or a trough and, more importantly, what influenced that event. Influencers do not often originate in the source data. NLP techniques can compare data sources to highlight where there is a third party influence, thus adding value to the original data.

The way that unstructured data has grown has left researchers in a quandary: how to search and achieve precise and usable results quickly. Many technologies achieve precise results and many put performance at the top of the list, but few can do both. It’s not uncommon for developers and vendors to make a conscious decision of performance versus precision. As a consumer, you might not be aware of these decisions. As a consumer, you want both! Discover more about the precision data search solution ‘Nalytics‘, that can do both at www.nalandatechnology.com.