WebMD

The WebMD project is a set of scripts that records all the outlinks from the main page of a particular website(in this case WebMD), and constructs a network graph from the subpages and their relationships. It is possible to look at complicated and large websites as a complex network in order to determine what the underlying messages of a website might be as indicated by its physical structure. Below is a case study of its usage.

Cyberchondria refers to unfounded health concerns perpetuated by medical information found online. WebMD is a popular website and often a top search result for people seeking to self diagnose conditions and symptoms. Its tendency to increase concern for potential conditions and exaggerate the seriousness of symptoms is found at the center of jokes. Specifically, articles online have referred to how easy it is to arrive at a cancer diagnosis on the website.

We cannot determine the validity of the entire WebMD site by fact-checking the answers given by each page, but we can perhaps answer this question – given a symptom, how far away is a person from a diagnosis of cancer on WebMD?

So here is an experiment that attempts to use the physical properties(text and links) of a website to determine it’s message. The goal is to investigate the structure and content of webmd.com in order to determine if and how much it perpetuates the diagnosis of cancer.

The site is a big nest of links so the scope is limited to be the A-Z common topics page. This section lists 482 health related topic pages from Acid Reflux to Zoster (Herpes) Virus. The content examined is further limited to the main article of each of the conditions.

The experiment looks at each page’s center content section for 2 things – cancer related words(a limited list I found on the internet), and all the out links from that section of the page. It continues to search through the pages until it arrives at either a page with cancer, a page with no links, or a page that is outside of WebMD.

Using this method, the simple web scraper picked up 9714 web pages. Of these,

  • 7976 pages do not have cancer related keywords on them.
  • 726 pages are cancer related conditions because keywords were found in the main content.
  • 1012 information pages had either no outlinks such as liver, or out-links that redirected to a sponsored page like this.

A rat’s nest of a directed network graph was made with a force directed layout from the resulting pages where each page is a node, and each edge a link between pages. The cancer related pages here are colored in red. It is not immediately noticeable which categories of pages have more prominence. However it is clear that there are central nodes in the network where almost every page eventually leads.

PageRank, the more famous part of the google search algorithm measures the relative importance of the page given its links based on one of the algorithms that determines the order of search results. Below are the top 1000 pageranked pages in descending order. We can see that pages with cancer do not have the highest scores, and are distributed throughout the ranking.

Unfortunately, this is a much more complicated project than I expected, so I can only tell you that given what I have seen of the network, cancer related pages do not act differently or hold prominence over other topic pages. However, it is not clear that the scope of the website’s conditions covers cancer related topics proportionally more than it should. Nor is it clear that if a cancer diagnosis occurs, how much of it is driven by the behavior of the medical advice seeker who may tend to travel the path toward the worst scenarios.

If webMD is not about diagnosing cancer, then where are the most likely places that any given webMD query will lead? A few pages with significantly higher centrality and pageRank stood out far from the rest. And these pages focus on 2 things – policy and medicine.

The page which every page eventually leads to is as expected – the disclaimer that states webMD information “are for informational purposes only. The Content is not intended to be a substitute for professional medical advice, diagnosis, or treatment…”

A equally prominent page is a tool to identify medication. The drug index comes in 3rd, but has the most user input on the website with its thousands of reviews of specific drugs.

https://github.com/jjjiia/webmd

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s