Tech Blog

What's in a Claim? (Part 2): Gaps, Anomalies and Analytical Challenges

Matt O'Neill

March 17, 2025 at 7:29 PM

image generated by Midjourney — a computer analyst line drawing in the style of Maurice Sendak

If you haven’t read part 1 of our introduction to claims data, please take the time to peruse it; I promise that what follows will make a lot more sense. And before we dive into a specific use case, let’s take a moment to review some nuances common to this sort of data. As Richard McElreath said, “It isn’t my job to disappoint people, but I’m good at it.” While we would all love a dataset that is complete, consistent, and error-free, that is not, unfortunately, what a claims dataset is.

Keep the following in mind as you read:

The discussion below is operating under the same principles as your high school physics class, i.e., “For the purposes of this problem, assume a perfect vacuum, no friction, and no other environmental variables exist.” We will cover core concepts that are general truths, but the devil is in the details of any specific analysis.
YMMV (your mileage may vary). It is worthwhile to maintain a skeptic’s view of your results. Just because your query returned an answer doesn’t mean the answer makes any sense. Compare your answer to macro-economic trends, other public datasources, and any other set you have to attempt to verify it. An analysis that works swimmingly for one entity may not work at all for another, due to the vagaries in claims submission and data collection (again, see part 1).
Even after all this reading, your journey has only begun. If you remember the discussion of NPPES from the Care and Feeding of Healthcare Datasets article, you know that most of the information about entities and providers isn’t reported on a claim, usually only the NPI. We will cover the importance of a Provider Directory in a future article.

Core concepts for claims datasets

Referring providers

The “referring provider” field would be valuable if it were always filled in, but the raw “fill rate” is ~30%. While its presence can be indicative of a referral, its absence is not indicative of an absence of a referral.
Not only is the “fill rate” low for referring NPI, it may not be what it appears when it is filled in. Sometimes it is a Type 1, and sometimes it is a Type 2. While CMS guidance heavily implies that 17, 17.a and 17.b (specifics here) are meant to be a Type 1 provider (not an entity), your mileage may vary 🤷. If there are multiple claims filed for the same patient within a system or provider group, often the referring provider is simply the last provider who treated the patient.

Address standardization and normalization

Addresses are a difficult problem. Fundamentally, it is important to remember that at some point a person is inputting an address onto the claim. As you would expect, the inconsistency of input is reflective of that fact. The following street addresses are effectively the same:

* 123 N 15th Ave.
* 123 North 15th Avenue
* 123 15th Ave. N
* 123 N Fifteenth Ave.

If your analysis centers around finding volumes at a specific location, and you’re searching for 123 N 15th Ave., you will miss out on a significant amount of data where the address varies slightly. Further compounding this issue is the use of "vanity" addresses. It can be difficult to find official documentation on how "vanity" and "preferred" addresses come into existence, but it is not uncommon. Apple's headquarters in Cupertino is located at "1 Infinite Loop". Similarly, there are plenty of hospitals with an address that looks something like "1 Medical Center Drive"; however, not every supplier, provider and payer is necessarily going to use the "vanity" address. Even more confusing (and sometimes amusing) are "preferred" city names. For example: 5025 HARDING PIKE BELLE MEADE TN 37205-2801 is a valid address in Nashville, TN, and mail will be delivered correctly whether Belle Meade or Nashville is specified as the city. Here is a good reference on preferred city names. USPS actually calls this out as a problem within their own systems.

Is the address supplied relevant?

In short, maybe. Remember our study of the 1500 form above? Field 32.a (specifics here) is only filled out if the service facility NPI is different than the billing provider NPI. Pragmatically, this means that for the vast majority of professional claims the provider simply enters the billing address and the billing NPI, which is often not where the service was actually rendered. To give an example - a surgeon employed by an orthopedic group renders service at a local hospital. Someone who works for that surgeon is going to file a claim, which may or may not list the billing address and billing NPI of the orthopedic group instead of the facility address. Why, you may ask? The reality is that over the years providers have learned to include the minimal amount of information that will result in the claim being paid. If that information is incorrect but the claim is paid, there is simply no benefit to the orthopedic group to provide that information — just cost. This can be an even more pernicious problem with the “remote” procedures, labs, image interpretation, etc.

Provider / entity specific volume

OK. We’ve established that overall claims “completeness” is difficult to judge (see part 1). Let’s say, however, that you’re fairly confident that you have enough volume to be representative. You can just count procedures for specific NPIs, right? I think you know where this is going… No! Remember that claims are a reimbursement for services rendered, and can be more representative of the transaction of services than care delivered. For example: if I want to know how many patients are being seen at an imaging center, I might be tempted to simply count the number of procedures per day. This might give a misleading result, though: often times different “views” during imaging are billed as separate procedures, as are body parts (usually noted with modifiers). If you do a simple sum of procedures, you’ll end up with a much higher count than you are looking for.

Dates of service

Dates of Service on a claim are very important in almost any analysis. Generally, the dates of service that are recorded in 24.a (specifics here) are the most reliable. However, remember that it is a date range that is recorded. So if you have a procedure tied to a date range, there is very little to tell you on what date that procedure was actually performed. This isn't a huge problem — usually it's safe to just use the earlier date in the range.

Place of service codes

These are notoriously unreliable, to the point where I would advise caution relying on them for almost any analysis. This report from 2009 details the issue, and little has changed since then. The summary of findings has been copied below (emphasis mine):

“Physicians did not always correctly code nonfacility places of service on Part B claims submitted to and paid by Medicare contractors. Physicians correctly coded the claims for 17 of the 100 services that we sampled. However, physicians incorrectly coded the claims for 83 sampled services by using nonfacility place-of-service codes for services that were actually performed in hospital outpatient departments or ASCs. The incorrect coding resulted in overpayments totaling $2,979.”

Evaluation and management procedure codes

E&M codes are not specific to a service line. Let’s say you want to gain some insight into office visits for your surgeon group compared to other surgeon groups in your market. Sometimes those office visits might take place at a standalone building, but often they take place in either a multi-office building or a medical center. You look up the CPT codes for evaluation and management (99202 - 99499), and start querying your dataset. Oh no! Turns out most service lines use that same range of CPT codes for E&M visits, so your numbers don't make any sense at all. This is a very common use case and can be somewhat difficult to solve depending on the analysis. If you have a finite set of NPIs, perfect - you can also filter by the rendering provider.

Finally! Let’s analyze!

Now that we’ve covered some things you can’t necessarily rely on in claims data, let’s pivot to an analysis you CAN reliably perform. The preceding paragraphs aren’t meant to dissuade you from claims data. Like everything in healthcare, it’s complicated; as long as you understand the nature of the dataset you’re using, you can still extract an incredible amount of value.

Tony Miller (founder of Lemhi Ventures, Surest, etc.) would always say “Whenever we have an interaction with a customer, I want to know what they did five minutes before that interaction, and five minutes after that interaction”. If we take that advice to heart, one very useful way to leverage a claims dataset is to look at longitudinal claims for a patient or cohort of patients. That is, we look at what happened before they arrived at our facility, and we look at what happened after they left our facility.

The key in this sort of analysis is constructing the cohort of patients carefully. Maybe you are working with an orthopedics group, and they are trying to determine the percentage of patients that have already had physical therapy before arriving at your group. If you are confident in the set NPIs for your organization, you can collect a distinct list of patients that have had an interaction with your group, and the first date of service amongst those claims (per patient). You then can “look back” 60 or 90 days for specific physical therapy procedure evaluation codes CPTs: 97161–97163 for each patient. If you were doing this in SQL, it might look something like this:

-- create a table of patients that have had an interaction with 
-- the specified list of providers in 2022, 
-- and the date of first interaction 
CREATE TABLE patient_cohort AS 
SELECT
  patient_id,
  MIN(date_of_service) AS first_dos,
  MAX(date_of_service) AS last_dos 
  -- we'll use the MAX date in a little bit, 
  -- but it is useful to collect it now
FROM
  claims
WHERE
    rendering_physician_npi IN (1234567890, 0987654321, 1357913579, ...)
    AND date_part('YEAR', date_of_service) = 2022
GROUP BY
  patient_id;


-- for every patient in patient_cohort, 
-- collect all claims that occurred between 90 and 1 days 
-- prior to first_dos
CREATE TABLE patient_cohort_prior_pt_claims AS 
SELECT
    c.*
FROM
  claims c
  INNER JOIN patient_cohort pc ON c.patient_id = pc.patient_id 
WHERE
    c.date_of_service BETWEEN dateadd(day, pc.first_dos, -90) 
    AND dateadd(day, pc.first_dos, - 1)
    AND procedure_code in (97161, 97162, 97163)

Now we’re getting somewhere! We have a table that has physical therapy claims for patients before they arrived at our orthopedic group. We could mine this table for different things, but let’s just calculate the share of patients that have had physical therapy versus those that haven’t.

-- this will return the number of distinct patients in the cohort
SELECT 
    count(distinct patient_id) as total_patients
FROM
    patient_cohort


-- this will return the number of distinct patients who had PT evals
SELECT 
    count(distinct patient_id) as total_pt_patients
FROM
    patient_cohort_prior_pt_claims

-- total_pt_patients / total_patients = % of patients that had PT evals

Excellent! With that information in hand, we can compare our orthopedic group’s interactions with patients who did have prior physical therapy versus that who did not.

Now that we’ve gleaned some information about what is happening before patients arrive, let’s see what is happening after our interaction. Is there opportunity for better outcomes? Let’s look for patients in our cohort that had an emergency department visit within 30 days of their last interaction with us. CPT codes 99281-99285 represent the 5 levels of emergency department visits. Remember when we calculated the max(date_of_service) as last_dosearlier? We’ll use that now.

-- for every patient in patient_cohort, 
-- collect all claims that occurred between 1 and 30 days 
-- after the last_dos
CREATE TABLE patient_cohort_future_ed_claims AS 
SELECT
    c.*
FROM
  claims c
  INNER JOIN patient_cohort pc ON c.patient_id = pc.patient_id 
WHERE
    (c.date_of_service BETWEEN dateadd(day, pc.last_dos, 1)
    AND dateadd(day, pc.last_dos, 30))
    AND procedure_code in (99281, 99282, 99283, 99284, 99285)

Now we can focus on the group of patients that were admitted to an emergency department (ED) within 30 days of seeing our orthopedic group. You might be thinking, “So what?! What if they were involved with a burning jet-ski accident (V90.27) that had nothing to do with their knee replacement?” Well, your suspicions would be well founded. However, there isn’t a link in claims data that says “this claim is related to that other one”, so it is up to you to decide how to make that determination. Perhaps you know of common diagnoses that result in an ED admission after an orthopedic surgery and look for those specifically.

Let’s summarize what we just did.

Find the cohort of patients that our orthopedic group saw in 2022.
From that cohort, find all the patients that had physical therapy before coming to see our orthopedic group.
From that cohort, find all the patients that admitted to an ED within 30 days after seeing our orthopedic group.

Summary

Your head is probably swimming a bit from all the possibilities combined with all the caveats. That’s OK! Claims data is complex, and it takes time to get comfortable with it. The key is to understand the limitations of the data and to be careful about the conclusions you draw from it. In upcoming articles, we’ll explore what we do at Trilliant Health to systemically reduce the caveats and complexity in the data. Next up we’ll talk about the importance of a Provider Directory.

Happy analyzing!

Topics