Monday, October 6, 2008

Measurement in Clinical Practice and Research (Part I): Interview with Dr. Mary Rauktis

[Episode 43] Today’s podcast is the first in a two part series on measurement for clinical practice and research. In today's podcast I speak with Dr. Mary Rauktis about how she became interested in measurement; some key concepts needed to understand measurement including reliability, validity and error; and how to understand measures used in research articles.

In part two of the podcast we talk about the difference between measurement in the field and measurement in research settings. We talk about some of the ways that social workers can think about measurement as a tool to improve clinical practice, and some ways that social workers in the field can develop measures that will really benefit their clients. We talk about some of the challenges social workers have using measurement tools because of how rarely measures are integrated into social work courses. We talk about some ideas for how to better integrate measurement into social work education, particularly beyond the required research classes. We end Part II with a discussion of some resources for social workers interested in learning more about measurement.

Download MP3 [34:50]


Mary Elizabeth Rauktis Ph.D. is an Assistant Professor of Research in the Child Welfare Research and Training Program. She is a 1993 graduate of the University of Pittsburgh, School of Social Work.

Prior to her appointment at the University of Pittsburgh, she was the Director of Research and Evaluation at Pressley Ridge, an international nonprofit provider of services to children and families. She was an adjunct at the University Of Pittsburgh School Of Social Work and Robert Morris University Business School, nonprofit management and a visiting professor at the University of Minho, Institute of Child and Family Studies in Portugal.

Contact InformationThe University of Pittsburgh
Room 2326 Cathedral of LearningPittsburgh PA 15260
Office 412 648 1225
Cell 412. 716.9061
fax: 412.624.1159


Hello and welcome. You’ve found the social work podcast. My name’s Jonathan Singer and I’ll be your host as we explore all things social work.

Today’s podcast is the first in a two-part series with Mary Rauktis about measurement. In today’s podcast we talk a little bit about how Mary got interested in measurement, talk about the purposes of measurement – some of the key concepts, including reliability, validity and error and we also talk about what to look for in a research article with regards to measurement.

In the second part of our interview we talk about the difference between measurement in the field and measurement in research settings. We talk about some of the challenges social workers have using measurement because of training and we talk about some ideas that might be useful for schools of social work who are interested in incorporating measurement into classes other than research. We also talked about some resources that social workers can follow-up on. We ended with some ideas on how social workers can really develop their own measures and think about measurement as a tool for improving their own practice in the field.

Without further ado, onto the interview with Mary and measurement.


Jonathan Singer: So, Mary thanks so much for being here today and talking with us about measurement. And the first question is – why are you interested in measurement?

Mary Rauktis:: Well Jonathan, I became interested in measurement I believe through my earlier work as a nurse. You learn very quickly in nursing that clear communication is vital. And much of the way that you communicate is through measurement. Some think of it as objectifying patients but in fact I believe its nursing language in which we talk about someone in terms of measurement – their status. And so that probably started my interest in measurement. And then I was fortunate to study here at Pitt with Gary Koeske and work with him on his measure of burnout for social workers. And that continued my interest in measurement. And without really realizing that my career was moving in this direction I began to work with some other people – John Lyons from Northwestern University on measuring child and adolescents needs and strengths and some people from Vanderbilt in measuring alliance and somewhere along the line I found myself knowing more about measurement than I probably ever thought I would.

Jonathan Singer: What’s the purpose of measurement?

Mary Rauktis:: Well, the purpose of measurement is to collect information about a concept in a way that is valid and reliable. And our work in social work again as we communicate with each other it is really around these concepts – depression, self-esteem, something like the restrictiveness of a living setting for an adolescent. We use these terms with each other, we need to sometimes operationalize them in a way so that we really understand what we’re talking about. And so, measurement is really an entire process by which we take a concept, we begin to operationalize it, and then begin to collect information about it in a way that we can collect information about it again and find the same results. And more importantly, we know that what we’re collecting is really what it is supposed to be.

Jonathan Singer: And when you say “operationalize,” what does that mean?

Mary Rauktis:: Well, what it means is, and I’ll actually if you don’t mind, give you an example.

Jonathan Singer: Yeah, that would be great.

Mary Rauktis:: Operationalizing, it means several things – one is that you need to come up with a conceptual definition, which is very important. People sort of skip over the definition part and get to the numbers section right away. But in fact, you need to think about what is a definition? So that when I say “restriction” – you say, “oh you mean the way adults in a child’s life structure the settings for their safety and their developmental needs”. So first you have to come up with, the first piece of operationalizing is coming up with a definition.

Jonathan Singer: And I could see why that would be important because you get two people in the room and they see restriction as very different things. And what you’re doing is you’re saying, “this is what we mean when we say restriction and so this is then how we’re going to measure it”.

Mary Rauktis:: Exactly. We think of restriction as a way in which, again, someone in the child’s life is limiting who they see, what they do, where they go for either treatment reasons but also for developmental reasons. For example, Jonathan, you have what a six-month-old daughter, an eight-month-old daughter?

Jonathan Singer: Yep

Mary Rauktis:: Okay, well, you know she doesn’t have run of the house.

Jonathan Singer: No, no.

Mary Rauktis:: Of course not, you know she’s crawling. You have to restrict her environment in some way so that, for her developmental needs for her safety. I have to restrict my daughter’s environment who is twelve, in a different way for her developmental needs. So, we needed to come up with an operational definition so that we understood as clinicians when we talk about one environment, a residential treatment program, being more restrictive than living at home. That we understood what that meant. There was a definition. Then from the definition we needed to think about okay, what are the ways adults restrict the environment for either developmental reasons, for safety, and treatment reasons and that led to well, they restrict the environment in terms of where the child can go, what they can do, who they can be with. Okay, so then that narrows it down a little bit more. That gives you some broad, that gives you, you know, you go from this rather broad definition to something, alright restriction around these domains. Did we come up with this out of nowhere? No. We went to the literature to look to see what research has told us about restriction for adults and for children. Alright so we had these broad areas. How do we take this down even more? Do we, do I ask you as a clinician? Well, give it a number, you know, one to ten how restrictive it is. Or do I come up with discreet items that people begin to then assign a number to? And that was a big decision because originally the roles were an ordered list of settings from least restrictive to most based on expert opinion. And we were taking a rather radically different approach to it. We, in terms of our reconceptualization of it, we felt that in fact it should be empirically based. Again, depending on work you are looking to the people who do the work to provide you with the information. But you are not asking them to rank the settings but in fact describe the settings in very, in numeric ways. For example, where you can go in the community is anchored between a one and a five. With a one being there are no restrictions about where you can go in the community to five that you are not allowed to go anywhere in the community. And then we had, you know, two, three, and four were numbers that represented something else. I have to tell you; this was very difficult to do. And required a lot of work, a lot of going to experts, it was a three-year process actually to do this. So, my advice to anyone out there is if you’re thinking about creating a new measure, are you sure you want to devote several years of your life to it. And so, that is a rather long answer to your question operationalizing. A conceptual definition, coming up with general content areas, and then even taking it down further into in this case we used numbers to be able to define various aspects.

Jonathan Singer: Okay, so operationalizing is having this concept and the narrowing it down and the assigning some numbers, but it’s not just randomly assigning some numbers. It’s based on, you said “empirical,” not just expert and so, and by that you mean that you actually collected data and you compared this to that and not just relying on folks out in the field.

Mary Rauktis:: Yes, exactly. Early on in the process, we went to experts. People who knew about this work. And Asked them some questions about a general approach, gave them our definition and got their idea if we were in the right direction. Then, what we did was, we drafted this early draft and sent it back to them again for their comment. And then, what we did was, we sort of had this basis to go on and then we went to people in the field and asked them to react to this. Is this what you think of, this is our general definition ways in which the environment is restricted – do you think we are tapping into the right areas? And then we got some additional comments. Then what we did was, something that I’m going to talk about a little bit now, cognitive testing. Cognitive testing is a process and it’s great when you’re creating a new measure because what we did is we drafted it at this time we were about around four and then we asked people – I’d ask you Jonathan, “okay, you completed this for a typical child in your care. And you wrote number two for television. Could you tell me what you were thinking when you selected that answer? How did you choose a two rather than a one or a three?” How well does this answer fit what is true for this child in terms of their television viewing? And so we were trying to tap into what people were thinking when they were filling this out. Which is very important because it’s sort of the black box if you will when people look at a question and they’re trying to select the answer again they’re operationalizing this concept as well. It’s very helpful to know what they’re thinking as they’re filling that out. And we actually found the cognitive testing to be really, it was an excellent use of time. It helped us to improve not only the items and the responses – it also helped us to improve the instructions. For example, the community-based people were really confused about the purpose because they would say, “well, I don’t control his television watching, that’s his parent.” Well, it’s not about who controls it, it’s about what is. So, we had to make sure we were very clear on our instructions. We also found that sometimes people, the salience factor, they were thinking about the worst kid when they were filling it out or the kid they had just seen as opposed to the typical child. So, it wasn’t perfect but it really gave us a window into people’s thinking as they were taking this construct themselves and thinking it through. It was time very well spent that led to more refining. When social workers are reading articles about any study just on a measurement study, it’s wonderful if people talk about but they don’t always talk about the process of operationalizing. In a measurement article, they will. If you’re reading a study about perhaps using cognitive behavior therapy in a certain setting they may mention the measure that they used or the tool or there are several words that are often used, but they won’t always tell you how they operationalized it, so that’s one of the things you want to look for particularly in a measurement article. What I think the gold standard and some people may, this is my opinion, I think you need a conceptual definition from the very beginning. There should be some expert review. There should be review by certain groups. We, for example, had youth who had transitioned out of foster care, foster care alumni basically, look at this as well as experts in race, culture, and gender.

Jonathan Singer: If you’re reading a journal article, it’s one thing to know, oh they used the roles, it’s another thing to understand exactly what the roles is measuring. And it could be the way I define restriction would be different than the way that the roles defines restriction. But unless I know that, then I won’t know how to interpret what they find in their research. And that’s one of the important things to know about that.

Mary Rauktis:: Yes, yes. Now that I’m not suggesting that every time you read a research article that you go back three citations to try to find out about this. But if you’re interested or if you have some concerns or questions about what they found, I would suggest that you do that, to look at how. To go back to see the instrument that they used to measure whatever concept, whatever they were looking at, whatever the dependent or independent variables is or are. If you have some questions or interest, that you go back to the source articles to see. Sort of a bit of a detective trail but I think it can tell you a lot. Sometimes people think remember it’s these are human beings, you know, this is a group of four or five of us that came together to work on this. We aren’t the experts necessarily, so it was influenced also by our backgrounds and our thinking. So, there’s certainly that aspect. I think the other question you mentioned, you asked me, was about what is important for social workers to know. And often times people throw out the terms: “reliability” and “validity” a lot.

Jonathan Singer: Yeah, those are biggies.

Mary Rauktis:: Yes, they are. And I’ve often had people say to me, well particularly with a Likert measure, “well, that’s not reliable” or “how valid is that”. When I’ve worked in practice settings and clinicians would question me about that. And those are very, those are important concepts to understand, and I’ll do my best in a short time to explain them quickly, but there’s some excellent texts. I know that there was an earlier podcast by Dr. Rubin, correct?

Jonathan Singer: Yes, that’s correct. Allen Rubin, the author of the Rubin and Babbie research text.

Mary Rauktis:: Which is a classic text I use in my foundations class and they do an excellent job, I believe, in the measurement area. Plenty of examples. So, I highly recommend those chapters on measurement. But there’s also some other great books about measurement too, that I can talk about a little bit more if you’re looking for particular measures. But reliability is about… There are two concepts reliability and validity. Validity let me start with validity, because I think in some ways it’s probably the most critical. Are you really measuring this concept, whatever it is: depression, self-esteem once again, restriction, child strengths, therapeutic alliance – are you really measuring it? Or are you measuring something like it, but not quite it? And so, I’d like to think of validity, because you know I’m a lumper. There are lumpers and splitters. And I tend to be a little bit of a lumper. So, I think of it in sort of three large lumps. One is the whole area of face validity, which sometimes people downplay, but it is critical for clinicians because it is very difficult to ask clinicians to do a measure that does not have face validity for them. They will not do it, or they’ll lose it, or they’ll find ways of not finding the time to do it. So, it’s very important. How does it look to people? Does it look like it’s tapping into restriction of environment or therapeutic alliance with a client? Does it, and that can be done by people looking at it or you can ask a group of experts to give you their opinion. That’s sort of the lowest level of validity. It’s important though, it’s very important. And then you have the idea of criterion validity in which you are, it’s sort of the next level, how well does this compare to some sort of external criterion. Does it predict something? Or is it close to something that’s current? So, if you’re looking at criterion validity for a test, for example, how well does it predict that student’s first-year grade point average in college? The SAT’s, typical example. That’s important. Again, it’s taking it a little bit closer. Or how well does it distinguish something from an external criterion?

Jonathan Singer: Since we were talking about restriction before, what would be an example of criterion validity for the restriction measure that you were talking about?

Mary Rauktis:: Normalization.

Jonathan Singer: Normalization, yeah, okay talk to me about that.

Mary Rauktis:: And this came up again in the process, we in this marathon process of working on this measure, we found that, okay, restriction is similar to normalization but not the same. Normalization, again I’m trying to think of a good conceptual definition, normalization is the ways in which an environment is typical of a home environment. How normal is your environment or your school environment to what is typical? But it’s not, I believe it’s different than restriction, which is normalization is an element perhaps of restriction, but restriction is bigger. So, I think of restriction of environment being the larger construct with normalization being an aspect of it. So, for example, in testing this new measure of restriction which by the way we called the REM-Y, which we need to come up with a new name probably at some point.

Jonathan Singer: The REM-Y?

Mary Rauktis:: The REM-Y! Yes. I didn’t say we were too good at names. Restriction Environmental Measure for Youth. What we would do is find another, we would for example, I would perhaps go to a group home that you are part of, and I would ask you to fill out the REM-Y. And another measure perhaps that just taps into just normalization. And I would look to see how they correlate. They probably will correlate. They’ll probably, as you see increase in restriction you may see decrease in the normalization score. We’ve not done that, to be honest. What we have done is because it’s still such a new measure, we’ve looked at it in terms of we have the face and the content validity, but we haven’t started to do the criterion validity. But that would be, I believe, the first place we would start. So, we want to see if it’s like normalization, if it correlates, and how well it correlates. If there is no correlation at all then I would be a little concerned. Because I believe that normalization is a piece of restriction, but not the whole thing.

Jonathan Singer: So, the criterion validity requires you to have an idea of what it’s related to and if it’s not, then you have to find out why.

Mary Rauktis:: Exactly. And is it because of some sort of error? Is it, an error and once again I refer you to Rubin and Babbie’s excellent book, but error can be systematic error, or it can be random error. And usually, when you have a problem and my experience has been when you have a problem with criterion validity it may be because of systematic error. That there is something about the question that people are systematically as opposed to some sort of random goofiness, in a systematic way people are answering it differently because perhaps they don’t want to admit that their environment is that restrictive and so there’s a social desirability bias happening. And so, they’re making their environment look really normal when it isn’t. So, is there some sort of bias happening here because of, or is it being answered in a different way because of cultural reasons? So, oftentimes if your validity is not looking the way you might like it to be you might want to go back and look at sources of systematic error. And I typically look for social desirability and I look for some sort of cultural bias that might be happening.

Jonathan Singer: So, you mentioned three types of validity: face, criterion…

Mary Rauktis:: And the last is construct.

Jonathan Singer: Construct validity…

Mary Rauktis:: Right. Which is sort of the highest level. It’s what you aspire to. And oftentimes it’s what takes measures many, perhaps many years of being used in order to achieve this. And this is the sense that what you’re doing is you really are measuring whatever this concept is.

Jonathan Singer: So, for example I will just throw something out. So, you would have depression, right. Which is a construct, right? You can’t point to something and say, “oh look there’s depression sitting over there at the curb,” right? So, this construct validity is to say that really is depression and it’s not fear.

Mary Rauktis:: It’s not anxiety, it’s not fear. Right.

Jonathan Singer: It is actually depression.

Mary Rauktis:: And it takes several, it often takes time for people to use… Say you have this new measure of depression. It has some face validity, clinicians like it, they use it. It seems to be able to predict when people are depressed. It seems to discriminate when people are depressed versus anxious. And then you start to use, it gets used by lots of people. And then you begin to build this body of research around using this measure and that you find that it really discriminates begins to discriminate when people are very depressed or mildly depressed. And if you use certain treatments it is very sensitive to change. And basically, construct validity is building this whole sort of conceptual universe. Boy, that sounds sort of vague.

Jonathan Singer: But it sounds like you’re saying it builds up enough evidence to say that in this situation and in this and in this – we are able to say that the way we’ve defined depression, it fits.

Mary Rauktis:: Exactly.

Jonathan Singer: Okay.

Mary Rauktis:: And so, as you can see that takes time. It’s not something that the first year of something is out there being used. It can take several years, sometimes a decade or more. But there are several when I think of tools or measures or instruments that people think of having really having reached this level… And again, in the children’s field I think of the CAFAS – the Child and Adolescent Functioning Assessment Scale. Kate Hodges’ work. Certainly, the Beck Depression Scale for adults. So, these have been out there. The SCL-90, the symptoms checklist. They have been out a long time. They have been used in many many studies. They are reported in the literature. So, it takes time to do that. It’s what you aspire to in any measurement. And what you should look for when you’re reading an article. About how in the past it’s been used in these studies. How it’s discriminated, how it’s not. How predictive it is. How’s it shown change. If it’s been used in terms of evaluating outcomes. That’s why when people say to you don’t create a new tool or a new measure it’s because of these reasons. It’s very creative and a lot of fun to come up with something new. It’s great. It actually is a lot of fun. But what you have is something that is brand new. You really don’t know for sure what it is you’re measuring. Because you need more evidence and one study isn’t enough. And so that’s why we always suggest you look for what’s out there first before you go and create something new.

Jonathan Singer: There are a couple of things that you brought up that I want to touch on. One of them is what you just said, is that you go out and look for existing measures. And I want to ask you about where you would look for existing measures. You also used a couple of words: measurement, measure, instrument, survey and I don’t know if there is a difference between those or if they’re used interchangeably. And also, I want to give you the opportunity to talk about reliability. Which is something we haven’t talked about yet.

Mary Rauktis:: Well, let me address reliability because it’s sort of part of this whole. Reliability is really about if you did it again would it be consistent. It’s really around consistency. And so, if I give you this measure of depression and we do it again, is it reliably, is it similar? We wouldn’t expect that in a day that you would change that drastically. So, are you scoring roughly in the same area? How reliable is it? Now, sometimes reliability it can be interrater reliability. And that’s consistency. So, if I’m observing, you in a setting and I’m scoring you on your ability to follow directions. It would be good to know if someone else is seeing the same thing. And so oftentimes you’ll have what we call interrater reliability. There’s agreement that what we’re seeing is actually happening. That we are seeing Jonathan following directions in a group home. Reliability is also, can be about consistency within a set of questions. So, oftentimes multi-item scales, scales about self-esteem, depression, again many concepts will have several items that are basically asking the same thing. And you want them to be similar. So, in other words, if I’m asking you if you have energy and you say, no I don’t. Then another question about when I wake up in the morning I feel like I can go out and conquer the world, I wouldn’t want you – if you said yes absolutely, then I would say there might be some reliability problems within that scale because you’re responding one way to one item and then on a similar item in a very different way. So, you can measure reliability and that’s done through statistically looking at how well items intercorrelate with each other. And so, if I have a multi-item scale on depression, self-esteem, alliance for example, I want to make sure the questions about mutual goal setting on alliance – if I have four questions that are asking in similar ways trying to tap into that people are answering very similar to those four questions. So, reliability is very important to look at. And oftentimes in journal articles it will be mentioned as an alpha coefficient. An alpha coefficient of point, it’s usually less than one. And you look for something in an article that is somewhere in the nineties, eighties range, is generally speaking a good range. If it’s less than that, then it may be the result of some items that need to leave or perhaps there are too few items and you need to add more. So, that’s an important thing to look at as you read a journal article. Now you ask me the other question. So, reliability is very important. And reliability and validity are really important. And you always want to look for those as you’re reading an article. I’ll give you a very short example though about as I mention face validity and why it’s so important. I was once working with a group of both parents and professionals about measuring child’s strengths and needs. And I found something called the Child Adolescents Needs and Strengths which I thought was great and would really work. And the parents looked at it and they said this isn’t it. We don’t like how it is set up. It’s one more thing done to us. We need to change this. And so we did. We worked with John Lyons who is the creator of it, and we came up with a parent-friendly version that become a conversation as well as something measurement related. And then the professionals looked at it and said this doesn’t look valid or reliable to us. This doesn’t look like anything we’ve ever used before. And so that’s the perfect example of how you really need to pay attention to that. Again, if you’re introducing something into the workplace – clinicians, family members, youth – if they look at it and say this doesn’t look right, this just doesn’t look right to me. Then you have a problem and so it’s an important thing very much in the practice area to keep in mind. So, validity and reliability

Jonathan Singer: Okay, that’s great. And so, this question about our instruments, our measures, questionnaires, are they the same thing? Are they different?

Mary Rauktis:: Yeah, the language is really confusing, and I don’t know if I can shed a great deal of clarity on it. This is my view of it – you have something that’s called a survey and when I think of a survey, I think it is several measures. I believe measures are discreet tools that are systematically operationalizing some sort concept. But on a survey, you could have several of those as well as demographic information. So, a survey is a…

Jonathan Singer: It’s a collection.

Mary Rauktis:: It’s a collection, that’s a really great term. It’s a collection of measures as well as maybe perhaps other things. So, that’s a survey. I often think of measures and tools interchangeably and so I don’t try to get too caught up in that. Instrument I suppose I think of that more, again being a lumper, in the measurement, tool, instrument lump.

Jonathan Singer: Okay, well and I think that it’s likely there are some people out there who could say that they’re different. But for social workers who are in the field or even researchers who are looking for something, it sounds like there’s not a generally agreed upon distinction that people need to know about.

Mary Rauktis:: No.

Jonathan Singer: Okay, were there other key concepts that folks need to know about to understand what measurement is?

Mary Rauktis:: Well, I did mention error. And I mentioned systematic error. I’d also like to talk about random error.

Jonathan Singer: Okay.

Mary Rauktis:: Random error I sometimes call goofiness. But random error is error that isn’t, people aren’t systematically answering a question wrong or incorrectly because of bias or social desirability or cultural misunderstanding but they’re just, you can’t predict it – errors happening. It’s the unpredictable and that’s what can really interfere with reliability, the consistency. And so, there are ways to statistically measure error and that does happen. But I believe, I feel that the onus is on the person creating the measure, tool, instrument to try to reduce this random error. And you do this by what I described earlier the cognitive interviewing. Trying to find out if people are answering questions differently, inconsistently because they are responding – so you are responding to something differently than me and so you answered that question differently because there is something in the question that’s triggering, that you’re responding to it. So, I think error is very important. It’s important that people know that there’s error. That you can control for it in a statistical manner but also you can control for it in the measurement piece by looking, by careful testing ahead of time to see where error might be introduced systematically or randomly. There’s always going to be someone who has a bad day and picks up your survey and does the scales and the next day perhaps they get a better night’s sleep and they do it differently. You can’t control for that, but you can do your best. People develop or should do their best to try to find, to test for it, to see at least if there is an instruction or an item that people are responding to differentially, what they can do to change that. So, I think those are the important concepts. The reliability, the validity and error.

Jonathan Singer: I’m Jonathan Singer and thanks for being with me today for another episode of the Social Work Podcast. If you missed an episode or have suggestions for future episodes, please visit [If you like what you hear, please leave a review on Apple Podcasts. To join the global community of Social Work Podcast listeners, please follow us on Twitter at and on Facebook at] To all the social workers out there, keep up the good work. We’ll see ya next time at the Social Work Podcast.

End Transcript.
Transcript generously provided by Sara Sparks, Master of Social Work Student, Auburn University

References and Resources
  • Fisher, J, & Corcoran, K. (2006). Measures for clinical practice and research: Volume 1: couples, families and children (4th ed.). New York: Oxford University Press.

  • Lyons, J., Howard, K., O”Mahoney, M., & Lish, J. (1997). The measurement and management of clinical outcomes in mental health. NJ: Wiley Publishing Company.

  • Maruish, M. E. (2004). The use of psychological testing for treatment planning and outcomes assessment: Instruments for children and adolescents (3rd ed., volume 2). New Jersey: Lawrence Erlbaum Associates.

APA (6th ed) citation for this podcast:

Singer, J. B. (Producer). (2008, October 6). Measurement in clinical practice and research (Part I): Interview with Dr. Mary Rauktis [Episode 43]. Social Work Podcast [Audio podcast]. Retrieved from

No comments: