How to do case insensitive counts of hashtags or other texts…
In digital methods research you may sometimes end up with a list of hashtags where small variations in case and capitalisation are counted as separate items, such as in the following table:
date | item | value |
---|---|---|
2021 | ESEAHM2021 | 179 |
2021 | ESEAhm2021 | 46 |
2021 | eseahm2021 | 44 |
2021 | ESEA | 16 |
2021 | Earth2Air | 13 |
2021 | ESEAHM | 13 |
2021 | ESEAHeritageMonth | 11 |
2021 | NIW2021 | 10 |
2021 | TheMSGpod | 8 |
2021 | MoongateMix | 8 |
This recipe explores simple approaches to changing cases.
There are many ways of changing cases of items in datasets with different tools and scripts.
Here’s one simple approach using spreadsheets…
lower
if you will change hashtags to lowercase)=LOWER()
and add the reference for the first item between the brackets.
C2
and your orginal case sensitive hashtag is B2
then you can type =LOWER(B2)
in cell C1
.Another simple alternative to the approach above you can use OpenRefine…
There are many other ways to do this (e.g. using pandas).
Now you should have a table something like the following:
date | item | lower | value |
---|---|---|---|
2021 | ESEAHM2021 | eseahm2021 | 179 |
2021 | ESEAhm2021 | eseahm2021 | 46 |
2021 | eseahm2021 | eseahm2021 | 44 |
2021 | ESEA | esea | 16 |
2021 | Earth2Air | earth2air | 13 |
2021 | ESEAHM | eseahm | 13 |
2021 | ESEAHeritageMonth | eseaheritagemonth | 11 |
2021 | NIW2021 | niw2021 | 10 |
2021 | TheMSGpod | themsgpod | 8 |
2021 | MoongateMix | moongatemix | 8 |
You’ll see that the first three hashtags are still separate. How can we combine these?
To recount the new column you can:
lower
) for the rowsvalue
) for the valuesSUM
of the column with counts in (in this case value
)If you have a date
column and multiple different years for the hashtags, you can also add the date to the pivot table to ensure that multiple years are not merged together.
You should now have a table with case insensitive counts of your hashtags, which you can export into a new csv file for further analysis. 🎊
Note the difference in counts between this new table and the tables above (which is why case insensitive counts can matter for analysis)! 😲
date | lower | SUM of value |
---|---|---|
2021 | eseahm2021 | 270 |
2021 | esea | 18 |
2021 | eseahm | 17 |
2021 | earth2air | 13 |
2021 | eseaheritagemonth | 12 |
2021 | niw2021 | 10 |
2021 | beseakidlit | 9 |
2021 | themsgpod | 8 |
2021 | moongatemix | 8 |
2021 | londonpodfest | 7 |