A recipe to extract a subset of tweets based on one or more hashtags using OpenRefine
This recipe presents a method to export small subsets from a full Twitter dataset based on a selection of hashtags. In particular, with this recipe, one will query a Twitter dataset with one hashtag or two or more hashtags connected by basic boolean operations (i.e., OR, AND). It can be used to extract a selection of tweets to analyse later with spreadsheet software.
This recipe starts from a full Twitter dataset exported from the Twitter Capture and Analysis Toolset (DMI-TCAT). Still, it can be used with any Twitter data, as long as there is a column with tweets text and one column with hashtags (which should be in the same cell and separated by semicolon).
📃 Steps
Installing OpenRefine
- Download latest version of OpenRefine from this link
- Follow instructions based on OS (Mac or Windows)
- Refer to the official documentation for troubleshooting errors when installing OpenRefine
- When installed, double click on the Open refine icon
- OpenRefine opens directly in your browser
- If the browser does not open, you can type this URL in your browser bar
Opening the full dataset with OpenRefine
- Select the file from your computer and press [next]
- Click [create project] on the top right corner
Creating a subset based on 1 hashtag
This step can be used to select tweets containing one specific hashtag. For example, extracting all the tweets with #greenpeace.
- On the header of the column containing the hashtags, click the small arrow next to the column name, and choose [Text filter]
- On the panel [hashtags] in the top left corner write the hashtag you want to filter (e.g. greenpeace)
- You can set the filter to be case sensitive checking the box at the bottom
- Note that on top of the spreadsheet you can see how many tweets match your criteria (in this case 80 tweets)
- Click on [Export] on the top right corner to export a csv file or Excel file
This step can be used to filter tweets based on more than one hashtag. This query technique will result in a subset of all tweets mentioning at least one of the selected hashtags. For example, extracting all the tweets with #greenpeace or #extinctionrebellion.
- On the header of the column containing the hashtags, click the small arrow next to the column name, and choose [Text filter]
- On the panel [hashtags] in the top left corner, select the [regular expression] option
- In the same panel, write the hashtags separated by a pipe:
greenpeace|extinctionrebellion
- You can set the filter to be case sensitive checking the box at the bottom
- Note that on top of the spreadsheet you can see how many tweets match your criteria (in this case 478 tweets)
- Click on [Export] on the top right corner to export a csv file or Excel file
This step can be used to filter tweets based on more than one hashtag. This query technique will result in a subset of all tweets mentioning one of the selected hashtags. For example, extracting all the tweets with #greenpeace and #extinctionrebellion.
- On the header of the column containing the hashtags, click the small arrow next to the column name, and choose [Text filter]
- On the panel [hashtags] in the top left corner write the first hashtag you want to filter (e.g. greenpeace)
- On the header of the column containing the hashtags, click again the small arrow next to the column name, and select [Text filter]: a new filtering panel will appear
- In the second panel, write the second hashtag you want to filter (e.g. extinctionrebellion)
- Note that on top of the spreadsheet you can see how many tweets match your criteria (in this case only 2 tweets)
- Click on [Export] on the top right corner to export a csv file or Excel file