This is a narrative guide outlining how to start a search and a filter and combine the results once the event is over. We’re going to running this on a recent news event about the Governor of Florida, but any topic will work.
Before You Start
Filter and Search
Dehydrate
Combine
Rehydrate
Deduplicate
Analysis
Before starting this guide, make sure you have twarc installed and setup.
Next you’re going to want to run twarc filter which collects tweets from the Twitter stream matching the filter criteria, and twarc search which collects tweets made in the past seven days matching the search criteria. There are a couple of ways this can be done, but the most preferable is to run two command line windows.
twarc filter desantis > desantis_filter.jsonl
twarc search desantis > desantis_search.jsonl
The search command will finish before the filter which will keep running until manually stopped. Once we are finished running the search, we can work on combining the two JSONLs.
We will start by dehydrating the two collected datasets.
twarc dehydrate desantis_filter.jsonl > desantis_filter.txt
twarc dehydrate desantis_search.jsonl > desantis_search.txt
Now that the datasets have been dehydrated, we can use the python program combine.py here to combine them.
python utils/combine.py
And enter the input requests as follows:
Enter the name of your filter txt: desantis_filter.txt
Enter the name of your search txt: desantis_search.txt
Enter the name of your output txt: desantis_fs.txt
Now that we have our merged dataset, we can rehydrate the dataset.
twarc hydrate desantis_fs.txt > desantis_fs.jsonl
Then, we can run deduplicate.py to remove any overlap from the merging of the two datasets.
python utils/deduplicate.py desantis_fs.jsonl > desantis.jsonl
All of the usage is displayed in the command line here:
Now that we have our merged dataset without duplicate ID’s, we can perform analysis using the python utilities provided with twarc. See the twarc page for more information and links the the repository.
You can download the DeSantis files from the twitter repo.