far-right/
|--info-source/
|--daily
|--<date>
|--<Source1>
|--<Source2>
|--<date2>
|--<Source1>
|--<Source2>
|--<date3>
|--<Source1>
|--<Source2>
|--archive
|--breitbart
|--<historic_file1.json>
|--<historic_file2.json>
|--<historic_file3.json>
|--stormfront
|--<historic_file1.json>
|--<historic_file2.json>
|--<historic_file3.json>
aws s3 ls s3://far-right/info-source/ --profile <credential_profile>
see aws configure
For aws credentials for read-only access, ping @bstarling, @nick, @sjackson, @wwymak, or @divya.
daily has one folder per day (UTC time), inside that folder there is a folder for each source.
breitbartfolder will have a file for each news category containing most recent ~30-35 articles for that category. Originally posted every four hours, now posted every twelve hours.- naming convention
<category>_<time>.jsonwhere time is the UTC time snapshot was taken. - Since this process dumps the most recent articles without knowledge snapshot
big-government_1200.jsonandbig-government_1600.jsoncould be the same if no new articles were published between 1200 hours and 1600 hours (UTC). - These files will need to be combined and de-duped before real analysis can start.
- naming convention
stormfront- TBD - still gathering historic activity.
radix- Running 2x daily (noon and midnight UTC)
americanren- Running 2x daily (noon and midnight UTC)
4chan- Running 2x daily (noon and midnight UTC)
archive folder contains historic results (much larger files).
breitbart:- one file per news category. Entire collection is ~240k articles starting in early 2012.
- Should not have dupes unless an article appeared in multiple categories (TBD)
stormfront- In progress, new sources posted daily. One file per sub-forum