Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -8,22 +8,21 @@ Browser profiles are saved instances of a web browsing session that can be used

Pre-configure a social media site to be logged in so that the crawler can access content that can only be viewed by logged-in users.

!!! tip "Best Practices: Use an account created specifically for archiving a website"

Websites may require user registration to view content at URLs that are otherwise public. This practice is sometimes referred to as a login wall. Login walls are commonly used by social media and publishing platforms.

We recommend creating dedicated accounts when archiving any website that requires a user account. Although dedicated accounts are not required to benefit from browser profiles, they can address the following potential issues:
### Hide Popup Prompts

- While usernames and passwords are never saved by Browsertrix, the private tokens that enable access to logged in content _are_ stored. Thus, anyone with access to your Browsertrix account, intentional or malicious, may be able to access the logged in content.
Websites may prompt users for a number of reasons before displaying the rest of the page, such as for age verification, informed consent requirements, or geographical location. Configure a browser profile to accept, dismiss, or otherwise hide these dialogs so that the content behind them is visible to the crawler.

- Some websites may rate limit or lock accounts for reasons they deem to be suspicious, such as logging in from a new geographical location or if the site determines crawls to be robot activity.
## Best Practices

- Personalized data such as cookies, location, etc. may be included in the resulting crawl.
### Use logins dedicated to web archiving

- The logged in interface may display unwanted personally identifiable information such as a username or profile picture.
Websites may require user registration to view content at URLs that are otherwise public. This practice is sometimes referred to as a login wall. Login walls are commonly used by social media and publishing platforms.

An exception to this practice is if your goal is to archive personalized or private content accessible only from designated accounts. In these instances we recommend changing the account's password after crawling is complete.
We highly recommend avoiding use of your personal accounts when logging into websites during the profile creation process. Instead, sign up for a new account dedicated to archiving and use that dedicated account in your browser profile. Although dedicated accounts are not necessary to benefit from browser profiles, they can address the following potential issues:

### Hide Popup Prompts
- While usernames and passwords are never saved by Browsertrix, the private tokens that enable access to logged-in content _are_ stored in WACZ files. Thus, anyone with access to your Browsertrix account or WACZ files, intentional or malicious, may be able to view and use the token to log in to your account.
- Some websites may rate limit or lock accounts for reasons they deem to be suspicious, such as logging in from a new geographical location or if the site determines crawls to be robot activity.
- Personalized data such as cookies, location, etc. may be included in the resulting crawl.
- The logged-in interface may display unwanted personally identifiable information such as a username or profile picture.

Websites may prompt users for a number of reasons before displaying the rest of the page, such as for age verification, informed consent requirements, or geographical location. Configure a browser profile to accept, dismiss, or otherwise hide these dialogs so that the content behind them is visible to the crawler.
An exception to this practice is if your goal is to archive personalized or private content accessible only from designated accounts. In these instances we recommend changing the account's password or logging out of all active sessions after crawling is complete.
16 changes: 12 additions & 4 deletions frontend/docs/docs/user-guide/workflow-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -287,12 +287,24 @@ Configure the browser used to visit URLs during the crawl.

Sets the [_Browser Profile_](browser-profiles/browser-profiles-overview.md) to be used for this crawl.

!!! Tip "Best Practices: Use login profiles dedicated to crawling"
We highly recommend avoiding use of your personal accounts when logging into websites during the profile creation process. Crawling with a browser profile that uses your personal account may expose you to risks such as compromised private tokens and unwanted sharing of user preferences. Although accounts dedicated to crawling are not necessary to benefit from browser profiles, they can address these potential issues and more. [Continue reading about dedicated accounts](browser-profiles/browser-profiles-overview.md#use-logins-dedicated-to-web-archiving)

### Fail Crawl if Not Logged In

When enabled, the crawl will fail if a [page behavior](#page-behavior) detects the presence or absence of content on supported pages indicating that the browser is not logged in.

For details about which websites are supported and how to add this functionality to your own [custom behaviors](#use-custom-behaviors), see the [Browsertrix Crawler documentation for Fail on Content Check](https://crawler.docs.browsertrix.com/user-guide/behaviors/#fail-on-content-check).

### Include Browser Storage Data

When enabled, instructs the crawler to save the browser's `localStorage` and `sessionStorage` data for each page in the web archive as part of the `WARC-JSON-Metadata` field. Enabling this option is recommended to properly archive and replay certain websites, as long as privacy and security implications have been reviewed.

!!! Warning "Privacy & security implications when used with browser profiles"
Websites can use browser storage to store arbitrary data. During the browser profile creation process, some websites may save sensitive data such as login information and user-identifying preferences in browser storage. Since every website can implement browser storage differently, Browsertrix does not attempt to detect whether the information stored is potentially sensitive.

Use caution when sharing WACZ files created with this option enabled, especially if you’re crawling pages that require login. We always recommend creating dedicated website logins to be used only for crawling to mitigate the risk of compromised login information.

### Crawler Proxy Server

!!! Info "This setting will be shown if the organization supports multiple proxies."
Expand Down Expand Up @@ -320,10 +332,6 @@ Sets the release channel of [Browsertrix Crawler](https://github.com/webrecorder

Will prevent any content from the domains listed in [Steven Black's Unified Hosts file](https://github.com/StevenBlack/hosts) (ads & malware) from being captured by the crawler.

### Save Local and Session Storage

When enabled, instructs the crawler to save the browser's `localStorage` and `sessionStorage` data for each page in the web archive as part of the `WARC-JSON-Metadata` field. This option may be necessary to properly archive and replay certain websites. Use caution when sharing WACZ files created with this option enabled, as the saved browser storage may contain sensitive information.

### User Agent

Sets the browser's [user agent](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) in outgoing requests to the specified value. If left blank, the crawler will use the Brave browser's default user agent. For a list of common user agents see [useragents.me](https://www.useragents.me/).
Expand Down
2 changes: 1 addition & 1 deletion frontend/src/components/ui/config-details.ts
Original file line number Diff line number Diff line change
Expand Up @@ -292,7 +292,7 @@ export class ConfigDetails extends BtrixElement {
seedsConfig?.blockAds,
)}
${this.renderSetting(
msg("Save Local and Session Storage"),
labelFor["saveStorage"],
seedsConfig?.saveStorage,
)}
${this.renderSetting(
Expand Down
38 changes: 24 additions & 14 deletions frontend/src/features/crawl-workflows/workflow-editor.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2061,11 +2061,21 @@ https://archiveweb.page/images/${"logo.svg"}`}
this.updateFormState({
browserProfile: profile ?? null,
proxyId: profile?.proxyId ?? null,
saveStorage:
profile && this.formState.saveStorage === undefined
? true
: this.formState.saveStorage,
});
}}
></btrix-select-browser-profile>
`)}
${this.renderHelpTextCol(infoTextFor["browserProfile"])}
${this.renderHelpTextCol(html`
${infoTextFor["browserProfile"]}
${this.renderUserGuideLink({
hash: "browser-profile",
content: msg("More details"),
})}
`)}
${when(
this.formState.browserProfile,
() => html`
Expand All @@ -2088,6 +2098,19 @@ https://archiveweb.page/images/${"logo.svg"}`}
)}
`,
)}
${inputCol(html`
<sl-checkbox name="saveStorage" ?checked=${this.formState.saveStorage}>
${labelFor["saveStorage"]}
</sl-checkbox>
`)}
${this.renderHelpTextCol(
html`${infoTextFor["saveStorage"]}
${this.renderUserGuideLink({
hash: "include-browser-storage-data",
content: msg("More details"),
})}.`,
false,
)}
${proxies?.servers.length
? [
inputCol(html`
Expand Down Expand Up @@ -2164,19 +2187,6 @@ https://archiveweb.page/images/${"logo.svg"}`}
</sl-checkbox>
`)}
${this.renderHelpTextCol(infoTextFor["blockAds"], false)}
${inputCol(html`
<sl-checkbox name="saveStorage" ?checked=${this.formState.saveStorage}>
${msg("Save local and session storage")}
</sl-checkbox>
`)}
${this.renderHelpTextCol(
html`${infoTextFor["saveStorage"]}
${this.renderUserGuideLink({
hash: "save-local-and-session-storage",
content: msg("Implications for shared archives"),
})}.`,
false,
)}
${inputCol(html`
<sl-input
name="userAgent"
Expand Down
1 change: 1 addition & 0 deletions frontend/src/pages/org/workflows-new.ts
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,7 @@ export class WorkflowsNew extends BtrixElement {
blockAds: org.crawlingDefaults?.blockAds ?? undefined,
lang: org.crawlingDefaults?.lang,
customBehaviors: org.crawlingDefaults?.customBehaviors || [],
saveStorage: org.crawlingDefaults?.profileid ? true : undefined,
},
crawlTimeout: org.crawlingDefaults?.crawlTimeout,
maxCrawlSize: org.crawlingDefaults?.maxCrawlSize,
Expand Down
27 changes: 18 additions & 9 deletions frontend/src/strings/crawl-workflows/infoText.ts
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,11 @@ export const infoTextFor = {
pageExtraDelaySeconds: msg(
`Waits on the page after behaviors are complete before moving onto the next page. Can be helpful for rate limiting.`,
),
browserProfile:
msg(`Choose a custom profile to make use of saved cookies and logged-in
accounts. Note that websites may log profiles out after a period of time.`),
browserProfile: html`${msg(`Choose a custom profile to make use of saved cookies and logged-in
accounts.`)}<br /><br />
${msg(
"For websites that require login, we always recommend using a profile that's logged-in to an account created specifically for crawling.",
)} `,
crawlerChannel: msg(
`Choose a Browsertrix Crawler release channel. If available, other versions may provide new or experimental crawling features.`,
),
Expand Down Expand Up @@ -79,12 +81,19 @@ export const infoTextFor = {
customBehavior: msg(
`Enable custom page actions with behavior scripts. You can specify any publicly accessible URL or public Git repository.`,
),
failOnContentCheck: msg(
`Fail the crawl if a page behavior detects the browser is not logged in on supported pages.`,
),
saveStorage: msg(
`Include data from the browser's local and session storage in the web archive.`,
),
failOnContentCheck: html`${msg(
"Fail the crawl if a page behavior detects the browser is not logged in on supported pages.",
)}
${msg("Note that websites may log profiles out after a period of time.")}`,
saveStorage: html`${msg(
"During a crawl, websites may store data in the browser itself, e.g. to persist logins.",
)}
${msg(
"Checking this will include data from the browser’s local and session storage in the archive.",
)}
${msg(
"This can improve replay quality, but may come with security implications.",
)}`,
useRobots: msg(
`Check for a robots.txt file for each host and skip any disallowed pages.`,
),
Expand Down
1 change: 1 addition & 0 deletions frontend/src/strings/crawl-workflows/labels.ts
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ export const labelFor = {
selectLinks: msg("Link Selectors"),
clickSelector: msg("Click Selector"),
dedupeType: msg("Crawl Deduplication"),
saveStorage: msg("Include browser storage data"),
} as const satisfies Partial<Record<FormStateField, string>>;
Loading