public class WebcrawlerConnector extends BaseRepositoryConnector
Modifier and Type | Class and Description |
---|---|
protected static class |
WebcrawlerConnector.CanonicalizationPolicies
Class representing a list of canonicalization rules
|
protected static class |
WebcrawlerConnector.CanonicalizationPolicy
Class representing a URL regular expression match, for the purposes of determining canonicalization policy
|
protected class |
WebcrawlerConnector.DocumentURLFilter
This class describes the url filtering information (for crawling and indexing) obtained from a digested DocumentSpecification.
|
protected static class |
WebcrawlerConnector.EvaluatorToken
Evaluator token.
|
protected static class |
WebcrawlerConnector.EvaluatorTokenStream
Token stream.
|
protected class |
WebcrawlerConnector.FeedContextClass |
protected class |
WebcrawlerConnector.FeedItemContextClass |
protected static class |
WebcrawlerConnector.FetchStatus |
protected static class |
WebcrawlerConnector.MappingRule
Class representing a mapping rule
|
protected static class |
WebcrawlerConnector.MappingRules
Class that represents all mappings
|
protected static class |
WebcrawlerConnector.NameValue
Name/value class
|
protected class |
WebcrawlerConnector.OuterContextClass
This class handles the outermost XML context for the feed document.
|
protected class |
WebcrawlerConnector.ProcessActivityHTMLHandler
Class that describes HTML handling
|
protected class |
WebcrawlerConnector.ProcessActivityLinkHandler
This class is the handler for links that get added into a IProcessActivity object.
|
protected class |
WebcrawlerConnector.ProcessActivityRedirectionHandler
Class that describes redirection handling
|
protected class |
WebcrawlerConnector.ProcessActivityXMLHandler
Class that describes XML handling
|
protected class |
WebcrawlerConnector.RDFContextClass |
protected class |
WebcrawlerConnector.RDFItemContextClass |
protected class |
WebcrawlerConnector.RSSChannelContextClass |
protected class |
WebcrawlerConnector.RSSContextClass |
protected class |
WebcrawlerConnector.RSSItemContextClass |
protected class |
WebcrawlerConnector.UrlsetContextClass |
protected class |
WebcrawlerConnector.UrlsetItemContextClass |
Modifier and Type | Field and Description |
---|---|
static String |
_rcsid |
static String |
ACTIVITY_FETCH |
static String |
ACTIVITY_LOGON_END |
static String |
ACTIVITY_LOGON_START |
static String |
ACTIVITY_PROCESS |
static String |
ACTIVITY_ROBOTSPARSE |
protected static DataCache |
cache
This is where we keep data around between the getVersions() phase and the processDocuments() phase.
|
protected int |
connectionTimeoutMilliseconds
Connection timeout, milliseconds.
|
protected CookieManager |
cookieManager
The cookie manager used by this instance
|
protected CredentialsDescription |
credentialsDescription
The credentials description
|
protected DNSManager |
dnsManager
The DNS manager currently used by this instance
|
protected static String |
FETCH_LOGIN |
protected static String |
FETCH_ROBOTS |
protected static String |
FETCH_STANDARD |
protected String |
from
The email address for this connector instance
|
protected static String[] |
interestingMimeTypeArray
This represents a list of the mime types that this connector knows how to extract links from.
|
protected static Set<String> |
interestingMimeTypeMap |
protected boolean |
isInitialized
This flag is set when the instance has been initialized
|
protected static List<String> |
potentiallyExcludedHeaders |
protected String |
proxyAuthDomain
Proxy auth domain
|
protected String |
proxyAuthPassword
Proxy auth password
|
protected String |
proxyAuthUsername
Proxy auth user name
|
protected String |
proxyHost
Proxy host
|
protected int |
proxyPort
Proxy port
|
static String |
REL_LINK |
static String |
REL_REDIRECT |
protected static Set<String> |
reservedHeaders |
protected static int |
RESULT_NO_DOCUMENT |
protected static int |
RESULT_NO_VERSION |
protected static int |
RESULT_RETRY_DOCUMENT |
protected static int |
RESULT_VERSION_NEEDED |
protected static int |
RESULTSTATUS_FALSE |
protected static int |
RESULTSTATUS_NOTYETDETERMINED |
protected static int |
RESULTSTATUS_TRUE |
protected static int |
ROBOTS_ALL |
protected static int |
ROBOTS_DATA |
protected static int |
ROBOTS_NONE |
protected RobotsManager |
robotsManager
The robots manager currently used by this instance
|
protected int |
robotsUsage
Robots usage flag
|
protected static int |
SESSIONSTATE_LOGIN
We're in 'login mode'
|
protected static int |
SESSIONSTATE_NORMAL
Normal fetch of content document.
|
protected int |
socketTimeoutMilliseconds
Socket timeout, milliseconds
|
protected ThrottleDescription |
throttleDescription
The throttle description
|
protected String |
throttleGroupName
Throttle group name
|
protected TrustsDescription |
trustsDescription
The trusts description
|
protected static Set<String> |
understoodProtocols |
protected String |
userAgent
The user-agent for this connector instance
|
currentContext, params
GLOBAL_DENY_TOKEN, JOBMODE_CONTINUOUS, JOBMODE_ONCEONLY, MODEL_ADD, MODEL_ADD_CHANGE, MODEL_ADD_CHANGE_DELETE, MODEL_ALL, MODEL_CHAINED_ADD, MODEL_CHAINED_ADD_CHANGE, MODEL_CHAINED_ADD_CHANGE_DELETE, MODEL_PARTIAL
Constructor and Description |
---|
WebcrawlerConnector()
Constructor.
|
Modifier and Type | Method and Description |
---|---|
String |
addSeedDocuments(ISeedingActivity activities,
Specification spec,
String lastSeedVersion,
long seedTime,
int jobMode)
Queue "seed" documents.
|
protected String[] |
calculateDocumentEvents(INamingActivity activities,
String documentIdentifier)
Calculate events that should be associated with a document.
|
String |
check()
Check status of connection.
|
protected int |
checkFetchAllowed(String documentIdentifier,
String protocol,
String hostIPAddress,
int port,
PageCredentials credential,
IKeystoreManager trustStore,
String hostName,
String[] binNames,
long currentTime,
String pathString,
IVersionActivity versionActivities,
int connectionLimit,
String proxyHost,
int proxyPort,
String proxyAuthDomain,
String proxyAuthUsername,
String proxyAuthPassword)
Check robots to see if fetch is allowed.
|
void |
clearThreadContext()
Clear out any state information specific to a given thread.
|
protected static void |
compileList(List<Pattern> output,
List<String> input)
Compile all regexp entries in the passed in list, and add them to the output
list.
|
void |
deinstall(IThreadContext threadContext)
Uninstall the connector.
|
void |
disconnect()
Close the connection.
|
protected String |
doCanonicalization(WebcrawlerConnector.DocumentURLFilter filter,
WebURL url)
Code to canonicalize a URL.
|
protected String |
documentIdentifiertoFileName(String documentIdentifier)
Convert a document identifier to filename.
|
protected static String |
extractContentType(String contentType) |
protected static String |
extractEncoding(String contentType) |
protected boolean |
extractLinks(String documentIdentifier,
IProcessActivity activities,
WebcrawlerConnector.DocumentURLFilter filter)
Code to extract links from an already-fetched document.
|
protected static String |
extractMimeType(String contentType) |
protected static Set<String> |
findExcludedHeaders(Specification spec)
Read a document specification to get a set of excluded headers
|
protected FormData |
findHTMLForm(String currentURI,
LoginParameters lp)
Find matching HTML form data, if present.
|
protected String |
findHTMLLinkURI(String currentURI,
LoginParameters lp)
Find HTML link URI, if present, making sure specified preference is matched.
|
protected static List<WebcrawlerConnector.NameValue> |
findMetadata(Specification spec)
Read a document specification to yield a map of name/value pairs for metadata
|
protected String |
findPreferredRedirectionURI(String currentURI,
LoginParameters lp)
Find a preferred redirection URI, if it exists
|
protected String |
findRedirectionURI(String currentURI)
Find a redirection URI, if it exists
|
protected String |
findSpecifiedContent(String currentURI,
LoginParameters lp)
Find existence of specific content on the page (never finds a URL)
|
protected static String[] |
getAcls(Specification spec)
Grab forced acl out of document specification.
|
String[] |
getActivitiesList()
Return the list of activities that this connector supports (i.e.
|
String[] |
getBinNames(String documentIdentifier)
Get the bin name string for a document identifier.
|
int |
getConnectorModel()
Tell the world what model this connector uses for getDocumentIdentifiers().
|
String |
getFormCheckJavascriptMethodName(int connectionSequenceNumber)
Obtain the name of the form check javascript method to call.
|
String |
getFormPresaveCheckJavascriptMethodName(int connectionSequenceNumber)
Obtain the name of the form presave check javascript method to call.
|
int |
getMaxDocumentRequest()
Get the maximum number of documents to amalgamate together into one batch, for this connector.
|
protected PageCredentials |
getPageCredential(String documentIdentifier)
Get the page credentials for a given document identifier (URL)
|
String[] |
getRelationshipTypes()
Return the list of relationship types that this connector recognizes.
|
protected SequenceCredentials |
getSequenceCredential(String documentIdentifier)
Get the sequence credentials for a given document identifier (URL)
|
protected void |
getSession()
Start a session
|
protected IKeystoreManager |
getTrustStore(String documentIdentifier)
Get the trust store for a given document identifier (URL)
|
protected void |
handleHTML(String documentURI,
IHTMLHandler handler)
Handle document references from HTML
|
protected static void |
handleIOException(IOException e,
String context) |
protected void |
handleRedirects(String documentURI,
IRedirectionHandler handler)
Handle extracting the redirect link from a redirect response.
|
protected void |
handleXML(String documentURI,
IXMLHandler handler)
Handle document references from XML.
|
void |
install(IThreadContext threadContext)
Install the connector.
|
protected boolean |
isContentInteresting(IFingerprintActivity activities,
String documentIdentifier,
int response,
String contentType)
Code to check if data is interesting, based on response code and content type.
|
protected boolean |
isDocumentText(String documentURI)
Is the document text, as far as we can tell?
|
protected static boolean |
isStrange(byte x)
Check if character is not typical ASCII or utf-8.
|
protected static boolean |
isText(byte[] beginChunk,
int chunkLength)
Test to see if a document is text or not.
|
protected static boolean |
isWhiteSpace(byte x)
Check if a byte is a whitespace character.
|
protected void |
loginAndFetch(WebcrawlerConnector.FetchStatus fetchStatus,
IProcessActivity activities,
String documentIdentifier,
SequenceCredentials sessionCredential,
String globalSequenceEvent) |
protected int |
lookupIPAddress(String documentIdentifier,
IVersionActivity activities,
String hostName,
long currentTime,
StringBuilder ipAddressBuffer)
Look up an ipaddress given a non-canonical host name.
|
protected String |
makeDNSEventName(INamingActivity activities,
String hostNameKey)
Calculate the event name for DNS access.
|
protected String |
makeDocumentIdentifier(String parentIdentifier,
String rawURL,
WebcrawlerConnector.DocumentURLFilter filter)
Convert an absolute or relative URL to a document identifier.
|
protected String |
makeRobotsEventName(INamingActivity versionActivities,
String robotsKey)
Construct a name for the global web-connector robots event.
|
protected static String |
makeRobotsKey(String protocol,
String hostName,
int port)
Construct the robots key for a host.
|
protected String |
makeSessionLoginEventName(INamingActivity activities,
String sequenceKey)
Calculate the event name for session login.
|
void |
outputConfigurationBody(IThreadContext threadContext,
IHTTPOutput out,
Locale locale,
ConfigParams parameters,
String tabName)
Output the configuration body section.
|
void |
outputConfigurationHeader(IThreadContext threadContext,
IHTTPOutput out,
Locale locale,
ConfigParams parameters,
List<String> tabsArray)
Output the configuration header section.
|
void |
outputSpecificationBody(IHTTPOutput out,
Locale locale,
Specification ds,
int connectionSequenceNumber,
int actualSequenceNumber,
String tabName)
Output the specification body section.
|
void |
outputSpecificationHeader(IHTTPOutput out,
Locale locale,
Specification ds,
int connectionSequenceNumber,
List<String> tabsArray)
Output the specification header section.
|
void |
poll()
This method is periodically called for all connectors that are connected but not
in active use.
|
String |
processConfigurationPost(IThreadContext threadContext,
IPostParameters variableContext,
Locale locale,
ConfigParams parameters)
Process a configuration post.
|
protected void |
processDocument(IProcessActivity activities,
String documentIdentifier,
String versionString,
boolean indexDocument,
Map<String,Set<String>> metaHash,
Map<String,Set<String>> metaHash2,
String[] acls,
WebcrawlerConnector.DocumentURLFilter filter) |
void |
processDocuments(String[] documentIdentifiers,
IExistingVersions statuses,
Specification spec,
IProcessActivity activities,
int jobMode,
boolean usesDefaultAuthority)
Process a set of documents.
|
String |
processSpecificationPost(IPostParameters variableContext,
Locale locale,
Specification ds,
int connectionSequenceNumber)
Process a specification post.
|
protected static List<String> |
stringToArray(String input)
Read a string as a sequence of individual expressions, urls, etc.
|
void |
viewConfiguration(IThreadContext threadContext,
IHTTPOutput out,
Locale locale,
ConfigParams parameters)
View configuration.
|
void |
viewSpecification(IHTTPOutput out,
Locale locale,
Specification ds,
int connectionSequenceNumber)
View specification.
|
addSeedDocuments, addSeedDocuments, addSeedDocuments, getDocumentIdentifiers, getDocumentIdentifiers, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getRemainingDocumentIdentifiers, outputSpecificationBody, outputSpecificationBody, outputSpecificationHeader, outputSpecificationHeader, outputSpecificationHeader, processDocuments, processDocuments, processDocuments, processDocuments, processSpecificationPost, processSpecificationPost, releaseDocumentVersions, releaseDocumentVersions, requestInfo, viewSpecification, viewSpecification
connect, getConfiguration, isConnected, outputConfigurationBody, outputConfigurationHeader, outputConfigurationHeader, pack, packFixedList, packList, packList, processConfigurationPost, setThreadContext, unpack, unpackFixedList, unpackList, viewConfiguration
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
connect, getConfiguration, isConnected, setThreadContext
public static final String _rcsid
protected static final int RESULTSTATUS_FALSE
protected static final int RESULTSTATUS_TRUE
protected static final int RESULTSTATUS_NOTYETDETERMINED
protected static final String[] interestingMimeTypeArray
protected static final int ROBOTS_NONE
protected static final int ROBOTS_DATA
protected static final int ROBOTS_ALL
public static final String REL_LINK
public static final String REL_REDIRECT
public static final String ACTIVITY_FETCH
public static final String ACTIVITY_PROCESS
public static final String ACTIVITY_ROBOTSPARSE
public static final String ACTIVITY_LOGON_START
public static final String ACTIVITY_LOGON_END
protected static final String FETCH_ROBOTS
protected static final String FETCH_STANDARD
protected static final String FETCH_LOGIN
protected int robotsUsage
protected String userAgent
protected String from
protected int connectionTimeoutMilliseconds
protected int socketTimeoutMilliseconds
protected String throttleGroupName
protected ThrottleDescription throttleDescription
protected CredentialsDescription credentialsDescription
protected TrustsDescription trustsDescription
protected RobotsManager robotsManager
protected DNSManager dnsManager
protected CookieManager cookieManager
protected boolean isInitialized
protected static DataCache cache
protected String proxyHost
protected int proxyPort
protected String proxyAuthDomain
protected String proxyAuthUsername
protected String proxyAuthPassword
protected static final int SESSIONSTATE_NORMAL
protected static final int SESSIONSTATE_LOGIN
protected static final int RESULT_NO_DOCUMENT
protected static final int RESULT_NO_VERSION
protected static final int RESULT_VERSION_NEEDED
protected static final int RESULT_RETRY_DOCUMENT
public int getConnectorModel()
getConnectorModel
in interface IRepositoryConnector
getConnectorModel
in class BaseRepositoryConnector
public void install(IThreadContext threadContext) throws ManifoldCFException
install
in interface IConnector
install
in class BaseConnector
threadContext
- is the current thread context.ManifoldCFException
public void deinstall(IThreadContext threadContext) throws ManifoldCFException
deinstall
in interface IConnector
deinstall
in class BaseConnector
threadContext
- is the current thread context.ManifoldCFException
public String[] getActivitiesList()
getActivitiesList
in interface IRepositoryConnector
getActivitiesList
in class BaseRepositoryConnector
public String[] getRelationshipTypes()
getRelationshipTypes
in interface IRepositoryConnector
getRelationshipTypes
in class BaseRepositoryConnector
public void clearThreadContext()
clearThreadContext
in interface IConnector
clearThreadContext
in class BaseConnector
protected void getSession() throws ManifoldCFException
ManifoldCFException
public void poll() throws ManifoldCFException
poll
in interface IConnector
poll
in class BaseConnector
ManifoldCFException
public String check() throws ManifoldCFException
check
in interface IConnector
check
in class BaseConnector
ManifoldCFException
public void disconnect() throws ManifoldCFException
disconnect
in interface IConnector
disconnect
in class BaseConnector
ManifoldCFException
public String[] getBinNames(String documentIdentifier)
getBinNames
in interface IRepositoryConnector
getBinNames
in class BaseRepositoryConnector
documentIdentifier
- is the document identifier.public String addSeedDocuments(ISeedingActivity activities, Specification spec, String lastSeedVersion, long seedTime, int jobMode) throws ManifoldCFException, ServiceInterruption
addSeedDocuments
in interface IRepositoryConnector
addSeedDocuments
in class BaseRepositoryConnector
activities
- is the interface this method should use to perform whatever framework actions are desired.spec
- is a document specification (that comes from the job).seedTime
- is the end of the time range of documents to consider, exclusive.lastSeedVersion
- is the last seeding version string for this job, or null if the job has no previous seeding version string.jobMode
- is an integer describing how the job is being run, whether continuous or once-only.ManifoldCFException
ServiceInterruption
public void processDocuments(String[] documentIdentifiers, IExistingVersions statuses, Specification spec, IProcessActivity activities, int jobMode, boolean usesDefaultAuthority) throws ManifoldCFException, ServiceInterruption
processDocuments
in interface IRepositoryConnector
processDocuments
in class BaseRepositoryConnector
documentIdentifiers
- is the set of document identifiers to process.statuses
- are the currently-stored document versions for each document in the set of document identifiers
passed in above.activities
- is the interface this method should use to queue up new document references
and ingest documents.jobMode
- is an integer describing how the job is being run, whether continuous or once-only.usesDefaultAuthority
- will be true only if the authority in use for these documents is the default one.ManifoldCFException
ServiceInterruption
protected void loginAndFetch(WebcrawlerConnector.FetchStatus fetchStatus, IProcessActivity activities, String documentIdentifier, SequenceCredentials sessionCredential, String globalSequenceEvent) throws ManifoldCFException, ServiceInterruption
protected void processDocument(IProcessActivity activities, String documentIdentifier, String versionString, boolean indexDocument, Map<String,Set<String>> metaHash, Map<String,Set<String>> metaHash2, String[] acls, WebcrawlerConnector.DocumentURLFilter filter) throws ManifoldCFException, ServiceInterruption
protected static void handleIOException(IOException e, String context) throws ManifoldCFException, ServiceInterruption
public int getMaxDocumentRequest()
getMaxDocumentRequest
in interface IRepositoryConnector
getMaxDocumentRequest
in class BaseRepositoryConnector
public void outputConfigurationHeader(IThreadContext threadContext, IHTTPOutput out, Locale locale, ConfigParams parameters, List<String> tabsArray) throws ManifoldCFException, IOException
outputConfigurationHeader
in interface IConnector
outputConfigurationHeader
in class BaseConnector
threadContext
- is the local thread context.out
- is the output to which any HTML should be sent.parameters
- are the configuration parameters, as they currently exist, for this connection being configured.tabsArray
- is an array of tab names. Add to this array any tab names that are specific to the connector.ManifoldCFException
IOException
public void outputConfigurationBody(IThreadContext threadContext, IHTTPOutput out, Locale locale, ConfigParams parameters, String tabName) throws ManifoldCFException, IOException
outputConfigurationBody
in interface IConnector
outputConfigurationBody
in class BaseConnector
threadContext
- is the local thread context.out
- is the output to which any HTML should be sent.parameters
- are the configuration parameters, as they currently exist, for this connection being configured.tabName
- is the current tab name.ManifoldCFException
IOException
public String processConfigurationPost(IThreadContext threadContext, IPostParameters variableContext, Locale locale, ConfigParams parameters) throws ManifoldCFException
processConfigurationPost
in interface IConnector
processConfigurationPost
in class BaseConnector
threadContext
- is the local thread context.variableContext
- is the set of variables available from the post, including binary file post information.parameters
- are the configuration parameters, as they currently exist, for this connection being configured.ManifoldCFException
public void viewConfiguration(IThreadContext threadContext, IHTTPOutput out, Locale locale, ConfigParams parameters) throws ManifoldCFException, IOException
viewConfiguration
in interface IConnector
viewConfiguration
in class BaseConnector
threadContext
- is the local thread context.out
- is the output to which any HTML should be sent.parameters
- are the configuration parameters, as they currently exist, for this connection being configured.ManifoldCFException
IOException
public String getFormCheckJavascriptMethodName(int connectionSequenceNumber)
getFormCheckJavascriptMethodName
in interface IRepositoryConnector
getFormCheckJavascriptMethodName
in class BaseRepositoryConnector
connectionSequenceNumber
- is the unique number of this connection within the job.public String getFormPresaveCheckJavascriptMethodName(int connectionSequenceNumber)
getFormPresaveCheckJavascriptMethodName
in interface IRepositoryConnector
getFormPresaveCheckJavascriptMethodName
in class BaseRepositoryConnector
connectionSequenceNumber
- is the unique number of this connection within the job.public void outputSpecificationHeader(IHTTPOutput out, Locale locale, Specification ds, int connectionSequenceNumber, List<String> tabsArray) throws ManifoldCFException, IOException
outputSpecificationHeader
in interface IRepositoryConnector
outputSpecificationHeader
in class BaseRepositoryConnector
out
- is the output to which any HTML should be sent.locale
- is the locale the output is preferred to be in.ds
- is the current document specification for this job.connectionSequenceNumber
- is the unique number of this connection within the job.tabsArray
- is an array of tab names. Add to this array any tab names that are specific to the connector.ManifoldCFException
IOException
public void outputSpecificationBody(IHTTPOutput out, Locale locale, Specification ds, int connectionSequenceNumber, int actualSequenceNumber, String tabName) throws ManifoldCFException, IOException
outputSpecificationBody
in interface IRepositoryConnector
outputSpecificationBody
in class BaseRepositoryConnector
out
- is the output to which any HTML should be sent.locale
- is the locale the output is preferred to be in.ds
- is the current document specification for this job.connectionSequenceNumber
- is the unique number of this connection within the job.actualSequenceNumber
- is the connection within the job that has currently been selected.tabName
- is the current tab name. (actualSequenceNumber, tabName) form a unique tuple within
the job.ManifoldCFException
IOException
public String processSpecificationPost(IPostParameters variableContext, Locale locale, Specification ds, int connectionSequenceNumber) throws ManifoldCFException
processSpecificationPost
in interface IRepositoryConnector
processSpecificationPost
in class BaseRepositoryConnector
variableContext
- contains the post data, including binary file-upload information.locale
- is the locale the output is preferred to be in.ds
- is the current document specification for this job.connectionSequenceNumber
- is the unique number of this connection within the job.ManifoldCFException
public void viewSpecification(IHTTPOutput out, Locale locale, Specification ds, int connectionSequenceNumber) throws ManifoldCFException, IOException
viewSpecification
in interface IRepositoryConnector
viewSpecification
in class BaseRepositoryConnector
out
- is the output to which any HTML should be sent.locale
- is the locale the output is preferred to be in.ds
- is the current document specification for this job.connectionSequenceNumber
- is the unique number of this connection within the job.ManifoldCFException
IOException
protected String makeSessionLoginEventName(INamingActivity activities, String sequenceKey)
protected String makeDNSEventName(INamingActivity activities, String hostNameKey)
protected int lookupIPAddress(String documentIdentifier, IVersionActivity activities, String hostName, long currentTime, StringBuilder ipAddressBuffer) throws ManifoldCFException, ServiceInterruption
ManifoldCFException
ServiceInterruption
protected static String makeRobotsKey(String protocol, String hostName, int port)
protected String makeRobotsEventName(INamingActivity versionActivities, String robotsKey)
protected int checkFetchAllowed(String documentIdentifier, String protocol, String hostIPAddress, int port, PageCredentials credential, IKeystoreManager trustStore, String hostName, String[] binNames, long currentTime, String pathString, IVersionActivity versionActivities, int connectionLimit, String proxyHost, int proxyPort, String proxyAuthDomain, String proxyAuthUsername, String proxyAuthPassword) throws ManifoldCFException, ServiceInterruption
ManifoldCFException
ServiceInterruption
protected String makeDocumentIdentifier(String parentIdentifier, String rawURL, WebcrawlerConnector.DocumentURLFilter filter) throws ManifoldCFException
parentIdentifier
- the identifier of the document in which the raw url was found, or null if none.rawURL
- the starting, un-normalized, un-canonicalized URL.filter
- the filter object, used to remove unmatching URLs.ManifoldCFException
protected String doCanonicalization(WebcrawlerConnector.DocumentURLFilter filter, WebURL url) throws ManifoldCFException, URISyntaxException
protected boolean isContentInteresting(IFingerprintActivity activities, String documentIdentifier, int response, String contentType) throws ServiceInterruption, ManifoldCFException
protected String documentIdentifiertoFileName(String documentIdentifier) throws URISyntaxException
documentIdentifier
- URISyntaxException
protected String findRedirectionURI(String currentURI) throws ManifoldCFException
ManifoldCFException
protected FormData findHTMLForm(String currentURI, LoginParameters lp) throws ManifoldCFException
ManifoldCFException
protected String findPreferredRedirectionURI(String currentURI, LoginParameters lp) throws ManifoldCFException
ManifoldCFException
protected String findSpecifiedContent(String currentURI, LoginParameters lp) throws ManifoldCFException
ManifoldCFException
protected String findHTMLLinkURI(String currentURI, LoginParameters lp) throws ManifoldCFException
ManifoldCFException
protected boolean extractLinks(String documentIdentifier, IProcessActivity activities, WebcrawlerConnector.DocumentURLFilter filter) throws ManifoldCFException, ServiceInterruption
protected void handleRedirects(String documentURI, IRedirectionHandler handler) throws ManifoldCFException
ManifoldCFException
protected void handleXML(String documentURI, IXMLHandler handler) throws ManifoldCFException, ServiceInterruption
protected void handleHTML(String documentURI, IHTMLHandler handler) throws ManifoldCFException
ManifoldCFException
protected boolean isDocumentText(String documentURI) throws ManifoldCFException
ManifoldCFException
protected static boolean isText(byte[] beginChunk, int chunkLength)
protected static boolean isStrange(byte x)
protected static boolean isWhiteSpace(byte x)
protected static List<String> stringToArray(String input)
protected static void compileList(List<Pattern> output, List<String> input) throws ManifoldCFException
ManifoldCFException
protected PageCredentials getPageCredential(String documentIdentifier)
protected SequenceCredentials getSequenceCredential(String documentIdentifier)
protected IKeystoreManager getTrustStore(String documentIdentifier) throws ManifoldCFException
ManifoldCFException
protected static String[] getAcls(Specification spec)
spec
- is the document specification.protected static List<WebcrawlerConnector.NameValue> findMetadata(Specification spec) throws ManifoldCFException
ManifoldCFException
protected static Set<String> findExcludedHeaders(Specification spec) throws ManifoldCFException
ManifoldCFException
protected String[] calculateDocumentEvents(INamingActivity activities, String documentIdentifier)