public class RSSConnector extends BaseRepositoryConnector
Modifier and Type | Class and Description |
---|---|
protected static class |
RSSConnector.CanonicalizationPolicies
Class representing a list of canonicalization rules
|
protected static class |
RSSConnector.CanonicalizationPolicy
Class representing a URL regular expression match, for the purposes of determining canonicalization policy
|
protected static class |
RSSConnector.EvaluatorToken
Evaluator token.
|
protected static class |
RSSConnector.EvaluatorTokenStream
Token stream.
|
protected class |
RSSConnector.FeedAuthorContextClass |
protected class |
RSSConnector.FeedContextClass |
protected class |
RSSConnector.FeedItemContextClass |
protected static class |
RSSConnector.Filter
Class that handles parsing and interpretation of the document specification.
|
protected static class |
RSSConnector.MappingRule
Class representing a mapping rule
|
protected static class |
RSSConnector.MappingRules
Class that represents all mappings
|
protected static class |
RSSConnector.NameValue
Name/value class
|
protected class |
RSSConnector.OuterContextClass
This class handles the outermost XML context for the feed document.
|
protected class |
RSSConnector.RDFContextClass |
protected class |
RSSConnector.RDFItemContextClass |
protected class |
RSSConnector.RSSChannelContextClass |
protected class |
RSSConnector.RSSContextClass |
protected class |
RSSConnector.RSSItemContextClass |
protected static class |
RSSConnector.ThrottleSpec
The throttle specification class.
|
protected class |
RSSConnector.UrlsetContextClass |
protected class |
RSSConnector.UrlsetItemContextClass |
Modifier and Type | Field and Description |
---|---|
static String |
_rcsid |
static String |
ACTIVITY_FETCH |
static String |
ACTIVITY_PROCESS |
static String |
ACTIVITY_ROBOTSPARSE |
protected static DataCache |
cache |
static int |
CHROMED_METADATA_ONLY
Chromed suppression mode - index metadata only if dechromed content not available
|
static int |
CHROMED_SKIP
Chromed suppression mode - skip documents if dechromed content not available
|
static int |
CHROMED_USE
Chromed suppression mode - use chromed content if dechromed content not available
|
static int |
DECHROMED_CONTENT
Dechromed content mode - content field
|
static int |
DECHROMED_DESCRIPTION
Dechromed content mode - description field
|
static int |
DECHROMED_NONE
Dechromed content mode - none
|
protected ThrottledFetcher |
fetcher
The throttled fetcher used by this instance
|
protected static Map<String,ThrottledFetcher> |
fetcherMap
Storage for fetcher objects
|
protected String |
from
The email address for this connector instance
|
protected boolean |
isInitialized
Flag indicating whether session data is initialized
|
protected int |
maxOpenConnectionsPerServer
The maximum open connections
|
protected double |
minimumMillisecondsPerBytePerServer
The minimum milliseconds between bytes
|
protected long |
minimumMillisecondsPerFetchPerServer
The minimum milliseconds between fetches
|
protected String |
proxyAuthDomain
Proxy auth domain
|
protected String |
proxyAuthPassword
Proxy auth password
|
protected String |
proxyAuthUsername
Proxy auth username
|
protected String |
proxyHost
The proxy host
|
protected int |
proxyPort
The proxy port
|
protected Robots |
robots
The robots object used by this instance
|
protected static int |
ROBOTS_ALL |
protected static int |
ROBOTS_DATA |
protected static int |
ROBOTS_NONE |
protected static Map |
robotsMap
Storage for robots objects
|
protected int |
robotsUsage
Robots usage flag
|
protected static String |
rssThrottleGroupType |
protected String |
throttleGroupName
The throttle group name
|
protected static Map |
understoodProtocols |
protected String |
userAgent
The user-agent for this connector instance
|
protected static Set<String> |
xmlContentTypes |
currentContext, params
GLOBAL_DENY_TOKEN, JOBMODE_CONTINUOUS, JOBMODE_ONCEONLY, MODEL_ADD, MODEL_ADD_CHANGE, MODEL_ADD_CHANGE_DELETE, MODEL_ALL, MODEL_CHAINED_ADD, MODEL_CHAINED_ADD_CHANGE, MODEL_CHAINED_ADD_CHANGE_DELETE, MODEL_PARTIAL
Constructor and Description |
---|
RSSConnector()
Constructor.
|
Modifier and Type | Method and Description |
---|---|
String |
addSeedDocuments(ISeedingActivity activities,
Specification spec,
String lastSeedVersion,
long seedTime,
int jobMode)
Queue "seed" documents.
|
String |
check()
Check status of connection.
|
protected static void |
compileList(ArrayList output,
ArrayList input)
Compile all regexp entries in the passed in list, and add them to the output
list.
|
void |
connect(ConfigParams configParams)
Connect.
|
void |
disconnect()
Close the connection.
|
protected static String |
doCanonicalization(RSSConnector.CanonicalizationPolicy p,
WebURL url)
Code to canonicalize a URL.
|
String[] |
getActivitiesList()
Return the list of activities that this connector supports (i.e.
|
String[] |
getBinNames(String documentIdentifier)
Get the bin name string for a document identifier.
|
int |
getConnectorModel()
Tell the world what model this connector uses for getDocumentIdentifiers().
|
protected ThrottledFetcher |
getFetcher()
Given the current parameters, find the correct throttled fetcher object
(or create one if not there).
|
String |
getFormCheckJavascriptMethodName(int connectionSequenceNumber)
Obtain the name of the form check javascript method to call.
|
String |
getFormPresaveCheckJavascriptMethodName(int connectionSequenceNumber)
Obtain the name of the form presave check javascript method to call.
|
int |
getMaxDocumentRequest()
Get the maximum number of documents to amalgamate together into one batch, for this connector.
|
protected Robots |
getRobots(ThrottledFetcher fetcher)
Given the current parameters, find the correct robots object (or create
one if none found).
|
protected void |
getSession()
Establish a session
|
protected static void |
handleIOException(IOException e,
String context) |
protected void |
handleRSSFeedSAX(String documentIdentifier,
IProcessActivity activities,
RSSConnector.Filter filter)
Handle an RSS feed document, using SAX to limit the memory impact
|
protected static String |
makeDocumentIdentifier(RSSConnector.CanonicalizationPolicies policies,
String parentIdentifier,
String rawURL)
Convert an absolute or relative URL to a document identifier.
|
void |
outputConfigurationBody(IThreadContext threadContext,
IHTTPOutput out,
Locale locale,
ConfigParams parameters,
String tabName)
Output the configuration body section.
|
void |
outputConfigurationHeader(IThreadContext threadContext,
IHTTPOutput out,
Locale locale,
ConfigParams parameters,
List<String> tabsArray)
Output the configuration header section.
|
void |
outputSpecificationBody(IHTTPOutput out,
Locale locale,
Specification ds,
int connectionSequenceNumber,
int actualSequenceNumber,
String tabName)
Output the specification body section.
|
void |
outputSpecificationHeader(IHTTPOutput out,
Locale locale,
Specification ds,
int connectionSequenceNumber,
List<String> tabsArray)
Output the specification header section.
|
void |
poll()
This method is periodically called for all connectors that are connected but not
in active use.
|
String |
processConfigurationPost(IThreadContext threadContext,
IPostParameters variableContext,
Locale locale,
ConfigParams parameters)
Process a configuration post.
|
void |
processDocuments(String[] documentIdentifiers,
IExistingVersions statuses,
Specification spec,
IProcessActivity activities,
int jobMode,
boolean usesDefaultAuthority)
Process a set of documents.
|
String |
processSpecificationPost(IPostParameters variableContext,
Locale locale,
Specification ds,
int connectionSequenceNumber)
Process a specification post.
|
protected static ArrayList |
stringToArray(String input)
Read a string as a sequence of individual expressions, urls, etc.
|
void |
viewConfiguration(IThreadContext threadContext,
IHTTPOutput out,
Locale locale,
ConfigParams parameters)
View configuration.
|
void |
viewSpecification(IHTTPOutput out,
Locale locale,
Specification ds,
int connectionSequenceNumber)
View specification.
|
addSeedDocuments, addSeedDocuments, addSeedDocuments, getDocumentIdentifiers, getDocumentIdentifiers, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getRelationshipTypes, getRemainingDocumentIdentifiers, outputSpecificationBody, outputSpecificationBody, outputSpecificationHeader, outputSpecificationHeader, outputSpecificationHeader, processDocuments, processDocuments, processDocuments, processDocuments, processSpecificationPost, processSpecificationPost, releaseDocumentVersions, releaseDocumentVersions, requestInfo, viewSpecification, viewSpecification
clearThreadContext, deinstall, getConfiguration, install, isConnected, outputConfigurationBody, outputConfigurationHeader, outputConfigurationHeader, pack, packFixedList, packList, packList, processConfigurationPost, setThreadContext, unpack, unpackFixedList, unpackList, viewConfiguration
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
clearThreadContext, deinstall, getConfiguration, install, isConnected, setThreadContext
public static final String _rcsid
protected static final String rssThrottleGroupType
protected static final int ROBOTS_NONE
protected static final int ROBOTS_DATA
protected static final int ROBOTS_ALL
public static final int DECHROMED_NONE
public static final int DECHROMED_DESCRIPTION
public static final int DECHROMED_CONTENT
public static final int CHROMED_USE
public static final int CHROMED_SKIP
public static final int CHROMED_METADATA_ONLY
protected int robotsUsage
protected String userAgent
protected String from
protected long minimumMillisecondsPerFetchPerServer
protected int maxOpenConnectionsPerServer
protected double minimumMillisecondsPerBytePerServer
protected String throttleGroupName
protected String proxyHost
protected int proxyPort
protected String proxyAuthDomain
protected String proxyAuthUsername
protected String proxyAuthPassword
protected ThrottledFetcher fetcher
protected Robots robots
protected static Map<String,ThrottledFetcher> fetcherMap
protected static Map robotsMap
protected boolean isInitialized
protected static DataCache cache
protected static final Map understoodProtocols
public static final String ACTIVITY_FETCH
public static final String ACTIVITY_ROBOTSPARSE
public static final String ACTIVITY_PROCESS
protected void getSession() throws ManifoldCFException
ManifoldCFException
public String[] getActivitiesList()
getActivitiesList
in interface IRepositoryConnector
getActivitiesList
in class BaseRepositoryConnector
public int getConnectorModel()
getConnectorModel
in interface IRepositoryConnector
getConnectorModel
in class BaseRepositoryConnector
public void connect(ConfigParams configParams)
connect
in interface IConnector
connect
in class BaseConnector
configParams
- are the configuration parameters for this connection.
Note well: There are no exceptions allowed from this call, since it is expected to mainly establish connection parameters.public void poll() throws ManifoldCFException
poll
in interface IConnector
poll
in class BaseConnector
ManifoldCFException
public String check() throws ManifoldCFException
check
in interface IConnector
check
in class BaseConnector
ManifoldCFException
public void disconnect() throws ManifoldCFException
disconnect
in interface IConnector
disconnect
in class BaseConnector
ManifoldCFException
public String[] getBinNames(String documentIdentifier)
getBinNames
in interface IRepositoryConnector
getBinNames
in class BaseRepositoryConnector
documentIdentifier
- is the document identifier.public String addSeedDocuments(ISeedingActivity activities, Specification spec, String lastSeedVersion, long seedTime, int jobMode) throws ManifoldCFException, ServiceInterruption
addSeedDocuments
in interface IRepositoryConnector
addSeedDocuments
in class BaseRepositoryConnector
activities
- is the interface this method should use to perform whatever framework actions are desired.spec
- is a document specification (that comes from the job).seedTime
- is the end of the time range of documents to consider, exclusive.lastSeedVersionString
- is the last seeding version string for this job, or null if the job has no previous seeding version string.jobMode
- is an integer describing how the job is being run, whether continuous or once-only.ManifoldCFException
ServiceInterruption
protected static String makeDocumentIdentifier(RSSConnector.CanonicalizationPolicies policies, String parentIdentifier, String rawURL) throws ManifoldCFException
policies
- are the canonicalization policies in effect.parentIdentifier
- the identifier of the document in which the raw url was found, or null if none.rawURL
- is the raw, un-normalized and un-canonicalized url.ManifoldCFException
protected static String doCanonicalization(RSSConnector.CanonicalizationPolicy p, WebURL url) throws ManifoldCFException, URISyntaxException
public void processDocuments(String[] documentIdentifiers, IExistingVersions statuses, Specification spec, IProcessActivity activities, int jobMode, boolean usesDefaultAuthority) throws ManifoldCFException, ServiceInterruption
processDocuments
in interface IRepositoryConnector
processDocuments
in class BaseRepositoryConnector
documentIdentifiers
- is the set of document identifiers to process.statuses
- are the currently-stored document versions for each document in the set of document identifiers
passed in above.activities
- is the interface this method should use to queue up new document references
and ingest documents.jobMode
- is an integer describing how the job is being run, whether continuous or once-only.usesDefaultAuthority
- will be true only if the authority in use for these documents is the default one.ManifoldCFException
ServiceInterruption
protected static void handleIOException(IOException e, String context) throws ManifoldCFException, ServiceInterruption
public void outputConfigurationHeader(IThreadContext threadContext, IHTTPOutput out, Locale locale, ConfigParams parameters, List<String> tabsArray) throws ManifoldCFException, IOException
outputConfigurationHeader
in interface IConnector
outputConfigurationHeader
in class BaseConnector
threadContext
- is the local thread context.out
- is the output to which any HTML should be sent.parameters
- are the configuration parameters, as they currently exist, for this connection being configured.tabsArray
- is an array of tab names. Add to this array any tab names that are specific to the connector.ManifoldCFException
IOException
public void outputConfigurationBody(IThreadContext threadContext, IHTTPOutput out, Locale locale, ConfigParams parameters, String tabName) throws ManifoldCFException, IOException
outputConfigurationBody
in interface IConnector
outputConfigurationBody
in class BaseConnector
threadContext
- is the local thread context.out
- is the output to which any HTML should be sent.parameters
- are the configuration parameters, as they currently exist, for this connection being configured.tabName
- is the current tab name.ManifoldCFException
IOException
public String processConfigurationPost(IThreadContext threadContext, IPostParameters variableContext, Locale locale, ConfigParams parameters) throws ManifoldCFException
processConfigurationPost
in interface IConnector
processConfigurationPost
in class BaseConnector
threadContext
- is the local thread context.variableContext
- is the set of variables available from the post, including binary file post information.parameters
- are the configuration parameters, as they currently exist, for this connection being configured.ManifoldCFException
public void viewConfiguration(IThreadContext threadContext, IHTTPOutput out, Locale locale, ConfigParams parameters) throws ManifoldCFException, IOException
viewConfiguration
in interface IConnector
viewConfiguration
in class BaseConnector
threadContext
- is the local thread context.out
- is the output to which any HTML should be sent.parameters
- are the configuration parameters, as they currently exist, for this connection being configured.ManifoldCFException
IOException
public String getFormCheckJavascriptMethodName(int connectionSequenceNumber)
getFormCheckJavascriptMethodName
in interface IRepositoryConnector
getFormCheckJavascriptMethodName
in class BaseRepositoryConnector
connectionSequenceNumber
- is the unique number of this connection within the job.public String getFormPresaveCheckJavascriptMethodName(int connectionSequenceNumber)
getFormPresaveCheckJavascriptMethodName
in interface IRepositoryConnector
getFormPresaveCheckJavascriptMethodName
in class BaseRepositoryConnector
connectionSequenceNumber
- is the unique number of this connection within the job.public void outputSpecificationHeader(IHTTPOutput out, Locale locale, Specification ds, int connectionSequenceNumber, List<String> tabsArray) throws ManifoldCFException, IOException
outputSpecificationHeader
in interface IRepositoryConnector
outputSpecificationHeader
in class BaseRepositoryConnector
out
- is the output to which any HTML should be sent.locale
- is the locale the output is preferred to be in.ds
- is the current document specification for this job.connectionSequenceNumber
- is the unique number of this connection within the job.tabsArray
- is an array of tab names. Add to this array any tab names that are specific to the connector.ManifoldCFException
IOException
public void outputSpecificationBody(IHTTPOutput out, Locale locale, Specification ds, int connectionSequenceNumber, int actualSequenceNumber, String tabName) throws ManifoldCFException, IOException
outputSpecificationBody
in interface IRepositoryConnector
outputSpecificationBody
in class BaseRepositoryConnector
out
- is the output to which any HTML should be sent.locale
- is the locale the output is preferred to be in.ds
- is the current document specification for this job.connectionSequenceNumber
- is the unique number of this connection within the job.actualSequenceNumber
- is the connection within the job that has currently been selected.tabName
- is the current tab name. (actualSequenceNumber, tabName) form a unique tuple within
the job.ManifoldCFException
IOException
public String processSpecificationPost(IPostParameters variableContext, Locale locale, Specification ds, int connectionSequenceNumber) throws ManifoldCFException
processSpecificationPost
in interface IRepositoryConnector
processSpecificationPost
in class BaseRepositoryConnector
variableContext
- contains the post data, including binary file-upload information.locale
- is the locale the output is preferred to be in.ds
- is the current document specification for this job.connectionSequenceNumber
- is the unique number of this connection within the job.ManifoldCFException
public void viewSpecification(IHTTPOutput out, Locale locale, Specification ds, int connectionSequenceNumber) throws ManifoldCFException, IOException
viewSpecification
in interface IRepositoryConnector
viewSpecification
in class BaseRepositoryConnector
out
- is the output to which any HTML should be sent.locale
- is the locale the output is preferred to be in.ds
- is the current document specification for this job.connectionSequenceNumber
- is the unique number of this connection within the job.ManifoldCFException
IOException
protected void handleRSSFeedSAX(String documentIdentifier, IProcessActivity activities, RSSConnector.Filter filter) throws ManifoldCFException, ServiceInterruption
public int getMaxDocumentRequest()
getMaxDocumentRequest
in interface IRepositoryConnector
getMaxDocumentRequest
in class BaseRepositoryConnector
protected ThrottledFetcher getFetcher()
protected static ArrayList stringToArray(String input)
protected static void compileList(ArrayList output, ArrayList input) throws ManifoldCFException
ManifoldCFException
protected Robots getRobots(ThrottledFetcher fetcher)