An accessor to an inverted file. More...
#include <CAcIFFileSystem.h>
Public Member Functions | |
bool | operator() () const |
for testing if the inverted file is correctly constructed | |
CAcIFFileSystem (const CXMLElement &inCollectionElement) | |
This opens an exsisting inverted file, and then inits this structure. | |
bool | init (bool) |
called by constructors | |
~CAcIFFileSystem () | |
Destructor. | |
string | IDToURL (TID inID) const |
Translate a DocumentID to a URL (for output) | |
virtual pair< bool, TID > | URLToID (const string &inURL) const |
Translate an URL to its document ID. | |
void | getAllIDs (list< TID > &) const |
List of the IDs of all documents present in the inverted file. | |
void | getAllAccessorElements (list< CAccessorElement > &) const |
List of triplets (ID,imageURL,thumbnailURL) of all the documents present in the inverted file. | |
void | getRandomIDs (list< TID > &, list< TID >::size_type) const |
get a given number of random C-AccessorElement-s | |
void | getRandomAccessorElements (list< CAccessorElement > &outResult, list< CAccessorElement >::size_type inSize) const |
For drawing random sets. | |
int | size () const |
The number of images in this accessor. | |
TID | getMaximumFeatureID () const |
This is interesting for browsing. | |
list< TID > * | getAllFeatureIDs () const |
Getting a list of all features contained in this. | |
virtual pair< bool, CAccessorElement > | IDToAccessorElement (TID inID) const |
Translate a DocumentID to an accessor Element. | |
operator bool () const | |
is this well constructed? | |
The proper inverted file access | |
CDocumentFrequencyList * | FeatureToList (TFeatureID) const |
List of documents containing the feature. | |
CDocumentFrequencyList * | URLToFeatureList (string inURL) const |
List of features contained by a document. | |
CDocumentFrequencyList * | DIDToFeatureList (TID inDID) const |
List of features contained by a document with ID inDID. | |
Accessing information about features | |
double | FeatureToCollectionFrequency (TFeatureID) const |
Collection frequency for a given feature. | |
unsigned int | getFeatureDescription (TID inFeatureID) const |
What kind of feature is the feature with ID inFeatureID? | |
Accessing additional document information | |
double | DIDToMaxDocumentFrequency (TID) const |
returns the maximum document frequency for one document ID | |
double | DIDToDFSquareSum (TID) const |
Returns the document-frequency square sum for a given document ID. | |
double | DIDToSquareDFLogICFSum (TID) const |
Returns this function for a given document ID. | |
bool | generateInvertedFile () |
Generating an inverted File, if there is none. | |
bool | newGenerateInvertedFile () |
Generating an inverted File, if there is none. | |
bool | checkConsistency () |
Check the consistency of the inverted file system accessed by this accessor. | |
bool | findWithinStream (TID inFeatureID, TID inDocumentID, double inDocumentFrequency) const |
Is the Document with inDocumentID contained in the document frequency list of the feature inFeatureID and is the associated document frequency the same? | |
Protected Types | |
typedef HASH_MAP< TID, streampos > | CIDToOffset |
map from feature id to the offset for this feature | |
Protected Member Functions | |
void | writeOffsetFileElement (TID inFeatureID, streampos inPosition, ostream &inOpenOffsetFile) |
add a pair of FeatureID,Offset to the open offset file (helper function for inverted file construction) | |
CDocumentFrequencyList * | getFeatureFile (string inFileName) const |
loads a *.fts file. | |
Protected Attributes | |
CMutex | mMutex |
the mutex for multi threading | |
CSelfDestroyPointer< CAcURL2FTS > | mURL2FTS |
In order to have just one parent, I have to limit on single inheritance. | |
TID | mMaximumFeatureID |
the maximum feature ID arising in this file | |
string | mInvertedFileBuffer |
A buffer, if the inverted file is to be held in ram. | |
string | mTemporaryIndexingFileBase |
Some place for putting temporary indexing data. | |
CSelfDestroyPointer< istream > | mInvertedFile |
The inverted file. | |
ifstream | mOffsetFile |
Feature -> Offset in inverted file. | |
ifstream | mFeatureDescriptionFile |
File of feature descriptions. | |
string | mInvertedFileName |
Name of the inverted file. | |
string | mOffsetFileName |
Name of the Offset file. | |
string | mFeatureDescriptionFileName |
Name for the file with the feature description. | |
CIDToOffset | mIDToOffset |
map from feature id to the offset for this feature | |
HASH_MAP< TID, double > | mFeatureToCollectionFrequency |
map from feature to the collection frequency | |
for fast access... | |
HASH_MAP< TID, unsigned int > | mFeatureDescription |
map from the feature ID to the feature description | |
CADIHash | mDocumentInformation |
additional information about the document like, e.g. |
An accessor to an inverted file.
This access is done "by hand".
For a long time we wanted to move to memory mapped files (like SWISH++) but currently I think this is not the best idea.
CAcIFFileSystem::CAcIFFileSystem | ( | const CXMLElement & | inCollectionElement | ) |
This opens an exsisting inverted file, and then inits this structure.
After that it is fully usable
As a paramter it takes an XMLElement which contains a "collection" element and its content.
If the attribute cui-generate-inverted-file is true, then a new inverted file will be generated using the parameters given in inCollectionElement. you will NOT be able to use *this afterwards.
Like every accessor, this accessor takes a <collection> MRML element as input (
cui-base-dir: the directory containing the following files cui-inverted-file-location: the location of the inverted file cui-offset-file-location: a file containing offsets into the inverted file cui-feature-file-location: the location of the "url2fts" file which translates urls to feature file names.
bool CAcIFFileSystem::checkConsistency | ( | ) | [virtual] |
Check the consistency of the inverted file system accessed by this accessor.
Implements CAcInvertedFile.
bool CAcIFFileSystem::findWithinStream | ( | TID | inFeatureID, |
TID | inDocumentID, | ||
double | inDocumentFrequency | ||
) | const |
Is the Document with inDocumentID contained in the document frequency list of the feature inFeatureID and is the associated document frequency the same?
inFeature<id | the |
Reimplemented from CAcInvertedFile.
bool CAcIFFileSystem::generateInvertedFile | ( | ) | [virtual] |
Generating an inverted File, if there is none.
Fast but stupid in-memory method. This method is very fast, if all the inverted file (and a bit more) can be kept in memory at runtime. If this is not the case, extensive swapping is the result, virtually halting the inverted file creation.
Implements CAcInvertedFile.
list<TID>* CAcIFFileSystem::getAllFeatureIDs | ( | ) | const [virtual] |
Getting a list of all features contained in this.
This function is necessary, because in the present system only about 50 percent of the features are really used.
A feature is considered used if it arises in mIDToOffset.
Implements CAcInvertedFile.
CDocumentFrequencyList* CAcIFFileSystem::getFeatureFile | ( | string | inFileName | ) | const [protected] |
void CAcIFFileSystem::getRandomAccessorElements | ( | list< CAccessorElement > & | outResult, |
list< CAccessorElement >::size_type | inSize | ||
) | const [virtual] |
For drawing random sets.
Why is this part of an CAccessorImplementation? The way the accessor is organised might influence the way random sets can be drawn. At present everything happens in RAM, but we do not want to be fixed on that.
inoutResultList | the list which will contain the result |
inSize | the desired size of the inoutResultList |
Implements CAccessor.
void CAcIFFileSystem::getRandomIDs | ( | list< TID > & | , |
list< TID >::size_type | |||
) | const [virtual] |
get a given number of random C-AccessorElement-s
inoutResultList | the list which will contain the result |
inSize | the desired size of the inoutResultList |
Implements CAccessor.
virtual pair<bool,CAccessorElement> CAcIFFileSystem::IDToAccessorElement | ( | TID | inID | ) | const [virtual] |
Translate a DocumentID to an accessor Element.
Implements CAccessor.
bool CAcIFFileSystem::newGenerateInvertedFile | ( | ) |
Generating an inverted File, if there is none.
Employing the two-way-merge method described in "managing gigabytes", chapter 5.2. Sort-based inversion. (Page 181)
Reimplemented from CAcInvertedFile.
virtual pair<bool,TID> CAcIFFileSystem::URLToID | ( | const string & | inURL | ) | const [virtual] |
Translate an URL to its document ID.
Implements CAcInvertedFile.
CADIHash CAcIFFileSystem::mDocumentInformation [protected] |
additional information about the document like, e.g.
the euclidean length of the feature list.
Reimplemented from CAcInvertedFile.
CSelfDestroyPointer<CAcURL2FTS> CAcIFFileSystem::mURL2FTS [protected] |
In order to have just one parent, I have to limit on single inheritance.
I cannot use virtual base classes, because then I cannot downcast