CSProX Description
CSProX is an integrated system for processing statistical data that combines a powerful interactive data capture system together with a BATCH module to easily perform the analysis of the micro data. Both modules have common data dictionaries and the same language, although each module has specialized functions oriented to the individual tasks they have been built for.
CSProX comes from the CSPro, a public domain system, being SERPRO a fundamental part of its development not only in terms of developing the software but also providing the source code of the predecessor system, ISSA-X, originating the base or core of the first version of CSPro. Based on this fact, SERPRO's software developing contract agreement stated that it could use the source code to expand the public system's capabilities originating a commercial version out of it.
One of the most distinguish characteristics of CSProX is its ability to perform data capture in a client/server environment through either internet or intranet, profiting of a complete multi-user benefit. The data captured are stored on a relational data base and at the moment, ORACLE, SQL-Server, SyBase and MySQL have been tested although in theory, any relational data base system having an ODBC driver should work.
The batch module of CSProX works currently as a stand alone system although in the near future the client/server platform will be extended to this module also. In fact, one of the most important features of this module will be the analysis of micro data through internet, and more important, the option to join multiple institutions/companies having micro data with common interests, to share their data without loosing the privacy of the primary information (the micro data will never leave the institution/company's server).
|
Client/Server Features
The client/server system has been carefully designed to meet the most demanding requirements of the statistical community. The software runs under Windows 2000/XP/NT/2003 and uses industry standard components like SOAP (GSOAP), Internet Information Server, and Secure Socket Layers.
In the system development, the following concepts were considered as crucial and were always present:
Security: since CSProX allows clients to connect through internet, security was a major concern. The response to the security issue was the implementation of the Secure Sockets Layer (SSL) a protocol developed by Netscape to transmit private information. As you are probably aware of, most of the information gathered by surveys is confidential and protected by statistical secrecy laws. SSL encrypts the information using two keys, a public key known to everyone and a private key known only by the recipient of the message (information).
Quick Response: The system is normally used as a data capture system of large and sophisticated surveys. Given the nature of the software, each answer to any given question is or can be checked with any previous answer(s) given before (online consistency checking), the transmission of data piece by piece could be too expensive in terms of response time, especially when the communication speed is not adequate. Thus, the alternative chosen was to have a client player or runtime driver that could execute all the application logic locally as opposed to a simple internet browser that would execute the application logic in the server.
On-Line and Off-Line Data Capture: As it has been mentioned, CSProX, besides other goals or objectives, is an intelligent data capture system. As such, its most important capabilities are oriented to the CAPI (Computer Aided Personal Interview) surveys. Therefore, we can't require that clients are always connected to the server since frequently they will be in remote locations without an internet connection available. Thus, the ability to work offline during extended period of time was a basic requirement. Embedded in the system (client or runtime module of CSProX) is the ability to automatically transmit and synchronize the data gathered as soon as the client connects to the server. This is one more reason to have the runtime module resident in the client's computer since it needs to run as a stand alone system too.
Clients Grouped in Clusters: Given the nature of the software, clients need to be classified in clusters or groups according to some concept(s). In turn, groups can also be clustered in higher level groups creating a tree structure. The important consequence is that clients can inherit the group characteristics (e.g. access rights to DB or dictionaries) although they can be individually altered by the project administrator. This attribute permits each group's member -having the right permits- to have the same access rights to the data/cases captured by others members of the same group. The main objective is to allow any member of the group to modify or finish partially done interviews, eliminating the dependence of interviewers or operators. The second objective is that cases captured by one group are transparent to other groups (like non-existents) unless particular clients have special rights like administrator(s) or supervisor(s).
Automatic Update of Clients' Application: Each project or set of applications has an MD5 (Message Digest algorithm) stamped. When the client establish connection with the server, the MD5s of both projects, the one in the server and the client's are compared to make sure that both are identical (within some probabilistic error limit). Whenever the MD5s are different, the application(s) are automatically refreshed in the client's computer. This feature guarantees that any changes made to the project's applications will be rapidly passed to the clients.
CSProX and Relational Databases: All data transmitted to the server is sent to a relational database through the ODBC driver. In addition, all data captured by the clients is also written locally in the same CSProX format (ASCII file) and is fully compatible with CSPro file structure. Subsequent versions of CSProX will permit the administrator decide whether the case is locally written or not. The software has tools for the automatic generation of the necessary relational tables of any CSProX data dictionary. In addition, data can be exported from CSProX data files to the RDB or vice versa at the click of a button. The system has been tested with Oracle, SyBase, SQL Server and MySQL.
CSProX Flexibility: A CSProX server can be configured to run in secure mode or not and as a stand alone server or an ISAPI server.
|
THE DATA CAPTURE MODULE
This module can be used in two different ways depending on the final objectives of any individual application/project: the first one is targeted to perform intelligent data entry generally including online consistency checks (the amount of online editing depending entirely on the case and the application developer) and normally targeted to unveil errors originated by the DE operator rather than those originated during the data capture. The second use, probably the most important too, focuses on capturing the data directly from the information source into the computer, having the software/computer (CAPI & CATI) guiding the flow of the data capture session. This last approach differs from the first one not only in the strategy followed by the application developer but also in the amount of relevant information that the system is capable of providing to the respondent to get a better and more accurate answer.
DATA ENTRY SYSTEM: In this approach, CSProX offers several possibilities depending entirely on the specific requirements: no online editing, leaving the entire error detection for a second stage, where the data are key-entered again being each field/variable checked against with the original data (online verification). Using the online verification, the original data are in the background and whenever a difference is detected -by direct comparison of both information elements-, the operator is obligated to fix it before continuing.
The second option is to include as much online consistency checking as wanted/needed, alerting the data entry operator of any unveiled error. However, the application lets the operator decide whether the inconsistency is fixed or not. Using this option, some errors can be fixed in a second stage, where more qualified personnel can analyze the problems and decide how to fix them. Normally, these errors are originated during the data capture stage and require of a more delicate and refined analysis before fixing them (if they are at all). In both options, the system executes automatically unconditional as well as conditional skips without the operator intervention, and they can't be overridden unless the application allows them to.
The third option combines the features mentioned above with a more relaxed system's control that allows the DE operators to freely override the application skips by a single mouse click over the desired -target- field.
The main features of the data entry system (common to the data capture in general) are as follows:
Data Dictionaries (DD) are easily defined using a friendly interface. The DD can be stored in two different formats, the first one, the normal CSPro format, and the second is fully compliant with the DDI (Data Documentation Initiative), which "is an effort to establish an international XML-based standard for the content, presentation, transport, and preservation of documentation for datasets in the social and behavioral sciences" . For more information regarding the DDI, please visit http://www.icpsr.umich.edu/DDI/index.html .
The forms or data entry screens can be generated automatically by the system using as reference the data dictionary definition. Starting from the default form generation, the application developer can enhance the forms using special fonts, colors, and edit the default texts to end up with a high quality data entry screen. If the project has existing paper forms/questionnaires, they can be scanned and used as background of the data entry screens and fields can be overlaid over the corresponding spots by simple drag and drop action.
The rich and powerful CSProX language combined with the concepts over which the system has been developed allows:
To create a simple data entry application in a few minutes with the confidence that at least the data file will be structurally correct. The application developer can also opt for a sophisticated application using all the system resources.
To include online consistency checks for fields as they are key-entered. The checks/tests can be of a wide variety being the most important: (i) applying logical or arithmetic relationships among the current and previously entered variable(s); (ii) application of algorithms of any sophistication (i.e. check digit); (iii) checking with related tables either from the same data base or from different data bases. Among the algorithms the application developer can apply, the system has a powerful and useful one to perform "automatic coding" of gloss/text based on keywords. The algorithm is extremely useful for coding variables like Occupation, Industry, and other variables.
The data entry system has two distinctive behavioral approaches: one, and the most commonly used, is a rigid behavior also known as "system's controlled" where all the skips are strictly handled by the application defining a data path that can't be altered by the DE operator unless the data entered are modified; thus, whenever the operator moves back to a previous field, the system makes sure that the same path followed when moving forward is also followed when moving backward (no skipped fields can be reached by mouse clicking or back-tabbing). The second behavioral approach is a relaxed form also known as "operator's controlled", that differs from the first one in the sense that any skip can be overridden by the operator either by a mouse click over a skipped field or by back-tabbing to it. This last approach does not guarantee the integrity of the data structure but in some cases it might be useful.
The system allows entering data to different files from the same application without any operator's intervention.
Both, the data modification as well as the data verification are performed using the same application used to enter the data. This implies that the same consistency checks and the same logic are applied in all instances (data entry, modification and verification). The cases to be modified or verified are directly accessed by key which are all displayed on the tree at the left side of the screen.
DATA CAPTURE: It is in this arena where the system shows outstanding features that put it in a different category when comparing it to other systems in the market. The main reasons for this are basically two: (i) once again the powerful system's language plays an important role in this preponderant leadership; (ii) the creativity imbedded in the software makes the communication between the application developer and the respondent a simple task. Both characteristics have a decisive impact in improving the quality of the information gathered.
The factors that more heavily impact the quality of the information can be summarized as follows:
Precision regarding the questions as well as directions to the respondents together with contextual online helps about the specific topic being captured. Questions as well as directions to the respondent can be dynamically generated using information already gathered in previous questions. In the same manner, the possible answers displayed -from which the respondent has to choose- can be constrained by previous answers. Online help to any question can be a combination of text and contextual information based on answers to previous questions.
The use of algorithms and information stored in the same or other data base definitely contribute to detect errors in the information gathered at the same time it is entered. Thus, the respondent is able to fix the problem immediately.
The first factor mentioned above is expedited by a special module called "Questions and Text Editor" which can work either as a standalone or in conjunction with the application designer. This module is a full RTF (Rich Text Format) editor with many features oriented to the specific task. Parameters can be imbedded in the text to be replaced in run time by their respective values (dynamic formulation of questions) and multiple questions for the same issue can be formulated (conditional formulation of questions).
All the features mentioned for the Data Entry System are common to the Data Capture . In fact, the same module is used for both approaches varying only the sophistication degree and system's resources used. Obviously, the sophistication of the data capture requires more system functionality.
|
BATCH MODULE
The functionality of the batch module is oriented to three main objectives of the statistical data processing:
ENTRY in BATCH. It allows running the data entry or data capture application in batch mode. Using as input the same data entered by the interactive module (data entry or capture), it will provide an organized error listing of all those inconsistencies that, although they were detected by the application, they were ignored by the data entry operator. In addition, the application developer might have restricted some consistency checks to be executed only when the application is run in batch mode and thus, the errors were not unveiled during the data entry. Errors detected at this time may be analyzed and eventually fixed by more qualified personnel, letting the operator concentrate in fixing only those errors originated during the entry session.
Missing Data and Imputation. This is an important step prior to the analysis of the micro-data. The system once again has a rich language to do imputation of missing data or variables that fail to pass consistency checks. On the other hand, more and more, census data and many other sources are captured via scanners or any type of optical reader devices and thus, the batch module is the only way to check for data consistency and other analysis of the micro-data (i.e. completeness and data structure). When data are input through optical devices, they are normally stored in relational data bases from which, CSProX can import them to a CSProX format (the batch module can't presently process directly from the relational DB).
Micro-Data Analysis. This is no doubt the end goal of the statistical data processing and thus, a great effort has been devoted to make the most powerful tabulation system imbedded in the batch module. Currently, there is a powerful graphic interface that allows to easily obtain all kind of sophisticated tables by using the drag and drop technique. Some basic statistics are also simple to obtain like mean, median and percentiles, minimum, maximum, mode, standard deviation, variance, percents (row, column, layer and table total), etc.
The system has a variety of batch utilities to perform in a simple way the following tasks: (i) to export data to SPSS, SAS and STATA or simply to ASCII using a specific character to delimit each field; (ii) to produce marginal and conditional frequencies; (iii) sorting of data files where each case has a variable number of records; (iv) reformatting of data; (v) data index generation.
|
|
 |